Skip to content

Latest commit

 

History

History
99 lines (80 loc) · 7.02 KB

File metadata and controls

99 lines (80 loc) · 7.02 KB

Output of a Monitoring Plugin

The output of a Monitoring Plugin consists of two parts on the first level, the Exit Code and output in textual form on stdout.

Exit Code

The Monitoring Plugin MUST make use of the Exit Code as a method to communicate a result to the Monitoring System. Since the Exit Code is more or less standardized over different systems as an integer number with a width of or greater than 8bit, the following mapping is used:

Exit Code (numerical) Meaning (short) Meaning (extended)
0 OK The execution of the Monitoring Plugin proceeded as planned and whatever it test appeared to function properly and the measured values are with their respective thresholds
1 WARNING The execution of the Monitoring Plugin proceeded as planned and whatever it test appeared to not function properly or the measured values are not with their respective thresholds. The problem(s) do(es) not seem exceptionally grave though and do(es) not require immediate attention
2 CRITICAL The execution of the Monitoring Plugin proceeded as planned and whatever it test appeared to not function properly or the measured values are not with their respective thresholds. The problem(s) do(es) seem exceptionally grave though and do(es) require immediate attention
3 UNKNOWN The execution of the Monitoring Plugin did not proceed as planned. The reasons might be manifold, e.g. missing permissions, missing libraries, no available network connection to the destination, etc.. In summary: The Monitoring Plugin could not determine the state of whatever it should have been checking and can therefore make no reliable statement about it.
4-125 reserved for future use

Textual Output

The original purpose of the output on stdout was to provide human readable information for the user of the Monitoring System, a way for the Monitoring Plugin to communicate further details on what happened. This purpose still exists, but was expanded with the, so called, perfdata (performance data) to allow the machine readable communication of measured values for further processing in the Monitoring System, e.g. for the creation of diagrams.

Therefore the further explanation is split into human readable output and perfdata.

Human readable output

This part of the output should give an user information about the state of the test and, in the case of problems, ideally hint what the origin of the problem might be or what the symptoms are. If the test relies on numeric values, this might be displayed to give an user more information about the specific problem. It might consist of one or more lines of printable symbols.

Examples:

Remaining space on filesystem "/" is OK

Sensor temperature is within thresholds

Available Memory is too low

Sensore temperature exceeds thresholds

are OK, but

Remaining space on filesystem "/" is OK ( 62GiB / 128GiB )

Sensor temperature is within thresholds ( 42°C )

Available Memory is too low ( 126MiB / 32GiB )

Sensor temperature exceeds thresholds ( 78°C > 70°C )

are better.

Although no strict guidelines for creating this part of the output can really be given, a developer should keep a potential user in mind. It might, for example, be OK to put the output in a single line if there are only one or two items of a similar type (think: multiple file systems, multiple sensors, etc.) are present, but not if there 10 or 100, although this might present a valid use case. If there are several different items exists in the output of the Monitoring Plugin, furthermore called partial results, they probably SHOULD be given their own line in the output.

Performance data

In addition to the human readable part the output can contain machine readable measurement values. These data points are separated from the human readable part by the "|" symbol which is in effect until the end of the output. The performance data then MUST consist of space separated single values, these MUST have the following format:

'label'=value[UOM][;warn[;crit[;min[;max]]]]

with the following definitions:

  1. label must consist of at least on non-space character, but can otherwise contain any printable characters except for the equals sign (=) or single quotes ('). If it contains spaces, it must be surrounded by single quotes

  2. value is a numerical value, might be either an integer or a floating point number. Using floating point numbers if the value is really discreet SHOULD be avoided. Also the representation of a floating point number SHOULD NOT use the "scientific notation" (e.g. 6.02e23 or -3e-45), since some systems might not be able to parse them correctly. Also values with a base other then 10 SHOULD be avoided (see below for more information on Byte values).

  3. UOM is the Unit of measurement (e.g. "B" for Bytes, "s" for seconds) which gives more context to the Monitoring System.

    • The following constraints MUST be applied:

      1. An UOM of % MUST be used for percentage values
      2. An UOM of c MUST be used for continuous counters (commonly used for the sum of bytes transmitted on an interface)
    • The following recommendations SHOULD be applied:

      1. The UOM for Byte values is B and although many systems do understand units like KB,KiB, MB, GB, TB they SHOULD be avoided, at the least to avoid the ugly hassle about people misinterpreting the base10 values as base2 values and the other way round. This is also a prime example where floating point number SHOULD NOT be used, since there are obviously only integer numbers included.
      2. The UOM for time is s, meaning seconds, SI-Prefixes (e.g. ms for milli seconds) are allowed if necessary or useful, but be aware, that many systems may not understand μs for micro seconds and expect us instead.
      3. In general, SI units and SI prefixes SHOULD be used as UOM if applicable, but the Monitoring System may not understand them correctly (mostly in uncommon cases), in that cases appropriate workarounds MAY be applied on the side of the Monitoring Plugin, but it would be nice make the developer of the Monitoring System aware of the problem. Since the values are not intented to be human readable normalized units are recommended (e.g. overall_power=14000000000W instead of overall_power=14GW)
  4. warn and crit are the threshold values for this measurement, which may have been given by the user as input, may be hardcoded in the Monitoring Plugin or may be retrieved from a file or a device or somewhere else during the execution of the tests. The unit used MUST be the same as for value. These values are not simple numbers, but range expressions.

  5. min and max are the minimal respectively the maximal value the value could possibly be. The unit is the same as for value. These values can be omitted, if the value is a percentage value, since min and max are always 0 and 100 in this case.