Monitoring framework: Difference between revisions

From techdocs
Jump to navigation Jump to search
(Created page with "The basic monitoring framework implemented in the New World infrastructure is designed to monitor and report on problems and faults with services or hosts AND to record historical performance data and provide graphing thereof. The framework is not intended to be fancy, but to provide a "standard" for adding monitoring for new services and/or hosts, and to provide a standard look-and-feel for web pages providing access to historical performance data. The nitty gritt...")
 
(Replaced content with "There are two separate monitoring systems (or frameworks) running in the New World. For simplicity, they are named after the servers which provide the principal monitoring activities: # Monitoring system #1 # Monitoring system #2")
Tag: Replaced
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
The basic monitoring framework implemented in the [[New World]] infrastructure is designed to monitor and report on problems and faults with services or hosts AND to record historical performance data and provide graphing thereof.
There are two separate monitoring systems (or frameworks) running in the [[New World]]. For simplicity, they are named after the servers which provide the principal monitoring activities:


The framework is not intended to be fancy, but to provide a "standard" for adding monitoring for new services and/or hosts, and to provide a standard look-and-feel for web pages providing access to historical performance data.
# [[monitor1|Monitoring system #1]]
 
# [[monitor2|Monitoring system #2]]
The nitty gritty is:
 
* The monitoring server is nw-syd-monitor1 and it serves the entire New World infrastructure and some "Old World" infrastructure. Visit http://nw-syd-monitor1.cse.unsw.edu.au/,
* It runs on standard Debian Buster. cfengine manages its configuration,
* It uses Apache <code>httpd</code> web server (v2.4) (http://httpd.apache.org/),
* It uses Nagios both for service and host problem detection AND to collect performance data and store it in historical databases (https://www.nagios.org/),
* It uses RRDtool as both the primary (but not only) backend to store historical performance data collected by Nagios (see above) and to generate on-the-fly graphs for web display (https://oss.oetiker.ch/rrdtool/),
* It mainly uses SNMP servers on the monitored hosts to read performance data and for fault detection. These SNMP servers are polled from the monitoring server.
 
== Performance data collection ==
 
One of the notable/interesting ways in which the monitoring system works is how it uses Nagios check commands to collect performance data samples from the various hosts and then writes them to RRDtool databases.
 
Typically, Nagios uses check commands (or scripts) to regularly poll hosts or services to make sure they're still working and to report changes of state (working-to-not-working, for example). Nagios runs multiple polls simultaneously and parallelises this activity to deal with situations where a service may not be responding or may be slow.
 
The monitoring infrastructure takes advantage of this Nagios capability to combine service monitoring and data collection in one so that Nagios check commands are used to collect performance data, write it into RRDtool databases, and report the results to Nagios so Nagios can then detect and report when data collection falters. The data successfully collected and stored can then be displayed as required.
 
Thus, part of the Nagios configuration includes specifying when and how data collection check commands are run. Notably, the individual check commands which collect data are run once per minute. Retries are disabled so that if a collection attempt fails nothing is written to the database and Nagios will report the problem and then, eventually, try again in a minute or so.
 
The collection scripts automatically create RRDtool databases as required so once a new host or service is added to Nagios it automatically gets an RRDtool database and data collection begins.
 
The data collection scripts are located in <code>/usr/local/bin</code>.
 
RRDtool databases are created with 365 days worth of history.
 
== cfengine "products" and the warehouse ==
 
There are two principal products in the cfengine warehouse which have to do with the monitoring system:
 
* nagiosconf - manages Nagios generally, plus manages the historical data collection and archival,
* monitorconf - manages the display of historical data using RRDtool.
 
Because RRDtool performs double duty as both the historical data database AND as the graph creation tool it appears prominently in both of the above products.
 
== Important files and directories on nw-syd-monitor1 ==
 
{|class="firstcolfixed"
!width="33%"| File or directory path
!width="33%"| Description
!width="33%"| Managed by
|-
| /usr/local/bin<br />
/usr/local/lib
| Nagios check scripts, graph-drawing useful stuff, etc.
| monitorconf<br />
nagiosconf
|-
| /usr/lib/cgi-bin
| Default location of CGI scripts using Debian standard Apache configuration files
| Package: apache2
|-
| /usr/lib/cgi-bin/nagios4
| Location of CGI scripts which are part of the Debian nagios4 package and which display the Nagios status pages
| Package: nagios4
|-
| /usr/lib/cgi-bin/rrdtool
| Location of CGI scripts which display the performance graphs
| monitorconf
|-
| /etc/nagios4
| Top-level configuration directory for Nagios4
| Package: nagios4
|-
| /etc/nagios4/objects
| Directory containing *.cfg Nagios configuration files specific to CSE
| nagiosconf. This directory and its contents are initially installed by the nagios4 package, but the entire directory contents are blown away and replaced by cfengine
|-
| /etc/nagios4/objects/rrdtool.cfg
| Defines specific services and check commands used by the RRDtool data collectors
| nagiosconf
|-
| /etc/nagios4/objects/rrdtool
| Directory containing Nagios configuration files which set up data collection polling
| Contents maintained by [[host-configuration.html|update_hosts]]
|-
| /etc/nagios4/objects/rrdtool*.cfg
| Configuration files specific to updating RRDtool databases
|
|-
| /etc/nagios4/objects/rrdtool_hostgroup_poll_*.cfg
| Nagios hostgroups used to determine what data is collected from particular hosts
|
|}
 
== Automated configuration ==
 
The host and service configuration for Nagios (in <code>/etc/nagios4/objects</code>) refers to Nagios hostgroups rather than individual hosts. Hosts are selected for monitoring by including them in separate files (with "hostgroup" in the file name) containing hostgroup membership lists. Many, if not all, of these hostgroup files are generated automatically using ''host generator(s)'' (see [[host-configuration.html|Host configuration]])
 
== Historical performance graph standard template ==
 
The standard performance graph web page contains a series of graphs of whatever is being shown with a set of selectors at the top of the page for choosing the graph span (in minutes, days, weeks, etc.) plus ways of stepping the graphs backwards and forwards in time, e.g., to show an arbitrary 3-hour span of data recorded 6 months ago. The time controls are loosely called "retro" due to this being the name of the CGI parameter used to pass this setting to the scripts.
 
To avoid repetition of work and to provide a consistent interface, all of the performance graph display pages are created with CGI shell scripts using a set of HTML- and graph-generator functions included from <code>/usr/local/lib/rrdtool_plot.inc</code>.
 
The provided functions include:
 
{|class="firstcolfixed"
! Function name
! Description
|-
| StartPage
| Generates the HTML page header, page title and span controls
|-
| EndPage
| Outputs the bottom of the HTML page
|-
| DrawPlot
| Draws a single RRDtool graph as an HTML, base64-encoded, inline, PNG image. This is typically called multiple times on a page, once per graph
|}

Latest revision as of 13:00, 26 April 2023

There are two separate monitoring systems (or frameworks) running in the New World. For simplicity, they are named after the servers which provide the principal monitoring activities:

  1. Monitoring system #1
  2. Monitoring system #2