Monitor2: Difference between revisions

From techdocs
Jump to navigation Jump to search
Line 32: Line 32:
!Description
!Description
|-
|-
|<code style="white-space: nowrap;">/etc/[[Monitor2#monitor2.conf|monitor2.conf]]</code>
|<code style="white-space: nowrap;">/etc/monitor2.conf</code>
|Site-specific configuration: top-level directory location, etc.
|Site-specific configuration: top-level directory location, etc. (see below)
|-
|-
|<code style="white-space: nowrap;">/etc/apache2/apache2.conf</code>
|<code style="white-space: nowrap;">/etc/apache2/apache2.conf</code>
|Custom Apache2 configuration. Does not ''include'' any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
|Custom Apache2 configuration. Does not ''include'' any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
|-
|-
|<code style="white-space: nowrap;">/etc/systemd/system/[[Monitor2#datacollectpoll.service|datacollectpoll.service]]</code>
|<code style="white-space: nowrap;">/etc/systemd/system/datacollectpoll.service</code>
|<code>systemd</code> service file used to control our data collection service
|<code>systemd</code> service file used to control our data collection service (see below)
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/[[Monitor2#samples|samples]]</code>
|<code style="white-space: nowrap;">/var/samples/monitor2</code>
|Top level directory under which, firstly, sampled/graphed hosts are collected in [[Monitor2#plot_pools|plot pool]] directories
|Top level directory under which sampled/graphed hosts are collected in plot pool directories
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/cgi-bin</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/cgi-bin</code>

Revision as of 13:19, 27 April 2023

Typical graph page

The second monitoring system/framework in the New World, monitor2, is implemented on the server https://nw-syd-monitor2.cse.unsw.edu.au/. Unlike monitor1 it does not use Nagios and does not generate any alarms or warnings.

Instead, monitor2 collects and graphs data sampled from various servers. It is designed to implement a simple but flexible way of collecting data from these hosts, storing that data and graphing it, regardless of the data sources. It uses SNMP to collect data samples from the monitored hosts because SNMP operations only lightly load the host compared to, say, using SSH to log in to and query the host. Where the standard SNMP MIBs don't define the desired data, SNMP is extended with external scripts (currently written in bash for ease of portability and maintenance) to provide the required samples.

As of the date of writing, monitor2 supports:

  1. Disk activity (bytes read and written per local disk and partition)
  2. Network interface traffic (bytes read and written per network interface)
  3. CPU usage (percentage) and load average
  4. Memory usage (RAM)
  5. Device temperature(s)

Monitored hosts only require:

  • That the abovementioned extension scripts be copied into place (see file locations below),
  • That a custom snmpd.conf be copied into place, and
  • That snmpd be installed and started/run via systemd.

File locations

monitor2

The monitor2 server (see above):

  • Runs the various scripts used to collect data from the monitored hosts via SNMP, and also
  • Runs an Apache2 web server which, via CGI scripts written in bash (also aided and abetted by gnuplot, graphs the collected data for user consumption.
monitor2 files and directories Description
/etc/monitor2.conf Site-specific configuration: top-level directory location, etc. (see below)
/etc/apache2/apache2.conf Custom Apache2 configuration. Does not include any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
/etc/systemd/system/datacollectpoll.service systemd service file used to control our data collection service (see below)
/var/samples/monitor2 Top level directory under which sampled/graphed hosts are collected in plot pool directories
/usr/local/infrastructure/monitor2/cgi-bin Directory containing bash CGI scripts for each graphed data type. They use a common library to display pages with similar-looking controls at the top and graphs underneath
/usr/local/infrastructure/monitor2/plot-bin ?
/usr/local/infrastructure/monitor2/html Static HTML pages, including index.html which contains links to the CGI scripts
/usr/local/infrastructure/monitor2/bin ?
/usr/local/infrastructure/monitor2/datacollectpoll Directory containing the datacollectpoll Tcl script (run by systemd) and its configuration file

Monitored hosts

As noted above, hosts from which data is collected need only run snmpd and have the extension scripts installed. This simplicity is reflected in the shortness of the table below.

Files and directories on the monitored hosts Description
/etc/snmp/snmpd.conf Configuration file for the SNMP daemon running on the host. It contains the community name plus it lists the scripts used to extend the range of data the daemon can provide
/usr/local/snmpd_extend Directory containing the extension scripts referred to by snmpd.conf

monitor2.conf

Configuration file used to set enbironment variables used by the datacollectpoll script (and same-named systemd service). See datacollectpoll.service below.

SLEEPTIME=25
SAMPLERETRIES=3
SAMPLETIMEOUT=2
SNMPCOMMUNITY="csereader"
DEFAULTPLOTPOOL="vlab"
DATADIR="/var/samples/monitor2"
DATACOLLECTPOLLCONF="/etc/datacollect.conf"
Environment variable Description
SLEEPTIME ?
SAMPLERETRIES Passed to snmpget and snmpwalk commands to set number of retry attempts to make while reading data samples from a monitored host
SAMPLETIMOUT Ditto of the above, but the timeout before retrying
DEFAULTPLOTPOOL ?
DATADIR Top-level directory where data samples are stored by the data collection scripts, and read from by the CGI page-plotting scripts
DATACOLLECTPOLLCONF Location and name of the configuration file for datacollectpoll, the immortal script run by datacollectpoll.service

datacollectpoll.service

Location: /etc/systemd/system/datacollectpoll.service

[Unit]
Description=Data Collect Poll daemon for fleet monitoring
After=network.target

[Service]
User=monitor2
Group=monitor2
EnvironmentFile=/etc/monitor2.conf
ExecStart=/usr/local/infrastructure/datacollectpoll/datacollectpoll

[Install]
WantedBy=multi-user.target

Note: the "monitor2" user and group need to be created on the monitor2 server using the ID numbers (1000/1000) specified in the cfengine configuration. See /var/lib/cfengine3/masterfiles/monitorconf.inc (on the cfengine hub).

datacollectpoll

Immortal script (i.e., never intentionally dies) managed and run by systemd. Iruns commands at intervals listed in datacollectpoll.conf to collect data samples and stores them in plain-text files in the samples directory.

samples

Directory and subdirectories with plain-text files containing daily collections of data samples named by type and host.

datacollectpoll.conf

Format:

  • Blank lines are ignored.
  • '#' to end of line is a comment
  • First space-separated field on the line is the command to run. This file must exist at the time the file is loaded
  • Subsequent space-separated fields are optional arguments passed to the aforementioned command. Although optional format-wise, the existing scripts take two arguments:
    • The poll group
    • The host to sample
  • An optional ';' (semicolon) followed by an interval in seconds changes the time between samples from XX to the specified interval
/home/cephmonitor/bin/get_disk_stats vmhost node3		 ; 30
/home/cephmonitor/bin/get_disk_stats vmhost node4		 ; 30
/home/cephmonitor/bin/get_disk_stats vmhost node6		 ; 30
/home/cephmonitor/bin/get_disk_stats vmhost node9		 ; 30

/home/cephmonitor/bin/get_temperature_stats ceph node1		 ; 120
/home/cephmonitor/bin/get_temperature_stats ceph node2		 ; 120
/home/cephmonitor/bin/get_temperature_stats ceph node5		 ; 120
/home/cephmonitor/bin/get_temperature_stats ceph node7		 ; 120
/home/cephmonitor/bin/get_temperature_stats ceph odroidn2	 ; 120
/home/cephmonitor/bin/get_temperature_stats ceph rockpro64	 ; 120
...

Plot pools