Monitor2: Difference between revisions

From techdocs
Jump to navigation Jump to search
 
(33 intermediate revisions by the same user not shown)
Line 3: Line 3:
The second monitoring system/framework in the [[New World]], '''monitor2''', is implemented on the server https://nw-syd-monitor2.cse.unsw.edu.au/. Unlike [[monitor1]] it does not use Nagios and does not generate any alarms or warnings.
The second monitoring system/framework in the [[New World]], '''monitor2''', is implemented on the server https://nw-syd-monitor2.cse.unsw.edu.au/. Unlike [[monitor1]] it does not use Nagios and does not generate any alarms or warnings.


Instead, monitor2 collects and graphs data sampled from various servers. It is designed to implement a simple but flexible way of collecting data from these hosts, storing that data and graphing it, regardless of the data sources. It uses SNMP to collect data samples from the monitored hosts. Where the standard SNMP MIBs don't define the desired data, SNMP is extended with external scripts (currently written in <code>bash</code> for ease of portability and maintenance) to provide the required samples.
Instead, monitor2 collects and graphs data sampled from various servers. It is designed to implement a simple but flexible way of collecting data from these hosts, storing that data and graphing it, regardless of the data sources. It uses SNMP to collect data samples from the monitored hosts because SNMP operations only lightly load the host compared to, say, using SSH to log in to and query the host. Where the standard SNMP MIBs don't define the desired data, SNMP is extended with external scripts (currently written in <code>bash</code> for ease of portability and maintenance) to provide the required samples.


As of the date of writing, it supports:
__TOC__
 
As of the date of writing, monitor2 supports:


# Disk activity (bytes read and written per local disk and partition)
# Disk activity (bytes read and written per local disk and partition)
Line 12: Line 14:
# Memory usage (RAM)
# Memory usage (RAM)
# Device temperature(s)
# Device temperature(s)
# Logged-on user count


Monitored hosts only require:
Monitored hosts only require:
Line 17: Line 20:
* That a custom <code>snmpd.conf</code> be copied into place, and
* That a custom <code>snmpd.conf</code> be copied into place, and
* That <code>snmpd</code> be installed and started/run via <code>systemd</code>.
* That <code>snmpd</code> be installed and started/run via <code>systemd</code>.
Monitored hosts are also organised into "plot pools", groups of hosts which presumably share some characteristic, and which are always plotted together on the same page. E.g., "vlab", "storecomp" (storage/compute cluster) or "kora" (kora lab workstation). The first time data is plotted via the web interface the plot pool defaults to the DEFAULTPLOTPOOL setting in <code>/etc/monitor2.conf</code> (see below).
== Notes about SNMP on Debian ==
By default, the full suite of SNMP MIBs is '''not'' installed when SNMP packages are installed on their own. Instead, you need to install the <code>snmp-mibs-downloader</code> package which then runs <code>/usr/bin/download-mibs</code> as post-install. Applicable only to the monitoring server.


== File locations ==
== File locations ==
Line 22: Line 31:
=== monitor2 ===
=== monitor2 ===


The monitor2 server, see above, runs both the various scripts used to collect data from the monitored hosts via SNMP, and also runs an Apache2 web server which, via CGI scripts written in <code>bash</code> (also aided and abetted by <code>gnuplot</code>, graphs the collected data for user consumption.
The monitor2 server (see above):
* Runs the various scripts used to collect data from the monitored hosts via SNMP, and also
* Runs an Apache2 web server which, via CGI scripts written in <code>bash</code> (also aided and abetted by <code>gnuplot</code>, graphs the collected data for user consumption.


{|
{|
Line 28: Line 39:
!Description
!Description
|-
|-
|<code style="white-space: nowrap;">/etc/[[Monitor2#monitor2.conf|monitor2.conf]]</code>
|<code style="white-space: nowrap;">/etc/monitor2.conf</code>
|Site-specific configuration: top-level directory location, etc.
|Site-specific configuration: top-level directory location, etc. (see below)
|-
|<code style="white-space: nowrap;">/etc/datacollectpoll.conf</code>
|Host sampling configuration file (see below)
|-
|-
|<code style="white-space: nowrap;">/etc/apache2/apache2.conf</code>
|<code style="white-space: nowrap;">/etc/apache2/apache2.conf</code>
|Custom Apache2 configuration. Does not ''include'' any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
|Custom Apache2 configuration. Does not ''include'' any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
|-
|-
|<code style="white-space: nowrap;">/etc/systemd/system/[[Monitor2#datacollectpoll.service|datacollectpoll.service]]</code>
|<code style="white-space: nowrap;">/etc/systemd/system/datacollectpoll.service</code>
|<code>systemd</code> service file used to control our data collection service
|<code>systemd</code> service file used to control our data collection service (see below)
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/[[Monitor2#samples|samples]]</code>
|<code style="white-space: nowrap;">/var/samples/monitor2</code>
|Top level directory under which, firstly, sampled/graphed hosts are collected in [[Monitor2#plot_pools|plot pool]] directories
|Top level directory under which data for sampled/graphed hosts are collected in plot pool-specific directories
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/cgi-bin</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/cgi-bin</code>
Line 44: Line 58:
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/plot-bin</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/plot-bin</code>
|?
|Contains <code>commonx.sh</code> which is a set of <code>bash</code> functions used to give a common look-and-feel to the graph pages. source'd by the CGI scripts
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/html</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/html</code>
|Static HTML pages, including <code>index.html</code> which contains links to the CGI scripts
|Static HTML pages, including <code>index.html</code> which contains links to the CGI scripts
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/bin</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/getbin</code>
|?
|Scripts to sample particular data types
|-
|-
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/datacollectpoll</code>
|<code style="white-space: nowrap;">/usr/local/infrastructure/monitor2/datacollectpoll</code>
|Directory containing the <code>[[Monitor2#datacollectpoll|datacollectpoll]]</code> Tcl script (run by <code>systemd</code>) and its configuration file
|Directory containing the <code>[[Monitor2#datacollectpoll|datacollectpoll]]</code> Tcl script (run by <code>systemd</code>)
|}
|}


Line 72: Line 86:


== monitor2.conf ==
== monitor2.conf ==
Configuration file used to set environment variables used by the <code>datacollectpoll</code> script (and same-named systemd service). See <code>datacollectpoll.service</code> below.


{|
{|
|
|
SLEEPTIME=25
  SAMPLERETRIES=3
  SAMPLERETRIES=3
  SAMPLETIMEOUT=2
  SAMPLETIMEOUT=2
  SNMPCOMMUNITY="reader"
KEEP_SAMPLES_DAYS=28
  DEFAULTPLOTPOOL="ceph"
  SNMPCOMMUNITY="csereader"
  DATADIR="/home/cephmonitor/samples"
  DEFAULTPLOTPOOL="vlab"
  DATACOLLECTPOLLCONF="/home/cephmonitor/datacollectpoll/datacollectpoll.conf"
  DATADIR="/var/samples/monitor2"
  DATACOLLECTPOLLCONF="/etc/datacollect.conf"
|}
 
{|
!Environment variable
!Description
|-
|SLEEPTIME
|?
|-
|SAMPLERETRIES
|Passed to <code>snmpget</code> and <code>snmpwalk</code> commands to set number of retry attempts to make while reading data samples from a monitored host
|-
|SAMPLETIMOUT
|Ditto of the above, but the timeout before retrying
|-
|KEEP_SAMPLES_DAYS
|Number of days to keep data samples. Used by the <code>cron</code> job <code>~monitor2/getbin/delete_old_samples</code>
|-
|DEFAULTPLOTPOOL
|The initial plot pool selected when data is plotted via the web interface
|-
|DATADIR
|Top-level directory where data samples are stored by the data collection scripts, and read from by the CGI page-plotting scripts
|-
|DATACOLLECTPOLLCONF
|Location and name of the configuration file for <code>datacollectpoll</code>, the immortal script run by <code>datacollectpoll.service</code>
|}
|}


== datacollectpoll.service ==
== datacollectpoll.service ==
Location: <code>/etc/systemd/system/datacollectpoll.service</code>


{|
{|
Line 92: Line 138:
   
   
  [Service]
  [Service]
  User=cephmonitor
  User=monitor2
  Group=cephmonitor
  Group=monitor2
  EnvironmentFile=/etc/monitor2.conf
  EnvironmentFile=/etc/monitor2.conf
  ExecStart=/home/cephmonitor/datacollectpoll/datacollectpoll
  ExecStart=/usr/local/infrastructure/datacollectpoll/datacollectpoll
   
   
  [Install]
  [Install]
  WantedBy=multi-user.target
  WantedBy=multi-user.target
|}
|}
Note: the "monitor2" user and group need to be created on the monitor2 server using the ID numbers (1000/1000) specified in the cfengine configuration. See <code>/var/lib/cfengine3/masterfiles/monitorconf.inc</code> (on the cfengine hub).


== <code>datacollectpoll</code> ==
== <code>datacollectpoll</code> ==


Immortal script (i.e., never intentionally dies) run by <code>systemd</code> which runs commands at intervals listed in [[Monitor2#datacollectpoll.conf|datacollectpoll.conf]] to collect data samples and store them in plain-text files in the [[Monitor2#samples|samples]] directory.
Immortal Tcl script (i.e., never intentionally dies) managed and run by <code>systemd</code>. It runs commands at intervals listed in <code>datacollectpoll.conf</code> to collect data samples and stores them in plain-text files in the <code>samples</code> directory.
 
It reads the list of poll commands to run, which includes host names and the plot pools to which they belong, and then proceeds to run these commands at specified intervals to collect the data. The scripts themselves are responsible for doing the actual polling and data storage - <code>datacollectpoll</code> simply runs the commands.
 
<code>datacollectpoll.conf</code> is automatically re-read each time it changes so <code>datacollectpoll</code> does not need to be restarted.
 
If a command fails (returns an exit code other than zero) <code>datacollectpoll</code> backs off before trying again.


== samples ==
== samples ==


Directory and subdirectories with plain-text files containing daily collections of data samples named by type and host.
Directory and subdirectories with plain-text files containing collections of data samples named by date, data type and host.
 
More specifically, each sample file name is formatted as follows:
 
YYYYMMDD-&lt;host&gt;-&lt;datatype&gt;.dat
 
Where:
* YYYYMMDD is the date. There's one file per 24-hour day
* &lt;host&gt; is the name of the monitored host
* &lt;datatype&gt; is the type of data - e.g., "netif" (network interface byte counts), "usercount" (user count, especially for VLAB and login servers), etc.
 
Each file is a plain text file consisting of a series of lines containing space-separated values:
* First field is the sample time since midnight in seconds
* Subsequent fields are typically numerical values (may be floating point). Where data is for potentially multiple attached peripherals, such as disk drives or network interfaces, the fields may consist of groups of peripheral names and their data.
 
=== samples directory structure (see also <code>/etc/monitor2.conf</code>) ===
 
root@nw-syd-monitor2:# '''find /var/samples/monitor2 -type d -ls'''
    277404      4 drwxr-xr-x  7 monitor2 monitor2    4096 Mar  7 13:52 /var/samples/monitor2
    280270    68 drwxr-xr-x  2 monitor2 monitor2    65536 Apr 27 00:00 /var/samples/monitor2/admin
    277405    412 drwxr-xr-x  2 monitor2 monitor2  417792 Apr 27 00:01 /var/samples/monitor2/vlab
    393252    12 drwxr-xr-x  2 monitor2 monitor2    12288 Apr 27 00:00 /var/samples/monitor2/homeserver
    278614    356 drwxr-xr-x  2 monitor2 monitor2  360448 Apr 27 00:00 /var/samples/monitor2/kora
    393224    64 drwxr-xr-x  2 monitor2 monitor2    61440 Apr 27 00:00 /var/samples/monitor2/storecomp
root@nw-syd-monitor2:#
 
=== sample data file extracts ===
 
==== Disk activity ====
 
<div style="overflow: auto; white-space: nowrap; font-family: monospace; border: solid 1px #f0f0f0; background-color: #f8f8f8; padding: 0px; margin: 0px;">root@nw-syd-monitor2:# <b>head /var/samples/monitor2/vlab/20230401-nw-syd-vxdb-disk.dat</b>
11 nvme0n1 1989867008 2570418688 1 0 0 nvme0n1p1 1966918656 2570417152 1 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3538283008 1 1 1<br />
41 nvme0n1 1989867008 2573756928 0 0 0 nvme0n1p1 1966918656 2573755392 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3540136448 1 1 1<br />
71 nvme0n1 1989867008 2577824256 0 0 0 nvme0n1p1 1966918656 2577822720 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3541457920 0 1 1<br />
101 nvme0n1 1989867008 2581551616 0 0 0 nvme0n1p1 1966918656 2581550080 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2293008384 3543464448 0 1 1<br />
131 nvme0n1 1989867008 2586585600 0 0 0 nvme0n1p1 1966918656 2586584064 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2295695360 3553381376 1 1 1<br />
161 nvme0n1 1989944832 2596915712 0 0 0 nvme0n1p1 1966996480 2596914176 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296547328 3557158912 0 1 1<br />
191 nvme0n1 1989944832 2598357504 0 0 0 nvme0n1p1 1966996480 2598355968 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3558146560 0 1 1<br />
222 nvme0n1 1989944832 2600430080 0 0 0 nvme0n1p1 1966996480 2600428544 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3571383296 0 1 1<br />
252 nvme0n1 1989944832 2601798144 0 0 0 nvme0n1p1 1966996480 2601796608 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3588140544 1 1 1<br />
282 nvme0n1 1989944832 2602273280 0 0 0 nvme0n1p1 1966996480 2602271744 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296969216 3607321600 1 1 1<br />
root@nw-syd-monitor2:#</div>
 
Field #1 is the number of seconds since midnight
Fields #2 thru #7 (repeats for each device) are:
# Device name
# Cumulative read byte count
# Cumulative write byte count
# ?
# ?
# ?
 
Note that in this case the host is ARM-based (not Intel) and this is reflected in the device names (i.e., they don't look like "sdX".
 
==== User count ====
 
root@nw-syd-monitor2:# '''head /var/samples/monitor2/vlab/20230401-nw-syd-vxdb-usercount.dat'''
26 26
146 25
266 25
386 28
506 29
626 28
746 28
866 28
986 28
1106 29
root@nw-syd-monitor2:#
 
Much simpler this time. Field #1 is the seconds since midnight and field #2 is the instantaneous user count at the time of the sample.


== datacollectpoll.conf ==
== datacollectpoll.conf ==
Line 114: Line 237:
* Blank lines are ignored.
* Blank lines are ignored.
* '#' to end of line is a comment
* '#' to end of line is a comment
* First space-separated field on the line is the command to run. This must exist at the time the file is read
* First space-separated field on the line is the command to run. This file must exist at the time the file is loaded
* Subsequent space-separated fields are optional arguments passed to the aforementioned command. Although optional format-wise, the existing scripts take two arguments:
* Subsequent space-separated fields are optional arguments passed to the aforementioned command. Although optional format-wise, the existing scripts take two arguments:
** The poll group
** The poll group
** The host to sample
** The host to sample
* An optional ';' followed by an interval in seconds change the time between samples from XX to the specified interval
* An optional ';' (semicolon) followed by an interval in seconds changes the time between samples from the default 30 to the specified interval


Example:
{|
{|
|
|
  /home/cephmonitor/bin/get_disk_stats vmhost node3 ; 30
  /usr/local/infrastructure/monitor2/getbin/get_load_stats admin nw-syd-cfengine-hub ; 30
  /home/cephmonitor/bin/get_disk_stats vmhost node4 ; 30
  /usr/local/infrastructure/monitor2/getbin/get_disk_stats admin nw-syd-cfengine-hub ; 30
  /home/cephmonitor/bin/get_disk_stats vmhost node6 ; 30
  /usr/local/infrastructure/monitor2/getbin/get_netif_stats admin nw-syd-cfengine-hub ; 30
  /home/cephmonitor/bin/get_disk_stats vmhost node9 ; 30
  /usr/local/infrastructure/monitor2/getbin/get_load_stats vlab nw-syd-armvx1 ; 30
   
  /usr/local/infrastructure/monitor2/getbin/get_disk_stats vlab nw-syd-armvx1 ; 30
  /home/cephmonitor/bin/get_temperature_stats ceph node1 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_netif_stats vlab nw-syd-armvx1 ; 30
  /home/cephmonitor/bin/get_temperature_stats ceph node2 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_usercount_stats vlab nw-syd-armvx1 ; 120
  /home/cephmonitor/bin/get_temperature_stats ceph node5 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_load_stats vlab nw-syd-vxdb ; 30
  /home/cephmonitor/bin/get_temperature_stats ceph node7 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_disk_stats vlab nw-syd-vxdb ; 30
  /home/cephmonitor/bin/get_temperature_stats ceph odroidn2 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_netif_stats vlab nw-syd-vxdb ; 30
  /home/cephmonitor/bin/get_temperature_stats ceph rockpro64 ; 120
  /usr/local/infrastructure/monitor2/getbin/get_usercount_stats vlab nw-syd-vxdb ; 120
  ...
  ...
|}
|}


== Plot pools ==
== Deleting old data samples ==
 
Data sample files are cleaned out by the <code>getbin/delete_old_samples</code> cronjob run nightly by the monitor2 user.
 
The KEEP_SAMPLES_DAYS setting in <code>/etc/monitor2.conf</code> determines the number of days the data is kept.

Latest revision as of 14:02, 19 Haziran 2023

Typical graph page

The second monitoring system/framework in the New World, monitor2, is implemented on the server https://nw-syd-monitor2.cse.unsw.edu.au/. Unlike monitor1 it does not use Nagios and does not generate any alarms or warnings.

Instead, monitor2 collects and graphs data sampled from various servers. It is designed to implement a simple but flexible way of collecting data from these hosts, storing that data and graphing it, regardless of the data sources. It uses SNMP to collect data samples from the monitored hosts because SNMP operations only lightly load the host compared to, say, using SSH to log in to and query the host. Where the standard SNMP MIBs don't define the desired data, SNMP is extended with external scripts (currently written in bash for ease of portability and maintenance) to provide the required samples.

As of the date of writing, monitor2 supports:

  1. Disk activity (bytes read and written per local disk and partition)
  2. Network interface traffic (bytes read and written per network interface)
  3. CPU usage (percentage) and load average
  4. Memory usage (RAM)
  5. Device temperature(s)
  6. Logged-on user count

Monitored hosts only require:

  • That the abovementioned extension scripts be copied into place (see file locations below),
  • That a custom snmpd.conf be copied into place, and
  • That snmpd be installed and started/run via systemd.

Monitored hosts are also organised into "plot pools", groups of hosts which presumably share some characteristic, and which are always plotted together on the same page. E.g., "vlab", "storecomp" (storage/compute cluster) or "kora" (kora lab workstation). The first time data is plotted via the web interface the plot pool defaults to the DEFAULTPLOTPOOL setting in /etc/monitor2.conf (see below).

Notes about SNMP on Debian

By default, the full suite of SNMP MIBs is 'not installed when SNMP packages are installed on their own. Instead, you need to install the snmp-mibs-downloader package which then runs /usr/bin/download-mibs as post-install. Applicable only to the monitoring server.

File locations

monitor2

The monitor2 server (see above):

  • Runs the various scripts used to collect data from the monitored hosts via SNMP, and also
  • Runs an Apache2 web server which, via CGI scripts written in bash (also aided and abetted by gnuplot, graphs the collected data for user consumption.
monitor2 files and directories Description
/etc/monitor2.conf Site-specific configuration: top-level directory location, etc. (see below)
/etc/datacollectpoll.conf Host sampling configuration file (see below)
/etc/apache2/apache2.conf Custom Apache2 configuration. Does not include any site, module or configuration files from the default Apache2 configuration directories - i.e., it's all done here!
/etc/systemd/system/datacollectpoll.service systemd service file used to control our data collection service (see below)
/var/samples/monitor2 Top level directory under which data for sampled/graphed hosts are collected in plot pool-specific directories
/usr/local/infrastructure/monitor2/cgi-bin Directory containing bash CGI scripts for each graphed data type. They use a common library to display pages with similar-looking controls at the top and graphs underneath
/usr/local/infrastructure/monitor2/plot-bin Contains commonx.sh which is a set of bash functions used to give a common look-and-feel to the graph pages. source'd by the CGI scripts
/usr/local/infrastructure/monitor2/html Static HTML pages, including index.html which contains links to the CGI scripts
/usr/local/infrastructure/monitor2/getbin Scripts to sample particular data types
/usr/local/infrastructure/monitor2/datacollectpoll Directory containing the datacollectpoll Tcl script (run by systemd)

Monitored hosts

As noted above, hosts from which data is collected need only run snmpd and have the extension scripts installed. This simplicity is reflected in the shortness of the table below.

Files and directories on the monitored hosts Description
/etc/snmp/snmpd.conf Configuration file for the SNMP daemon running on the host. It contains the community name plus it lists the scripts used to extend the range of data the daemon can provide
/usr/local/snmpd_extend Directory containing the extension scripts referred to by snmpd.conf

monitor2.conf

Configuration file used to set environment variables used by the datacollectpoll script (and same-named systemd service). See datacollectpoll.service below.

SLEEPTIME=25
SAMPLERETRIES=3
SAMPLETIMEOUT=2
KEEP_SAMPLES_DAYS=28
SNMPCOMMUNITY="csereader"
DEFAULTPLOTPOOL="vlab"
DATADIR="/var/samples/monitor2"
DATACOLLECTPOLLCONF="/etc/datacollect.conf"
Environment variable Description
SLEEPTIME ?
SAMPLERETRIES Passed to snmpget and snmpwalk commands to set number of retry attempts to make while reading data samples from a monitored host
SAMPLETIMOUT Ditto of the above, but the timeout before retrying
KEEP_SAMPLES_DAYS Number of days to keep data samples. Used by the cron job ~monitor2/getbin/delete_old_samples
DEFAULTPLOTPOOL The initial plot pool selected when data is plotted via the web interface
DATADIR Top-level directory where data samples are stored by the data collection scripts, and read from by the CGI page-plotting scripts
DATACOLLECTPOLLCONF Location and name of the configuration file for datacollectpoll, the immortal script run by datacollectpoll.service

datacollectpoll.service

Location: /etc/systemd/system/datacollectpoll.service

[Unit]
Description=Data Collect Poll daemon for fleet monitoring
After=network.target

[Service]
User=monitor2
Group=monitor2
EnvironmentFile=/etc/monitor2.conf
ExecStart=/usr/local/infrastructure/datacollectpoll/datacollectpoll

[Install]
WantedBy=multi-user.target

Note: the "monitor2" user and group need to be created on the monitor2 server using the ID numbers (1000/1000) specified in the cfengine configuration. See /var/lib/cfengine3/masterfiles/monitorconf.inc (on the cfengine hub).

datacollectpoll

Immortal Tcl script (i.e., never intentionally dies) managed and run by systemd. It runs commands at intervals listed in datacollectpoll.conf to collect data samples and stores them in plain-text files in the samples directory.

It reads the list of poll commands to run, which includes host names and the plot pools to which they belong, and then proceeds to run these commands at specified intervals to collect the data. The scripts themselves are responsible for doing the actual polling and data storage - datacollectpoll simply runs the commands.

datacollectpoll.conf is automatically re-read each time it changes so datacollectpoll does not need to be restarted.

If a command fails (returns an exit code other than zero) datacollectpoll backs off before trying again.

samples

Directory and subdirectories with plain-text files containing collections of data samples named by date, data type and host.

More specifically, each sample file name is formatted as follows:

YYYYMMDD-<host>-<datatype>.dat

Where:

  • YYYYMMDD is the date. There's one file per 24-hour day
  • <host> is the name of the monitored host
  • <datatype> is the type of data - e.g., "netif" (network interface byte counts), "usercount" (user count, especially for VLAB and login servers), etc.

Each file is a plain text file consisting of a series of lines containing space-separated values:

  • First field is the sample time since midnight in seconds
  • Subsequent fields are typically numerical values (may be floating point). Where data is for potentially multiple attached peripherals, such as disk drives or network interfaces, the fields may consist of groups of peripheral names and their data.

samples directory structure (see also /etc/monitor2.conf)

root@nw-syd-monitor2:# find /var/samples/monitor2 -type d -ls
   277404      4 drwxr-xr-x   7 monitor2 monitor2     4096 Mar  7 13:52 /var/samples/monitor2
   280270     68 drwxr-xr-x   2 monitor2 monitor2    65536 Apr 27 00:00 /var/samples/monitor2/admin
   277405    412 drwxr-xr-x   2 monitor2 monitor2   417792 Apr 27 00:01 /var/samples/monitor2/vlab
   393252     12 drwxr-xr-x   2 monitor2 monitor2    12288 Apr 27 00:00 /var/samples/monitor2/homeserver
   278614    356 drwxr-xr-x   2 monitor2 monitor2   360448 Apr 27 00:00 /var/samples/monitor2/kora
   393224     64 drwxr-xr-x   2 monitor2 monitor2    61440 Apr 27 00:00 /var/samples/monitor2/storecomp
root@nw-syd-monitor2:#

sample data file extracts

Disk activity

root@nw-syd-monitor2:# head /var/samples/monitor2/vlab/20230401-nw-syd-vxdb-disk.dat

11 nvme0n1 1989867008 2570418688 1 0 0 nvme0n1p1 1966918656 2570417152 1 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3538283008 1 1 1
41 nvme0n1 1989867008 2573756928 0 0 0 nvme0n1p1 1966918656 2573755392 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3540136448 1 1 1
71 nvme0n1 1989867008 2577824256 0 0 0 nvme0n1p1 1966918656 2577822720 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2288322560 3541457920 0 1 1
101 nvme0n1 1989867008 2581551616 0 0 0 nvme0n1p1 1966918656 2581550080 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2293008384 3543464448 0 1 1
131 nvme0n1 1989867008 2586585600 0 0 0 nvme0n1p1 1966918656 2586584064 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2295695360 3553381376 1 1 1
161 nvme0n1 1989944832 2596915712 0 0 0 nvme0n1p1 1966996480 2596914176 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296547328 3557158912 0 1 1
191 nvme0n1 1989944832 2598357504 0 0 0 nvme0n1p1 1966996480 2598355968 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3558146560 0 1 1
222 nvme0n1 1989944832 2600430080 0 0 0 nvme0n1p1 1966996480 2600428544 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3571383296 0 1 1
252 nvme0n1 1989944832 2601798144 0 0 0 nvme0n1p1 1966996480 2601796608 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296928256 3588140544 1 1 1
282 nvme0n1 1989944832 2602273280 0 0 0 nvme0n1p1 1966996480 2602271744 0 0 0 nvme0n1p14 4218880 0 0 0 0 nvme0n1p15 10947072 1536 0 0 0 nvme1n1 2296969216 3607321600 1 1 1

root@nw-syd-monitor2:#

Field #1 is the number of seconds since midnight Fields #2 thru #7 (repeats for each device) are:

  1. Device name
  2. Cumulative read byte count
  3. Cumulative write byte count
  4. ?
  5. ?
  6. ?

Note that in this case the host is ARM-based (not Intel) and this is reflected in the device names (i.e., they don't look like "sdX".

User count

root@nw-syd-monitor2:# head /var/samples/monitor2/vlab/20230401-nw-syd-vxdb-usercount.dat
26 26
146 25
266 25
386 28
506 29
626 28
746 28
866 28
986 28
1106 29
root@nw-syd-monitor2:# 

Much simpler this time. Field #1 is the seconds since midnight and field #2 is the instantaneous user count at the time of the sample.

datacollectpoll.conf

Format:

  • Blank lines are ignored.
  • '#' to end of line is a comment
  • First space-separated field on the line is the command to run. This file must exist at the time the file is loaded
  • Subsequent space-separated fields are optional arguments passed to the aforementioned command. Although optional format-wise, the existing scripts take two arguments:
    • The poll group
    • The host to sample
  • An optional ';' (semicolon) followed by an interval in seconds changes the time between samples from the default 30 to the specified interval

Example:

/usr/local/infrastructure/monitor2/getbin/get_load_stats admin nw-syd-cfengine-hub ; 30
/usr/local/infrastructure/monitor2/getbin/get_disk_stats admin nw-syd-cfengine-hub ; 30
/usr/local/infrastructure/monitor2/getbin/get_netif_stats admin nw-syd-cfengine-hub ; 30
/usr/local/infrastructure/monitor2/getbin/get_load_stats vlab nw-syd-armvx1 ; 30
/usr/local/infrastructure/monitor2/getbin/get_disk_stats vlab nw-syd-armvx1 ; 30
/usr/local/infrastructure/monitor2/getbin/get_netif_stats vlab nw-syd-armvx1 ; 30
/usr/local/infrastructure/monitor2/getbin/get_usercount_stats vlab nw-syd-armvx1 ; 120
/usr/local/infrastructure/monitor2/getbin/get_load_stats vlab nw-syd-vxdb ; 30
/usr/local/infrastructure/monitor2/getbin/get_disk_stats vlab nw-syd-vxdb ; 30
/usr/local/infrastructure/monitor2/getbin/get_netif_stats vlab nw-syd-vxdb ; 30
/usr/local/infrastructure/monitor2/getbin/get_usercount_stats vlab nw-syd-vxdb ; 120
...

Deleting old data samples

Data sample files are cleaned out by the getbin/delete_old_samples cronjob run nightly by the monitor2 user.

The KEEP_SAMPLES_DAYS setting in /etc/monitor2.conf determines the number of days the data is kept.