Real World Computing
Monitoring with Nagios
The next steps
So you have managed to monitor the machine Nagios is currently running on, but you need it to monitor other machines across the network. Assuming you have direct access to those machines, there are two ways of doing this, depending on whether the service is visible from the monitoring host. For example, if you want to monitor a remote web server, you can run the 'check_http' plug-in and tell it to inspect the other machine for a web service. However, if you want to monitor the disk space usage on a remote machine that may not be directly observable, you need to run the Nagios Remote Plug-in Executor (NRPE), which runs the plug-ins on the remote machine. NRPE comes in two parts: a daemon that runs on the remote machine and a plug-in that runs on the monitoring host. Having installed both parts, your check commands on the monitoring host use the 'check_nrpe' plug-in to run plug-ins on the remote host, which solves the problem.
However, you may be able to monitor a remote server without installing NRPE if that server offers an SNMP (Simple Network Management Protocol) service. Most Unix systems start an SNMP daemon by default, and SNMP is available under Windows as well. If you know your way around SNMP, you can monitor most resources this way, and there are Nagios plug-ins that check specific SNMP-monitored events.
With a bit of thought, you can even write your own plug-ins. A Nagios plug-in is just a program that returns a message saying what the current state of a service is (say, okay, warning or critical) with some useful information, and sets its return code according to the state. For example, we have written a plug-in that checks the state of our Dell servers via their OpenManage SNMP interface, which allows us to check fan speeds, temperatures and voltages remotely and notifies us if there is a problem.
More advanced topics
There are two main areas we have not yet spoken about: how do you monitor Windows machines, and how do you monitor a machine you cannot access directly? There are various ways of monitoring Windows machines, but the one we chose was the NC_Net, which comes with its own plug-in to be run on the monitoring host and an easy-to-install package for the Windows machines. Once it is up and running, you can check CPU load, uptime, disk space, processes and various other Windows metrics.
To monitor machines that you cannot access seems, at first glance, like a mildly stupid idea, but it happens all the time. We are talking about machines behind firewalls, which can probably talk to the monitoring host, but to which the monitoring host may not be able to reply directly. To make this work, you have to set up slightly different checks and deploy another piece of software called Nagios Service Check Acceptor (NSCA). The service checks we have discussed so far have been called 'active' checks, which means Nagios checks them itself regularly using plug-ins. We now want to define 'passive' checks, where Nagios processes the results but does not actually do the checking. To make this work, you run another copy of Nagios behind the firewall that performs active checking. This copy does not normally carry out notifications but forwards its results to the main Nagios monitoring host using NSCA. That instance of Nagios then processes the service checks it receives as though it had generated them itself and issues the appropriate notifications.
Before you ask, Nagios can tell whether it has received these service checks from behind a firewall and will inform you that there may be a problem if the information it has is stale. You can also use these passive service checks to handle events. For example, your backup software might send an SNMP trap if there is no tape drive online. These can be fed into Nagios to send out notifications.





