Monitoring with Nagios
Posted on 29 Mar 2005 at 15:32
Simon Brock relates the woes of hardware problems on a Friday night, and provides a solution for system monitoring
Last Friday wasn't a particularly bad day for us, but it certainly was for one of our clients. The previous evening, half their network had disappeared, at least from where I was sitting, and the next morning they told us a network switch had broken and been replaced. Later that afternoon, some of their services disappeared again and we discovered a router was wrongly configured. It was six o'clock on Friday evening by now and no-one was answering the phone, so we sent them an email and went down the pub. An hour later, I received a call on my mobile phone; I was told the router was fixed and they asked me to check everything was working. Normally, we would ask if it could wait until we got home or returned to the office, but we would just moved all our monitoring over to a piece of software called Nagios, which has a WAP interface. We checked that all their services were working from the comfort of the pub.
However, Nagios is much more than a belated justification for WAP phones. One day, we woke up to discover we had more servers than we knew what to do with and that our business depended on machines we weren't monitoring very well. How could we find out what all those servers were doing, know what had gone wrong and when? We already ran a bunch of scripts on various machines to check matters like disk space usage, but these were becoming increasingly hard to maintain and didn't check enough things. We also needed to monitor service quality as well as availability. Our company creates and hosts websites, and we need to know that pages are being served quickly from both within and without our own network.
We had to take a decision between writing more software ourselves, finding an open-source solution or buying in monitoring software. The first route wasn't really viable, as we needed a cross-platform solution to monitor both Windows and Unix servers. We could write easily enough for Unix, but weren't confident about writing for Windows. There is commercial software out there that seemed to do what we wanted, but it is often expensive (we hate to spend more on monitoring software than we paid for the server itself). That left open-source as our only option.
We soon uncovered a large collection of open-source monitoring software. However, unlike some applications where you know which solution is best - for example, the dominant open-source web server is Apache - in network monitoring there is no clear winner. We had to devise a way to choose, so we applied these three criteria, which we tend to use with all open-source solutions:
1. Does it work with what we have already?
2. Do other people use it?
3. Are many people working on it?
On the first point, we needed to monitor both Unix and Windows servers, and to monitor those parameters that interest us: resource usage (CPU and disk); service availability (is the web server up?); and service quality (how fast is this page generated?). It would also be good if we could monitor the hardware, to check that it is not getting too hot. We didn't want to have to install much other software either. For example, we use MySQL for virtually all our database work and didn't want to install a different database server (say, PostgreSQL) just to monitor the network.
The second and third points about other users and developer status are equally important. On open-source sites like SourceForge, you will see a lot of projects that seem to do what you want, but either aren't used by anyone else or have had no development done on them for years. For software you are going to rely on, it is important to know there is a community that can help. A good way to check is to examine any mailing list that comes with the package. If there is not one, or it shows no postings for a year or so, there may be a reason.
From around the web
advertisement
- Why you have to be left in the dark on OS patches
- Is Microsoft mismanaging Windows on ARM?
- Dealing with spam surrogates
- Why 3G broadband can be better and cheaper than ADSL
- Is Twitter bad for business?
- Publishing your email address isn't a security disaster
- Why you'll need a fax machine to develop iOS apps
- Learning to adapt to the mobile web
- Why you shouldn't use WPS on your Wi-Fi network
- Disabled users suffer when software breaks the rules
- Laptop bag reviews: nine tested
- Sony VAIO T Series Ultrabook review: first look
- Revealed: the military standards and robots HP uses to test its laptops
- Windows 8: multi-monitors and double standards?
- Why is TalkTalk's year-old porn filter suddenly big news?
- Why are laptop screens so far behind mobiles?
- HP EliteBook Folio review: first look
- The shoebox-sized all-in-one printer
- Forget the Ultrabook: here comes the HP Sleekbook
- HP Spectre XT review: first look
- Autonomy's Lynch joins 27,000 on way out of HP
- ICO: no fines for breaking cookie rules
- HP set to slash up to 30,000 jobs
- Government sites to miss cookie deadline
- Microsoft tweaks multi-monitor support in Windows 8
- Apple patches Leopard, despite ending support last year
- Defra opens rural broadband funding applications
- BT's broadband sales surpass calls revenue
- Apple patches multiple security issues
- FBI warns travellers to beware attacks via hotel Wi-Fi
advertisement

