Real World Computing
Where did the speed go?
Recognise this scenario? You buy a new computer to host some application, not a top-of-the-range machine but one that everyone agrees is powerful enough for the job. It runs like a dream for a while, then all goes pear-shaped and the finger-pointing starts: "Application's fault"; "Admin's fault"; "Wrong disks in it..." The graphing tools are broken out, everyone has a theory and the machine takes the blame. But if the machine really is to blame, where did all that performance you bought go? With packages bought from multiple vendors this can be really difficult to uncover, and the traditional tools don't always help.
So in this month's column, I'll examine how to analyse machine performance and use the results in diagnosing problems. I'll mostly be talking about dtrace running on Solaris and Macs, but before the rest of you slink off to another column, this utility will end up on Linux soon - and something like it is bound to be copied by Windows, too - so read on and admire the abstract beauty of this solution!
Analysing system performance means first getting the right information and then understanding what it means, the trouble being that you very quickly wind up with too much information that's too difficult to analyse. The simplest performance analysis is straightforward observation, typified by ill-informed comments such as, "When the application was slowing down the disk drive/network/street lights kept flickering" or "When the app was running slowly, there seemed to be more fan noise". All these comments could be true (even the street lights), but they lack scientific rigour - flashing lights might distinguish a busy disk from an idle one, but I defy anyone to distinguish busy from very busy that way.
The next step is to use standard operating system utilities - such as Windows Performance Monitor or Unix's "top" - to investigate resource utilisation and find any bottlenecks. For example, if an application is using 90% of the CPU, the machine may not be fast enough, and that might be what's slowing down the application: or it could be that the other 10% of the time it's waiting for something else to finish and that's the problem; or, worse still, a poorly written app is consuming CPU time while it waits for something else to finish.
Some well-known metrics are almost designed to mislead you, the worst example in the Unix world being the "iowait" time, the percentage of its time the CPU has spent waiting for input-output (I/O). For a simple system where an application talks to a disk drive, this figure might make sense, but in more complex systems it soon becomes nonsense. A system has to wait for slow devices precisely because they're slow, but how can you tell whether it's waiting for something that's supposed to be slow or something that isn't? Indeed the confusion caused by users misunderstanding iowait has led Sun to remove it altogether from Solaris 10, which always reports zero time! This caught me out when I was trying to work out why a Solaris 9 system seemed to be working slower than a Solaris 10 system, not realising that the Solaris 10 system wasn't reporting the same thing.
If some of these operating system utilities aren't very good, and some of the statistics they provide can be misleading, which of them are useful? The truth is that none of them should be considered in isolation - they all measure different things and they all work at a very high level, which may not be the level where your real problem is located. In general, to cure a symptom is not to cure the underlying disease. For example, an application might run slow due to some disk-related performance issue and putting in faster disks may alleviate this symptom, but it isn't fixing the real problem that lies within the application. Adding faster disks may provide a quick fix, but the underlying problem is still there and may return.
