Computing in the real world
SEARCH FOR: IN:
Guest  Level 00    Register Log in

Real World Computing

Where did the speed go?

5th March 2008 [PC Pro]

It was obvious there was a problem and there seemed to be a lot of disk activity, but what was that activity achieving? By tracking with dtrace, I discovered that Apple Mail was putting an awful lot of activity through one file: it turns out that Mail uses an open-source SQL database module called SQLite to track its mail messages, and mine had become distinctly less than optimal. A quick check on Google for more information about this file yielded a command to optimise it and once again I could read my emails - hurrah!

My second example concerns a much bigger problem. My company operates a collection of websites out of two hosting centres, which are connected by a 100Mb/sec link. In the hosting centre is a load-balanced web server farm that talks to a back-end database server, which is a quad-socket Opteron Sun with lots of memory and a dual-controller EMC fibre channel RAID array. These servers run MySQL and are duplicated so that an update in one centre gets copied to the other centre and runs there, too.

A few weeks ago, the performance of one of these systems began to routinely fall off a cliff, with web pages that would normally take tenths of seconds to create taking several seconds instead. The front-end web servers were all fine, barely breaking sweat, and to all intents and purposes the back-end databases looked fine, too. The obvious tools such as "top" yielded no useful information, and tweaking MySQL's performance variables didn't yield any major improvements, particularly as reproducing the problem was very difficult. There had to be another cause. The most suggestive aspect of the problem was that it affected only one hosting centre, so something was happening in one centre but not in the other. It's true that different websites did run out of the different centres, but all database write transactions were duplicated.

Ironically, the first hint came from the humble "iostat" I criticised earlier. It showed me that one particular device was very busy - up to 90% load - and that the average wait time was longer than I'd have liked. It was a real surprise that we were soaking up all of the 2Gb/sec bandwidth of fibre channel and the RAID. Without dtrace, we'd now have been pretty much on our own, down to pure guesswork. We needed to balance up the I/O load between the controllers in the RAID unit, but how were we going to do that? And could we really be sure what was causing it?

This is where dtrace really came into its own. The first thing I tried was using "rfileio" from the dtrace toolkit to find out the identity of the really busy files, which told me that a particular type of table in the databases was getting hit a lot, but what sort of traffic was it being hit by? Entering a script from the MySQL dtrace toolkit (available from www.pcpro.co.uk/links/163open) let me monitor a particular call in the MySQL database server (using the process provider mentioned above) and therefore told me what SQL queries the database was running - and it was running a lot of them. A pattern soon emerged in which I could see that there was a particular case where two queries were being run where one would have done, and it also exposed a logic problem.

The case of these two queries - which could be replaced with one - was no fault of the developer. What was happening was that a query was made to fetch one column from a table and then - very soon after - another column from the same table. I could see both these queries executing next to each other, which enabled me to merge them, a technique rather like peephole optimisation in compilers where redundant statements can be removed. Doing this work allowed me to completely transform the instantaneous performance of the application (how it performs normally) in addition to its performance under heavy load. One particular operation that routinely took more than a second is now performing consistently below 0.4 seconds, or roughly three times faster. It's also shown me some weaknesses in an application that will let me improve it.

Continued....