Computing in the real world
SEARCH FOR: IN:
Guest  Level 00    Register Log in

Features


Digitising the British Library

17th January 2008 [PC Pro]

Around 1% of the pages scanned will be fold-outs, often containing illustrations or diagrams, that can't be scanned by the conventional machine. The operator makes a note of such pages on the computer system, and after the book is completed the fold-outs are scanned on a separate, larger overhead scanner to ensure all the pages in the book are retained. The computer software later integrates the separate fold-out files with the rest of the book pages.

Making sure the fold-out images are inserted back in the right place isn't the only challenge pictures present - the scanning environment also has to be monitored to ensure the colours are reproduced accurately. "Just one degree in temperature changes the light tuning and requires colour adjustments," says Helle. Consequently, there's no natural daylight in the air-conditioned, restricted-access bunker in which the scanning takes place, and all the scans made while we photographed the equipment had to be discarded to ensure the flashes didn't distort the images.

All of the pages - whether text- or image-based - are scanned at a resolution of 300dpi, which is way below the maximum resolution of even today's budget home scanners, but more than sufficient for this project. "We wouldn't have done much better at 600dpi - it [300dpi] meets the requirements for the project," says Fitzgerald. "Most people will read [the pages] online, but even at this resolution we could offer print-on-demand in the future."

Old character recognition

All of the books scanned during the project will be output in three formats: JPEG 2000
 
 
ADVERTISEMENT
(the less-compressed JPEG successor), PDF and plain text. To get to plain text - crucial for the content of the book to be fully searchable and for the books to be correctly ordered by page number - the scanned pages have to be passed through optical character recognition (OCR) software.

As anyone who's used OCR software on even a modern printed page will testify, the process is far from flawless. However, scanning 200-year-old texts presents its own unique problems. "Words and spellings have changed over time," says Fitzgerald, potentially resulting in searches for "jail" instead of the old-fashioned "gaol" returning no results, for example. There are typographical hurdles, too: the long "s" of yesteryear is often confused with an "f" by the OCR software. Consequently, the scanned images are specially processed to give the software the best possible chance of success. "There are improvements you can make to OCR by making post-capture improvements to the images," claims Fitzgerald.

Scans are batch-checked by Library staff.

The OCR documents have to pass an accuracy threshold set by the Library, although this level varies depending on the age of the book involved: titles from the early 1800s have a lower "OCR confidence level" than later tomes. "As long as there's enough OCR to retain some detail of the book, we will retain it," says Fitzgerald.

Fitzgerald admits the current system is far from perfect, but says there's hope for future improvements: "If there's an advance in OCR, we may go back and rescan the images."

Final checks

The OCR accuracy isn't the only element of the process to go through a rigorous quality check. As well as the operator manually checking the integrity of the scans at the lectern, CCS has a foreign office that undertakes a secondary check of the files. "If there's a bad image, or the [automatic] page recognition hasn't been detected... we have an extensive quality-assurance process in Romania," says Helle.

Continued....

Related News
Related Reviews