Skip to navigation

PCPro-Computing in the Real World Printed from www.pcpro.co.uk

Register to receive our regular email newsletter at http://www.pcpro.co.uk/registration.

The newsletter contains links to our latest PC news, product reviews, features and how-to guides, plus special offers and competitions.

Analysis

Digitising the British Library

Posted on 17 Jan 2008 at 11:42

All of the pages - whether text- or image-based - are scanned at a resolution of 300dpi, which is way below the maximum resolution of even today's budget home scanners, but more than sufficient for this project. "We wouldn't have done much better at 600dpi - it [300dpi] meets the requirements for the project," says Fitzgerald. "Most people will read [the pages] online, but even at this resolution we could offer print-on-demand in the future."

Old character recognition

All of the books scanned during the project will be output in three formats: JPEG 2000 (the less-compressed JPEG successor), PDF and plain text. To get to plain text - crucial for the content of the book to be fully searchable and for the books to be correctly ordered by page number - the scanned pages have to be passed through optical character recognition (OCR) software.

As anyone who's used OCR software on even a modern printed page will testify, the process is far from flawless. However, scanning 200-year-old texts presents its own unique problems. "Words and spellings have changed over time," says Fitzgerald, potentially resulting in searches for "jail" instead of the old-fashioned "gaol" returning no results, for example. There are typographical hurdles, too: the long "s" of yesteryear is often confused with an "f" by the OCR software. Consequently, the scanned images are specially processed to give the software the best possible chance of success. "There are improvements you can make to OCR by making post-capture improvements to the images," claims Fitzgerald.

Scans are batch-checked by Library staff.

The OCR documents have to pass an accuracy threshold set by the Library, although this level varies depending on the age of the book involved: titles from the early 1800s have a lower "OCR confidence level" than later tomes. "As long as there's enough OCR to retain some detail of the book, we will retain it," says Fitzgerald.

Fitzgerald admits the current system is far from perfect, but says there's hope for future improvements: "If there's an advance in OCR, we may go back and rescan the images."

Final checks

The OCR accuracy isn't the only element of the process to go through a rigorous quality check. As well as the operator manually checking the integrity of the scans at the lectern, CCS has a foreign office that undertakes a secondary check of the files. "If there's a bad image, or the [automatic] page recognition hasn't been detected... we have an extensive quality-assurance process in Romania," says Helle.

But this does raise doubts: you have to wonder if staff who presumably only have English as a second language are the best people to be checking a digital library of British literature. CSS, itself a German company, insists all its staff are proficient in English.

The Library itself provides an extra safety net by batch-sampling the files delivered from CSS and ensuring they conform to ISO standard 2859-1, which refers to sampling procedures. "It's impossible to individually examine every page delivered to the Library," says Fitzgerald. "We're very happy with the quality and consistency of the files being delivered to us."

The Library estimates that 30TB of storage will be required to accommodate the entire output of the collection when it's finished in late 2009. And unlike its unique collection of books, the Library can ensure that copies of the scanned manuscripts are kept across multiple locations. The digital library is currently backed up offsite in Boston, Yorkshire, and a site in Wales will soon be added.

1 2 3 4
Be the first to comment this article

You need to Login or Register to comment.

(optional)

advertisement

Most Commented Features
Latest News Stories Subscribe to our RSS Feeds
Latest Blog Posts Subscribe to our RSS Feeds
Latest Reviews Subscribe to our RSS Feeds
Latest Real World Computing

advertisement

Sponsored Links
 
SEARCH
SIGN UP

Your email:

Your password:

remember me

advertisement


Hitwise Top 10 Website 2008