Digitising the British Library
Posted on 17 Jan 2008 at 11:42
All of the pages - whether text- or image-based - are scanned at a resolution of 300dpi, which is way below the maximum resolution of even today's budget home scanners, but more than sufficient for this project. "We wouldn't have done much better at 600dpi - it [300dpi] meets the requirements for the project," says Fitzgerald. "Most people will read [the pages] online, but even at this resolution we could offer print-on-demand in the future."
Old character recognition
All of the books scanned during the project will be output in three formats: JPEG 2000 (the less-compressed JPEG successor), PDF and plain text. To get to plain text - crucial for the content of the book to be fully searchable and for the books to be correctly ordered by page number - the scanned pages have to be passed through optical character recognition (OCR) software.
As anyone who's used OCR software on even a modern printed page will testify, the process is far from flawless. However, scanning 200-year-old texts presents its own unique problems. "Words and spellings have changed over time," says Fitzgerald, potentially resulting in searches for "jail" instead of the old-fashioned "gaol" returning no results, for example. There are typographical hurdles, too: the long "s" of yesteryear is often confused with an "f" by the OCR software. Consequently, the scanned images are specially processed to give the software the best possible chance of success. "There are improvements you can make to OCR by making post-capture improvements to the images," claims Fitzgerald.
Scans are batch-checked by Library staff.
The OCR documents have to pass an accuracy threshold set by the Library, although this level varies depending on the age of the book involved: titles from the early 1800s have a lower "OCR confidence level" than later tomes. "As long as there's enough OCR to retain some detail of the book, we will retain it," says Fitzgerald.
Fitzgerald admits the current system is far from perfect, but says there's hope for future improvements: "If there's an advance in OCR, we may go back and rescan the images."
Final checks
The OCR accuracy isn't the only element of the process to go through a rigorous quality check. As well as the operator manually checking the integrity of the scans at the lectern, CCS has a foreign office that undertakes a secondary check of the files. "If there's a bad image, or the [automatic] page recognition hasn't been detected... we have an extensive quality-assurance process in Romania," says Helle.
But this does raise doubts: you have to wonder if staff who presumably only have English as a second language are the best people to be checking a digital library of British literature. CSS, itself a German company, insists all its staff are proficient in English.
The Library itself provides an extra safety net by batch-sampling the files delivered from CSS and ensuring they conform to ISO standard 2859-1, which refers to sampling procedures. "It's impossible to individually examine every page delivered to the Library," says Fitzgerald. "We're very happy with the quality and consistency of the files being delivered to us."
The Library estimates that 30TB of storage will be required to accommodate the entire output of the collection when it's finished in late 2009. And unlike its unique collection of books, the Library can ensure that copies of the scanned manuscripts are kept across multiple locations. The digital library is currently backed up offsite in Boston, Yorkshire, and a site in Wales will soon be added.
From around the web
For more details about purchasing this feature and/or images for editorial usage, please contact Jasmine Samra on pictures@dennis.co.uk
advertisement
- Windows 8 on ARM to run desktop apps... but only Office
- Windows 8 pauses desktop apps to save energy
- Mobiles boost Apple profits... and there's more to come
- Ubuntu rips up drop-down menus
- RIM founders fall on their swords
- Microsoft to tweak Windows 8 Start screen
- Weak PC sales expected to hit Microsoft's profits
- 802.11ac routers to hit 800Mbit/sec this year
- Asus Transformer Prime gets HD upgrade
- Netgear brings apps to routers for “smart networks”
- Chrome's shine getting lost in translation
- BytePac: the cardboard hard disk enclosure
- How tech loosens our grip on reality
- Hokum watch: Safer Internet Day
- Why I'm deleting Adobe from my PC
- Prepare to be patronised: it's Safer Internet Day
- Dear Sony, Samsung and every other tech company in the world: stop trying to be Apple
- Will Apple's Final Cut Pro X update placate the pros?
- Smartr Contacts for iPhone review
- Switching to Office 365's Outlook Web App
advertisement

