Digitising the British Library
Posted on 17 Jan 2008 at 11:42
All of the pages - whether text- or image-based - are scanned at a resolution of 300dpi, which is way below the maximum resolution of even today's budget home scanners, but more than sufficient for this project. "We wouldn't have done much better at 600dpi - it [300dpi] meets the requirements for the project," says Fitzgerald. "Most people will read [the pages] online, but even at this resolution we could offer print-on-demand in the future."
Old character recognition
All of the books scanned during the project will be output in three formats: JPEG 2000 (the less-compressed JPEG successor), PDF and plain text. To get to plain text - crucial for the content of the book to be fully searchable and for the books to be correctly ordered by page number - the scanned pages have to be passed through optical character recognition (OCR) software.
As anyone who's used OCR software on even a modern printed page will testify, the process is far from flawless. However, scanning 200-year-old texts presents its own unique problems. "Words and spellings have changed over time," says Fitzgerald, potentially resulting in searches for "jail" instead of the old-fashioned "gaol" returning no results, for example. There are typographical hurdles, too: the long "s" of yesteryear is often confused with an "f" by the OCR software. Consequently, the scanned images are specially processed to give the software the best possible chance of success. "There are improvements you can make to OCR by making post-capture improvements to the images," claims Fitzgerald.
Scans are batch-checked by Library staff.
The OCR documents have to pass an accuracy threshold set by the Library, although this level varies depending on the age of the book involved: titles from the early 1800s have a lower "OCR confidence level" than later tomes. "As long as there's enough OCR to retain some detail of the book, we will retain it," says Fitzgerald.
Fitzgerald admits the current system is far from perfect, but says there's hope for future improvements: "If there's an advance in OCR, we may go back and rescan the images."
Final checks
The OCR accuracy isn't the only element of the process to go through a rigorous quality check. As well as the operator manually checking the integrity of the scans at the lectern, CCS has a foreign office that undertakes a secondary check of the files. "If there's a bad image, or the [automatic] page recognition hasn't been detected... we have an extensive quality-assurance process in Romania," says Helle.
But this does raise doubts: you have to wonder if staff who presumably only have English as a second language are the best people to be checking a digital library of British literature. CSS, itself a German company, insists all its staff are proficient in English.
The Library itself provides an extra safety net by batch-sampling the files delivered from CSS and ensuring they conform to ISO standard 2859-1, which refers to sampling procedures. "It's impossible to individually examine every page delivered to the Library," says Fitzgerald. "We're very happy with the quality and consistency of the files being delivered to us."
The Library estimates that 30TB of storage will be required to accommodate the entire output of the collection when it's finished in late 2009. And unlike its unique collection of books, the Library can ensure that copies of the scanned manuscripts are kept across multiple locations. The digital library is currently backed up offsite in Boston, Yorkshire, and a site in Wales will soon be added.
advertisement
- Apple "refuses to repair smokers' Macs"
- Spotify arrives on Symbian
- Chrome OS and Android to "converge over time"
- Microsoft to pay News Corp to stay off Google
- Christmas sales surge knocks out eBay search
- Windows 8 set for 2012 release
- Q&A: Why Conficker was a victim of its own success
- App developers losing faith in Android
- Biz Stone: Murdoch's Google veto will "fail fast"
- Google adds automatic captions to YouTube
- ATI Radeon HD 5970: 42% more expensive in the UK
- Office 2010 Beta – 32-bit or 64-bit – The Choice is Clear
- Why Britain's watchdogs have fewer teeth than goldfish
- Tabbed documents: how to make Office 2010 great
- Outlook 2010 People Pane – does it spell death to Xobni
- Microsoft Outlook 2010 screenshots
- Co-Authoring in Word 2010 and SharePoint Foundation 2010
- Microsoft Outlook 2010 screenshots: Backstage view
- Flash 10.1: Developing for Desktop and Device
- Microsoft Office 2010 screenshots: Recover unsaved items
- Getting to grips with Microsoft's IT Health Environment Scanner
- Virtualise your servers
- The changing face of travel gadgets
- Build your own distributed file system
- The bulletproof Dell that costs an arm and a leg
- Microsoft Office 2010 Technical Preview: Q&A
- Lawnmowers, the TyTN II and one odd insurance request
- There'll never be a bulletproof OS
- How far can we trust apps?
- Five nice touches in Outlook 2010
advertisement
Printed from www.pcpro.co.uk


