One of six legal deposit libraries along with: NLW, NLS, Oxford, Cambridge, Trinity who are all currently trying to deal with Electronic Legal Deposit [born digital] material. Involved in a range of digitisation activities using a number of approaches – no grant-in-aid funding!
For any type of digitisation: TS – appropriate for material/volume e.g. MDP in TIFF = 1.2Pb, using JP2 25Tb WT – most cost is in inefficient pre & post-capture processing PDD – affects ability to operationalise – direct efficiency/cost effects MS – affects costs and resource discovery options Meta – Each stage of processing impacted OCR – Processing & resource discovery implications – only good on post 1950 documents
Self-selecting, i.e. obvious Treasures Drivers: cultural restitution, wider public access Sometimes private sponsorship, especially for iconic items Cultural reunification projects e.g. International Dunhuang Project and Codex Sinaiticus Focus on small scale & high quality showcases – often a re-cataloguing/metadata/resource discovery tool improvement exercise in disguise! Although it is often said there is one chance to capture these items, compelling new technology is often an exception to the rule.
Google entry to market EU i2010 response – devolved to national governments Microsoft entry and withdrawal from market – Internet Archive Complex rights landscape Range of capture approaches available, e.g. move from analogue to digital conversion as historical archives processed and digital equipment quality improves/costs fall Some scanners better at dealing with certain material e.g. tight bindings, but exaggerate show through…
We still don’t fully understand our audience/stakeholders are and what they want! SCA is trying to provide guidance….
Industrial scanning & processing – central services Requires multiple capture loops to deal with material with specific handling issues R&D/CoC guidance on optimal capture & OCR required – project based digitisation unlikely to deliver long term improvements in isolation
General principles – then some points in more detail
Typical large scale workflow – should highlight the QA batch sampling method based on ISO 2895-1 – trend analysis. Proved that good quality possible in large volume workflow!
Publisher and physical description: The earliest, unamended nineteenth century catalogue records in GK were very brief. Often there is no information on publisher and on the number of pages and of course no ISBN. Most often however ‘format’ is included; in effect a statement of how paper was used in the production process. In many cases publisher names have since been added as have page numbers. But when the printed catalogue was converted it was not possible to separate the various statements. Some attempt has been made to make up for this, but not always successfully. Ambiguous headings: GK made main entries under personal author, corporate body, title or sometimes initial words of the title. When the catalogue was converted to UKMARC the coding made no distinction between the various types of main entry. When the data was converted to MARC 21 algorithms were written to code the type of entry appropriately; as personal author, corporate author, title etc. However, some headings were ambiguous and could not be processed in this way. Those which could not be distinguished were placed in the 720 field, which therefore contains titles as well as some types of authors’ names. Name authorities: We now use the Library of Congress / NACO file for name authority control. This ensures that one standard form of a person’s name is preferred in catalogue entries, but that access is also provided from variant forms. GK used its own name authority forms. This may mean that there will be no tie-up with books digitised by Microsoft from another library which has used a different name form for the same author. Books originally catalogued in GK, the British Museum General Catalogue of Printed Books The records were originally intended to be used in the context of a guard book catalogue The records were converted to machine readable format in the period 1987-92 The data was copied as seen; errors in the printed catalogue were not systematically corrected The MARC format employed was a simplified version of UKMARC. On migration to the ILS some of the deficiencies of this format were addressed, but a comprehensive solution was not possible at that time
Scalable services/systems required to deal with large volume of material Management information essential to improve outcomes and add value to collection holders/end users
Complexity of working with historical metadata/current rights landscape requires innovation to provide solutions.
Future shared responsibility [web] services which benefit all will become more prevalent.
Themed deliverables for material content streams, e.g; books/newspapers/journals/special collections. Ability to repurpose files to suit future requirements essential
Content will drive new services/resource discovery tools and change user demands – this item digitised by Google but archived by collection holder with IA as they don’t have own DP solution currently.
POD – hardware/format wars – new channels for delivery will impact on capture approach/post capture processing.
Need to join up disparate services to provide an efficient end-to-end solution.
User community involvement will accelerate volume available, increase quality so its fit for purpose and change cost model.
Consolidation of processing required to solve outstanding issues.
UK needs more coordinated approach to ensure cultural memory is available for research and to contribute to UK plc bottom line – report did not deliver required vision/building blocks to deliver it. More integrated competitors have advanced plans to digitise own language material – UK at risk as language is so widespread – who will deliver?
Current content/tools need to expand if resource is going to be one of primary destination choices.
IMPACT CoC to drive cross discipline research and provide source material – datasets and collaborative correction extension.
Digitisation Overview Neil Fitzgerald IMPACT Project Delivery Manager 24 th September 2009
British Library Overview <ul><li>We receive a copy of every publication produced in the UK and Ireland </li></ul><ul><li>The collection includes 150 million items, in most known languages </li></ul><ul><li>3 million new items are incorporated every year </li></ul><ul><li>We house manuscripts, maps, newspapers, magazines, prints and drawings, music scores, and patents </li></ul><ul><li>The Sound Archive keeps sound recordings from 19th-century cylinders to the latest CD, DVD and minidisk recordings </li></ul><ul><li>We house 8 million stamps and other philatelic items </li></ul><ul><li>These require over 625 km of shelves, and grow 12km every year </li></ul><ul><li>We have on-site space for over 1,200 readers </li></ul><ul><li>Over 16,000 people use the collections each day </li></ul><ul><li>Online catalogues, information and exhibitions can be found on www.bl.uk </li></ul><ul><li>We operate the world's largest document delivery service providing millions of items a year to customers all over the world </li></ul>
Mass Digitisation Principles Continuous improvement Use standards to benefit resource discovery, interoperability & digital preservation Content selection by collection Critical mass required to build useful service Workflow designed to deliver quality fit for purpose OCR’d where possible
Metadata Issues <ul><li>Language codes: The language of the resource was not systematically recorded in. In some cases language was recorded as a note or as a component of a structured heading, but MARC language codes (based on ISO 639-2) are not present in the records </li></ul><ul><li>Multivolume works: A record may refer to a single volume or to several volumes </li></ul><ul><li>Shelfmarks: For a variety of reasons, including analytics, multiparts; transcription error, lapses in maintenance, the same [or apparently the same] shelfmark sometimes relates to multiple logical or physical items </li></ul><ul><li>Format: in effect a statement of how paper was used in the production process </li></ul><ul><li>No ISBN </li></ul>
Deliverables <ul><li>Contractor provides the following data for each digital object [book]: </li></ul><ul><li>Dynamically compressed lossy JP2 file as master image </li></ul><ul><li>METS/ALTO.xml file containing; </li></ul><ul><ul><li>MARC and MODS descriptive metadata </li></ul></ul><ul><ul><li>ALTO file per image representing layout and OCR text information </li></ul></ul><ul><ul><li>METS rights copyright metadata </li></ul></ul><ul><ul><li>MIX technical metadata per image </li></ul></ul><ul><ul><li>Filesec containing information about all files, sequence and hash code </li></ul></ul><ul><ul><li>Physical structmap containing information about physical structure, </li></ul></ul><ul><ul><li>page sequence and page type [Title Page, Table of Contents Page, Foldout Page] </li></ul></ul><ul><li>SHA-1 hash code file for METS verification </li></ul><ul><li>Bound PDF file containing all images as dynamically compressed JP2 files, hidden text, linearised to open TOC page, Title Page or first image </li></ul>