Digitisation Overview Neil Fitzgerald IMPACT Project Delivery Manager 24 th  September 2009
 
British Library Overview <ul><li>We receive a copy of every publication produced in the UK and Ireland  </li></ul><ul><li>...
Key Challenges <ul><li>Technical standards  </li></ul><ul><li>Workflow tools </li></ul><ul><li>Project driven digitisation...
Boutique  Digitisation
Present
Strategic Content Alliance
Google’s Scanning Patent
Mass Digitisation Principles Continuous improvement Use standards to benefit resource discovery,  interoperability & digit...
Scanning Process : Contractor Workflow
Metadata Issues <ul><li>Language codes:  The language of the resource was not systematically recorded in.  In some cases l...
MDP Book Workflow
Workflow Tools
Copyright Tools
Deliverables <ul><li>Contractor provides the following data for each digital object [book]: </li></ul><ul><li>Dynamically ...
Online Books
E-Book & POD
Permanent Access
Collaborative Correction
R&D Still Necessary
Future
Digital Britain
Europeana
Future Collaboration
www.bl.uk  [email_address]
Upcoming SlideShare
Loading in...5
×

Digitisation Overview

666

Published on

JISC workshop presentation

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
666
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Page Footer text here... Header text here...
  • BL has 2 mains sites at STP and Bspa
  • One of six legal deposit libraries along with: NLW, NLS, Oxford, Cambridge, Trinity who are all currently trying to deal with Electronic Legal Deposit [born digital] material. Involved in a range of digitisation activities using a number of approaches – no grant-in-aid funding!
  • For any type of digitisation: TS – appropriate for material/volume e.g. MDP in TIFF = 1.2Pb, using JP2 25Tb WT – most cost is in inefficient pre &amp; post-capture processing PDD – affects ability to operationalise – direct efficiency/cost effects MS – affects costs and resource discovery options Meta – Each stage of processing impacted OCR – Processing &amp; resource discovery implications – only good on post 1950 documents
  • Self-selecting, i.e. obvious Treasures Drivers: cultural restitution, wider public access Sometimes private sponsorship, especially for iconic items Cultural reunification projects e.g. International Dunhuang Project and Codex Sinaiticus Focus on small scale &amp; high quality showcases – often a re-cataloguing/metadata/resource discovery tool improvement exercise in disguise! Although it is often said there is one chance to capture these items, compelling new technology is often an exception to the rule.
  • Google entry to market EU i2010 response – devolved to national governments Microsoft entry and withdrawal from market – Internet Archive Complex rights landscape Range of capture approaches available, e.g. move from analogue to digital conversion as historical archives processed and digital equipment quality improves/costs fall Some scanners better at dealing with certain material e.g. tight bindings, but exaggerate show through…
  • We still don’t fully understand our audience/stakeholders are and what they want! SCA is trying to provide guidance….
  • Industrial scanning &amp; processing – central services Requires multiple capture loops to deal with material with specific handling issues R&amp;D/CoC guidance on optimal capture &amp; OCR required – project based digitisation unlikely to deliver long term improvements in isolation
  • General principles – then some points in more detail
  • Typical large scale workflow – should highlight the QA batch sampling method based on ISO 2895-1 – trend analysis. Proved that good quality possible in large volume workflow!
  • Publisher and physical description: The earliest, unamended nineteenth century catalogue records in GK were very brief. Often there is no information on publisher and on the number of pages and of course no ISBN. Most often however ‘format’ is included; in effect a statement of how paper was used in the production process. In many cases publisher names have since been added as have page numbers. But when the printed catalogue was converted it was not possible to separate the various statements. Some attempt has been made to make up for this, but not always successfully. Ambiguous headings: GK made main entries under personal author, corporate body, title or sometimes initial words of the title. When the catalogue was converted to UKMARC the coding made no distinction between the various types of main entry. When the data was converted to MARC 21 algorithms were written to code the type of entry appropriately; as personal author, corporate author, title etc. However, some headings were ambiguous and could not be processed in this way. Those which could not be distinguished were placed in the 720 field, which therefore contains titles as well as some types of authors’ names. Name authorities: We now use the Library of Congress / NACO file for name authority control. This ensures that one standard form of a person’s name is preferred in catalogue entries, but that access is also provided from variant forms. GK used its own name authority forms. This may mean that there will be no tie-up with books digitised by Microsoft from another library which has used a different name form for the same author. Books originally catalogued in GK, the British Museum General Catalogue of Printed Books The records were originally intended to be used in the context of a guard book catalogue The records were converted to machine readable format in the period 1987-92 The data was copied as seen; errors in the printed catalogue were not systematically corrected The MARC format employed was a simplified version of UKMARC. On migration to the ILS some of the deficiencies of this format were addressed, but a comprehensive solution was not possible at that time
  • Scalable services/systems required to deal with large volume of material Management information essential to improve outcomes and add value to collection holders/end users
  • Complexity of working with historical metadata/current rights landscape requires innovation to provide solutions.
  • Future shared responsibility [web] services which benefit all will become more prevalent.
  • Themed deliverables for material content streams, e.g; books/newspapers/journals/special collections. Ability to repurpose files to suit future requirements essential
  • Content will drive new services/resource discovery tools and change user demands – this item digitised by Google but archived by collection holder with IA as they don’t have own DP solution currently.
  • POD – hardware/format wars – new channels for delivery will impact on capture approach/post capture processing.
  • Need to join up disparate services to provide an efficient end-to-end solution.
  • User community involvement will accelerate volume available, increase quality so its fit for purpose and change cost model.
  • Consolidation of processing required to solve outstanding issues.
  • UK needs more coordinated approach to ensure cultural memory is available for research and to contribute to UK plc bottom line – report did not deliver required vision/building blocks to deliver it. More integrated competitors have advanced plans to digitise own language material – UK at risk as language is so widespread – who will deliver?
  • Current content/tools need to expand if resource is going to be one of primary destination choices.
  • IMPACT CoC to drive cross discipline research and provide source material – datasets and collaborative correction extension.
  • Digitisation Overview

    1. 1. Digitisation Overview Neil Fitzgerald IMPACT Project Delivery Manager 24 th September 2009
    2. 3. British Library Overview <ul><li>We receive a copy of every publication produced in the UK and Ireland </li></ul><ul><li>The collection includes 150 million items, in most known languages </li></ul><ul><li>3 million new items are incorporated every year </li></ul><ul><li>We house manuscripts, maps, newspapers, magazines, prints and drawings, music scores, and patents </li></ul><ul><li>The Sound Archive keeps sound recordings from 19th-century cylinders to the latest CD, DVD and minidisk recordings </li></ul><ul><li>We house 8 million stamps and other philatelic items </li></ul><ul><li>These require over 625 km of shelves, and grow 12km every year </li></ul><ul><li>We have on-site space for over 1,200 readers </li></ul><ul><li>Over 16,000 people use the collections each day </li></ul><ul><li>Online catalogues, information and exhibitions can be found on www.bl.uk </li></ul><ul><li>We operate the world's largest document delivery service providing millions of items a year to customers all over the world </li></ul>
    3. 4. Key Challenges <ul><li>Technical standards </li></ul><ul><li>Workflow tools </li></ul><ul><li>Project driven digitisation </li></ul><ul><li>Material selection </li></ul><ul><li>Metadata </li></ul><ul><li>OCR Accuracy </li></ul>
    4. 5. Boutique Digitisation
    5. 6. Present
    6. 7. Strategic Content Alliance
    7. 8. Google’s Scanning Patent
    8. 9. Mass Digitisation Principles Continuous improvement Use standards to benefit resource discovery, interoperability & digital preservation Content selection by collection Critical mass required to build useful service Workflow designed to deliver quality fit for purpose OCR’d where possible
    9. 10. Scanning Process : Contractor Workflow
    10. 11. Metadata Issues <ul><li>Language codes: The language of the resource was not systematically recorded in. In some cases language was recorded as a note or as a component of a structured heading, but MARC language codes (based on ISO 639-2) are not present in the records </li></ul><ul><li>Multivolume works: A record may refer to a single volume or to several volumes </li></ul><ul><li>Shelfmarks: For a variety of reasons, including analytics, multiparts; transcription error, lapses in maintenance, the same [or apparently the same] shelfmark sometimes relates to multiple logical or physical items </li></ul><ul><li>Format: in effect a statement of how paper was used in the production process </li></ul><ul><li>No ISBN </li></ul>
    11. 12. MDP Book Workflow
    12. 13. Workflow Tools
    13. 14. Copyright Tools
    14. 15. Deliverables <ul><li>Contractor provides the following data for each digital object [book]: </li></ul><ul><li>Dynamically compressed lossy JP2 file as master image </li></ul><ul><li>METS/ALTO.xml file containing; </li></ul><ul><ul><li>MARC and MODS descriptive metadata </li></ul></ul><ul><ul><li>ALTO file per image representing layout and OCR text information </li></ul></ul><ul><ul><li>METS rights copyright metadata </li></ul></ul><ul><ul><li>MIX technical metadata per image </li></ul></ul><ul><ul><li>Filesec containing information about all files, sequence and hash code </li></ul></ul><ul><ul><li>Physical structmap containing information about physical structure, </li></ul></ul><ul><ul><li>page sequence and page type [Title Page, Table of Contents Page, Foldout Page] </li></ul></ul><ul><li>SHA-1 hash code file for METS verification </li></ul><ul><li>Bound PDF file containing all images as dynamically compressed JP2 files, hidden text, linearised to open TOC page, Title Page or first image </li></ul>
    15. 16. Online Books
    16. 17. E-Book & POD
    17. 18. Permanent Access
    18. 19. Collaborative Correction
    19. 20. R&D Still Necessary
    20. 21. Future
    21. 22. Digital Britain
    22. 23. Europeana
    23. 24. Future Collaboration
    24. 25. www.bl.uk [email_address]

    ×