Large Scale Digitisation Initiatives   –  the British Library Experience Aly Conteh   Digitisation Programme Manager Briti...
 
What do we have <ul><li>150 million items </li></ul><ul><li>Or </li></ul><ul><li>650 linear kilometres </li></ul><ul><li>+...
3.5 billion pages
825 million pages
5.5 billion pages
 
The focus of LSDI
Historic Newspapers 1620 - 1900 <ul><li>Over 4 million pages digitised  </li></ul><ul><li>Three Challenges </li></ul><ul><...
How do you Quality Assure 4 million pages? <ul><li>Outsource – but need to QA the QA Process </li></ul><ul><li>ISO 2895-1 ...
 
 
Better Text Extraction January 1874
 
 
They had the internet in 1816 ! The Morning Chronicle  (London, England), Saturday, May 18, 1816; Issue 14678
and DVD in 1803!   The Morning Chronicle  (London, England), Friday, June 10, 1803; Issue 10625
<ul><li>Significantly improving mass digitisation of historical printed text by </li></ul><ul><li>Innovating OCR software ...
The IMPACT Consortium <ul><li>Libraries </li></ul><ul><ul><li>National Library of the Netherlands (KB) </li></ul></ul><ul>...
Facts and figures   <ul><li>Project supported by the European Community under the FP7 ICT Work Programme.  </li></ul><ul><...
Microsoft – British Library  100,000 or 75,000 19 th  Century Books
Digitising 19 th  Century Books
Two more challenges <ul><li>Greater Capacity  </li></ul><ul><li>How do we store everything </li></ul>
Greater Capacity
 
The reality <ul><li>Productivity = 2x – 3x  </li></ul><ul><li>At peak: 2 shifts, 6 workstations = 1.5m pcm </li></ul><ul><...
only 2 of 20 million pages were damaged during scanning
Storage <ul><li>CR2 = 23 MB (438 TB) </li></ul><ul><li>TIFF = 53 MB (1 PB) </li></ul><ul><li>JP2K = 17 MB (324 TB) </li></...
<ul><li>www.bl.uk </li></ul><ul><li>[email_address] </li></ul>Storage Thank you
Upcoming SlideShare
Loading in …5
×

Aly

1,156 views
1,111 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,156
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • But I do think this picture holds a number of the clues. I’m being perhaps a bit mysterious here aren’t I?
  • Aly

    1. 1. Large Scale Digitisation Initiatives – the British Library Experience Aly Conteh Digitisation Programme Manager British Library September 2009
    2. 3. What do we have <ul><li>150 million items </li></ul><ul><li>Or </li></ul><ul><li>650 linear kilometres </li></ul><ul><li>+ </li></ul><ul><li>11 kilometres every year </li></ul>
    3. 4. 3.5 billion pages
    4. 5. 825 million pages
    5. 6. 5.5 billion pages
    6. 8. The focus of LSDI
    7. 9. Historic Newspapers 1620 - 1900 <ul><li>Over 4 million pages digitised </li></ul><ul><li>Three Challenges </li></ul><ul><ul><ul><ul><li>How to QA </li></ul></ul></ul></ul><ul><ul><ul><ul><li>How to Sustain </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Need for better text extraction </li></ul></ul></ul></ul>
    8. 10. How do you Quality Assure 4 million pages? <ul><li>Outsource – but need to QA the QA Process </li></ul><ul><li>ISO 2895-1 : “Sampling procedures for inspection by attribute-sampling scheme indexed by acceptances quality limit (AQL)” </li></ul><ul><li>Automation - JHOVE </li></ul>
    9. 13. Better Text Extraction January 1874
    10. 16. They had the internet in 1816 ! The Morning Chronicle  (London, England), Saturday, May 18, 1816; Issue 14678
    11. 17. and DVD in 1803! The Morning Chronicle  (London, England), Friday, June 10, 1803; Issue 10625
    12. 18. <ul><li>Significantly improving mass digitisation of historical printed text by </li></ul><ul><li>Innovating OCR software and language technology </li></ul><ul><li>Sharing expertise and building capacity across Europe </li></ul><ul><li>Ensuring that tools and services will be sustained after the end of the project </li></ul>IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT
    13. 19. The IMPACT Consortium <ul><li>Libraries </li></ul><ul><ul><li>National Library of the Netherlands (KB) </li></ul></ul><ul><ul><li>The British Library (BL) </li></ul></ul><ul><ul><li>Bibliothèque nationale de France (BNF) </li></ul></ul><ul><ul><li>German National Library (DNB) </li></ul></ul><ul><ul><li>Bavarian State Library (BSB) </li></ul></ul><ul><ul><li>Göttingen State and University Library (UGOE) </li></ul></ul><ul><ul><li>Austrian National Library (ONB) </li></ul></ul><ul><ul><li>University of Innsbruck Library (UIBK) </li></ul></ul><ul><li>Universities & Research centres </li></ul><ul><ul><li>Dutch Institute for Lexicology (INL) </li></ul></ul><ul><ul><li>National Centre for Scientific Research – Demokritos (NCSR) </li></ul></ul><ul><ul><li>University of Salford (USAL) </li></ul></ul><ul><ul><li>University of Munich (CIS group) </li></ul></ul><ul><ul><li>University of Innsbruck (InfMath group) </li></ul></ul><ul><ul><li>University of Bath (UKOLN) </li></ul></ul><ul><li>Industry partners </li></ul><ul><ul><li>IBM (Haifa Research Lab) </li></ul></ul><ul><ul><li>ABBYY (Moscow) </li></ul></ul>
    14. 20. Facts and figures <ul><li>Project supported by the European Community under the FP7 ICT Work Programme. </li></ul><ul><li>coordinated by the National Library of the Netherlands (KB) </li></ul><ul><li>EU funding: € 11 500 000 </li></ul><ul><li>Start date: 1 January 2008 </li></ul><ul><li>Duration: 48 months </li></ul><ul><li>From 2011: sustainable Centre of Competence with alternative resources </li></ul><ul><li>Web site: www.impact-project.eu </li></ul>
    15. 21. Microsoft – British Library 100,000 or 75,000 19 th Century Books
    16. 22. Digitising 19 th Century Books
    17. 23. Two more challenges <ul><li>Greater Capacity </li></ul><ul><li>How do we store everything </li></ul>
    18. 24. Greater Capacity
    19. 26. The reality <ul><li>Productivity = 2x – 3x </li></ul><ul><li>At peak: 2 shifts, 6 workstations = 1.5m pcm </li></ul><ul><li>Integration with book ordering and catalogue systems </li></ul><ul><li>Project completed 5 months early </li></ul>
    20. 27. only 2 of 20 million pages were damaged during scanning
    21. 28. Storage <ul><li>CR2 = 23 MB (438 TB) </li></ul><ul><li>TIFF = 53 MB (1 PB) </li></ul><ul><li>JP2K = 17 MB (324 TB) </li></ul><ul><li>JP2K (lossy) = 4 MB (80 TB) </li></ul><ul><li>What format is right for your project? </li></ul>
    22. 29. <ul><li>www.bl.uk </li></ul><ul><li>[email_address] </li></ul>Storage Thank you

    ×