Bratislava WS - Mühlberger - OCR in libraries_pdf

963 views
945 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
963
On SlideShare
0
From Embeds
0
Number of Embeds
312
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Bratislava WS - Mühlberger - OCR in libraries_pdf

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in Libraries  Not an easy chapter...  Is the glass half empty or half full?  Historical fonts: Black letter, gothic, Old Cyrillic, ...  Great attempts for full-text – JSTOR (1994) – Google (2004)  But: Still many digital libraries without integrated full-text
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR and Digitization  OCR changes everything!  Workflow has to be adopted at all steps – Preparation and selection of material – Image processing & scanning – Quality control – Storage and preservation – Correction and user involvement – Full-text search – Web interfaces for digital libraries  Significant increase in complexity
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Preparation  Which material will be taken for scanning? Options: – Bound volumes? – Microfilm? – Loose folios?
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Bound volumes  Bound volumes – Pros:  That’s the way books/journals/newspapers are in the library – Cons:  Often narrow binding, especially with newspapers  Often warping due to humidity – Remark  Technical solution: ScanRobots make life easier and double the speed compared to manual interaction, e.g. 700 – 1000 pages per hour  Investment for ScanRobots must not be underestimated 15
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Microfilm  Microfilm – Pros:  If a microfilm is available it is a cheap alternative  Easy option (no handling of volumes) – Cons:  Microfilms have the same problems as bound volumes  Microfilms were often produced with minimum quality control  Microfilms before 1990 are often not in a good condition  Remark – If microfilm was produced with good quality than there is no significant difference in the OCR quality  Case study with BL material will be published on IMPACT site 16
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Loose folios  Pros – No narrow binding, less warping – Extremely fast performance with industry scanners – low price – Duplicates can be sent to off-shore providers in huge packages  Cons – Not feasible for material before 1850 – libraries would run into justification problems – Organisational effort to organise duplicates (but completeness has to be evaluated anyway)  Remark – By far the best option to produce high quality with the lowest resources – Especially interesting for newspapers, 20th century material and grey literature – Used e.g. by MOA, JSTOR 17
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Good, bad and ugly images  Careful scanning is A and O – Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive  Criteria for a good page image are simple: – sharp – significant fonts with clear curves – clear background, no shining through from the backside – no warping of the page and no geometrical distortions – complete shot with some white frame around the text borders – lines to be parallel resp. rectangle to borders – no noise of users  If you have perfect images you can wait until OCR technology improves, with bad images you never get good results
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20
  21. 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21
  22. 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  23. 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bad print – broken characters
  24. 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. und wenn 24
  25. 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  26. 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?  Bitonal vs. 8/24 bit – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results – Experiment: Microfilm scanned bitonal or greyscale – no difference  Simple experiments show the opposite – Innsbrucker Zeitungsarchiv: bitonal and 24 bit – Results are clearly better with colour  300 or 400 Resolution – Very small font: Word text: 4 point font  JPEG vs. TIFF RGB – Tests with the Treventus ScanRobot but also with other material show that there is no advantage of TIFF RGB images compared to compressed JPEGs  Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate
  27. 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Accuracy  Is the glas half full or half empty? – Rose Holley <90% word recognition: Poor result – Google: OCR every image, so every correctly recognized word is better than nothing – Painful errors? – Mature users?  Character vs. word accuracy – Word accuracy says much more, and is much easier to gain: Each word which would be correctly found in a full-text search, can be counted as correct.
  28. 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples from real world projects  Based on: ABBYY Recognition Server 2 – Reichstagsprotokolle, 1925 – Zedler, 1744 – Coburger Zeitung, 1808 – Judentum, 1803 – Eckartshausen, 1792 – Landesbauernkammer, 1921 – Galvani, 1793 – Hieber, 1722 – Hofmann, 1875 – Buschendorf, 1805 – Schreiben, 1689 – Lateinische Texte
  29. 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Correction of OCR text  Until recently regarded as „absurd“  But: – Crowd sourcing – New technologies  Crowd sourcing – Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin – Since August 2008 6000 users contributed – 7 Mill. lines in 318.000 articles were corrected – If you count 50 characters per line it is worth about 200.000 EUR (= compared to the prices of service providers)  New technologies – IBM: CONCERT Tool, LMU: PostCorrection Tool – Productivity compared to simple rekeying will be enhanced by several factors (at least 1:5)
  30. 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What to do with OCR results?  Structural enhancement – INEX: competition based on OCR files – Functional Extension Parser  Preservation – Complexity is significantly increased – Output: TXT, PDF, ABBYY XML – ALTO Format – How to integrated corrective actions of users? – Proposition for enhancing ALTO format
  31. 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digital library applications  Fulltext search – JSTOR, Google, publishers – Facetted Search (SOLR)  Indexing through search engines – Site XML  Visibility of the OCR text – User training (by doing) – Necessary if correction shall be included  New research fields – Text mining – Linking of texts – Near duplicates, similiarity and new identifiers
  32. 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  OCR is a „must“ – For documents of the 19. and 20th century OCR provides in general useful or even very good results – Bevore 1800: Improvements can be expected by IMPACT – Careful and exact scanning is always the main prerequisite, preferable in 400 ppi and 8 or 24 bit – Test runs with random sets  Modern applications – Fulltext search – Visibility of the erroneous text – Options for correcting the text by users – Several export formats (also for end-users) – Site XML for search engines
  33. 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention!

×