SlideShare a Scribd company logo
1 of 33
Download to read offline
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in libraries – some practical remarks

                          Günter Mühlberger
                          Department for Digitisation and Digital Preservation
                          University Innsbruck Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in Libraries
       Not an easy chapter...
       Is the glass half empty or half full?
       Historical fonts: Black letter, gothic, Old Cyrillic, ...
       Great attempts for full-text
          – JSTOR (1994)
          – Google (2004)
 But: Still many digital libraries without integrated full-text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR and Digitization
 OCR changes everything!
 Workflow has to be adopted at all steps
          –      Preparation and selection of material
          –      Image processing & scanning
          –      Quality control
          –      Storage and preservation
          –      Correction and user involvement
          –      Full-text search
          –      Web interfaces for digital libraries
 Significant increase in complexity
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Preparation
 Which material will be taken for scanning? Options:
          – Bound volumes?
          – Microfilm?
          – Loose folios?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Bound volumes
 Bound volumes
          – Pros:
                      That’s the way books/journals/newspapers are in the library
          – Cons:
                      Often narrow binding, especially with newspapers
                      Often warping due to humidity
          – Remark
                      Technical solution: ScanRobots make life easier and double the speed
                       compared to manual interaction, e.g. 700 – 1000 pages per hour
                      Investment for ScanRobots must not be underestimated




                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Microfilm
 Microfilm
          – Pros:
                      If a microfilm is available it is a cheap alternative
                      Easy option (no handling of volumes)
          – Cons:
                      Microfilms have the same problems as bound volumes
                      Microfilms were often produced with minimum quality control
                      Microfilms before 1990 are often not in a good condition
 Remark
          – If microfilm was produced with good quality than there is no significant
            difference in the OCR quality
                      Case study with BL material will be published on IMPACT site



                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Loose folios
 Pros
          – No narrow binding, less warping
          – Extremely fast performance with industry scanners – low price
          – Duplicates can be sent to off-shore providers in huge packages
 Cons
          – Not feasible for material before 1850 – libraries would run into justification problems
          – Organisational effort to organise duplicates (but completeness has to be evaluated
            anyway)
 Remark
          – By far the best option to produce high quality with the lowest resources
          – Especially interesting for newspapers, 20th century material and grey literature
          – Used e.g. by MOA, JSTOR

                                                                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Good, bad and ugly images
 Careful scanning is A and O
          – Scanrobots and document scanners lower the requirements for a good
            operator, but still individual capability is decisive
 Criteria for a good page image are simple:
          –      sharp
          –      significant fonts with clear curves
          –      clear background, no shining through from the backside
          –      no warping of the page and no geometrical distortions
          –      complete shot with some white frame around the text borders
          –      lines to be parallel resp. rectangle to borders
          –      no noise of users
 If you have perfect images you can wait until OCR technology
  improves, with bad images you never get good results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bad print – broken characters
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   und                                                                              wenn

                                                                                                                                                         24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?
 Bitonal vs. 8/24 bit
          – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results
          – Experiment: Microfilm scanned bitonal or greyscale – no difference
 Simple experiments show the opposite
          – Innsbrucker Zeitungsarchiv: bitonal and 24 bit
          – Results are clearly better with colour
 300 or 400 Resolution
          – Very small font: Word text: 4 point font
 JPEG vs. TIFF RGB
          – Tests with the Treventus ScanRobot but also with other material show that
            there is no advantage of TIFF RGB images compared to compressed
            JPEGs
 Modern documents with medium sized fonts can be scanned with 300
  ppi and bitonal, but documents with small fonts and challenging paper
  quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be
  stored as JPEGs with e.g. 90% compression rate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Accuracy
 Is the glas half full or half empty?
          – Rose Holley <90% word recognition: Poor result
          – Google: OCR every image, so every correctly recognized word is better
            than nothing
          – Painful errors?
          – Mature users?


 Character vs. word accuracy
          – Word accuracy says much more, and is much easier to gain: Each word
            which would be correctly found in a full-text search, can be counted as
            correct.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Examples from real world projects
 Based on: ABBYY Recognition Server 2
          –      Reichstagsprotokolle, 1925
          –      Zedler, 1744
          –      Coburger Zeitung, 1808
          –      Judentum, 1803
          –      Eckartshausen, 1792
          –      Landesbauernkammer, 1921
          –      Galvani, 1793
          –      Hieber, 1722
          –      Hofmann, 1875
          –      Buschendorf, 1805
          –      Schreiben, 1689
          –      Lateinische Texte
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Correction of OCR text
 Until recently regarded as „absurd“
 But:
          – Crowd sourcing
          – New technologies
 Crowd sourcing
          –      Figures from Austrialian Newspaper Project:
          –      Correction via a simple editor: line by line correctioin
          –      Since August 2008 6000 users contributed
          –      7 Mill. lines in 318.000 articles were corrected
          –      If you count 50 characters per line it is worth about 200.000 EUR (=
                 compared to the prices of service providers)
 New technologies
          – IBM: CONCERT Tool, LMU: PostCorrection Tool
          – Productivity compared to simple rekeying will be enhanced by several
            factors (at least 1:5)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




What to do with OCR results?
 Structural enhancement
          – INEX: competition based on OCR files
          – Functional Extension Parser
 Preservation
          –      Complexity is significantly increased
          –      Output: TXT, PDF, ABBYY XML
          –      ALTO Format
          –      How to integrated corrective actions of users?
          –      Proposition for enhancing ALTO format
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Digital library applications
 Fulltext search
          – JSTOR, Google, publishers
          – Facetted Search (SOLR)
 Indexing through search engines
          – Site XML
 Visibility of the OCR text
          – User training (by doing)
          – Necessary if correction shall be included
 New research fields
          – Text mining
          – Linking of texts
          – Near duplicates, similiarity and new identifiers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 OCR is a „must“
          – For documents of the 19. and 20th century OCR provides in general
            useful or even very good results
          – Bevore 1800: Improvements can be expected by IMPACT
          – Careful and exact scanning is always the main prerequisite, preferable
            in 400 ppi and 8 or 24 bit
          – Test runs with random sets
 Modern applications
          –      Fulltext search
          –      Visibility of the erroneous text
          –      Options for correcting the text by users
          –      Several export formats (also for end-users)
          –      Site XML for search engines
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Thank you for your attention!

More Related Content

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Bratislava WS - Mühlberger - OCR in libraries_pdf

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in Libraries  Not an easy chapter...  Is the glass half empty or half full?  Historical fonts: Black letter, gothic, Old Cyrillic, ...  Great attempts for full-text – JSTOR (1994) – Google (2004)  But: Still many digital libraries without integrated full-text
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR and Digitization  OCR changes everything!  Workflow has to be adopted at all steps – Preparation and selection of material – Image processing & scanning – Quality control – Storage and preservation – Correction and user involvement – Full-text search – Web interfaces for digital libraries  Significant increase in complexity
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Preparation  Which material will be taken for scanning? Options: – Bound volumes? – Microfilm? – Loose folios?
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Bound volumes  Bound volumes – Pros:  That’s the way books/journals/newspapers are in the library – Cons:  Often narrow binding, especially with newspapers  Often warping due to humidity – Remark  Technical solution: ScanRobots make life easier and double the speed compared to manual interaction, e.g. 700 – 1000 pages per hour  Investment for ScanRobots must not be underestimated 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Microfilm  Microfilm – Pros:  If a microfilm is available it is a cheap alternative  Easy option (no handling of volumes) – Cons:  Microfilms have the same problems as bound volumes  Microfilms were often produced with minimum quality control  Microfilms before 1990 are often not in a good condition  Remark – If microfilm was produced with good quality than there is no significant difference in the OCR quality  Case study with BL material will be published on IMPACT site 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Loose folios  Pros – No narrow binding, less warping – Extremely fast performance with industry scanners – low price – Duplicates can be sent to off-shore providers in huge packages  Cons – Not feasible for material before 1850 – libraries would run into justification problems – Organisational effort to organise duplicates (but completeness has to be evaluated anyway)  Remark – By far the best option to produce high quality with the lowest resources – Especially interesting for newspapers, 20th century material and grey literature – Used e.g. by MOA, JSTOR 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Good, bad and ugly images  Careful scanning is A and O – Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive  Criteria for a good page image are simple: – sharp – significant fonts with clear curves – clear background, no shining through from the backside – no warping of the page and no geometrical distortions – complete shot with some white frame around the text borders – lines to be parallel resp. rectangle to borders – no noise of users  If you have perfect images you can wait until OCR technology improves, with bad images you never get good results
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bad print – broken characters
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. und wenn 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?  Bitonal vs. 8/24 bit – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results – Experiment: Microfilm scanned bitonal or greyscale – no difference  Simple experiments show the opposite – Innsbrucker Zeitungsarchiv: bitonal and 24 bit – Results are clearly better with colour  300 or 400 Resolution – Very small font: Word text: 4 point font  JPEG vs. TIFF RGB – Tests with the Treventus ScanRobot but also with other material show that there is no advantage of TIFF RGB images compared to compressed JPEGs  Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Accuracy  Is the glas half full or half empty? – Rose Holley <90% word recognition: Poor result – Google: OCR every image, so every correctly recognized word is better than nothing – Painful errors? – Mature users?  Character vs. word accuracy – Word accuracy says much more, and is much easier to gain: Each word which would be correctly found in a full-text search, can be counted as correct.
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples from real world projects  Based on: ABBYY Recognition Server 2 – Reichstagsprotokolle, 1925 – Zedler, 1744 – Coburger Zeitung, 1808 – Judentum, 1803 – Eckartshausen, 1792 – Landesbauernkammer, 1921 – Galvani, 1793 – Hieber, 1722 – Hofmann, 1875 – Buschendorf, 1805 – Schreiben, 1689 – Lateinische Texte
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Correction of OCR text  Until recently regarded as „absurd“  But: – Crowd sourcing – New technologies  Crowd sourcing – Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin – Since August 2008 6000 users contributed – 7 Mill. lines in 318.000 articles were corrected – If you count 50 characters per line it is worth about 200.000 EUR (= compared to the prices of service providers)  New technologies – IBM: CONCERT Tool, LMU: PostCorrection Tool – Productivity compared to simple rekeying will be enhanced by several factors (at least 1:5)
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What to do with OCR results?  Structural enhancement – INEX: competition based on OCR files – Functional Extension Parser  Preservation – Complexity is significantly increased – Output: TXT, PDF, ABBYY XML – ALTO Format – How to integrated corrective actions of users? – Proposition for enhancing ALTO format
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digital library applications  Fulltext search – JSTOR, Google, publishers – Facetted Search (SOLR)  Indexing through search engines – Site XML  Visibility of the OCR text – User training (by doing) – Necessary if correction shall be included  New research fields – Text mining – Linking of texts – Near duplicates, similiarity and new identifiers
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  OCR is a „must“ – For documents of the 19. and 20th century OCR provides in general useful or even very good results – Bevore 1800: Improvements can be expected by IMPACT – Careful and exact scanning is always the main prerequisite, preferable in 400 ppi and 8 or 24 bit – Test runs with random sets  Modern applications – Fulltext search – Visibility of the erroneous text – Options for correcting the text by users – Several export formats (also for end-users) – Site XML for search engines
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention!