SlideShare a Scribd company logo
1 of 18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Analysis and Post-Correction of OCR-processed
historical documents
Ulrich Reffle

CIS
University of Munich
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Overview
 Document specific analysis of OCR results of historical documents
 A system for interactive OCR post-correction




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Document specific analysis of OCR
results of historical documents




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Why do we need special methods?
           Problems specific to the processing of historical language in the context of
           mass digitization:
             – High OCR error rates
             – No standardized language
                         Special resources and methods are needed for OCR, post-processing and
                          Information Retrieval
                                                                                                  Problem of historical
                                                                                                    language variation

                                                                                                                 Post-
Digital                                   OCR                          OCR-
                                                                                                               Correction                                   IR
image                                                                  result
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                              4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Why do we need special methods?
           Diversity of input material makes document specific parameter settings
           important:
             – Distribution of spelling variants
             – Special vocabulary
             – OCR channel model

                                                                                                  Problem of historical
                                                                                                    language variation

                                                                                                                 Post-
Digital                                   OCR                          OCR-
                                                                                                               Correction                                   IR
image                                                                  result
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                              5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Document specific language and error profiles
 Language and error profiles provide document specific characteristics of
  the language and OCR errors.
 Language profile: shares of foreign languages (such as Latin, French),
  frequencies for language modeling, important patterns of spelling variation
  (in English: e.g. oou, vu )
 Error profile: estimated error rate, important error patterns (like ec, il),
  frequent erroneous words
 Language and error profiles are computed fully automatically, no manual
  interaction or groundtruth needed.



24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Global Profile of a document
                                                      Frequency                                                    Lexicon                                       %
                           t→th                      120                                                           Modern                                      82%
Language
                           i→y                       106                                                           Historic                                    9%
profile
                           ä→a                       38                                                            Place names                                 6%
                           …                         …                                                             Latin                                       3%

                                                      Frequency
                           e→c                       51                                                            Correct words                               72%
  Error
                           n→u                       45                                                            Erroneous words                             20%
  profile
                           t→i                       34                                                            Unknown words                               8%
                           …                         …

      24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                               7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Local profile of all words of a document
    Weighted set of interpretations/ correction suggestions for each word of the
     document.
    „theil“
   „theil“
  „theil“
 „theil“
„hatn“
   Correction suggestion                                  Modern spelling                                        probability
   hath                                                   has                                                    0,95
   hat                                                    Hat                                                    0,01
   hate                                                   hate                                                   0,04



   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 Document specific profiles …
          – are computed in a fully automated way from OCR output
          – provide characteristics of language and OCR error channel in order to adapt
            OCR and downstream processes.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




System for interactive post-correction
of OCR results




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Post-correction system
 A graphical user interface for fast and convenient post-correction
  specifically for OCRed historical documents
 Novel possibilities for detection, presentation and correction of systematic
  OCR errors.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Post-correction system
                                                                                                                                                         OCR Editor




Special functionality




                                                                                                                                                             Image
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                                   12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Proper treatment of spelling variants
 Historical spelling variants are identified with the help of historical lexica and
  language profiles.
 Local profiles include non-modern words as correction suggestions.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Conventional correction methods
 Correcting words in the text view
          – Manual input
          – Selection of a correction suggestion




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Batch-Correction of systematic OCR errors
 Systematic OCR errors are identified by error profile
 Batches of errors can be corrected with just a few keystrokes.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation
 User experiment with 14 participants.
 Novel technology makes correction up to 2.7 times faster.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Availability
 Graphical interface is going to be distributed open source.
 Document pre-processing to obtain language and error profiles is protected
  by US patent application.
          – Pre-processing is offered as a web-service, as of now free of charge.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                             Thank you!

                                           http://ocr.cis.uni-muenchen.de
                                              uli@cis.uni-muenchen.de




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         18

More Related Content

What's hot (7)

Résumé of Liudvikas Paskevicius
Résumé of Liudvikas PaskeviciusRésumé of Liudvikas Paskevicius
Résumé of Liudvikas Paskevicius
 
Peter Doorn
Peter DoornPeter Doorn
Peter Doorn
 
my updated CV (resume)
my updated CV (resume)my updated CV (resume)
my updated CV (resume)
 
Michel Alexandre Salim\'s resume
Michel Alexandre Salim\'s resumeMichel Alexandre Salim\'s resume
Michel Alexandre Salim\'s resume
 
Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction? Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction?
 
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
 
Digital Humanities @ Net7
Digital Humanities @ Net7Digital Humanities @ Net7
Digital Humanities @ Net7
 

Viewers also liked

IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Centre of Competence
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Centre of Competence
 

Viewers also liked (20)

IMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de DoesIMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de Does
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCR
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Khalil Rouhana
IMPACT Final Conference - Khalil  RouhanaIMPACT Final Conference - Khalil  Rouhana
IMPACT Final Conference - Khalil Rouhana
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly ContehIMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly Conteh
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 

Similar to IMPACT Final Conference - Ulrich Reffle

M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
Media & Learning Conference
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
IMPACT Centre of Competence
 

Similar to IMPACT Final Conference - Ulrich Reffle (20)

TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
Towards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and InterpretationTowards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and Interpretation
 
博物館科技前瞻2010 horizon-report-museum-edition
博物館科技前瞻2010 horizon-report-museum-edition博物館科技前瞻2010 horizon-report-museum-edition
博物館科技前瞻2010 horizon-report-museum-edition
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
 
The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020
 
Multilingual challenges in Europeana
Multilingual challenges in EuropeanaMultilingual challenges in Europeana
Multilingual challenges in Europeana
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
AI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual EuropeAI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual Europe
 
French Presidency - 1 march 2022
French Presidency - 1 march 2022French Presidency - 1 march 2022
French Presidency - 1 march 2022
 
2012 oct 22 shaping access presentation_alt
2012 oct 22  shaping access presentation_alt2012 oct 22  shaping access presentation_alt
2012 oct 22 shaping access presentation_alt
 
Mahmoud Resume English
Mahmoud Resume EnglishMahmoud Resume English
Mahmoud Resume English
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
how to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructureshow to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructures
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

IMPACT Final Conference - Ulrich Reffle

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Analysis and Post-Correction of OCR-processed historical documents Ulrich Reffle CIS University of Munich
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview  Document specific analysis of OCR results of historical documents  A system for interactive OCR post-correction 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Document specific analysis of OCR results of historical documents 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Problems specific to the processing of historical language in the context of mass digitization: – High OCR error rates – No standardized language  Special resources and methods are needed for OCR, post-processing and Information Retrieval Problem of historical language variation Post- Digital OCR OCR- Correction IR image result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Diversity of input material makes document specific parameter settings important: – Distribution of spelling variants – Special vocabulary – OCR channel model Problem of historical language variation Post- Digital OCR OCR- Correction IR image result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Document specific language and error profiles  Language and error profiles provide document specific characteristics of the language and OCR errors.  Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )  Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words  Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global Profile of a document Frequency Lexicon % t→th 120 Modern 82% Language i→y 106 Historic 9% profile ä→a 38 Place names 6% … … Latin 3% Frequency e→c 51 Correct words 72% Error n→u 45 Erroneous words 20% profile t→i 34 Unknown words 8% … … 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local profile of all words of a document  Weighted set of interpretations/ correction suggestions for each word of the document. „theil“ „theil“ „theil“ „theil“ „hatn“ Correction suggestion Modern spelling probability hath has 0,95 hat Hat 0,01 hate hate 0,04 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  Document specific profiles … – are computed in a fully automated way from OCR output – provide characteristics of language and OCR error channel in order to adapt OCR and downstream processes. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. System for interactive post-correction of OCR results 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Post-correction system  A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents  Novel possibilities for detection, presentation and correction of systematic OCR errors. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Post-correction system OCR Editor Special functionality Image 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Proper treatment of spelling variants  Historical spelling variants are identified with the help of historical lexica and language profiles.  Local profiles include non-modern words as correction suggestions. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Conventional correction methods  Correcting words in the text view – Manual input – Selection of a correction suggestion 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Batch-Correction of systematic OCR errors  Systematic OCR errors are identified by error profile  Batches of errors can be corrected with just a few keystrokes. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation  User experiment with 14 participants.  Novel technology makes correction up to 2.7 times faster. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Availability  Graphical interface is going to be distributed open source.  Document pre-processing to obtain language and error profiles is protected by US patent application. – Pre-processing is offered as a web-service, as of now free of charge. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! http://ocr.cis.uni-muenchen.de uli@cis.uni-muenchen.de 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 18