IMPACT Final Conference - Ulrich Reffle
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

IMPACT Final Conference - Ulrich Reffle

on

  • 2,175 views

Postcorrection in IMPACT with Ulrich Reffle from the University of Munich

Postcorrection in IMPACT with Ulrich Reffle from the University of Munich

Statistics

Views

Total Views
2,175
Views on SlideShare
818
Embed Views
1,357

Actions

Likes
0
Downloads
17
Comments
0

8 Embeds 1,357

http://www.digitisation.eu 1001
http://impactocr.wordpress.com 249
http://impact.dlsi.ua.es 93
http://impact.sherrydesign.co.uk 6
http://impact2.sherrydesign.co.uk 3
http://a0.twimg.com 2
https://impactocr.wordpress.com 2
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

IMPACT Final Conference - Ulrich Reffle Presentation Transcript

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Analysis and Post-Correction of OCR-processedhistorical documentsUlrich ReffleCISUniversity of Munich
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Overview Document specific analysis of OCR results of historical documents A system for interactive OCR post-correction24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Document specific analysis of OCRresults of historical documents24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Problems specific to the processing of historical language in the context of mass digitization: – High OCR error rates – No standardized language  Special resources and methods are needed for OCR, post-processing and Information Retrieval Problem of historical language variation Post-Digital OCR OCR- Correction IRimage result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Diversity of input material makes document specific parameter settings important: – Distribution of spelling variants – Special vocabulary – OCR channel model Problem of historical language variation Post-Digital OCR OCR- Correction IRimage result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Document specific language and error profiles Language and error profiles provide document specific characteristics of the language and OCR errors. Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu ) Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global Profile of a document Frequency Lexicon % t→th 120 Modern 82%Language i→y 106 Historic 9%profile ä→a 38 Place names 6% … … Latin 3% Frequency e→c 51 Correct words 72% Error n→u 45 Erroneous words 20% profile t→i 34 Unknown words 8% … … 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local profile of all words of a document  Weighted set of interpretations/ correction suggestions for each word of the document. „theil“ „theil“ „theil“ „theil“„hatn“ Correction suggestion Modern spelling probability hath has 0,95 hat Hat 0,01 hate hate 0,04 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Summary Document specific profiles … – are computed in a fully automated way from OCR output – provide characteristics of language and OCR error channel in order to adapt OCR and downstream processes.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.System for interactive post-correctionof OCR results24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Post-correction system A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents Novel possibilities for detection, presentation and correction of systematic OCR errors.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Post-correction system OCR EditorSpecial functionality Image 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Proper treatment of spelling variants Historical spelling variants are identified with the help of historical lexica and language profiles. Local profiles include non-modern words as correction suggestions.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Conventional correction methods Correcting words in the text view – Manual input – Selection of a correction suggestion24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Batch-Correction of systematic OCR errors Systematic OCR errors are identified by error profile Batches of errors can be corrected with just a few keystrokes.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation User experiment with 14 participants. Novel technology makes correction up to 2.7 times faster.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Availability Graphical interface is going to be distributed open source. Document pre-processing to obtain language and error profiles is protected by US patent application. – Pre-processing is offered as a web-service, as of now free of charge.24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! http://ocr.cis.uni-muenchen.de uli@cis.uni-muenchen.de24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 18