Successfully reported this slideshow.
PoCoTo
An Open Source System for Efficient
Interactive Postcorrection of OCRed
Historical Texts
Thorsten Vobl, Annette Got...
Motivation
- For historical texts still many OCR errors
- Downstream Applications harmed
Option to improve quality with in...
Approach
Features to Raise Productivity within our competence and explorative :
•  Plugin Language technology that unmasks...
Evaluation
Tool developed in University Environment during EU project IMPACT
and maintained since despite serious fluctuat...
§  Language technology used for improvement of
interactive postcorrection
§  Lexica, matching tool, profiler integrated ...
Flexible GUI
OCR
Correction candidates,
Special workflows
Image
§  Unlimited configuration of
the views:
–  OCR with imag...
§  OCRed text is presented to
the user with word-image
alignment.
§  Natural flow of text is
maintained, comparison
with...
§  Alternative view with the
complete page image.
–  Useful for difficult to read words
–  Useful if word segmentation of...
§  Classical correction
workflow through seuential
manual input
Manual Correction
§  Speed-up through
selection of proposed
correction candidates
In line with what is usually
offered: „Base Mode“
Drop Do...
Modern word word form in word form in
form ground truth OCRed text
Wmod Wgt Wocr
Patterns applied
„pattern trace“
OCR erro...
Improved model for
• words
• patterns
• OCR errors
and their probabilities
.
.
for each OCR token Wocr
Improved list of
in...
Document Eckartshausen
Result Probabilities historical patterns
LMF
Document Eckartshausen
Result Probabilities OCR errors
§  Valid historical words not
marked as errors even if
not in the lexicon
(„hypothetical lexicon“)
§  Historical variant...
§  Improved Ranking of candidates through document
specific language and error profile
§  Concordance Error View with hi...
§  High Probability Identical strings
corrected as batch
§  Concordance views optional
Rapid Workflow - Batch Processing...
§  Strings with identical error patterns
corrected as batch
§  In the example: n -> u
Rapid Workflow - Batch Processing
...
Controlled “Hard” Evaluations
0 10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
BSB Dokument1
Corrections mad...
Closer Look into the Data
Soft Evaluations
Questionaires with all three institutions.
Most favorite aspect:
Batch Corrections
Main problems:
Stabili...
Future work
•  Extend to new Languages e.g. Latin
•  New Correction Scenarios e.g. specific Named
Entity Correction
•  Tur...
Thanks for your attention!
… and special thanks to University of Alicante, Bavarian State Library, Royal
Library of the Ne...
Upcoming SlideShare
Loading in …5
×

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

623 views

Published on

Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

  1. 1. PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich
  2. 2. Motivation - For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.
  3. 3. Approach Features to Raise Productivity within our competence and explorative : •  Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. •  Tool visualizes series of similar OCR errors •  Error series can be corrected in one shot •  Implement productive UX through interface and functionality
  4. 4. Evaluation Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.
  5. 5. §  Language technology used for improvement of interactive postcorrection §  Lexica, matching tool, profiler integrated as background technology §  Document centric knowledge from unsupervised analysis of OCRed document used for detection of error classes and suggested corrections §  Batchmode for corrections of many errors in „one shot“ §  Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes Starting Point: Postcorrection Tool as a Carrier of Technology
  6. 6. Flexible GUI OCR Correction candidates, Special workflows Image §  Unlimited configuration of the views: –  OCR with image snippets –  Complete image page –  Correction candidates, special workflows Font-/window size configuration
  7. 7. §  OCRed text is presented to the user with word-image alignment. §  Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping View: OCR + Image Snippets
  8. 8. §  Alternative view with the complete page image. –  Useful for difficult to read words –  Useful if word segmentation of the OCR is too poor –  Useful if long distance text understanding is needed View: Original Image
  9. 9. §  Classical correction workflow through seuential manual input Manual Correction
  10. 10. §  Speed-up through selection of proposed correction candidates In line with what is usually offered: „Base Mode“ Drop Down Selection of Correction Candidates
  11. 11. Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr Patterns applied „pattern trace“ OCR errors applied „OCR trace“ „Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model Two-Channel Model for OCRed historical Text
  12. 12. Improved model for • words • patterns • OCR errors and their probabilities . . for each OCR token Wocr Improved list of interpretations with probabilities Final Result Modern word Ground truth OCR trace Hist trace Local guess Global guess Profiling of historical OCRed corpora with EM
  13. 13. Document Eckartshausen Result Probabilities historical patterns
  14. 14. LMF Document Eckartshausen Result Probabilities OCR errors
  15. 15. §  Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“) §  Historical variants proposed as correction candidates Lexicons Triggered by Profiles
  16. 16. §  Improved Ranking of candidates through document specific language and error profile §  Concordance Error View with high confidence corrections Selection of Correction Candidates
  17. 17. §  High Probability Identical strings corrected as batch §  Concordance views optional Rapid Workflow - Batch Processing Identical Strings
  18. 18. §  Strings with identical error patterns corrected as batch §  In the example: n -> u Rapid Workflow - Batch Processing Identical Error Patterns
  19. 19. Controlled “Hard” Evaluations 0 10 20 30 40 50 60 70 80 90 0 100 200 300 400 500 600 700 800 BSB Dokument1 Corrections made User1 F User2 F User3 B User4 B User5 F User6 B time in minutes correctionsmade §  Measure Points every 10 minutes for 90 minutes §  Each User with a base/full session (inter/intra User comparison) §  More corrections avg. 1.5x – 3x for Full Mode §  Earley Gains: First 10 Minutes
  20. 20. Closer Look into the Data
  21. 21. Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections Main problems: Stability Correction of Segmentation Errors
  22. 22. Future work •  Extend to new Languages e.g. Latin •  New Correction Scenarios e.g. specific Named Entity Correction •  Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software
  23. 23. Thanks for your attention! … and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments

×