Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

PoCoTo
An Open Source System for Efficient
Interactive Postcorrection of OCRed
Historical Texts
Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,
Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing
University of Munich
Gini GmbH Munich

Motivation
- For historical texts still many OCR errors
- Downstream Applications harmed
Option to improve quality with interactive Postcorrection
Why: selected and important texts/corpora or parts can/must be lifted
to a much higher level of accuracy/to perfection.
Somehow “business driven”
How: The user experience of the software has a major influence on time and
efforts needed for improving accuracy.

Approach
Features to Raise Productivity within our competence and explorative :
•  Plugin Language technology that unmasks orthographic variation in historical
language and returns document specific distributions of OCR errors.
•  Tool visualizes series of similar OCR errors
•  Error series can be corrected in one shot
•  Implement productive UX through interface and functionality

Evaluation
Tool developed in University Environment during EU project IMPACT
and maintained since despite serious fluctuation
Practical user tests in three major European libraries have shown
Gains in time/corrections rates. User ratings from practitioners high.
Maintaining Interest, open for new languages, new functionalities.
Division of language resources and tool through a server-client model
Published as an open source tool under GitHub.

§  Language technology used for improvement of
interactive postcorrection
§  Lexica, matching tool, profiler integrated as background technology
§  Document centric knowledge from unsupervised analysis of OCRed
document used for detection of error classes and suggested corrections
§  Batchmode for corrections of many errors in „one shot“
§  Rich graphical user interface to let users fully benefit
from „knowledge“ on document derived error classes
Starting Point: Postcorrection Tool as
a Carrier of Technology

Flexible GUI
OCR
Correction candidates,
Special workflows
Image
§  Unlimited configuration of
the views:
–  OCR with image snippets
–  Complete image page
–  Correction candidates, special
workflows
Font-/window size
configuration

§  OCRed text is presented to
the user with word-image
alignment.
§  Natural flow of text is
maintained, comparison
with original text images a
lot easier than with focus
hopping
View: OCR + Image Snippets

§  Alternative view with the
complete page image.
–  Useful for difficult to read words
–  Useful if word segmentation of the OCR
is too poor
–  Useful if long distance text understanding
is needed
View: Original Image

§  Classical correction
workflow through seuential
manual input
Manual Correction

§  Speed-up through
selection of proposed
correction candidates
In line with what is usually
offered: „Base Mode“
Drop Down Selection of Correction
Candidates

Modern word word form in word form in
form ground truth OCRed text
Wmod Wgt Wocr
Patterns applied
„pattern trace“
OCR errors applied
„OCR trace“
„Interpretation“ of the OCR token
Starting from OCR token Wocr Estimation of the Channel Model
Two-Channel Model for OCRed
historical Text

Improved model for
• words
• patterns
• OCR errors
and their probabilities
.
.
for each OCR token Wocr
Improved list of
interpretations
with probabilities
Final Result
Modern word
Ground truth
OCR trace
Hist trace
Local guess Global guess
Profiling of historical OCRed corpora
with EM

Document Eckartshausen
Result Probabilities historical patterns

LMF
Document Eckartshausen
Result Probabilities OCR errors

§  Valid historical words not
marked as errors even if
not in the lexicon
(„hypothetical lexicon“)
§  Historical variants
proposed as correction
candidates
Lexicons Triggered by Profiles

§  Improved Ranking of candidates through document
specific language and error profile
§  Concordance Error View with high confidence
corrections
Selection of Correction Candidates

§  High Probability Identical strings
corrected as batch
§  Concordance views optional
Rapid Workflow - Batch Processing
Identical Strings

§  Strings with identical error patterns
corrected as batch
§  In the example: n -> u
Rapid Workflow - Batch Processing
Identical Error Patterns

Controlled “Hard” Evaluations
0 10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
BSB Dokument1
Corrections made
User1 F
User2 F
User3 B
User4 B
User5 F
User6 B
time in minutes
correctionsmade
§  Measure Points every 10
minutes for 90 minutes
§  Each User with a base/full
session (inter/intra User
comparison)
§  More corrections avg. 1.5x – 3x
for Full Mode
§  Earley Gains: First 10 Minutes

Soft Evaluations
Questionaires with all three institutions.
Most favorite aspect:
Batch Corrections
Main problems:
Stability
Correction of Segmentation Errors

Future work
•  Extend to new Languages e.g. Latin
•  New Correction Scenarios e.g. specific Named
Entity Correction
•  Turn Interest into a Community and Implement
Industrial Tool Partnerships for isolated parts of
the Software

Thanks for your attention!
… and special thanks to University of Alicante, Bavarian State Library, Royal
Library of the Netherlands for their Time and Efforts during the Experiments

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Recommended

Recommended

More Related Content

Similar to Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Similar to Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text (20)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text