A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 1 / 16
Overview
Automatic post-correction (A-PoCoTo)
Evaluation results
Automatic interactive post-correction (A-I-PoCoTo)
Resume
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 2 / 16
A-PoCoTo
Automatic post-correction of OCR-results of historical documents using
supervised machine learning.
Multiple OCRs (OCR1, OCR2, . . . , OCRn) can be used
3 steps with two profiling rounds
3 classifiers for 1, 2, . . . , n OCRs
Classifiers are trained using logistic regression
Developed as a module of the OCR-D project 1
1
http://www.ocr-d.de/
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 3 / 16
PoCoTo
PoCoTo (Post-Correction Tool) is a tool for manual interactive
post-correction of OCRed historical Documents.
Initially a desktop application (2014)2
New version as web-application (2017)
Profiling used for error detection and correction suggestions
Batch correction of (error-) patterns
2
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K. U. (2014, May).
PoCoTo-an open source system for efficient interactive postcorrection of OCRed
historical texts. In Proceedings of the First International Conference on Digital Access
to Textual Cultural Heritage (pp. 57-61). ACM.
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 4 / 16
Profile (global)
Given an OCRed historical text, the profiling derives a ‘statistical picture’
(guess) of the language in the document using various background lexica3
OCR errors and OCR error series
Historical patterns of the form mod → hist (t → th, ei → ey, ...)
Underlying modern words
The profile is used as a feature generator for the automatic
post-correction system
3
Reffle, U., & Ringlstetter, C. (2013). Unsupervised profiling of OCRed historical
documents. Pattern Recognition, 46(5), 1346-1357.
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 5 / 16
Profile (local)
Profiling associates with each token wocr of a document a set of
interpretations wmod,cand →α whist,cand →β wocr is generated.
α- (historical patterns) and β- (OCR-errors) channels can be empty
Interpretations have a weight
Each wocr has a ranked set of interpretations wcand,hist
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 6 / 16
A-PoCoTo
Alignment Profiling
Lexicon
Extension
Profiling Ranking Decision
A-I-PoCoTo
OCR1 OCR2 OCRn
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 7 / 16
Multiple OCRs
One master-OCR
Additional support-OCRs (optional)
OCRs are token-wise aligned with the master-OCR
Each wocr has n − 1 additional OCR-tokens wocr2 , wocr3 , . . . , wocrn
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 8 / 16
A-PoCoTo — Lexicon Extension step
In the lexicon extension step a classifier tries to find good wocr to extend
the profiler’s back-end resources.
Classification starts after the first profiling round
wocr with a non empty α or β channel are considered
Set of features for each wocr (token-shape, candidate set, unigram
frequencies, agreeing OCRs, . . . )
Classify wocr as True or False
True tokens are put into the extended lexicon for the second profiler
round
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 9 / 16
A-PoCoTo — Ranking step
In the Ranking step the profiler’s candidates are re-ranked.
Classification starts after the second profiling round
All whist,cand for each wocr are considered
Set of features for each whist,cand (token-shape, candidate unigram
frequencies, agreeing OCRs, . . . )
Classifier classifies whist,cand as True or False
Candidates are re-ranked using the classifier’s confidence values
([−1, 1])
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 10 / 16
A-PoCoTo — Decision step
In the Decision step a classifier decides if the best ranked candidate for
any wocr should be used as a correction for wocr .
Re-ranked candidate set for each wocr are considered
Confidence for highest candidate and distance to next candidate are
the features
Classifier classifies highest ranked candidate as True or False
True candidates are used to correct the corresponding wocr
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 11 / 16
A-PoCoTo — Evaluation results
Post-correction model trained on OCR-D4 ground truth
Documents from 16th to 19th century
574 pages from 90 documents (3-6 pages per doc.)
Profile for each document separately
Evaluated two documents:
‘1557, Bodenstein, WieSichMeniglich’ (20 pages)
‘1841, Die Grenzboten’ (50 pages)
Four experiments:
1LE (Only master OCR)
1noLE (Only master OCR, LE step omitted)
2LE (One additional support OCR)
2noLE (One additional support OCR, LE step omitted)
4
http://www.ocr-d.de/gt
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 12 / 16
A-PoCoTo — Evaluation results
2noLE provided best improvement of accuracy:
‘1557, Bodenstein, WieSichMeniglich’:
OCR word accuracy: 65,63% → 69,81%
‘1841, Die Grenzboten’:
OCR word accuracy: 77,57% → 80,63%
Lexicon Extension does not offer benefit
Ranking step help finding the best correction candidate
Support OCR’s offer improvements (if not combined with LE step)
Too many lost chances in both documents → Decision-Step too
hesitant with corrections
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 13 / 16
A-I-PoCoTo
Combine the automatic post-correction with the interactive
post-correction of PoCoTo (work in progress).
Users review and approve (reject) the additional lexicon entries of the
extended lexicon.
Users can inspect all correction decisions carried out (or not carried
out) and revert (or apply) them
The trained base models for the automatic post-correction can be
further improved with the manually corrected document
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 14 / 16
Resume
Automatic post-correction can improve accuracy
Lexicon Extension step does not help → leave out or use only after
manual inspection
Feature-based re-ranking step improves the ranking of the profiler
Automatic post-correction is too cautious → change training of
Decision step to make it more courageous
Automatic post-correction can support the interactive post-correction
General problem: different alphabets between OCR-engines, Profiler
(and ground-truth)
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 15 / 16
A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 16 / 16

Session1 04.florian fink

  • 1.
    A-I-PoCoTo — CombiningAutomated and Interactive OCR Postcorrection Tobias Englmeier, Florian Fink and Klaus U. Schulz 9. May 2019 Florian Fink (CIS) A-I-PoCoTo 9. May 2019 1 / 16
  • 2.
    Overview Automatic post-correction (A-PoCoTo) Evaluationresults Automatic interactive post-correction (A-I-PoCoTo) Resume Florian Fink (CIS) A-I-PoCoTo 9. May 2019 2 / 16
  • 3.
    A-PoCoTo Automatic post-correction ofOCR-results of historical documents using supervised machine learning. Multiple OCRs (OCR1, OCR2, . . . , OCRn) can be used 3 steps with two profiling rounds 3 classifiers for 1, 2, . . . , n OCRs Classifiers are trained using logistic regression Developed as a module of the OCR-D project 1 1 http://www.ocr-d.de/ Florian Fink (CIS) A-I-PoCoTo 9. May 2019 3 / 16
  • 4.
    PoCoTo PoCoTo (Post-Correction Tool)is a tool for manual interactive post-correction of OCRed historical Documents. Initially a desktop application (2014)2 New version as web-application (2017) Profiling used for error detection and correction suggestions Batch correction of (error-) patterns 2 Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K. U. (2014, May). PoCoTo-an open source system for efficient interactive postcorrection of OCRed historical texts. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 57-61). ACM. Florian Fink (CIS) A-I-PoCoTo 9. May 2019 4 / 16
  • 5.
    Profile (global) Given anOCRed historical text, the profiling derives a ‘statistical picture’ (guess) of the language in the document using various background lexica3 OCR errors and OCR error series Historical patterns of the form mod → hist (t → th, ei → ey, ...) Underlying modern words The profile is used as a feature generator for the automatic post-correction system 3 Reffle, U., & Ringlstetter, C. (2013). Unsupervised profiling of OCRed historical documents. Pattern Recognition, 46(5), 1346-1357. Florian Fink (CIS) A-I-PoCoTo 9. May 2019 5 / 16
  • 6.
    Profile (local) Profiling associateswith each token wocr of a document a set of interpretations wmod,cand →α whist,cand →β wocr is generated. α- (historical patterns) and β- (OCR-errors) channels can be empty Interpretations have a weight Each wocr has a ranked set of interpretations wcand,hist Florian Fink (CIS) A-I-PoCoTo 9. May 2019 6 / 16
  • 7.
    A-PoCoTo Alignment Profiling Lexicon Extension Profiling RankingDecision A-I-PoCoTo OCR1 OCR2 OCRn Florian Fink (CIS) A-I-PoCoTo 9. May 2019 7 / 16
  • 8.
    Multiple OCRs One master-OCR Additionalsupport-OCRs (optional) OCRs are token-wise aligned with the master-OCR Each wocr has n − 1 additional OCR-tokens wocr2 , wocr3 , . . . , wocrn Florian Fink (CIS) A-I-PoCoTo 9. May 2019 8 / 16
  • 9.
    A-PoCoTo — LexiconExtension step In the lexicon extension step a classifier tries to find good wocr to extend the profiler’s back-end resources. Classification starts after the first profiling round wocr with a non empty α or β channel are considered Set of features for each wocr (token-shape, candidate set, unigram frequencies, agreeing OCRs, . . . ) Classify wocr as True or False True tokens are put into the extended lexicon for the second profiler round Florian Fink (CIS) A-I-PoCoTo 9. May 2019 9 / 16
  • 10.
    A-PoCoTo — Rankingstep In the Ranking step the profiler’s candidates are re-ranked. Classification starts after the second profiling round All whist,cand for each wocr are considered Set of features for each whist,cand (token-shape, candidate unigram frequencies, agreeing OCRs, . . . ) Classifier classifies whist,cand as True or False Candidates are re-ranked using the classifier’s confidence values ([−1, 1]) Florian Fink (CIS) A-I-PoCoTo 9. May 2019 10 / 16
  • 11.
    A-PoCoTo — Decisionstep In the Decision step a classifier decides if the best ranked candidate for any wocr should be used as a correction for wocr . Re-ranked candidate set for each wocr are considered Confidence for highest candidate and distance to next candidate are the features Classifier classifies highest ranked candidate as True or False True candidates are used to correct the corresponding wocr Florian Fink (CIS) A-I-PoCoTo 9. May 2019 11 / 16
  • 12.
    A-PoCoTo — Evaluationresults Post-correction model trained on OCR-D4 ground truth Documents from 16th to 19th century 574 pages from 90 documents (3-6 pages per doc.) Profile for each document separately Evaluated two documents: ‘1557, Bodenstein, WieSichMeniglich’ (20 pages) ‘1841, Die Grenzboten’ (50 pages) Four experiments: 1LE (Only master OCR) 1noLE (Only master OCR, LE step omitted) 2LE (One additional support OCR) 2noLE (One additional support OCR, LE step omitted) 4 http://www.ocr-d.de/gt Florian Fink (CIS) A-I-PoCoTo 9. May 2019 12 / 16
  • 13.
    A-PoCoTo — Evaluationresults 2noLE provided best improvement of accuracy: ‘1557, Bodenstein, WieSichMeniglich’: OCR word accuracy: 65,63% → 69,81% ‘1841, Die Grenzboten’: OCR word accuracy: 77,57% → 80,63% Lexicon Extension does not offer benefit Ranking step help finding the best correction candidate Support OCR’s offer improvements (if not combined with LE step) Too many lost chances in both documents → Decision-Step too hesitant with corrections Florian Fink (CIS) A-I-PoCoTo 9. May 2019 13 / 16
  • 14.
    A-I-PoCoTo Combine the automaticpost-correction with the interactive post-correction of PoCoTo (work in progress). Users review and approve (reject) the additional lexicon entries of the extended lexicon. Users can inspect all correction decisions carried out (or not carried out) and revert (or apply) them The trained base models for the automatic post-correction can be further improved with the manually corrected document Florian Fink (CIS) A-I-PoCoTo 9. May 2019 14 / 16
  • 15.
    Resume Automatic post-correction canimprove accuracy Lexicon Extension step does not help → leave out or use only after manual inspection Feature-based re-ranking step improves the ranking of the profiler Automatic post-correction is too cautious → change training of Decision step to make it more courageous Automatic post-correction can support the interactive post-correction General problem: different alphabets between OCR-engines, Profiler (and ground-truth) Florian Fink (CIS) A-I-PoCoTo 9. May 2019 15 / 16
  • 16.
    A-I-PoCoTo — CombiningAutomated and Interactive OCR Postcorrection Tobias Englmeier, Florian Fink and Klaus U. Schulz 9. May 2019 Florian Fink (CIS) A-I-PoCoTo 9. May 2019 16 / 16