TR5 Prolifer and Post-Correction System. Ludwig Maximilians

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

TR5 Profiler and Post-Correction System
Ludwig-Maximilians-Universität München
Centrum für Informations- und Sprachverarbeitung


TR5 Post-Correction System

User interface for easy postcorrection of
User interface for easy postcorrection of
historical OCR'd documents
historical OCR'd documents
Stand-alone user interface
Stand-alone user interface
Innovative language technology enables
Innovative language technology enables
identification, presentation of recognition
identification, presentation of recognition
errors and efficient correction
errors and efficient correction


Customizable user interface Font size

Freely rearrangeable interface
Freely rearrangeable interface
elements:
elements:
–– OCR with Image snippets
OCR with Image snippets
–– Complete image
Complete image
–– Correction candidates/ Special OCR and image fragments
Correction candidates/ Special
functions
functions

Complete image

Correction candidates,
Special functions


View: OCR and Image clippings
Word by word presentation of
Word by word presentation of
recognized text and image clippings.
recognized text and image clippings.
Comparison of text and image follows
Comparison of text and image follows
reading order and isismuch easier than
reading order and much easier than
side-by-side presentation of image and
side-by-side presentation of image and
text.
text.


View: Original image

–– For difficult cases
For difficult cases
–– When word segmentation by OCR
When word segmentation by OCR
fails
fails
–– Current word isis highlighted
Current word highlighted


Word by word correction of text
Correction by manual text entry
Correction by manual text entry
Choosing correction candidates
Choosing correction candidates
Faster correction thanks to candidates
Faster correction thanks to candidates
proposed by the postcorrection system
proposed by the postcorrection system


Batch correction: efficient postcorrection
Batch correction
Batch correction
–– Several occurences of identical
Several occurences of identical
word
word


Batch correction: efficient postcorrection
Batch correction
Batch correction
–– classes of systematic errors
classes of systematic errors
–– errors where the correction
errors where the correction
candidate has aa high degree of
candidate has high degree of
certainty
certainty
–– further possilities
further possilities
Frequent errors
Frequent errors
For instance Location names
For instance Location names


Postcorrection system: Evaluation
User Experiment with 14 individual instances

Result:
Result:
Error correction thanks to text and error
Error correction thanks to text and error
profiling is 2.7 times faster
profiling is 2.7 times faster

9
Ulrich Reffle, 4,


Korrektursystem

10


Korrektursystem

11


Why another postcorrection system?

Targets more specialist audience
Targets more specialist audience

Thanks to underlying language technology:
Thanks to underlying language technology:
Historical variants are recognized and
Historical variants are recognized and
not marked as errors –– evenwhen not in
not marked as errors even when not in
historical lexicon
historical lexicon
Historical variants are proposed as
Historical variants are proposed as
correction candidates
correction candidates
Typical error patterns are exploited
Typical error patterns are exploited
Ranking of correction candidates
Ranking of correction candidates


Underlying language technology
Lexica and language models help dealing with orthographical variants und
Lexica and language models help dealing with orthographical variants und
unknown words.
unknown words.
Recognition of OCR errors and proposal of Correction candidates depends
Recognition of OCR errors and proposal of Correction candidates depends
on specially developed LMU language technology
on specially developed LMU language technology
Approximate search inin “hypothetical lexica“
Approximate search “hypothetical lexica“
An analysis of the whole work („text and error profile“) produces document-
An analysis of the whole work („text and error profile“) produces document-
specific information about the language and the type of OCR errors
specific information about the language and the type of OCR errors


Text and error profiles
Text profile Error profile
Coverage of lexica
Coverage of lexica
Estimate of error rate
Estimate of error rate
Typical variant patterns Typical OCR errors
Typical OCR errors
Typical variant patterns

→ Targeted selection of lexica
→ Targeted selection of lexica
→ Better language models → Better modeling of error channel
→ Better modeling of error channel
→ Better language models
→ Distinguishing historical variants → Distinguishing historical variants
→ Distinguishing historical variants → Distinguishing historical variants
and OCR errors and OCT errors
and OCR errors and OCT errors
→ Ranking of correction candidates → Ranking of correction candidates
→ Ranking of correction candidates → Ranking of correction candidates
→ Recall and Precision in IR →Treatment of systematic errors
→ Recall and Precision in IR →Treatment of systematic errors

14


Underlying logic: Dual noisy channel model
Interpretation of OCR output tokens as result of two “noisy channels”

modern word u historical variant v OCR result w
patterns OCR errors

Given an OCR token w, give possible interpretations of w in terms of
• “underlying” modern word u (IR!)
• correct historical word v and its derivation from u via “patterns”
• OCR errors garbling v into w


Historical variant and OCR error patterns

teil theil
Historical
Variants

OCR
Error patterns theil iheil


Relative frequency: 2.9% of all
‘t’ are rewritten to ‘th’

Absolute frequency: Pattern
was found 120 times in the
current document.


Local view: interpretations of tokens
– Local view: “Meaningful interpretations” for all tokens of the
ocr text are the matches in all attached lexicons, using the
given settings.
Occurrence of spelling variant
“i→y”:

Occurrence of ocr error
“i→y”:


Global view: pattern frequencies
– Global view: Increment counters to estimate (relative)
frequencies.

Occurrences of spelling variant
“i→y”:
+0.999771

Occurrences of ocr error
“i→y”:
+0.000224948


Computation of profile: initialization

Initial global profile

Non-specific model with
probabilities for
•Words
•Variant Patterns
•Error

OCR result
w0, w1 ,w2, w3, …
0 1 2 3
20


Computation of profile: global to local

Initial global profile
Local profile

Non-specific model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
probabilities for ……→……→……
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
•Variant Patterns …………… ……→…
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
21
Ulrich Reffle, 4,


Computation of profile: local to global

Global profile
Local profile

Improved model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
22
Ulrich Reffle, 4,


Computation of profile: iteration

Global profile
Local profile

Improved model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
23
Ulrich Reffle, 4,


Profiler Evaluation

Measure the quality
1. of global profiles
2. of OCR error detection

Challenges
Measures not obvious
Good evaluation data is difficult to gather
Results need interpretation


Evaluation: Measures
(1) Global Profiles
Percentage of matches for the first 10 patterns in the ranked output lists
Two Values: Historical Patterns, OCR Patterns

(2) OCR Error Detection
Precision and Recall for the OCR errors detected by the Profiler

(3) Indirect evaluation
(For instance, by means of the postcorrection system)


Evaluation: Data preparation
(1) Deep Evaluation:
For each token of the evaluation document the historical interpretation and the
OCR interpretation have been manually annotated.
++ fully accurate -- manual work

(2) Shallow Evaluation:
The OCR’ed document is automatically aligned with its re-typed ground truth;
For each token of the evaluation document the historical and the OCR
interpretation is automatically assigned from the ground truth.

++ no manual work – not completely accurate


Evaluation: Data

Deep: Eckartshausen 100 pages
Briefkunst 40 pages
Shallow: 5 books each,
16th, 17th and 18th century


Evaluation: Eckartshausen

(1) historical patterns
matches first 10 70%
precision all 68%
recall all 73%
(2) OCR patterns
matches first 6 67%
precision all 59%
recall all 19%
(3) OCR error detection
precision 86%
recall 46%


Graphical Evaluation: Eckartshausen


Graphical Evaluation: diacritics

Hist. Var.

OCR


Shallow Evaluation Results

16th 17th 18th
HIST Patterns first 10 60% 74% 78%
OCR Patterns first 10 48% 70% 50%
Error Detection Prec 95% 92% 81%
Error Detection Recall 49% 43% 45%
Content Words Errors 64% 44% 16%
Easy Interactive Correction per ≈3000 words ≈ 1892 words ≈ 720 words
10,000 words


Global Profile: Spelling variation patterns


Spelling variation profile


OCR Error Profile

TR5 Prolifer and Post-Correction System. Ludwig Maximilians

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (8)

Similar to TR5 Prolifer and Post-Correction System. Ludwig Maximilians

Similar to TR5 Prolifer and Post-Correction System. Ludwig Maximilians (20)

More from Biblioteca Nacional de España

More from Biblioteca Nacional de España (20)

Recently uploaded

Recently uploaded (20)

TR5 Prolifer and Post-Correction System. Ludwig Maximilians