Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Profiler and Post-Correction System
Ludwig-Maximilians-Universität München
Centrum für Informations- und Sprachverarbeitung
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Post-Correction System
User interface for easy postcorrection of
User interface for easy postcorrection of
historical OCR'd documents
historical OCR'd documents
Stand-alone user interface
Stand-alone user interface
Innovative language technology enables
Innovative language technology enables
identification, presentation of recognition
identification, presentation of recognition
errors and efficient correction
errors and efficient correction
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Customizable user interface Font size
Freely rearrangeable interface
Freely rearrangeable interface
elements:
elements:
–– OCR with Image snippets
OCR with Image snippets
–– Complete image
Complete image
–– Correction candidates/ Special OCR and image fragments
Correction candidates/ Special
functions
functions
Complete image
Correction candidates,
Special functions
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
View: OCR and Image clippings
Word by word presentation of
Word by word presentation of
recognized text and image clippings.
recognized text and image clippings.
Comparison of text and image follows
Comparison of text and image follows
reading order and isismuch easier than
reading order and much easier than
side-by-side presentation of image and
side-by-side presentation of image and
text.
text.
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
View: Original image
–– For difficult cases
For difficult cases
–– When word segmentation by OCR
When word segmentation by OCR
fails
fails
–– Current word isis highlighted
Current word highlighted
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Word by word correction of text
Correction by manual text entry
Correction by manual text entry
Choosing correction candidates
Choosing correction candidates
Faster correction thanks to candidates
Faster correction thanks to candidates
proposed by the postcorrection system
proposed by the postcorrection system
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction: efficient postcorrection
Batch correction
Batch correction
–– Several occurences of identical
Several occurences of identical
word
word
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction: efficient postcorrection
Batch correction
Batch correction
–– classes of systematic errors
classes of systematic errors
–– errors where the correction
errors where the correction
candidate has aa high degree of
candidate has high degree of
certainty
certainty
–– further possilities
further possilities
Frequent errors
Frequent errors
For instance Location names
For instance Location names
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Postcorrection system: Evaluation
User Experiment with 14 individual instances
Result:
Result:
Error correction thanks to text and error
Error correction thanks to text and error
profiling is 2.7 times faster
profiling is 2.7 times faster
9
Ulrich Reffle, 4,
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Why another postcorrection system?
Targets more specialist audience
Targets more specialist audience
Thanks to underlying language technology:
Thanks to underlying language technology:
Historical variants are recognized and
Historical variants are recognized and
not marked as errors –– evenwhen not in
not marked as errors even when not in
historical lexicon
historical lexicon
Historical variants are proposed as
Historical variants are proposed as
correction candidates
correction candidates
Typical error patterns are exploited
Typical error patterns are exploited
Ranking of correction candidates
Ranking of correction candidates
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Underlying language technology
Lexica and language models help dealing with orthographical variants und
Lexica and language models help dealing with orthographical variants und
unknown words.
unknown words.
Recognition of OCR errors and proposal of Correction candidates depends
Recognition of OCR errors and proposal of Correction candidates depends
on specially developed LMU language technology
on specially developed LMU language technology
Approximate search inin “hypothetical lexica“
Approximate search “hypothetical lexica“
An analysis of the whole work („text and error profile“) produces document-
An analysis of the whole work („text and error profile“) produces document-
specific information about the language and the type of OCR errors
specific information about the language and the type of OCR errors
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Text and error profiles
Text profile Error profile
Coverage of lexica
Coverage of lexica
Estimate of error rate
Estimate of error rate
Typical variant patterns Typical OCR errors
Typical OCR errors
Typical variant patterns
→ Targeted selection of lexica
→ Targeted selection of lexica
→ Better language models → Better modeling of error channel
→ Better modeling of error channel
→ Better language models
→ Distinguishing historical variants → Distinguishing historical variants
→ Distinguishing historical variants → Distinguishing historical variants
and OCR errors and OCT errors
and OCR errors and OCT errors
→ Ranking of correction candidates → Ranking of correction candidates
→ Ranking of correction candidates → Ranking of correction candidates
→ Recall and Precision in IR →Treatment of systematic errors
→ Recall and Precision in IR →Treatment of systematic errors
14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Underlying logic: Dual noisy channel model
Interpretation of OCR output tokens as result of two “noisy channels”
modern word u historical variant v OCR result w
patterns OCR errors
Given an OCR token w, give possible interpretations of w in terms of
• “underlying” modern word u (IR!)
• correct historical word v and its derivation from u via “patterns”
• OCR errors garbling v into w
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Historical variant and OCR error patterns
teil theil
Historical
Variants
OCR
Error patterns theil iheil
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Relative frequency: 2.9% of all
‘t’ are rewritten to ‘th’
Absolute frequency: Pattern
was found 120 times in the
current document.
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local view: interpretations of tokens
– Local view: “Meaningful interpretations” for all tokens of the
ocr text are the matches in all attached lexicons, using the
given settings.
Occurrence of spelling variant
“i→y”:
Occurrence of ocr error
“i→y”:
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global view: pattern frequencies
– Global view: Increment counters to estimate (relative)
frequencies.
Occurrences of spelling variant
“i→y”:
+0.999771
Occurrences of ocr error
“i→y”:
+0.000224948
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: initialization
Initial global profile
Non-specific model with
probabilities for
•Words
•Variant Patterns
•Error
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: global to local
Initial global profile
Local profile
Non-specific model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
probabilities for ……→……→……
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
•Variant Patterns …………… ……→…
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
21
Ulrich Reffle, 4,
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: local to global
Global profile
Local profile
Improved model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
probabilities for ……→……→……
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
•Variant Patterns …………… ……→…
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
22
Ulrich Reffle, 4,
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: iteration
Global profile
Local profile
Improved model with ww:33::
w:
ww… → … → …
:
w22:33 → … → …
…→ … → …
probabilities for ……→……→……
…→ …→ …
…→ … → …
…… → … → …
w11::… → … → …
w …→…→… → →
•Words ………→ →→ ……
………… ……… …
→ … →→
→
w00…… →→…→ …
→ … →… …
→→ →…
w :: … → ……→…
… →…→
… →
……… …
•Variant Patterns …………… ……→…
→→ →→→ …
→…→ → … …
→ …… →
…………→……→… …
… →… →
…→ …
…→ →
•Error → →
…… → … → …
…… → … → …
→…→…
→…→…
…… → … → …
…… → … → …
→…→…
→…→…
…→…→…
…→…→…
OCR result
w0, w1 ,w2, w3, …
0 1 2 3
23
Ulrich Reffle, 4,
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Profiler Evaluation
Measure the quality
1. of global profiles
2. of OCR error detection
Challenges
Measures not obvious
Good evaluation data is difficult to gather
Results need interpretation
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Measures
(1) Global Profiles
Percentage of matches for the first 10 patterns in the ranked output lists
Two Values: Historical Patterns, OCR Patterns
(2) OCR Error Detection
Precision and Recall for the OCR errors detected by the Profiler
(3) Indirect evaluation
(For instance, by means of the postcorrection system)
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data preparation
(1) Deep Evaluation:
For each token of the evaluation document the historical interpretation and the
OCR interpretation have been manually annotated.
++ fully accurate -- manual work
(2) Shallow Evaluation:
The OCR’ed document is automatically aligned with its re-typed ground truth;
For each token of the evaluation document the historical and the OCR
interpretation is automatically assigned from the ground truth.
++ no manual work – not completely accurate
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data
Deep: Eckartshausen 100 pages
Briefkunst 40 pages
Shallow: 5 books each,
16th, 17th and 18th century
28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Eckartshausen
(1) historical patterns
matches first 10 70%
precision all 68%
recall all 73%
(2) OCR patterns
matches first 6 67%
precision all 59%
recall all 19%
(3) OCR error detection
precision 86%
recall 46%
29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: Eckartshausen
30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: diacritics
Hist. Var.
OCR
31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Shallow Evaluation Results
16th 17th 18th
HIST Patterns first 10 60% 74% 78%
OCR Patterns first 10 48% 70% 50%
Error Detection Prec 95% 92% 81%
Error Detection Recall 49% 43% 45%
Content Words Errors 64% 44% 16%
Easy Interactive Correction per ≈3000 words ≈ 1892 words ≈ 720 words
10,000 words
32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global Profile: Spelling variation patterns
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Spelling variation profile
34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Error Profile
35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.