TR5 Prolifer and Post-Correction System. Ludwig Maximilians

934 views

Published on

Presentada en "Sesión de demostración de IMPACT en la BNE". Octubre. Biblioteca Nacional de España.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
934
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

TR5 Prolifer and Post-Correction System. Ludwig Maximilians

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.TR5 Profiler and Post-Correction SystemLudwig-Maximilians-Universität MünchenCentrum für Informations- und Sprachverarbeitung
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TR5 Post-Correction SystemUser interface for easy postcorrection of User interface for easy postcorrection ofhistorical OCRd documents historical OCRd documentsStand-alone user interface Stand-alone user interfaceInnovative language technology enables Innovative language technology enablesidentification, presentation of recognition identification, presentation of recognitionerrors and efficient correction errors and efficient correction
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Customizable user interface Font sizeFreely rearrangeable interface Freely rearrangeable interfaceelements: elements: –– OCR with Image snippets OCR with Image snippets –– Complete image Complete image –– Correction candidates/ Special OCR and image fragments Correction candidates/ Special functions functions Complete image Correction candidates, Special functions
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. View: OCR and Image clippingsWord by word presentation of Word by word presentation ofrecognized text and image clippings. recognized text and image clippings.Comparison of text and image follows Comparison of text and image followsreading order and isismuch easier than reading order and much easier thanside-by-side presentation of image and side-by-side presentation of image andtext. text.
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. View: Original image–– For difficult cases For difficult cases–– When word segmentation by OCR When word segmentation by OCR fails fails–– Current word isis highlighted Current word highlighted
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Word by word correction of textCorrection by manual text entry Correction by manual text entryChoosing correction candidates Choosing correction candidatesFaster correction thanks to candidates Faster correction thanks to candidatesproposed by the postcorrection system proposed by the postcorrection system
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Batch correction: efficient postcorrection Batch correction Batch correction –– Several occurences of identical Several occurences of identical word word
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Batch correction: efficient postcorrectionBatch correction Batch correction –– classes of systematic errors classes of systematic errors –– errors where the correction errors where the correction candidate has aa high degree of candidate has high degree of certainty certainty –– further possilities further possilities Frequent errors Frequent errors For instance Location names For instance Location names
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Postcorrection system: EvaluationUser Experiment with 14 individual instances Result: Result: Error correction thanks to text and error Error correction thanks to text and error profiling is 2.7 times faster profiling is 2.7 times faster 9 Ulrich Reffle, 4,
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Korrektursystem 10
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Korrektursystem 11
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why another postcorrection system? Targets more specialist audience Targets more specialist audienceThanks to underlying language technology: Thanks to underlying language technology: Historical variants are recognized and Historical variants are recognized and not marked as errors –– evenwhen not in not marked as errors even when not in historical lexicon historical lexicon Historical variants are proposed as Historical variants are proposed as correction candidates correction candidates Typical error patterns are exploited Typical error patterns are exploited Ranking of correction candidates Ranking of correction candidates
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Underlying language technology Lexica and language models help dealing with orthographical variants und Lexica and language models help dealing with orthographical variants und unknown words. unknown words. Recognition of OCR errors and proposal of Correction candidates depends Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology on specially developed LMU language technology Approximate search inin “hypothetical lexica“ Approximate search “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document- An analysis of the whole work („text and error profile“) produces document- specific information about the language and the type of OCR errors specific information about the language and the type of OCR errors
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Text and error profiles Text profile Error profile Coverage of lexica Coverage of lexica Estimate of error rate Estimate of error rate Typical variant patterns Typical OCR errors Typical OCR errors Typical variant patterns → Targeted selection of lexica → Targeted selection of lexica → Better language models → Better modeling of error channel → Better modeling of error channel → Better language models → Distinguishing historical variants → Distinguishing historical variants → Distinguishing historical variants → Distinguishing historical variants and OCR errors and OCT errors and OCR errors and OCT errors → Ranking of correction candidates → Ranking of correction candidates → Ranking of correction candidates → Ranking of correction candidates → Recall and Precision in IR →Treatment of systematic errors → Recall and Precision in IR →Treatment of systematic errors 14
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Underlying logic: Dual noisy channel modelInterpretation of OCR output tokens as result of two “noisy channels” modern word u historical variant v OCR result w patterns OCR errorsGiven an OCR token w, give possible interpretations of w in terms of • “underlying” modern word u (IR!) • correct historical word v and its derivation from u via “patterns” • OCR errors garbling v into w
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Historical variant and OCR error patterns teil theilHistoricalVariants OCR Error patterns theil iheil
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local view: interpretations of tokens – Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings. Occurrence of spelling variant “i→y”:Occurrence of ocr error“i→y”:
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global view: pattern frequencies – Global view: Increment counters to estimate (relative) frequencies. Occurrences of spelling variant “i→y”: +0.999771Occurrences of ocr error“i→y”:+0.000224948
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: initialization Initial global profileNon-specific model withprobabilities for•Words•Variant Patterns•Error OCR result w0, w1 ,w2, w3, … 0 1 2 3 20
  21. 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: global to local Initial global profile Local profileNon-specific model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → …probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → →•Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… …•Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ →•Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 21 Ulrich Reffle, 4,
  22. 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: local to global Global profile Local profileImproved model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → …probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → →•Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… …•Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ →•Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 22 Ulrich Reffle, 4,
  23. 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: iteration Global profile Local profileImproved model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → …probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → →•Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… …•Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ →•Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 23 Ulrich Reffle, 4,
  24. 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Profiler EvaluationMeasure the quality1. of global profiles2. of OCR error detection Challenges Measures not obvious Good evaluation data is difficult to gather Results need interpretation
  25. 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation: Measures(1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns(2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler(3) Indirect evaluation (For instance, by means of the postcorrection system)
  26. 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation: Data preparation(1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work(2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  27. 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation: DataDeep: Eckartshausen 100 pages Briefkunst 40 pagesShallow: 5 books each, 16th, 17th and 18th century
  28. 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation: Eckartshausen (1) historical patterns matches first 10 70% precision all 68% recall all 73% (2) OCR patterns matches first 6 67% precision all 59% recall all 19% (3) OCR error detection precision 86% recall 46%
  29. 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Graphical Evaluation: Eckartshausen
  30. 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Graphical Evaluation: diacriticsHist. Var. OCR
  31. 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Shallow Evaluation Results 16th 17th 18thHIST Patterns first 10 60% 74% 78%OCR Patterns first 10 48% 70% 50%Error Detection Prec 95% 92% 81%Error Detection Recall 49% 43% 45%Content Words Errors 64% 44% 16%Easy Interactive Correction per ≈3000 words ≈ 1892 words ≈ 720 words10,000 words
  32. 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Global Profile: Spelling variation patterns
  33. 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Spelling variation profile
  34. 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.OCR Error Profile
  35. 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

×