Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bne demoday postcorrection_and_profiler

241 views

Published on

Presentation introducing Profiler and Postcorrection System presented by Jesse de Does during demo session held at the BNE 5th of October 2011.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Bne demoday postcorrection_and_profiler

  1. 1. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
  2. 2. TR5 Post-Correction System <ul><li>User interface for easy postcorrection of historical OCR'd documents </li></ul><ul><li>Stand-alone user interface </li></ul><ul><li>Innovative language technology enables identification, presentation of recognition errors and efficient correction </li></ul>
  3. 3. Customizable user interface <ul><li>Freely rearrangeable interface elements: </li></ul><ul><ul><li>OCR with Image snippets </li></ul></ul><ul><ul><li>Complete image </li></ul></ul><ul><ul><li>Correction candidates/ Special functions </li></ul></ul>OCR and image fragments Correction candidates, Special functions Complete image Font size
  4. 4. <ul><li>Word by word presentation of recognized text and image clippings. </li></ul><ul><li>Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text. </li></ul>View: OCR and Image clippings
  5. 5. <ul><ul><li>For difficult cases </li></ul></ul><ul><ul><li>When word segmentation by OCR fails </li></ul></ul><ul><ul><li>Current word is highlighted </li></ul></ul>View: Original image
  6. 6. <ul><li>Correction by manual text entry </li></ul><ul><li>Choosing correction candidates </li></ul><ul><li>Faster correction thanks to candidates proposed by the postcorrection system </li></ul>Word by word correction of text
  7. 7. <ul><li>Batch correction </li></ul><ul><ul><li>Several occurences of identical word </li></ul></ul>Batch correction: efficient postcorrection
  8. 8. <ul><li>Batch correction </li></ul><ul><ul><li>classes of systematic errors </li></ul></ul><ul><ul><li>errors where the correction candidate has a high degree of certainty </li></ul></ul><ul><ul><li>further possilities </li></ul></ul><ul><ul><ul><li>Frequent errors </li></ul></ul></ul><ul><ul><ul><li>For instance Location names </li></ul></ul></ul>Batch correction: efficient postcorrection
  9. 9. Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 <ul><ul><li>Result: Error correction thanks to text and error profiling is 2.7 times faster </li></ul></ul><ul><ul><li>User Experiment with 14 individual instances </li></ul></ul>
  10. 10. Korrektursystem
  11. 11. Korrektursystem
  12. 12. <ul><li>Targets more specialist audience </li></ul><ul><li>Thanks to underlying language technology: </li></ul><ul><li>Historical variants are recognized and not marked as errors – even when not in historical lexicon </li></ul><ul><li>Historical variants are proposed as correction candidates </li></ul><ul><li>Typical error patterns are exploited </li></ul><ul><li>Ranking of correction candidates </li></ul>Why another postcorrection system?
  13. 13. <ul><li>Lexica and language models help dealing with orthographical variants und unknown words. </li></ul><ul><li>Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology </li></ul><ul><ul><li>Approximate search in “hypothetical lexica“ </li></ul></ul><ul><ul><li>An analysis of the whole work („text and error profile“) produces document-specific information about the language and the type of OCR errors </li></ul></ul>Underlying language technology
  14. 14. Text and error profiles <ul><li>Text profile </li></ul><ul><li>Error profile </li></ul><ul><ul><li>Coverage of lexica </li></ul></ul><ul><ul><li>Typical variant patterns </li></ul></ul><ul><ul><li>Targeted selection of lexica </li></ul></ul><ul><ul><li>Better language models </li></ul></ul><ul><ul><ul><li>Distinguishing historical variants and OCR errors </li></ul></ul></ul><ul><ul><ul><li>Ranking of correction candidates </li></ul></ul></ul><ul><ul><ul><li>Recall and Precision in IR </li></ul></ul></ul><ul><ul><li>Estimate of error rate </li></ul></ul><ul><ul><li>Typical OCR errors </li></ul></ul><ul><ul><li>Better modeling of error channel </li></ul></ul><ul><ul><ul><li>Distinguishing historical variants and OCT errors </li></ul></ul></ul><ul><ul><ul><li>Ranking of correction candidates </li></ul></ul></ul><ul><ul><ul><li>Treatment of systematic errors </li></ul></ul></ul>
  15. 15. <ul><li>Underlying logic: Dual noisy channel model </li></ul><ul><li>Interpretation of OCR output tokens as result of two “noisy channels” </li></ul><ul><li>modern word u historical variant v OCR result w </li></ul><ul><li>Given an OCR token w, give possible interpretations of w in terms of </li></ul><ul><ul><ul><li>“ underlying” modern word u (IR!) </li></ul></ul></ul><ul><ul><ul><li>correct historical word v and its derivation from u via “patterns” </li></ul></ul></ul><ul><ul><ul><li>OCR errors garbling v into w </li></ul></ul></ul>patterns OCR errors
  16. 16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil  theil theil  iheil
  17. 17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  18. 18. <ul><li>Local view: interpretations of tokens </li></ul><ul><ul><li>Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings. </li></ul></ul>Occurrence of spelling variant “i->y”: Occurrence of ocr error “ i->y”:
  19. 19. <ul><li>Global view: pattern frequencies </li></ul><ul><ul><li>Global view: Increment counters to estimate (relative) frequencies. </li></ul></ul>Occurrences of spelling variant “i->y”: +0.999771 Occurrences of ocr error “ i->y”: +0.000224948
  20. 20. Computation of profile: initialization OCR result w 0 , w 1 ,w 2 , w 3 , … Initial global profile <ul><li>Non-specific model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
  21. 21. Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Initial global profile OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Non-specific model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
  22. 22. Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Global profile OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Improved model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
  23. 23. Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Improved model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
  24. 24. <ul><li>Profiler Evaluation </li></ul><ul><li>Measure the quality </li></ul><ul><li>of global profiles </li></ul><ul><li>of OCR error detection </li></ul><ul><ul><li>Challenges </li></ul></ul><ul><ul><li>Measures not obvious </li></ul></ul><ul><ul><li>Good evaluation data is difficult to gather </li></ul></ul><ul><ul><li>Results need interpretation </li></ul></ul>
  25. 25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
  26. 26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  27. 27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century
  28. 28. Evaluation: Eckartshausen <ul><ul><li>historical patterns </li></ul></ul><ul><ul><ul><li>matches first 10 70% </li></ul></ul></ul><ul><ul><li> precision all 68% </li></ul></ul><ul><ul><li>recall all 73% </li></ul></ul><ul><ul><li>OCR patterns </li></ul></ul><ul><ul><li>matches first 6 67% </li></ul></ul><ul><ul><li>precision all 59% </li></ul></ul><ul><ul><li>recall all 19% </li></ul></ul><ul><ul><li>(3) OCR error detection </li></ul></ul><ul><ul><li>precision 86% </li></ul></ul><ul><ul><li>recall 46% </li></ul></ul>
  29. 29. Graphical Evaluation: Eckartshausen
  30. 30. Graphical Evaluation: diacritics Hist. Var. OCR
  31. 31. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per 10,000 words ≈ 3000 words ≈ 1892 words ≈ 720 words
  32. 32. Global Profile: Spelling variation patterns
  33. 33. Spelling variation profile
  34. 34. OCR Error Profile

×