Your SlideShare is downloading. ×
0
TR5 Profiler and Post-Correction System  Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarb...
TR5 Post-Correction System <ul><li>User interface for easy postcorrection of historical OCR'd documents </li></ul><ul><li>...
Customizable user interface <ul><li>Freely rearrangeable interface elements: </li></ul><ul><ul><li>OCR with Image snippets...
<ul><li>Word by word presentation of recognized text and image clippings. </li></ul><ul><li>Comparison of text and image f...
<ul><ul><li>For difficult cases  </li></ul></ul><ul><ul><li>When word segmentation by OCR fails </li></ul></ul><ul><ul><li...
<ul><li>Correction by manual text entry </li></ul><ul><li>Choosing correction candidates </li></ul><ul><li>Faster correcti...
<ul><li>Batch correction  </li></ul><ul><ul><li>Several occurences of identical word </li></ul></ul>Batch correction: effi...
<ul><li>Batch correction </li></ul><ul><ul><li>classes of systematic errors </li></ul></ul><ul><ul><li>errors where the co...
Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 <ul><ul><li>Result:  Error correction thanks to text and err...
Korrektursystem
Korrektursystem
<ul><li>Targets more specialist audience </li></ul><ul><li>Thanks to underlying language technology: </li></ul><ul><li>His...
<ul><li>Lexica and language models help dealing with orthographical variants und unknown words. </li></ul><ul><li>Recognit...
Text and error profiles <ul><li>Text profile </li></ul><ul><li>Error profile </li></ul><ul><ul><li>Coverage of lexica </li...
<ul><li>Underlying logic: Dual noisy channel model </li></ul><ul><li>Interpretation of OCR output tokens as result of two ...
Historical variant and OCR error patterns Historical Variants OCR Error patterns teil    theil theil    iheil
Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’  Absolute frequency: Pattern was found 120 times in the current ...
<ul><li>Local view: interpretations of tokens </li></ul><ul><ul><li>Local view:  “Meaningful interpretations” for all toke...
<ul><li>Global view: pattern frequencies </li></ul><ul><ul><li>Global view:  Increment counters to estimate (relative) fre...
Computation of profile: initialization OCR result w 0 , w 1  ,w 2 , w 3 , … Initial global profile <ul><li>Non-specific mo...
Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 ...
Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 ...
Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … …...
<ul><li>Profiler Evaluation </li></ul><ul><li>Measure the quality  </li></ul><ul><li>of  global profiles </li></ul><ul><li...
Evaluation: Measures (1)  Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two V...
Evaluation: Data preparation (1)  Deep Evaluation: For each token of the evaluation document the historical interpretation...
Evaluation: Data Deep:  Eckartshausen  100 pages  Briefkunst  40 pages Shallow:  5 books each,  16 th , 17 th  and 18 th  ...
Evaluation: Eckartshausen <ul><ul><li>historical patterns  </li></ul></ul><ul><ul><ul><li>matches first 10  70%  </li></ul...
Graphical Evaluation: Eckartshausen
Graphical Evaluation: diacritics Hist. Var. OCR
Shallow Evaluation Results 16th  17th 18th HIST Patterns  first 10 60% 74% 78% OCR Patterns  first 10 48% 70% 50% Error De...
Upcoming SlideShare
Loading in...5
×

BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

1,414

Published on

Jesse de Does gives presentation on the LMU OCR Profiler and Post Correction.

Delivered at British Library Demo Day on the 12th of July 2011.

Published in: Technology, Education
1 Comment
0 Likes
Statistics
Notes
  • I think using OCR Cloud 2.0 platform is  a good idea.It can convert virtually any image (TIF, JPG, PNG, BMP) or PDF to any standard text-based document type (TXT, DOC, RTF, XLS, PPT, XML, HTML) or searchable PDF.OCR Cloud 2.0 is a powerful Web-based API which allows developers of mobile and small footprint applications to integrate highly accurate Optical Character Recognition technologies that convert images and photographs into manageable, usable and searchable text. For free developer account signup here-http://www.ocr-it.com/ocr-cloud-2-0-api
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
1,414
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  • DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  • Transcript of "BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction"

    1. 1. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
    2. 2. TR5 Post-Correction System <ul><li>User interface for easy postcorrection of historical OCR'd documents </li></ul><ul><li>Stand-alone user interface </li></ul><ul><li>Innovative language technology enables identification, presentation of recognition errors and efficient correction </li></ul>
    3. 3. Customizable user interface <ul><li>Freely rearrangeable interface elements: </li></ul><ul><ul><li>OCR with Image snippets </li></ul></ul><ul><ul><li>Complete image </li></ul></ul><ul><ul><li>Correction candidates/ Special functions </li></ul></ul>OCR and image fragments Correction candidates, Special functions Complete image Font size
    4. 4. <ul><li>Word by word presentation of recognized text and image clippings. </li></ul><ul><li>Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text. </li></ul>View: OCR and Image clippings
    5. 5. <ul><ul><li>For difficult cases </li></ul></ul><ul><ul><li>When word segmentation by OCR fails </li></ul></ul><ul><ul><li>Current word is highlighted </li></ul></ul>View: Original image
    6. 6. <ul><li>Correction by manual text entry </li></ul><ul><li>Choosing correction candidates </li></ul><ul><li>Faster correction thanks to candidates proposed by the postcorrection system </li></ul>Word by word correction of text
    7. 7. <ul><li>Batch correction </li></ul><ul><ul><li>Several occurences of identical word </li></ul></ul>Batch correction: efficient postcorrection
    8. 8. <ul><li>Batch correction </li></ul><ul><ul><li>classes of systematic errors </li></ul></ul><ul><ul><li>errors where the correction candidate has a high degree of certainty </li></ul></ul><ul><ul><li>further possilities </li></ul></ul><ul><ul><ul><li>Frequent errors </li></ul></ul></ul><ul><ul><ul><li>For instance Location names </li></ul></ul></ul>Batch correction: efficient postcorrection
    9. 9. Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 <ul><ul><li>Result: Error correction thanks to text and error profiling is 2.7 times faster </li></ul></ul><ul><ul><li>User Experiment with 14 individual instances </li></ul></ul>
    10. 10. Korrektursystem
    11. 11. Korrektursystem
    12. 12. <ul><li>Targets more specialist audience </li></ul><ul><li>Thanks to underlying language technology: </li></ul><ul><li>Historical variants are recognized and not marked as errors – even when not in historical lexicon </li></ul><ul><li>Historical variants are proposed as correction candidates </li></ul><ul><li>Typical error patterns are exploited </li></ul><ul><li>Ranking of correction candidates </li></ul>Why another postcorrection system?
    13. 13. <ul><li>Lexica and language models help dealing with orthographical variants und unknown words. </li></ul><ul><li>Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology </li></ul><ul><ul><li>Approximate search in “hypothetical lexica“ </li></ul></ul><ul><ul><li>An analysis of the whole work („text and error profile“) produces document-specific information about the language and the type of OCR errors </li></ul></ul>Underlying language technology
    14. 14. Text and error profiles <ul><li>Text profile </li></ul><ul><li>Error profile </li></ul><ul><ul><li>Coverage of lexica </li></ul></ul><ul><ul><li>Typical variant patterns </li></ul></ul><ul><ul><li>Targeted selection of lexica </li></ul></ul><ul><ul><li>Better language models </li></ul></ul><ul><ul><ul><li>Distinguishing historical variants and OCR errors </li></ul></ul></ul><ul><ul><ul><li>Ranking of correction candidates </li></ul></ul></ul><ul><ul><ul><li>Recall and Precision in IR </li></ul></ul></ul><ul><ul><li>Estimate of error rate </li></ul></ul><ul><ul><li>Typical OCR errors </li></ul></ul><ul><ul><li>Better modeling of error channel </li></ul></ul><ul><ul><ul><li>Distinguishing historical variants and OCT errors </li></ul></ul></ul><ul><ul><ul><li>Ranking of correction candidates </li></ul></ul></ul><ul><ul><ul><li>Treatment of systematic errors </li></ul></ul></ul>
    15. 15. <ul><li>Underlying logic: Dual noisy channel model </li></ul><ul><li>Interpretation of OCR output tokens as result of two “noisy channels” </li></ul><ul><li>modern word u historical variant v OCR result w </li></ul><ul><li>Given an OCR token w, give possible interpretations of w in terms of </li></ul><ul><ul><ul><li>“ underlying” modern word u (IR!) </li></ul></ul></ul><ul><ul><ul><li>correct historical word v and its derivation from u via “patterns” </li></ul></ul></ul><ul><ul><ul><li>OCR errors garbling v into w </li></ul></ul></ul>patterns OCR errors
    16. 16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil  theil theil  iheil
    17. 17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
    18. 18. <ul><li>Local view: interpretations of tokens </li></ul><ul><ul><li>Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings. </li></ul></ul>Occurrence of spelling variant “i->y”: Occurrence of ocr error “ i->y”:
    19. 19. <ul><li>Global view: pattern frequencies </li></ul><ul><ul><li>Global view: Increment counters to estimate (relative) frequencies. </li></ul></ul>Occurrences of spelling variant “i->y”: +0.999771 Occurrences of ocr error “ i->y”: +0.000224948
    20. 20. Computation of profile: initialization OCR result w 0 , w 1 ,w 2 , w 3 , … Initial global profile <ul><li>Non-specific model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
    21. 21. Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Initial global profile OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Non-specific model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
    22. 22. Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Global profile OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Improved model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
    23. 23. Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … OCR result w 0 , w 1 ,w 2 , w 3 , … <ul><li>Improved model with probabilities for </li></ul><ul><li>Words </li></ul><ul><li>Variant Patterns </li></ul><ul><li>Error </li></ul>
    24. 24. <ul><li>Profiler Evaluation </li></ul><ul><li>Measure the quality </li></ul><ul><li>of global profiles </li></ul><ul><li>of OCR error detection </li></ul><ul><ul><li>Challenges </li></ul></ul><ul><ul><li>Measures not obvious </li></ul></ul><ul><ul><li>Good evaluation data is difficult to gather </li></ul></ul><ul><ul><li>Results need interpretation </li></ul></ul>
    25. 25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
    26. 26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
    27. 27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century
    28. 28. Evaluation: Eckartshausen <ul><ul><li>historical patterns </li></ul></ul><ul><ul><ul><li>matches first 10 70% </li></ul></ul></ul><ul><ul><li> precision all 68% </li></ul></ul><ul><ul><li>recall all 73% </li></ul></ul><ul><ul><li>OCR patterns </li></ul></ul><ul><ul><li>matches first 6 67% </li></ul></ul><ul><ul><li>precision all 59% </li></ul></ul><ul><ul><li>recall all 19% </li></ul></ul><ul><ul><li>(3) OCR error detection </li></ul></ul><ul><ul><li>precision 86% </li></ul></ul><ul><ul><li>recall 46% </li></ul></ul>
    29. 29. Graphical Evaluation: Eckartshausen
    30. 30. Graphical Evaluation: diacritics Hist. Var. OCR
    31. 31. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per 10,000 words ≈ 3000 words ≈ 1892 words ≈ 720 words
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×