Your SlideShare is downloading. ×
0
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

486

Published on

Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidays

Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidays

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
486
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Automatic OCR correction http://overproof.projectcomputing.com Correcting noisy OCR - Context beats Confusion [ presentation viewableat http://goo.gl/n85gR6 ]
  • 2. Automatic OCR correction http://overproof.projectcomputing.com who are we? ● Australian software company ● developers John and Kent ● we put theory into practice
  • 3. Automatic OCR correction http://overproof.projectcomputing.com ● the first draft of history ● popular if made available ● usually poorly digitized ● too extensive for full human correction main target - newspapers
  • 4. Automatic OCR correction http://overproof.projectcomputing.com goals ● run on commodity cloud server ● optimal for noisy text ● at least 1000 words/sec ● correct at least 50% of errors
  • 5. Automatic OCR correction http://overproof.projectcomputing.com division of labour bad good models models MANAGER, TRIAGE CORE
  • 6. Automatic OCR correction http://overproof.projectcomputing.com snippets for the core ● prefer triaged good words at start/end ● column aware ● some easy corrections applied ● some suggestions supplied ● bag of topic words available ● surrounding noise level indicated
  • 7. Automatic OCR correction http://overproof.projectcomputing.com error contexts ● spell: vowals or consonnants ● type: you jit teh wrng key ● OCR: roprcroiitativcs cf thc Coveriuient ● random: anygh<eg 0at7happen
  • 8. Automatic OCR correction http://overproof.projectcomputing.com confusion cost matrix 93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
  • 9. Automatic OCR correction http://overproof.projectcomputing.com word cost (eg rnorniny|morning) language cost ● lexicon frequency ● entity list ● rare word list ● character 4-gram error cost ● edit sum ● visual correlation ● generator hint
  • 10. Automatic OCR correction http://overproof.projectcomputing.com word character confusion m o r n i n g r n o r n i n y
  • 11. Automatic OCR correction http://overproof.projectcomputing.com visual correlation
  • 12. Automatic OCR correction http://overproof.projectcomputing.com suggestion methods ● gift ● common, cached ● language ● entities ● split/join ● generated (magic)
  • 13. Automatic OCR correction http://overproof.projectcomputing.com searching for gold (A*) l i i ne r h hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ... purple nodes: working priority queue red nodes: output priority queue
  • 14. Automatic OCR correction http://overproof.projectcomputing.com amazing generated suggestions Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
  • 15. Automatic OCR correction http://overproof.projectcomputing.com selecting best combination unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently bohavlour behaviour behavour behavior Behaviour behaviours behaving abonf about above along been am am an a in as unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently disgrie disgrace disagree disguise desire degree disease [NOTE: word joins and splits are also supported]
  • 16. Automatic OCR correction http://overproof.projectcomputing.com training ● 5-grams - subset selection ● corpus 1,2,3-grams - statistical build ● extra word lists - easy ● error model - bootstrap or new pairs
  • 17. Automatic OCR correction http://overproof.projectcomputing.com testing ● 65000 words ground truth including foreign (US) newspapers ● all measures exceeded goal: ○ search errors (article word types) ○ read errors (article word tokens) ○ entropy weighted term errors
  • 18. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 19. Automatic OCR correction http://overproof.projectcomputing.com ¿preguntas? Presentation viewable at http://goo.gl/n85gR6
  • 20. Automatic OCR correction http://overproof.projectcomputing.com
  • 21. Automatic OCR correction http://overproof.projectcomputing.com National Library of Australia’s TROVE ● 1.4m distinct visitors/month ● 16m pageviews/month ● 80% of usage is old newspapers o 13m pages, over 600 titles o 85k lines corrected/day
  • 22. Automatic OCR correction http://overproof.projectcomputing.com Even this massive volunteer effort cannot keep up ● < 2% of errors have been corrected ● % corrected is declining ● Hence searching is unreliable, OCR’ed text is hard to read and reuse ● Trove’s accuracy is “typical”
  • 23. Automatic OCR correction http://overproof.projectcomputing.com
  • 24. Automatic OCR correction http://overproof.projectcomputing.com 159 randomly selected news articles from The Sydney Morning Herald 47.4K words hand-corrected to ground truth
  • 25. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% False positive recall 26.7% 9.1% false positives reduced 65.8% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 26. Automatic OCR correction http://overproof.projectcomputing.com
  • 27. Automatic OCR correction http://overproof.projectcomputing.com
  • 28. Automatic OCR correction http://overproof.projectcomputing.com
  • 29. Automatic OCR correction http://overproof.projectcomputing.com
  • 30. Automatic OCR correction http://overproof.projectcomputing.com 49 randomly selected news articles from LoC Chronicling America 18.1K words hand-corrected to ground truth
  • 31. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 84.0% 93.1% recall misses reduced 56.6% False positive recall 23.6% 8.8% false positives reduced 62.8% Raw Error Rate 19.1% 6.4% errors reduced 66.7% Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8% LOC sample

×