Automatic OCR correction http://overproof.projectcomputing.com
Correcting noisy OCR
- Context beats Confusion
[ presentati...
Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John...
Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● ...
Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy t...
Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
C...
Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/...
Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you j...
Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w
155: e ← e
3750: c ← e
4451...
Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost
● lexicon fre...
Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
Automatic OCR correction http://overproof.projectcomputing.com
visual correlation
Automatic OCR correction http://overproof.projectcomputing.com
suggestion methods
● gift
● common, cached
● language
● ent...
Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l
i
i
ne
r
h
hcii
h li b n ...
c e ...
Automatic OCR correction http://overproof.projectcomputing.com
amazing generated suggestions
Parhumuitar} ← Parliamentary
...
Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv
unsightly
unseemly
un...
Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams ...
Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) n...
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%...
Automatic OCR correction http://overproof.projectcomputing.com
¿preguntas?
Presentation viewable at http://goo.gl/n85gR6
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visit...
Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of...
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning...
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%...
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling Ame...
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 84.0% 93.1% recall misses reduced 56.6%...
Upcoming SlideShare
Loading in …5
×

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

705 views

Published on

Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidays

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
705
On SlideShare
0
From Embeds
0
Number of Embeds
79
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

  1. 1. Automatic OCR correction http://overproof.projectcomputing.com Correcting noisy OCR - Context beats Confusion [ presentation viewableat http://goo.gl/n85gR6 ]
  2. 2. Automatic OCR correction http://overproof.projectcomputing.com who are we? ● Australian software company ● developers John and Kent ● we put theory into practice
  3. 3. Automatic OCR correction http://overproof.projectcomputing.com ● the first draft of history ● popular if made available ● usually poorly digitized ● too extensive for full human correction main target - newspapers
  4. 4. Automatic OCR correction http://overproof.projectcomputing.com goals ● run on commodity cloud server ● optimal for noisy text ● at least 1000 words/sec ● correct at least 50% of errors
  5. 5. Automatic OCR correction http://overproof.projectcomputing.com division of labour bad good models models MANAGER, TRIAGE CORE
  6. 6. Automatic OCR correction http://overproof.projectcomputing.com snippets for the core ● prefer triaged good words at start/end ● column aware ● some easy corrections applied ● some suggestions supplied ● bag of topic words available ● surrounding noise level indicated
  7. 7. Automatic OCR correction http://overproof.projectcomputing.com error contexts ● spell: vowals or consonnants ● type: you jit teh wrng key ● OCR: roprcroiitativcs cf thc Coveriuient ● random: anygh<eg 0at7happen
  8. 8. Automatic OCR correction http://overproof.projectcomputing.com confusion cost matrix 93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
  9. 9. Automatic OCR correction http://overproof.projectcomputing.com word cost (eg rnorniny|morning) language cost ● lexicon frequency ● entity list ● rare word list ● character 4-gram error cost ● edit sum ● visual correlation ● generator hint
  10. 10. Automatic OCR correction http://overproof.projectcomputing.com word character confusion m o r n i n g r n o r n i n y
  11. 11. Automatic OCR correction http://overproof.projectcomputing.com visual correlation
  12. 12. Automatic OCR correction http://overproof.projectcomputing.com suggestion methods ● gift ● common, cached ● language ● entities ● split/join ● generated (magic)
  13. 13. Automatic OCR correction http://overproof.projectcomputing.com searching for gold (A*) l i i ne r h hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ... purple nodes: working priority queue red nodes: output priority queue
  14. 14. Automatic OCR correction http://overproof.projectcomputing.com amazing generated suggestions Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
  15. 15. Automatic OCR correction http://overproof.projectcomputing.com selecting best combination unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently bohavlour behaviour behavour behavior Behaviour behaviours behaving abonf about above along been am am an a in as unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently disgrie disgrace disagree disguise desire degree disease [NOTE: word joins and splits are also supported]
  16. 16. Automatic OCR correction http://overproof.projectcomputing.com training ● 5-grams - subset selection ● corpus 1,2,3-grams - statistical build ● extra word lists - easy ● error model - bootstrap or new pairs
  17. 17. Automatic OCR correction http://overproof.projectcomputing.com testing ● 65000 words ground truth including foreign (US) newspapers ● all measures exceeded goal: ○ search errors (article word types) ○ read errors (article word tokens) ○ entropy weighted term errors
  18. 18. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  19. 19. Automatic OCR correction http://overproof.projectcomputing.com ¿preguntas? Presentation viewable at http://goo.gl/n85gR6
  20. 20. Automatic OCR correction http://overproof.projectcomputing.com
  21. 21. Automatic OCR correction http://overproof.projectcomputing.com National Library of Australia’s TROVE ● 1.4m distinct visitors/month ● 16m pageviews/month ● 80% of usage is old newspapers o 13m pages, over 600 titles o 85k lines corrected/day
  22. 22. Automatic OCR correction http://overproof.projectcomputing.com Even this massive volunteer effort cannot keep up ● < 2% of errors have been corrected ● % corrected is declining ● Hence searching is unreliable, OCR’ed text is hard to read and reuse ● Trove’s accuracy is “typical”
  23. 23. Automatic OCR correction http://overproof.projectcomputing.com
  24. 24. Automatic OCR correction http://overproof.projectcomputing.com 159 randomly selected news articles from The Sydney Morning Herald 47.4K words hand-corrected to ground truth
  25. 25. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% False positive recall 26.7% 9.1% false positives reduced 65.8% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  26. 26. Automatic OCR correction http://overproof.projectcomputing.com
  27. 27. Automatic OCR correction http://overproof.projectcomputing.com
  28. 28. Automatic OCR correction http://overproof.projectcomputing.com
  29. 29. Automatic OCR correction http://overproof.projectcomputing.com
  30. 30. Automatic OCR correction http://overproof.projectcomputing.com 49 randomly selected news articles from LoC Chronicling America 18.1K words hand-corrected to ground truth
  31. 31. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 84.0% 93.1% recall misses reduced 56.6% False positive recall 23.6% 8.8% false positives reduced 62.8% Raw Error Rate 19.1% 6.4% errors reduced 66.7% Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8% LOC sample

×