OCR challenges in historic documents  and the contribution of IMPACT Clemens Neudecker, KB National Library of the Netherl...
Background <ul><li>Text that is not digital is virtually invisible </li></ul><ul><li>OCR (optical character recognition) t...
IMPACT – Improving access to text <ul><li>Funded by the EC as part of the 7 th  Framework Programme </li></ul><ul><li>Coor...
Historic material:  different   problems <ul><li>OCR errors </li></ul><ul><li>Damaged material, bad quality scans, difficu...
Bad OCR results… la 112 B ik e my lat arrived the >Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath, ' titch ,cuim; ,'...
Bleed  t hrough &  s hine  t hrough Effects are high, since it is the same ink (though lighter) and the shaping of charact...
IMPACT: Binarisation
Annotations in the text Effects are high, since both segmentation as well as the recognition process itself is disturbed. ...
IMPACT: Improved binarisation <ul><li>Original </li></ul><ul><li>State of the Art </li></ul><ul><li>IMPACT </li></ul>
Warping of paper Partly a relatively high effect, especially if it is connected with bad printing (e.g. characters not ali...
IMPACT: Border removal
IMPACT: Geometric correction I
IMPACT: Geometric correction II
Gothic  typeface Effects are high since such fonts and  characters are often not recognised correctly. Historic fonts, obs...
IMPACT: Improved recognition
Complex l ayout   Effects are high since text is not ordered in the right way  Effects on OCRing Due to difficult layouts,...
IMPACT: Segmentation Blocks/Regions   Words   Glyphs
IMPACT: Functional extension parser <ul><li>Recognition of the structure of book pages </li></ul><ul><ul><li>Print space <...
Bad  p rinting: blurred, broken, faded characters   Effects are high since characters are  broken or bound together.  Acco...
IMPACT: Cooperative correction <ul><li>Integrated web-based system for cooperative correction of OCR results </li></ul><ul...
IMPACT: Word spotting <ul><li>Alternative technique for indexing historical documents </li></ul><ul><li>After word segment...
Historical language Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelde...
IMPACT: Historical dictionaries <ul><li>OCR: </li></ul><ul><ul><li>Lexica for German, Dutch, English, French, Spanish, Pol...
IMPACT: Linguistic post-correction <ul><li>The colors indicate different types of analysis results, like a word being foun...
IMPACT: Interoperability framework Interaction, Modularisation, Evaluation
Thank you! <ul><li>http://www.impact–project.eu/ </li></ul><ul><li>[email_address]   </li></ul><ul><li>@impactocr </li></u...
Upcoming SlideShare
Loading in …5
×

OCR challenges in historic documents and the contribution of IMPACT

3,306 views
3,148 views

Published on

Presentation by Clemens Neudecker (KB) at the IFLA satellite meeting “New Techniques for Old Documents” (16-18 August, Uppsala, Sweden)

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total views
3,306
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

OCR challenges in historic documents and the contribution of IMPACT

  1. 1. OCR challenges in historic documents and the contribution of IMPACT Clemens Neudecker, KB National Library of the Netherlands
  2. 2. Background <ul><li>Text that is not digital is virtually invisible </li></ul><ul><li>OCR (optical character recognition) technology does not produce satisfactory results for historic documents </li></ul><ul><li>There is a lack of institutional knowledge and expertise which causes “re-inventing the wheel” </li></ul><ul><li>Innovate OCR software and language technology </li></ul><ul><li>Share best practice and build capacity across Europe (Guidelines, Training, Workshops) </li></ul>
  3. 3. IMPACT – Improving access to text <ul><li>Funded by the EC as part of the 7 th Framework Programme </li></ul><ul><li>Coordinated by KB – National Library of the Netherlands </li></ul><ul><li>EU funding: € 12 100 000 </li></ul><ul><li>26 partners: Libraries, Research Institutes, Industry Partners </li></ul><ul><li>Start date: 1 January 2008 </li></ul><ul><li>Duration: 48 Months  2011: Center of Competence </li></ul>
  4. 4. Historic material: different problems <ul><li>OCR errors </li></ul><ul><li>Damaged material, bad quality scans, difficult layout, </li></ul><ul><li>historic fonts, … </li></ul><ul><li>Historical language </li></ul><ul><li>Spelling variants, orthographical variants, inflected forms, … </li></ul>
  5. 5. Bad OCR results… la 112 B ik e my lat arrived the >Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath, ' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn- .die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine and grocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;' 4Stalled the AluidonG.: ceror' Lkndon, with sundries; : ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,; ;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman, for eathly Newpot;agd llford; -Tw Br.otherAs, lawces, fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per- wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for .:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foir ouck , + iii ballasto I _______~ ~ ~ ~~~Ai
  6. 6. Bleed t hrough & s hine t hrough Effects are high, since it is the same ink (though lighter) and the shaping of characters is directly disturbed. When the printing ink was not dry, the letters of the one page also appear on the other page. Also, if a paper is relatively thin the ink of the other side of the page may shine through. Effects on OCRing General description
  7. 7. IMPACT: Binarisation
  8. 8. Annotations in the text Effects are high, since both segmentation as well as the recognition process itself is disturbed. All notes, lines, drawings created by users, but also stamps, tapes etc. used within libraries. Effects on OCRing General description
  9. 9. IMPACT: Improved binarisation <ul><li>Original </li></ul><ul><li>State of the Art </li></ul><ul><li>IMPACT </li></ul>
  10. 10. Warping of paper Partly a relatively high effect, especially if it is connected with bad printing (e.g. characters not aligned on the baseline of a line). Due to humidity the single page of an old book is very rarely really flat, in contrast it is warped. Even with putting the paper against a glass plate the warping will not disappear. Effects on OCRing General description
  11. 11. IMPACT: Border removal
  12. 12. IMPACT: Geometric correction I
  13. 13. IMPACT: Geometric correction II
  14. 14. Gothic typeface Effects are high since such fonts and characters are often not recognised correctly. Historic fonts, obsolete characters such as the long s Effects on OCRing General description
  15. 15. IMPACT: Improved recognition
  16. 16. Complex l ayout Effects are high since text is not ordered in the right way Effects on OCRing Due to difficult layouts, pages can be segmented incorrectly General description
  17. 17. IMPACT: Segmentation Blocks/Regions Words Glyphs
  18. 18. IMPACT: Functional extension parser <ul><li>Recognition of the structure of book pages </li></ul><ul><ul><li>Print space </li></ul></ul><ul><ul><li>Standard font of the </li></ul></ul><ul><ul><li>main text </li></ul></ul><ul><ul><li>Page numbers </li></ul></ul><ul><li>Enrichment of OCR results with structural information </li></ul>
  19. 19. Bad p rinting: blurred, broken, faded characters Effects are high since characters are broken or bound together. According to the printing technology used letters may be blurred, broken or dotted. Effects on OCRing General description
  20. 20. IMPACT: Cooperative correction <ul><li>Integrated web-based system for cooperative correction of OCR results </li></ul><ul><li>Character/Word/Page mode </li></ul><ul><li>Collaboratively correct OCR errors and use results for improving OCR </li></ul>
  21. 21. IMPACT: Word spotting <ul><li>Alternative technique for indexing historical documents </li></ul><ul><li>After word segmentation relevant words are detected and highlighted </li></ul><ul><li>Key words can be e.g. person and location names </li></ul>
  22. 22. Historical language Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
  23. 23. IMPACT: Historical dictionaries <ul><li>OCR: </li></ul><ul><ul><li>Lexica for German, Dutch, English, French, Spanish, Polish, Bulgarian and Czech </li></ul></ul><ul><ul><li>Generic tools for building historical lexica </li></ul></ul><ul><li>FineReader with built in standard Dutch dictionary </li></ul>werreid FineReader with IMPACT dictionary of historical Dutch werreld <ul><li>RETRIEVAL: </li></ul><ul><ul><li>Key in ‘ wereld ’ and find ‘ werreld ’ </li></ul></ul>
  24. 24. IMPACT: Linguistic post-correction <ul><li>The colors indicate different types of analysis results, like a word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc. </li></ul>
  25. 25. IMPACT: Interoperability framework Interaction, Modularisation, Evaluation
  26. 26. Thank you! <ul><li>http://www.impact–project.eu/ </li></ul><ul><li>[email_address] </li></ul><ul><li>@impactocr </li></ul><ul><li>http://impactocr.wordpress.com/ </li></ul>

×