Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Uwe Springmann1
, Dietmar Najock2
, Hermann Morgenroth2
,
Helmut Schmid1
, Annette Gotscharek1
and Florian Fink1
OCR of Hi...
p. 2 (16)OCR of historical printings of Latin textsSpringmann et al.
Overview
●
Why Latin?
●
Problems
●
Prospects
●
Progre...
p. 3 (16)OCR of historical printings of Latin textsSpringmann et al.
Why Latin?
●
huge heritage: largest body of historica...
p. 4 (16)OCR of historical printings of Latin textsSpringmann et al.
Some problems for OCR engines
historical fonts
long s...
p. 5 (16)OCR of historical printings of Latin textsSpringmann et al.
Some problems for OCR engines (continued)
●
historica...
p. 6 (16)OCR of historical printings of Latin textsSpringmann et al.
State of the art – example pages
Prospects
1544
1779
...
p. 7 (16)OCR of historical printings of Latin textsSpringmann et al.
State of the art – results for example pages
Prospect...
p. 8 (16)OCR of historical printings of Latin textsSpringmann et al.
Prospects
Overcoming the obstacles
●
Training (Tesser...
p. 10 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Postcorrection: Open-Source-Tool PoCoTo
(se...
p. 11 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on historical fonts (artificial im...
p. 12 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on fonts, ideal lexicon
Example: P...
p. 14 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on historical fonts (real images)
...
p. 15 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Summary
●
very old printings are hard to OC...
p. 16 (16)OCR of historical printings of Latin textsSpringmann et al.
Progress
Thank you for your interest!
Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress
Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress
Upcoming SlideShare
Loading in …5
×

Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

1,050 views

Published on

Presentation of the paper OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress by Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek and Florian Fink in DATeCH 2014. #digidays

Published in: Technology
  • Get the best essay, research papers or dissertations. from ⇒ www.HelpWriting.net ⇐ A team of professional authors with huge experience will give u a result that will overcome your expectations.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • ...his blood sugar level is well within normal range and all his symptoms are gone. I cannot explain in words how much this book has meant to me and my family. ◆◆◆ http://t.cn/A6vI6BAP
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

  1. 1. Uwe Springmann1 , Dietmar Najock2 , Hermann Morgenroth2 , Helmut Schmid1 , Annette Gotscharek1 and Florian Fink1 OCR of Historical Printings of Latin Texts Problems, Prospects, Progress 1 CIS, Ludwig-Maximilans-Universität München 2 Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin
  2. 2. p. 2 (16)OCR of historical printings of Latin textsSpringmann et al. Overview ● Why Latin? ● Problems ● Prospects ● Progress
  3. 3. p. 3 (16)OCR of historical printings of Latin textsSpringmann et al. Why Latin? ● huge heritage: largest body of historical literary sources ● Latin publications dominate print production until about 1750 ● many titles have never been reprinted ● either key or barrier to cultural heritage of the western world ● has been left out of the IMPACT project despite its importance
  4. 4. p. 4 (16)OCR of historical printings of Latin textsSpringmann et al. Some problems for OCR engines historical fonts long s ( )ſ historical ligatures: Æ, æ, Œ, œ, st,  polytonic Greek words diacritics abbreviations historical spellings Problems
  5. 5. p. 5 (16)OCR of historical printings of Latin textsSpringmann et al. Some problems for OCR engines (continued) ● historical typography and spelling are also a problem for early modern languages ● ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text ● but discretionary diacritics are helpful in POS/morphology disambiguation: – adverb/vocative: altè/alte – adverb/pronoun: quàm/quam – conjunction/preposition: cùm/cum – ablative/nominative: hastâ/hasta Problems
  6. 6. p. 6 (16)OCR of historical printings of Latin textsSpringmann et al. State of the art – example pages Prospects 1544 1779 1649
  7. 7. p. 7 (16)OCR of historical printings of Latin textsSpringmann et al. State of the art – results for example pages Prospects Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7 1544 83,14 70,32 74,59 1649 88,07 84,87 78,98 1779 82,13 80,77 75,46 character accuracy in % out-of-the-box performance, no language model (or default = English) OCRopus hampered by bad image-text segmentation
  8. 8. p. 8 (16)OCR of historical printings of Latin textsSpringmann et al. Prospects Overcoming the obstacles ● Training (Tesseract, OCRopus) – (a) generate pseudo-historical images from existing texts and historical-looking computer fonts (add some degradation to the image) – (b) transcribe some real pages and train on true historical fonts ● Lexical resources (Tesseract) in recognition ● Post-processing – correct OCR errors, not historical spelling (might be interesting itself) – add annotation: expand abbreviations, ligatures, normalize spelling – helpful: language model, lexicon of historical word forms
  9. 9. p. 10 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Postcorrection: Open-Source-Tool PoCoTo (see paper of Vobl et al. - presentation by Christoph Ringlstetter)
  10. 10. p. 11 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Training on historical fonts (artificial images) Example: Pontanus, Progymnasmata Latinitatis (1589)
  11. 11. p. 12 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Training on fonts, ideal lexicon Example: Pontanus, Progymnasmata Latinitatis (1589) character accuracy in % Page Abbyy FR 11.1 Tesseract 3.03 Ocropus 0.7 Tesseract (font) Tesseract (font + lex.) Ocropus (font) 15 87,79 80,88 80,70 91,02 93,90 92,55 16 82,94 77,41 76,94 80,12 85,65 80,47 17 85,25 75,98 86,07 85,41 91,56 93,93 18 85,93 79,51 85,53 88,29 92,68 89,67 19 87,94 80,09 79,09 86,06 90,15 87,83 OCRopus: no language model! red: accuracy better than Abbyy
  12. 12. p. 14 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Training on historical fonts (real images) Example: Thanner, Petronij Arbitri Sathyra (1500) character accuracy in % Page Tesseract 3.03 Ocropus 0.7 Ocropus (trained) 13 41,59 44,59 93,15 14 52,38 57,77 94,61 15 53,09 62,38 95,17 16 59,09 61,45 93,27 page 1-12: training set; page 13-16: test set
  13. 13. p. 15 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Summary ● very old printings are hard to OCR out-of-the box ● Tesseract and OCRopus can be trained to results above ABBYY ● applying lexica as well as font training helps a lot ● OCRopus can be trained to accuracies > 90%, but must at present be combined with good line segmentation in a preprocessing step ● postcorrection will do the rest
  14. 14. p. 16 (16)OCR of historical printings of Latin textsSpringmann et al. Progress Thank you for your interest!

×