Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DLF Forum 2015: Beyond eMOP

768 views

Published on

A report of eMOP's final outcomes and the work that we are continuing to do.

Published in: Education
  • Be the first to comment

DLF Forum 2015: Beyond eMOP

  1. 1. Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach
  2. 2.  emop.tamu.edu/  2015 DLF Forum– Beyond eMOP  emop.tamu.edu/presentations/ #DLF15  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/Mell on/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2 October 27, 20152015 DLF Forum - Beyond eMOP
  3. 3.  The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test open-source tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 2015 DLF Forum - Beyond eMOP Goals October 27, 2015
  4. 4. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800) Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed documents  44,000 EEBO  2,200 ECCO 4 October 27, 20152015 DLF Forum - Beyond eMOP Total: >300,000 documents & 45 million page images.
  5. 5. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking, bleedthrough  Old, low-quality, small tiff files  Noise, skew, warp, 5 October 27, 20152015 DLF Forum - Beyond eMOP
  6. 6. Page Images 2015 DLF Forum - Beyond eMOP 6 October 27, 2015
  7. 7. Results  ECCO  Avg. Scores  309,328 pages  86% correct  57% correct on 1st pages  Comparison to Prime Recognition’s OCR  93% claim  89% correct by our metrics  EEBO  Avg. Scores  1,475,026 pages  68% correct  No previous OCR to compare to October 27, 20152015 DLF Forum - Beyond eMOP 7 Using Google’s Tesseract open-source OCR engine with eMOP created training
  8. 8. Results (cont)  Breakdown against Groundtruth  Out of ~1.8 million pages of GT  82% EEBO  18% ECCO  We have GT for  25.4% of EEBO  0.98% of ECCO October 27, 20152015 DLF Forum - Beyond eMOP 8 % Correct # Pages % Pages 90-100%: 354,358 19.5% 80-89%: 524,825 28.9% 70-79%: 316,511 17.4% 60-69%: 193,708 10.7% 50-59%: 125,060 6.9% 40-49%: 85,875 4.7% 30-39%: 61,992 3.4% 20-29%: 48,539 2.7% 10-19%: 43,776 2.4% 0-9%: 60,912 3.4% 48.4 %
  9. 9. eMOP Outcomes - Github October 27, 20152015 DLF Forum - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 9
  10. 10. October 27, 20152015 DLF Forum - Beyond eMOP 10 Outcomes – Github Franken+ • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training. early-modern-ocr.github.io/FrankenPlus/
  11. 11. October 27, 20152015 DLF Forum - Beyond eMOP 11 Outcomes – Github emop-dashboard • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/emop-dashboard/
  12. 12. 12 Outcomes-Github emop-controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py October 27, 20152015 DLF Forum - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach
  13. 13. October 27, 20152015 DLF Forum - Beyond eMOP 13 Outcomes – Github hOCR deNoising • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/hOCR-De-Noising/ Before After
  14. 14. October 27, 20152015 DLF Forum - Beyond eMOP 14 Outcomes – Github page evaluator • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-evaluator/  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  15. 15. October 27, 20152015 DLF Forum - Beyond eMOP 15 Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-corrector/ 1. Preliminary cleanup  remove punctuation from begin/end of tokens & empty lines  combine hyphenated tokens at end of lines  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3-gram dataset tbat I thoughc Ihe Was
  16. 16. October 27, 20152015 DLF Forum - Beyond eMOP 16 window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach tbat I thoughc Ihe Was window: l thoughc Ihe Candidates used for context matching: ■ l -> Set(l) ■ thoughc -> Set(thoughc, thought, though) ■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91) that I thought she was
  17. 17. October 27, 20152015 DLF Forum - Beyond eMOP 17 Outcomes – Github juxta-CL • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/Juxta-cl/  Juxta-CL(command line)  created for eMOP  based on JuxtaCommons tool (juxtacommons.org/) created by Performant Software  several different comparison algorithms to choose from and other options  Levenshtein / Jaro-Winkler / Juxta  ignore: punctuation, caps, hyphens
  18. 18. October 27, 20152015 DLF Forum - Beyond eMOP 18 Outcomes – Github Cobre • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/Cobre/
  19. 19. October 27, 20152015 DLF Forum - Beyond eMOP 19 Outcomes - Tesseract Training • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/TesseractTraining/  Font Training  3 different types  Roman, Italic, Blackletter  12 different printers  1564-1764  40 different typeface combinations  Even more combined typefaces training files  EM Dictionaries  from multiple sources with alternate spellings  More  Franken+ training files  common OCR error file
  20. 20. October 27, 20152015 DLF Forum - Beyond eMOP 20 Outcomes – DB of EM Printers • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Parsing the imprint lines of all EEBO & ECCO docs to create the ImprintDB  Gathering:  Printed By  Printed For  Sold by  Location(“at the Rose and Crown in St. Paul's Church-Yard”, “at the signe of the Traytors head”)  Place (London, Cambridge…)  (Dates)  Those docs with ESTC numbers will be available via a public database early-modern-ocr.github.io/ImprintDB/
  21. 21.  Phase 1 of TCP EEBO hand transcriptions are now available for fulltext searching in 18thConnect  over 25,000 documents October 27, 20152015 DLF Forum - Beyond eMOP 21 Outcomes – Fulltext Search of TCP Phase 1 • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 18thconnect.org/search?a=EEBO&o=fulltext
  22. 22. October 27, 20152015 DLF Forum - Beyond eMOP 22 Outcomes – Editable EEBO OCR – TypeWright • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Can already use TypeWright in 18thConnect to edit ECCO docs (without a subscription)  Will soon be able to edit EEBO docs in TypeWright  TCP hand transcriptions  eMOP OCR transcriptions  When doc is fully corrected, users can request text and/or XML versions of corrected transcriptions for scholarly use.  First time EEBO transcriptions will be available for over 80,000 docs. 18thconnect.org/typewright
  23. 23. October 27, 20152015 DLF Forum - Beyond eMOP 23 Outcomes – Collection Evaluation • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  We’ve collected over 6 TB of data  Our post-processing algorithms collect data:  amount of skew  amount of noise  number and location of text columns  page quality  correctability  corrections made  Saved to the eMOP DB  First time this has been done on these collections  Pre-process and re-OCR pages  several iterations will identify bad page images
  24. 24. The Future of eMOP October 27, 20152015 DLF Forum - Beyond eMOP 24  We want eMOP to be viable long-term  Continue to be developed/supporte d  open-source repos  outreach  Looking for partnersfrom Kill Bill 2, 2004, Miramax
  25. 25. Future – Work  ImprintDB  We want to turn this into an online DB with functionality to  Query & download results  Recommend corrections to data  Analyze Data  We have collected a mass of data on nearly 45 million pages:  Skew & noise measures  Column coordinates  Corrections made  Scores & Estimates for “correctability”  We can use that data to improve the OCR workflow and judge its efficacy  Re-OCR  Using skew & noise measures & column info we can pre-process pages and re-OCR for better results October 27, 20152015 DLF Forum - Beyond eMOP 25
  26. 26. Future – Funded Work  NEH Implementation Grant Reading the First Books  OCR the first printed works of the New World  Using Ocular OCR engine with multiple language recognition  Opportunity to integrate another OCR engine into our workflow, and continue to develop our workflow and DB to work with other collections. October 27, 20152015 DLF Forum - Beyond eMOP 26
  27. 27. Future – Partnerships Notre Dame Libraries Penn St. Libraries Adam Matthew Digital Visualizing English Print U of S Carolina – Center for Digital Humanities  Consultation  Workshops  Share labor / resources  Test the robustness of our workflow in parts or as a whole on different collections and systems  Test how text analysis, topic modeling, lexical mapping, etc. can improve our data and results October 27, 20152015 DLF Forum - Beyond eMOP 27
  28. 28. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu mandell@tamu.edu 28 2015 DLF Forum - Beyond eMOP October 27, 2015

×