Successfully reported this slideshow.
Your SlideShare is downloading. ×

mchristy-DH2014-emop-bookhistory-tools

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
DLF Forum 2015: Beyond eMOP
DLF Forum 2015: Beyond eMOP
Loading in …3
×

Check these out next

1 of 21 Ad

mchristy-DH2014-emop-bookhistory-tools

Download to read offline

A presentation at DH2014-Lausanne of the Tesseract training methods and tools developed by eMOP (at the IDHMC at Texas A&M), and their uses for other book history and typeface history research.

A presentation at DH2014-Lausanne of the Tesseract training methods and tools developed by eMOP (at the IDHMC at Texas A&M), and their uses for other book history and typeface history research.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to mchristy-DH2014-emop-bookhistory-tools (20)

Advertisement

Recently uploaded (20)

mchristy-DH2014-emop-bookhistory-tools

  1. 1. eMOP Book History Tools Book History and Software Tools: Examining Typefaces for OCR Training in eMOP Matt Christy, Todd Samuelson, Katayoun Torabi, Bryan Tarpley, Elizabeth Grumbach
  2. 2.  emop.tamu.edu/  Dh2014 Presentation  emop.tamu.edu/book- history-tools  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @matt_christy  @EMGrumbach DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 2
  3. 3. Early Modern OCR Project  The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 3 Specifically, eMOP’s goal is to make machine readable, or improve the readability, for 305,000 document/45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, our aim is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results.
  4. 4. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 4
  5. 5. Aletheia DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 5 www.primaresearch.org/too ls.php Available for free but requires registration.  Created by PRImA Research Labs, University of Salford, UK.  Windows based tool.  Developed as a groundtruth creation tool  Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract.  Can identify glyphs on a page image with page coordinates and Unicode values.
  6. 6. Aletheia:Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 6  Binarization and Denoise are native Aletheia functions  A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed.  Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image.
  7. 7. Aletheia: Glyph Recognition DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 7 Uses Tesseract to find glyphs
  8. 8. Aletheia: I/O DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 8 We then convert PAGE XML file to Tesseract Box file using XSLT
  9. 9. Tesseract Training DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 9
  10. 10. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training.  Available open-source at: github.com/idhmc- tamu/FrankenPlus DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 10
  11. 11. Franken+Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 11 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken- page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
  12. 12. Franken+ Ingestion DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 12
  13. 13. Franken+ DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 13  All exemplars of the same glyph are displayed together.  Users can quickly identify and deselect:  Incorrectly labeled glyphs  Incomplete glyphs  Unrepresentative exemplars  Different sized glyphs
  14. 14. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 14 Franken+
  15. 15. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 15 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗſtak'ſt thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ſt ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haſt loſt thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haſt forgot thy name thou hadſt; thou waſt Nothing but ee, and her thou haſt o'rpaſt. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ſt laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ſt to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ſtateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha ,
  16. 16. Franken+ Results DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 16 AFTER BEFORE
  17. 17. eMOP TesseractTraining DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 17
  18. 18. S-face / Y-face DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 18 Weiss, Adrian. “Font Analysis as a Bibliographical Method: the Elizabethan Play-Quarto Printers and Compositors.” Studies in Bibliography 43 (1990): 95-164.  Weiss organized late 16th and early 17th century typefaces into these two general types (named for the first works in which they were identified)  Y-Face, from an edition of The Malcontents  S-Face, from Ben Jonson's Sejanus
  19. 19. S-face / Y-face 19 DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
  20. 20. Other Applications DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 20  A close examination of the typefaces used by a printer  An investigation of the typefaces used in a work or in the same editions of a work  A reexamination of typefaces classified via a system (Proctor- Haebler)
  21. 21. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 21

Editor's Notes

  • Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  • This is cheating: the result of scanning the same page we used to create the training.
  • So we think that Franken+ can be a really useful tool for the close examination of typefaces and book history, and opens up some of the admittedly tedious work to non-experts.

×