Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

1,454 views

Published on

Slides for the eMOP presentation at the Digital Humanities 2014 conference in Lausanne, Switzerland.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

  1. 1. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC Laura Mandell, PI, IDHMC Apostolos Antonacopoulos, PRImA Lab Clemens Neudecker, Koninklijke Bibliotheek Matthew Christy, Co-Project Manager, IDHMC Loretta Auvil, SEASR Analytics Todd Samuelson, Cushing Memorial Library emop.tamu.edu
  2. 2. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  3. 3. Straight from the grant proposal… “Our overarching goals” 1) Train three open-access OCR engines to “read” early modern fonts 2) Map specific font training onto specific sets of documents 3) Create error-evaluation mechanisms for failed documents 4) Use crowd-sourced correction tools specific to OCR errors 5) Identify pages that are too flawed to be “readable” 6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents. Navigating the Storm | @EMGrumbach | emop.tamu.edu
  4. 4. Main Collaborators CIIR IDHMC + Cushing Memorial Library Koninklijke Bibliotheek Performant Software Solutions PRImA Labs PSI Labs SEASR UMass Amhearst Texas A&M Netherlands Charlottesville, Virginia University of Salford, Manchester Texas A&M U of Illinois, Urbana-Champaign Navigating the Storm | @EMGrumbach | emop.tamu.edu
  5. 5. Data Contributors + Collaborators Early English Books Online (EEBO) Eighteenth Century Collections Online (ECCO) Text Creation Partnership (TCP) Brazos Computing Cluster (Texas A&M) Main Collaborators Navigating the Storm | @EMGrumbach | emop.tamu.edu
  6. 6. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Laura Mandell, Principal Investigator, eMOP Director, IDHMC @mandellc idhmc@tamu.edu
  7. 7. Early Modern Printing • Individual, hand-made typefaces • Worn and broken type • Poor quality equipment/paper • Inconsistent line bases • Unusual page layouts, decorative page elements, • Special characters & ligatures • Spelling variations • Mixed typefaces and languages Slides by Matthew Christy 7
  8. 8. Slides by Matthew Christy 8 • Irregular Layouts • Print Bleedthrough
  9. 9. Document/Image Quality • Torn and damaged pages • Noise introduced to images of pages • Skewed pages • Warped pages • Missing pages • Inverted pages • Incorrect metadata • Extremely low quality TIFFs (~50K) Slides by Matthew Christy 9
  10. 10. Slides by Matthew Christy 10
  11. 11. 11 There may be as much difference between one letter and another in a specific font As there is between letters in different fonts. Reality Dream Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.
  12. 12. Training Tesseract Aletheia Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  13. 13. Training Tesseract Franken+ 1. Takes Aletheia's output files as input. 2. Groups all glyphs with the same Unicode values into one window for comparison. 3. Mistakenly coded glyphs are easily identified and re-coded. 4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. 5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. 6. Outputs the same box files and TIFF images that Tesseract's first stage of native training. 7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. 8. Outputs a .traineddata file used by Tesseract when OCRing page images. Slides by Matthew Christy 13
  14. 14. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Clemens Neudecker, Koninklijke Bibliotheek @cneudecker
  15. 15. The case of IMPAC T • IMPACT = IMProving ACcess to Text • EU FP7, 2008 – 2012 • €16.7 M budget • 22 partners (libraries, universities, companies) • Goal: Significantly improve OCR for historical documents
  16. 16. Issue 1 • Expectation: The "IMPACT OCR" • Reality: A collection of very diverse tools, algorithms, etc. Some prototypes, some commercial tools, different programming languages, different levels of maturity etc. • • No integrated product possible!
  17. 17. Issue 1 • Solution: Interoperability rather than integration • Change: Individual applications as pluggable modules in a web-based framework • Result: Flexible framework with additional benefits for testing, transparency, provenance
  18. 18. Issue 2 • Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists • Are we really talking the same language? • Different focus points in the project: applicable solutions vs. academic publications
  19. 19. Issue 2 • Solution: Create bonding activities, foster atmosphere for knowledge exchange • Change: Buddy programme, social games, quizzes about partners • Result: Understand your partners background, their way of thinking enrich the experience for everyone
  20. 20. Large Digitisation Projects: Two Key Perspectives Apostolos Antonacopoulos PRImA Research Lab
  21. 21. Background Since 2002 the PRImA Lab has been involved in large digitisation projects, creating software tools for all stages of the workflow • From Image Enhancement to Layout Analysis to OCR • Use-scenario based evaluation of extracted text quality • Crowd/Scholar-sourcing Two general points are routinely underestimated: • (Really) Understanding stakeholders and their roles • (Real) Understanding of problems, their extent and the effectiveness/requirements of potential solutions
  22. 22. Stakeholders and their roles Seems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated Content holders • Keen for their content to be widely available and used • Do not know their content well and neither its potential uses Computer scientists • Have technical expertise to solve many of the problems • Do not know the material and its use to prioritise problems well DH researchers – the catalysts • Very knowledgeable of material and potential use • Have complementary technical skills to computer scientists
  23. 23. Problem understanding At the start of each project everyone is eager to deliver “big” results but it is important to identify and understand a few key problems and solve them well “Improve OCR results” is an ill-defined and short-sighted goal • Measured in terms of word-accuracy, OCR results are of little use • Layout is very important • Even if all the words are recognised correctly, the reading order is unlikely to be correct, limiting potentially interesting uses. • Page numbers, captions, running headers etc. should not be mixed with body text • Graphical elements / illustrations are important too Think: Useful data (investment) vs. just more of any data (instant gratification)
  24. 24. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC @EMGrumbach egrumbac@tamu.edu
  25. 25. “If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t worth a damn.” - John Unsworth “Documenting the Reinvention of Text: The Importance of Failure” Navigating the Storm | @EMGrumbach | emop.tamu.edu
  26. 26. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  27. 27. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Challenges Or Failures Analysis New Directions Adaptability Challenges +Failures should be constantly or consistently communicated. Analysis + New Directions should lead to research and communication with similar projects. Adaptability should allow for new possibilities, new questions. Navigating the Storm | @EMGrumbach | emop.tamu.edu

×