5. 5
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartners
6. The Problems
Early Modern Printing
Individual, hand-made
typefaces
Worn and broken type
Poor quality
equipment/paper
Inconsistent line bases
Unusual page layouts,
decorative page elements,
Special characters &
ligatures
Spelling variations
Mixed typefaces and
languages
over/under-inking
Old, low-quality, small tiff
files
Noise, skew, warp,
bleedthrough,
6
7. Page Images
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
7
8. 8
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
10. Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
10
Editor's Notes
The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
These are not big numbers by STEM standards, but for the Humanities they are:
This is the largest dataset for any open-source, academic OCR project in North American, ever
For the Humanities research data is about quality of content rather than size
Our OCR must be as good as possible in order to generate reliable research
Some were great
most were not
Noisy
Skewed
Warped
Or they posed challenges for OCR engines
Multiple pages per image
Multiple columns
Images & decorative elements
Marginalia
Missing margins
many were terrible
Dashboard provides access to the DB via an API
Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS