1. Beyond the Early Modern OCR
Project
or, What are We Going to do With Ourselves Now?
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
2. emop.tamu.edu/
2015 DLF Forum– Beyond eMOP
emop.tamu.edu/presentations/
#DLF15
eMOP Workflows
emop.tamu.edu/workflows
Mellon Grant Proposal
idhmc.tamu.edu/projects/Mell
on/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
Facebook
The Early Modern OCR
Project
Twitter
#emop
@IDHMC_Nexus
@mandellc
@matt_christy
@EMGrumbach
2
October 27, 20152015 DLF Forum - Beyond eMOP
3. The Early Modern OCR Project (eMOP) is an
Andrew W. Mellon Foundation funded grant project
running out of the Initiative for Digital Humanities, Media,
and Culture (IDHMC) at Texas A&M University, to
develop and test open-source tools and techniques to
apply Optical Character Recognition (OCR) to early
modern English documents
from the hand press period, roughly 1475-1800.
eMOP aims to improve the visibility of early modern
texts by making their contents fully searchable. The
current paradigm of searching special collections for
early modern materials by either metadata alone or
“dirty” OCR is insufficient for scholarly research.
3
2015 DLF Forum - Beyond eMOP
Goals
October 27, 2015
4. The Numbers
Page Images
Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million pages images
(1475-1700)
Eighteenth Century Collections
Online (Gale Cengage) ECCO:
~182,000 documents, ~32 million
page images (1700-1800)
Ground Truth Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
44,000 EEBO
2,200 ECCO
4
October 27, 20152015 DLF Forum - Beyond eMOP
Total: >300,000 documents & 45 million page images.
5. The Problems
Early Modern Printing
Individual, hand-made
typefaces
Worn and broken type
Poor quality equipment/paper
Inconsistent line bases
Unusual page layouts,
decorative page elements,
Special characters & ligatures
Spelling variations
Mixed typefaces and
languages
over/under-inking,
bleedthrough
Old, low-quality, small tiff files
Noise, skew, warp,
5
October 27, 20152015 DLF Forum - Beyond eMOP
10. October 27, 20152015 DLF Forum - Beyond eMOP
10 Outcomes – Github
Franken+
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
1. Windows based tool that
uses a MySQL DB.
2. Developed for eMOP by
IDHMC Graduate student
worker Bryan Tarpley.
3. Designed to be easily
used by eMOP
Undergraduate student
workers
4. Takes Aletheia's output
files as input.
5. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
early-modern-ocr.github.io/FrankenPlus/
12. 12
Outcomes-Github
emop-controller
• Powered by the eMOP DB
• Collection processing is managed via the
online Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
October 27, 20152015 DLF Forum - Beyond eMOP
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
13. October 27, 20152015 DLF Forum - Beyond eMOP
13 Outcomes – Github
hOCR deNoising
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/hOCR-De-Noising/
Before
After
14. October 27, 20152015 DLF Forum - Beyond eMOP
14 Outcomes – Github
page evaluator
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-evaluator/
Determine how correctable a
page’s OCR results are by
examining the text.
The score is based on the ratio of
words that fit the correctable
profile to the total number of
words
Correctable Profile
1. Clean tokens:
remove leading and trailing
punctuation
remaining token must have at
least 3 letters
2. Spell check tokens >1 character
3. Check token profile :
contain at most 2 non-alpha
characters, and
at least 1 alpha character,
have a length of at least 3,
and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
15. October 27, 20152015 DLF Forum - Beyond eMOP
15 Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-corrector/
1. Preliminary cleanup
remove punctuation from begin/end of tokens & empty lines
combine hyphenated tokens at end of lines
retain cleaned & original tokens as “suggestions”
2. Apply common transformations and period specific dictionary
lookups to gather suggestions for words.
transformation rules: rn->m; c->e; 1->l; e
3. Use context checking on a sliding window of 3 words, and their
suggested changes, to find the best context matches in
our(sanitized, period-specific) Google 3-gram dataset
tbat I thoughc Ihe Was
16. October 27, 20152015 DLF Forum - Beyond eMOP
16
window: tbat l thoughc
Candidates used for context matching:
tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat,
tba, ilial, abat, tbat, teat)
l -> Set(l)
thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount:
1474)
Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
tbat I thoughc Ihe Was
window: l thoughc Ihe
Candidates used for context matching:
■ l -> Set(l)
■ thoughc -> Set(thoughc, thought, though)
■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide,
ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
window: thoughc Ihe Was
Candidates used for context matching:
thoughc -> Set(thoughc, thought, though)
Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene,
ice, inc, tho, ime, ite, ive, the)
Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
17. October 27, 20152015 DLF Forum - Beyond eMOP
17 Outcomes – Github
juxta-CL
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/Juxta-cl/
Juxta-CL(command line)
created for eMOP
based on JuxtaCommons tool (juxtacommons.org/)
created by Performant Software
several different comparison algorithms to choose
from and other options
Levenshtein / Jaro-Winkler / Juxta
ignore: punctuation, caps, hyphens
19. October 27, 20152015 DLF Forum - Beyond eMOP
19
Outcomes - Tesseract Training
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/TesseractTraining/
Font Training
3 different types
Roman, Italic, Blackletter
12 different printers
1564-1764
40 different typeface combinations
Even more combined typefaces training files
EM Dictionaries
from multiple sources with alternate spellings
More
Franken+ training files
common OCR error file
20. October 27, 20152015 DLF Forum - Beyond eMOP
20
Outcomes – DB of EM Printers
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
Parsing the imprint lines of all EEBO & ECCO
docs to create the ImprintDB
Gathering:
Printed By
Printed For
Sold by
Location(“at the Rose and Crown in St. Paul's
Church-Yard”, “at the signe of the Traytors head”)
Place (London, Cambridge…)
(Dates)
Those docs with ESTC numbers will be available
via a public database
early-modern-ocr.github.io/ImprintDB/
21. Phase 1 of TCP EEBO hand transcriptions are
now available for fulltext searching in
18thConnect
over 25,000 documents
October 27, 20152015 DLF Forum - Beyond eMOP
21
Outcomes – Fulltext Search of
TCP Phase 1
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
18thconnect.org/search?a=EEBO&o=fulltext
22. October 27, 20152015 DLF Forum - Beyond eMOP
22
Outcomes – Editable EEBO OCR
– TypeWright
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
Can already use TypeWright in 18thConnect to
edit ECCO docs (without a subscription)
Will soon be able to edit EEBO docs in
TypeWright
TCP hand transcriptions
eMOP OCR transcriptions
When doc is fully corrected, users can request
text and/or XML versions of corrected
transcriptions for scholarly use.
First time EEBO transcriptions will be available
for over 80,000 docs.
18thconnect.org/typewright
23. October 27, 20152015 DLF Forum - Beyond eMOP
23
Outcomes – Collection
Evaluation
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
We’ve collected over 6 TB of data
Our post-processing algorithms collect data:
amount of skew
amount of noise
number and location of text columns
page quality
correctability
corrections made
Saved to the eMOP DB
First time this has been done on these collections
Pre-process and re-OCR pages
several iterations will identify bad page images
24. The Future of eMOP
October 27, 20152015 DLF Forum - Beyond eMOP
24
We want eMOP to
be viable long-term
Continue to be
developed/supporte
d
open-source repos
outreach
Looking for partnersfrom Kill Bill 2, 2004, Miramax
25. Future – Work
ImprintDB
We want to turn this into an online DB with functionality to
Query & download results
Recommend corrections to data
Analyze Data
We have collected a mass of data on nearly 45 million pages:
Skew & noise measures
Column coordinates
Corrections made
Scores & Estimates for “correctability”
We can use that data to improve the OCR workflow and judge its
efficacy
Re-OCR
Using skew & noise measures & column info we can pre-process
pages and re-OCR for better results
October 27, 20152015 DLF Forum - Beyond eMOP
25
26. Future – Funded Work
NEH Implementation Grant
Reading the First Books
OCR the first printed works of the New World
Using Ocular OCR engine with multiple language recognition
Opportunity to integrate another OCR engine into our workflow, and
continue to develop our workflow and DB to work with other
collections.
October 27, 20152015 DLF Forum - Beyond eMOP
26
27. Future – Partnerships
Notre Dame Libraries
Penn St. Libraries
Adam Matthew Digital
Visualizing English Print
U of S Carolina – Center for Digital Humanities
Consultation
Workshops
Share labor / resources
Test the robustness of our workflow in parts or as a whole on
different collections and systems
Test how text analysis, topic modeling, lexical mapping, etc. can
improve our data and results
October 27, 20152015 DLF Forum - Beyond eMOP
27
28. The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
mandell@tamu.edu
28
2015 DLF Forum - Beyond eMOP October 27, 2015
Editor's Notes
We have a number of contact points
The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
OPEN SOURCE
For our collections we had EEBO & ECCO (proprietary collections)
This constitutes most of the English-language EM documents available in digitized format.
And for our Groundtruth comparison we used the TCP.
The reason why this project is necessary is due to the nature of the documents in question.
This is the beginning of the printing era when a lot of things were in flux.
The technology was new.
Standards were loose or non-existant.
Quality was variable
Problems are endemic to both the original documents and their digital representations
Some were great
most were not
Noisy
Skewed
Warped
Or they posed challenges for OCR engines
Multiple pages per image
Multiple columns
Images & decorative elements
Marginalia
Missing margins
many were terrible
So, after almost 2 years of typeface research, and testing, and creating training for Tesseract (Google)
we OCRd this collection on anywhere from 128 – 500 processors simultaneously and continuously for 3 months
When we were done, we analyzed the scores for the 1.8 million pages that we could compare to the groundtruth we got from the TCP.
We were only 3% off from Prime Recognition who were able to use multiple OCR engines, and charge a lot of money.
So, we achieved 80% or better on almost half of our documents with groudtruth and those results mostly reflect comparison to EEBO and consists of over a quarter of EEBO’s collection.
What we think this is telling us (lacking more detailed analysis) is that we did pretty good with EEBO, but REALLY bad images in that collection pulled our overall score average down.
Because the collections we OCRd are proprietary, those transcriptions are not necessarily part of our open-source outcomes, except with some exceptions, which I’ll talk about in a minute.
Most of what we produced open-source is a lot of code and Tesseract training. That’s all in github and linked to from our website.
We created this tool to allow us to generate our own EM typeface training for Tesseract.
Tesseract’s native training method wouldn’t work for us.
We see this as a useful tool for Typeface research as well.
The eMOP dashboard allows us to control what docs get OCRd using which set of training.
We can also use it to see results down to a page level.
From here we can see scores against GT, estimated correctability, and correction stats
There’s also an admin page with an API for looking at more detailed run information and download other stats
Dashboard provides access to the DB via an API
Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS
I’d like to point out that you see very little pre-processing of page images in this workflow.
The size and nature of our image collections meant that it was prohibitively expensive to attempt to identify and fix potential issues BEFORE ocr’ing
So we concentrated on post-processing
One of those post-processing algorithms:
Remove “noise” from Tesseract’s hOCR output
“words” that are really caused by noise
look at position and size of the word to determine
Created by Performant Software
Cobre is a robust image comparison environment first designed for Primeros Libros collection at TAMU, UT and Mexico.
We’ve added the ability to edit text transcriptions and add document level metadata and annotations.
We ingested 3 copies from eebo of this document in Cobre along with OCR transcriptions
Editors with a login can then add doc-level metadata and annotations
And can edit page transcriptions one at a time.
More importantly you can compare page images and transcriptions for 2 or mare pages.
Here we can see two copies of the same document side-by-side. The copy on the left is in much worse condition and so OCR’d much worse.
We have the ability to cut-and-paste all or part of the transcription on the right. But we have to be careful about differences. (W VV of first word)
That’s the remaining 2/3 of the collection, which could never before be searched.