eMOP-PennSt-lunch

Matt Christy
Matt ChristyLead Software Applications Developer
The Early Modern OCR Project
An open source workflow and tools for OCR
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
1
 emop.tamu.edu/
 Texas A&M Big Data Workshop
 emop.tamu.edu/TAMU-
BigData
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2
Early Modern OCR Project
 The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
 44,000 EEBO
 2,200 ECCO
4
5
6
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartnersemop.tamu.edu/participating-institutions
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking
7
Document/Image Quality
 Torn and damaged
pages
 Noise introduced to
images of pages
 Skewed pages
 Warped pages
 Missing pages
 Inverted pages
 Incorrect metadata
 Extremely low quality
TIFFs (~50K)
Page Images 8
TrainingTesseract9
Aletheia
10
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
 Created by PRImA Research
Labs, University of Salford, UK.
 Windows based tool.
 Developed as a groundtruth
creation tool
 Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
 Can identify glyphs on a page
image with page coordinates and
Unicode values.
Aletheia:Workflow11  Binarization and Denoise are native Aletheia functions
 A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
 Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
Aletheia: Glyph Recognition 12
Uses Tesseract to find glyphs
Aletheia: I/O 13
We then convert PAGE XML
file to Tesseract Box file using
XSLT
Tesseract Training 14
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
 Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
15
Franken+Workflow16
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
Franken+ Ingestion 17
Franken+ 18
 All exemplars of the
same glyph are
displayed together.
 Users can quickly
identify and
deselect:
 Incorrectly labeled
glyphs
 Incomplete glyphs
 Unrepresentative
exemplars
 Different sized glyphs
19
Franken+
TrainingTesseract
20
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
Franken+ Results 21
AFTER
BEFORE
eMOP
TesseractTraining22
23
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
24
Post-ProcessingTriage
25
Post-Processing
Triage
26
Post-Processing 27
Triage
Treatment Diagnosis
Triage:De-noising
28
 Uses hOCR results
1. Determine average
word bounding box
size
2. Weed out boxes are
that too big or too
small
3. But keep small boxes
that have neighbors
that are “words”
Triage: De-noising 29
Before: 35% After: 58%
Triage: De-noising 30
Before After
Triage: Estimated Correctability 31
Page Evaluation
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio
of words that fit the
correctable profile to the total
number of words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have at
least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
32Triage: Estimated Correctability
Treatment: Page Correction 33
1. Preliminary cleanup
 remove punctuation from begin/end of
tokens
 remove empty lines and empty tokens
 combine hyphenated tokens that appear
at the end of a line
 retain cleaned & original tokens as
“suggestions”
2. Apply common transformations and
period specific dictionary lookups to
gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e
34Treatment: Page Correction
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-
gram dataset
 if no context is found and only one additional suggestion
was made from transformation or dictionary, then
replace with this suggestion
 if no context and “clean” token from above is in the
dictionary, replace with this token
35Treatment: Page Correction
window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial,
abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount: 1474)
window: l thoughc Ihe
Candidates used for context matching:
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
tbat I thoughc Ihe Was
36Treatment: Page Correction
window: thoughc Ihe Was
Candidates used for context matching:
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount:
364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
Treatment: Results 37
Treatment: Results 38
Tags DB
 All page images that don’t produce OCR text that
meets our standards will be tagged in the eMOP DB
with tags that indicate what is wrong with the page
image.
 skew, warp, noise, wrong font, etc.
 Tags can then be used to later apply the
appropriate pre-processing techniques and then
re-OCR the page to produce better results.
 This will be the first time that any sort of
comprehensive analysis has been done on the
page images of these collections.
 We’ll end up (ideally) with a list of page images that
simply need to be re-scanned.
39
 www.18thconnect.org/typew
right/
 Available through
18thconnect.org
An aggregator of 18th
Century metadata & content
This will be the first time that
EEBO documents are available
as searchable text, and that
they can be corrected by
scholars for their own use.
TypeWright
 Crowd-sourced correction
tool for OCR.
 Currently available for ECCO
collection, based on original
ECCO OCR results.
 Will be ingested with eMOP
OCR for ECCO and EEBO by
year-end.
 Users can correct a whole
document and then receive
the text or a lightly encoded
TEI XML version for use in a
digital edition or whatever.
40
TypeWright
Edit button identifies
“TypeWright” enabled
Texts.
41
1 of 41

Recommended

DLF Forum 2015: Beyond eMOP by
DLF Forum 2015: Beyond eMOPDLF Forum 2015: Beyond eMOP
DLF Forum 2015: Beyond eMOPMatt Christy
1.4K views28 slides
mchristy-DH2014-emop-bookhistory-tools by
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
2.8K views21 slides
TCDL15 Beyond eMOP by
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOPMatt Christy
1.3K views31 slides
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB by
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBDigital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBMatt Christy
1.2K views12 slides
mchristy-Dh2014- emop-postOCR-triage by
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triageMatt Christy
1.6K views21 slides
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP by
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFPFrom Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFPMatt Christy
1.5K views19 slides

More Related Content

What's hot

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards by
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
2.1K views27 slides
Dh2014 e mopcobre-complete by
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
2.9K views45 slides
Learning to assess Linked Data relationships using Genetic Programming by
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
664 views28 slides
How well does your Instance Matching system perform? Experimental evaluation ... by
How well does your Instance Matching system perform? Experimental evaluation ...How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...Holistic Benchmarking of Big Linked Data
648 views19 slides
Aspects of NLP Practice by
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP PracticeVsevolod Dyomkin
1K views37 slides
Does Data Quality lays in facts, or in acts? by
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?jeansoulin
69 views59 slides

What's hot(11)

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards by Liz Grumbach
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Liz Grumbach2.1K views
Dh2014 e mopcobre-complete by Laura Mandell
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
Laura Mandell2.9K views
Does Data Quality lays in facts, or in acts? by jeansoulin
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
jeansoulin69 views
Quick tour all handout by Yi-Shin Chen
Quick tour all handoutQuick tour all handout
Quick tour all handout
Yi-Shin Chen615 views
Digitization Projects Tech Con 2006 by Regina Koury
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006
Regina Koury267 views
The Semantic Web - Interacting with the Unknown by Steffen Staab
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the Unknown
Steffen Staab1.1K views
Quick Tour of Text Mining by Yi-Shin Chen
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
Yi-Shin Chen4.9K views

Similar to eMOP-PennSt-lunch

2013 siam-cse-big-data by
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
1.1K views32 slides
2015 illinois-talk by
2015 illinois-talk2015 illinois-talk
2015 illinois-talkc.titus.brown
1.2K views63 slides
2016 bergen-sars by
2016 bergen-sars2016 bergen-sars
2016 bergen-sarsc.titus.brown
915 views56 slides
eScience: A Transformed Scientific Method by
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
3.5K views59 slides
2014 abic-talk by
2014 abic-talk2014 abic-talk
2014 abic-talkc.titus.brown
1.5K views45 slides
A Distributed Architecture System for Recognizing Textual Entailment by
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
725 views24 slides

Similar to eMOP-PennSt-lunch(20)

2013 siam-cse-big-data by c.titus.brown
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown1.1K views
eScience: A Transformed Scientific Method by Duncan Hull
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull3.5K views
Machine Learning ICS 273A by butest
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest539 views
Machine Learning ICS 273A by butest
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest271 views
Knowledge base system appl. p 1,2-ver1 by Taymoor Nazmy
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
Taymoor Nazmy45 views
HPCAC - the state of bioinformatics in 2017 by philippbayer
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
philippbayer370 views
2013 py con awesome big data algorithms by c.titus.brown
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown3.5K views
ADAM—Spark Summit, 2014 by fnothaft
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft45.9K views
Develop and design hybrid genetic algorithms with multiple objectives in data... by khalil IBRAHIM
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
khalil IBRAHIM27 views

Recently uploaded

DU Oral Examination Toni Santamaria by
DU Oral Examination Toni SantamariaDU Oral Examination Toni Santamaria
DU Oral Examination Toni SantamariaMIPLM
138 views2 slides
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptx by
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptxGopal Chakraborty Memorial Quiz 2.0 Prelims.pptx
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptxDebapriya Chakraborty
553 views81 slides
7 NOVEL DRUG DELIVERY SYSTEM.pptx by
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptxSachin Nitave
56 views35 slides
Lecture: Open Innovation by
Lecture: Open InnovationLecture: Open Innovation
Lecture: Open InnovationMichal Hron
95 views56 slides
ICANN by
ICANNICANN
ICANNRajaulKarim20
63 views13 slides
Material del tarjetero LEES Travesías.docx by
Material del tarjetero LEES Travesías.docxMaterial del tarjetero LEES Travesías.docx
Material del tarjetero LEES Travesías.docxNorberto Millán Muñoz
68 views9 slides

Recently uploaded(20)

DU Oral Examination Toni Santamaria by MIPLM
DU Oral Examination Toni SantamariaDU Oral Examination Toni Santamaria
DU Oral Examination Toni Santamaria
MIPLM138 views
7 NOVEL DRUG DELIVERY SYSTEM.pptx by Sachin Nitave
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptx
Sachin Nitave56 views
Lecture: Open Innovation by Michal Hron
Lecture: Open InnovationLecture: Open Innovation
Lecture: Open Innovation
Michal Hron95 views
Nico Baumbach IMR Media Component by InMediaRes1
Nico Baumbach IMR Media ComponentNico Baumbach IMR Media Component
Nico Baumbach IMR Media Component
InMediaRes1425 views
Community-led Open Access Publishing webinar.pptx by Jisc
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptx
Jisc69 views
Education and Diversity.pptx by DrHafizKosar
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptx
DrHafizKosar107 views
American Psychological Association 7th Edition.pptx by SamiullahAfridi4
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptx
SamiullahAfridi474 views
UWP OA Week Presentation (1).pptx by Jisc
UWP OA Week Presentation (1).pptxUWP OA Week Presentation (1).pptx
UWP OA Week Presentation (1).pptx
Jisc68 views
SIMPLE PRESENT TENSE_new.pptx by nisrinamadani2
SIMPLE PRESENT TENSE_new.pptxSIMPLE PRESENT TENSE_new.pptx
SIMPLE PRESENT TENSE_new.pptx
nisrinamadani2173 views
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) by AnshulDewangan3
 Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
AnshulDewangan3275 views
Chemistry of sex hormones.pptx by RAJ K. MAURYA
Chemistry of sex hormones.pptxChemistry of sex hormones.pptx
Chemistry of sex hormones.pptx
RAJ K. MAURYA119 views
The basics - information, data, technology and systems.pdf by JonathanCovena1
The basics - information, data, technology and systems.pdfThe basics - information, data, technology and systems.pdf
The basics - information, data, technology and systems.pdf
JonathanCovena177 views
The Open Access Community Framework (OACF) 2023 (1).pptx by Jisc
The Open Access Community Framework (OACF) 2023 (1).pptxThe Open Access Community Framework (OACF) 2023 (1).pptx
The Open Access Community Framework (OACF) 2023 (1).pptx
Jisc77 views

eMOP-PennSt-lunch

  • 1. The Early Modern OCR Project An open source workflow and tools for OCR Matthew Christy, Laura Mandell, Elizabeth Grumbach 1
  • 2.  emop.tamu.edu/  Texas A&M Big Data Workshop  emop.tamu.edu/TAMU- BigData  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2
  • 3. Early Modern OCR Project  The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 Specifically, eMOP’s goal is to make machine readable, or improve the readability, for 305,000 document/45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, our aim is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results.
  • 4. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed documents  44,000 EEBO  2,200 ECCO 4
  • 5. 5
  • 6. 6 • PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK • SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign • PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University • The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University • The Brazos High Performance Computing Cluster (HPCC) OurPartnersemop.tamu.edu/participating-institutions
  • 7. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking 7 Document/Image Quality  Torn and damaged pages  Noise introduced to images of pages  Skewed pages  Warped pages  Missing pages  Inverted pages  Incorrect metadata  Extremely low quality TIFFs (~50K)
  • 10. Aletheia 10 www.primaresearch.org/too ls.php Available for free but requires registration.  Created by PRImA Research Labs, University of Salford, UK.  Windows based tool.  Developed as a groundtruth creation tool  Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract.  Can identify glyphs on a page image with page coordinates and Unicode values.
  • 11. Aletheia:Workflow11  Binarization and Denoise are native Aletheia functions  A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed.  Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image.
  • 12. Aletheia: Glyph Recognition 12 Uses Tesseract to find glyphs
  • 13. Aletheia: I/O 13 We then convert PAGE XML file to Tesseract Box file using XSLT
  • 15. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training.  Available open-source at: github.com/idhmc- tamu/FrankenPlus 15
  • 16. Franken+Workflow16 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken- page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
  • 18. Franken+ 18  All exemplars of the same glyph are displayed together.  Users can quickly identify and deselect:  Incorrectly labeled glyphs  Incomplete glyphs  Unrepresentative exemplars  Different sized glyphs
  • 20. TrainingTesseract 20 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗſtak'ſt thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ſt ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haſt loſt thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haſt forgot thy name thou hadſt; thou waſt Nothing but ee, and her thou haſt o'rpaſt. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ſt laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ſt to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ſtateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha ,
  • 23. 23 Workflows-Controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py
  • 24. Brazos HPCC • 128 processors as a stakeholder • Access to background queues • Estimated at over 2 months of constant processing • We will have to reprocess some files that fail due to timeouts or that require pre-processing 24
  • 28. Triage:De-noising 28  Uses hOCR results 1. Determine average word bounding box size 2. Weed out boxes are that too big or too small 3. But keep small boxes that have neighbors that are “words”
  • 31. Triage: Estimated Correctability 31 Page Evaluation  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  • 33. Treatment: Page Correction 33 1. Preliminary cleanup  remove punctuation from begin/end of tokens  remove empty lines and empty tokens  combine hyphenated tokens that appear at the end of a line  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e
  • 34. 34Treatment: Page Correction 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3- gram dataset  if no context is found and only one additional suggestion was made from transformation or dictionary, then replace with this suggestion  if no context and “clean” token from above is in the dictionary, replace with this token
  • 35. 35Treatment: Page Correction window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) window: l thoughc Ihe Candidates used for context matching:  l -> Set(l)  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) tbat I thoughc Ihe Was
  • 36. 36Treatment: Page Correction window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91) that I thought she was
  • 39. Tags DB  All page images that don’t produce OCR text that meets our standards will be tagged in the eMOP DB with tags that indicate what is wrong with the page image.  skew, warp, noise, wrong font, etc.  Tags can then be used to later apply the appropriate pre-processing techniques and then re-OCR the page to produce better results.  This will be the first time that any sort of comprehensive analysis has been done on the page images of these collections.  We’ll end up (ideally) with a list of page images that simply need to be re-scanned. 39
  • 40.  www.18thconnect.org/typew right/  Available through 18thconnect.org An aggregator of 18th Century metadata & content This will be the first time that EEBO documents are available as searchable text, and that they can be corrected by scholars for their own use. TypeWright  Crowd-sourced correction tool for OCR.  Currently available for ECCO collection, based on original ECCO OCR results.  Will be ingested with eMOP OCR for ECCO and EEBO by year-end.  Users can correct a whole document and then receive the text or a lightly encoded TEI XML version for use in a digital edition or whatever. 40

Editor's Notes

  1. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  2. These are not big numbers by STEM standards, but for the Humanities they are: This is the largest dataset for any open-source, academic OCR project in North American, ever For the Humanities research data is about quality of content rather than size Our OCR must be as good as possible in order to generate reliable research
  3. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  4. Tesseract: Google Open source Accurate Fast Tesseract must first be trained to recognize the typeface used in the document(s) it is OCRing. Tesseract has a native mechanism to create that training, but it relies on having training pages with characteristics that can typically only be created with modern word processors, which don't have early modern fonts. To create early modern font training for Tesseract we had to adapt and create our own tools.
  5. Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  6. This is cheating: the result of scanning the same page we used to create the training.
  7. Franken+: Created by eMOP graduate student Bryan Tarpley. Used by the same team of undergraduates led by eMOP graduate student Kathy Torabi. Takes Aletheia's output files as input. Groups all glyphs with the same Unicode values into one window for comparison. Mistakenly coded glyphs are easily identified and re-coded. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. Outputs the same box files and TIFF images that Tesseract's first stage of native training. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. Outputs a .traineddata file used by Tesseract when OCRing page images.
  8. This document had 3 distinct point sizes, but Franken+ revealed there were more like 5 or 6. In this case it probably doesn’t matter, but it points out the possibilities
  9. Dashboard provides access to the DB via an API Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS
  10. at any one time we currently have about 2000 pages being processed at the same time that’s subject to processor availability on the background queues about 8 pages per second.
  11. Before: 55% After: 73%