SlideShare a Scribd company logo
1 of 31
Beyond the Early Modern OCR
Project
or, What are We Going to do With Ourselves Now?
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
 emop.tamu.edu/
 Texas A&M Big Data Workshop
 emop.tamu.edu/TAMU-
BigData
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
3
April 27, 2015TCDL15 - Beyond eMOP
 The Early Modern OCR Project (eMOP) is an
 Andrew W. Mellon Foundation funded grant project
running out of the Initiative for Digital Humanities,
Media, and Culture (IDHMC) at Texas A&M
University, to
 develop and test tools and techniques to apply
Optical Character Recognition (OCR) to early
modern English documents
 from the hand press period, roughly 1475-1800.
 eMOP aims to improve the visibility of early
modern texts by making their contents fully
searchable. The current paradigm of searching
special collections for early modern materials by
either metadata alone or “dirty” OCR is
insufficient for scholarly research.
4
TCDL15 - Beyond eMOP
Goals
April 27, 2015
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed docuemnts
 44,000 EEBO
 2,200 ECCO
5
April 27, 2015TCDL15 - Beyond eMOP
TCDL15 - Beyond eMOP
6
April 27, 2015
7
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartners
April 27, 2015TCDL15 - Beyond eMOP
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking
 Old, low-quality, small tiff
files
 Noise, skew, warp,
bleedthrough,
8
April 27, 2015TCDL15 - Beyond eMOP
Page Images
TCDL15 - Beyond eMOP
9
April 27, 2015
Results
 ECCO
 Avg. Scores
 309,328 pages
 86% correct
 Removing 1st page from
score
 87% correct
 57% correct on 1st pages
 Comparison to Prime
Recognition’s OCR
 93% claim
 (still running analysis)
 EEBO
 Avg. Scores
 1,475,026 pages
 68% correct
 Removing 1st page from
score
 68%
 No previous OCR to
compare to
April 27, 2015TCDL15 - Beyond eMOP
10
eMOP Outcomes - Github
April 27, 2015TCDL15 - Beyond eMOP
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
11
April 27, 2015TCDL15 - Beyond eMOP
12 Outcomes – Github
Franken+
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
1. Windows based tool
that uses a MySQL DB.
2. Developed for eMOP
by IDHMC Graduate
student worker Bryan
Tarpley.
3. Designed to be easily
used by eMOP
Undergraduate student
workers
4. Takes Aletheia's output
files as input.
5. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
early-modern-ocr.github.io/FrankenPlus/
April 27, 2015TCDL15 - Beyond eMOP
13 Outcomes – Github
emop-dashboard
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/emop-dashboard/
April 27, 2015TCDL15 - Beyond eMOP
14 Outcomes – Github
hOCR deNoising
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/hOCR-De-Noising/
April 27, 2015TCDL15 - Beyond eMOP
15 Outcomes – Github
hOCR deNoising
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
Before After
April 27, 2015TCDL15 - Beyond eMOP
16 Outcomes – Github
page evaluator
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-evaluator/
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio
of words that fit the correctable
profile to the total number of
words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have
at least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or
more repeated characters in
a run
4. Also consider length of
tokens compared to
average for the page
April 27, 2015TCDL15 - Beyond eMOP
17 Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-corrector/
1. Preliminary cleanup
 remove punctuation from begin/end of tokens &
empty lines
 combine hyphenated tokens at end of lines
 retain cleaned & original tokens as “suggestions”
2. Apply common transformations and period specific
dictionary lookups to gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-gram
dataset
tbat I thoughc Ihe Was
April 27, 2015TCDL15 - Beyond eMOP
18
window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that,
tat, tba, ilial, abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount:
1474)
Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
tbat I thoughc Ihe Was
window: l thoughc Ihe
Candidates used for context matching:
■ l -> Set(l)
■ thoughc -> Set(thoughc, thought, though)
■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife,
ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount:
486)
ContextMatch: l thought she (matchCount: 1538 , volCount:
997)
ContextMatch: l thought the (matchCount: 2496 , volCount:
1905)
window: thoughc Ihe Was
Candidates used for context matching:
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
April 27, 2015TCDL15 - Beyond eMOP
19 Outcomes – Github
juxta-CL
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/Juxta-cl/
 Juxta-CL(command line)
 created for eMOP
 based on JuxtaCommons tool
(juxtacommons.org/)
 several different comparison algorithms to
choose from and other options
 Levenshtein / Jaro-Winkler / Juxta
 ignore: punctuation, caps, hyphens
20
Outcomes-Github
emop-controller
• Powered by the eMOP DB
• Collection processing is managed via
the online Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
April 27, 2015TCDL15 - Beyond eMOP
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
April 27, 2015TCDL15 - Beyond eMOP
21
Outcomes - Tesseract Training
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/TesseractTraining/
 Font Training
 3 different types
 Roman, Italic, Blackletter
 12 different printers
 1564-1764
 40 different typeface combinations
 Even more combined typefaces training files
 Dictionaries
 from multiple sources with alternate spellings
 More
 Franken+ training files
 common OCR error file
April 27, 2015TCDL15 - Beyond eMOP
22
Outcomes – DB of EM Printers
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
 Parsing the imprint lines of all EEBO & ECCO
docs to create the ImprintDB
 Gathering:
 Printed By
 Printed For
 Sold by
 Location(“at the Rose and Crown in St. Paul's
Church-Yard”, “at the signe of the Traytors
head”)
 Place (London, Cambridge…)
 (Dates)
 Those docs with ESTC numbers will be
available via a public database
April 27, 2015TCDL15 - Beyond eMOP
23
Outcomes – Fulltext Search of
TCP Phase 1
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
18thconnect.org/search?a=EEBO&o=fulltext
 Phase 1 of TCP EEBO hand transcriptions will
be available for fulltext searching in
18thConnect
 over 25,000 documents
April 27, 2015TCDL15 - Beyond eMOP
24
Outcomes – Editable EEBO
OCR
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
 Can already use TypeWright in 18thConnect
to edit ECCO docs (without a subscription)
 Will soon be able to edit EEBO docs in
TypeWright
 TCP hand transcriptions
 eMOP OCR transcriptions
 When doc is fully corrected, users can
request text and/or XML versions of
corrected transcriptions for scholarly use.
 First time EEBO transcriptions will be available
for over 80,000 docs.
April 27, 2015TCDL15 - Beyond eMOP
25
Outcomes – Collection
Evaluation
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
 We’ve collected over 6 TB of data
 Our post-processing algorithms collect data:
 amount of skew
 amount of noise
 number and location of text columns
 page quality
 correctability
 corrections made
 Saved to the eMOP DB
 First time this has been done on these
collections
April 27, 2015TCDL15 - Beyond eMOP
26
Outcomes – Outreach
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase 1
• EEBO OCR
• Collection
Evaluation
• Outreach
 2013
 ASECS (Amer. Soc. of Eighteenth Century Scholars)
 KIAS (Kule Institute for Advanced Studies): Around the World
Symposium
 DocEng (ACM Symposium on Document Engineering)
 2014
 TxDHC
 TCDL
 DH (5 papers)
 SAA (pre-conference workshop on OCR)
 DHCS
 2015
 AAAI: (Association for the Advancement of Artificial Intelligence)
 TAMU Big Data Workshop
 Downstream from the Digital Humanities
 Penn St. OCR Workshop
 TxDHC
 TCDL
 DHSI (July)
The Future of eMOP
April 27, 2015TCDL15 - Beyond eMOP
27
 We want eMOP
to be viable long-
term
 Continue to be
developed/suppo
rted
 open-source
repos
 outreach
 Looking for
partners
from Kill Bill 2, 2004, Miramax
Future - Partners
 Hathi Trust
 Talking about possible “next-step” grant application
 Notre Dame Libraries
 Helping to recreate eMOP workflow on their systems to OCR
Cobbett's Complete Collection of State Trials
 Swapping labor
 Opportunity to test the robustness of our workflow and tools as a
whole unit
 Penn St. Libraries
 Held a workshop on OCR’ing with open source tools
 Opportunity to test the robustness of our workflow and tools as
discrete parts
April 27, 2015TCDL15 - Beyond eMOP
28
Future - Partners
 UT Austin
 Sub-awardee on an NEH grant to OCR Primeros Libros collection
[primeroslibros.org/]
 $$ for continued development of workflow and improved
hardware
 Opportunity to add another OCR engine and further test the
robustness of the workflow on other document sets
 step towards a voting algorithm
 Adam Matthew Digital
 Going to OCR some of their EM collection to see how we do;
maybe more if good results
 Opportunity to test robustness on similar collection; establish a
relationship with another industry leader
 possibly acquire more data to ingest into ARC nodes and
TypeWright
April 27, 2015TCDL15 - Beyond eMOP
29
Future - Partners
 Austin Fanzine Project
 Small, local project; personally interesting
 Opportunity to test tools and workflow on a different
collection and whole other type of documents (not that
dissimilar)
 Hathi Trust Research Center
 Talking about possibility of using our algorithms on their imprint
data to expand and share Imprint DB
 Opportunity to acquire more EM publishing data and make
available in one place
April 27, 2015TCDL15 - Beyond eMOP
30
Call to Libraries
April 27, 2015TCDL15 - Beyond eMOP
31
 Use the Code
 Give us feedback
 Develop the Code
 Create branches in Git
 Commit improvements/changes
 Contact Us
 Consultation
 Partnership
 for Labor or Funds
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
32
TCDL15 - Beyond eMOP April 27, 2015

More Related Content

What's hot

Tamu big data-conf-1b
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1bMatt Christy
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsMatt Christy
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4WMigrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4WPMB-BUG
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...InfinIT - Innovationsnetværket for it
 

What's hot (7)

Tamu big data-conf-1b
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1b
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4WMigrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
 

Viewers also liked

BrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter ApplicationBrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter Applicationpijush15
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructuresSolin TEM
 
reelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail SolutionreelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail SolutionreelyActive
 
手機自動化測試和持續整合
手機自動化測試和持續整合手機自動化測試和持續整合
手機自動化測試和持續整合Carl Su
 

Viewers also liked (9)

BrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter ApplicationBrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter Application
 
Tiny Google Projects
Tiny Google ProjectsTiny Google Projects
Tiny Google Projects
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
reelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail SolutionreelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail Solution
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Node way
Node wayNode way
Node way
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
手機自動化測試和持續整合
手機自動化測試和持續整合手機自動化測試和持續整合
手機自動化測試和持續整合
 

Similar to TCDL15 Beyond eMOP

Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Gary Dare
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Build your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 daysBuild your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 daysDmytro Naumov
 
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptxGauravGamer2
 
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...TAUS - The Language Data Network
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforRomain Boman
 
Morphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometryMorphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometrySilvie Cinková
 

Similar to TCDL15 Beyond eMOP (20)

Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Lecture semantic augmentation
Lecture semantic augmentationLecture semantic augmentation
Lecture semantic augmentation
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Intro_to_ML
Intro_to_MLIntro_to_ML
Intro_to_ML
 
Build your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 daysBuild your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 days
 
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
 
MTM 2015
MTM 2015MTM 2015
MTM 2015
 
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
TAUS Machine Translation Showcase, The Simplified Guide to Getting Started in...
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
 
Morphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometryMorphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometry
 
Final presentation
Final presentationFinal presentation
Final presentation
 

Recently uploaded

Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationNeilDeclaro1
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answersdalebeck957
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 

Recently uploaded (20)

Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 

TCDL15 Beyond eMOP

  • 1. Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  Texas A&M Big Data Workshop  emop.tamu.edu/TAMU- BigData  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 3 April 27, 2015TCDL15 - Beyond eMOP
  • 3.  The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 4 TCDL15 - Beyond eMOP Goals April 27, 2015
  • 4. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO 5 April 27, 2015TCDL15 - Beyond eMOP
  • 5. TCDL15 - Beyond eMOP 6 April 27, 2015
  • 6. 7 • PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK • SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign • PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University • The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University • The Brazos High Performance Computing Cluster (HPCC) OurPartners April 27, 2015TCDL15 - Beyond eMOP
  • 7. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking  Old, low-quality, small tiff files  Noise, skew, warp, bleedthrough, 8 April 27, 2015TCDL15 - Beyond eMOP
  • 8. Page Images TCDL15 - Beyond eMOP 9 April 27, 2015
  • 9. Results  ECCO  Avg. Scores  309,328 pages  86% correct  Removing 1st page from score  87% correct  57% correct on 1st pages  Comparison to Prime Recognition’s OCR  93% claim  (still running analysis)  EEBO  Avg. Scores  1,475,026 pages  68% correct  Removing 1st page from score  68%  No previous OCR to compare to April 27, 2015TCDL15 - Beyond eMOP 10
  • 10. eMOP Outcomes - Github April 27, 2015TCDL15 - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 11
  • 11. April 27, 2015TCDL15 - Beyond eMOP 12 Outcomes – Github Franken+ • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training. early-modern-ocr.github.io/FrankenPlus/
  • 12. April 27, 2015TCDL15 - Beyond eMOP 13 Outcomes – Github emop-dashboard • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/emop-dashboard/
  • 13. April 27, 2015TCDL15 - Beyond eMOP 14 Outcomes – Github hOCR deNoising • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/hOCR-De-Noising/
  • 14. April 27, 2015TCDL15 - Beyond eMOP 15 Outcomes – Github hOCR deNoising • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach Before After
  • 15. April 27, 2015TCDL15 - Beyond eMOP 16 Outcomes – Github page evaluator • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-evaluator/  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  • 16. April 27, 2015TCDL15 - Beyond eMOP 17 Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-corrector/ 1. Preliminary cleanup  remove punctuation from begin/end of tokens & empty lines  combine hyphenated tokens at end of lines  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3-gram dataset tbat I thoughc Ihe Was
  • 17. April 27, 2015TCDL15 - Beyond eMOP 18 window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach tbat I thoughc Ihe Was window: l thoughc Ihe Candidates used for context matching: ■ l -> Set(l) ■ thoughc -> Set(thoughc, thought, though) ■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91)
  • 18. April 27, 2015TCDL15 - Beyond eMOP 19 Outcomes – Github juxta-CL • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/Juxta-cl/  Juxta-CL(command line)  created for eMOP  based on JuxtaCommons tool (juxtacommons.org/)  several different comparison algorithms to choose from and other options  Levenshtein / Jaro-Winkler / Juxta  ignore: punctuation, caps, hyphens
  • 19. 20 Outcomes-Github emop-controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py April 27, 2015TCDL15 - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach
  • 20. April 27, 2015TCDL15 - Beyond eMOP 21 Outcomes - Tesseract Training • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/TesseractTraining/  Font Training  3 different types  Roman, Italic, Blackletter  12 different printers  1564-1764  40 different typeface combinations  Even more combined typefaces training files  Dictionaries  from multiple sources with alternate spellings  More  Franken+ training files  common OCR error file
  • 21. April 27, 2015TCDL15 - Beyond eMOP 22 Outcomes – DB of EM Printers • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Parsing the imprint lines of all EEBO & ECCO docs to create the ImprintDB  Gathering:  Printed By  Printed For  Sold by  Location(“at the Rose and Crown in St. Paul's Church-Yard”, “at the signe of the Traytors head”)  Place (London, Cambridge…)  (Dates)  Those docs with ESTC numbers will be available via a public database
  • 22. April 27, 2015TCDL15 - Beyond eMOP 23 Outcomes – Fulltext Search of TCP Phase 1 • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 18thconnect.org/search?a=EEBO&o=fulltext  Phase 1 of TCP EEBO hand transcriptions will be available for fulltext searching in 18thConnect  over 25,000 documents
  • 23. April 27, 2015TCDL15 - Beyond eMOP 24 Outcomes – Editable EEBO OCR • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Can already use TypeWright in 18thConnect to edit ECCO docs (without a subscription)  Will soon be able to edit EEBO docs in TypeWright  TCP hand transcriptions  eMOP OCR transcriptions  When doc is fully corrected, users can request text and/or XML versions of corrected transcriptions for scholarly use.  First time EEBO transcriptions will be available for over 80,000 docs.
  • 24. April 27, 2015TCDL15 - Beyond eMOP 25 Outcomes – Collection Evaluation • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  We’ve collected over 6 TB of data  Our post-processing algorithms collect data:  amount of skew  amount of noise  number and location of text columns  page quality  correctability  corrections made  Saved to the eMOP DB  First time this has been done on these collections
  • 25. April 27, 2015TCDL15 - Beyond eMOP 26 Outcomes – Outreach • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  2013  ASECS (Amer. Soc. of Eighteenth Century Scholars)  KIAS (Kule Institute for Advanced Studies): Around the World Symposium  DocEng (ACM Symposium on Document Engineering)  2014  TxDHC  TCDL  DH (5 papers)  SAA (pre-conference workshop on OCR)  DHCS  2015  AAAI: (Association for the Advancement of Artificial Intelligence)  TAMU Big Data Workshop  Downstream from the Digital Humanities  Penn St. OCR Workshop  TxDHC  TCDL  DHSI (July)
  • 26. The Future of eMOP April 27, 2015TCDL15 - Beyond eMOP 27  We want eMOP to be viable long- term  Continue to be developed/suppo rted  open-source repos  outreach  Looking for partners from Kill Bill 2, 2004, Miramax
  • 27. Future - Partners  Hathi Trust  Talking about possible “next-step” grant application  Notre Dame Libraries  Helping to recreate eMOP workflow on their systems to OCR Cobbett's Complete Collection of State Trials  Swapping labor  Opportunity to test the robustness of our workflow and tools as a whole unit  Penn St. Libraries  Held a workshop on OCR’ing with open source tools  Opportunity to test the robustness of our workflow and tools as discrete parts April 27, 2015TCDL15 - Beyond eMOP 28
  • 28. Future - Partners  UT Austin  Sub-awardee on an NEH grant to OCR Primeros Libros collection [primeroslibros.org/]  $$ for continued development of workflow and improved hardware  Opportunity to add another OCR engine and further test the robustness of the workflow on other document sets  step towards a voting algorithm  Adam Matthew Digital  Going to OCR some of their EM collection to see how we do; maybe more if good results  Opportunity to test robustness on similar collection; establish a relationship with another industry leader  possibly acquire more data to ingest into ARC nodes and TypeWright April 27, 2015TCDL15 - Beyond eMOP 29
  • 29. Future - Partners  Austin Fanzine Project  Small, local project; personally interesting  Opportunity to test tools and workflow on a different collection and whole other type of documents (not that dissimilar)  Hathi Trust Research Center  Talking about possibility of using our algorithms on their imprint data to expand and share Imprint DB  Opportunity to acquire more EM publishing data and make available in one place April 27, 2015TCDL15 - Beyond eMOP 30
  • 30. Call to Libraries April 27, 2015TCDL15 - Beyond eMOP 31  Use the Code  Give us feedback  Develop the Code  Create branches in Git  Commit improvements/changes  Contact Us  Consultation  Partnership  for Labor or Funds
  • 31. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu 32 TCDL15 - Beyond eMOP April 27, 2015

Editor's Notes

  1. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  2. These are not big numbers by STEM standards, but for the Humanities they are: This is the largest dataset for any open-source, academic OCR project in North American, ever For the Humanities research data is about quality of content rather than size Our OCR must be as good as possible in order to generate reliable research
  3. Besides Proquest, Gale, and the TCP we have many technical partners. Also advisers from Impact at the KB and Salford.
  4. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  5. Dashboard provides access to the DB via an API Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS