DLF Forum 2015: Beyond eMOP

Matt Christy
Matt ChristyLead Software Applications Developer
Beyond the Early Modern OCR
Project
or, What are We Going to do With Ourselves Now?
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
 emop.tamu.edu/
 2015 DLF Forum– Beyond eMOP
 emop.tamu.edu/presentations/
#DLF15
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/Mell
on/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2
October 27, 20152015 DLF Forum - Beyond eMOP
 The Early Modern OCR Project (eMOP) is an
 Andrew W. Mellon Foundation funded grant project
running out of the Initiative for Digital Humanities, Media,
and Culture (IDHMC) at Texas A&M University, to
 develop and test open-source tools and techniques to
apply Optical Character Recognition (OCR) to early
modern English documents
 from the hand press period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern
texts by making their contents fully searchable. The
current paradigm of searching special collections for
early modern materials by either metadata alone or
“dirty” OCR is insufficient for scholarly research.
3
2015 DLF Forum - Beyond eMOP
Goals
October 27, 2015
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million pages images
(1475-1700)
 Eighteenth Century Collections
Online (Gale Cengage) ECCO:
~182,000 documents, ~32 million
page images (1700-1800)
Ground Truth  Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
 44,000 EEBO
 2,200 ECCO
4
October 27, 20152015 DLF Forum - Beyond eMOP
Total: >300,000 documents & 45 million page images.
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters & ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking,
bleedthrough
 Old, low-quality, small tiff files
 Noise, skew, warp,
5
October 27, 20152015 DLF Forum - Beyond eMOP
Page Images
2015 DLF Forum - Beyond eMOP
6
October 27, 2015
Results
 ECCO
 Avg. Scores
 309,328 pages
 86% correct
 57% correct on 1st pages
 Comparison to Prime
Recognition’s OCR
 93% claim
 89% correct by our metrics
 EEBO
 Avg. Scores
 1,475,026 pages
 68% correct
 No previous OCR to
compare to
October 27, 20152015 DLF Forum - Beyond eMOP
7
Using Google’s Tesseract open-source OCR engine with eMOP
created training
Results (cont)
 Breakdown against Groundtruth
 Out of ~1.8 million pages of GT
 82% EEBO
 18% ECCO
 We have GT for
 25.4% of EEBO
 0.98% of ECCO
October 27, 20152015 DLF Forum - Beyond eMOP
8
% Correct # Pages % Pages
90-100%: 354,358 19.5%
80-89%: 524,825 28.9%
70-79%: 316,511 17.4%
60-69%: 193,708 10.7%
50-59%: 125,060 6.9%
40-49%: 85,875 4.7%
30-39%: 61,992 3.4%
20-29%: 48,539 2.7%
10-19%: 43,776 2.4%
0-9%: 60,912 3.4%
48.4
%
eMOP Outcomes - Github
October 27, 20152015 DLF Forum - Beyond eMOP
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
9
October 27, 20152015 DLF Forum - Beyond eMOP
10 Outcomes – Github
Franken+
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
1. Windows based tool that
uses a MySQL DB.
2. Developed for eMOP by
IDHMC Graduate student
worker Bryan Tarpley.
3. Designed to be easily
used by eMOP
Undergraduate student
workers
4. Takes Aletheia's output
files as input.
5. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
early-modern-ocr.github.io/FrankenPlus/
October 27, 20152015 DLF Forum - Beyond eMOP
11 Outcomes – Github
emop-dashboard
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/emop-dashboard/
12
Outcomes-Github
emop-controller
• Powered by the eMOP DB
• Collection processing is managed via the
online Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
October 27, 20152015 DLF Forum - Beyond eMOP
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
October 27, 20152015 DLF Forum - Beyond eMOP
13 Outcomes – Github
hOCR deNoising
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/hOCR-De-Noising/
Before
After
October 27, 20152015 DLF Forum - Beyond eMOP
14 Outcomes – Github
page evaluator
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-evaluator/
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio of
words that fit the correctable
profile to the total number of
words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have at
least 3 letters
2. Spell check tokens >1 character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
October 27, 20152015 DLF Forum - Beyond eMOP
15 Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/page-corrector/
1. Preliminary cleanup
 remove punctuation from begin/end of tokens & empty lines
 combine hyphenated tokens at end of lines
 retain cleaned & original tokens as “suggestions”
2. Apply common transformations and period specific dictionary
lookups to gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e
3. Use context checking on a sliding window of 3 words, and their
suggested changes, to find the best context matches in
our(sanitized, period-specific) Google 3-gram dataset
tbat I thoughc Ihe Was
October 27, 20152015 DLF Forum - Beyond eMOP
16
window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat,
tba, ilial, abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount:
1474)
Outcomes – Github
page corrector
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
tbat I thoughc Ihe Was
window: l thoughc Ihe
Candidates used for context matching:
■ l -> Set(l)
■ thoughc -> Set(thoughc, thought, though)
■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide,
ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
window: thoughc Ihe Was
Candidates used for context matching:
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene,
ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
October 27, 20152015 DLF Forum - Beyond eMOP
17 Outcomes – Github
juxta-CL
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/Juxta-cl/
 Juxta-CL(command line)
 created for eMOP
 based on JuxtaCommons tool (juxtacommons.org/)
created by Performant Software
 several different comparison algorithms to choose
from and other options
 Levenshtein / Jaro-Winkler / Juxta
 ignore: punctuation, caps, hyphens
October 27, 20152015 DLF Forum - Beyond eMOP
18 Outcomes – Github
Cobre
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/Cobre/
October 27, 20152015 DLF Forum - Beyond eMOP
19
Outcomes - Tesseract Training
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
early-modern-ocr.github.io/TesseractTraining/
 Font Training
 3 different types
 Roman, Italic, Blackletter
 12 different printers
 1564-1764
 40 different typeface combinations
 Even more combined typefaces training files
 EM Dictionaries
 from multiple sources with alternate spellings
 More
 Franken+ training files
 common OCR error file
October 27, 20152015 DLF Forum - Beyond eMOP
20
Outcomes – DB of EM Printers
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
 Parsing the imprint lines of all EEBO & ECCO
docs to create the ImprintDB
 Gathering:
 Printed By
 Printed For
 Sold by
 Location(“at the Rose and Crown in St. Paul's
Church-Yard”, “at the signe of the Traytors head”)
 Place (London, Cambridge…)
 (Dates)
 Those docs with ESTC numbers will be available
via a public database
early-modern-ocr.github.io/ImprintDB/
 Phase 1 of TCP EEBO hand transcriptions are
now available for fulltext searching in
18thConnect
 over 25,000 documents
October 27, 20152015 DLF Forum - Beyond eMOP
21
Outcomes – Fulltext Search of
TCP Phase 1
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
18thconnect.org/search?a=EEBO&o=fulltext
October 27, 20152015 DLF Forum - Beyond eMOP
22
Outcomes – Editable EEBO OCR
– TypeWright
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
 Can already use TypeWright in 18thConnect to
edit ECCO docs (without a subscription)
 Will soon be able to edit EEBO docs in
TypeWright
 TCP hand transcriptions
 eMOP OCR transcriptions
 When doc is fully corrected, users can request
text and/or XML versions of corrected
transcriptions for scholarly use.
 First time EEBO transcriptions will be available
for over 80,000 docs.
18thconnect.org/typewright
October 27, 20152015 DLF Forum - Beyond eMOP
23
Outcomes – Collection
Evaluation
• Code Repo
• Tesseract Training
• ImprintDB
• TCP EEBO Phase
1
• EEBO OCR
• Collection
Evaluation
• Outreach
 We’ve collected over 6 TB of data
 Our post-processing algorithms collect data:
 amount of skew
 amount of noise
 number and location of text columns
 page quality
 correctability
 corrections made
 Saved to the eMOP DB
 First time this has been done on these collections
 Pre-process and re-OCR pages
 several iterations will identify bad page images
The Future of eMOP
October 27, 20152015 DLF Forum - Beyond eMOP
24
 We want eMOP to
be viable long-term
 Continue to be
developed/supporte
d
 open-source repos
 outreach
 Looking for partnersfrom Kill Bill 2, 2004, Miramax
Future – Work
 ImprintDB
 We want to turn this into an online DB with functionality to
 Query & download results
 Recommend corrections to data
 Analyze Data
 We have collected a mass of data on nearly 45 million pages:
 Skew & noise measures
 Column coordinates
 Corrections made
 Scores & Estimates for “correctability”
 We can use that data to improve the OCR workflow and judge its
efficacy
 Re-OCR
 Using skew & noise measures & column info we can pre-process
pages and re-OCR for better results
October 27, 20152015 DLF Forum - Beyond eMOP
25
Future – Funded Work
 NEH Implementation Grant
Reading the First Books
 OCR the first printed works of the New World
 Using Ocular OCR engine with multiple language recognition
 Opportunity to integrate another OCR engine into our workflow, and
continue to develop our workflow and DB to work with other
collections.
October 27, 20152015 DLF Forum - Beyond eMOP
26
Future – Partnerships
Notre Dame Libraries
Penn St. Libraries
Adam Matthew Digital
Visualizing English Print
U of S Carolina – Center for Digital Humanities
 Consultation
 Workshops
 Share labor / resources
 Test the robustness of our workflow in parts or as a whole on
different collections and systems
 Test how text analysis, topic modeling, lexical mapping, etc. can
improve our data and results
October 27, 20152015 DLF Forum - Beyond eMOP
27
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
mandell@tamu.edu
28
2015 DLF Forum - Beyond eMOP October 27, 2015
1 of 28

Recommended

TCDL15 Beyond eMOP by
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOPMatt Christy
1.3K views31 slides
eMOP-PennSt-lunch by
eMOP-PennSt-luncheMOP-PennSt-lunch
eMOP-PennSt-lunchMatt Christy
1.4K views41 slides
mchristy-DH2014-emop-bookhistory-tools by
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
2.8K views21 slides
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards by
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
2.1K views27 slides
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB by
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBDigital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBMatt Christy
1.2K views12 slides
Mchristy-eMOP-workflows2-24x7 by
Mchristy-eMOP-workflows2-24x7Mchristy-eMOP-workflows2-24x7
Mchristy-eMOP-workflows2-24x7Matt Christy
1.7K views24 slides

More Related Content

What's hot

Tamu big data-conf-1b by
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1bMatt Christy
1.3K views10 slides
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools by
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsMatt Christy
2.2K views48 slides
Dh2014 e mopcobre-complete by
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
2.9K views45 slides
Learning to assess Linked Data relationships using Genetic Programming by
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
664 views28 slides
Aspects of NLP Practice by
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP PracticeVsevolod Dyomkin
1K views37 slides
Semantic web, python, construction industry by
Semantic web, python, construction industrySemantic web, python, construction industry
Semantic web, python, construction industryReinout van Rees
879 views7 slides

What's hot(12)

Tamu big data-conf-1b by Matt Christy
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1b
Matt Christy1.3K views
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools by Matt Christy
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
Matt Christy2.2K views
Dh2014 e mopcobre-complete by Laura Mandell
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
Laura Mandell2.9K views
Semantic web, python, construction industry by Reinout van Rees
Semantic web, python, construction industrySemantic web, python, construction industry
Semantic web, python, construction industry
Reinout van Rees879 views
The Semantic Web - Interacting with the Unknown by Steffen Staab
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the Unknown
Steffen Staab1.1K views
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W by PMB-BUG
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4WMigrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W
PMB-BUG1.1K views
Chances and Challenges in Comparing Cross-Language Retrieval Tools by Giovanna Roda
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Giovanna Roda479 views
Quick tour all handout by Yi-Shin Chen
Quick tour all handoutQuick tour all handout
Quick tour all handout
Yi-Shin Chen615 views

Similar to DLF Forum 2015: Beyond eMOP

Space Codesign at TandemLaunch 20150414 by
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign
732 views19 slides
Space Codesign at TandemLaunch Lunch & Learn 20150414 by
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Gary Dare
574 views19 slides
Python ml by
Python mlPython ml
Python mlShubham Sharma
170 views29 slides
MTM 2015 by
MTM 2015MTM 2015
MTM 2015Matīss
635 views60 slides
Infrastructure crossroads... and the way we walked them in DKPro by
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
505 views18 slides
Session 1 - The Current Landscape of Big Data Benchmarks by
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksDataBench
43 views33 slides

Similar to DLF Forum 2015: Beyond eMOP(20)

Space Codesign at TandemLaunch 20150414 by Space Codesign
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414
Space Codesign732 views
Space Codesign at TandemLaunch Lunch & Learn 20150414 by Gary Dare
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414
Gary Dare574 views
MTM 2015 by Matīss
MTM 2015MTM 2015
MTM 2015
Matīss 635 views
Infrastructure crossroads... and the way we walked them in DKPro by openminted_eu
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
openminted_eu505 views
Session 1 - The Current Landscape of Big Data Benchmarks by DataBench
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
DataBench43 views
Array computing and the evolution of SciPy, NumPy, and PyData by Travis Oliphant
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
Travis Oliphant515 views
from ai.backend import python @ pycontw2018 by Chun-Yu Tseng
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018
Chun-Yu Tseng889 views
Perl::Lint - Yet Another Perl Source Code Linter by moznion
Perl::Lint - Yet Another Perl Source Code LinterPerl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Linter
moznion3.3K views
Tensorflow on Android by Koan-Sin Tan
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
Koan-Sin Tan5.7K views
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx by GauravGamer2
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
2R-3KS03-OOP_UNIT-I (Part-A)_2023-24.pptx
GauravGamer21 view
1st MoveIt! Community Meeting by moveitrobot
1st MoveIt! Community Meeting1st MoveIt! Community Meeting
1st MoveIt! Community Meeting
moveitrobot3.4K views
“Houston, we have a model...” Introduction to MLOps by Rui Quintino
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
Rui Quintino228 views
Technical debt management strategies by Raquel Pau
Technical debt management strategiesTechnical debt management strategies
Technical debt management strategies
Raquel Pau1.4K views
ChefConf 2014 - Kanban will save You! Will it? by Marcin Mazurek
ChefConf 2014 - Kanban will save You! Will it?ChefConf 2014 - Kanban will save You! Will it?
ChefConf 2014 - Kanban will save You! Will it?
Marcin Mazurek970 views
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar by confluent
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent4K views

Recently uploaded

11.28.23 Social Capital and Social Exclusion.pptx by
11.28.23 Social Capital and Social Exclusion.pptx11.28.23 Social Capital and Social Exclusion.pptx
11.28.23 Social Capital and Social Exclusion.pptxmary850239
112 views25 slides
Community-led Open Access Publishing webinar.pptx by
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptxJisc
69 views9 slides
AI Tools for Business and Startups by
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and StartupsSvetlin Nakov
89 views39 slides
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) by
 Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)AnshulDewangan3
275 views12 slides
CWP_23995_2013_17_11_2023_FINAL_ORDER.pdf by
CWP_23995_2013_17_11_2023_FINAL_ORDER.pdfCWP_23995_2013_17_11_2023_FINAL_ORDER.pdf
CWP_23995_2013_17_11_2023_FINAL_ORDER.pdfSukhwinderSingh895865
501 views6 slides
Are we onboard yet University of Sussex.pptx by
Are we onboard yet University of Sussex.pptxAre we onboard yet University of Sussex.pptx
Are we onboard yet University of Sussex.pptxJisc
71 views7 slides

Recently uploaded(20)

11.28.23 Social Capital and Social Exclusion.pptx by mary850239
11.28.23 Social Capital and Social Exclusion.pptx11.28.23 Social Capital and Social Exclusion.pptx
11.28.23 Social Capital and Social Exclusion.pptx
mary850239112 views
Community-led Open Access Publishing webinar.pptx by Jisc
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptx
Jisc69 views
AI Tools for Business and Startups by Svetlin Nakov
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and Startups
Svetlin Nakov89 views
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) by AnshulDewangan3
 Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
AnshulDewangan3275 views
Are we onboard yet University of Sussex.pptx by Jisc
Are we onboard yet University of Sussex.pptxAre we onboard yet University of Sussex.pptx
Are we onboard yet University of Sussex.pptx
Jisc71 views
Classification of crude drugs.pptx by GayatriPatra14
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptx
GayatriPatra1465 views
Universe revised.pdf by DrHafizKosar
Universe revised.pdfUniverse revised.pdf
Universe revised.pdf
DrHafizKosar108 views
Structure and Functions of Cell.pdf by Nithya Murugan
Structure and Functions of Cell.pdfStructure and Functions of Cell.pdf
Structure and Functions of Cell.pdf
Nithya Murugan317 views
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx by Inge de Waard
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptxOEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
Inge de Waard165 views
Use of Probiotics in Aquaculture.pptx by AKSHAY MANDAL
Use of Probiotics in Aquaculture.pptxUse of Probiotics in Aquaculture.pptx
Use of Probiotics in Aquaculture.pptx
AKSHAY MANDAL81 views
NS3 Unit 2 Life processes of animals.pptx by manuelaromero2013
NS3 Unit 2 Life processes of animals.pptxNS3 Unit 2 Life processes of animals.pptx
NS3 Unit 2 Life processes of animals.pptx
manuelaromero2013102 views
Narration lesson plan.docx by TARIQ KHAN
Narration lesson plan.docxNarration lesson plan.docx
Narration lesson plan.docx
TARIQ KHAN99 views
Class 10 English lesson plans by TARIQ KHAN
Class 10 English  lesson plansClass 10 English  lesson plans
Class 10 English lesson plans
TARIQ KHAN239 views

DLF Forum 2015: Beyond eMOP

  • 1. Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  2015 DLF Forum– Beyond eMOP  emop.tamu.edu/presentations/ #DLF15  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/Mell on/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2 October 27, 20152015 DLF Forum - Beyond eMOP
  • 3.  The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test open-source tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 2015 DLF Forum - Beyond eMOP Goals October 27, 2015
  • 4. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800) Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed documents  44,000 EEBO  2,200 ECCO 4 October 27, 20152015 DLF Forum - Beyond eMOP Total: >300,000 documents & 45 million page images.
  • 5. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking, bleedthrough  Old, low-quality, small tiff files  Noise, skew, warp, 5 October 27, 20152015 DLF Forum - Beyond eMOP
  • 6. Page Images 2015 DLF Forum - Beyond eMOP 6 October 27, 2015
  • 7. Results  ECCO  Avg. Scores  309,328 pages  86% correct  57% correct on 1st pages  Comparison to Prime Recognition’s OCR  93% claim  89% correct by our metrics  EEBO  Avg. Scores  1,475,026 pages  68% correct  No previous OCR to compare to October 27, 20152015 DLF Forum - Beyond eMOP 7 Using Google’s Tesseract open-source OCR engine with eMOP created training
  • 8. Results (cont)  Breakdown against Groundtruth  Out of ~1.8 million pages of GT  82% EEBO  18% ECCO  We have GT for  25.4% of EEBO  0.98% of ECCO October 27, 20152015 DLF Forum - Beyond eMOP 8 % Correct # Pages % Pages 90-100%: 354,358 19.5% 80-89%: 524,825 28.9% 70-79%: 316,511 17.4% 60-69%: 193,708 10.7% 50-59%: 125,060 6.9% 40-49%: 85,875 4.7% 30-39%: 61,992 3.4% 20-29%: 48,539 2.7% 10-19%: 43,776 2.4% 0-9%: 60,912 3.4% 48.4 %
  • 9. eMOP Outcomes - Github October 27, 20152015 DLF Forum - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 9
  • 10. October 27, 20152015 DLF Forum - Beyond eMOP 10 Outcomes – Github Franken+ • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training. early-modern-ocr.github.io/FrankenPlus/
  • 11. October 27, 20152015 DLF Forum - Beyond eMOP 11 Outcomes – Github emop-dashboard • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/emop-dashboard/
  • 12. 12 Outcomes-Github emop-controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py October 27, 20152015 DLF Forum - Beyond eMOP • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach
  • 13. October 27, 20152015 DLF Forum - Beyond eMOP 13 Outcomes – Github hOCR deNoising • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/hOCR-De-Noising/ Before After
  • 14. October 27, 20152015 DLF Forum - Beyond eMOP 14 Outcomes – Github page evaluator • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-evaluator/  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  • 15. October 27, 20152015 DLF Forum - Beyond eMOP 15 Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/page-corrector/ 1. Preliminary cleanup  remove punctuation from begin/end of tokens & empty lines  combine hyphenated tokens at end of lines  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3-gram dataset tbat I thoughc Ihe Was
  • 16. October 27, 20152015 DLF Forum - Beyond eMOP 16 window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) Outcomes – Github page corrector • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach tbat I thoughc Ihe Was window: l thoughc Ihe Candidates used for context matching: ■ l -> Set(l) ■ thoughc -> Set(thoughc, thought, though) ■ Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91) that I thought she was
  • 17. October 27, 20152015 DLF Forum - Beyond eMOP 17 Outcomes – Github juxta-CL • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/Juxta-cl/  Juxta-CL(command line)  created for eMOP  based on JuxtaCommons tool (juxtacommons.org/) created by Performant Software  several different comparison algorithms to choose from and other options  Levenshtein / Jaro-Winkler / Juxta  ignore: punctuation, caps, hyphens
  • 18. October 27, 20152015 DLF Forum - Beyond eMOP 18 Outcomes – Github Cobre • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/Cobre/
  • 19. October 27, 20152015 DLF Forum - Beyond eMOP 19 Outcomes - Tesseract Training • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach early-modern-ocr.github.io/TesseractTraining/  Font Training  3 different types  Roman, Italic, Blackletter  12 different printers  1564-1764  40 different typeface combinations  Even more combined typefaces training files  EM Dictionaries  from multiple sources with alternate spellings  More  Franken+ training files  common OCR error file
  • 20. October 27, 20152015 DLF Forum - Beyond eMOP 20 Outcomes – DB of EM Printers • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Parsing the imprint lines of all EEBO & ECCO docs to create the ImprintDB  Gathering:  Printed By  Printed For  Sold by  Location(“at the Rose and Crown in St. Paul's Church-Yard”, “at the signe of the Traytors head”)  Place (London, Cambridge…)  (Dates)  Those docs with ESTC numbers will be available via a public database early-modern-ocr.github.io/ImprintDB/
  • 21.  Phase 1 of TCP EEBO hand transcriptions are now available for fulltext searching in 18thConnect  over 25,000 documents October 27, 20152015 DLF Forum - Beyond eMOP 21 Outcomes – Fulltext Search of TCP Phase 1 • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach 18thconnect.org/search?a=EEBO&o=fulltext
  • 22. October 27, 20152015 DLF Forum - Beyond eMOP 22 Outcomes – Editable EEBO OCR – TypeWright • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  Can already use TypeWright in 18thConnect to edit ECCO docs (without a subscription)  Will soon be able to edit EEBO docs in TypeWright  TCP hand transcriptions  eMOP OCR transcriptions  When doc is fully corrected, users can request text and/or XML versions of corrected transcriptions for scholarly use.  First time EEBO transcriptions will be available for over 80,000 docs. 18thconnect.org/typewright
  • 23. October 27, 20152015 DLF Forum - Beyond eMOP 23 Outcomes – Collection Evaluation • Code Repo • Tesseract Training • ImprintDB • TCP EEBO Phase 1 • EEBO OCR • Collection Evaluation • Outreach  We’ve collected over 6 TB of data  Our post-processing algorithms collect data:  amount of skew  amount of noise  number and location of text columns  page quality  correctability  corrections made  Saved to the eMOP DB  First time this has been done on these collections  Pre-process and re-OCR pages  several iterations will identify bad page images
  • 24. The Future of eMOP October 27, 20152015 DLF Forum - Beyond eMOP 24  We want eMOP to be viable long-term  Continue to be developed/supporte d  open-source repos  outreach  Looking for partnersfrom Kill Bill 2, 2004, Miramax
  • 25. Future – Work  ImprintDB  We want to turn this into an online DB with functionality to  Query & download results  Recommend corrections to data  Analyze Data  We have collected a mass of data on nearly 45 million pages:  Skew & noise measures  Column coordinates  Corrections made  Scores & Estimates for “correctability”  We can use that data to improve the OCR workflow and judge its efficacy  Re-OCR  Using skew & noise measures & column info we can pre-process pages and re-OCR for better results October 27, 20152015 DLF Forum - Beyond eMOP 25
  • 26. Future – Funded Work  NEH Implementation Grant Reading the First Books  OCR the first printed works of the New World  Using Ocular OCR engine with multiple language recognition  Opportunity to integrate another OCR engine into our workflow, and continue to develop our workflow and DB to work with other collections. October 27, 20152015 DLF Forum - Beyond eMOP 26
  • 27. Future – Partnerships Notre Dame Libraries Penn St. Libraries Adam Matthew Digital Visualizing English Print U of S Carolina – Center for Digital Humanities  Consultation  Workshops  Share labor / resources  Test the robustness of our workflow in parts or as a whole on different collections and systems  Test how text analysis, topic modeling, lexical mapping, etc. can improve our data and results October 27, 20152015 DLF Forum - Beyond eMOP 27
  • 28. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu mandell@tamu.edu 28 2015 DLF Forum - Beyond eMOP October 27, 2015

Editor's Notes

  1. We have a number of contact points
  2. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research. OPEN SOURCE
  3. For our collections we had EEBO & ECCO (proprietary collections) This constitutes most of the English-language EM documents available in digitized format. And for our Groundtruth comparison we used the TCP.
  4. The reason why this project is necessary is due to the nature of the documents in question. This is the beginning of the printing era when a lot of things were in flux. The technology was new. Standards were loose or non-existant. Quality was variable Problems are endemic to both the original documents and their digital representations
  5. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  6. So, after almost 2 years of typeface research, and testing, and creating training for Tesseract (Google) we OCRd this collection on anywhere from 128 – 500 processors simultaneously and continuously for 3 months When we were done, we analyzed the scores for the 1.8 million pages that we could compare to the groundtruth we got from the TCP. We were only 3% off from Prime Recognition who were able to use multiple OCR engines, and charge a lot of money.
  7. So, we achieved 80% or better on almost half of our documents with groudtruth and those results mostly reflect comparison to EEBO and consists of over a quarter of EEBO’s collection. What we think this is telling us (lacking more detailed analysis) is that we did pretty good with EEBO, but REALLY bad images in that collection pulled our overall score average down.
  8. Because the collections we OCRd are proprietary, those transcriptions are not necessarily part of our open-source outcomes, except with some exceptions, which I’ll talk about in a minute. Most of what we produced open-source is a lot of code and Tesseract training. That’s all in github and linked to from our website.
  9. We created this tool to allow us to generate our own EM typeface training for Tesseract. Tesseract’s native training method wouldn’t work for us. We see this as a useful tool for Typeface research as well.
  10. The eMOP dashboard allows us to control what docs get OCRd using which set of training. We can also use it to see results down to a page level. From here we can see scores against GT, estimated correctability, and correction stats There’s also an admin page with an API for looking at more detailed run information and download other stats
  11. Dashboard provides access to the DB via an API Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS I’d like to point out that you see very little pre-processing of page images in this workflow. The size and nature of our image collections meant that it was prohibitively expensive to attempt to identify and fix potential issues BEFORE ocr’ing So we concentrated on post-processing
  12. One of those post-processing algorithms: Remove “noise” from Tesseract’s hOCR output “words” that are really caused by noise look at position and size of the word to determine
  13. Created by Performant Software
  14. Cobre is a robust image comparison environment first designed for Primeros Libros collection at TAMU, UT and Mexico. We’ve added the ability to edit text transcriptions and add document level metadata and annotations. We ingested 3 copies from eebo of this document in Cobre along with OCR transcriptions Editors with a login can then add doc-level metadata and annotations And can edit page transcriptions one at a time. More importantly you can compare page images and transcriptions for 2 or mare pages. Here we can see two copies of the same document side-by-side. The copy on the left is in much worse condition and so OCR’d much worse. We have the ability to cut-and-paste all or part of the transcription on the right. But we have to be careful about differences. (W VV of first word)
  15. That’s the remaining 2/3 of the collection, which could never before be searched.