SlideShare a Scribd company logo
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
Laura Mandell, PI, IDHMC
Apostolos Antonacopoulos, PRImA Lab
Clemens Neudecker, Koninklijke Bibliotheek
Matthew Christy, Co-Project Manager, IDHMC
Loretta Auvil, SEASR Analytics
Todd Samuelson, Cushing Memorial Library
emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Straight from the grant proposal…
“Our overarching goals”
1) Train three open-access OCR engines to “read” early modern
fonts
2) Map specific font training onto specific sets of documents
3) Create error-evaluation mechanisms for failed documents
4) Use crowd-sourced correction tools specific to OCR errors
5) Identify pages that are too flawed to be “readable”
6) Share our workflow procedure and results, so that the
community can use them in digitizing and transcribing early
modern documents.
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Main Collaborators
CIIR
IDHMC + Cushing Memorial Library
Koninklijke Bibliotheek
Performant Software Solutions
PRImA Labs
PSI Labs
SEASR
UMass Amhearst
Texas A&M
Netherlands
Charlottesville, Virginia
University of Salford, Manchester
Texas A&M
U of Illinois, Urbana-Champaign
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Data Contributors + Collaborators
Early English Books Online (EEBO)
Eighteenth Century Collections Online (ECCO)
Text Creation Partnership (TCP)
Brazos Computing Cluster (Texas A&M)
Main Collaborators
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Laura Mandell, Principal Investigator, eMOP
Director, IDHMC
@mandellc
idhmc@tamu.edu
Early Modern
Printing
• Individual, hand-made
typefaces
• Worn and broken type
• Poor quality equipment/paper
• Inconsistent line bases
• Unusual page layouts,
decorative page elements,
• Special characters & ligatures
• Spelling variations
• Mixed typefaces and languages
Slides by Matthew Christy 7
Slides by Matthew Christy 8
• Irregular Layouts
• Print Bleedthrough
Document/Image
Quality
• Torn and damaged
pages
• Noise introduced to
images of pages
• Skewed pages
• Warped pages
• Missing pages
• Inverted pages
• Incorrect metadata
• Extremely low quality
TIFFs (~50K)
Slides by Matthew Christy 9
Slides by Matthew Christy 10
11
There may be as much
difference between one letter
and another in a specific font
As there is between letters in
different fonts.
Reality
Dream
Training Tesseract in different
fonts and applying them to the
documents printed in those
particular fonts will improve OCR
quality.
Training Tesseract
Aletheia
Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on
the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an
XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode
values.
Training Tesseract
Franken+
1. Takes Aletheia's output files as input.
2. Groups all glyphs with the same Unicode values
into one window for comparison.
3. Mistakenly coded glyphs are easily identified and
re-coded.
4. A user can quickly compare all exemplars of a
glyph and choose just the best subset, if desired.
5. Uses all selected glyphs to create a Franken-page
image (TIFF) using a selected text as a base.
6. Outputs the same box files and TIFF images that
Tesseract's first stage of native training.
7. Also allows users to complete Tesseract training
using newly created box/TIFF file pairs, and add
optional dictionary and other files.
8. Outputs a .traineddata file used by Tesseract
when OCRing page images.
Slides by Matthew Christy 13
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Clemens Neudecker, Koninklijke Bibliotheek
@cneudecker
The case of IMPAC T
• IMPACT = IMProving ACcess to Text
• EU FP7, 2008 – 2012
• €16.7 M budget
• 22 partners (libraries, universities, companies)
• Goal: Significantly improve OCR for historical
documents
Issue 1
• Expectation: The "IMPACT OCR"
• Reality: A collection of very diverse tools,
algorithms, etc. Some prototypes, some
commercial tools, different programming
languages, different levels of maturity etc.
•
• No integrated product possible!
Issue 1
• Solution: Interoperability rather than integration
• Change: Individual applications as pluggable
modules in a web-based framework
• Result: Flexible framework with additional
benefits for testing, transparency, provenance
Issue 2
• Diversity: Librarians, Computer Scientists,
Computational Linguists, Humanists
• Are we really talking the same language?
• Different focus points in the project: applicable
solutions vs. academic publications
Issue 2
• Solution: Create bonding activities, foster
atmosphere for knowledge exchange
• Change: Buddy programme, social games,
quizzes about partners
• Result: Understand your partners background,
their way of thinking
enrich the experience for everyone
Large Digitisation Projects:
Two Key Perspectives
Apostolos Antonacopoulos
PRImA Research Lab
Background
Since 2002 the PRImA Lab has been involved in large digitisation
projects, creating software tools for all stages of the workflow
• From Image Enhancement to Layout Analysis to OCR
• Use-scenario based evaluation of extracted text quality
• Crowd/Scholar-sourcing
Two general points are routinely underestimated:
• (Really) Understanding stakeholders and their roles
• (Real) Understanding of problems, their extent and the
effectiveness/requirements of potential solutions
Stakeholders and their
roles
Seems obvious and often mentioned but the significance of
understanding this point and its effects is vastly underestimated
Content holders
• Keen for their content to be widely available and used
• Do not know their content well and neither its potential uses
Computer scientists
• Have technical expertise to solve many of the problems
• Do not know the material and its use to prioritise problems well
DH researchers – the catalysts
• Very knowledgeable of material and potential use
• Have complementary technical skills to computer scientists
Problem understanding
At the start of each project everyone is eager to deliver “big” results but
it is important to identify and understand a few key problems and solve
them well
“Improve OCR results” is an ill-defined and short-sighted goal
• Measured in terms of word-accuracy, OCR results are of little use
• Layout is very important
• Even if all the words are recognised correctly, the reading order is unlikely to be
correct, limiting potentially interesting uses.
• Page numbers, captions, running headers etc. should not be mixed with body text
• Graphical elements / illustrations are important too
Think: Useful data (investment) vs. just more of any data (instant
gratification)
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
@EMGrumbach
egrumbac@tamu.edu
“If an electronic scholarly project can’t fail and
doesn’t produce new ignorance, then it isn’t
worth a damn.”
- John Unsworth
“Documenting the Reinvention of Text: The Importance of Failure”
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Challenges +Failures
should be constantly or
consistently
communicated.
Analysis + New
Directions
should lead to research
and communication
with similar projects.
Adaptability
should allow for new
possibilities, new
questions.
Navigating the Storm | @EMGrumbach | emop.tamu.edu

More Related Content

What's hot

mchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triage
Matt Christy
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
Matt Christy
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
Laura Mandell
 
Once upon a time in Datatown ...
Once upon a time in Datatown ...Once upon a time in Datatown ...
Once upon a time in Datatown ...
srazniewski
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin
 
How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...
Holistic Benchmarking of Big Linked Data
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
Vsevolod Dyomkin
 
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
SCONUL Summer Conference 2019 -  Svein Arne BrygfjeldSCONUL Summer Conference 2019 -  Svein Arne Brygfjeld
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
sconul
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Lifeng (Aaron) Han
 
Integration stories with OpenClinica and OpenXData
Integration stories with OpenClinica and OpenXDataIntegration stories with OpenClinica and OpenXData
Integration stories with OpenClinica and OpenXData
Tom Hickerson
 
Link Discovery Tutorial Introduction
Link Discovery Tutorial IntroductionLink Discovery Tutorial Introduction
Link Discovery Tutorial Introduction
Holistic Benchmarking of Big Linked Data
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
InfinIT - Innovationsnetværket for it
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
Scottish Language Dictionaries
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
Vsevolod Dyomkin
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Giovanna Roda
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
Constantin Orasan
 

What's hot (18)

mchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triage
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Once upon a time in Datatown ...
Once upon a time in Datatown ...Once upon a time in Datatown ...
Once upon a time in Datatown ...
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
SCONUL Summer Conference 2019 -  Svein Arne BrygfjeldSCONUL Summer Conference 2019 -  Svein Arne Brygfjeld
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Integration stories with OpenClinica and OpenXData
Integration stories with OpenClinica and OpenXDataIntegration stories with OpenClinica and OpenXData
Integration stories with OpenClinica and OpenXData
 
Link Discovery Tutorial Introduction
Link Discovery Tutorial IntroductionLink Discovery Tutorial Introduction
Link Discovery Tutorial Introduction
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 

Similar to Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
sagarjsicg
 
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar PosterCritiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
Mark Guzdial
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at Universities
PK Mishra
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
Paige Morgan
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
rohitcse52
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
Paige Morgan
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
Daniel S. Katz
 
Benoit Visual Only Retrieval
Benoit Visual Only RetrievalBenoit Visual Only Retrieval
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
InfinIT - Innovationsnetværket for it
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Shalin Hai-Jew
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
CS, NcState
 
Integrating Semantic Systems
Integrating Semantic SystemsIntegrating Semantic Systems
Integrating Semantic Systems
Kingsley Uyi Idehen
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
HPCC Systems
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
openminted_eu
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
openminted_eu
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?
Shawn Day
 
Text Mining
Text MiningText Mining
Text Mining
Biniam Asnake
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
Alexander Borzunov
 

Similar to Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards (20)

1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
 
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar PosterCritiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar Poster
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at Universities
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
 
Benoit Visual Only Retrieval
Benoit Visual Only RetrievalBenoit Visual Only Retrieval
Benoit Visual Only Retrieval
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Integrating Semantic Systems
Integrating Semantic SystemsIntegrating Semantic Systems
Integrating Semantic Systems
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?
 
Text Mining
Text MiningText Mining
Text Mining
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
 

Recently uploaded

Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
ssuser13ffe4
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Vivekanand Anglo Vedic Academy
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
سمير بسيوني
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 

Recently uploaded (20)

Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

  • 1. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC Laura Mandell, PI, IDHMC Apostolos Antonacopoulos, PRImA Lab Clemens Neudecker, Koninklijke Bibliotheek Matthew Christy, Co-Project Manager, IDHMC Loretta Auvil, SEASR Analytics Todd Samuelson, Cushing Memorial Library emop.tamu.edu
  • 2. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 3. Straight from the grant proposal… “Our overarching goals” 1) Train three open-access OCR engines to “read” early modern fonts 2) Map specific font training onto specific sets of documents 3) Create error-evaluation mechanisms for failed documents 4) Use crowd-sourced correction tools specific to OCR errors 5) Identify pages that are too flawed to be “readable” 6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents. Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 4. Main Collaborators CIIR IDHMC + Cushing Memorial Library Koninklijke Bibliotheek Performant Software Solutions PRImA Labs PSI Labs SEASR UMass Amhearst Texas A&M Netherlands Charlottesville, Virginia University of Salford, Manchester Texas A&M U of Illinois, Urbana-Champaign Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 5. Data Contributors + Collaborators Early English Books Online (EEBO) Eighteenth Century Collections Online (ECCO) Text Creation Partnership (TCP) Brazos Computing Cluster (Texas A&M) Main Collaborators Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 6. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Laura Mandell, Principal Investigator, eMOP Director, IDHMC @mandellc idhmc@tamu.edu
  • 7. Early Modern Printing • Individual, hand-made typefaces • Worn and broken type • Poor quality equipment/paper • Inconsistent line bases • Unusual page layouts, decorative page elements, • Special characters & ligatures • Spelling variations • Mixed typefaces and languages Slides by Matthew Christy 7
  • 8. Slides by Matthew Christy 8 • Irregular Layouts • Print Bleedthrough
  • 9. Document/Image Quality • Torn and damaged pages • Noise introduced to images of pages • Skewed pages • Warped pages • Missing pages • Inverted pages • Incorrect metadata • Extremely low quality TIFFs (~50K) Slides by Matthew Christy 9
  • 10. Slides by Matthew Christy 10
  • 11. 11 There may be as much difference between one letter and another in a specific font As there is between letters in different fonts. Reality Dream Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.
  • 12. Training Tesseract Aletheia Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  • 13. Training Tesseract Franken+ 1. Takes Aletheia's output files as input. 2. Groups all glyphs with the same Unicode values into one window for comparison. 3. Mistakenly coded glyphs are easily identified and re-coded. 4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. 5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. 6. Outputs the same box files and TIFF images that Tesseract's first stage of native training. 7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. 8. Outputs a .traineddata file used by Tesseract when OCRing page images. Slides by Matthew Christy 13
  • 14. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Clemens Neudecker, Koninklijke Bibliotheek @cneudecker
  • 15. The case of IMPAC T • IMPACT = IMProving ACcess to Text • EU FP7, 2008 – 2012 • €16.7 M budget • 22 partners (libraries, universities, companies) • Goal: Significantly improve OCR for historical documents
  • 16. Issue 1 • Expectation: The "IMPACT OCR" • Reality: A collection of very diverse tools, algorithms, etc. Some prototypes, some commercial tools, different programming languages, different levels of maturity etc. • • No integrated product possible!
  • 17. Issue 1 • Solution: Interoperability rather than integration • Change: Individual applications as pluggable modules in a web-based framework • Result: Flexible framework with additional benefits for testing, transparency, provenance
  • 18. Issue 2 • Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists • Are we really talking the same language? • Different focus points in the project: applicable solutions vs. academic publications
  • 19. Issue 2 • Solution: Create bonding activities, foster atmosphere for knowledge exchange • Change: Buddy programme, social games, quizzes about partners • Result: Understand your partners background, their way of thinking enrich the experience for everyone
  • 20. Large Digitisation Projects: Two Key Perspectives Apostolos Antonacopoulos PRImA Research Lab
  • 21. Background Since 2002 the PRImA Lab has been involved in large digitisation projects, creating software tools for all stages of the workflow • From Image Enhancement to Layout Analysis to OCR • Use-scenario based evaluation of extracted text quality • Crowd/Scholar-sourcing Two general points are routinely underestimated: • (Really) Understanding stakeholders and their roles • (Real) Understanding of problems, their extent and the effectiveness/requirements of potential solutions
  • 22. Stakeholders and their roles Seems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated Content holders • Keen for their content to be widely available and used • Do not know their content well and neither its potential uses Computer scientists • Have technical expertise to solve many of the problems • Do not know the material and its use to prioritise problems well DH researchers – the catalysts • Very knowledgeable of material and potential use • Have complementary technical skills to computer scientists
  • 23. Problem understanding At the start of each project everyone is eager to deliver “big” results but it is important to identify and understand a few key problems and solve them well “Improve OCR results” is an ill-defined and short-sighted goal • Measured in terms of word-accuracy, OCR results are of little use • Layout is very important • Even if all the words are recognised correctly, the reading order is unlikely to be correct, limiting potentially interesting uses. • Page numbers, captions, running headers etc. should not be mixed with body text • Graphical elements / illustrations are important too Think: Useful data (investment) vs. just more of any data (instant gratification)
  • 24. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC @EMGrumbach egrumbac@tamu.edu
  • 25. “If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t worth a damn.” - John Unsworth “Documenting the Reinvention of Text: The Importance of Failure” Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 26. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 27. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Challenges Or Failures Analysis New Directions Adaptability Challenges +Failures should be constantly or consistently communicated. Analysis + New Directions should lead to research and communication with similar projects. Adaptability should allow for new possibilities, new questions. Navigating the Storm | @EMGrumbach | emop.tamu.edu

Editor's Notes

  1. sf
  2. eMOP – early modern OCR project, funded by the Mellon foundation for 734,000 for two years, and our initial goals were the following
  3. Influenced us a lot when we were discussing putting together this paper and presentation, as we’ve all come to this international, interdisciplinary grant project from similar projects – we’ve faced challenges to our initial premises, we’ve not met milestone in the grant, yet we’ve produced interesting results and raised new research questions