Mchristy-eMOP-workflows2-24x7

Matt Christy
Matt ChristyLead Software Applications Developer
eMOP guide to OCR
Flowcharting a Course Through Open-Source Waters
 emop.tamu.edu/
 TCDL 24x7 Presentation
 emop.tamu.edu/news
 More information
 emop.tamu.edu/TxDHC-
flowcharts
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 Early Modern OCR Project
 Twitter
 #emop
 @IDHMC_Nexus
 @matt_christy
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
2
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
3
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
4
The Problems
Document/Image Quality
 Torn and damaged
pages
 Noise introduced to
images of pages
 Skewed pages
 Warped pages
 Missing pages
 Inverted pages
 Incorrect metadata
 Extremely low quality
TIFFs (~50K)
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
5
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
6
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
7
Wrangling Data
The Numbers
 EEBO: ~125,000 documents, ~13
million pages images (1475-1700)
 ECCO: ~182,000 documents,
~32 million page images (1700-
1800)
 TCP: ~46,000 double-keyed
hand transcriptions (44,000
EEBO, 2,200 ECCO) -
Groundtruth
 Total: >300,000 documents & 45
million page images.
The Data
 ECCO page images (1 pg/
image)
 ECCO original OCR results
(doc-level XML files)
 ECCO TCP transcriptions (doc-
level XML and text files)
 EEBO page images (2 pgs/
image)
 EEBO TCP transcriptions (doc-
level XML and text files)
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
8
WranglingData
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
9
TrainingTesseract
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
10
Training Tesseract
Aletheia
Created by PRImA Research Labs. A team of undergraduates uses Aletheia to
identify each glyph on the page images, and ensure that the correct Unicode value
is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a
page with their corresponding coordinates and Unicode values.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
11
Training Tesseract
Franken+
1. Takes Aletheia's output files as input.
2. Groups all glyphs with the same
Unicode values into one window for
comparison.
3. Mistakenly coded glyphs are easily
identified and re-coded.
4. A user can quickly compare all
exemplars of a glyph and choose just
the best subset, if desired.
5. Uses all selected glyphs to create a
Franken-page image (TIFF) using a
selected text as a base.
6. Outputs the same box files and TIFF
images that Tesseract's first stage of
native training.
7. Also allows users to complete Tesseract
training using newly created box/TIFF
file pairs, and add optional dictionary
and other files.
8. Outputs a .traineddata file used by
Tesseract when OCRing page images.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
12
Control Tools
 Query Builder: Users can build
queries to find specific
documents, or sets of
documents, in the eMOP DB.
Document sets can be
labeled or grouped with an
identifier.
 Data Downloader: Available
via the QB. Allows identified
items to be downloaded:
page images, OCR results,
associated groundtruth.
 eMOP Dashboard: Users
select documents from the
database, individually or with
the identifier created in the
QB, for OCRing with specific
typeface training. The
Dashboard displays results
with Juxta/RETAS scores when
groundtruth is available.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
13
emop_controller
A java program that runs on the Brazos High Performance Computing Cluster
and ensures maximum utilization of the 128 processors available for our use by
scheduling various functions and processes of the controller separately.
1. The selected documents are marked in DB as scheduled and the
documents are queued for processing.
2. A cron job continuously checks the scheduled queue and OCRs any
unscheduled pages.
3. Each page is OCRd with Tesseract.
4. After OCRing each page's hOCR output file is de-noised.
5. hOCR pages that have matching Groundtruth are scored using Juxta and
RETAS algorithms.
6. File paths and scores are written to the eMOP DB.
7. hOCR documents are examined again and given an estimated
correctability (ECORR) score.
8. hOCR page results are sent to post-processing triage.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
14
emop_controller
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
15
De-noising
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
16
De-noising
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
17
On pages with a mix of text and image removes
parts of the image Tesseract identifies as “words”.
Post-ProcessingTriage
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
18
Post-Processing Triage
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
19
Post-Processing
Triage
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
20
 www.18thconnect.org/typew
right/
 Available through
18thconnect.org
An aggregator of 18th
Century metadata & content
This will be the first time that
EEBO documents are available
as searchable text, and that
they can be corrected by
scholars for their own use.
TypeWright
 Crowd-sourced correction
tool for OCR.
 Currently available for ECCO
collection, based on original
ECCO OCR results.
 Will be ingested with eMOP
OCR for ECCO and EEBO by
year-end.
 Users can correct a whole
document and then receive
the text or a lightly encoded
TEI XML version for use in a
digital edition or whatever.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
21
TypeWright
Edit button identifies
“TypeWright” enabled
Texts.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
22
Tags DB
 All page images that don’t produce OCR text that
meets our standards will be tagged in the eMOP DB
with tags that indicate what is wrong with the page
image.
 skew, warp, noise, wrong font, etc.
 Tags can then be used to later apply the
appropriate pre-processing techniques and then
re-OCR the page to produce better results.
 This will be the first time that any sort of
comprehensive analysis has been done on the
page images of these collections.
 We’ll end up (ideally) with a list of page images that
simply need to be re-scanned.
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
23
The end
For eMOP questions please
contact me at :
mchristy@tamu.edu
TCDL - eMOP: Flowcharting a Course Through Open-Source Waters
24
1 of 24

Recommended

mchristy-Dh2014- emop-postOCR-triage by
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triageMatt Christy
1.6K views21 slides
TCDL15 Beyond eMOP by
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOPMatt Christy
1.3K views31 slides
mchristy-DH2014-emop-bookhistory-tools by
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
2.8K views21 slides
DLF Forum 2015: Beyond eMOP by
DLF Forum 2015: Beyond eMOPDLF Forum 2015: Beyond eMOP
DLF Forum 2015: Beyond eMOPMatt Christy
1.4K views28 slides
eMOP-PennSt-lunch by
eMOP-PennSt-luncheMOP-PennSt-lunch
eMOP-PennSt-lunchMatt Christy
1.4K views41 slides
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards by
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
2.1K views27 slides

More Related Content

What's hot

Dh2014 e mopcobre-complete by
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
2.9K views45 slides
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools by
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsMatt Christy
2.2K views48 slides
Can functional programming be liberated from static typing? by
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
1.8K views39 slides
The State of #NLProc by
The State of #NLProcThe State of #NLProc
The State of #NLProcVsevolod Dyomkin
996 views19 slides
Aspects of NLP Practice by
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP PracticeVsevolod Dyomkin
1K views37 slides
Crash-course in Natural Language Processing by
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
1.6K views47 slides

What's hot(16)

Dh2014 e mopcobre-complete by Laura Mandell
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
Laura Mandell2.9K views
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools by Matt Christy
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
Matt Christy2.2K views
Can functional programming be liberated from static typing? by Vsevolod Dyomkin
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin1.8K views
Crash-course in Natural Language Processing by Vsevolod Dyomkin
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin1.6K views
Natural Language Processing in Practice by Vsevolod Dyomkin
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin3.2K views
Crash Course in Natural Language Processing (2016) by Vsevolod Dyomkin
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin1.8K views
Semantic web, python, construction industry by Reinout van Rees
Semantic web, python, construction industrySemantic web, python, construction industry
Semantic web, python, construction industry
Reinout van Rees879 views
Feature Engineering for NLP by Bill Liu
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu2.9K views
Recent trends in natural language processing by Balayogi G
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
Balayogi G221 views
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W by PMB-BUG
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4WMigrer vers PMB: retour d\'expérience d\'une migration depuis S4W
Migrer vers PMB: retour d\'expérience d\'une migration depuis S4W
PMB-BUG1.1K views

Similar to Mchristy-eMOP-workflows2-24x7

Lecture 1 introduction to language processors by
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processorsRebaz Najeeb
1.3K views33 slides
Numerical Simulation of Nonlinear Mechanical Problems using Metafor by
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforRomain Boman
951 views22 slides
Xml processing-by-asfak by
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
1.3K views41 slides
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft... by
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
414 views30 slides
Compiler Design.pptx by
Compiler Design.pptxCompiler Design.pptx
Compiler Design.pptxSouvikRoy149
34 views10 slides
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration by
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
969 views18 slides

Similar to Mchristy-eMOP-workflows2-24x7(20)

Lecture 1 introduction to language processors by Rebaz Najeeb
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
Rebaz Najeeb1.3K views
Numerical Simulation of Nonlinear Mechanical Problems using Metafor by Romain Boman
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Romain Boman951 views
Xml processing-by-asfak by Asfak Mahamud
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
Asfak Mahamud1.3K views
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft... by Dataconomy Media
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media414 views
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration by Stuart Chalk
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk969 views
CS4200 2019 Lecture 1: Introduction by Eelco Visser
CS4200 2019 Lecture 1: IntroductionCS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: Introduction
Eelco Visser1K views
Multi Lingual Websites In Umbraco by Paul Marden
Multi Lingual Websites In UmbracoMulti Lingual Websites In Umbraco
Multi Lingual Websites In Umbraco
Paul Marden4.6K views
Technology Stack Discussion by Zaiyang Li
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack Discussion
Zaiyang Li586 views
Bitstuffing by Vishal Kr
BitstuffingBitstuffing
Bitstuffing
Vishal Kr551 views
Environment Canada's Data Management Service by Safe Software
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
Safe Software1.3K views
Wiki dev nlp by ICSM 2010
Wiki dev nlpWiki dev nlp
Wiki dev nlp
ICSM 2010576 views
Applications of data structures by Wipro
Applications of data structuresApplications of data structures
Applications of data structures
Wipro29.5K views
BigDataSpain 2016: Stream Processing Applications with Apache Apex by Thomas Weise
BigDataSpain 2016: Stream Processing Applications with Apache ApexBigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
Thomas Weise322 views
PhD Presentation by mskayed
PhD PresentationPhD Presentation
PhD Presentation
mskayed3.2K views

Recently uploaded

Google solution challenge..pptx by
Google solution challenge..pptxGoogle solution challenge..pptx
Google solution challenge..pptxChitreshGyanani1
82 views18 slides
GSoC 2024 by
GSoC 2024GSoC 2024
GSoC 2024DeveloperStudentClub10
63 views15 slides
American Psychological Association 7th Edition.pptx by
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptxSamiullahAfridi4
74 views8 slides
ICS3211_lecture 08_2023.pdf by
ICS3211_lecture 08_2023.pdfICS3211_lecture 08_2023.pdf
ICS3211_lecture 08_2023.pdfVanessa Camilleri
95 views30 slides
Community-led Open Access Publishing webinar.pptx by
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptxJisc
69 views9 slides
Material del tarjetero LEES Travesías.docx by
Material del tarjetero LEES Travesías.docxMaterial del tarjetero LEES Travesías.docx
Material del tarjetero LEES Travesías.docxNorberto Millán Muñoz
68 views9 slides

Recently uploaded(20)

American Psychological Association 7th Edition.pptx by SamiullahAfridi4
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptx
SamiullahAfridi474 views
Community-led Open Access Publishing webinar.pptx by Jisc
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptx
Jisc69 views
Structure and Functions of Cell.pdf by Nithya Murugan
Structure and Functions of Cell.pdfStructure and Functions of Cell.pdf
Structure and Functions of Cell.pdf
Nithya Murugan317 views
Nico Baumbach IMR Media Component by InMediaRes1
Nico Baumbach IMR Media ComponentNico Baumbach IMR Media Component
Nico Baumbach IMR Media Component
InMediaRes1425 views
SIMPLE PRESENT TENSE_new.pptx by nisrinamadani2
SIMPLE PRESENT TENSE_new.pptxSIMPLE PRESENT TENSE_new.pptx
SIMPLE PRESENT TENSE_new.pptx
nisrinamadani2173 views
Universe revised.pdf by DrHafizKosar
Universe revised.pdfUniverse revised.pdf
Universe revised.pdf
DrHafizKosar108 views
DU Oral Examination Toni Santamaria by MIPLM
DU Oral Examination Toni SantamariaDU Oral Examination Toni Santamaria
DU Oral Examination Toni Santamaria
MIPLM138 views
EIT-Digital_Spohrer_AI_Intro 20231128 v1.pptx by ISSIP
EIT-Digital_Spohrer_AI_Intro 20231128 v1.pptxEIT-Digital_Spohrer_AI_Intro 20231128 v1.pptx
EIT-Digital_Spohrer_AI_Intro 20231128 v1.pptx
ISSIP256 views
The basics - information, data, technology and systems.pdf by JonathanCovena1
The basics - information, data, technology and systems.pdfThe basics - information, data, technology and systems.pdf
The basics - information, data, technology and systems.pdf
JonathanCovena177 views
AI Tools for Business and Startups by Svetlin Nakov
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and Startups
Svetlin Nakov89 views
JiscOAWeek_LAIR_slides_October2023.pptx by Jisc
JiscOAWeek_LAIR_slides_October2023.pptxJiscOAWeek_LAIR_slides_October2023.pptx
JiscOAWeek_LAIR_slides_October2023.pptx
Jisc72 views
Plastic waste.pdf by alqaseedae
Plastic waste.pdfPlastic waste.pdf
Plastic waste.pdf
alqaseedae110 views
UWP OA Week Presentation (1).pptx by Jisc
UWP OA Week Presentation (1).pptxUWP OA Week Presentation (1).pptx
UWP OA Week Presentation (1).pptx
Jisc68 views

Mchristy-eMOP-workflows2-24x7

  • 1. eMOP guide to OCR Flowcharting a Course Through Open-Source Waters
  • 2.  emop.tamu.edu/  TCDL 24x7 Presentation  emop.tamu.edu/news  More information  emop.tamu.edu/TxDHC- flowcharts  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @matt_christy TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 2
  • 3. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 3
  • 4. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 4
  • 5. The Problems Document/Image Quality  Torn and damaged pages  Noise introduced to images of pages  Skewed pages  Warped pages  Missing pages  Inverted pages  Incorrect metadata  Extremely low quality TIFFs (~50K) TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 5
  • 6. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 6
  • 7. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 7
  • 8. Wrangling Data The Numbers  EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  ECCO: ~182,000 documents, ~32 million page images (1700- 1800)  TCP: ~46,000 double-keyed hand transcriptions (44,000 EEBO, 2,200 ECCO) - Groundtruth  Total: >300,000 documents & 45 million page images. The Data  ECCO page images (1 pg/ image)  ECCO original OCR results (doc-level XML files)  ECCO TCP transcriptions (doc- level XML and text files)  EEBO page images (2 pgs/ image)  EEBO TCP transcriptions (doc- level XML and text files) TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 8
  • 9. WranglingData TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 9
  • 10. TrainingTesseract TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 10
  • 11. Training Tesseract Aletheia Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 11
  • 12. Training Tesseract Franken+ 1. Takes Aletheia's output files as input. 2. Groups all glyphs with the same Unicode values into one window for comparison. 3. Mistakenly coded glyphs are easily identified and re-coded. 4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. 5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. 6. Outputs the same box files and TIFF images that Tesseract's first stage of native training. 7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. 8. Outputs a .traineddata file used by Tesseract when OCRing page images. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 12
  • 13. Control Tools  Query Builder: Users can build queries to find specific documents, or sets of documents, in the eMOP DB. Document sets can be labeled or grouped with an identifier.  Data Downloader: Available via the QB. Allows identified items to be downloaded: page images, OCR results, associated groundtruth.  eMOP Dashboard: Users select documents from the database, individually or with the identifier created in the QB, for OCRing with specific typeface training. The Dashboard displays results with Juxta/RETAS scores when groundtruth is available. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 13
  • 14. emop_controller A java program that runs on the Brazos High Performance Computing Cluster and ensures maximum utilization of the 128 processors available for our use by scheduling various functions and processes of the controller separately. 1. The selected documents are marked in DB as scheduled and the documents are queued for processing. 2. A cron job continuously checks the scheduled queue and OCRs any unscheduled pages. 3. Each page is OCRd with Tesseract. 4. After OCRing each page's hOCR output file is de-noised. 5. hOCR pages that have matching Groundtruth are scored using Juxta and RETAS algorithms. 6. File paths and scores are written to the eMOP DB. 7. hOCR documents are examined again and given an estimated correctability (ECORR) score. 8. hOCR page results are sent to post-processing triage. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 14
  • 15. emop_controller TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 15
  • 16. De-noising TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 16
  • 17. De-noising TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 17 On pages with a mix of text and image removes parts of the image Tesseract identifies as “words”.
  • 18. Post-ProcessingTriage TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 18
  • 19. Post-Processing Triage TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 19
  • 20. Post-Processing Triage TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 20
  • 21.  www.18thconnect.org/typew right/  Available through 18thconnect.org An aggregator of 18th Century metadata & content This will be the first time that EEBO documents are available as searchable text, and that they can be corrected by scholars for their own use. TypeWright  Crowd-sourced correction tool for OCR.  Currently available for ECCO collection, based on original ECCO OCR results.  Will be ingested with eMOP OCR for ECCO and EEBO by year-end.  Users can correct a whole document and then receive the text or a lightly encoded TEI XML version for use in a digital edition or whatever. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 21
  • 22. TypeWright Edit button identifies “TypeWright” enabled Texts. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 22
  • 23. Tags DB  All page images that don’t produce OCR text that meets our standards will be tagged in the eMOP DB with tags that indicate what is wrong with the page image.  skew, warp, noise, wrong font, etc.  Tags can then be used to later apply the appropriate pre-processing techniques and then re-OCR the page to produce better results.  This will be the first time that any sort of comprehensive analysis has been done on the page images of these collections.  We’ll end up (ideally) with a list of page images that simply need to be re-scanned. TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 23
  • 24. The end For eMOP questions please contact me at : mchristy@tamu.edu TCDL - eMOP: Flowcharting a Course Through Open-Source Waters 24