Tamu big data-conf-1b

Matt Christy
Matt ChristyLead Software Applications Developer
The Early Modern OCR Project
Big Data in the Humanities
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
 emop.tamu.edu/
 Texas A&M Big Data Workshop
 emop.tamu.edu/TAMU-
BigData
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed docuemnts
 44,000 EEBO
 2,200 ECCO
3
http://emop.tamu.edu
4
5
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartners
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking
 Old, low-quality, small tiff
files
 Noise, skew, warp,
bleedthrough,
6
Page Images
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
7
8
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
Post-ProcessingTriage
9
Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
10
1 of 10

Recommended

Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB by
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBDigital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBMatt Christy
1.2K views12 slides
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP by
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFPFrom Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFPMatt Christy
1.5K views19 slides
eMOP-PennSt-lunch by
eMOP-PennSt-luncheMOP-PennSt-lunch
eMOP-PennSt-lunchMatt Christy
1.4K views41 slides
DLF Forum 2015: Beyond eMOP by
DLF Forum 2015: Beyond eMOPDLF Forum 2015: Beyond eMOP
DLF Forum 2015: Beyond eMOPMatt Christy
1.4K views28 slides
TCDL15 Beyond eMOP by
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOPMatt Christy
1.3K views31 slides
mchristy-DH2014-emop-bookhistory-tools by
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
2.8K views21 slides

More Related Content

Similar to Tamu big data-conf-1b

Oulu2 by
Oulu2Oulu2
Oulu2Stuart Dunn
404 views17 slides
Defrosting the Digital Library: A survey of bibliographic tools for the next ... by
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Duncan Hull
26.5K views65 slides
Living the life electric by
Living the life electricLiving the life electric
Living the life electricDoctorG
796 views29 slides
Computation and Knowledge by
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
2.3K views50 slides
Describing Everything - Open Web standards and classification by
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDan Brickley
9.4K views62 slides
Big Data in the Arts and Humanities by
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and HumanitiesAndrew Prescott
940 views21 slides

Similar to Tamu big data-conf-1b(12)

Defrosting the Digital Library: A survey of bibliographic tools for the next ... by Duncan Hull
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Duncan Hull26.5K views
Living the life electric by DoctorG
Living the life electricLiving the life electric
Living the life electric
DoctorG796 views
Computation and Knowledge by Ian Foster
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster2.3K views
Describing Everything - Open Web standards and classification by Dan Brickley
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
Dan Brickley9.4K views
Big Data in the Arts and Humanities by Andrew Prescott
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
Andrew Prescott940 views
'E-Science and Archaeology' by Stuart Dunn
'E-Science and Archaeology''E-Science and Archaeology'
'E-Science and Archaeology'
Stuart Dunn340 views
Content Mining for Machines and Humans by petermurrayrust
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
petermurrayrust1.6K views
Content Mining for Machines and Humans by TheContentMine
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
TheContentMine128 views
ContentMine: Open Data and Social Machines by petermurrayrust
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
petermurrayrust2.6K views
Agents In An Exponential World Foster by Ian Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
Ian Foster575 views

Recently uploaded

STERILITY TEST.pptx by
STERILITY TEST.pptxSTERILITY TEST.pptx
STERILITY TEST.pptxAnupkumar Sharma
114 views9 slides
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively by
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyPECB
457 views18 slides
UWP OA Week Presentation (1).pptx by
UWP OA Week Presentation (1).pptxUWP OA Week Presentation (1).pptx
UWP OA Week Presentation (1).pptxJisc
68 views11 slides
Azure DevOps Pipeline setup for Mule APIs #36 by
Azure DevOps Pipeline setup for Mule APIs #36Azure DevOps Pipeline setup for Mule APIs #36
Azure DevOps Pipeline setup for Mule APIs #36MysoreMuleSoftMeetup
88 views21 slides
American Psychological Association 7th Edition.pptx by
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptxSamiullahAfridi4
74 views8 slides

Recently uploaded(20)

ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively by PECB
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
PECB 457 views
UWP OA Week Presentation (1).pptx by Jisc
UWP OA Week Presentation (1).pptxUWP OA Week Presentation (1).pptx
UWP OA Week Presentation (1).pptx
Jisc68 views
American Psychological Association 7th Edition.pptx by SamiullahAfridi4
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptx
SamiullahAfridi474 views
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx by Inge de Waard
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptxOEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
Inge de Waard165 views
The basics - information, data, technology and systems.pdf by JonathanCovena1
The basics - information, data, technology and systems.pdfThe basics - information, data, technology and systems.pdf
The basics - information, data, technology and systems.pdf
JonathanCovena177 views
Chemistry of sex hormones.pptx by RAJ K. MAURYA
Chemistry of sex hormones.pptxChemistry of sex hormones.pptx
Chemistry of sex hormones.pptx
RAJ K. MAURYA119 views
Narration lesson plan.docx by TARIQ KHAN
Narration lesson plan.docxNarration lesson plan.docx
Narration lesson plan.docx
TARIQ KHAN99 views
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) by AnshulDewangan3
 Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation) Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
Compare the flora and fauna of Kerala and Chhattisgarh ( Charttabulation)
AnshulDewangan3275 views
NS3 Unit 2 Life processes of animals.pptx by manuelaromero2013
NS3 Unit 2 Life processes of animals.pptxNS3 Unit 2 Life processes of animals.pptx
NS3 Unit 2 Life processes of animals.pptx
manuelaromero2013102 views
Community-led Open Access Publishing webinar.pptx by Jisc
Community-led Open Access Publishing webinar.pptxCommunity-led Open Access Publishing webinar.pptx
Community-led Open Access Publishing webinar.pptx
Jisc69 views

Tamu big data-conf-1b

  • 1. The Early Modern OCR Project Big Data in the Humanities Matthew Christy, Laura Mandell, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  Texas A&M Big Data Workshop  emop.tamu.edu/TAMU- BigData  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2
  • 3. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO 3
  • 5. 5 • PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK • SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign • PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University • The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University • The Brazos High Performance Computing Cluster (HPCC) OurPartners
  • 6. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking  Old, low-quality, small tiff files  Noise, skew, warp, bleedthrough, 6
  • 7. Page Images DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 7
  • 8. 8 Workflows-Controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py
  • 10. Brazos HPCC • 128 processors as a stakeholder • Access to background queues • Estimated at over 2 months of constant processing • We will have to reprocess some files that fail due to timeouts or that require pre-processing 10

Editor's Notes

  1. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  2. These are not big numbers by STEM standards, but for the Humanities they are: This is the largest dataset for any open-source, academic OCR project in North American, ever For the Humanities research data is about quality of content rather than size Our OCR must be as good as possible in order to generate reliable research
  3. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  4. Dashboard provides access to the DB via an API Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS