A presentation at DH2014-Lausanne of the Tesseract training methods and tools developed by eMOP (at the IDHMC at Texas A&M), and their uses for other book history and typeface history research.
A presentation at DH2014-Lausanne of the Tesseract training methods and tools developed by eMOP (at the IDHMC at Texas A&M), and their uses for other book history and typeface history research.
1.
eMOP Book History Tools
Book History and Software Tools: Examining Typefaces for OCR
Training in eMOP
Matt Christy,
Todd Samuelson,
Katayoun Torabi,
Bryan Tarpley,
Elizabeth Grumbach
2.
emop.tamu.edu/
Dh2014 Presentation
emop.tamu.edu/book-
history-tools
eMOP Workflows
emop.tamu.edu/workflows
Mellon Grant Proposal
idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
Facebook
Early Modern OCR Project
Twitter
#emop
@IDHMC_Nexus
@matt_christy
@EMGrumbach
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
2
3.
Early Modern OCR Project
The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
4.
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
4
5.
Aletheia
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
5
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
Created by PRImA Research
Labs, University of Salford, UK.
Windows based tool.
Developed as a groundtruth
creation tool
Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
Can identify glyphs on a page
image with page coordinates and
Unicode values.
6.
Aletheia:Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
6 Binarization and Denoise are native Aletheia functions
A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
7.
Aletheia: Glyph Recognition
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
7
Uses Tesseract to find glyphs
8.
Aletheia: I/O
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
8
We then convert PAGE XML
file to Tesseract Box file using
XSLT
9.
Tesseract Training
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
9
10.
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
10
11.
Franken+Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
11
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
12.
Franken+ Ingestion
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
12
13.
Franken+
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
13
All exemplars of the
same glyph are
displayed together.
Users can quickly
identify and
deselect:
Incorrectly labeled
glyphs
Incomplete glyphs
Unrepresentative
exemplars
Different sized glyphs
14.
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
14
Franken+
15.
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
15
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
16.
Franken+ Results
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
16
AFTER
BEFORE
17.
eMOP
TesseractTraining
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
17
18.
S-face / Y-face
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
18
Weiss, Adrian. “Font Analysis as a Bibliographical Method: the
Elizabethan Play-Quarto Printers and Compositors.” Studies
in Bibliography 43 (1990): 95-164.
Weiss organized late 16th and early 17th century
typefaces into these two general types (named for the
first works in which they were identified)
Y-Face, from an edition of The Malcontents
S-Face, from Ben Jonson's Sejanus
19.
S-face /
Y-face
19
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
20.
Other Applications
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
20
A close examination of the typefaces used by a printer
An investigation of the typefaces used in a work or in the
same editions of a work
A reexamination of typefaces classified via a system (Proctor-
Haebler)
21.
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
21
Editor's Notes
Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
This is cheating: the result of scanning the same page we used to create the training.
So we think that Franken+ can be a really useful tool for the close examination of typefaces and book history, and opens up some of the admittedly tedious work to non-experts.
It appears that you have an ad-blocker running. By whitelisting SlideShare on your ad-blocker, you are supporting our community of content creators.
Hate ads?
We've updated our privacy policy.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.