Daniel Vasicek discuss the processes undertaken to OCR various kinds of content from the University of Florida Special Collections to make them machine readable for indexing.
2. • The University of Florida has a diverse set of records which they want to
index but often these records were not native digital and have poor OCR.
• 29,842 directories containing thesis data ( and more for Bryant
collection documents) in various formats
• Lots of variety
• Combination of (2,471,339)TXT files, (29,859) XML, and (26,124) PDFs.
• Many images in SOME of the PDFs
• Some of the theses were digitized long ago using software that has
greatly improved since then
• We need to select the best text. How do we determine the best text?
• Nineteen of the original University of Florida theses have no text
Background
3. Images
extracted for
tesseract OCR
and merged into
text data
Pdfttoext
Analysis to
determine
“best” text
Text indexed and
converted to final
XIS XML file
Final XIS
XML file
Original UFTXT
Original PDF with
text information
PDF with text
information plus
OCR data
4. • Each version of text for a record was compared against a large list of
published words which were assumed to be “good”.
• These included
• Standard dictionary words
• Acronyms
• Made up words previously published
• Common misspellings
• Common word variations
• The path that produces the largest number of “good” words was the one
chosen for the final text
Determine how to identify the best text
5. Dictionary of “Real” Words
(Here are 11 “words” around the word “and” )
•ancylis anczo and anda
•andab andac andacc andaccuracy
•andaction andactivation andactivity
6. • There are 19 records (theses) that have no text.
• Why study these first?
• There are only a few of these. And they are extreme.They will show some of the
variety present in the rest of the records.
• Methods for extracting text from these 19 records will help obtain better text for
all the records as well as these 19.
• 14 of the 19 produce text using the pdftotext function which pulls text
data out of the PDF!
• Two had orientation issues and had to be rotated 270 degrees to improve
the quality of the OCR to produce text
• And 5 were improved using tesseract OCR
But what if there is no text?
7. •Some pictures might have no text
•Combination of text and images
•Picture orientation is important
•Color choices matter:
Potential OCR Issues
Example problematic PDF image
that originally produced no text:
8. Comparison of Old OCR with New OCR
1.1 Introduction
Therearethreebasicchallengesinmolecularbiology:(
i)identifyingnewgenes;(ii)locatingthecodingregion
so fthegenes;and(iii)analyzingthefunc-
tionsofthegenes.Inthepast,researchersgenerallyw
orkedonaonegene,oneexperiment"bas
is.Onegoalofthehumangenomeprojectistoobtaint
hegeneticcodesforthehumangenome.However,ha
vingthecodesongenesisonlythers
strand1tstep.Biologistsarealsointerestedindiscove
ringthefunctionofgenesandtheinteractionsbetwee
nindividualgenes.
1.1 Introduction
There are three basic challenges in molecular
biology: (i) identifying new genes; (ii) locating the
coding regions of the genes; and (iii) analyzing the
functions of the genes. In the past, researchers
generally worked on a “one gene, one
experiment” basis. One goal of the human
genome project is to obtain the genetic codes for
the human genome. However, having the codes
on genes is only the first step. Biologists are also
interested in discovering the function of genes and
the interactions between individual genes…
OriginalText Access InnovationsText
9. 5445
2938
7194
9378
943
199 26 1
0 1 TO 9 10 TO 99 100 TO 999 1000 TO 9999 10000 TO
99999
100000 TO
999999
1 MILLION +
NumberofPDFs
Images per PDF
1199
Number of Images per PDF
27
records!
26
10. Histogram of the Number of Images per PDF
(The Law of Diminishing Returns)
Number of Images
per pdf % of PDFs % of Images Number of Images
none 21 0 0
<10 32 0.07 12,954
<100 60 1.5 289,751
<1000 96 22 4,129,309
<10000 99.1 35 6,603,208
<100000 99.9 66 12,554,430
<1000000 99.9962 94.5 17,998,752
<10000000 100 100 19,037,665
11. • There are 27 records with over 100,000 images
• One record has over a million images!
• 344,000 copies of this picture in a record
• Challenge balancing time to process all the images
vs. Utility
• Still determining cost benefit ratio for when to not bother
processing a PDF (We processed them ALL!)
• Working on a programmatic way to determine when the
images in a PDF are not useful
• The initial batch of data ran in 14 parallel threads
and produced over 15 GB of text files (19 million text
files)
Why do some have so many pictures?
16x16 checkerboard
12. • Accurate indexing must start with accurate text!
• Legacy OCR data (and indexing) can often be improved.
• There are a great many possible ways making PDFs and consequently
many possible bottlenecks.
• The best text can be determined programmatically by comparing
against a list of good words. (And I probably need to make a better list
of good words!)
• While there are challenges, the ability to solve this problem exists and
our techniques are solid!
What Can WeTake fromThis?