Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DHUG 2018 - Florida Thesis OCR


Published on

Daniel Vasicek discuss the processes undertaken to OCR various kinds of content from the University of Florida Special Collections to make them machine readable for indexing.

Published in: Software
  • Be the first to comment

  • Be the first to like this

DHUG 2018 - Florida Thesis OCR

  1. 1. Florida Records OCR DanielVasicek Data Scientist Access Innovations February 7, 2018
  2. 2. • The University of Florida has a diverse set of records which they want to index but often these records were not native digital and have poor OCR. • 29,842 directories containing thesis data ( and more for Bryant collection documents) in various formats • Lots of variety • Combination of (2,471,339)TXT files, (29,859) XML, and (26,124) PDFs. • Many images in SOME of the PDFs • Some of the theses were digitized long ago using software that has greatly improved since then • We need to select the best text. How do we determine the best text? • Nineteen of the original University of Florida theses have no text Background
  3. 3. Images extracted for tesseract OCR and merged into text data Pdfttoext Analysis to determine “best” text Text indexed and converted to final XIS XML file Final XIS XML file Original UFTXT Original PDF with text information PDF with text information plus OCR data
  4. 4. • Each version of text for a record was compared against a large list of published words which were assumed to be “good”. • These included • Standard dictionary words • Acronyms • Made up words previously published • Common misspellings • Common word variations • The path that produces the largest number of “good” words was the one chosen for the final text Determine how to identify the best text
  5. 5. Dictionary of “Real” Words (Here are 11 “words” around the word “and” ) •ancylis anczo and anda •andab andac andacc andaccuracy •andaction andactivation andactivity
  6. 6. • There are 19 records (theses) that have no text. • Why study these first? • There are only a few of these. And they are extreme.They will show some of the variety present in the rest of the records. • Methods for extracting text from these 19 records will help obtain better text for all the records as well as these 19. • 14 of the 19 produce text using the pdftotext function which pulls text data out of the PDF! • Two had orientation issues and had to be rotated 270 degrees to improve the quality of the OCR to produce text • And 5 were improved using tesseract OCR But what if there is no text?
  7. 7. •Some pictures might have no text •Combination of text and images •Picture orientation is important •Color choices matter: Potential OCR Issues Example problematic PDF image that originally produced no text:
  8. 8. Comparison of Old OCR with New OCR 1.1 Introduction Therearethreebasicchallengesinmolecularbiology:( i)identifyingnewgenes;(ii)locatingthecodingregion so fthegenes;and(iii)analyzingthefunc- tionsofthegenes.Inthepast,researchersgenerallyw orkedonaonegene,oneexperiment"bas is.Onegoalofthehumangenomeprojectistoobtaint hegeneticcodesforthehumangenome.However,ha vingthecodesongenesisonlythers strand1tstep.Biologistsarealsointerestedindiscove ringthefunctionofgenesandtheinteractionsbetwee nindividualgenes. 1.1 Introduction There are three basic challenges in molecular biology: (i) identifying new genes; (ii) locating the coding regions of the genes; and (iii) analyzing the functions of the genes. In the past, researchers generally worked on a “one gene, one experiment” basis. One goal of the human genome project is to obtain the genetic codes for the human genome. However, having the codes on genes is only the first step. Biologists are also interested in discovering the function of genes and the interactions between individual genes… OriginalText Access InnovationsText
  9. 9. 5445 2938 7194 9378 943 199 26 1 0 1 TO 9 10 TO 99 100 TO 999 1000 TO 9999 10000 TO 99999 100000 TO 999999 1 MILLION + NumberofPDFs Images per PDF 1199 Number of Images per PDF 27 records! 26
  10. 10. Histogram of the Number of Images per PDF (The Law of Diminishing Returns) Number of Images per pdf % of PDFs % of Images Number of Images none 21 0 0 <10 32 0.07 12,954 <100 60 1.5 289,751 <1000 96 22 4,129,309 <10000 99.1 35 6,603,208 <100000 99.9 66 12,554,430 <1000000 99.9962 94.5 17,998,752 <10000000 100 100 19,037,665
  11. 11. • There are 27 records with over 100,000 images • One record has over a million images! • 344,000 copies of this picture in a record • Challenge balancing time to process all the images vs. Utility • Still determining cost benefit ratio for when to not bother processing a PDF (We processed them ALL!) • Working on a programmatic way to determine when the images in a PDF are not useful • The initial batch of data ran in 14 parallel threads and produced over 15 GB of text files (19 million text files) Why do some have so many pictures? 16x16 checkerboard
  12. 12. • Accurate indexing must start with accurate text! • Legacy OCR data (and indexing) can often be improved. • There are a great many possible ways making PDFs and consequently many possible bottlenecks. • The best text can be determined programmatically by comparing against a list of good words. (And I probably need to make a better list of good words!) • While there are challenges, the ability to solve this problem exists and our techniques are solid! What Can WeTake fromThis?
  13. 13. Questions?
  14. 14. Thanks! DanVasicek