SlideShare a Scribd company logo
1 of 14
Florida Records OCR
DanielVasicek
Data Scientist
Access Innovations
February 7, 2018
• The University of Florida has a diverse set of records which they want to
index but often these records were not native digital and have poor OCR.
• 29,842 directories containing thesis data ( and more for Bryant
collection documents) in various formats
• Lots of variety
• Combination of (2,471,339)TXT files, (29,859) XML, and (26,124) PDFs.
• Many images in SOME of the PDFs
• Some of the theses were digitized long ago using software that has
greatly improved since then
• We need to select the best text. How do we determine the best text?
• Nineteen of the original University of Florida theses have no text
Background
Images
extracted for
tesseract OCR
and merged into
text data
Pdfttoext
Analysis to
determine
“best” text
Text indexed and
converted to final
XIS XML file
Final XIS
XML file
Original UFTXT
Original PDF with
text information
PDF with text
information plus
OCR data
• Each version of text for a record was compared against a large list of
published words which were assumed to be “good”.
• These included
• Standard dictionary words
• Acronyms
• Made up words previously published
• Common misspellings
• Common word variations
• The path that produces the largest number of “good” words was the one
chosen for the final text
Determine how to identify the best text
Dictionary of “Real” Words
(Here are 11 “words” around the word “and” )
•ancylis anczo and anda
•andab andac andacc andaccuracy
•andaction andactivation andactivity
• There are 19 records (theses) that have no text.
• Why study these first?
• There are only a few of these. And they are extreme.They will show some of the
variety present in the rest of the records.
• Methods for extracting text from these 19 records will help obtain better text for
all the records as well as these 19.
• 14 of the 19 produce text using the pdftotext function which pulls text
data out of the PDF!
• Two had orientation issues and had to be rotated 270 degrees to improve
the quality of the OCR to produce text
• And 5 were improved using tesseract OCR
But what if there is no text?
•Some pictures might have no text
•Combination of text and images
•Picture orientation is important
•Color choices matter:
Potential OCR Issues
Example problematic PDF image
that originally produced no text:
Comparison of Old OCR with New OCR
1.1 Introduction
Therearethreebasicchallengesinmolecularbiology:(
i)identifyingnewgenes;(ii)locatingthecodingregion
so fthegenes;and(iii)analyzingthefunc-
tionsofthegenes.Inthepast,researchersgenerallyw
orkedonaonegene,oneexperiment"bas
is.Onegoalofthehumangenomeprojectistoobtaint
hegeneticcodesforthehumangenome.However,ha
vingthecodesongenesisonlythers
strand1tstep.Biologistsarealsointerestedindiscove
ringthefunctionofgenesandtheinteractionsbetwee
nindividualgenes.
1.1 Introduction
There are three basic challenges in molecular
biology: (i) identifying new genes; (ii) locating the
coding regions of the genes; and (iii) analyzing the
functions of the genes. In the past, researchers
generally worked on a “one gene, one
experiment” basis. One goal of the human
genome project is to obtain the genetic codes for
the human genome. However, having the codes
on genes is only the first step. Biologists are also
interested in discovering the function of genes and
the interactions between individual genes…
OriginalText Access InnovationsText
5445
2938
7194
9378
943
199 26 1
0 1 TO 9 10 TO 99 100 TO 999 1000 TO 9999 10000 TO
99999
100000 TO
999999
1 MILLION +
NumberofPDFs
Images per PDF
1199
Number of Images per PDF
27
records!
26
Histogram of the Number of Images per PDF
(The Law of Diminishing Returns)
Number of Images
per pdf % of PDFs % of Images Number of Images
none 21 0 0
<10 32 0.07 12,954
<100 60 1.5 289,751
<1000 96 22 4,129,309
<10000 99.1 35 6,603,208
<100000 99.9 66 12,554,430
<1000000 99.9962 94.5 17,998,752
<10000000 100 100 19,037,665
• There are 27 records with over 100,000 images
• One record has over a million images!
• 344,000 copies of this picture in a record
• Challenge balancing time to process all the images
vs. Utility
• Still determining cost benefit ratio for when to not bother
processing a PDF (We processed them ALL!)
• Working on a programmatic way to determine when the
images in a PDF are not useful
• The initial batch of data ran in 14 parallel threads
and produced over 15 GB of text files (19 million text
files)
Why do some have so many pictures?
16x16 checkerboard
• Accurate indexing must start with accurate text!
• Legacy OCR data (and indexing) can often be improved.
• There are a great many possible ways making PDFs and consequently
many possible bottlenecks.
• The best text can be determined programmatically by comparing
against a list of good words. (And I probably need to make a better list
of good words!)
• While there are challenges, the ability to solve this problem exists and
our techniques are solid!
What Can WeTake fromThis?
Questions?
Thanks!
DanVasicek
Daniel_Vasicek@accessinn.com

More Related Content

Similar to DHUG 2018 - Florida Thesis OCR

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkDeep Kayal
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsJohn Kunze
 
Evaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersEvaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersIMPACT Centre of Competence
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Davood Rafiei
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringMachine Learning Valencia
 
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSRMoving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSRBryan Beecher
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 

Similar to DHUG 2018 - Florida Thesis OCR (20)

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Textpy
TextpyTextpy
Textpy
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy databaseDigitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
 
Digital Preservation at UNM Libraries
Digital Preservation at UNM LibrariesDigital Preservation at UNM Libraries
Digital Preservation at UNM Libraries
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data Citations
 
Evaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersEvaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapers
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
COPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob DaveyCOPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob Davey
 
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSRMoving an Archive from Tape to Disk: A Case-Study at ICPSR
Moving an Archive from Tape to Disk: A Case-Study at ICPSR
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 

More from Access Innovations, Inc.

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Access Innovations, Inc.
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Access Innovations, Inc.
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Access Innovations, Inc.
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut ItAccess Innovations, Inc.
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedAccess Innovations, Inc.
 

More from Access Innovations, Inc. (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Smart submit
Smart submitSmart submit
Smart submit
 
Plos taxonomy beyond search dhug 2021
Plos taxonomy beyond search   dhug 2021Plos taxonomy beyond search   dhug 2021
Plos taxonomy beyond search dhug 2021
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Atypon dhug2021
Atypon dhug2021Atypon dhug2021
Atypon dhug2021
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021
 
Asce more than just topic taxonomies
Asce more than just topic taxonomiesAsce more than just topic taxonomies
Asce more than just topic taxonomies
 
Acs discoverability-dhug2021
Acs discoverability-dhug2021Acs discoverability-dhug2021
Acs discoverability-dhug2021
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut It
 
Health Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut ItHealth Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut It
 
Why Keywords Don't Cut It
Why Keywords Don't Cut ItWhy Keywords Don't Cut It
Why Keywords Don't Cut It
 
Data Harmony update 2020 final
Data Harmony update 2020 finalData Harmony update 2020 final
Data Harmony update 2020 final
 
Data Harmony Update 2020 final
Data Harmony Update 2020 finalData Harmony Update 2020 final
Data Harmony Update 2020 final
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
 
DHUG 2017 - Thesaurus Construction Training
DHUG 2017 - Thesaurus Construction TrainingDHUG 2017 - Thesaurus Construction Training
DHUG 2017 - Thesaurus Construction Training
 

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 

DHUG 2018 - Florida Thesis OCR

  • 1. Florida Records OCR DanielVasicek Data Scientist Access Innovations February 7, 2018
  • 2. • The University of Florida has a diverse set of records which they want to index but often these records were not native digital and have poor OCR. • 29,842 directories containing thesis data ( and more for Bryant collection documents) in various formats • Lots of variety • Combination of (2,471,339)TXT files, (29,859) XML, and (26,124) PDFs. • Many images in SOME of the PDFs • Some of the theses were digitized long ago using software that has greatly improved since then • We need to select the best text. How do we determine the best text? • Nineteen of the original University of Florida theses have no text Background
  • 3. Images extracted for tesseract OCR and merged into text data Pdfttoext Analysis to determine “best” text Text indexed and converted to final XIS XML file Final XIS XML file Original UFTXT Original PDF with text information PDF with text information plus OCR data
  • 4. • Each version of text for a record was compared against a large list of published words which were assumed to be “good”. • These included • Standard dictionary words • Acronyms • Made up words previously published • Common misspellings • Common word variations • The path that produces the largest number of “good” words was the one chosen for the final text Determine how to identify the best text
  • 5. Dictionary of “Real” Words (Here are 11 “words” around the word “and” ) •ancylis anczo and anda •andab andac andacc andaccuracy •andaction andactivation andactivity
  • 6. • There are 19 records (theses) that have no text. • Why study these first? • There are only a few of these. And they are extreme.They will show some of the variety present in the rest of the records. • Methods for extracting text from these 19 records will help obtain better text for all the records as well as these 19. • 14 of the 19 produce text using the pdftotext function which pulls text data out of the PDF! • Two had orientation issues and had to be rotated 270 degrees to improve the quality of the OCR to produce text • And 5 were improved using tesseract OCR But what if there is no text?
  • 7. •Some pictures might have no text •Combination of text and images •Picture orientation is important •Color choices matter: Potential OCR Issues Example problematic PDF image that originally produced no text:
  • 8. Comparison of Old OCR with New OCR 1.1 Introduction Therearethreebasicchallengesinmolecularbiology:( i)identifyingnewgenes;(ii)locatingthecodingregion so fthegenes;and(iii)analyzingthefunc- tionsofthegenes.Inthepast,researchersgenerallyw orkedonaonegene,oneexperiment"bas is.Onegoalofthehumangenomeprojectistoobtaint hegeneticcodesforthehumangenome.However,ha vingthecodesongenesisonlythers strand1tstep.Biologistsarealsointerestedindiscove ringthefunctionofgenesandtheinteractionsbetwee nindividualgenes. 1.1 Introduction There are three basic challenges in molecular biology: (i) identifying new genes; (ii) locating the coding regions of the genes; and (iii) analyzing the functions of the genes. In the past, researchers generally worked on a “one gene, one experiment” basis. One goal of the human genome project is to obtain the genetic codes for the human genome. However, having the codes on genes is only the first step. Biologists are also interested in discovering the function of genes and the interactions between individual genes… OriginalText Access InnovationsText
  • 9. 5445 2938 7194 9378 943 199 26 1 0 1 TO 9 10 TO 99 100 TO 999 1000 TO 9999 10000 TO 99999 100000 TO 999999 1 MILLION + NumberofPDFs Images per PDF 1199 Number of Images per PDF 27 records! 26
  • 10. Histogram of the Number of Images per PDF (The Law of Diminishing Returns) Number of Images per pdf % of PDFs % of Images Number of Images none 21 0 0 <10 32 0.07 12,954 <100 60 1.5 289,751 <1000 96 22 4,129,309 <10000 99.1 35 6,603,208 <100000 99.9 66 12,554,430 <1000000 99.9962 94.5 17,998,752 <10000000 100 100 19,037,665
  • 11. • There are 27 records with over 100,000 images • One record has over a million images! • 344,000 copies of this picture in a record • Challenge balancing time to process all the images vs. Utility • Still determining cost benefit ratio for when to not bother processing a PDF (We processed them ALL!) • Working on a programmatic way to determine when the images in a PDF are not useful • The initial batch of data ran in 14 parallel threads and produced over 15 GB of text files (19 million text files) Why do some have so many pictures? 16x16 checkerboard
  • 12. • Accurate indexing must start with accurate text! • Legacy OCR data (and indexing) can often be improved. • There are a great many possible ways making PDFs and consequently many possible bottlenecks. • The best text can be determined programmatically by comparing against a list of good words. (And I probably need to make a better list of good words!) • While there are challenges, the ability to solve this problem exists and our techniques are solid! What Can WeTake fromThis?