SlideShare a Scribd company logo
Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/33061402
Digitization http://biodiversitylibrary.org/page/6165462
Workflow Conservation Digitization Selection Preparation Post Production (Re)publication
Scanning Derivatives Files are stored & sync’d across BHL clusters Master Derivatives XML JP2 PDF JPG TXT DJVu Storage PDF OCR JP2 XML
Optical Character Recognition (OCR) http://biodiversitylibrary.org/page/2836705
OCR is a *BIG* challenge All book / literature digitization projects affected, not just BHL Especially problematic in BHL More than 50 languages represented in BHL Dates of publication from 1400’s to 2000’s Irregular typeface / typesetting Multiple languages on one page Botanical descriptions in Latin
Abbildungenund Beschreibungen der FischeSyriens, nebst einerneuen Classification und Characteristik sämmtlicherGattungen der i JOH. JAKOB HECKEL,  Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' seheVerlagshandlung, 1843.
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a 6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. au:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
2007 Name Finding Study 35.16% >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
WikiSource Trove - National Library of Australia Manual techniques for text correction
WikiSource Example http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293
Goal: Semi-automated text correction OCR + Machine Learning + Users Let machines do raw processing	 Develop algorithms for natural language processing & machine learning Build a community of (human) users to help reCAPTCHA as an example Why not just use reCAPTCHA? Google bought it *More work needed here*
Scientific names mapping  http://biodiversitylibrary.org/page/27782237
TaxonFinder API response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action withuBio’sTaxonFinder…
Crowdsourcing http://biodiversitylibrary.org/page/20965795
CiteBank:   http://citebank.org New search index to BHL content Platform for journals/publishers/societies in need of tools to store & share their digitized content Access to “crowdsourced” articles from BHL scans
Crowdsourcing Statistics & Analysis Analysis http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html At that time, more than 80% of the PDFs created had metadata attached by users More than 50% contributed accurate article-level information New analysis over more data this summer / fall Now have more than 58,000 PDFs to analyze
Open Data = More Use Scholars Rod Page iPhylo BioGUID BioStor Ryan Schenk Other Apps EarthCape ZipecodeZoo
Conclusion BHL is a massive dataset useful for multidisciplinary research Systematics Natural Language Processing Humanities BHL is open Free to use at http://biodiversitylibrary.org Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
Questions? Chris Freeland Technical Director, Biodiversity Heritage Library Director, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden 4344 Shaw Blvd. St. Louis, MO 63110 USA Email: chris.freeland@mobot.org Twitter: @chrisfreeland Blog / info: chrisfreeland.com BioSystematics Berlin 2011 22 Feb 2011

More Related Content

Viewers also liked

Mapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservationMapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservation
British Cartographic Society
 
Hanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityHanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityJSchultner
 
Role of computers in science and technology agriculture
Role of computers in science and technology agricultureRole of computers in science and technology agriculture
Role of computers in science and technology agricultureGobind Raj Aulakh
 
EVS - Biodiversity Notes
EVS - Biodiversity NotesEVS - Biodiversity Notes
EVS - Biodiversity NotesArzoo Sahni
 
PPT OF BIODIVERSITY
PPT OF BIODIVERSITYPPT OF BIODIVERSITY
PPT OF BIODIVERSITY
Tusharkanti Nayak
 

Viewers also liked (6)

Mapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservationMapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservation
 
Hanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityHanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford University
 
Role of computers in science and technology agriculture
Role of computers in science and technology agricultureRole of computers in science and technology agriculture
Role of computers in science and technology agriculture
 
EVS - Biodiversity Notes
EVS - Biodiversity NotesEVS - Biodiversity Notes
EVS - Biodiversity Notes
 
Biodiversity of India
Biodiversity of IndiaBiodiversity of India
Biodiversity of India
 
PPT OF BIODIVERSITY
PPT OF BIODIVERSITYPPT OF BIODIVERSITY
PPT OF BIODIVERSITY
 

Similar to Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
Chris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
Chris Freeland
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
Chris Freeland
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge GraphsRoderic Page
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
Chris Freeland
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryChris Freeland
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
Dag Endresen
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010
tgarnett
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
Dag Endresen
 
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship NexusForging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Bianca Crowley
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
BHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzBHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzcoelatura
 
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
ICZN
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
Chris Freeland
 
Limitreal
LimitrealLimitreal
Limitreal
Connie Rinaldo
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
Phil Cryer
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
Chris Freeland
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
Trish Whetzel
 

Similar to Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing (20)

BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship NexusForging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
BHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzBHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholz
 
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
Limitreal
LimitrealLimitreal
Limitreal
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 

More from Chris Freeland

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
Chris Freeland
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Chris Freeland
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
Chris Freeland
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repository
Chris Freeland
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015
Chris Freeland
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLA
Chris Freeland
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
Chris Freeland
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Chris Freeland
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
Chris Freeland
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
Chris Freeland
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
Chris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
Chris Freeland
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
Chris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Chris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureChris Freeland
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
Chris Freeland
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
Chris Freeland
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
Chris Freeland
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Chris Freeland
 

More from Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repository
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

  • 1. Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/33061402
  • 3. Workflow Conservation Digitization Selection Preparation Post Production (Re)publication
  • 4. Scanning Derivatives Files are stored & sync’d across BHL clusters Master Derivatives XML JP2 PDF JPG TXT DJVu Storage PDF OCR JP2 XML
  • 5. Optical Character Recognition (OCR) http://biodiversitylibrary.org/page/2836705
  • 6. OCR is a *BIG* challenge All book / literature digitization projects affected, not just BHL Especially problematic in BHL More than 50 languages represented in BHL Dates of publication from 1400’s to 2000’s Irregular typeface / typesetting Multiple languages on one page Botanical descriptions in Latin
  • 7. Abbildungenund Beschreibungen der FischeSyriens, nebst einerneuen Classification und Characteristik sämmtlicherGattungen der i JOH. JAKOB HECKEL, Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' seheVerlagshandlung, 1843.
  • 8. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a 6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. au:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
  • 9. 2007 Name Finding Study 35.16% >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
  • 10. WikiSource Trove - National Library of Australia Manual techniques for text correction
  • 12. Goal: Semi-automated text correction OCR + Machine Learning + Users Let machines do raw processing Develop algorithms for natural language processing & machine learning Build a community of (human) users to help reCAPTCHA as an example Why not just use reCAPTCHA? Google bought it *More work needed here*
  • 13. Scientific names mapping http://biodiversitylibrary.org/page/27782237
  • 14. TaxonFinder API response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action withuBio’sTaxonFinder…
  • 15.
  • 16.
  • 17.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. CiteBank: http://citebank.org New search index to BHL content Platform for journals/publishers/societies in need of tools to store & share their digitized content Access to “crowdsourced” articles from BHL scans
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Crowdsourcing Statistics & Analysis Analysis http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html At that time, more than 80% of the PDFs created had metadata attached by users More than 50% contributed accurate article-level information New analysis over more data this summer / fall Now have more than 58,000 PDFs to analyze
  • 35. Open Data = More Use Scholars Rod Page iPhylo BioGUID BioStor Ryan Schenk Other Apps EarthCape ZipecodeZoo
  • 36. Conclusion BHL is a massive dataset useful for multidisciplinary research Systematics Natural Language Processing Humanities BHL is open Free to use at http://biodiversitylibrary.org Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
  • 37. Questions? Chris Freeland Technical Director, Biodiversity Heritage Library Director, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden 4344 Shaw Blvd. St. Louis, MO 63110 USA Email: chris.freeland@mobot.org Twitter: @chrisfreeland Blog / info: chrisfreeland.com BioSystematics Berlin 2011 22 Feb 2011