SlideShare a Scribd company logo
1 of 11
BHL: Big Data, Big Challenges

               Chris Freeland               @chrisfreeland


        Founding Technical Director, BHL

   Sr. Director, University Academic Computing,
               Washington University
>100,000 books, > 39 million pages, > 70TB data
BHL Architecture: Window Seat Ed.
Access
                                                     Data
                      APIs          UI
                                                    Exports


Logic
          Geocoding




                                                                          Firewall
                                                       Data Transform
           Name                   BHL DB                  Utilities
          Finding


Storage

                             Images (JP2)              Internet Archive
                             PDF
                             Coordinate-based OCR
                             XML metadata
BHL Content: Structured Data
• Metadata from Library catalogues at
  title/volume level
 <titleInfo>
 <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera
 relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus,
 secundumsystemasexualedigestas...</title>
 </titleInfo>
 <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo>
 <name type="personal">
 <namePart>Linné, Carl von,</namePart>
 <namePart type="date">1707-1778</namePart>
 </name>
 <typeOfResource>text</typeOfResource>
 <genre authority="marcgt">book</genre>
 <originInfo>
 <place><placeTerm type="text">Holmiae :</placeTerm></place>
 <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
BHL Content: Unstructured Data
More than 39 million pages!



….of uncorrected OCR   
Abbild ungen und Beschreibungen
                   der

               Fische Syriens,
                    nebst
einer neuen Classification und Characteristik
           sämmtlicher Gattungen
                     der
                       i
             JOH. JAKOB HECKEL,
Inipectoi am k. k. Hof-Natur.-iUenkabinete in
    Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.




               STUTTGART.
  E. Schweizerbart' sehe Verlagshandlung,
                   1843.
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
How to connect/consume
http://biodivlib.wikispaces.com/Developer+Tools+and+API


                                        Data
        APIs              UI
                                       Exports




                       BHL DB




                              Internet Archive
       Images (JP2)
       PDF
       Coordinate-based OCR
       XML metadata
BHL Data Challenge: Name Finding
            pre-2007
BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since
  2008
  – More than 100 million candidate name strings
  – More than 1.5 million unique, verified names
  – Available through UI, APIs, Data Exports & Internet
    Archive
• New collaboration with Global Names
  – Improved algorithm, better precision & recall
  – More data!
New Data Challenges
        http://biodivlib.wikispaces.com/BHL+and+Gaming
                                             ^Challenges framed as games

•   Correcting OCR
•   Rekeying Tables of Contents
•   Researching candidate Scientific Names
•   Image identification & extraction
    – http://biodivlib.wikispaces.com/Art+of+Life
    – Currently funded by NEH

More Related Content

Similar to BHL: Big Data, Big Challenges

BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeChris Freeland
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaChris Freeland
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014William Ulate
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionChris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Duncan Hull
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryChris Freeland
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...Chris Freeland
 
Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Chris Freeland
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and PlansWilliam Ulate
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)DevDays
 
BHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerBHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerHeimo Rainer
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked DataJane Stevenson
 

Similar to BHL: Big Data, Big Challenges (20)

BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
 
Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and Plans
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
BHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerBHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainer
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 

More from Chris Freeland

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeChris Freeland
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Chris Freeland
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLAChris Freeland
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriChris Freeland
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryChris Freeland
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansChris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)Chris Freeland
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr LibraryChris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Chris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureChris Freeland
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHLChris Freeland
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataChris Freeland
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureChris Freeland
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using TropicosChris Freeland
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Chris Freeland
 
Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Chris Freeland
 
DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability Chris Freeland
 
Digitized Public Domain Literature
Digitized Public Domain LiteratureDigitized Public Domain Literature
Digitized Public Domain LiteratureChris Freeland
 

More from Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using Tropicos
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
 
Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1
 
DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability
 
Digitized Public Domain Literature
Digitized Public Domain LiteratureDigitized Public Domain Literature
Digitized Public Domain Literature
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

BHL: Big Data, Big Challenges

  • 1. BHL: Big Data, Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  • 2. >100,000 books, > 39 million pages, > 70TB data
  • 3. BHL Architecture: Window Seat Ed. Access Data APIs UI Exports Logic Geocoding Firewall Data Transform Name BHL DB Utilities Finding Storage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  • 4. BHL Content: Structured Data • Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  • 5. BHL Content: Unstructured Data More than 39 million pages! ….of uncorrected OCR 
  • 6. Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.
  • 7. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 8. How to connect/consume http://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  • 9. BHL Data Challenge: Name Finding pre-2007
  • 10. BHL Data Challenge: Name Finding • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  • 11. New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH