BHL: Big Data, Big Challenges

               Chris Freeland               @chrisfreeland


        Founding Technical Director, BHL

   Sr. Director, University Academic Computing,
               Washington University
>100,000 books, > 39 million pages, > 70TB data
BHL Architecture: Window Seat Ed.
Access
                                                     Data
                      APIs          UI
                                                    Exports


Logic
          Geocoding




                                                                          Firewall
                                                       Data Transform
           Name                   BHL DB                  Utilities
          Finding


Storage

                             Images (JP2)              Internet Archive
                             PDF
                             Coordinate-based OCR
                             XML metadata
BHL Content: Structured Data
• Metadata from Library catalogues at
  title/volume level
 <titleInfo>
 <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera
 relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus,
 secundumsystemasexualedigestas...</title>
 </titleInfo>
 <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo>
 <name type="personal">
 <namePart>Linné, Carl von,</namePart>
 <namePart type="date">1707-1778</namePart>
 </name>
 <typeOfResource>text</typeOfResource>
 <genre authority="marcgt">book</genre>
 <originInfo>
 <place><placeTerm type="text">Holmiae :</placeTerm></place>
 <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
BHL Content: Unstructured Data
More than 39 million pages!



….of uncorrected OCR   
Abbild ungen und Beschreibungen
                   der

               Fische Syriens,
                    nebst
einer neuen Classification und Characteristik
           sämmtlicher Gattungen
                     der
                       i
             JOH. JAKOB HECKEL,
Inipectoi am k. k. Hof-Natur.-iUenkabinete in
    Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.




               STUTTGART.
  E. Schweizerbart' sehe Verlagshandlung,
                   1843.
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
How to connect/consume
http://biodivlib.wikispaces.com/Developer+Tools+and+API


                                        Data
        APIs              UI
                                       Exports




                       BHL DB




                              Internet Archive
       Images (JP2)
       PDF
       Coordinate-based OCR
       XML metadata
BHL Data Challenge: Name Finding
            pre-2007
BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since
  2008
  – More than 100 million candidate name strings
  – More than 1.5 million unique, verified names
  – Available through UI, APIs, Data Exports & Internet
    Archive
• New collaboration with Global Names
  – Improved algorithm, better precision & recall
  – More data!
New Data Challenges
        http://biodivlib.wikispaces.com/BHL+and+Gaming
                                             ^Challenges framed as games

•   Correcting OCR
•   Rekeying Tables of Contents
•   Researching candidate Scientific Names
•   Image identification & extraction
    – http://biodivlib.wikispaces.com/Art+of+Life
    – Currently funded by NEH

BHL: Big Data, Big Challenges

  • 1.
    BHL: Big Data,Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  • 2.
    >100,000 books, >39 million pages, > 70TB data
  • 3.
    BHL Architecture: WindowSeat Ed. Access Data APIs UI Exports Logic Geocoding Firewall Data Transform Name BHL DB Utilities Finding Storage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  • 4.
    BHL Content: StructuredData • Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  • 5.
    BHL Content: UnstructuredData More than 39 million pages! ….of uncorrected OCR 
  • 6.
    Abbild ungen undBeschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.
  • 7.
    *E.xvi�c�piteI von c.cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 8.
    How to connect/consume http://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  • 9.
    BHL Data Challenge:Name Finding pre-2007
  • 10.
    BHL Data Challenge:Name Finding • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  • 11.
    New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH