0
BHL: Big Data, Big Challenges               Chris Freeland               @chrisfreeland        Founding Technical Director...
>100,000 books, > 39 million pages, > 70TB data
BHL Architecture: Window Seat Ed.Access                                                     Data                      APIs...
BHL Content: Structured Data• Metadata from Library catalogues at  title/volume level <titleInfo> <title>CaroliLinnaei ......
BHL Content: Unstructured DataMore than 39 million pages!….of uncorrected OCR   
Abbild ungen und Beschreibungen                   der               Fische Syriens,                    nebsteiner neuen Cl...
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmnbu�fbe;bcn.5 am cix bIa � S &3rn~ 41Xa�m cv(f b1air�o�et ert oiensr �; �,:�hlrfc�c ...
How to connect/consumehttp://biodivlib.wikispaces.com/Developer+Tools+and+API                                        Data ...
BHL Data Challenge: Name Finding            pre-2007
BHL Data Challenge: Name Finding• TaxonFinder algorithm in production since  2008  – More than 100 million candidate name ...
New Data Challenges        http://biodivlib.wikispaces.com/BHL+and+Gaming                                             ^Cha...
Upcoming SlideShare
Loading in...5
×

BHL: Big Data, Big Challenges

750

Published on

Presented at EOL Semantic Reasoning Workshop, 6-7 Sep 2012, Washington DC.

Published in: Technology, Education

Transcript of "BHL: Big Data, Big Challenges"

  1. 1. BHL: Big Data, Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  2. 2. >100,000 books, > 39 million pages, > 70TB data
  3. 3. BHL Architecture: Window Seat Ed.Access Data APIs UI ExportsLogic Geocoding Firewall Data Transform Name BHL DB Utilities FindingStorage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  4. 4. BHL Content: Structured Data• Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  5. 5. BHL Content: Unstructured DataMore than 39 million pages!….of uncorrected OCR 
  6. 6. Abbild ungen und Beschreibungen der Fische Syriens, nebsteiner neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL,Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart sehe Verlagshandlung, 1843.
  7. 7. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmnbu�fbe;bcn.5 am cix bIa � S &3rn~ 41Xa�m cv(f b1air�o�et ert oiensr �; �,:�hlrfc�c wa ff�4am.diug bist a6aiw~s ff oJrJtwt nof bL4ecImt& blfafra memb t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ckwuo, ma144*4I bttE5rmbebt =rt3kn am4ratif vrmr Waff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weib e , wpiteIvoE5teiri ct c ober gtUcr cit cm` 91 cLi biar J >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :rtrtuft *e t � B Rn "� trv W1Rt ?Cm c blaswaIwutr Ober �ci ti 1V Ces wtgbtiemwwajfu tpctt, afferain 9 c: b�titbfof�r f eran m rs bra wlg auig4;f aer�m *mc vrtblatcabtfm wfru andeg~m rt blas IaumbwWt� run f ncmai b14ianf tJobrrfanebrut4net vnber Brwt Ober awawi*m.crriiibtafwfm uww c on$ it ttu wttkc 5,10 $ m~Cfca trc* cx u W�e�&mcyfbq4 Mabtt mmwrc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbtenb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  8. 8. How to connect/consumehttp://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  9. 9. BHL Data Challenge: Name Finding pre-2007
  10. 10. BHL Data Challenge: Name Finding• TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive• New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  11. 11. New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games• Correcting OCR• Rekeying Tables of Contents• Researching candidate Scientific Names• Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×