Successfully reported this slideshow.
Your SlideShare is downloading. ×

BHL: Big Data, Big Challenges

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 11 Ad

More Related Content

Similar to BHL: Big Data, Big Challenges (20)

More from Chris Freeland (20)

Advertisement

Recently uploaded (20)

BHL: Big Data, Big Challenges

  1. 1. BHL: Big Data, Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  2. 2. >100,000 books, > 39 million pages, > 70TB data
  3. 3. BHL Architecture: Window Seat Ed. Access Data APIs UI Exports Logic Geocoding Firewall Data Transform Name BHL DB Utilities Finding Storage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  4. 4. BHL Content: Structured Data • Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  5. 5. BHL Content: Unstructured Data More than 39 million pages! ….of uncorrected OCR 
  6. 6. Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.
  7. 7. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  8. 8. How to connect/consume http://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  9. 9. BHL Data Challenge: Name Finding pre-2007
  10. 10. BHL Data Challenge: Name Finding • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  11. 11. New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH

×