• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BHL: Big Data, Big Challenges
 

BHL: Big Data, Big Challenges

on

  • 709 views

Presented at EOL Semantic Reasoning Workshop, 6-7 Sep 2012, Washington DC.

Presented at EOL Semantic Reasoning Workshop, 6-7 Sep 2012, Washington DC.

Statistics

Views

Total Views
709
Views on SlideShare
699
Embed Views
10

Actions

Likes
3
Downloads
11
Comments
0

3 Embeds 10

https://twitter.com 8
http://biiiogeek.blogspot.mx 1
http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

BHL: Big Data, Big Challenges BHL: Big Data, Big Challenges Presentation Transcript

  • BHL: Big Data, Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  • >100,000 books, > 39 million pages, > 70TB data
  • BHL Architecture: Window Seat Ed.Access Data APIs UI ExportsLogic Geocoding Firewall Data Transform Name BHL DB Utilities FindingStorage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  • BHL Content: Structured Data• Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  • BHL Content: Unstructured DataMore than 39 million pages!….of uncorrected OCR 
  • Abbild ungen und Beschreibungen der Fische Syriens, nebsteiner neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL,Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart sehe Verlagshandlung, 1843.
  • *E.xvi�c�piteI von c. cXx.WptdvonfnrWmnbu�fbe;bcn.5 am cix bIa � S &3rn~ 41Xa�m cv(f b1air�o�et ert oiensr �; �,:�hlrfc�c wa ff�4am.diug bist a6aiw~s ff oJrJtwt nof bL4ecImt& blfafra memb t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ckwuo, ma144*4I bttE5rmbebt =rt3kn am4ratif vrmr Waff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weib e , wpiteIvoE5teiri ct c ober gtUcr cit cm` 91 cLi biar J >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :rtrtuft *e t � B Rn "� trv W1Rt ?Cm c blaswaIwutr Ober �ci ti 1V Ces wtgbtiemwwajfu tpctt, afferain 9 c: b�titbfof�r f eran m rs bra wlg auig4;f aer�m *mc vrtblatcabtfm wfru andeg~m rt blas IaumbwWt� run f ncmai b14ianf tJobrrfanebrut4net vnber Brwt Ober awawi*m.crriiibtafwfm uww c on$ it ttu wttkc 5,10 $ m~Cfca trc* cx u W�e�&mcyfbq4 Mabtt mmwrc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbtenb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • How to connect/consumehttp://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  • BHL Data Challenge: Name Finding pre-2007
  • BHL Data Challenge: Name Finding• TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive• New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  • New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games• Correcting OCR• Rekeying Tables of Contents• Researching candidate Scientific Names• Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH