AWS Community Day CPH - Three problems of Terraform
BHL: Big Data, Big Challenges
1. BHL: Big Data, Big Challenges
Chris Freeland @chrisfreeland
Founding Technical Director, BHL
Sr. Director, University Academic Computing,
Washington University
3. BHL Architecture: Window Seat Ed.
Access
Data
APIs UI
Exports
Logic
Geocoding
Firewall
Data Transform
Name BHL DB Utilities
Finding
Storage
Images (JP2) Internet Archive
PDF
Coordinate-based OCR
XML metadata
4. BHL Content: Structured Data
• Metadata from Library catalogues at
title/volume level
<titleInfo>
<title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera
relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus,
secundumsystemasexualedigestas...</title>
</titleInfo>
<titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo>
<name type="personal">
<namePart>Linné, Carl von,</namePart>
<namePart type="date">1707-1778</namePart>
</name>
<typeOfResource>text</typeOfResource>
<genre authority="marcgt">book</genre>
<originInfo>
<place><placeTerm type="text">Holmiae :</placeTerm></place>
<publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
6. Abbild ungen und Beschreibungen
der
Fische Syriens,
nebst
einer neuen Classification und Characteristik
sämmtlicher Gattungen
der
i
JOH. JAKOB HECKEL,
Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART.
E. Schweizerbart' sehe Verlagshandlung,
1843.
7. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
10. BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since
2008
– More than 100 million candidate name strings
– More than 1.5 million unique, verified names
– Available through UI, APIs, Data Exports & Internet
Archive
• New collaboration with Global Names
– Improved algorithm, better precision & recall
– More data!
11. New Data Challenges
http://biodivlib.wikispaces.com/BHL+and+Gaming
^Challenges framed as games
• Correcting OCR
• Rekeying Tables of Contents
• Researching candidate Scientific Names
• Image identification & extraction
– http://biodivlib.wikispaces.com/Art+of+Life
– Currently funded by NEH