Biodiversity Heritage Library (BHL): Technology Overview Chris Freeland Director, Bioinformatics Missouri Botanical Garden...
BHL Partners <ul><li>Museums </li></ul><ul><ul><li>American Museum of Natural History (New York) </li></ul></ul><ul><ul><l...
Why have BHL? In any well-appointed Natural History Library there should be found every book and every edition of every bo...
Unique Components of BHL <ul><li>Combining metadata records from multiple libraries (similar, but different) and represent...
Scanning process <ul><li>Select Book </li></ul><ul><li>Pull from Shelf </li></ul><ul><li>Send to IA scanning center </li><...
Mushrooms of America, edible and poisonous.  Ed. by Julius A. Palmer, Jr.  , 1885.
Scan & Store: Internet Archive Scanning on Scribes Storage in Petaboxes
Scanning & Derivatives <ul><li>XML </li></ul><ul><li>JP2 </li></ul><ul><li>PDF </li></ul><ul><li>JPG </li></ul><ul><li>TXT...
Harvest from IA <ul><li>Extract, Transform, Load (ETL) </li></ul><ul><li>Custom scripts to extract content via IA’s APIs <...
 
 
 
Stable URL Attribution Name Finding Page Turning Page Turning Zoom/Pan Download/View Browse Search Filter Target/Object
JPEG2000 (*.jp2) display <ul><li>RAW original => 85% .jp2 </li></ul><ul><li>LuraTech encoder </li></ul><ul><ul><li>Wavelet...
LizardTech ExpressServer Browser  GSIV.js www.biodiversitylibrary.org .jp2 .jpg IA  /page/1274907 pageid: 1274907 BHLdb ht...
Reuse, don’t rebuild
TIF Image from Scanner Converted to text via PrimeOCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP r...
Names data mining
Tag cloud from LCSH Subject Heading from library catalog Expressed as MARCXML Tag Cloud
Geocoding LCSH
RSS Feeds <ul><li>Specific:   Last 25 books published in German from NYBG </li></ul><ul><li>RSS Feed location:  http://www...
Services <ul><li>Names </li></ul><ul><ul><li>v.1 released http://www.biodiversitylibrary.org/services/name/NameService.asm...
BHL Name Services http://www.biodiversitylibrary.org/services/name/NameService.asmx
Provider Integration <ul><li>Encyclopedia of Life </li></ul><ul><li>Atrium Andes Biodiversity </li></ul><ul><li>Wikipedia ...
 
 
Hardware Infrastructure <ul><li>Distributed </li></ul><ul><li>Partially redundant </li></ul><ul><ul><li>Work needed </li><...
MOBOT Petabox cluster Internet Archive
 
File Storage Estimates <ul><li>4MB per page  including derivatives </li></ul><ul><li>1 million pages =  4TB storage  </li>...
Future Work <ul><li>Services </li></ul><ul><ul><li>Citation Resolver </li></ul></ul><ul><ul><li>Titles Resolver </li></ul>...
Fedora <ul><li>Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons </li></ul><ul><li>Working with Internet...
Thank You <ul><li>Chris Freeland </li></ul><ul><li>[email_address] </li></ul><ul><li>BHL Portal </li></ul><ul><li>www.biod...
Upcoming SlideShare
Loading in …5
×

BHL Technology Overview

1,345 views

Published on

Presentation to Smithsonian's Office of the Chief Information Officer.

Published in: Technology, Self Improvement
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,345
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BHL Technology Overview

  1. 1. Biodiversity Heritage Library (BHL): Technology Overview Chris Freeland Director, Bioinformatics Missouri Botanical Garden Technical Director Biodiversity Heritage Library [email_address] www.biodiversitylibrary.org
  2. 2. BHL Partners <ul><li>Museums </li></ul><ul><ul><li>American Museum of Natural History (New York) </li></ul></ul><ul><ul><li>Natural History Museum (London) </li></ul></ul><ul><ul><li>Smithsonian Institution (Washington) </li></ul></ul><ul><ul><li>The Field Museum (Chicago) </li></ul></ul><ul><li>Botanical Gardens </li></ul><ul><ul><li>Missouri Botanical Garden </li></ul></ul><ul><ul><li>New York Botanical Garden </li></ul></ul><ul><ul><li>Royal Botanic Garden, Kew </li></ul></ul><ul><li>University Libraries </li></ul><ul><ul><li>Botany Libraries, Harvard University </li></ul></ul><ul><ul><li>Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University </li></ul></ul><ul><ul><li>University of Illinois </li></ul></ul><ul><li>Bioinformatics Institutes </li></ul><ul><ul><li>MBL/WHOI </li></ul></ul><ul><ul><li>uBio.org </li></ul></ul>
  3. 3. Why have BHL? In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find ; it is not enough to copy from a previous author; he must verify each reference itself from the original . Charles Davies Sherborn, Epilogue to Index Animalium , March 1922 Charles Davies Sherborn (1861-1942)
  4. 4. Unique Components of BHL <ul><li>Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal </li></ul><ul><li>Use of JPEG2000 </li></ul><ul><li>Web 2.0 Mashups </li></ul><ul><li>Taxonomic data mining </li></ul><ul><li>Services </li></ul><ul><li>Rare & novel content </li></ul>
  5. 5. Scanning process <ul><li>Select Book </li></ul><ul><li>Pull from Shelf </li></ul><ul><li>Send to IA scanning center </li></ul><ul><li>Book is scanned & QA </li></ul><ul><li>Page images loaded on IA cluster </li></ul><ul><ul><li>Derivatives created </li></ul></ul><ul><li>Book returned to library </li></ul><ul><li>Files harvested from IA portal </li></ul><ul><li>Books available for display within BHL portal </li></ul>
  6. 6. Mushrooms of America, edible and poisonous. Ed. by Julius A. Palmer, Jr. , 1885.
  7. 7. Scan & Store: Internet Archive Scanning on Scribes Storage in Petaboxes
  8. 8. Scanning & Derivatives <ul><li>XML </li></ul><ul><li>JP2 </li></ul><ul><li>PDF </li></ul><ul><li>JPG </li></ul><ul><li>TXT </li></ul><ul><li>DJVu </li></ul>Master Derivatives
  9. 9. Harvest from IA <ul><li>Extract, Transform, Load (ETL) </li></ul><ul><li>Custom scripts to extract content via IA’s APIs </li></ul><ul><li>Database scripts to transform to relational data structure </li></ul><ul><li>Load into database </li></ul>
  10. 13. Stable URL Attribution Name Finding Page Turning Page Turning Zoom/Pan Download/View Browse Search Filter Target/Object
  11. 14. JPEG2000 (*.jp2) display <ul><li>RAW original => 85% .jp2 </li></ul><ul><li>LuraTech encoder </li></ul><ul><ul><li>Wavelet compression </li></ul></ul><ul><li>LizardTech decoder </li></ul><ul><ul><li>Tiled on the fly, cached for performance </li></ul></ul><ul><li>GSIV browser-based client viewer </li></ul><ul><ul><li>‘ AJAXian’ </li></ul></ul>
  12. 15. LizardTech ExpressServer Browser GSIV.js www.biodiversitylibrary.org .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2 images.mobot.org A user requests Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907 locate: BHL/IA architecture = 5.0+ sec transfer Time to deliver image: 8+ sec
  13. 16. Reuse, don’t rebuild
  14. 17. TIF Image from Scanner Converted to text via PrimeOCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
  15. 18. Names data mining
  16. 19. Tag cloud from LCSH Subject Heading from library catalog Expressed as MARCXML Tag Cloud
  17. 20. Geocoding LCSH
  18. 21. RSS Feeds <ul><li>Specific: Last 25 books published in German from NYBG </li></ul><ul><li>RSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG     </li></ul><ul><li>Allgemeine deutsche Garten-Zeitung , 7, 1829 (added: 04/03/2008 ) </li></ul><ul><li>Zeitschrift fr wissenschaftliche Mikroskopie und fr mikroskopische Technik . 2, 1885 (added: 03/28/2008 ) </li></ul><ul><li>Zeitschrift fr technische Biologie . 7, 1919 (added: 03/27/2008 ) </li></ul><ul><li>… </li></ul><ul><li>General: Last 25 books from all libraries </li></ul><ul><li>RSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25     </li></ul><ul><li>Summa plantarum : v.1 (added: 05/01/2008 ) </li></ul><ul><li>Vegetable materia medica of the United States (added: 04/30/2008 ) </li></ul><ul><li>The family herbal; (added: 04/30/2008 ) </li></ul><ul><li>… </li></ul>
  19. 22. Services <ul><li>Names </li></ul><ul><ul><li>v.1 released http://www.biodiversitylibrary.org/services/name/NameService.asmx </li></ul></ul><ul><li>Stable urls </li></ul><ul><ul><li>http://www.biodiversitylibrary.org/bibliography/1652 </li></ul></ul><ul><ul><li>http://www.biodiversitylibrary.org/name/Carcharodon_carcharias </li></ul></ul><ul><li>Future: </li></ul><ul><ul><li>Citation Resolver </li></ul></ul><ul><ul><li>Titles Resolver </li></ul></ul>
  20. 23. BHL Name Services http://www.biodiversitylibrary.org/services/name/NameService.asmx
  21. 24. Provider Integration <ul><li>Encyclopedia of Life </li></ul><ul><li>Atrium Andes Biodiversity </li></ul><ul><li>Wikipedia </li></ul><ul><li>EDIT Scratchpads </li></ul><ul><li>More to come… </li></ul>
  22. 27. Hardware Infrastructure <ul><li>Distributed </li></ul><ul><li>Partially redundant </li></ul><ul><ul><li>Work needed </li></ul></ul><ul><li>Mixed platforms </li></ul><ul><li>Mixed app frameworks </li></ul>
  23. 28. MOBOT Petabox cluster Internet Archive
  24. 30. File Storage Estimates <ul><li>4MB per page including derivatives </li></ul><ul><li>1 million pages = 4TB storage </li></ul><ul><li>Expected output: 60 – 100 million pages </li></ul><ul><li>240 - 400 TB for files </li></ul><ul><li>10 - 20 GB for db </li></ul>
  25. 31. Future Work <ul><li>Services </li></ul><ul><ul><li>Citation Resolver </li></ul></ul><ul><ul><li>Titles Resolver </li></ul></ul><ul><li>Interfaces </li></ul><ul><li>Editing </li></ul><ul><ul><li>Authoritative </li></ul></ul><ul><ul><li>Community </li></ul></ul><ul><li>Backend </li></ul>
  26. 32. Fedora <ul><li>Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons </li></ul><ul><li>Working with Internet Archive to define use and practice </li></ul><ul><li>Project completion December 2009 </li></ul>
  27. 33. Thank You <ul><li>Chris Freeland </li></ul><ul><li>[email_address] </li></ul><ul><li>BHL Portal </li></ul><ul><li>www.biodiversitylibrary.org </li></ul><ul><li>BHL Blog </li></ul><ul><li>biodiversitylibrary.blogspot.com </li></ul><ul><li>BHL collection at Internet Archive </li></ul><ul><li>www.archive.org /details/biodiversity </li></ul>

×