Cross-Community User Requirements and the Biodiversity Heritage Library


Published on

9 June 2011,

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • No Marine in the midcoast! Degree was multi disciplinary; good overview of botany, zoology, physiology, ecology, conservation management, Expected to work in…something other than field work;
  • Cross-Community User Requirements and the Biodiversity Heritage Library

    2. 2. My background <ul><li>M.S., Biological Sciences Eastern Illinois University, 1997 </li></ul><ul><li>B.S., Environmental Biology Eastern Illinois University, 1996 </li></ul><ul><li>Director, Center for Biodiversity Informatics Missouri Botanical Garden, 2007 – date </li></ul><ul><li>Technical Director Biodiversity Heritage Library, 2007 – date </li></ul><ul><li>Application Development Manager Missouri Botanical Garden, 2003 – 2007 </li></ul><ul><li>Web Project Leader Missouri Botanical Garden, 2000 - 2003 </li></ul> @chrisfreeland
    3. 3. Data sharing & integration Plant Names Specimens Plant Names Plant Names Specimens Descriptions Plant Names Plant Names Citations
    4. 4. Plant Sciences: Tropicos <ul><li>Developed in-house at MOBOT since 1982 </li></ul><ul><ul><li>Originally developed to capture field notebook data & streamline printing herbarium sheet labels </li></ul></ul><ul><ul><li>Tool used by MOBOT staff, collaborators & a global audience of scholars & students </li></ul></ul>
    5. 5. Core Components <ul><li>Names </li></ul><ul><ul><li>1.2 million names + synonymy </li></ul></ul><ul><ul><li>Objective view </li></ul></ul><ul><li>Specimens </li></ul><ul><ul><li>3.9 million specimen records </li></ul></ul><ul><li>Images </li></ul><ul><ul><li>160,000 specimens, plants, drawings </li></ul></ul><ul><ul><li>IMLS National Leadership Grant, 1998 </li></ul></ul><ul><li>Literature </li></ul><ul><ul><li>1.2 million protologue citations, linked to BHL when available </li></ul></ul><ul><ul><li>160,000 name-based citations </li></ul></ul><ul><li>Projects </li></ul><ul><ul><li>Floras, checklists & data gathering </li></ul></ul><ul><ul><li>Alternate classifications, project-specific views </li></ul></ul>
    6. 6.
    7. 7. System Expansion <ul><li>GIS integration </li></ul><ul><ul><li>Enhanced mapping & analysis </li></ul></ul><ul><ul><li>Complements Analysis Unit </li></ul></ul><ul><ul><li>IMLS grant, 2009 </li></ul></ul><ul><li>Enhanced interfaces for keys </li></ul><ul><ul><li>SDD export now available </li></ul></ul><ul><li>Robust APIs, including names lookup service </li></ul><ul><ul><li>Services instead of scraping </li></ul></ul><ul><li>djatoka for JPEG2000 (JP2) image delivery </li></ul>MO Distribution: Caprifoliaceae
    8. 8. Tropicos as Data Provider <ul><li>GBIF </li></ul><ul><ul><li>3.9mil records; 2.1mil georeferenced </li></ul></ul><ul><li>Taxonomic Name Resolution Service </li></ul><ul><ul><li>Computed Acceptance, Synonymy </li></ul></ul><ul><li>NameBank </li></ul><ul><ul><li>Contributed names </li></ul></ul><ul><li>Zipcode Zoo </li></ul><ul><ul><li>20,000 images shared between systems </li></ul></ul><ul><li>The Plant List, in collaboration with Kew </li></ul>
    9. 9. Users & Requirements <ul><li>Plant Science Scholars & Students </li></ul><ul><ul><li>Status / history of name </li></ul></ul><ul><ul><ul><li>Links to BHL </li></ul></ul></ul><ul><ul><li>Specimens collected / specimens determined </li></ul></ul><ul><ul><li>Distribution </li></ul></ul><ul><ul><li>Multiple classifications </li></ul></ul><ul><ul><li>Acceptance </li></ul></ul><ul><li>General Public </li></ul><ul><ul><li>Common names, images, maps/distribution </li></ul></ul>
    10. 10. Literature Repositories: BHL <ul><li>Consortium of natural history museum & botanical garden libraries </li></ul><ul><ul><li>Expanded to include technology partners and service providers </li></ul></ul><ul><li>Goal of digitizing public domain biodiversity literature, and in-copyright materials where negotiable </li></ul><ul><li>Direct integration with Encyclopedia of Life (EOL) </li></ul>
    11. 11. BHL Partners The Biodiversity Heritage Library (BHL) is a global community of natural history libraries and research institutions who have formed a partnership to digitize and make available the world's biodiversity literature. Now Online: 90,000+ volumes 34 million+ pages
    12. 12. BHL is a research space <ul><li>BHL corpus as whole is a data set of biodiversity data in its own right. Embedded in it are: </li></ul><ul><ul><li>Predator/prey relationships </li></ul></ul><ul><ul><li>Habitat/distribution data </li></ul></ul><ul><ul><li>Host/parasite data </li></ul></ul><ul><ul><li>Pathogen/disease vector data </li></ul></ul><ul><li>Third party researchers and projects are interested in mining the BHL texts for multiple research needs. </li></ul><ul><li>One site for serving/accessing/downloading digital texts AND for data mining is messy. Separate out and put a version of the corpus in a public-like cloud space. </li></ul>
    13. 13.
    14. 15. BHL by the Book PDF OCR XML JP2 > 70TB, growing every day… One 380 pg (avg) volume = multiple files, varying sizes, relationships among them
    15. 16. Current distributed infrastructure Internet Archive: Digitized content / files MOBOT: Database & web application MBL: Redundant cluster Metadata Content
    16. 17. Data Ingest Data Ingest Data Ingest Sync BHL Vision: Global Infrastructure Preservation System – multiple redundant copies of all digitized content. Replicate Access System – files, metadata & services needed to deliver content.
    17. 18. DuraCloud pilot <ul><li>Community interest in cloud storage </li></ul><ul><ul><li>(Funding organizations, too!) </li></ul></ul><ul><li>Wanted to evaluate applicability of cloud storage for large-scale digitization activities </li></ul><ul><ul><li>Solutions for efficient transfer of 10-100s TB data </li></ul></ul><ul><ul><li>Lower cost alternatives to maintaining large data centers </li></ul></ul>
    18. 19. BHL Policy Challenges <ul><li>Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting. This is a very attractive financial offer. Since the MBL is BHL member it provides a level of administrative commitment </li></ul><ul><li>Skill level - Multiple global partners needing all or some of the current holdings - have varying levels of technical skills. For some shipping hard drives might be easier. For some uploading to and downloading from the cloud might be preferable. </li></ul><ul><li>Control – in cultural-scientific digital projects no clear models using cloud. Early-adopter paranoia. </li></ul>
    19. 20. Data Transfer Methods & Limitations vs NodeB NodeB NodeA NodeA Problems: Hardware failure, data loss, shipping fees Problems: Available bandwidth, upload/download fees
    20. 21. Data transfer: Cloud vs. Cluster <ul><li>Inventory & audit lists </li></ul><ul><li>Checksums for data integrity </li></ul><ul><li>Heavy lifting at BHL scale, regardless of endpoint </li></ul><ul><ul><li>weeks->months, not minutes->days </li></ul></ul><ul><li>Differences </li></ul><ul><ul><li>In cluster environment, have to be intimately involved in hardware decisions, maintenance, troubleshooting </li></ul></ul><ul><ul><li>In cloud environment, those worries are part of your fee </li></ul></ul>
    21. 22. Challenges for adopting cloud storage <ul><li>BHL is embedded in longstanding institutions with megainfrastructure. </li></ul><ul><ul><li>Already support data storage & maintenance at BHL scale </li></ul></ul><ul><li>Little funding for alternative infrastructure / storage </li></ul><ul><ul><li>Current storage is (really, truly) free through Internet Archive </li></ul></ul><ul><li>Costs associated with download / use of content </li></ul><ul><ul><li>BHL is a global resource for a broad community </li></ul></ul><ul><ul><li>User community wants to “do things” with data </li></ul></ul>
    22. 23. Lessons learned from pilot <ul><li>Cloud infrastructure & applicability to BHL are no longer a mystery </li></ul><ul><li>Nothing is free </li></ul><ul><ul><li>Except when it is </li></ul></ul><ul><li>Cloud storage provides ability to quickly scale infrastructure </li></ul><ul><ul><li>No lost time procuring & configuring hardware </li></ul></ul><ul><li>Useful for the right kinds of datasets </li></ul><ul><ul><li>It’s not the size of the corpus, it’s the size of the files </li></ul></ul><ul><ul><li>Huge files are problematic </li></ul></ul>
    23. 24. More lessons learned <ul><li>More possibilities than expected: </li></ul><ul><ul><li>Features </li></ul></ul><ul><ul><li>Movement </li></ul></ul><ul><ul><li>Support available from commercial providers. </li></ul></ul><ul><ul><li>Increasing menus of choices </li></ul></ul><ul><li>There is no silver bullet </li></ul><ul><ul><li>Cloud is just a different endpoint for file storage </li></ul></ul><ul><ul><li>It doesn’t solve all problems related to repository management </li></ul></ul>
    24. 25. Global data sharing requires a social infrastructure
    25. 26. BHL Services & APIs <ul><li>OpenURL </li></ul><ul><ul><li>Facilitate links to citations: protologues, articles, references </li></ul></ul><ul><ul><ul><li>Documentation: </li></ul></ul></ul><ul><ul><li>Useful to Nomenclators, Reference Systems </li></ul></ul><ul><ul><ul><li>IPNI </li></ul></ul></ul><ul><ul><ul><li>Tropicos </li></ul></ul></ul><ul><li>Names Service </li></ul><ul><ul><li>Return all occurrences of a name throughout BHL digitized corpus </li></ul></ul><ul><ul><ul><li>Documentation: </li></ul></ul></ul><ul><ul><li>Working out a strategy for obscure species </li></ul></ul><ul><ul><li>Algorithm improvements to detect nomenclatural & taxonomic acts </li></ul></ul>
    26. 27. BHL + Tropicos <ul><li>A unique platform for biodiversity research </li></ul><ul><li>Built to serve taxonomists’ & other scientists’ investigations </li></ul><ul><ul><li>But now serve multiple disciplines </li></ul></ul><ul><li>Enhanced by 250+ years of accumulated knowledge </li></ul><ul><ul><li>Complicated by 250+ years of collegial disagreement </li></ul></ul><ul><li>Complementary to physical libraries & herbaria </li></ul>
    27. 28. pid=title:3934&volume=14&issue=&spage=301&date=1879
    28. 29. BHL OpenURL Disambiguation <ul><li>Looking for: </li></ul><ul><li>BHL returns: </li></ul>
    29. 30. Services: OpenURL Results
    30. 31. Conclusion
    31. 32. Questions? <ul><li>Chris Freeland </li></ul><ul><li>Technical Director , Biodiversity Heritage Library </li></ul><ul><li>Director , Center for Biodiversity Informatics, Missouri Botanical Garden </li></ul><ul><li>Missouri Botanical Garden </li></ul><ul><li>4344 Shaw Blvd. </li></ul><ul><li>St. Louis, MO 63110 USA </li></ul>Email: [email_address] Twitter: @chrisfreeland Blog / info: