Cross-Community User Requirements and the Biodiversity Heritage Library
Upcoming SlideShare
Loading in...5

Cross-Community User Requirements and the Biodiversity Heritage Library



9 June 2011,

9 June 2011,



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • No Marine in the midcoast! Degree was multi disciplinary; good overview of botany, zoology, physiology, ecology, conservation management, Expected to work in…something other than field work;

Cross-Community User Requirements and the Biodiversity Heritage Library Cross-Community User Requirements and the Biodiversity Heritage Library Presentation Transcript

  • My background
    • M.S., Biological Sciences Eastern Illinois University, 1997
    • B.S., Environmental Biology Eastern Illinois University, 1996
    • Director, Center for Biodiversity Informatics Missouri Botanical Garden, 2007 – date
    • Technical Director Biodiversity Heritage Library, 2007 – date
    • Application Development Manager Missouri Botanical Garden, 2003 – 2007
    • Web Project Leader Missouri Botanical Garden, 2000 - 2003 @chrisfreeland
  • Data sharing & integration Plant Names Specimens Plant Names Plant Names Specimens Descriptions Plant Names Plant Names Citations
  • Plant Sciences: Tropicos
    • Developed in-house at MOBOT since 1982
      • Originally developed to capture field notebook data & streamline printing herbarium sheet labels
      • Tool used by MOBOT staff, collaborators & a global audience of scholars & students
  • Core Components
    • Names
      • 1.2 million names + synonymy
      • Objective view
    • Specimens
      • 3.9 million specimen records
    • Images
      • 160,000 specimens, plants, drawings
      • IMLS National Leadership Grant, 1998
    • Literature
      • 1.2 million protologue citations, linked to BHL when available
      • 160,000 name-based citations
    • Projects
      • Floras, checklists & data gathering
      • Alternate classifications, project-specific views
  • System Expansion
    • GIS integration
      • Enhanced mapping & analysis
      • Complements Analysis Unit
      • IMLS grant, 2009
    • Enhanced interfaces for keys
      • SDD export now available
    • Robust APIs, including names lookup service
      • Services instead of scraping
    • djatoka for JPEG2000 (JP2) image delivery
    MO Distribution: Caprifoliaceae
  • Tropicos as Data Provider
    • GBIF
      • 3.9mil records; 2.1mil georeferenced
    • Taxonomic Name Resolution Service
      • Computed Acceptance, Synonymy
    • NameBank
      • Contributed names
    • Zipcode Zoo
      • 20,000 images shared between systems
    • The Plant List, in collaboration with Kew
  • Users & Requirements
    • Plant Science Scholars & Students
      • Status / history of name
        • Links to BHL
      • Specimens collected / specimens determined
      • Distribution
      • Multiple classifications
      • Acceptance
    • General Public
      • Common names, images, maps/distribution
  • Literature Repositories: BHL
    • Consortium of natural history museum & botanical garden libraries
      • Expanded to include technology partners and service providers
    • Goal of digitizing public domain biodiversity literature, and in-copyright materials where negotiable
    • Direct integration with Encyclopedia of Life (EOL)
  • BHL Partners The Biodiversity Heritage Library (BHL) is a global community of natural history libraries and research institutions who have formed a partnership to digitize and make available the world's biodiversity literature. Now Online: 90,000+ volumes 34 million+ pages
  • BHL is a research space
    • BHL corpus as whole is a data set of biodiversity data in its own right. Embedded in it are:
      • Predator/prey relationships
      • Habitat/distribution data
      • Host/parasite data
      • Pathogen/disease vector data
    • Third party researchers and projects are interested in mining the BHL texts for multiple research needs.
    • One site for serving/accessing/downloading digital texts AND for data mining is messy. Separate out and put a version of the corpus in a public-like cloud space.
  • BHL by the Book PDF OCR XML JP2 > 70TB, growing every day… One 380 pg (avg) volume = multiple files, varying sizes, relationships among them
  • Current distributed infrastructure Internet Archive: Digitized content / files MOBOT: Database & web application MBL: Redundant cluster Metadata Content
  • Data Ingest Data Ingest Data Ingest Sync BHL Vision: Global Infrastructure Preservation System – multiple redundant copies of all digitized content. Replicate Access System – files, metadata & services needed to deliver content.
  • DuraCloud pilot
    • Community interest in cloud storage
      • (Funding organizations, too!)
    • Wanted to evaluate applicability of cloud storage for large-scale digitization activities
      • Solutions for efficient transfer of 10-100s TB data
      • Lower cost alternatives to maintaining large data centers
  • BHL Policy Challenges
    • Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting. This is a very attractive financial offer. Since the MBL is BHL member it provides a level of administrative commitment
    • Skill level - Multiple global partners needing all or some of the current holdings - have varying levels of technical skills. For some shipping hard drives might be easier. For some uploading to and downloading from the cloud might be preferable.
    • Control – in cultural-scientific digital projects no clear models using cloud. Early-adopter paranoia.
  • Data Transfer Methods & Limitations vs NodeB NodeB NodeA NodeA Problems: Hardware failure, data loss, shipping fees Problems: Available bandwidth, upload/download fees
  • Data transfer: Cloud vs. Cluster
    • Inventory & audit lists
    • Checksums for data integrity
    • Heavy lifting at BHL scale, regardless of endpoint
      • weeks->months, not minutes->days
    • Differences
      • In cluster environment, have to be intimately involved in hardware decisions, maintenance, troubleshooting
      • In cloud environment, those worries are part of your fee
  • Challenges for adopting cloud storage
    • BHL is embedded in longstanding institutions with megainfrastructure.
      • Already support data storage & maintenance at BHL scale
    • Little funding for alternative infrastructure / storage
      • Current storage is (really, truly) free through Internet Archive
    • Costs associated with download / use of content
      • BHL is a global resource for a broad community
      • User community wants to “do things” with data
  • Lessons learned from pilot
    • Cloud infrastructure & applicability to BHL are no longer a mystery
    • Nothing is free
      • Except when it is
    • Cloud storage provides ability to quickly scale infrastructure
      • No lost time procuring & configuring hardware
    • Useful for the right kinds of datasets
      • It’s not the size of the corpus, it’s the size of the files
      • Huge files are problematic
  • More lessons learned
    • More possibilities than expected:
      • Features
      • Movement
      • Support available from commercial providers.
      • Increasing menus of choices
    • There is no silver bullet
      • Cloud is just a different endpoint for file storage
      • It doesn’t solve all problems related to repository management
  • Global data sharing requires a social infrastructure
  • BHL Services & APIs
    • OpenURL
      • Facilitate links to citations: protologues, articles, references
        • Documentation:
      • Useful to Nomenclators, Reference Systems
        • IPNI
        • Tropicos
    • Names Service
      • Return all occurrences of a name throughout BHL digitized corpus
        • Documentation:
      • Working out a strategy for obscure species
      • Algorithm improvements to detect nomenclatural & taxonomic acts
  • BHL + Tropicos
    • A unique platform for biodiversity research
    • Built to serve taxonomists’ & other scientists’ investigations
      • But now serve multiple disciplines
    • Enhanced by 250+ years of accumulated knowledge
      • Complicated by 250+ years of collegial disagreement
    • Complementary to physical libraries & herbaria
  • pid=title:3934&volume=14&issue=&spage=301&date=1879
  • BHL OpenURL Disambiguation
    • Looking for:
    • BHL returns:
  • Services: OpenURL Results
  • Conclusion
  • Questions?
    • Chris Freeland
    • Technical Director , Biodiversity Heritage Library
    • Director , Center for Biodiversity Informatics, Missouri Botanical Garden
    • Missouri Botanical Garden
    • 4344 Shaw Blvd.
    • St. Louis, MO 63110 USA
    Email: [email_address] Twitter: @chrisfreeland Blog / info: