BiSciCol: Linking Information for Biodiversity Scientists

  • 60 views
Uploaded on

Describes the need for better ontologies, and better identifier schemes in the quest for breaking down the walled gardens of biodioversity science.

Describes the need for better ontologies, and better identifier schemes in the quest for breaking down the walled gardens of biodioversity science.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
60
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The BiSciCol ProjectLinking Information for Biodiversity ScientistsJohn Deck, UC BerkeleyBiSciCol Team: Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, John Deck,Rob Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Whitton
  • 2. Adapted from The Economist, by David SimondsThe Biodiversity Data Integration Challenge
  • 3. “I’m here to fight for truth, justice, and the American way.”– SupermanOntologies, vocabularies, and standards help provide acommon understanding of the structure of information,allowing us to break data down to its fundamental parts.
  • 4. “Your identity is your most valuable possession. Protect it.And if anything goes wrong, use your powers.” - ElastigirlIdentifiers allow us to tag, track, or reference any object orprocess. They must be awesome: persistent, unique, resolvable.
  • 5. Spreadsheets / DwC Archives / Raw DataRe-assemble and integrateAssign awesome identifiersBreak down to fundamental partsThe BiSciCol Strategy
  • 6. A Data Integration Experiment:Link records between VertNet and Genbank using the Darwin CoreTriplet (InstitutionCode : CollectionCode : CatalogNumber)• 1,400,000 VertNet Records• 460,739 Genbank records (filtered by VertNet institutions)Question:What % of harvested Genbank records could be linked to VertNetvoucher specimen records using the Darwin Core Triplet?Back to Reality …Less than 1%!
  • 7. NONE of the identifiers (that we found) employ strategies to ensuretruly long-term persistence, decoupling metadata from the identifieritself.Identifier ChallengesDarwin Core triplets (at least as currently specified in standards, andimplemented) do not do well for linking data.Interim SolutionsFix DwC Triplets standards/validation (that’s you Genbank), build aTriplet resolverPURLAwesome Solutions
  • 8. Ontologies, vocabularies, standardsBiological Collections Ontology (http://code.google.com/p/bco)Genomic, Biodiversity, and Ecological standards alignment*+BCIDsFree, persistent, scalable, resolvable and awesome identifiers forbiodiversity data, built on CDL’s EZID system (http://biscicol.org/bcid/)BiSciCol Strategies to Address theBiodiversity Data IntegrationChallenge*TriplifierChunks raw data into fundamental parts then re-assembles as RDF andintegrates with other data (http://biscicol.org/triplifier/)*Learn more about these projects at the Software Bazaar+More about BCIDs integrating with VertNet on Day 2
  • 9. Ontology / Vocabulary ChallengesNeed to clarify assumptions behind concepts• Individual / Material Sample / Specimen / Population• Different interpretations x-domains: MIxS, INSDC, DwC, OBISolutions:• Continually improve clarity in definitions• Work towards more robust standards governance frameworks• Implement test beds and better understand use casesVarying degrees of formalism• Checklists, spreadsheets, RDF, OBO, OWLInsufficient support for standards organizations• Consisting of tenuous structures maintained by informal networks ofactive volunteers