Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity Informatics

3,951 views

Published on

A presentation to the Genomic Standards Consortium 15 meeting in Bethesda, MD on 23 April 2013

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,951
On SlideShare
0
From Embeds
0
Number of Embeds
2,184
Actions
Shares
0
Downloads
9
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • This is a very different kind of talk because I am not focusing on metagenomics or microbes per se. I would to introduce what we are doing at the Encyclopedia of Life in hopes that we can soon bridge the gap between those studies and studies on macroorganism diversity.
  • We have a working infrastructure as well as more than 200 partners, We harvest and sort text and multimedia by topic and by species and put it on our pages. Curation + user-added content from the crowds is added to the mix.This is fed back to providers, giving them traffic, quality control on their own content, and new content for them to use And, we are already seeing spinoff products. We make it easy for developers, and everything is either public domain or CC-licensed so it can be re-used.
  • As this is a meeting about standards, I thought I would mention some of the standards we are using.
  • We now have over a million pages with content, some of it is even in other languages like Arabic, Spanish, and Chinese. And we are getting traffic mostly from the general public, from all over the world.
  • There are strong links between taxa represented in NCBI’s databases and others. Each dot here represents a project with a database holding some sort of biological data. Chiefly, the links between these databases are based on taxonomic names and so EOL has mapped every name and their identifiers in each of these hubs to bring the data together.
  • One of the benefits is that we can support third-party projects where linking and visualizing via names is critical. This is Bionames, a project by Rod Page & Ryan Shenk. They have visualized the taxonomic concepts in this family of bats – where there are no images there are obvious gaps in the Encyclopedia of life. There is a timeline showing when species were described, a sample classification and distribution map, and links to some of the foundational literature.Essentially, they are re-organizing EOL data to suit their own use cases for taxonomists, and bringing in additional data not yet available on EOL.
  • Here is what archaeologists are doing with EOL
  • The removal of objects is now forbidden in most countries and many sites in the US. As a result data collection methods have changed from description of a physical object accessible in the US to a full surrogate for an object that might be re-buried in the ground. Data collection has increased as the collection of objects has decreased.Still individual systems of data collection (see examples on the right) have emerged which have.Developed over timeAre Handed down from mentors Contain some technological adoption, particularly the adoption of Excel spreadsheets over relational databasesIn all of our interviews there was no reference to existing guides, such as the UK: Archaeological Data Service or Netherlands: DANS on archaeological documentation.
  • Most of our 5.4 million content objects are text blobs and here are the subjects of that text. Most often, our text objects are about distribution. But there are many other subjects involved including essays that include multiple subjects.
  • Information Visualization MOOC (Massive Open Online Course) led by Dr. Katy Börner of Indiana University, students TwyBethard (United States), Andrew Miles (United Kingdom), Edward Kok (Netherlands) and Mattia Della Libera (Italy) used GloBI data to create an insightful visualization of spatial marine food webs in the Gulf of Mexico.
  • In the next year and a half we are tackling these challenges with funding from the Sloan Foundation.We are starting with marine dataIn the most simplistic view, we’ll be storing triples, each part of which can be linked to a definitionso that the meaning is clearly defined. There might be five different ways to define an attribute like “body length” and we should be able to handle them all without losing the distinction. Of course we’ll also make sure each triple links back to a dataset and all the appropriate credits.This data will be organized on a data tab, perhaps sorted out into the 35 or so “topics” that we currently have text chapters for, like size or reproduction, and we will also allow powerful downloading and searching capabilityFinally we’ll be setting up ways for other applications to grab the data and do interesting things with it.This semantic web technology isn’t new, but the way we’ll be using it with EOL is new.
  • Serving building blocks, but actually not quite like lego because we are not one source that mass produces everything
  • More like amazon marketplace, because we are an infrastructure that providers (i.e. merchants) can plug into to share their data with others.
  • We are in the midst of a genomics revolution.The cost to generate a full genome sequence is dropping more or less daily.What is all this genetic information DOING?How does it relate to what we can see and measure about organisms, their phenotypes, or their traits?How do these genes interact with the environment to result in both normal and abnormal development of traitsnot just for lab-dwelling species like rats, but across the tree of life?How do evolutionary changes in DNA make a difference in the lives of organisms?TraitBank, which is not yet funded, would enable us to scale up and manage all kinds of trait data about all organisms.
  • Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity Informatics

    1. 1. Encyclopedia of Lifeeol.org@eolCynthia Parr@cydparrGSC15 23 April 2013
    2. 2. A webpage for every species
    3. 3. How EOL worksEOLCrowdsHarvestThird party applications
    4. 4. EOLPlinianCoreDwCdescriptionSPMinfoitemusingDublin Core & AudubonCore for other metadataDarwin Core Archiveflat files astransport mechanismSharing process adds semantics to content objects
    5. 5. EOL TodayKey Milestones in 20131.1 million species pages240+ content providers3 million unique visitorsfrom 223 countries &territories
    6. 6. EOLGBIFNCBIwith Anne Bowser, University of MarylandEOL connects hubs
    7. 7. BioNamesRod PageRyan Schenkiphylo.blogspot.com
    8. 8. Anatolia Zooarchaeology Case Studyled by Alexandria Archive InstituteResearch goals and outcomes:– Improve archaeological datacollection / documentation practices– Better understanding of gaps (spatialand temporal)– Integrated biometrics show complexpatterns (introduction of domesticand continued use of wild animalsby region)– Aligning data to EOL taxon identifiershelps draw out patterns in relativeproportion of taxa over time andspace across many assemblages
    9. 9. EOL Computable Data Challenge1. 14 different sites2. 34+ zooarchaeologists3. Decoding, cleanup, metadata documentation4. 220,000+ specimens5. 450 entities linked to 143 EOL taxon concepts6. Anatomical entities linked to Uberon.org7. Biometrics linked to measurement ontology8. Collaborative analysishttp://opencontext.org/
    10. 10. 0 100000 200000 300000 400000 500000 600000 700000 800000DistributionMultiple topicsHabitatThreatsConservationTrendsAssociationsTrophicStrategyPopulationBiologyMigrationLifeExpectancyBehaviourDiseasesNumber of text objectsSubjectoftextobject
    11. 11. Promote NLP textmining, crowdsourcing, standardizing• Species Interaction Datasets—Integration, Visualization, and Analysis (Poelen andMungall)• Discovering EnvO habitat terms in EOL contents(Pafilis)• Altitude Specificity of Flower Coloration (Wright)• Crowd-sourced data to examine morphologicalimpacts of extinction risk in ray-finned fishes(Chang)• Macroecological patterns in butterfly-hostplantassociations (Ferrer-Parris)
    12. 12. EOL GloBIGlobal Biotic InteractionsChallenge: Species interaction datasets are mostlyburied in flat files & custom formats.Plan: Build infrastructure for normalizing and aggregatingspecies interaction datasets and make them accessiblethrough flat files (Darwin Core Archive), webservices, and semantic web endpoints (SPARQL).Eventually: Publish biotic interaction ontology re-usingexisting ontologies, re-integrate with EOLEnable semantic interoperability to allow for cross-functionalanalysis (e.g. How does a parasite regulate geneexpression of host?Poelen, Mungall, Simons, Reiz
    13. 13. http://globalbioticinteractions.wordpress.com/14 datasets containing 25ktaxa, 422kinteractions, for 3klocationsalpha version ofingestion, normalization,aggegationalpha version of web APIalpha version of dataexportsDr. Katy Börner ledInformation VisualizationMOOC
    14. 14. Easy access to analyzable trait data“Are blue organisms more common in high altitudes?”“How can I predict vulnerability to climate change basedon life history characteristics?”“What organisms should I collect to fill in gaps in genomequality tissue collections?”• Look for data type, download for all taxa• Create a collection of taxa, download all data• Use Reol: an R interface to EOL (Banbury, Omeara)http://barbbanbury.info/barbbanbury/Reol.html• Find more specialized data repositories
    15. 15. Adding traits to EOLFunded: Marine focus<scientific name> <hasAvgBodyMass in g> <value><scientific name> <preysOn> <scientific name>Harvest and display on data tabAdd high-level semantics from coarse SPM ontologyDownloads, fancy searchingMachine access
    16. 16. INSDC900,000 species4,000 genomes60 million DNA sequence recordsHow are these related to traits?Next step: TraitBank
    17. 17. ThanksFunding & other contributionsSloan FoundationSmithsonian InstitutionDavid RubensteinMarine Biological LaboratoryHarvard UniversityOur content partnersThousands of individualcontributors, and hundreds ofvolunteer curatorsImage creditsJenny from TaipeiUniversity of BirminghamCynthia ParrChief Scientist @eol@cydparr parrc@si.eduGLoBI: Jorrit Poelen (lead/software), Chris Mungall(ontologies), James Simons (biologist) and RobertReiz (software). Datasets shared by: Peter D.Roopnarine, Rachel Hertog, Carlos García-Robledo, James Simons, Jenny L. Wrast, C.Barnes, International Council for the Exploration ofthe Sea (ICES), Jose R. Ferrer Paris, SenolAkin, Malcolm Storey (BioInfo.org.uk), Ivy E.Baremore, Joel Sachs (SPIRE), Colt W. Cook, David A.BlewettAlexandria Archive: Sarah Kansa, EricKansa, 34 other zooarchaeologistsBioNames: Rod Page, Ryan SchenkMOOC: Katy Börner, TwyBethard, Andrew Miles , Mattia DellaLibera

    ×