The emerging biodiversity data ecosystem


Published on

A talk given at iEvobio11, a conference about Informatics for Phylogenetics, Biodiversity and Evolutionary Biology, held in Norman, Oklahoma June 21-22, 2011

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Conclusion is that there is value to treating all the biodiversity information systems as part of an interconnected ecosystem. We can study the connections, we can assess depth of infomraiton in the network. I’ll focus on EOL’s role in the system, but I hope to make observations that will be generally useful too
  • Objects such as these are essentially chunks of text sorted by topic. Span biology from physiology to ecology to evolutionEach of these credits the source, and can receive comments or ratings, or can be trusted or untrusted by curators.
  • So, the approach of EOL is rather different than many other sites. EOL is a giant mashup that creates pages, that are then available for curators (mostly credentialed scientists) to assess and rate, or for anybody to provide comments or tags.160+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages600 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
  • Represents about 1600 projects, and 1700 instances of data flow or hyperlinks between them. Size of the vertex, or node, reflects degree, or how many links the node has. We used the Claust-Newman-Moore algorithm to determine which vertices grouped together, then gave each group a color code. Those nodes with a degree of 15 or higher are labeled, and their edges are shown thicker than the others. These are the hubsThese are the hubs of this network, and they are reasonably well connected to each other. (go through and expand the acronyms)
  • Daphne Fautin’sHexacorallians of the world
  • With this as a baseline, how connected and resilient is the network? Over time we want it to become more connected and resilient, both to enable discovery and recovery in case of catastrophic problems.We can also use this to develop effective mechanisms to annotate data and improve data quality. If the same data appear on different parts of the network, and someone reports an error, the repair of that data needs to propagate effectively. What are the factors that influence data flow quantity and effectiveness…
  • Brighter green has higher % descendents with text, size of square is number of descendents square root scaled
  • Ecologically important – keystone species, indicator species
  • Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
  • Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example.
  • Here’s what we are already doing – for the OBIS specimens which have rich environmental data associated with themCould add simllar values from other partners, for example from GenBank where some samples that are sequenced are collected from known envorinments, or from ecological studies that aren’t part of the specimen based system.Could subscribe to this value and get alerts if new values that come in that are outside this range.Could set up an model for this taxon and its relatives, predicting expected values, then if new values are aggregratedfrom any of EOL’s partners that violate the model, the scientist who has published the model gets a notification, could be there’s a flaw in the data integration, some violation of assumptions about the measurement workflow. Or could be that there’s something we truly didn’t understand before.Truly leveraging the scientific output of many researchers, better use of resources, more rapid advances in understanding of biological systems.
  • Analogousto the study of ecosystems where we seek to build an understanding of entire systems with many kinds of inputs, both biotic and abiotic
  • In addition to the authors…
  • The emerging biodiversity data ecosystem

    1. 1. The emerging biodiversity data ecosystem<br />Cynthia Parr, Katja Schulz, Jennifer Hammock <br />Smithsonian Institution <br />Nathan Wilson, Patrick Leary<br />Marine Biological Laboratory<br />Richard Allen<br />Environmental Protection Agency<br />
    2. 2. Today’s story<br />What is EOL<br />Core questions<br />Network analysis<br />Hotlist development<br />Page richness algorithm<br />Conclusion: improving the health and richness of our knowledge network advances understanding <br />
    3. 3. What is EOL<br /><br /><ul><li>Global access to knowledge about life on earth
    4. 4. All species
    5. 5. Freely accessible & reusable: open access, open source
    6. 6. Available from a single portal in a common format
    7. 7. Quality
    8. 8. Always growing</li></li></ul><li>EOL Topics<br />Associations BehaviourConservationStatusCyclicity Cytology DiagnosticDescription Diseases Dispersal Distribution Evolution GeneralDescription Genetics Growth Habitat Legislation LifeCycleLifeExpectancyLookAlikes Management Migration MolecularBiology Morphology Physiology PopulationBiology Procedures Reproduction RiskStatement Size Threats Trends TrophicStrategy Uses Description Conservation Key Biology Ecology Introduction Education Barcode CitizenScienceEducationResources Genome NucleotideSequencesFunctionalAdaptationsFossilHistorySystematicsOrPhylogenetics Development IdentificationResources<br />
    9. 9. EOL is a content curation community<br />Content providers<br />Databases<br /> Journals<br />LifeDesks<br /> Public contributions<br />Curating<br />Aggregation<br />Commenting<br />Tagging<br /><br />
    10. 10. Core questions<br />Where is our knowledge about biodiversity?<br />Where are the gaps?<br />What are the most effective ways to fill gaps given our limited resources?<br />
    11. 11. Network analysis<br />with Anne Bowser, University of Maryland<br />EOL<br />GBIF<br />NCBI<br />EOL connects hubs<br />
    12. 12. The GBIF hub has subnetworks<br />
    13. 13. Key individuals seek out hubs<br />TOLWeb<br />
    14. 14. Implications and next steps<br />Need more data<br />Identify isolated projects & mechanisms for connecting them to the network<br />Improve resilience & redundancy<br />Distribute annotation & quality control <br />Model data flow quantity and impact<br />
    15. 15. Viewer of Life on EOL – Kris Urie<br />
    16. 16. Low % of descendents with text <br />in Arthropods<br />
    17. 17. Within arthropods coverage varies <br />. . . Perhaps as expected<br /><br />
    18. 18. Developing the EOL hot list<br />Consultation with taxonomic experts<br />Development of criteria<br />Assembly of critical lists<br />Establishing targets for rich taxon pages, lesser known pages<br />
    19. 19. EOL’s hot lists<br />Hot List <br />Red Hot List<br />70,000 taxa<br />Conservation concern<br />Invasives<br />Model organisms<br />Ecologically important<br />Pests<br />Charismatics<br />Data availability<br />2,800 taxa<br />Most searched<br />Top 100 invasives<br />Crops (food)<br />Zoos & aquaria<br />High traffic<br />Higher taxa<br />
    20. 20. Taxon page richness algorithm<br />60%<br />30%<br />10%<br />Breadth: Images, topics of text objects, references, maps, videos, sounds, conservation status<br />Depth: # words per text object, # words total<br />Diversity: Sources (partners)<br />+<br />+<br />a (Breadth)<br />b (Depth)<br />c (Diversity)<br />0 – 1, Threshold 0.4 <br />
    21. 21. Summary of EOL page richness<br />Overall<br />Hot List<br />640,000 have content<br />2 % are rich<br />25 % have only links<br /> to literature<br />28 % of 75K are rich<br />Average richness = 0.30<br />Red Hot List<br />56 % of 3K are rich<br />Average richness = 0.43<br />
    22. 22. Strategies for improving richness<br />Crowd-sourcing<br />Leveraging<br />Collections<br />Communities<br />Mobile apps<br />Enabling platforms<br />Enabling journals<br />Data mining BHL etc.<br />Version 2<br />Coming in Fall 2011!<br />
    23. 23. The page richness index<br />Helps fill gaps with existing knowledge<br />Helps prioritize funding and training so that it has maximum impact on closing true gaps<br />Will be available via API<br />Computing and storing richness index on EOL is a step towards storing and serving computable data<br />
    24. 24. Dynamic data summaries = new knowledge<br />Summarize data within a partner, then across partners.<br />For example: compute an average value for one taxon (x specimens), compare to range of values across all taxa (621,393 samples)<br />Atlantic Cod<br />Gadusmorhua<br />Jen Hammock (EOL)<br />Edward van den Berge (OBIS)<br />
    25. 25. Conclusions<br />There is a lot of data out there in a lot of knowledge bases<br />Understanding how it is connected can help us improve the ecosystem<br /><ul><li>Quality control
    26. 26. Resilience
    27. 27. Richness assessment </li></ul>Large-scale data summaries can foster gap-filling and standing, dynamic knowledge analyses<br />
    28. 28. Thank you<br /><br />160+ content partners<br />2000 Flickr contributors<br />1000s Wikipedia contributors<br />43,000 EOL members<br />Funding:John D. and Catherine T. MacArthur Foundation, Alfred P. Sloan Foundation, Cornerstone Institutions, Private Donors<br />See Demo and Version 2 sneak peak in Software Bazaar<br />Leadership: Erick Mata, Bob Corrigan, Mark Westneat, Marie Studer, Tom Garnett, Jim Edwards, David Patterson, <br />Developers: Peter Mangiafico, Jeremy Rice, DimitriMozzherin, David Shorthouse, Lisa Whalley and others<br />Biologists: Tanya Dewey, Audrey Aronowsky, Leo Shapiro<br />