Introduction to EOL.org for scientists

2,034 views

Published on

A talk given at the Semantic Reasoning workshop held at the National Museum of Natural History September 6, 2012. The audience included computer scientists and biological scientists interested in using EOL for their research.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,034
On SlideShare
0
From Embeds
0
Number of Embeds
353
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Whirlwind tour to EOLAs you may know, Encyclopedia of Life is a web site providing global access to knowledge about life on earth.Global – the whole worldAccess – free, and freely re-usableKnowledge – synthesized, not rawLife on Earth – biological diversity
  • My goals are to give you the whirlwind tour with enough information to ring some bells in areas that might be of interest to you, and inspire you to ask deeper questions
  • I want to emphasize that EOL deals in summarized knowledge, not raw specimen data. For example, for the serpents head cowrie, we have images like this from the Mooreabiocode project, but instead of serving the individual specimen data, we get the overall distribution of specimen data on a map from GBIF. We also get a summary of environmental data associated with specimens in the Ocean Biogeographic Information System database. Imagine if we could do a summary like this across databases.
  • This is a graphical way of presenting the summarized data from OBIS, which Jen Hammock on my staff worked on with Edward Van den berghe and our team at the Marine Biological Lab. The salinity range for the species is shown here as just a smal, specific slice of the global ocean minimum and maximums.Looking just at 15 content providers we already work with, it is possible that numeric data such as lifespan or average body weight is already available for more than 800,000 species
  • EOL takes information from about 200 sources so far, mostly scientific databases, but also including Flickr and Wikipedia, and automatically sorts it onto on taxon pages. Our curators can then trust or untrust it, or anybody can provide comments or ratings. About a thousand credentialed scientists have already volunteered to help with quality control. Actions and comments get fed back to the original providers, and the material on EOL is also available to other applications via an Application Programming Interface, which I’ll talk more about in a moment.We’re partnering with over two hundred scientific databases as well as public conribution sites like Flickr and Wikipedia.100+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages500 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
  • ExtensionLeveraging strengths
  • EOL takes information from about 200 sources so far, mostly scientific databases, but also including Flickr and Wikipedia, and automatically sorts it onto on taxon pages. Our curators can then trust or untrust it, or anybody can provide comments or ratings. About a thousand credentialed scientists have already volunteered to help with quality control. Actions and comments get fed back to the original providers, and the material on EOL is also available to other applications via an Application Programming Interface, which I’ll talk more about in a moment.We’re partnering with over two hundred scientific databases as well as public conribution sites like Flickr and Wikipedia.100+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages500 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
  • Free for third party applications, as long as licenses are respectedField guidesMobile applicationWeb page widget
  • Please see me afterwards if you are interested in any of these topics
  • We have a feature where users can create customized collections of pages or objects on EOL.A scientist could search for a characteristic, say, red flowers, and create a collection of those taxa. Actually, we’ve been doing this with blue coloration in the “Life is blue” collection. If you wanted to test what might be driving the evolution of coloration, you could write a program that uses EOL to get all the Genbank IDs for those species identifiers or some other EOL partner that we’ve mapped to each of those taxon pages, and then use those to go to that database and pull raw data to analyze. For example, genetic sequences, or specimen locations. In the future we hope to make step 2 and step 3 even easier, so you might just be able to click a button and download lots of raw data for your collection from certain data sources.
  • You can also use EOL for crowd-sourcing. For example, Jennifer Hammock has started a collection called “Mystery associates” and asked people to try to identify the partners shown in photos that have some sort of ecological association. When they’ve been identified, like this sea star and anemone predation interaction, then she moves the image to the “known associates” collection. This adds to the information we have from a bunch of partners on food web interactions, and then would be available for foodweb modelers. There are many other possible ways that the large crowds on EOL could be harnessed to generate new datasets from EOL content. And this is all possible to some degree now.
  • For the future, we are working on a few new angles. First, we are working to get a more phylogenetic organization available on EOL, because that will definitely help those who are doing comparative analyses and who want a true evolutionary framework. The deadline for submitting a large tree is this weekend, Monday really. The second challenge is to propose research work using computable data and EOL in some concrete way. Perhaps as I suggested with using collections to harvest computable data or perhaps using text mining. Here the deadline is next month for the idea, and then we’re providing funds to accomplish the pilot project over the next year.Finally, in September here in Washington we’re bringing in computer scientists and biologists who have an interest in broad scale data-intensive science using biodiversity data. We expect this to lead to other projects and enhancements of the EOL platform.All this could, in my personal opinion, lead up to EOL beginning to serve as The Smithsonian’s phenotype repository. Parallel with genbank, we could be the initial point of entry for ecologists or other biologists seeking large-scale structured information about the observable characteristics of organisms.
  • Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example. The numbers in yellow are definitely out of date
  • Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
  • Introduction to EOL.org for scientists

    1. 1. Introduction to eol.orgCynthia ParrSemantic reasoning workshop @cydparrWashington, DC 6-7 September 2012 @eol
    2. 2. Whirlwind tour• What kind of information we have• How we assemble that information• How machines and people interact with EOL• Next steps
    3. 3. >1.1 million taxon pages with contentfrom more than 200 providers, 1000s individuals 5 million content objects
    4. 4. Details tabLeafy Seadragon example
    5. 5. Total of 1,344,711 images 9,586 videos 28,569 sounds
    6. 6. Maps
    7. 7. Literature
    8. 8. EOL has Global Partners and is internationalized Norway Dutch USA TaiwanMexico China Egypt India Costa Rica Colombia Peru Australia South Africa
    9. 9. From Moorea Biocode EOL summarizes knowledgeErosaria caputserpentisSerpents Head Cowrie Depth range based on 51 specimens in 2 taxa. Water temperature and chemistry ranges based on 40 samples. Environmental ranges Depth range (m): -5 - 67 Temperature range (°C): 23.011 - 28.496 Nitrate (umol/L): 0.048 - 0.923 Salinity (PPS): 33.821 - 35.837 Oxygen (ml/l): 4.349 - 4.825 Phosphate (umol/l): 0.088 - 0.228From GBIF Silicate (umol/l): 0.983 - 4.026 From OBIS
    10. 10. Erosaria caputserpentisSerpents Head Cowrie Salinity envelope (n=40) From OBIS
    11. 11. http://eol.org/pages/704102 Richness scoresCynthia Parr Global Content SummitSpecies Pages Group 17-19 Jan 2011
    12. 12. Whirlwind tour• What kind of information we have• How we assemble that information – Big picture – Subject semantics – Names infrastructure – Curation – Richness score• How machines and people interact with EOL• Next steps
    13. 13. EOL aggregates and curatesScientific Databases, includingBHL, GBIF, ALA, INBio, COL,Scratchpads, LifeDesksScientific Journals Curate Aggregate Comment Rate, Collect eol.org Quality control
    14. 14. Sharing process adds semantics to content objects SPM DwC infoitem description Plinian Core using Darwin Core Archive flat files as transport mechanism EOL v2
    15. 15. Number of text objects 0 100000 200000 300000 400000 500000 600000 700000 800000 Distribution Multiple topicsSubject of text object Habitat Threats Conservation Trends Associations TrophicStrategy PopulationBiology Migration LifeExpectancy Behaviour Diseases
    16. 16. Content objects are associated with taxonnames Wikimedia Commons: Physeter macrocephalus (note we actually have over 3.3 million named pages)
    17. 17. Names from different providers are matched Physeter macrocephalusAnimal Diversity Web .... Physeter catodon Linnaeus, 1758ARKive .................. Physeter macrocephalus LinnéBioPix .................. Physeter macrocephalus L.INBio ................... Physeter catodonIUCN .................... Physeter MacrocephalusITIS .................... Physeter macrocephalus Linnaeus, 1758MarLIN .................. Physeter macrocephalus LinnéNCBI .................... Physeter CatodonSpecies 2000 ............ Physeter macrocephalus Linnaeus, 1758Taxon Concept ........... Physeter australasianus Desmoulins, 1822Wikimedia Commons ....... Physeter macrocephalusWORMS ................... Physeter macrocephalus Linnaeus 1758
    18. 18. Taxon concept pages:multiple hierarchies onNames tab
    19. 19. Problem: one taxon may have several namesAnimal Diversity Web .... Physeter catodon Linnaeus, 1758ARKive .................. Physeter macrocephalus LinnéBioPix .................. Physeter macrocephalus L.INBio ................... Physeter catodonIUCN .................... Physeter MacrocephalusITIS .................... Physeter macrocephalus Linnaeus, 1758MarLIN .................. Physeter macrocephalus LinnéNCBI .................... Physeter CatodonSpecies 2000 ............ Physeter macrocephalus Linnaeus, 1758Taxon Concept ........... Physeter australasianus Desmoulins, 1822Wikimedia Commons ....... Physeter macrocephalusWORMS ................... Physeter macrocephalus Linnaeus 1758
    20. 20. Problem: the same name may apply to morethan one taxon
    21. 21. EOL curation• Trust or untrust taxon associations• Add new taxon association• Set preferred hierarchies• Set preferred common names• Leave commentsComing: Taxonomic concept curation
    22. 22. EOL is not Wikipedia …though we have more than 212,000 Wikipedia articles and 115,000 Wikimedia images Can’t currently edit within text objects
    23. 23. Whirlwind tour• What kind of information we have• How we assemble that information• How machines and people interact with EOL – API – Third party apps – Collections and communities• Next steps
    24. 24. EOL enables machine interaction Curate Aggregate Comment Rate, Collect eol.org API Third party apps
    25. 25. Third party applications eol.org/api
    26. 26. People interact with EOL content & each other Collections Communities
    27. 27. Studies currently underway with University of Maryland• Cross-cultural study on motivation to engage in citizen science – Dana Rotman• Interaction among scientists and non-scientists on EOL’s social network – Jae-wook Ahn• Website traffic analysis to aid conservation communication – Yurong He and Bill Fagan
    28. 28. Whirlwind tour• What kind of information we have• How we assemble that information• How machines and people interact with EOL• Next steps
    29. 29. Using EOL collectionsto get computable data Step 1: Search on EOL for organisms with characteristics of interest. Add each one to an EOL collection. Step 2: Write a program using EOL API methods to retrieve the external database identifiers for the species in that collection. Step 3: Add to your program code to retrieve data using external database APIs. Step 4: Analyze, rinse, repeat. From Arthur Chapman
    30. 30. Crowd-sourcing for computable dataLovell and Libby Langstroth, Calphotos Foodwebs.org
    31. 31. Efforts underwayPhylogenetic trees: Collaboration with Open Tree of Life projectfor draft treeComputable data challenge http://eol.org/info/data_challenge Rod Page’s Bionames project Alexandria Archive InstituteDevries and Thessen using DBPedia Spotlight to extractassociations among taxa and add to Linked Open Data cloudSloan 2 project: Marine computable dataTraitBank ABI proposal
    32. 32. Research wishes• Collecting nominations for research idea where EOL can help: http://eol.org/info/wishes_for_research DUE 15 SEPTEMBER• Will follow with Rubenstein Fellows call for proposals
    33. 33. Thanks toOur funders John D. and Catherine T. MacArthur Foundation Alfred P. Sloane Foundation Smithsonian Institution Marine Biological Laboratory Harvard University David Rubenstein and other funders and donorsAll our content providers and global partnersVolunteer curators and individual contributors via Flickr, Wikimedia, and members of EOL
    34. 34. Summary of EOL page richnessOverall Hot List• 950,000 have content • 30 % of 75K are rich• 2 % are rich • Average richness = ~30• ~22 % have only links• to literature • Red Hot List • 56 % of 3K are rich • Average richness = 43
    35. 35. Long Tail in databases contributing to EOL 600000Number of taxa for which content is contributed to EOL 500000 400000 300000 200000 100000 0 1 11 21 31 41 51 61 71 81 91 101 111 121 131 … viewed on log scale 1000000 100000 10000 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 101 111 121 131 Partners in order of # taxa contributed to EOL
    36. 36. Taxon page richness algorithma (Breadth) + b (Depth) + c (Diversity) 60% 30% 10%Breadth: Images, topics of textobjects, references, maps, videos, sounds, conservationstatusDepth: # words per text object, # words totalDiversity: Sources (partners) 0 – 100, Threshold 40

    ×