Frontiers of discovery with
Encyclopedia of Life
TraitBank research and other case studies
Cyndy Parr
Smithsonian Institution National Museum of Natural History
parrc@si.edu @cydparr http://www.slideshare.net/csparr
• How is EOL different
• How EOL gets used
• Introducing TraitBank
• Loading up TraitBank
• EOL & TraitBank in research
• Future of EOL & TraitBank
Outline
Take home messages
• EOL can be useful for research
• TraitBank is already awesome
• Mutualism between collections,
EOL, citizen science
• Let us know how we can help
EOL
Crowds
Harvest
Third party applications
How EOL is different
started 2008
text, media, literature
all species, genera, etc.
names infrastructure
data curation
2.6 million images
1.3 million taxa with content
Over 5 million visitors/year
75,000 registered members
eol.org
How EOL gets used
http://www.notesfromnature.org/
http://www.onezoom.org/ http://yanwong.me/
Links and images…what about research?
Search groups for
“EOL papers”
at Mendeley.com
Anatolia Zooarchaeology Case Study led
by Alexandria Archive Institute
1. 14 different sites
2. 34+ zooarchaeologists
3. Decoding, cleanup, metadata documentation
4. 220,000+ specimens
5. 450 entities linked to 143 EOL taxon concepts
6. Anatomical entities linked to Uberon.org
7. Biometrics linked to measurement ontology
8. Collaborative analysis
http://opencontext.org/
Kansa, E., Kansa, S. W., & Arbuckle, B. (2014). Publishing and Pushing:
Mixing Models for Communicating Research Data in Archaeology.
International Journal for Digital Curation, 9.
Page, R. D. M. (2013). BioNames: linking taxonomy,
texts, and trees. PeerJ, 1, e190. doi:10.7717/peerj.190
BioNames.org
Rod Page
But can we do more?
Introducing TraitBank
GenBank
60 million DNA sequence records
900,000 species
4,000 genomes
How are these related to traits?
Quick math
In Phenoscape
57 publications had 565,158 anatomical trait
descriptions for 2,527 kinds of organisms
= 223 traits/organism
In ZFIN
38,189 trait descriptions for 4,727 genes for
Zebrafish
1.9 million species on the planet
= LOTS OF TRAITS
Why Smithsonian + EOL
• Numeric data
(measurements)
• Categorical data
(controlled vocabulary)
• Species interactions
• Mostly summaries for
populations, species
• Individual specimens
• Higher taxa
http://eol.org/traitbank
released January 2014
TraitBank Quick facts
TraitBank Data tab
TraitBank Metadata
TraitBank Search & download
TraitBank Search & download
TraitBank Data glossary
http://eol.org/data_glossary
Download
Making TraitBank data available to
Google Knowledge Graph and
anyone
TraitBank data sources
Sources include:
Databases
(OBIS, AnAge, Paleodb, Phenoscape)
Literature
(Dryad, Pangaea, Ecological Archives)
Natural History Collections
(Label data)
Legacy/unpublished data
Loading up TraitBank
TraitBank
~7 million records
326 traits
1.2 million taxa
40+ datasets
http://eol.org/collections/97700
Text mining
Environments-EOL
Evangelos Pafilis, Hellenic Centre for Marine Research (HCMR), Institute of
Marine Biology, Biotechnology and Aquaculture (IMBBC), Crete, Greece
491,616 habitat terms for 136,548 taxa
Text mining
Automated annotation Manual annotation
Morphological Data from NMNH KE-Emu
Abi Nishimura
Project: Clean-up morphological data from
NMNH catalog and publish to TraitBank
Goal: Make it easier to access and analyze
this valuable morphological data
Sakurai Midori,
http://eol.org/data_objects/26918624
Raw data from Spectral Tarsier Tarsius tarsier
database search
RESULTS
• Primate data published (320 taxa)
• Comprehensive mammals data to
be published soon (4662 taxa)
• Bird catalog currently being mined
Wan Hong, http://eol.org/data_objects/29203274
Mineralization of tissue in
marine organisms
Jen Hammock with Steve Cairns
For modeling impacts of ocean acidification
143,000 records for 119,000 species and subspecies of Micro- and Macroalgae,
Cnidaria, Polychaetes, Bryozoans, Brachiopods, Sponges, Mollusks,
Echinoderms and Arthropods
Mineralized tissue =
● Biogenic silica
● Calcium carbonate
○ Calcite
○ and/or Aragonite
Other work in progress at NMNH
• Sarah Miller: growth form, habitat, and
elevation data from botany collection
specimen labels, summarizing elevation
• Reid Rumelt: behavior and other data
from Cornell University Macaulay Library
sound files and captions
• Katja Schulz:
PaleoBiology DataBase
• BHL-MoBot: IMLS
Mining biodiversity
© Donald E. Hurlbert/Smithsonian National
Museum of Natural History
2013-14 EOL Rubenstein Fellows
EOL & TraitBank research
1. EnvO habitat terms (Pafilis et al.)
2. Altitude Specificity of Flower Coloration (Wright & Seltmann)
3. Morphological impacts of extinction risk in fish (Chang)
4. Butterfly-host plant associations (Ferrer-Parris et al.)
5. Global Biotic Interactions (GLoBI, Poelen & Mungall et al)
6. Reol: An R interface for EOL (Banbury, O’Meara)
7. Taxon Tree Tool (Lin)
Chang crowdsourcingJonathan Chang, UCLA
http://jonathanchang.org/
Amazon Mechanical Turk
EOL-BHL
Research Sprint
1. Character displacement across the Tree of Life
2. Illuminating the Dark Parts of the Tree of Life
3. Evolution in the usage of anatomical concepts in the biodiversity
literature
4. Planning for global change: using species interactions in
conservation
5. No place like home: Defining “habitat” for biodiversity science
6. Assessing risk status of Mexican amphibians
7. Quantifying color from digital imagery: color may determine
species’ responses to habitat edges and to climate change
8. More is less - Identifying global trends in species’ niche width
9. Identifying key species traits associated with climate change
vulnerability
NESCent-EOL-BHL Research Sprint
Quantifying color from digital imagery
1. Automate processing of almost 300k images (of EOL’s 2.4 million)
2. Identify pinned specimen images
3. Process these for color and pattern information
4. Put this info into TraitBank
Elise Larsen, Yan Wong
Illuminating the Dark Parts of the Tree of Life
Jessica Oswald, Karen Cranston, Gordon Burleigh, Cyndy Parr
1. Query EOL, GBIF,
GenBank for # records
2. Create score for amount
of information available
3. Map score to phylogeny
Global Genome Initiative Data Portal
For every family:
• Use TraitBank to assemble counts of records in repositories
• Compute a score (percentile) to assess knowledge available
relative to other families
• Make it easy to browse to find families that require effort
Beta launch end of June
• NSF Genealogy of Life
• NSF Big Data
• TMON themed portals & traits
• Bocas del Toro revisionary taxonomy workshops
• NSF ABI Isotopes and Interactions
• Microsoft/WCMC Global Ecosystem Models
• And more mutualisms…
EOL & TraitBank future plans
Leveraging social networks
Ahn, J., et al.. (2012). Visually Exploring Social Participation in Encyclopedia of
Life. In 2012 International Conference on Social Informatics (pp. 149–156). IEEE.
Rotman, D., et al. (2014). Motivations affecting initial and long-term participation in
citizen science projects in three countries. In iConference 2014 Proceedings (pp.
110-124).
http://biotracker.umd.edu
• motivation model for citizen scientists
• international attitudes of scientists and
citizens to working together
• factors that increase curation network
activity
• currently working on motivations of EOL
content partners
Annotation of a specimen record
Ovary size and reproductive state
Age markers
Fat status
Body mass and other size
attributes
Annotation of an observation record
For more
information
• See & cite Parr, et al. 2014 Biodiv. Data Journal
• See our TraitBank paper (in review)
http://www.semantic-web-journal.net/content/traitbank-
practical-semantics-organism-attribute-data
• Talk to your favorite EOL person
• Become an EOL Curator
• See our NMNH collection of collections
http://eol.org/collections/743
Take home messages
• EOL can be useful for research
• TraitBank is already awesome
• Mutualism between collections,
EOL, citizen science
• Let us know how we can help
Atlas of Living Australia • Biodiversity Heritage Library Consortium • Chinese
Academy of Sciences • La Comisión Nacional para el Conocimiento y Uso de la
Biodiversidad (CONABIO) • The Field Museum • Harvard University • El Instituto
Nacional de Biodiversidad (INBio) • Marine Biological Laboratory • Missouri
Botanical Garden • Muséum National d’histoire Naturelle • Naturalis Netherlands
• New Library of Alexandria • Smithsonian Institution • South African National
Biodiversity Institute • All of our content providers and curators
Steve Cairnes • John Keltner • Katie Barker • Jonathan Coddington • Sean Brady •
Tom Orrell • Chris Meyers • Patricia Gentilis • Sylvia Orli • Kate Lyons • Yan Wong •
Jon Norenburg • Torsten Dikow • Yurong He • Jenny Preece and others on
BioTracker team • Pensoft Publishing • EOL Science Advisory Board
Katja Schulz, Jen Hammock, Marie Studer, Jeff Holmes, Nathan Wilson, Patrick
Leary, Jeremy Rice, Lisa Walley, Bob Corrigan, Erick Mata, Dmitry Mozzherin, Abi
Nishimura • Sarah Miller • Anthony Goddard, Mark Westneat and former BioSynC
staff
http://eol.org @eol parrc@si.edu
Major Funding for TraitBank provided by the Alfred P. Sloan
Foundation. Fellows program supported by Daniel M.
Rubenstein, Research sprint by Richard Lounsbery
Foundation.

Frontiers of discovery with Encyclopedia of Life

  • 1.
    Frontiers of discoverywith Encyclopedia of Life TraitBank research and other case studies Cyndy Parr Smithsonian Institution National Museum of Natural History parrc@si.edu @cydparr http://www.slideshare.net/csparr
  • 2.
    • How isEOL different • How EOL gets used • Introducing TraitBank • Loading up TraitBank • EOL & TraitBank in research • Future of EOL & TraitBank Outline
  • 3.
    Take home messages •EOL can be useful for research • TraitBank is already awesome • Mutualism between collections, EOL, citizen science • Let us know how we can help
  • 4.
  • 5.
    started 2008 text, media,literature all species, genera, etc. names infrastructure data curation 2.6 million images 1.3 million taxa with content Over 5 million visitors/year 75,000 registered members eol.org
  • 6.
    How EOL getsused http://www.notesfromnature.org/
  • 7.
  • 8.
    Search groups for “EOLpapers” at Mendeley.com
  • 10.
    Anatolia Zooarchaeology CaseStudy led by Alexandria Archive Institute 1. 14 different sites 2. 34+ zooarchaeologists 3. Decoding, cleanup, metadata documentation 4. 220,000+ specimens 5. 450 entities linked to 143 EOL taxon concepts 6. Anatomical entities linked to Uberon.org 7. Biometrics linked to measurement ontology 8. Collaborative analysis http://opencontext.org/ Kansa, E., Kansa, S. W., & Arbuckle, B. (2014). Publishing and Pushing: Mixing Models for Communicating Research Data in Archaeology. International Journal for Digital Curation, 9.
  • 11.
    Page, R. D.M. (2013). BioNames: linking taxonomy, texts, and trees. PeerJ, 1, e190. doi:10.7717/peerj.190 BioNames.org Rod Page
  • 12.
    But can wedo more? Introducing TraitBank
  • 13.
    GenBank 60 million DNAsequence records 900,000 species 4,000 genomes How are these related to traits?
  • 14.
    Quick math In Phenoscape 57publications had 565,158 anatomical trait descriptions for 2,527 kinds of organisms = 223 traits/organism In ZFIN 38,189 trait descriptions for 4,727 genes for Zebrafish 1.9 million species on the planet = LOTS OF TRAITS
  • 15.
  • 16.
    • Numeric data (measurements) •Categorical data (controlled vocabulary) • Species interactions • Mostly summaries for populations, species • Individual specimens • Higher taxa http://eol.org/traitbank released January 2014
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    Making TraitBank dataavailable to Google Knowledge Graph and anyone
  • 25.
    TraitBank data sources Sourcesinclude: Databases (OBIS, AnAge, Paleodb, Phenoscape) Literature (Dryad, Pangaea, Ecological Archives) Natural History Collections (Label data) Legacy/unpublished data Loading up TraitBank
  • 26.
    TraitBank ~7 million records 326traits 1.2 million taxa 40+ datasets http://eol.org/collections/97700
  • 27.
    Text mining Environments-EOL Evangelos Pafilis,Hellenic Centre for Marine Research (HCMR), Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Crete, Greece 491,616 habitat terms for 136,548 taxa
  • 28.
  • 29.
    Morphological Data fromNMNH KE-Emu Abi Nishimura Project: Clean-up morphological data from NMNH catalog and publish to TraitBank Goal: Make it easier to access and analyze this valuable morphological data Sakurai Midori, http://eol.org/data_objects/26918624 Raw data from Spectral Tarsier Tarsius tarsier database search
  • 30.
    RESULTS • Primate datapublished (320 taxa) • Comprehensive mammals data to be published soon (4662 taxa) • Bird catalog currently being mined Wan Hong, http://eol.org/data_objects/29203274
  • 31.
    Mineralization of tissuein marine organisms Jen Hammock with Steve Cairns For modeling impacts of ocean acidification 143,000 records for 119,000 species and subspecies of Micro- and Macroalgae, Cnidaria, Polychaetes, Bryozoans, Brachiopods, Sponges, Mollusks, Echinoderms and Arthropods Mineralized tissue = ● Biogenic silica ● Calcium carbonate ○ Calcite ○ and/or Aragonite
  • 32.
    Other work inprogress at NMNH • Sarah Miller: growth form, habitat, and elevation data from botany collection specimen labels, summarizing elevation • Reid Rumelt: behavior and other data from Cornell University Macaulay Library sound files and captions • Katja Schulz: PaleoBiology DataBase • BHL-MoBot: IMLS Mining biodiversity © Donald E. Hurlbert/Smithsonian National Museum of Natural History
  • 33.
    2013-14 EOL RubensteinFellows EOL & TraitBank research 1. EnvO habitat terms (Pafilis et al.) 2. Altitude Specificity of Flower Coloration (Wright & Seltmann) 3. Morphological impacts of extinction risk in fish (Chang) 4. Butterfly-host plant associations (Ferrer-Parris et al.) 5. Global Biotic Interactions (GLoBI, Poelen & Mungall et al) 6. Reol: An R interface for EOL (Banbury, O’Meara) 7. Taxon Tree Tool (Lin)
  • 34.
    Chang crowdsourcingJonathan Chang,UCLA http://jonathanchang.org/ Amazon Mechanical Turk
  • 35.
  • 36.
    1. Character displacementacross the Tree of Life 2. Illuminating the Dark Parts of the Tree of Life 3. Evolution in the usage of anatomical concepts in the biodiversity literature 4. Planning for global change: using species interactions in conservation 5. No place like home: Defining “habitat” for biodiversity science 6. Assessing risk status of Mexican amphibians 7. Quantifying color from digital imagery: color may determine species’ responses to habitat edges and to climate change 8. More is less - Identifying global trends in species’ niche width 9. Identifying key species traits associated with climate change vulnerability NESCent-EOL-BHL Research Sprint
  • 37.
    Quantifying color fromdigital imagery 1. Automate processing of almost 300k images (of EOL’s 2.4 million) 2. Identify pinned specimen images 3. Process these for color and pattern information 4. Put this info into TraitBank Elise Larsen, Yan Wong
  • 38.
    Illuminating the DarkParts of the Tree of Life Jessica Oswald, Karen Cranston, Gordon Burleigh, Cyndy Parr 1. Query EOL, GBIF, GenBank for # records 2. Create score for amount of information available 3. Map score to phylogeny
  • 39.
    Global Genome InitiativeData Portal For every family: • Use TraitBank to assemble counts of records in repositories • Compute a score (percentile) to assess knowledge available relative to other families • Make it easy to browse to find families that require effort Beta launch end of June
  • 40.
    • NSF Genealogyof Life • NSF Big Data • TMON themed portals & traits • Bocas del Toro revisionary taxonomy workshops • NSF ABI Isotopes and Interactions • Microsoft/WCMC Global Ecosystem Models • And more mutualisms… EOL & TraitBank future plans
  • 41.
    Leveraging social networks Ahn,J., et al.. (2012). Visually Exploring Social Participation in Encyclopedia of Life. In 2012 International Conference on Social Informatics (pp. 149–156). IEEE. Rotman, D., et al. (2014). Motivations affecting initial and long-term participation in citizen science projects in three countries. In iConference 2014 Proceedings (pp. 110-124). http://biotracker.umd.edu • motivation model for citizen scientists • international attitudes of scientists and citizens to working together • factors that increase curation network activity • currently working on motivations of EOL content partners
  • 42.
    Annotation of aspecimen record Ovary size and reproductive state Age markers Fat status Body mass and other size attributes
  • 43.
    Annotation of anobservation record
  • 44.
    For more information • See& cite Parr, et al. 2014 Biodiv. Data Journal • See our TraitBank paper (in review) http://www.semantic-web-journal.net/content/traitbank- practical-semantics-organism-attribute-data • Talk to your favorite EOL person • Become an EOL Curator • See our NMNH collection of collections http://eol.org/collections/743
  • 45.
    Take home messages •EOL can be useful for research • TraitBank is already awesome • Mutualism between collections, EOL, citizen science • Let us know how we can help
  • 46.
    Atlas of LivingAustralia • Biodiversity Heritage Library Consortium • Chinese Academy of Sciences • La Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO) • The Field Museum • Harvard University • El Instituto Nacional de Biodiversidad (INBio) • Marine Biological Laboratory • Missouri Botanical Garden • Muséum National d’histoire Naturelle • Naturalis Netherlands • New Library of Alexandria • Smithsonian Institution • South African National Biodiversity Institute • All of our content providers and curators Steve Cairnes • John Keltner • Katie Barker • Jonathan Coddington • Sean Brady • Tom Orrell • Chris Meyers • Patricia Gentilis • Sylvia Orli • Kate Lyons • Yan Wong • Jon Norenburg • Torsten Dikow • Yurong He • Jenny Preece and others on BioTracker team • Pensoft Publishing • EOL Science Advisory Board Katja Schulz, Jen Hammock, Marie Studer, Jeff Holmes, Nathan Wilson, Patrick Leary, Jeremy Rice, Lisa Walley, Bob Corrigan, Erick Mata, Dmitry Mozzherin, Abi Nishimura • Sarah Miller • Anthony Goddard, Mark Westneat and former BioSynC staff http://eol.org @eol parrc@si.edu Major Funding for TraitBank provided by the Alfred P. Sloan Foundation. Fellows program supported by Daniel M. Rubenstein, Research sprint by Richard Lounsbery Foundation.

Editor's Notes

  • #4 It is early days yet
  • #5 We have a working infrastructure as well as more than 200 partners, We harvest and sort text and multimedia by topic and by species and put it on our pages. Curation + user-added content from the crowds is added to the mix. This is fed back to providers, giving them traffic, quality control on their own content, and new content for them to use And, we are already seeing spinoff products. We make it easy for developers, and everything is either public domain or CC-licensed so it can be re-used.
  • #6  1.3 million species pages with content 250+ content providers AND multi-lingual – latest global partner is the French National Museum of Natural History
  • #7 General reference by the public, people listen to our podcasts, cited in wikipedia, links from OneTree from James Rosindell, Field Guides, Notes from Nature Games
  • #8 James Rosindelll Luke Harmon Yan Wong and others One Zoom Photomosaic from all the descendents of a particular mammal ancester, in the shape of Shrewdinger, a reconstruction of that ancestral mammal by
  • #10 Some papers are actually using EOL as a source of the information, whether they are properly citing the original sources or just mentioning that they used EOL.
  • #11 .
  • #14 We are in the midst of a genomics revolution. The cost to generate a full genome sequence is dropping more or less daily. What is all this genetic information DOING? How does it relate to what we can see and measure about organisms, their phenotypes, or their traits? How does DNA interact with the environment to result in both normal and abnormal development How did it evolve? How fast do DNA changes make a difference in the lives of organisms?
  • #15 Phenoscape is a database that is looking at anatomical traits in fishes. Looking just at 57 publications they have more than 500K descriptions for 2500 kinds of organisms. ZFIN is a model organism database for zebrafish, a common model organism for developmental biologists. In just this one species they have captured nearly 40,000 traits – just for ONE very well-studied SPECIES
  • #16 Strong Libraries Active researchers publishing in modern journals Efforts like the Global Genome initiative But mainly because our scientists and collections represent more than a hundred and fifty years of deep experience describing the biological diversity all over the planet AND we have a deep commitment to both the increase AND the diffusion of that knowledge.
  • #17 Recently, we expanded the scope of EOL to include the management and display of computable data about organisms. In January we released the first version of our TraitBank platform.
  • #18 TraitBank data are managed in a Virtuoso triple store, and a sample of the traits are shown on an overview tab.
  • #19 TraitBank data are managed in a Virtuoso triple store, and the trait information for each taxon is displayed in a Data tab on EOL taxon pages.
  • #20 Each record is annotated with rich metadata, including provenance, citation, information about methods etc.
  • #31 The “Spectral Tarsier” EOL page now contains organized, accessible NMNH measurement data (green arrows) You can see that it would now be easier to study a measurement like tail length. Note that many morphological categories ONLY have data from NMNH; It’s clearly important to make the database measurements accessible.
  • #32 Data come from Steve Cairns and the literature so far Here’s a scenario Most at species and subspecies level Conrolled vocabulary data, annotated w/verbal modifiers (eg: “High Mg C”, “inferred from Superfamily”) CHEME/ Saturation horizons with respect to all mineral phases are migrating toward the surface, potentially risking the survival of calcifiers in the neritic, shelf, and slope environment... Orr et al. (2005) predicted that by 2100 the Southern and the Arctic Oceans could be undersaturated with respect to aragonite, and then calcite would follow in ~50–100 years. This has also major implications for calcifying taxa at those latitudes. (http://www.esajournals.org/doi/full/10.1890/09-0553.1)
  • #33 PBDB: good, new taxonomy, structured data like time occurrence with error bars Including numeric),body size measurements
  • #34 How about some more focused work with EOL and traits. Next I’m going to give a few case studies where biologists really want to know the answers to some biological questions and are using TraitBank’s data and aggregation & integration & to be quite honest, people power to answer the questions. Some of you may recall the Rubenstein program – jn the last year of our Rubenstein program, we have funded projects that aim to lay the groundwork for biological reasarch using EOL.
  • #36 Use BHL or EOL and other sources to tackle biological questions Matched each awardee with informatics expert 4-7 February 2014, Durham, NC organized by Cynthia Parr and Craig McClain Funded in part by Richard Lounsbery foundation
  • #37 Some of these are conservation oriented research, e.g. 1, 6, and 9 Other topics are more basic evolutionary biology or ecology research or
  • #41 We have focused so far on being the species-based repository for aggregating and integrating the information, not providing analysis tools but providing general access to it, which then can be served and repurposed for various other projects.
  • #42 EOL has also been a platform for social science research.
  • #43 IDigBio is relevant here.
  • #44 Also, BioCubes
  • #46 It is early days yet
  • #47 Major Funding for the development of TraitBank was provided by the Alfred P. Sloan Foundation with additional support our global partner institutions These are in addition to people that I called out earlier in the slides, and I’ve probably forgotten many