The biodiversity informatics landscape: a systematics perspective


Published on

Presented by V. Smith at the Biodiversity Informatics Horizons conference, Sapienza – Università di Roma, Rome, Italy. 3-6 Sept. 2013.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The biodiversity informatics landscape: a systematics perspective

  1. 1. The biodiversity informatics landscape: a systematics perspective Vince Smith Biodiversity Informatics Horizons Rome, 3-6 Sept 2013
  2. 2. Overview 1. Background – the biodiversity informatics domain • • • 2. Social challenges • • • 3. Mobilizing existing data (metadata, literature, collections) New forms of data ([meta]genomics & observatories) Synthetic challenges • • • 5. Openness Collaboration and communities Standards, identifiers & protocols (Big) data challenges • • 4. The problem (i.e. why are we here) Representations of the domain (data, infrastructures, projects…) Toward an integrated view (strategy) Data Aggregation & linking Visualisation Modeling Next steps (data infrastructures & funding) • Lessons learned: new informatics opportunities in H2020
  3. 3. 1. Background
  4. 4. The problem – integrating biodiversity research How to we join up these activities? What infrastructures do we need? (technologies, tools, standards…) What processes do we need? (Modelling, workflows…) What data do we need? (Genes, localities…) How do we use this as a tool? Species conservation & protected areas Impacts of human development Biodiversity & human health Impacts of climate change Food, farming & biofuels Invasive alien species
  5. 5. Natural History – the foundation Darwin’s “tangled bank”… "It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, …, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us.” C. Darwin "On the Origin of Species”, 1859 Systematics, a foundational “law”
  6. 6. Ecological interactions
  7. 7. A granular understanding of biodiversity Genes Individuals Populations Species Interactions AB C D E F GCGC GTAC CTAG GenBank i ii iii iv v vi 1 2 1 2 3 Local populations A B C D E F Global biodiversity -+++++ +-+++ +++ + + Biological networks
  8. 8. An informaticians view of biodiversity GenBank MorphBank Interactions Geospatial Census Genotype Phenotype Biotic Interactions Environment Human Effects IUCN Pop. data Niche & Pop. Ecology TreeBase Biodiversity Loss GBIF Phylogenetic Trees IPNI, Zoobank Taxonomy AquaMaps Geographic Dsitributions Extent of Occurrence Range Maps Conservation & management AquaMaps Forecasts of Change Data Products Systems Key problems • Landscape is complex, fragmented & hard to navigate • Many audiences (policy makers, scientists, amateurs, citizen scientists) • Many scales (global solutions to local problems) Figure adapted from Peterson et al 2010
  9. 9. A project centric view of biodiversity Scan / Mark/up PLAZI Inotaxa BHL eFloras CDM GNA (NameBank) Phylogenetic Tree of Life TreeBase CIPRES Descriptive / classification EoL Scratchpads CATE MorphoBank Wikipedia Molecular Databases NCBI/EMBL/DDBJ CBoL Barcode of Life Initiative Bibliographic IPNI Google Scholar Connotea ViTaL ISI Institutional EMu (=MOA) Recorder uBio TDWG Checklists Identification Key2Nature IdentifyLife Inter-Institutional Synthesis BCI BioCASE GeoCASE MaNIS PESI: ERMS Fauna Europea Euro+Med Plantbase ORBIS WORMS Flora Europea Nomenclators Index Fungorum ZooBank IPNI (Kew/AUS/Harvard) ING AFD/APC/APUI NZOR CoL (Sp2000& ITIS) ZooRecord LifeWatch GBIF Biodiversity ALA CONABIO CRIA (Brazil) IUCN SEEK OPAL DAISIE iNaturalist A snapshot from 2009, “the dance of the initiatives”
  10. 10. The strategic view: community informatics challenges GBIF GBIC Report (Coming soon) EU Biodiversity Strategy (2011) Biodiv. Inf. Challenges (2013) Grand Challenges for Biodiversity Informatics (integrating activities for H2020)
  11. 11. 2. Social challenges - Openness - Collaboration and communities - Standards, identifiers & links
  12. 12. Openness in biodiversity informatics “A piece of data or content is open if anyone is free to use, reuse, and redistribute it subject, at most, to the requirement to attribute and/or share-alike.” • Sharing data is a foundation for our activities • Normal practice in some communities (molecular) • Mandated by some funders & governments Many kinds of openness: • Open Access • Open Data • Open Science • Open Source E. Archambault et. al., Proportion of Open Access Peer-Reviewed Papers at the European and World Levels--2004-2011, June 2013, Science-Metrix Inc. “One-half of all papers are now freely available within a year or two of publication”
  13. 13. Openness in biodiversity informatics “A piece of data or content is open if anyone is free to use, reuse, and redistribute it subject, at most, to the requirement to attribute and/or share-alike.” • Sharing data is a foundation for our activities • Normal practice in some communities (molecular) • Mandated by some funders & governments Many kinds of openness: • Open Access • Open Data • Open Science • Open Source Incentivise through credit via citation (e.g. BDJ) Need to continue to incentivise openness
  14. 14. What are Scratchpads? ( Collaboration & communities Making taxonomy a team sport e.g., Scratchpad Virtual Research Communities Taxa Projects 544 Scratchpad Communities by 6,644 active registered users covering 91,631 taxa in 535,317 pages. Regions Societies In total more than 1,300,000 visitors 81 paper citations in 2012 Our infrastructures need to facilitate collaboration
  15. 15. Standards, identifiers & protocols Facilitating data sharing across communities A foundation for integration Key requirements: • Need to be inclusive, practical & extensible • Readable by humans & machines • Widely used Good examples: • Darwin Core • CrossRef & DataCite DOIs • ORCHID Author identifiers Gaps / Problems • Reuse & persistence of identifiers • Vocabularies & ontologies (time consuming / little reward) Potential solutions • Build them into our credit systems • Show sematic reasoning potential (LOD & RDF demonstrators) Standards can’t be developed in isolation – they must be used
  16. 16. 3. (Big) data challenges - Mobilising existing data - New forms of data
  17. 17. Mobilising existing data Collections, literature & metadata How can we quickly, efficiently and cost effectively mobilise biological data at scale? Collections • 1.5-3B specimens in collections worldwide • Fragments efforts / heterogeneity of process • Needs ambition (NHM: 20M in 5 yrs.) & coord. Literature • >300M pages of biodiversity literature • BHL (41M pp.) an example of what can be done • Needs a sustainability & article metadata NHM Digitisation BHL literature Metadata registries • Data about data (cheaper & scalable) • e.g. bibliographic data, dataset portals Informatics challenges • Storage & persistence • Automation & annotation • Incentives to digitise & fitness for use Bibliography of Life (RefFinder & RefBank)
  18. 18. Mobilising & managing new forms of data Metagenomics & ecological observatories These new data types do not depend on traditional taxonomy & systematics New Molecular approaches • Molecular detection & monitoring of organisms is routine • Metagenomics (env. sequencing) commonplace • Becoming the 1° route to understanding biodiversity 3-4 June 2013, NHM Ecological observatories • Automated biodiversity detection • Remote sensing (e.g. satellite & acoustic data, drones, camera traps) • Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) • Monitoring human activity Informatics challenges • Very large quantities of data (2.5-10TB per researcher per yr.) • Doesn’t map well to existing data infrastructures • Challenge current networking & storage capacity • Digital and physical collections become equally important? 22 July, 2013
  19. 19. 4. Synthetic challenges - Data aggregation & linking - Visualisation - Modeling
  20. 20. Aggregation & linking Portals bringing together distributed & diverse forms of data Giving consistent and comprehensive access to all biological data eMonocot Several approaches, with different advantages • Tightly coupled to a few data sources • (e.g. eMonocot, CDM) • Loosely coupled to many sources • • (e.g. BioNames, Wikipedia) Hybrid forms (e.g. Canadensys, EOL, GBIF) Selective & accurate but hard to scale (276k taxa, 8k images, 13 keys & 3 phylogenies) Informatics challenges • Portals are hard to sustain • New methods of data discovery & access • Create new windows (views) on content • New data structures, new types of database BioNames Scalable but less accurate (3M taxon names, 93k phylogenies & 28k articles)
  21. 21. Visualisation Visually synthesizing large, linked biodiversity datasets Making biodiversity data accessible & understandable Research opportunities • Tools integration (e.g. GeoCat, CartoDB) • Span multiple audiences Outreach opportunities • Visually compelling story telling • Crowdsourcing tools (e.g. Notes From Nature) Exploiting new technologies • Touch screens • Mobile • Location awareness Informatics challenges • Very specific to individual use cases • Sustainability issues NHM specimen records
  22. 22. Modeling the biosphere: a (the) 30 year goal? Reasoning across large, linked biodiversity datasets A clear, singular, long-term vision, which biodiversity data can contribute too Conceptually has many potential uses • Identifying trends • Explaining patterns • Making predictions • Real time alerts - when data contradicts current knowledge • The ultimate policy tool Major informatics challenges • Technical very difficult (many years off) • Needs effective prototypes & platforms • Some first steps e.g. OBOE, LEFT Nature 2013, doi:10.1038/493295a
  23. 23. 5. Next steps
  24. 24. Lessons learned: new opportunities in H2020 PATHWAYS TO INTEGRATION (by addressing these social, data & synthetic challenges) • Break out of the discipline, technical & project centric activities (it is unsustainable, inefficient & bad for science) • Integrate & build on exiting programmes where possible (LifeWatch is a potential umbrella for these activities) • Bridge the disconnect between informaticians & users (make the users informaticians & in informaticians users) • Our products well suited to address these challenges • Use H2020 as a mechanism to achieve integration How do we join up these activities?
  25. 25. QUESTIONS
  26. 26. Possible biodiversity informatics design principles* = experience from 7-years with the Scratchpads = lessons for infrastructures in H2020? 1. Start with needs - focus on real user needs (not just the ‘official process’) 2. Do less - if someone else is doing it, link to it or use it 3. Design with data - prototype and test with real users on the live website 4. Do the hard work to make it simple - let the computer take the strain 5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable 6. Build for inclusion – it’s easier in the long run 7. Understand context - we are designing for people, not a screen or a brand 8. Build digital services, not websites - there is life beyond the website 9. Be consistent, not uniform - every circumstance is different 10. Make things open: it makes things better - it’s more sustainable *
  27. 27. Mobilising existing data: how to prioritise CONTENT FUN LEARNING OUTREACH Digitise a few things & invest in depth, description & promotion A LITTLE A LOT Digitise lots of things, put little effort into description & promotion AGGREGATION COLECTIONS MANAGEMENT METADATA DATA MINING RESEARCH Nick Poole, UK Collections Trust
  28. 28. Collaboration & communities Making taxonomy a team sport Average dates when increasing numbers of taxonomists were involved in describing species CONE SNAILS BIRDS MAMMALS AMPHIBIANS SPIDERS PLANTS Joppa et al, 2011 • • • • Very few recent single author papers Most (fundable) science is cross-disciplinary Need to incentivise data curation & annotation Need mechanisms to share annotations Our infrastructures need to facilitate collaboration