The biodiversity informatics landscape: a systematics perspective

The biodiversity
informatics landscape:
a systematics perspective
Vince Smith

Biodiversity Informatics Horizons
Rome, 3-6 Sept 2013

Overview
1.

Background – the biodiversity informatics domain
•
•
•

2.

Social challenges
•
•
•

3.

Mobilizing existing data (metadata, literature, collections)
New forms of data ([meta]genomics & observatories)

Synthetic challenges
•
•
•

5.

Openness
Collaboration and communities
Standards, identifiers & protocols

(Big) data challenges
•
•

4.

The problem (i.e. why are we here)
Representations of the domain (data, infrastructures, projects…)
Toward an integrated view (strategy)

Data Aggregation & linking
Visualisation
Modeling

Next steps (data infrastructures & funding)
•

Lessons learned: new informatics opportunities in H2020

The problem – integrating biodiversity research
How to we join up these activities?
What infrastructures do we need?
(technologies, tools, standards…)
What processes do we need?
(Modelling, workflows…)
What data do we need?
(Genes, localities…)

How do we use this as a tool?
Species conservation & protected areas
Impacts of human development
Biodiversity & human health
Impacts of climate change
Food, farming & biofuels
Invasive alien species

Natural History – the foundation
Darwin’s “tangled bank”…

"It is interesting to contemplate a tangled
bank, clothed with many plants of many
kinds, …, so different from each other, and
dependent upon each other in so complex a
manner, have all been produced by laws acting
around us.”
C. Darwin "On the Origin of Species”, 1859

Systematics, a foundational “law”

A granular understanding of biodiversity

Genes

Individuals Populations Species

Interactions
AB C D E F

GCGC
GTAC
CTAG

GenBank

i
ii
iii
iv
v
vi

1
2
1
2
3

Local populations

A
B
C
D
E
F
Global
biodiversity

-+++++
+-+++
+++
+
+
Biological
networks

An informaticians view of biodiversity

GenBank

MorphBank

Interactions

Geospatial

Census

Genotype

Phenotype

Biotic
Interactions

Environment

Human Effects

IUCN

Pop. data

Niche & Pop.
Ecology
TreeBase

Biodiversity
Loss

GBIF

Phylogenetic
Trees
IPNI, Zoobank

Taxonomy

AquaMaps

Geographic
Dsitributions
Extent of Occurrence

Range Maps

Conservation &
management
AquaMaps

Forecasts of
Change

Data
Products
Systems

Key problems
• Landscape is complex, fragmented & hard to navigate
• Many audiences (policy makers, scientists, amateurs, citizen scientists)
• Many scales (global solutions to local problems)

Figure adapted from
Peterson et al 2010

A project centric view of biodiversity
Scan / Mark/up
PLAZI
Inotaxa
BHL
eFloras

CDM
GNA (NameBank)

Phylogenetic
Tree of Life
TreeBase
CIPRES

Descriptive /
classification
EoL
Scratchpads
CATE
MorphoBank
Wikipedia

Molecular
Databases
NCBI/EMBL/DDBJ
CBoL
Barcode of Life
Initiative

Bibliographic
IPNI
Google Scholar
Connotea
ViTaL
ISI

Institutional
EMu (=MOA)
Recorder

uBio

TDWG
Checklists

Identification
Key2Nature
IdentifyLife

Inter-Institutional
Synthesis
BCI
BioCASE
GeoCASE
MaNIS

PESI:
ERMS
Fauna Europea
Euro+Med Plantbase
ORBIS
WORMS
Flora Europea

Nomenclators
Index Fungorum
ZooBank
IPNI
(Kew/AUS/Harvard)
ING
AFD/APC/APUI
NZOR
CoL (Sp2000& ITIS)
ZooRecord

LifeWatch

GBIF
Biodiversity
ALA
CONABIO
CRIA (Brazil)
IUCN
SEEK
OPAL
DAISIE
iNaturalist

A snapshot from 2009, “the dance of the initiatives”

The strategic view: community informatics challenges

GBIF GBIC Report
(Coming soon)

EU Biodiversity Strategy
(2011)

Biodiv. Inf. Challenges
(2013)

Grand Challenges for Biodiversity Informatics
(integrating activities for H2020)

2. Social challenges
- Openness
- Collaboration and communities
- Standards, identifiers & links

Openness in biodiversity informatics
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it subject, at most, to the requirement to attribute and/or share-alike.” http://opendefinition.org/

• Sharing data is a foundation
for our activities
• Normal practice in some
communities (molecular)
• Mandated by some funders
& governments
Many kinds of openness:
• Open Access
• Open Data
• Open Science
• Open Source

E. Archambault et. al., Proportion of Open Access Peer-Reviewed Papers at the
European and World Levels--2004-2011, June 2013, Science-Metrix Inc.

“One-half of all papers are now freely available
within a year or two of publication”

Openness in biodiversity informatics
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it subject, at most, to the requirement to attribute and/or share-alike.” http://opendefinition.org/

• Sharing data is a foundation
for our activities
• Normal practice in some
communities (molecular)
• Mandated by some funders
& governments
Many kinds of openness:
• Open Access
• Open Data
• Open Science
• Open Source
Incentivise through credit via citation (e.g. BDJ)

Need to continue to incentivise openness

What are Scratchpads? (http://scratchpads.eu)
Collaboration & communities
Making taxonomy a team sport
e.g., Scratchpad Virtual Research Communities

Taxa

Projects

544 Scratchpad Communities
by

6,644 active registered users

covering

91,631 taxa

in 535,317 pages.

Regions

Societies

In total more than

1,300,000 visitors

81 paper citations in 2012
Our infrastructures need to facilitate collaboration

Standards, identifiers & protocols
Facilitating data sharing across communities
A foundation for integration
Key requirements:
• Need to be inclusive, practical & extensible
• Readable by humans & machines
• Widely used
Good examples:
• Darwin Core
• CrossRef & DataCite DOIs
• ORCHID Author identifiers
Gaps / Problems
• Reuse & persistence of identifiers
• Vocabularies & ontologies (time consuming / little reward)
Potential solutions
• Build them into our credit systems
• Show sematic reasoning potential (LOD & RDF demonstrators)
Standards can’t be developed in isolation – they must be used

3. (Big) data challenges
- Mobilising existing data
- New forms of data

Mobilising existing data
Collections, literature & metadata
How can we quickly, efficiently and cost
effectively mobilise biological data at scale?
Collections
• 1.5-3B specimens in collections worldwide
• Fragments efforts / heterogeneity of process
• Needs ambition (NHM: 20M in 5 yrs.) & coord.
Literature
• >300M pages of biodiversity literature
• BHL (41M pp.) an example of what can be done
• Needs a sustainability & article metadata

NHM
Digitisation

BHL
literature

Metadata registries
• Data about data (cheaper & scalable)
• e.g. bibliographic data, dataset portals
Informatics challenges
• Storage & persistence
• Automation & annotation
• Incentives to digitise & fitness for use

Bibliography of Life
(RefFinder & RefBank)

Mobilising & managing new forms of data
Metagenomics & ecological observatories
These new data types do not depend on
traditional taxonomy & systematics
New Molecular approaches
• Molecular detection & monitoring of organisms is routine
• Metagenomics (env. sequencing) commonplace
• Becoming the 1° route to understanding biodiversity

3-4 June 2013, NHM

Ecological observatories
• Automated biodiversity detection
• Remote sensing (e.g. satellite & acoustic data, drones, camera traps)
• Monitoring conspicuous, rare or invasive spp. (algal blooms, palms)
• Monitoring human activity
• Very large quantities of data (2.5-10TB per researcher per yr.)
• Doesn’t map well to existing data infrastructures
• Challenge current networking & storage capacity
• Digital and physical collections become equally important?
22 July, 2013

4. Synthetic challenges
- Data aggregation & linking
- Visualisation
- Modeling

Aggregation & linking
Portals bringing together distributed & diverse forms of data
Giving consistent and comprehensive access
to all biological data

eMonocot

Several approaches, with different advantages
• Tightly coupled to a few data sources
•

(e.g. eMonocot, CDM)

• Loosely coupled to many sources
•

•

(e.g. BioNames, Wikipedia)
Hybrid forms (e.g. Canadensys, EOL, GBIF)

Selective & accurate but hard to scale
(276k taxa, 8k images, 13 keys & 3 phylogenies)

• Portals are hard to sustain
• New methods of data discovery & access
• Create new windows (views) on content
• New data structures, new types of database

BioNames

Scalable but less accurate
(3M taxon names, 93k phylogenies & 28k articles)

Visualisation
Visually synthesizing large, linked biodiversity datasets
Making biodiversity data accessible &
understandable
Research opportunities
• Tools integration (e.g. GeoCat, CartoDB)
• Span multiple audiences
Outreach opportunities
• Visually compelling story telling
• Crowdsourcing tools (e.g. Notes From Nature)
Exploiting new technologies
• Touch screens
• Mobile
• Location awareness
• Very specific to individual use cases
• Sustainability issues

NHM specimen records
http://data.nhm.ac.uk/globe/

Modeling the biosphere: a (the) 30 year goal?
Reasoning across large, linked biodiversity datasets
A clear, singular, long-term vision, which
biodiversity data can contribute too
Conceptually has many potential uses
• Identifying trends
• Explaining patterns
• Making predictions
• Real time alerts
- when data contradicts current knowledge

• The ultimate policy tool
Major informatics challenges
• Technical very difficult (many years off)
• Needs effective prototypes & platforms
• Some first steps e.g. OBOE, LEFT

Nature 2013, doi:10.1038/493295a

Lessons learned: new opportunities in H2020
PATHWAYS TO INTEGRATION
(by addressing these social, data & synthetic challenges)

• Break out of the discipline, technical &
project centric activities (it is
unsustainable, inefficient & bad for science)

• Integrate & build on exiting programmes
where possible (LifeWatch is a potential umbrella
for these activities)

• Bridge the disconnect between
informaticians & users (make the users
informaticians & in informaticians users)

• Our products well suited to address these
challenges
• Use H2020 as a mechanism to achieve
integration

How do we join up these activities?

Possible biodiversity informatics design principles*
= experience from 7-years with the Scratchpads
= lessons for infrastructures in H2020?
1. Start with needs - focus on real user needs (not just the ‘official process’)
2. Do less - if someone else is doing it, link to it or use it
3. Design with data - prototype and test with real users on the live website
4. Do the hard work to make it simple - let the computer take the strain
5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable
6. Build for inclusion – it’s easier in the long run
7. Understand context - we are designing for people, not a screen or a brand
8. Build digital services, not websites - there is life beyond the website
9. Be consistent, not uniform - every circumstance is different
10. Make things open: it makes things better - it’s more sustainable
*https://www.gov.uk/designprinciples

Mobilising existing data: how to prioritise
CONTENT

FUN
LEARNING
OUTREACH

Digitise a few things & invest in
depth, description & promotion

A LITTLE

A LOT

Digitise lots of things, put little effort
into description & promotion

AGGREGATION
COLECTIONS
MANAGEMENT

METADATA

DATA MINING

RESEARCH

Nick Poole, UK Collections Trust

Collaboration & communities
Making taxonomy a team sport
Average dates when increasing numbers of taxonomists were involved in describing species
CONE SNAILS

BIRDS

MAMMALS

AMPHIBIANS

SPIDERS

PLANTS

Joppa et al, 2011

•
•
•
•

Very few recent single author papers
Most (fundable) science is cross-disciplinary
Need to incentivise data curation & annotation
Need mechanisms to share annotations
Our infrastructures need to facilitate collaboration

The biodiversity informatics landscape: a systematics perspective

The biodiversity informatics landscape: a systematics perspective

More Related Content

What's hot

Similar to The biodiversity informatics landscape: a systematics perspective

More from Vince Smith

Recently uploaded

The biodiversity informatics landscape: a systematics perspective