NHM Data Portal: first steps toward the Graph-of-Life

NHM Data Portal:
first steps toward the Graph-of-Life
Vince Smith, Ben Scott & Ed Baker
Informatics & Digital Collections Group, NHM London
SPNHC, Berlin, 23 June 2016

NHM Collection
Collection area No of objects No of type
specimens
Physical
register
Digital
data
Palaeontology 6,919,207 43,146 2,364,232 340,636
Mineralogy 423,563 615 425,000 402,727
Botany 5,863,000 172,750 127,200 645,222
Entomology 33,753,257 612,796 57,197 255,000
Zoology 27,501,350 325,000 1,986,000 1,160,216
Library & archives 5,460,000 - - -
TOTAL 79,920,377 1,154,307 4,959,629 2,803,801
<3% of NHM specimens are digitised, &
even fewer are ‘computable’

Citizen science
Big, open, linked dataHigh-throughput digitisation
Data portal and tools Text mining
Robotics
Digital Science at the NHM

NHM Digital Collections Access, pre-2015
• Developed with the best of intentions, but…
• 23 separate interfaces
• Hard to find, cite, access and integrate
• No maps, few images, slow, no statistics, no export,
few updates, no authors, no citation mechanisms,
no GBIF connection

NHM Data Portal
• Discovery of NHM collections & research data
• Easy access & reuse to promote collaboration
(website, API, R-package, RDF & direct download)
• 3.7m records, >1m images (+sound, video & 3D)
• Integrates with our collection management
system (weekly) & DAM system (for images)
• Traffic light data quality indicators
• Stable, citable (DataCite) identifiers on datasets &
GUIDs on records to measure impact
• Technically sustainable & scalable
• Default open licensing (CC-Zero, CC-BY, CC-BY-NC)
http://data.nhm.ac.uk

CKAN – the technical foundation for the portal
• Enterprise, open source data portal platform
• Developed by Open Knowledge Foundation
• Used by 31 national governments, 74
regional authorities, academia & large
commercial organisations
• Key features
o Publish & find datasets
o Store & manage large data
o Robust API
o Customise & extend
o Sustainable
http://ckan.org/e.g. http://data.gov.uk/

Primary views of each NHM dataset
Point map Grid map Heat map
Statistical overviewFilterable table

Dataset & data record citation
• DataCite DOIs on every dataset
• Stable URI (UUID) on every record
• Prior identifiers aliased &
disambiguated
• Citation encouraged with clear
statements at dataset & record level
• Allows us to track cited usage
• Dynamic DOI’s on subsets coming soon
Dataset DOI Specimen URI

Traffic-light data quality indicators (via GBIF)
Via GBIF API
Major errors
Minor errors
No errors
nb. similar services offered by CRIA for Brazilian data

Potential errors highlighted & “corrected”

Assembly Video
doi: 10.3897/zookeys.481.8788
Step-by-step instructions
Supports deposition of other research datasets

Easy addition of new datasets (rapid & semi-automated)
1. Name the
dataset
2. Upload / link the
data file
3. Describe the
data file
4. Theme & tag
5. Add additional
resources
6. Temporal
coverage
7. Geographic
coverage
8. Save & finish

Data access & feedback
Extensive API
R integration
Link to data curator team
DwCA Downloads RDF (Linked Open Data)

Serving external data aggregators
GBIF iDigBio EOL
Vertnet CRIA

Data visualisations driven by API
DEMO DEMO DEMO

500,000,000
(since Feb. 2015, excluding major aggregators)
Records downloaded

Data access & feedback
Extensive API
R integration
Link to data curator team
DwCA Downloads RDF (& Linked Open Data)

Tim Berners-Lee, the inventor of the Web and
Linked Data initiator, suggested a 5-star
deployment scheme for Open Data…
What does a 5-star Data Portal mean?

LOD gives us the means to connect our data
(i.e. graph queries across distributed datasets)

Top 200 collections holding institutions
contributing specimen record to GBIF
Example 1: “what data are we publishing”
• What proportion of our collections
are accessible / digitised?
• What biases exiting in our digitised
collections?
• How much taxonomic redundancy
exists in our collections?
Useful for policy setting:
- Planning digitisation strategies
(why should we all be digitising the same taxa first)
- Identifying institutional collections strengths
(outside our community these are often not known)
- What is ‘unique’ in our collections
(taxonomically, geospatially, temporally)
- Disaster planning
(how many institutions hold the same material)

What collections are held globally?
Where are these specimens from?
There are huge gaps and biases in what & where about our collections &
where these collections are from
Top 200 collections
(scaled by size)
Specimen country origin
(darker is more )

Our results are very incomplete,
constrained by what we’ve digitised
Size of
collection
Proportion
digitised
RBGE
RBGK
NHM
MNHN
RMCA
RBINS
Very small proportions of our collections are digitally accessible
We don’t publish the overall size of our collections in a machine readable way

Example 2: exploring ecological interactions
• Specimen data is one dimension of our
collections
• We need to know how organisms interact
E.g. Predator-prey, pollinator-pollenated, host-parasite
• Museums have lots of this data
NHM Interactions data:
• Louse-host (12,000+)
• Helminth host-parasite (250,000+)
• Also large datasets: Coleoptera feeding on
dipterocarp seeds, butterfly host-plants,
British mammal-flea associations, bee flower
pollinators, several parasitic wasp datasets,
….
Increasingly published as RDF via NHM Data Portal

Global Biotic Interactions (GloBI) Database
• By Jorrit Poelen & colleagues
• Collates interaction datasets
• Currently >1.9M interactions
• EOL pulls these into Species Pages
• NHM Portal creates a combined
dataset to feed GloBI
• Produces Linked Open Data
– Create beautiful visualisations
http://www.globalbioticinteractions.org/

• Predatory interactions for
Eurythenes gryllus
• Visualisations highlight
number, frequency & type
of interaction
GloBI’s Interaction
Browser
https://blog.globalbioticinteractio
ns.org/2014/03/21/exploring-
antarctic-interactions-using-
globis-interaction-browser/

Create beautiful
visualisations with custom R
scripts and existing libraries
(e.g., igraph, Reol, rgdal)
https://blog.globalbioticinteractions.org/201
4/06/06/a-food-web-map-of-the-world/

Conclusions
• Data portals like the NHM Portal allow us to contribute and reflect
our data through the lens of specialist aggregators
• GBIF & GloBI are specialist aggregators serving LOD
• LOD allows us to combine big datasets to address new questions
– Tracking interactions & distribution of disease vectors
– Predicting crop pests, via the distribution and interactions of pests of crop
wild relatives
Next Steps
• Continue Portal development & encourage institutional adoption
• Consolidate NHM ecological interaction datasets
• Publish combined dataset on the NHM Data Portal
• GloBI to harvest the dataset and publish linked open data
• Develop visualisations for key NHM datasets

Acknowledgements
Ben Scott – Portal Engineer & Architect
Ed Baker – Data Researcher
Laurence Livermore - Project Management
Matt Woodburn – Data Architect
Vince Smith – SRO / Coordinator

NHM Data Portal: first steps toward the Graph-of-Life

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NHM Data Portal: first steps toward the Graph-of-Life

Similar to NHM Data Portal: first steps toward the Graph-of-Life (20)

More from Vince Smith

More from Vince Smith (20)

Recently uploaded

Recently uploaded (20)

NHM Data Portal: first steps toward the Graph-of-Life

Editor's Notes