4. A New Voyage of Discovery
Three Focal Areas
1. Scientific discovery
2. Scientific Infrastructure
3. Scientific engagement
Five Challenges
1. The Digital NHM
2. Origins, evolution & futures
3. Biodiversity discovery
4. Natural resources & hazards
5. Science, society & skills
Resources & funding
Measuring success
Digital Ambition: NHM Science Strategy 2013-2017
Scientific impact 1,000 papers in leading journals
Digital access 20M specimens available digitally
Engagement 1M face-to-face engagements
Collections Globally important collections
Diagnostic tools Diagnostic tools for key groups
Deep time Timeline of key transitions
Science & society Articulate of the role of science
UK network Act as a national museum
Earth sciences Earth Sciences Centre
Funding £10M for Five Challenge Areas
5. Overview
1. Existing digital content, sources & formats
• Research data
• Collections data
2. Making collections data digital
• Priorities
• Protocols & pathfinder activities
• Crowdsourcing transcription
3. Aggregation & delivery
• The NHM data portal
• Data visualisation, data sub-portals
4. Identifiers, links & interoperability
• DataCite DOIs
• Third party aggregators
• Portal API’s, download & analytical functions
5. Timeline & constraints
• Data policies
• Next steps
Digitisation
activities
Data
portal
6. NHM Research Outputs
• 49 papers, 45 available online
(4 print only or behind pay walls)
• 9 had supplementary data files
• 39 papers with tables, charts & other data
o >1000 sequences
o 826 figures
o 76 tables
o 1 genome
• No collective view of these data (37 journals)
• No consistent way of citing NHM data
• No consistent mechanism to access data
• Effectively invisible at the institutional level
One Month of NHM Science group papers
Data via Carolyn Lowry e-mail, 13th Feb. 2013
1. Existing digital content
7. NHM Collections Outputs: data
• Huge investment in NHM collection management system
• ≠ Imaging
• Most research projects need spatio-temporal records
• Different requirements for different purposes
NHM COLLECTIONS April 2013
Collection area
Estimate no of
specimens
No. records in
database
% collection in
database
% records with
location info
Botany 6,000,000 626,000 ~ 10% 96%
Entomology 32,000,000 316,000 <1% 68%
Mineralogy 500,000 422,000 ~ 95% 79%
Palaeontology 9,000,000 342,000 ~ 3% 89%
Zoology 28,000,000 1,131,000 ~ 60% via lots) 69%
TOTAL 76,000,000 2,837,000 3% (23% )
1. Existing digital content
8. • Many, many imaging projects (highly fragmented)
• Circa 40 TB for major collections (excluding library)
• 120,000 images in KE EMu (many others not in KE!)
• Circa 250,000 via NHM Photo unit (limited metadata)
Collection area No. image files Disk space
Botany 140,133 35,302
Entomology 529,106 3,172
Mineralogy 14,000 6
Palaeontology 122,548 993
Zoology 12,975 1,598
TOTAL 818,762 41,070
NHM Collections Outputs: images
1. Existing digital content
9. Current data formats
• Darwin Core Archive (DwCA) & extensions (collections)
• Circa 2020 fields mapped to 50 fields to generate archive
• Images mainly JPG & TIFF
• Metadata using EML & Genesis II standard
• Research data files in a wide array of formats (blob files)
Nexus (character data and Newick formatted
phylogenetic trees)
Non-NHM specimen lists (as Darwin Core
Archive files)
PhyloXML (an XML standard for representing
phylogenetic trees)
Output from the Imaging and Analysis Centre
(Micro CT datafile formats)
NeXML (an XML standard for representing
character data)
Collections of images from digitisation projects
(as a collection of links or a zipped archive)
Sequence trace files (.scf sequence
chromatogram format files)
Environmental sequence files
Taxon checklists (as Darwin Core Archive files) Collection level descriptions
1. Existing digital content
10. • Priorities linked to science strategic priorities
o Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections
• Low hanging fruit (2D objects e.g. herb. sheets & slides)
2. Making collections data digital
Digitisation Priorities
11. • Priorities linked to science strategic priorities
o Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections
• Low hanging fruit (2D objects e.g. herb. sheets & slides)
• Linked to strategic collaborations & financial opportunities
o e.g RBG Kew, RBG Edinburgh, Nat. Mum. Wales, Hunterian etc.
• Priorities dictate order – we plan to do it all (eventually)!
2. Making collections data digital
Digitisation Priorities
12. • Exercise to develop digitisation protocols across collection
o Slides, spirit, herbarium sheets, pinned, multispecimen/drawer
• Protocols mapped to high level collections descriptions
• Workflow software supporting rapid digitisation (to KE & DAMS)
2. Making collections data digital
Digitisation Protocols
13. • Exercise to develop digitisation protocols across collection
o Slides, spirit, herbarium sheets, pinned, multispecimen/drawer
• Protocols mapped to high level collections descriptions
• Workflow software supporting rapid digitisation (to KE & DAMS)
• Pathfinder activities for less well understood projects
o Entomological dry material (30 M specimens)
- iCollections (specimen-by-specimen) approach
- SatScan (drawer level multi-specimen) approach
2. Making collections data digital
Digitisation Protocols
14. • Specimen-by-specimen, traditional, dedicated 6 person team
• Digitising British Isles Lepidoptera collection
• ~500,000 specimens, 5,000 drawers
• Re-curation & specimen imaging
• Complete label information including georeferencing
• For use in Climate Change initiative
2. Making collections data digital
iCollections Initiative
15. • 4-6 people over 3 years, work broken into small tasks by teams
• Average imaging rate 163 specimen/day*person
• Averaging >3min per specimen (prep., imaging & databasing)
• >£1/specimen
• BUT: 6,800 person years for the entire collection
2. Making collections data digital
iCollections Initiative
16. • Drawer level digitisation, segmented down to specimens
• Very fast imaging, no specimen handling, just one view
• No label information, but some data extracted from drawer
• Specimens retrospectively cropped & annotated
2. Making collections data digital
SatScan Initiative
17. • Drawer level digitisation, segmented down to specimens
• Very fast imaging, no specimen handling, just one view
• No label information, but some data extracted from drawer
• Specimens retrospectively cropped & annotated
2. Making collections data digital
SatScan Initiative
18. • Dedicated specimen-level rapid annotation software
2. Making collections data digital
SatScan Initiative
19. Crowdsourcing & Transcription
• We have a massive transcription problem
• Experiments via Notes-from-Nature (a Zooniverse project)
• Transcribing the NHM ornithological accession registers
• Wikimedian in Residence (Wikisource transcription)
• 4 Month project, including specimen label transcription
2. Making collections data digital
20. data.nhm.ac.uk
• A focus for deposition and discovery of major NHM data sets
• Promote innovation though re-use of museum data
• Open Access, at a dedicated subdomain of the NHM website
• Started Jan. 2013 (3 years), consultation throughout 2012
NHM Data Portal
Functional
components
of the data
portal
3. Aggregation & Delivery
23. • Simple datasets upload workflow for non-collections data
1. Name the
dataset 2. Upload / link
the data file
3. Describe the
data file
4. Theme &
tag
5. Add additional
resources
6. Temporal
coverage
7. Geographic
coverage
8. Save & finish
3. Aggregation & Delivery
NHM Data Portal: Dataset upload
24. Zoomable
map
Applied
filters
Toggle map, table &
stats views
Search, download &
display options
No. records
No.
Georef.
records
• Dedicated interface to visualise & explore major datasets
• Focused on collections data, based on Canadensys.net, uses CartoDB
3. Aggregation & Delivery
NHM Data Portal: Data visualisation
26. • Using DataCite DOIs in the data portal
• datasets (2014) & specimens (2015)
• Unique, persistent and resolvable identifiers
• Easy to cite, alias existing specimen identifiers
• Conform to minimum DataCite requirements
• Landing page, min. metadata standard, fee, min. 10 yr. contract, DOI (pre)fixes
NHM Data Portal & DataCite
Breaks us out of the biodiversity data silo
4. Identifiers, links & interoperability
27. • Content within the NHM data portal will be highly accessible
o Collections harvestable (e.g. by GBIF as a DwCA)
o Download DwCAs on any search facet
o Wide set of API’s available of datasets (part of CKAN)
• Sub-portals (selected content, themed by topic)
o e.g Virtual Herbarium, NHM Science initiatives, geographic regions
• Analytical interface planned for 2015 (but not specified)
Data Aggregation, APIs & download
4. Identifiers, links & interoperability
28. • Data portal will be “open-by-default”
• Ambiguity in what this means & top down schizophrenia
• Conflicting mandates on open access & revenue opportunities
• Lots of guidance available, will use to form a common policy
• A cross institutional policy would be useful (but challenging)
Data Policies & Next Steps
5. Timeline & constraints
29. Jan 2013 Jan 2014 Jan 2015 Jan 2016
Requirements
& dataset discovery
Private alpha Stable public
beta
Full release &
sub-portals
Internal feedback, data
visualisation & DOIs
Subportals &
analytical tools
Project start
NHM Data portal timeline
Next 6 months
• More documentation (PID and Tech Spec)
• Consultation and advocacy (internal and external)
• Data mapping from KE EMu and software testing
• Development
o website wireframe design
o drafting data visualisation subcontract
o Construction of private alpha release
5. Timeline & constraints
Data Policies & Next Steps
30. Jan 2013 2014 2018
Path-finding &
Programme
development
Private alpha Stable public
beta
20 Million!!Project start
NHM digitisation timeline
Next 6 months
• Initial conclusions from path-finding digitisation activities
• Initial grant funding bids developed
• Advocacy, outreach & development of a digitisation “programme”
• Investigate possibilities for gallery development
• Develop crowdsourcing strategy
2015 2016 2017
Major funding
applications &
a new gallery?
Digitisie… Digitisie… Digitisie…
5. Timeline & constraints
Data Policies & Next Steps
32. Digitisation Priorities
• Priorities linked to science strategic priorities
o Disease, sustainability, crop wild relatives, pests etc.
0
100
200
300
400
500
600
700
Crop Wild Relatives (accepted taxa only)
2. Making collections data digital
33. • Priorities linked to science strategic priorities
o Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections
Nick Poole, UK Collections Trust
2. Making collections data digital
Digitisation Priorities
Editor's Notes
NHM has huge amount of digital ambition. As an institution we have a new science strategy taking us to to 2017, and “digital” as a concept runs through all most every aspect of that strategy. Just to underscore this, we put it on the front cover of the strategy.
This visualisation shows the are 400k specimens across all the departments in the NHM that have good geo-locative data, and the length of the line corresponds to the collecting effort in that spot. Its not the most informative visualisation but the intention is that these globe will grow with more points over time as we digitise. In fact its going to have to grow a lot over the next 5 years.
Our science strategy commits us to digitise 20M specimens over the next five year. This will involve an enormous ramping up of effort, given that at the moment we only have about 2.8M records.
So my talk today is really about how we are going to ramp up to achieve this 20 M figure, and I have structured this presentation according to the points that we were asked me to speak on. So first off I’ll say a little about the digital content that we already have. How we are creating new digital content; how we are delivering that content; how we are going to link that content up; and finally what is the timeline for doing this work. And I’m going to focus on the digitisation activities and the data portal since these are the parts of this work that I am most closely associated with…
So first off then what digital content do we (as an institution) already have. Well in the context of or research this is best represented by the papers we publish. On average the NHM produced about 50 papers a month and about 80% of these have a significant amount of digital data associated with them. However, this content is mostly invisible to the institution. Its only accessible through the papers.
Showing all geotagged specimens on a map. You can click one of the specimen records to get an overview of the record. Then click through to see the full record.
Shows all the information related to the record. You can also click through to see the data mapped to Darwin Core fields.
DataCite
Aggregation and access
Open by default
Data portal timeline
Digitisation timeline
Questions
Example of how we set digitisation priorities
The choice of what digitisation granularity we need is linked to the outcome for the data.