2013 02 data portal science group update -v smith

data.nhm.ac.uk
NHM data portal update

Part of the informatics
initiative (2013-15)

Vince Smith & Ben Scott

The problem – research data
Hard to find, access, cite and integrate
• 45 available online
(4 print only or behind pay walls)
• 9 had supplementary data files
• 39 papers with tables, charts & other data
o>1000 sequences
o826 figures
o76 tables
o1 genome

• No collective view of these data (37 journals)
• No consistent way of citing NHM data
• No mechanism to integrate or version
• No way to repurpose data (retyping?)

49 NHM science group
papers in last 4 weeks
Data via Carolyn Lowry e-mail, 13th Feb. 2013

The problem – collections data
Initial problems
•Don’t know / can’t find the website

Initial problems
Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32
Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40
Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36
Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55
Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34
Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38

Initial problems
•6 different data collections

Initial problems
•23 interfaces & datasets of varying importance

Initial problems
•No priority to collection datasets

119 Specimens Up to
28,000,000
Specimens

Initial problems
•Entomology collections don’t exist (404)

Initial problems
•6 different collections
•Library doesn’t have any online collections!

Initial problems

Bigger issues
•Idiosyncratic browse or search

Initial problems

Bigger issues
•Idiosyncratic browse or search
•No maps, few images & very slow
•No summary or statistics
•No download, export or custom views
•No integration with other data
•No author info or update info
•No means of specimen citation The data portal must
•No exports to GBIF or associated projects correct these issues

The solution – data.nhm.ac.uk portal
High level issues
Functional requirements
•A central access point for NHM research & collections data
•The capacity store/link and describe datasets
•Integrated search & browse of datasets
•The ability to cite datasets and specimen records in data sets
•The ability to integrate collections data
•Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)
•The capacity to download, export & analyse data

Principles
•Open-by-default: Capacity for embargoed and private data
•Sustainable: Self-populated by NHM staff (except collections data)

Exclusions
•Not a replacement for DAMS or KeEMu (a Web interface for these systems)
•Publications out of scope (focused on data sets)
•All annotations on data link back to the source (e.g. KeEMu)

The solution – data.nhm.ac.uk portal
System Overview
Scope File types Registry Subportals
(Source Data) (formats) (Discovery & download) (Branded slices of data)

KeEMu (NHM) Subportal 1
Other
e.g. Disease
initiative

HerbCat (Kew) NHM specimens
DwC-A
PhyloXML
neXML Subportal 2
Nexus e.g. Kew / NHM
Excel, CSV
Other datasets etc… Kew specimens Virtual Herbarium
Species dictionary,
initiatives, Scratchpads etc
Private

User contributed Explorer
datasets Map view Table view Statistics view Analytic view

R

Portal overview – adding data sets
Quick & easy, semi-automated workflow

1. Name the
dataset 2. Upload / link
the data file

3. Describe the
data file

4. Theme &
tag
5. Add additional
resources

6. Temporal
coverage
7. Geographic
coverage
8. Save & finish

Portal overview – search interface
Discovering research data sets

Results Search

Browse &
Datasets
search
matching
criteria
criteria

Individual Advanced
dataset display options

Portal overview – data set display
Exploring research data sets

License
Name Authors

Tags
Download

Metadata
about the
dataset
Technical
Info.
(extracted
from data
file)
Geographic Developer
“Social”
scope tools

Portal overview – collections data
Main interface

Toggle map, table Search, download
No. records
& stats views & display options
No.
Georef.
records

Zoomable Applied
map filters

Portal overview – collections data
Additional interfaces
Collections views Specimen record views

Tables

Statistical
summary Full
record

Summary Data field
Download preview mappings

Portal overview
Some example data portals & software

Data.gov & CKAN
•UK government data portal
•Uses CKAN, open-source data portal platform
•Used by national & regional governments
•Links into Drupal, DataCite & NHM systems
•http://data.gov.uk & http://ckan.org/

Canadensys & CartoDB
•Canadian network of biodiversity collections
•Almost 1 million specimens, 18 datasets
•Uses CartoDB mapping solution
•Create dynamic maps, analyze and build location
aware and geospatial applications
•Widely used, cloud data storage, PostGIS
•http://data.canadensys.net & http://cartodb.com/

Portal development
Timeline & resources
Year 1 – Dataset discovery
•Technical & functional specification (Vizz. subcontract)
•Data workflows (KeEMu & research datasets)
•Functional alpha prototype (CKAN)

Year 2 – Visualisation
•Mapping & statistical functionality (CartoDB)
•Social and annotation functions
•Stable beta release at http://data.nhm.ac.uk

Year 3 – Citation & analysis
•DataCite DOIs on datasets & specimens
•Initial Web analytical functions (R)
•Initiative sub-portals including Virt. Herbarium

Resources
•1x Developer (Ben Scott) for 3 years
•Vizzuality subcontract (circa £xxk - TBC)
•ICT capital, travel & software (circa £25k)

Portal consultation
Feedback & next steps
Documentation
•Overview specification - http://goo.gl/qjioh
•Project Initiation Document - http://goo.gl/oRr2j

Initial stakeholder meetings (Feb. – May)
•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)
•Darrell Siebert and the KE EMu user group
•NHM Collections Committee & Initiative leaders
•Kew Gardens & Virtual Herbarium Reps.
•GBIF, NBN, UK DataCite team at BL, NERC
•Digital Facility Team
•Vizzuality
FEEDBACK & LINKS
Wider consultation Slides:
•Example data types / sets Feedback: vince+portal@vsmith.info
•Specialist search options & vocabularies Specification: http://goo.gl/qjioh
•Specialist Earth Science needs PID: http://goo.gl/oRr2j

Two more things
Wikipedian in Residence
•Four month post with Science Museum
•Starting March / April
•Work with NHM staff to improve Wikipedia
•Run events with NHM staff & volunteers
•Work with the GLAM group at Imperial College
•Focus on NHM science themes & specimens
•Not about promotion of “The NHM”

Biodiversity Informatics Workshop – May 2013
•One full day - date TBC
•Outputs from ViBRANT & e-Monocot
•Includes Scratchpads & the Biodiversity Data Journal
•What we do, how its used and where are we going
•Includes links to NHM informatics & digitisation initiatives

Portal overview – data citation
Unique identifiers for datasets & specimen records
Why cite data
•URLs are not persistent
•e.g. Wren JD: URL decay in MEDLINE- a 4-year
follow-up study. Bioinformatics. 2008, Jun
1;24(11):1381-5) – circa 40% decay
•Measure our digital footprint
•Puts research data on par with articles
•Facilitates data mining
What gets an identifier
•NHM specimen records (suffix of NHM ID’s) http://dx.doi.org/BMNH_
•NHM research datasets (files) PBI_00388325

•Insert into publications

How to cite data
•Digital Object Identifiers (DOIs)
•Widely used & understood on articles
•Operates in collaboration with DataCite
•Part of an International consortium
•Mixes NHM data with other domains

2013 02 data portal science group update -v smith

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to 2013 02 data portal science group update -v smith

Similar to 2013 02 data portal science group update -v smith (20)

More from Vince Smith

More from Vince Smith (20)

2013 02 data portal science group update -v smith