Research Data, or: How I Learned to Stop Worrying and Love the Policy
Research Data, or: How I Learned to
Stop Worrying and Love the Policy
RDMF14: Research Data (and) Systems
York, 9th November 2015
Dr Torsten Reimer
Scholarly Communications Officer
Imperial College London
t.reimer@imperial.ac.uk / @torstenreimer
http://orcid.org/0000-0001-8357-9422
Why we fight – Compliance! Really?
“Well compliance is really important, yes that's
the whole reason we are doing it really. I mean
to comply with Research Council guidelines
yes. I am not saying the whole reason but
that's the main driver, yes.”
10.1371/journal.pone.0114734
There are issues with RCUK/EPSRC policy:
• cost-benefit analysis, anyone?
• expensive/issues around funding
• enough support/incentive for culture change?
• fine in theory, but is it workable in practice?
But…
Blame funders, or blame ourselves (hedgehog and hare)?
It seems wherever we go, the funders have
already been there: HEFCE open
access policy; EPSRC data policy…
Are the funders too fast? Or we too slow?
Imagine the sector had agreed on best
practice years ago – and implemented
it in a sensible way!
Data Science hub and KPMG Data Observatory launch (04 Nov)
"At a research intensive university like
Imperial it is hard to do anything that
doesn't involve data.“
James Stirling, Provost
"Data is at the heart of the human
condition."
Joanna Shields, UK Minister for Internet
Safety and Security
Considering these statements you’d think everyone, especially
Imperial, would have RDM all sorted, wouldn’t you?
… and yet we are losing research data
“In their parents' attic, in boxes in the garage, or stored on now-defunct floppy
disks — these are just some of the inaccessible places in which scientists have
admitted to keeping their old research data.”
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Isn’t research meant to be reproducible?
The results of only 6 out 53 ‘landmark’ studies were found
reproducible.
Drug development: Raise standards for preclinical cancer research.
DOI: doi:10.1038/483531a
“Several recent publications suggested that the seminal findings from
academic laboratories could only be reproduced 11–50% of the
time. The lack of data reproducibility likely contributes to the
difficulty in rapidly developing new drugs and biomarkers that
significantly impact the lives of patients with cancer and other
diseases.”
A Survey on Data Reproducibility in Cancer Research Provides
Insights into Our Limited Ability to Translate Findings from the
Laboratory to the Clinic. DOI: 10.1371/journal.pone.0063221
Case for a national infrastructure?
Currently, ~100 UK institutions spend effort to define and implement
an RDM infrastructure (storage, workflows, interfaces, metadata,
compliance, monitoring, business model etc.). Some aspects
have to be local, but…
…imagine a national research data infrastructure (say for data
publishing and preservation), run by RCUK:
• Economies of scale
• No issues with funding
• Just one system to interface with
• Increased visibility/discoverability
• Solution would by default be compliant
• No commercial “ownership” of public data
One RDM system to rule them all?
• Is community track record actually
better than funders’?
• Jisc offers components, but have we
found right model for collaboration
(supplier? leader? partner?)?
• Commercial solutions exist– trust?
Should they define our infrastructure?
Funders set policy; 3rd parties
infrastructure – we’ve been too slow again!
However, is one system actually suitable
(redundancy, competition, disciplines etc.)?
Until the one solution emerges (if ever), we should:
• consider defining minimum requirements (metadata,
identifiers, embargoes) for 3rd party solutions?
• use a flexible approach that enables us to learn and change
Imperial College London
• Seven London campuses
• Four Faculties: Engineering,
Medicine, Natural Sciences
and Business School
• Ranked 3rd in Europe / 8th in the
world (THE 2015-16 rankings)
• Net income (2014): £855m, incl.
£351m research grants and contracts
• ~15,000 students, ~7,400 staff, incl. ~3,900 academic & research staff
• Staff publish 10-12,000 scholarly articles per year
• Largest data traffic into Janet network of all UK universities
Process of policy development
• 2014: Draft policy: “Statement of Strategic Aims”
• Lack of reliable data (on data storage needs (scale) in particular)
• Concerns about cost of maintaining infrastructure
• Concerns about uncertainties and changing market / policy landscape
• Decision: re-think approach – more cost-effective, based on better data
• Approach: RDM Green Shoots and RDM Investigation
• Funded by Vice-Provost (Research)
• Green Shoots: 6 bottom-up, academic projects (2nd half of 2014)
• RDM investigation (Oct 2014-Jan 2015)
• Online survey (academics; 390 responses)
• ~40 interviews (academics)
• Workshops (academics & data managers)
RDM Green Shoots
• Haystack – a computational molecular data notebook
(Dr Mike Bearpark, Chemistry)
• Imperial College Healthcare Tissue Bank
(Prof. Gerry Thomas, Surgery & Cancer)
• Integrated Rule-based Data Management System for
Genome Sequencing Data (Dr Michael Mueller, Medicine)
• RDM in Computational and Experimental Molecular
Sciences (Prof. Henry Rzepa, Chemistry)
• RDM: Where software meets data (Dr Gerard Gorman &
Dr Matthew Piggott, Earth Science & Engineering)
• Time Series (Dr Nick Jones, Mathematics)
Idea
• Provide a platform and technology which automatically connects researchers
through their time-series data, models and analysis methods
Achievements
• Online interdisciplinary collection of time-series data and time-series analysis code
• Functionality to automatically profile time series
• Functionality to automatically profile time series algorithms
• Functionality to use these profiles to place a user’s work in the context of others
RDM Benefits
• Incentivises data sharing by allowing data comparison – increases discoverability of
an academic’s data plus increases likelihood of finding other relevant data
• Resource also available to general public
More Information
• http://www.comp-engine.org/timeseries/
Example project: Time Series
Online survey – where does active data live?
0 10 20 30 40 50 60 70 80
College computer
External/portable storage
Cloud storage
Personal computer
Departmental/group storage
College H drive
ICT central storage
Use of different types of storage in %
Online survey – growth of data volume
0 5 10 15 20 25 30
> 1 PB
100 TB – 1 PB
10 TB – 100 TB
1 TB – 10 TB
100 GB – 1 TB
10 GB – 100 GB
< 10 GB
Research group data storage needs in %
Now
In 2 years
Findings (best practice)
• RDM principles are considered to be sound but not fully practised
• Sharing publicly-funded data accepted in principle but some question
value and cost
• Concerns about (metadata) effort to make shared data discoverable
• Metadata schemas are not yet widely available across disciplines
• Auto-generate metadata where possible
• Consensus that RDM training for PhDs is vital
(also to ensure data loss when they leave)
Findings (data)
• 60-100% of grant required to re-generate data used in publications
• % of data that needs retaining to support publications: ~60%
• Data storage capacity will have to grow significantly
• Concerns around back-up and archiving, esp. considering data volume
• Popularity of cloud services (as opposed to College storage)
Researchers want self-administered, secure, responsive solution
for data sharing, storing and archiving; open APIs preferred
(“Yes [storage] is really important. Basically, whenever we have been out
to talk to researchers, that's the thing they have latched on to and want to
talk about the most.” 10.1371/journal.pone.0114734)
Conclusions / policy implementation principles
• Provide platform-independent, flexible data storage
• Embed RDM training into PhD progression
• Where available, uses existing workflows:
• Symplectic Elements: metadata management
• Spiral (DSpace): public (metadata) catalogue
• Additional infrastructure:
• use external resources
• no long-term commitment
• as flexible as possible
• cost-effective
Reesult: Imperial College RDM Policy
“Imperial College London is committed to
promoting the highest standards of
academic research, including excellence in
research data management. This includes a
robust digital curation infrastructure that
supports open data access and protects
confidential data. The College acknowledges
legal, ethical and commercial constraints on
data sharing and the need to preserve the
academic entitlement to publication.”
“Principal Investigators have overall
responsibility for the effective management
of research data generated within or obtained
for their research, including by their research
groups. The Library and ICT will provide
training, guidance and services to support
PIs.” http://imperial.ac.uk/research-data-management
Research Project
Data: Box
Software: GitHub
Data/software
stillneeded
Delete
External repositoryInternalStorage
Elements
Spiral
Creates data/software
Project ends
no
yes
Metadata, manual
or automatic
Can it be
published or
embargoed
externally?
yesno
Metadata, manual
or automatic
Can metadata
bepublished?
Library reviews
yes
Summarising RDM in 6 steps
1. Make a data management plan: use DMPOnline
2. Store your data management plan centrally: use InfoEd
3. Store your live data securely and safely: use Box
4. Store your final data (and/or code) for 10+ years,
making it publicly available: use Zenodo
5. Tell the College where your data (and/or code) is
published or stored: use Symplectic
6. Reference your funding and your data in the
publications it underpins: tell your publisher
Box – Data storage, sharing and syncing
Roll-out across College:
• unlimited data
storage
• online access, easy
sharing, data syncing
• file viewers included
• backup, data remains
even when staff leave
• machine learning
tools to describe data
• API
Infrastructure summary
• Flexible, can react to market / policy changes
• Components can be exchanged, no additional
in-house infrastructure
• Make a start, collect data, learn – change as required
• Preservation infrastructure needs further work
(discussions with Arkivum about ‘framework’ for
costing into grants) – how much do we need
to retain beyond published data?
• It isn’t perfect, but we can make a start
RDM policy with research software requirements
“3.6.7 Cost Effectiveness – where computer-generated data may be
reliably recreated at a cost less than that of storing raw output data,
then the inputs and human-readable outputs of the relevant
programme may be stored instead along with a reference to or copy of
the software version used.”
“3.7 If software is developed as part of a research project, Principal
Investigators must archive the particular version of the software
used to generate or analyse the data in a repository and inform the
Library of its location, taking account of the points raised in 3.5
above. Principal Investigators are encouraged to follow the
Sustainability and Preservation Framework of the Software
Sustainability Institute.”
Treat software as valuable research output
PyRDM Green Shoots project
Zenodo integrates with GitHub
College survey on distributed version control
Software Sustainability Institute – I a fellow
ORCID – Open Researcher and Contributor ID
• Emerging global standard for identifying authors of academic outputs
• The College created ORCID iDs for academics staff in late 2014
(now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements)
• Imperial hosted launch of Jisc ORCID consortium with
50 UK universities in September 2015
http://www.imperial.ac.uk/orcid
Towards automating RDM reporting with ORCID
Author links ORCID
with CRIS
…shares ORCID iD
with repository
…publishes dataset
DataCite DOI linked to
ORCID iD
CRIS pulls metadata
from ORCID /
DataCite / Repository
But: is the external
metadata likely to be
complete “enough”?
Useful infrastructure makes compliance a by-product
• One workflow for data generation, publishing, reporting and curation
• Link data generation directly to storage (log into facility, data “at your
desk” before you are out of the “lab”)
• (HSS colleagues – “facility” can also be a book scanner
• Automate reporting and generating / sharing of metadata
Facilities
write
(meta)
data into
Box
Data
processed
/ analysed
from Box
Machine-
learning
adds
metadata
Publish to
repository
from Box,
with
reference
Metadata
directly or
indirectly
(ORCID)
to CRISS
Make data useful for us, not just for external re-use
Now that we get data, shouldn’t we analyse it?
Add value by:
• connect researchers who have similar data interests
• connect researchers to relevant data
• present data in a way that’s suitable for public reuse
• develop data analytics and knowledge transfer service
• collect impact information on data
• Let’s make a start and learn from doing, from actual data
• Think about where we can coordinate (3rd party requirements)
• It is early stages, take a flexible approach
• Don’t wait for funders, interpret policies in a useful way and lead
=> If we lead instead of following there will be fewer unpleasant
surprises to deal with!
Research Data, or: How I Learned to
Stop Worrying and Love the Policy
Image Credit (note NC licence!)
1. https://en.wikipedia.org/wiki/File:Dr._Strangelove_-
_Group_Captain_Lionel_Mandrake.png public domain
2. https://it.wikipedia.org/wiki/Why_We_Fight#/media/File:Why_We_Fight
_title.jpg public domain
3. https://commons.wikimedia.org/wiki/File:Hase_und_Igel_%281%29.jpg
public domain
4. https://www.flickr.com/photos/jdhancock/4617759902/ C-3PO vs. Data
(137/365), by JD Hancock, CC BY 2.0
5. https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png
public domain
6. https://www.flickr.com/photos/dinnerseries/14994148089/ OXO tools,
by Didriks, CC BY 2.0
7. https://www.flickr.com/photos/albertovo5/3908190631/ How I Learned
To Stop Worrying..., by hjhipster, CC BY NC 2.0