1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu
What is a Data
Management Plan?
Sarah Jones
Digital Curation Centre
sarah.jones@glasgow.ac.uk
Twitter: @sjDCC
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. What is EUDAT?
EUDAT offers a pan-European solution, providing a
generic set of services to ensure minimum level of
interoperability
Building common data
services in close
collaboration with 25+
communities
www.eudat.eu
3. What is a DMP and why write one?
Requirements under Horizon 2020
Example plans
Lessons and guidance
Overview
4. WHAT IS A DMP & WHY WRITE ONE?
Image CC-BY-NC-SA by Leo Reynolds www.flickr.com/photos/lwr/13442910354
5. Data Management Plans
A DMP is a brief plan to define:
How the data will be created
how it will be documented
who will be able to access it
where it will be stored
who will back it up
whether (and how) it will be shared & preserved
DMPs are often submitted as part of grant applications, but are
useful whenever researchers are creating data.
6. How do DMPs help?
NON PECUNIAE INVESTIGATIONIS CURATORE
SED VITAE FACIMUS PROGRAMMAS DATORUM PROCURATIONIS
(Not for the research funder, but for life we make data
management plans)
Make your research easier
Stop yourself drowning in irrelevant stuff
Save data for later
Avoid accusations of fraud or bad science
Write a data paper
Share your data for re-use
Get credit for it
8. CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
Research data lifecycle
CREATING DATA: designing research,
DMPs, planning consent, locate existing
data, data collection and management,
capturing and creating metadata
RE-USING DATA: follow-
up research, new
research, undertake
research reviews,
scrutinising findings,
teaching & learning
ACCESS TO DATA:
distributing data,
sharing data,
controlling access,
establishing copyright,
promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format
& medium, creating metadata and
documentation
ANALYSING DATA:
interpreting, & deriving
data, producing outputs,
authoring publications,
preparing for sharing
PROCESSING DATA:
entering, transcribing,
checking, validating and
cleaning data, anonymising
data, describing data,
manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
9. What data organisation would a re-user like?
Planning trick 1: think backwards
CREATING
DATA
PROCESSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
11. Planning trick 2: include RDM
stakeholders
Institution
RDM policy
Facilities
€$£
Research funders
Publishers
Data Availability
policy
Commercial partners
www.openaire.eu/briefpaper-rdm-infonoads
12. DMPS IN HORIZON 2020
Image “Open Data” CC BY 2.0 by http://www.descrier.co.uk
13. Horizon 2020: Open Data Pilot
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilo
t/h2020-hi-oa-data-mgt_en.pdf
Participants must:
Develop a Data Management Plan
Deposit research data in a repository
Take measures to enable third parties to
access, mine, exploit, reproduce and
disseminate (free of charge for any user)
Provide information via the chosen repository
about the tools that are needed to validate the
results
15. Approach:
as open as
possible, as
closed as
necessary
Image: ‘Balancing rocks’ by Viewminder CC-BY-SA-ND
www.flickr.com/photos/light_seeker/7780857224
16. Horizon 2020 and DMPs
In H2020 the Data Management Plan (DMP) is a regular
project deliverable, due by month 6.
A DMP is a living document: to be used, updated and
shared.
You can use the H2020 template in DMPonline.
The DMP is not part of the proposal evaluation, but
there is an optional section on data management
evaluated under impact.
If (part of your) data cannot be shared with everyone, you
may (partially) opt out of the pilot.
17. Findable
– Assign persistent IDs, provide rich metadata, register in a
searchable resource,...
Accessible
– Retrievable by their ID using a standard protocol, metadata remain
accessible even if data aren’t...
Interoperable
– Use formal, broadly applicable languages, use standard
vocabularies, qualified references...
Reusable
– Rich, accurate metadata, clear licences, provenance, use of
community standards...
www.force11.org/group/fairgroup/fairprinciples
Making data FAIR
18. 1. Data Summary
2. FAIR data
2.1 Making data findable, including provisions for metadata
2.2 Making data openly accessible
2.3 Making data interoperable
2.4 Increase data re-use (through clarifying licences)
3. Allocation of resources
4. Data security
5. Ethical aspects
6. Other issues
http://ec.europa.eu/research/participants/data/ref/h2020/grants_
manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
H2020 template
19. Common themes in DMPs
1. Description of data to be collected / created
(i.e. content, type, format, volume...)
2. Standards / methodologies for data collection &
management
3. Ethics and Intellectual Property
(highlight any restrictions on data sharing e.g. embargoes,
confidentiality)
4. Plans for data sharing and access
(i.e. how, when, to whom)
5. Strategy for long-term preservation
21. Example plans
108 DMPs from the National Endowment for the Humanities
www.neh.gov/divisions/odh/grant-news/data-management-plans-successful-
grant-applications-2011-2014-now-available
20+ scientific DMPs submitted to the NSF (USA) provided by UCSD
• http://libraries.ucsd.edu/services/data-curation/data-management/ dmp-
samples.html
Example DMP collection from Leeds University
• https://library.leeds.ac.uk/research-data-tools
DMPs in RIO journal
• http://riojournal.com/browse_user_collection_documents.php?collection_id=3&j
ournal_id=17
Further examples:
• www.dcc.ac.uk/resources/data-management-plans/guidance-examples
22. Example H2020 DMPs in Zenodo
Helix Nebula – High Energy Physics example
https://zenodo.org/record/48171#.WATexnriF40
Tweether – engineering (micro-electronics) example
https://zenodo.org/record/55791#.WATei3riF40
AutoPost – ICT example
https://zenodo.org/record/56107#.WATefXriF40
23. Data description examples
The final dataset will include self-reported demographic and
behavioural data from interviews with the subjects and laboratory data
from urine specimens provided.
From NIH data sharing statements
Every two days, we will subsample E. affinis populations growing under
our treatment conditions. We will use a microscope to identify the life
stage and sex of the subsampled individuals. We will document the
information first in a laboratory notebook and then copy the data into an
Excel spreadsheet. The Excel spreadsheet will be saved as a comma
separated value (.csv) file.
From DataOne – E. affinis DMP example
24. Metadata examples
Metadata will be tagged in XML using the Data Documentation Initiative (DDI)
format. The codebook will contain information on study design, sampling
methodology, fieldwork, variable-level detail, and all information necessary for
a secondary analyst to use the data accurately and effectively.
From ICPSR Framework for Creating a DMP
We will first document our metadata by taking careful notes in the laboratory notebook
that refer to specific data files and describe all columns, units, abbreviations, and
missing value identifiers. These notes will be transcribed into a .txt document that will
be stored with the data file. After all of the data are collected, we will then use EML
(Ecological Metadata Language) to digitize our metadata. EML is one of the accepted
formats used in ecology, and works well for the types of data we will be producing. We
will create these metadata using Morpho software, available through KNB. The
metadata will fully describe the data files and the context of the measurements.
From DataOne – E. affinis DMP example
25. Data sharing examples
We will make the data and associated documentation available to users under a data-
sharing agreement that provides for: (1) a commitment to using the data only for research
purposes and not to identify any individual participant; (2) a commitment to securing the data
using appropriate computer technology; and (3) a commitment to destroying or returning the
data after analyses are completed.
From NIH data sharing statements
The videos will be made available via the bristol.ac.uk website (both as streaming media
and downloads) HD and SD versions will be provided to accommodate those with lower
bandwidth. Videos will also be made available via Vimeo, a platform that is already well
used by research students at Bristol. Appropriate metadata will also be provided to the
existing Vimeo standard.
All video will also be available for download and re-editing by third parties. To facilitate this
Creative Commons licenses will be assigned to each item. In order to ensure this usage is
possible, the required permissions will be gathered from participants (using a suitable
release form) before recording commences.
From University of Bristol Kitchen Cosmology DMP
26. Examples restrictions
Because the STDs being studied are reportable diseases, we will be collecting
identifying information. Even though the final dataset will be stripped of
identifiers prior to release for sharing, we believe that there remains the
possibility of deductive disclosure of subjects with unusual characteristics. Thus,
we will make the data and associated documentation available to users only
under a data-sharing agreement.
From NIH data sharing statements
1. Share data privately within 1 year.
Data will be held in Private Repository, but metadata will be public
2. Release data to public within 2 years.
Encouraged after one year to release data for public access.
3. Request, in writing, data privacy up to 4 years.
Extensions beyond 3 years will only be granted for compelling cases.
4. Consult with creators of private CZO datasets prior to use.
Pis required to seek consent before using private data they can access
From Boulder Creek Critical Zone Observatory DMP
27. Archiving examples
The investigators will work with staff at the UKDA to determine what to
archive and how long the deposited data should be retained. Future long-
term use of the data will be ensured by placing a copy of the data into the
repository.
From ICPSR Framework for Creating a DMP
Data will be provided in file formats considered appropriate for long-term
access, as recommended by the UK Data Service. For example, SPSS Portal
format and tab-delimited text for qualitative tabular data and RTF and
PDF/A for interview transcripts. Appropriate documentation necessary to
understand the data will also be provided. Anonymised data will be held
for a minimum of 10 years following project completion, in compliance
with LSHTM’s Records Retention and Disposal Schedule. Biological samples
(output 3) will be deposited with the UK BioBank for future use.
From Writing a Wellcome Trust Data Management and Sharing Plan
28. Share your example DMPs!
Send us links to your
DMPs
We will add them to
the DCC list
Aim to cover wide
range of disciplines
and funders
www.dcc.ac.uk/
share-DMPs
29. LESSONS AND RESOURCES
Image ‘Energy Resources | Energie Quelle’ CC-BY-NC by K. H. Reichert www.flickr.com/photos/reupa/19502634575
30. Tips for writing DMPs
Seek advice - consult and collaborate
Consider good practice for your field
Base plans on available skills & support
Make sure implementation is feasible
Think about things early…
31. DCC support on DMPs
Webinars and training materials
How-to guides and other advisory documents
Checklist on what to cover in DMPs
Example DMPs
DMPonline
www.dcc.ac.uk/resources/data-management-plans
32. DMPonline
A web-based tool to help researchers write DMPs
Includes a template for Horizon 2020
https://dmponline.dcc.ac.uk
33. How the tool works
Click to write a
generic DMP
Or choose your
funder to get their
specific template
Pick your uni to
add local
guidance and to
get their template
if no funder
applies
Choose any
additional
optional
guidance
34. Writing plans: features
Ability to leave notes for
collaborators
Custom guidance from
funder, uni, discipline,
group...
Progress indicators
35. Where to find a data repository?
http://databib.org
http://service.re3data.org/search
The EC guidelines point to Re3data as one of the registries
that can be searched to find a home for data
www.fosteropenscience.eu/co
ntent/re3data-demo
36. How to select a repository?
Look for provision from your community, university, publisher, funder
etc
Check they match your particular data needs: e.g. formats accepted;
mixture of Open and Restricted Access.
See if they provide guidance on how to cite the deposited data.
Do they assign a persistent & globally unique identifier for sustainable
citations and to links back to particular researchers and grants?
Look for certification as a ‘Trustworthy Digital Repository’ with an
explicit ambition to keep the data available in long term.
www.openaire.eu/opendatapilot-repository
37. How to license research data
Horizon 2020 guidelines point to CC-BY or CC-0
DCC How-to guide helps you to license data
www.dcc.ac.uk/resources/how-guides/license-research-data
EUDAT licensing wizard help you pick licence for data & software
http://ufal.github.io/public-license-selector
38. Metadata standards
Metadata Standards Directory
Broad, disciplinary listing of
standards and tools
Maintained by RDA group
http://rd-alliance.github.io/ metadata-
directory
Biosharing
A portal of data standards,
databases, and policies
Focused on life, environmental
and biomedical sciences
https://biosharing.org
39. Key messages
Data management is part of good practice whether you plan to
make the data open or not
– it benefits you!
If you plan to share data, consider this from the outset as
decisions made early on affect what you can do later.
The process of planning is the most important aspect of DMPs.
Think about the desired end result and plan for this.
Approach DMPs in whatever way best fits your project. Don’t
just let funder requirements drive things.
EUDAT offers a pan-European solution, providing a generic set of data services. These are being built in close collaboration with user communities.
So let’s begin by looking at the changing data landscape.
A Data Management Plan is often written early on in the research process to determine what data will be created and how it will be managed. Sometime you are asked for a DMP as part of a grant application, but they are useful to write regardless as it helps to develop consistent procedures from the outset.
You may know the old saying “We do not learn for school, but for life”. For planning and carrying out data management we’d like to encourage a similar attitude in researchers and other stakeholders.
There are lots of reasons to manage research data. You may be required to explain how you will manage your data by your funder or university. Ultimately though, it’s to make your research easier. If data are properly documented and organised, you can stop yourself drowning in irrelevant stuff and find the data when you need it – for example to validate findings. By managing your data you can also more easily share it with others to get more credit and impact.
Well-managed data opens up opportunities for re-use, integration and new science. And RDM is just part of a researcher’s life…
This research data lifecycle is taken from the UK Data Archive. It shows you the different processes and activities you’ll go through. As I’m sure you all know, data has a life beyond the project end.
Depending on your line of work, you may enter the cycle at ‘half past ten’, by re-using existing data, or at 12 o’clock:
Creating data: This is when you’ll design the research, write Data Management Plans, negotiate consent agreements, find any existing data you want to reuse, collect/capture your data and create any associated metadata
Processing data: When processing your data, you’ll be entering, transcribing, checking, validating and cleaning it, you may also need to anonymise your data, you should describe it and make sure it’s properly managed and stored.
Analysing data: when you analyse your data you’ll be interpreting it and creating derived data and outputs, you’ll probably also author publications and prepare the data for deposit and sharing.
Preserving data: data repositories play a key role in preserving data: they will make sure it’s properly stored and archived, they will migrate the formats and storage medium and create associated metadata and documentation to explain any changes made
Access to data: it may be that you share your data via a repository or handle access requests yourself. Either way, you need to establish copyright, decide who can have access and promote the data.
Re-using data: data can be re-used in follow-up studies, new research, research reviews, to evidence findings or for teaching and learning. Try to keep an open mind about the different ways in which your data could be re-used and make it as open as possible.
Let’s adopt the perspective of a future data user – maybe yourself: what should your data organisation – folders with data, metadata and documentation – look like at the moment that you start sharing - outside your team - and archiving?
When you are part of a large project which has been going on for some years already, this may be obvious, but for many researchers it isn’t clear from the start.
To answer that broad question, you want to come up, at an early stage, with answers regarding:
Types and formats of data;
New and/or existing;
Expected size;
Metadata;
Documentation;
Software.
It’s no fun to do the exercise by yourself, so use this as a communication opportunity.
Let’s move on to what H2020 requires from DMPs
As you will know, the EC runs a pilot study with Open Research Data and in the pilot the EC requires that data will be preserved for later use; a DMP should describe the What and How.
Starting next year, this with hold for all project call areas.
As far as we know, opting out and partially opting out will remain possible als long as it is motivated.
Working in a FAIR way can help you to deal with the first part of the previous slide.
It’s becoming an international ambition to make data FAIR. We’ve put sugggestions back to the EC and they are reworking the guideline, and FAIR concepts will play a role.
As always, namedropping is easy, so you do have to think at an early stage about what complying with the FAIR principles means in your situation.
There are some pointer here to what it means that data are FAIR.
For the DMP you can use a word document in your project layout, but you can also use the template within DMPonline.
Here is where you can log in.
From the start, the DCC has offered guidance, independent of funder or discipline. EUDAT and OpenAIRE and others are developing extra guidance as well.
Remember to give also your open data and software a proper licence.
Guidance from the DCC can also help researchers to understand data licensing. This guide outlines the pros and cons of each approach e.g. the limitations of some CC options
The OA guidelines under Horizon 2020 point to CC-0 or CC-BY as a straightforward and effective way to make it possible for others to mine, exploit and reproduce the data. See p11 at: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf