brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
Data management plans
• US:
• National Science Foundation (NSF)
• National Endowment for the Humanities (NEH)
• National Aeronautics and Space Administration (NASA)
• National Oceanic and Atmospheric Administration (NOAA)
• Institute of Museum and Library Services (IMLS)
• Agency for Healthcare Research and Quality (AHRQ)
• Gordon & Betty Moore Foundation
• Alfred P. Sloan Foundation
• UK: Economic and Social Research Council (ESRC)
• Encourage that best practices are followed
• Provide a structured approach to data throughout its
lifecycle
• Now mandated by many funders
• Europe: Horizon 2020
• Other international mandates: http://www.sherpa.ac.uk
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
DMP structure
5. Storage and backup
6. Selection and preservation
3. Documentation and metadata
1. Administrative data
4. Ethics and legal compliance
2. Data collection
7. Data sharing
8. Responsibilities and resources
Source: DCC. (2013). Checklist for a Data Management Plan. v.4.0. Edinburgh: Digital Curation Centre.
Available online: http://www.dcc.ac.uk/resources/data-management-plans
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
1. Administrative data
• Basic information e.g. project
title, your name, contact details,
reference numbers / IDs
Here you should record basic information to identify and
contextualise your plan. Identifiers may help to link your DMP
with information held in other systems. You should include:
• A summary of the research to
explain the purpose for which
data are being collected
• Details of related policies and
procedures e.g. institutional data
policy or departmental guidelines
Source: XKCD, http://xkcd.com/97/
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
2. Data collection
• How will you structure and name your folders and files?
• What quality assurance processes will you adopt?
• What standards or
methodologies will you
use to create data?
Here you should consider what data you will collect and how.
• Do your chosen formats and
software enable sharing and
long-term access to the data?
• Are there any existing data
that you can reuse?
Source: SMBC, http://smbc-comics.com/
index.php?db=comics&id=1849
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
3. Documentation and metadata
• How will you capture / create this
documentation and metadata?
• What documentation and
metadata will accompany the data?
Here you should consider what
information is needed for the data
to be to be read and interpreted in
the future. Estimate how much
time and effort will be needed to
create this supporting
documentation and ensure that you
allow for sufficient resource.
• What metadata standards will you
use and why?
Source: Gary Larson
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
4. Ethics and metadata
• How will you protect the identity of participants if required?
e.g. via anonymisation
• Will data sharing be postponed / restricted? e.g. to publish
or seek patents
Here you should consider any
ethical or legal issues,
particularly in terms of
restrictions they may place on
data sharing.
• Have you gained consent for
data sharing and preservation?
• How will the data be licensed for reuse?
Source: SMBC, http://smbc-comics.com/
index.php?db=comics&id=1957
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
5. Storage and backup
• What are the risks to data security
and how will these be managed?
Here you should consider where
the data will be stored and any
implications this has for backup,
access and security.
• Who will be responsible for backup
and recovery?
• Do you have sufficient storage or
will you need to include charges for
additional services?
• How will you ensure that
collaborators can access your data
securely?
Source: SMBC, http://smbc-comics.com/
index.php?db=comics&id=2237
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
6. Selection and preservation
• What are the foreseeable research uses for your data?
• Which data should be preserved and potentially shared?
• Which data must be retained or
destroyed for contractual, legal,
or regulatory purposes?
Here you should determine which data
are of long-term value and should be
preserved. Decide how best to preserve
those data, for example by depositing in
repositories.
• What is the long-term preservation plan for the dataset?
• Have you costed in the time and effort required to prepare
the data for preservation and sharing?
Source: XKCD, http://xkcd.com/309/
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
7. Data sharing
• When will you make the data
available?
• Are any restrictions on data sharing required?
Here you should consider which data
you will share and how. The methods
used will depend on a number of
factors such as the type, size,
complexity and sensitivity of the data.
Also consider how people might
acknowledge the reuse of your data
(e.g. via citations) so you gain impact.
• With whom will you share the data,
and under what conditions?
• What action will you take to overcome or minimise restrictions?
• How will potential users find out about your data?
Source: SMBC, http://smbc-comics.com/
index.php?db=comics&id=100
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
7. Responsibilities and resources
• Who is responsible for implementing the
DMP, and ensuring it is reviewed and
revised?
• How will responsibilities be split across
partner sites in collaborative research projects?
Here you should assign roles and
responsibilities for all data management
activities. Also carefully consider any
resources needed to deliver your plan.
These costs can usually be written into
grant applications but need to be clearly
outlined and justified.
• What resources will you require to deliver your plan?
• Is additional specialist expertise or equipment required?
Source: SMBC, http://smbc-comics.com/
index.php?db=comics&id=1893
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
Online tools
• DMPOnline (UK Digital Curation Centre)
https://dmponline.dcc.ac.uk/
• DMPTool (California Digital Library)
https://dmp.cdlib.org/
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
GenBank
• Two upload tools – Bankit for short sequences, Sequin for complex or multiple
sequences
• Sequence data uploaded as a FASTA file
• Immediate or future release instruction
• Citation of a reference paper
• Names of source organisms and any related descriptive data
• Sequence features (e.g. CDS, gene, rRNA, tRNA, with nucleotide intervals and
product names) and topology
• Organism name, applicable source modifiers, location
• Genus and species names (if not previously provided in FASTA file)
• If name is new or unrecognized, provide best known taxonomic lineage
• If genus and/or species names are not known, provide most specific name known
(for example:Bacillus sp., Uncultured bacterium, Uncultured archaeon)
• Most complete name for any synthetic vector (for example: Cloning vector
pAB234, Transfer vector p789Abc)
• Source modifiers include: strain, clone, isolate, specimen-voucher, isolation-
source, country
• Location: organelle (mitochondrion, chloroplast, etc); map and/or chromosome
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
ClinicalTrials.gov
• Web-based data entry system called the Protocol Registration and Results System
(PRS)
• Section 801 of the US Food and Drug Administration Amendments Act of 2007
requires clinical trial registration and the submission of results
• Standard format
• Study Type: ‘Observational’ or ‘Interventional’
• Outcome Measures: The Primary and Secondary Outcome Measure Titles and
Descriptions
• Outcome Measure Time Frame
• Conditions or Focus of the Study
• Intervention Information: Each intervention is entered separately using the
Intervention Type, Name, and Description data elements
• Eligibility: List of key inclusion and exclusion criteria
• Locations
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
FlyBase
• FlyBase contains a complete annotation of the Drosophila melanogaster genome
• It also includes a searchable bibliography of research on Drosophila genetics
• Detail which genes feature in
your paper, and FB will link
your paper to those genes for
the next release cycle.
• Provide additional
information during the
submission process about
your publication and help the
Curators to speed up your
curation.
• The whole process takes
about 5mins!
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
Dryad
Typical process:
• Authors submit their manuscripts to the journal for consideration.
• Journal provides information about manuscripts to Dryad through automated notices
from the manuscript processing system, which creates a provisional Dryad record for the
data.
• Journal invites authors to archive data in Dryad, through a custom submission link that
brings the author to the provisional record.
• Authors upload their files to Dryad through the submission link supplied by the journal; no
redundant information need be entered and the article details are correct.
• Dryad Curators process and approve the data files and register the Digital Object Identifier
(DOI), a permanent identifier that allows the data to be cited and tracked; curators convey
the DOI to the journal.
• Journal and publisher add the Dryad DOI to all forms of the final article, enabling readers
of the article to access the data.
• Dryad can also provide links to data in other repositories, including sequences in GenBank
and phylogenetic trees in TreeBASE.
• License: CC0
• Cost: $80 / ₹5,000
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
Nature Scientific Data
• Papers are called
“data descriptors”
• Fill out and submit a
paper template
• Requires an ISA-tab
metadata file
• Quality of data a major
focus.
• CC-BY/NC
• APC £890 / ₹84,000
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
Ubiquity Press
1. The paper contents
a. The methods section of the paper must provide
sufficient detail that a reader can understand how
the resource was created.
b. The resource must be correctly described.
c. The reuse section must provide concrete and useful
suggestions for reuse of the reuse.
2. The deposited resource
a. The repository must be suitable for resource
and have a sustainability model.
b. Open license permits unrestricted access (e.g. CC0),
or access guaranteed if criteria met (must qualify)
c. A version in an open, non-proprietary format.
d. Labeled in such a way that a 3rd party can make
sense of it.
e. Must be actionable.
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
The basics of the model
Data papers are short
1) Low barrier data publication
Peer review is quick and objective
2) Online authoring
Low APC: £100 / ₹1,000
Lower cost (straight to XML)
Encourages shorter form
3) Open access only (CC-BY)
4) The publisher is not the repository
No-questions-asked waivers
brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
11 Best practices for data publication
• Record your methodology well – think about reproducibility
• Make sure you can export to open formats
• Record your selection and QA processes well
• Choose appropriate metadata standards, record from the beginning
• Ensure you obtain proper consent, and that it allows for open
publication if possible
• Consider the timing of data publication – e.g. to coincide with research
papers
• Consider potential reuse scenarios from the start
• Choose an appropriate repository
• Think about possible restrictions and access conditions early – justify
and seek to minimise
• Plan to publish with maximum dissemination – data paper?
• Allocate time and funding for data publication in grant proposals