Datat and donuts: how to write a data management plan

How to write
a data
management
plan
C. Tobin Magle, PhD
Sept 25, 2017
10:00-11:30 a.m.
Morgan Library Computer
Classroom 175
*inspired by content from CU
Boulder research computing

What is data
management?
The policies, practices and procedures needed to
manage the storage, access and preservation of data
produced from a research project

data management != data sharing
• but the same principles apply to both

Why should I care about data management?
Rinehart, AK. “Getting emotional about data” College & Research Libraries News September 2015 vol. 76 no.
8 437-440

*ok not everything, but most things

More researchers
https://www.nsf.gov/statistics/2016/nsf16300/digest/nsf16300.pdf

See arXiv:1402.4578 for details

Working Email
Data are extant
(If status known)
Status of data
(if response)
Response
(if email
working)
doi:10.1016/j.cub.2013.11.014

We are losing vast amounts of data
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
11
1
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
00 0
1
1
1 1
1
0

Research funding is tight
http://www.bu.edu/research/articles/funding-for-scientific-research/

Federal agencies advocate OA
https://obamawhitehouse.archives.gov/blog/2017/01/09/making-federal-research-results-
available-all

Private funders require sharing
http://www.gatesfoundation.org/how-we-work/general-information/open-access-policy

It’s good for science
• Improves research reproducibility
• Improves efficiency
• Spurs innovation

It’s good for you
• You are the future data user
• Your data get used (and cited)
• Exposure to collaborators
• More competitive grants

Where does data
management fit into
research?
Throughout the whole research cycle

Hypothesis
Experimental
design
The research cycle

Hypothesis Data
Experimental
design
The research cycle

Hypothesis Data
Experimental
design
Results
The research cycle

Hypothesis Data
Experimental
design
ResultsArticle
The research cycle

Hypothesis Data
Experimental
design
ResultsArticle
Data
Management
Plans
The research cycle

Hypothesis
Raw
data
Experimental
design
Tidy
Data
ResultsArticle
Data
Management
Plans
Cleaning
Analysis
The research cycle

Hypothesis
Raw
data
Experimental
design
Tidy
Data
ResultsArticle
Data
Management
Plans
Cleaning
Sharing
Analysis
Open Data
Closed
Data
Archiving
The research cycle

Hypothesis
Raw
data
Experimental
design
Tidy
Data
ResultsArticle
Data
Management
Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible
Research
Closed
Data
Archiving
The research cycle

Hypothesis
Raw
data
Experimental
design
Tidy
Data
ResultsArticle
Data
Management
Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible
Research
Reuse
Closed
Data
Archiving
The research cycle

What is research data?
• “The recorded factual material
commonly accepted in the
scientific community as
necessary to validate research
findings”
- White House Office of
Management and Budget
• Reality: anything that is a
(digital) product or your
research

What is a data
management plan?
A description of how you plan to describe, preserve
and share your research data.
Often required by funding agencies

Successful DMPs include
• A data inventory, including type(s) and size
• A strategy for describing the data
• A plan for preserving the data long term
• A method for access to the data
Always make sure to follow funder requirements

Data inventory
• What type of data are you going to collect?
• What file type will be produced?
• What size will these files be? How many files?
• What other research outputs will be produced?
• Code/Software?
• Templates/protocols?

Example
miRNA sequences
FASTQ files
1 GB per file
x 64 strains
x 3 replicates
-------------------
~200 GB
R scripts for
analysis and
visualization
Data use tutorials
• What type of data are you going to collect?
• What file type will be produced?
• What size will these files be? How many files?
• What other research outputs will be produced?
• Code/Software?
• Templates/protocols?

Data formats
• Avoid proprietary formats
• Know what software can read your data
Proprietary Format Alternative Format
Excel (.xls, .xlsx) Comma Separated Values (.csv)
Word (.doc, .docx) plain text (.txt)
PowerPoint (.ppt, .pptx) PDF/A (.pdf)
Photoshop (.psd) TIFF (.tif, .tiff)
Quicktime (.mov) MPEG-4 (.mp4)
MPEG 4 Protected audio (.m4p) MP3 (.mp3)

Exercise: Data Inventory
What kind of data are you going to collect?
What file type will be produced?
What size will these files be? How many files?
What other research outputs will be produced?

A strategy for describing the data
• Metadata: Relevant information
for re-creation and re-use
• Contact info
• How data was collected
• Details about collection
• Date, location of collection
• Units
• Can be as simple as a text file

Genomics example (README)
This project contains next-generation miRNA sequencing data from 64 mouse strains.
Brain tissue from 10 week old male mice were harvested, stored in RNA later. RNA was
extracted using an RNeasy kit, and miRNA libraries were produced using an Illumina kit.
They were run on an Illumina mySeq sequencer. The FASTQ Files produced were analyzed
in R using Bioconductor.
The data and descriptive will be made available on NCBI in the bioproject (PRJXXXX). The
scripts used to analyzed the data are available on github (URL). Tutorials for data use will
be made available in the Digital Collections of Colorado (handle).
Contact Tobin Magle (tobin.magle@colostate.edu) for more information.
http://orcid.org/0000-0003-3185-7034

Metadata standards
• Dublin Core: http://dublincore.org/documents/dcmi-terms/
• Can be applied to anything
• Many discipline specific metadata standards
• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html
• MIAME: http://fged.org/projects/miame/
• Search for other standards:
• http://www.dcc.ac.uk/resources/metadata-standards
• https://fairsharing.org/standards/

Genomics example (NCBI template)

Exercise: Describe your data
What do people need to know to reuse your data?
Are there any discipline-specific metadata standards?
What format will you describe your data in (text, XML, tabular)?
What fields will you include (author, date, format, identifier?)

A plan for preserving the data long term
• What will you do to ensure
data are properly stored and
preserved?
• Include metadata and other
products needed for reuse
• Short vs long term

Recommendations for backing up data
• Store in geographically distinct
locations
• Automation: Will you remember to do it
manually?
• Security: Are you working with PHI?

Preservation questions
• What will you store?
• Who will be in charge?
• How long will you store it?
• Where will you store it?
• Multiple copies

Exercise: Preservation plan
What will you store?
Who will be responsible for the data (person or position)?
How long will you store it?
Where will you store it?
How will you back it up?

A method to access the data
• Important to funding agencies
• Reproduce existing research
• Promote further research
• Must be easily available:
• No “by request only”
• Embargoes are “ok”
• Data security: consider privacy
and IP issues before sharing

Data access and sharing best practices
• Non-proprietary formats
• Include metadata
• Proper storage
• Stable identifier
• Licensing: conditions for reuse

Trusted Repositories: store and share
• Discipline specific repositories
• Search:
http://service.re3data.org/browse/by-
subject/
• Generic:
• Figshare - https://figshare.com/
• Dryad - http://datadryad.org/
• CSU Digital Repository:
• http://lib.colostate.edu/digital-collections/ http://67.media.tumblr.com/6228cbe58a9652f1a85e8a
b1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png

Data archiving service
• Finished products for
sharing
• CSU Digital Repository
• Over 100 Datasets
• Satisfy requirements for
manuscripts and grants
• At no cost <1 TB
• $150/TB for 5 years
• $300/TB for >5 years

Stable identifiers
• URLs break
• Stable identifiers are
permanent in a database
• Some provide linking
capabilities
• DOI –
https://doi.org/10.1109/5.771073
• Handle-
http://hdl.handle.net/10217/177356

Licensing
• State your conditions for reuse
• Paper citation?
• Disclaimers
• Must justify limitations, describe
how you’ll advertise them
• Creative common licenses are a
good starting point

Exercise: Access methods
Where will people be able to access the data?
Does your discipline have a repository?
What kind of stable identifier will it have?
What are the conditions for reuse?
Are there any limitations to use of these data? Why?

DMPTool
• Review requirements from
different agencies
• https://dmptool.org/guidance
• Create new DMPs based on
funding agency templates
• Search public DMPs

Need help?
• Email: tobin.magle@colostate.edu
• DMPTool: http://dmptool.org/
• Data Management Services website:
http://lib.colostate.edu/services/data-management

Datat and donuts: how to write a data management plan

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Datat and donuts: how to write a data management plan

Similar to Datat and donuts: how to write a data management plan (20)

More from C. Tobin Magle

More from C. Tobin Magle (8)

Recently uploaded

Recently uploaded (20)

Datat and donuts: how to write a data management plan

Editor's Notes