Wisconsin Cyberinfrastructure Days
November 5, 2010
Dorothea Salo & Brad Houston
Document describing data (and/or digital
materials) that have been or will be gathered in
a study or project.
Often includes details on how data will be
organized, preserved, and accessed
Facilitates re-use of data sets by either PI or
Required component of grants for MANY
agencies (NSF and NIH)
Starting January 2011 for NEW, non-
Not voluntary – “integral part” of proposal
Data Management Plans for all data resulting
from any level of NSF funding
Supplementary 2-page document (max)
Optional: Also part of 15-page (max) Project
Must address both physical and digital data
“Efficiency and effectiveness” of the DMP will
be considered by NSF and disciplinary division
Must include sufficient information that peer
reviewers and project monitors can assess
present proposal and past performance
Such dissemination of data is necessary for the
community to stimulate new advancesstimulate new advances as quickly as
possible and to allow prompt evaluationallow prompt evaluation of the results
by the scientific community. “ – NSF (italics mine)
Part of Openness trend in federal government
(data.gov - Open Government Initiative)
NIH Public Access Policy (2008)
Public access to federally funded research hearings
- Information Policy, Census and National Archives
Subcommittee of U.S. Congress (July, 2010)
It makes your research easier!
Data available in case you need it later
Helps avoid accusations of fraud or bad science
To share it for others to use and learn from
To get credit for producing it
To keep from drowning in irrelevant stuff
... especially at grant/project end
Gene expression microarray data: “Publicly
available data was significantly (p=0.006)
associated with a 69% increase in citations,
independently of journal impact factor, date of
publication, and author country of origin.”
Piwowar, Heather et al. “Sharing detailed research
data is associated with increased citation rate.” PLoS
One 2010. DOI: 10.1371/journal.pone.0000308
Maybe there’s an advantage here!
Discuss specific requirements for NSF
Data Management plans
Suggest ways to manage, share, and
archive data more effectively
Provide resources for more information
What data are you collecting or making?
Can it be recreated? How much would that cost?
How much of it? How fast is it growing? Does it
What file format(s)?
What’s your infrastructure for data collection and
How do you find it, or find what you’re looking for
How easy is it to get new people up to speed? Or
share data with others?
Who are the audiences for your data?
You (including Future You), your lab colleagues
(including future ones), your PIs
Disciplinary colleagues, at your institution or at others
Colleagues in allied disciplines
What are your obligations to others?
How do you and your lab get from where you
are to where you need to be?
Document, document, document all decisions and
Secret sauce: the more you strategize upfront,
the less angst and panic later.
“Make it up as you go along” is very bad practice!
But the best-laid plans go agley... so be flexible.
And watch your field! Best practices are still in flux.
All submitted plans must include, at
1. Expected Data: types, physical/electronic collections,
materials to be produced
2. Standards for data and metadata format and content
3. Policies for access and sharing, including provisions for
appropriate protection of privacy, confidentiality,
security, intellectual property, etc.
4. Policies and provisions for re-use, re-distribution, and
the production of derivatives
5. Plans for archiving data, samples, and other research
products, and for preservation of access to them
Four kinds of data defined by OMB:
Examples: Sensor data, telemetry, survey data, sample
Examples: gene sequences, chromatograms, toroid
magnetic field data.
Examples: climate models, economic models.
Derived or compiled
Examples: text and data mining, compiled database, 3D
models, data gathered from public documents.
Raw data is included in this definition
Drafts of scientific papers
Plans for future research
Peer reviews or communications with
Physical objects, such as gel samples
As early as possible, but no later than
guidelines laid down by relevant Directorate
Engineering Section: “no later than the acceptance
for publication of the main findings of the final data”
Earth Sciences: “No later than two (2) years after the
data were collected.”
Social and Economic Sciences: “within one year after
the expiration of an award”
Be aware of concerns that may require earlier
or later disclosure
FERPA? Human Subjects data? HIPAA?
Again, specific retention periods will depend
on the type of data and the Directorate
Example: Engineering Section suggests retention
period of “three years after either completion of the
grant project or public release of research data,
whichever is later”
Certain types of data will need to be retained
Patent data, longitudinal data sets, etc.
Ask: is your data of permanent value?
Analyzed data (incl. images, tables and tables of
numbers used for making graphs)
Metadata that defines how data was generated,
such as experiment descriptions, computer code,
and computer-calculation input
Investigators are expected to preserve/share
primary data, samples, physical collections, &
Provide easily accessible information about data
holdings, including quality assessments and
Data may be made available through submission to
national data center, publication in journal, book, or
accessible website of institutional archives
Data Management Plans are required even if a
project is not expected to generate data that
DMP should clearly explain non-sharing in
light of COI standards (peer review)
Between the lines: Not sharing will require
justification and close scrutiny by NSF
Sharing is preferred
Preparing, sharing, and archiving your data sets
Think about where you will put your data
Local? Network drive? Online data management
Think about how you (or others) will find your
Think about how others may use your data, when
Think about how to store your data in the long
term (or if to store it long-term at all)
Will anybody be able to read these files at the
end of your time horizon?
Where possible, prefer file formats that are:
In wide use
Easy to data-mine, transform, recast
If you need to transform data for durability,
do it now, not later.
Fundamental question: What would someone
unfamiliar with your data need in order to
find, evaluate, understand, and reuse them?
Consider the differences between someone
inside your lab, someone outside your lab but
in your field, and someone outside your field.
Two parts: metadata and methods
About the project
Title, people, key dates, funders and grants
About the data
Title, key dates, creator(s), subjects, rights, included
files, format(s),versions, checksums
Interpretive aids: codebooks, data dictionaries,
Keep this with the data
Reason #1 for not reusing someone else’s data: “I
don’t know enough about how it was gathered to
Document what you did. (A published article may
or may not be enough.)
Document any limitations of what you did.
If you ran code on the data, document the code and
keep it with the data.
Need a codebook? Or a data dictionary?
If I can’t identify at sight what each bit of your dataset
means, yes, you do need a codebook or data dictionary.
DO NOT FORGET UNITS!
Your own drive (PC, server, flash drive, etc.)
And if you lose it? Or it breaks?
Somebody else’s drive
Departmental or campus drive
Do they care as much about your data as you do?
What about versioning?
Library motto: Lots Of Copies Keeps Stuff Safe.
Two onsite copies, one offsite copy.
Keep confidentiality and security requirements in
mind, of course
If data need to persist beyond project end, you have to
deal with a new kind of risk: organizational risk.
Servers come and go. So do labs. So do entire departments.
This is especially important if you share data! Don’t let it 404!
You need to find a trustworthy partner.
On campus: try the library or your campus research office. (No,
campus IT is usually not good enough.)
Off campus: look for a disciplinary data repository, or a journal
that accepts data. (It’s a good idea to do this as part of your
Let somebody else worry! You have new projects to get