RDAP13 Renata Curty: What Have Scientists Planned for Data Sharing and Reuse? A C…
1. What have Scientists Planned for Data
Sharing and Reuse?
A Content Analysis of NSF Awardees’ Data
Management Plans
Renata Curty, Youngseek Kim & Dr. Jian Qin
Baltimore, 4-5 April 2013
2. Motivation
While the NSF mandate gives researchers
plenty flexibility to define their own DMP
and many academic institutions provide
DMP writing support, little is known about
how scientists address their strategies on
their DMPs.
3. Study Design
Online Survey: 20 questions
Target Population: NSF Awardees from January 18, 2011 to
November 5, 2012 - Standard Grants - Total 16065
Random Sample: 1606 cases
Pilot Study: 100 Awardees (Survey Reformulation)
Final Deployment: 966 awardees, 169 responses (17.5%) and
DMPs (68)
4. Awards Info
Amount Awarded NSF Directorate
13% 10%
16%
15%
12%
16%
18%
BIO CISE EHR ENG
GEO MPS SBE
166 166
5. Awardees Info
Age Organization Type
65+ 7%
55-64 19%
45-54 26%
35-44 41%
25-24 7%
Academia, 93%
150 151
6. Awardees Info
Position in Academia Tenured
62%
Researcher
6.77%
Assistant On
Professor Tenure
22% Track Non-
25% Tenure
Full
Professor Track
Associate 11%
40% Professor Retired
28% 2%
Others: Dean (3), Professor Emeritus (1), Professor of Practice
(1), Lecturer/Instructor (1), Post-Doctoral Fellow (1), Emeritus Senior
Scientist, Director, Expert Consultant, Administrative Faculty
Position, Chair.
143 138
8. DMP is important to formalize data sharing practices in science
N=166
10.84%
10.24%
22.89%
33.13%
13.25%
3.01%
6.63%
= 4.93
= 1.62
Writing a DMP for NSF proposal is a challenging task
N=167
21.56%
13.77%
25.75%
23.35%
10.18%
0.40%
2.99%
= 3.89
= 1.45
DMP is difficult to execute
N=167
22.75%
11.38%
25.75%
23.35%
4.79%
8.98%
2.99%
= 3.79
= 1.51
Strongly disagree Disagree Somewhat disagree
Neither agree or disagree Somewhat agree Agree
Strongly agree
9. Types of Data Documentation of Data
3D Models 13.01% - 19
Audio Files 12.33% - 18 Will follow:
Curriculum Materials 21.23% - 31
Data Models 27.40% - 40 46% - Disciplinary
Field Notes 26.03% - 38 practices
Experimental Data 63.70% - 93
Images 36.99% - 54
37% - Research project’s
Interview Transcripts 17.12% - 25
Patient Records 0.68% - 1
needs
Samples 20.55% - 30
Software 35.62% - 52 17% - Institutional
Spreadsheets 40.41% - 59 recommendations/
Video Files 21.23% - 31 guidelines
Others: Computational Models, Surveys, DNA
Sequences, Computer Codes, Crowdsourcing
Data (Reviews) 158
10. Challenges Encountered
Others:
Which
stage(s) of None
26% Some projects do not generate
research to
data
share the
data Lack of
Data guidance Conflict between DMP
25%
Description & from my requirement and IRB
Documentation institution requirements regarding social
30% and behavioral research data
29%
Level of Conflicts intellectual property
granularity Lack of and data protection
of data guidance
25% Appropriate from NSF Long-term preservation issues
infrastructure 36%
to archive/ Conflicts individual/group vs.
preserve data institutional strategies
41%
169
11. Data Access & Availability
Restricted
5%
By email request 45.52% - 61
Personal website 17.91% - 24
Open
45% Research Group/Project
51.49% - 69
Website
Available with Institutional Repository 20.15% - 27
some
restrictions Disciplinary Repository 32.84% - 44
51%
Others: “Publications”, “Available to NSF only”
167
13. Reuse Issues - Privacy, Anonymity & Confidentiality
“IRB restrictions on ability to share even deidentified data. Concern that sharing
even deidentified data will discourage participation in the study.”
“For myself, no. But for others to use my data, yes: for qualitative data, under IRB
requirements for the protection of human subjects around confidentiality and
anonymity, DMPs are nearly impossible to implement without perhaps some
kind of temporal restriction on them (like, ‘This archive can only be opened in 20 -
30 - 40 years’ or something like that)”
“The project involves human subject; so protections have to be put in place that
may limit reuse applications in the future.”
“HIPAA *Health Insurance Portability and Accountability Act+ issues - obtaining
self reporting data on human subjects.”
14. Reuse Issues - Context, Time Factor & Documentation
“My past data was collected on a unique system built specifically for the research project.
Need lots of context to reuse the data.”
“The only problems I see is that data can be taken out of context in a way that produces
results that might not be correct.”
“Data is specific to testing scenarios. The insight gleaned from our experimental data is of
more importance than the data itself.”
“My data is for specific purposes and it is hard to conceive of how someone would use it for
something else/different. Even with a significant amount of metadata it would be difficult for
someone to know all the circumstances under which the data was collected and why it was
collected.”
“All scientific data is collected in particular context. Mechanisms that facilitate the description
of that context are lacking. The creation of metadata that provides this information is a
cumbersome, boring task and there are few resources available to ease the burden.”
15. Reuse Issues - Format, Tools, Infrastructure
Interoperability & Standards
“Systems are always changing...It would be best if we could upload data to NSF so
that it will be publicly available in the same way NIST [National Institutes of
Standards and Technology+ publishes data.”
“Our raw data formats are extremely large, and need to be compressed into
reduced, on-line archives for sharing. It is not possible for me as an individual PI to
archive the raw data for others to examine.”
“My data is generally related to large software artifacts, so using it could involve
quite a bit of work to get those artifacts running. This is something that I explicitly
try to come up with solutions for in my DMPs.”
“Until NSF provides a free national repository for data archiving, we will not make
progress in this area. If such an archive was available, it would be sensible to
require researchers to place data there at the end of a grant and would allow other
researchers to take advantage of it in a practical way.”
16. DMPs – Preliminary Content Analysis
• Coding Scheme
Used both deductive and inductive approaches
35 codes
NSF DMP Policy and University of Virginia's Guideline
Emerged from DMP statements
• Data Analysis Procedure
A total of 766 utterances were identified
642 unique utterances
17. DMPs’ Content
<Wordle Cloud Generated Based on Numbers of Each Code across the 68 DMPs>
18. Coding Scheme
Data Access Data
Types of Metadata Data Reuse
& Sharing Archiving Others
Data Standards Plan
Process Plan
• Strategy for
Archiving Data
• Which
• When Available
Repository
• What to • Data Format • How Available • Reusability of • Data Lifecycle
• Procedures for
Generate • Metadata Form • What Available the Data • Data Curation
Long-Term
• What Data • How to Create • Process for • Restrictions • Budget
Storage
Types • Gaining Access to Access
Which • Data
• How to Create • How Long • Groups
Metadata Preservation
• Where to Get Retain the Right Interested In
Standard Period
Existing Data • Embargo Period • Foreseeable
• Contextual • What Data
• Ethical/Privacy Uses/Users
Preserved for
Details Needed Issues
Long-Term
• Discoverability • Compliance
• Transformation
of the Data with IRB
Required
Protocol
• Data
• Whose
Documentation
Intellectual
• Related
Property
Information
19. Types of Data
Codes Freq. Examples
What to Generate 58 Geochemical Data, Physical Samples, Mathematica
(programing) Code, Course Materials
What Data Types 37 Gene Sequences, Experimental Data, Interview Transcript,
Video Recordings
How to Create Data 25 Experimental Setup, Field Observation, Simulation, Survey,
Interviews
Where to Get Existing Data 13 Moore Laboratory of Zoology, ArcView/GIS Inventories,
Prior Study’s Database
Metadata Standard
Codes Freq. Examples
Data Format 38 CSV file, TEMPO data file, XML format, SPSS file, plain text
Metadata Form 31 ArcGIS Metadata file, XML-base standard file, GIS database file
How to Create Metadata 14 Use existing metadata standards, or develop their own metadata
standards
Which Metadata Standard 15 Dublin Core, DNA Sequence Metadata, EML (Ecological Metadata
Language)
Contextual Details Needed 10 All aspect of the development project documented, experimental
procedure record
Data Discoverability 7 Searches Built into Library, Searchable through Project Website
20. Data Access & Sharing Process
Codes Freq. Examples
When Available 28 Post-Publication, Post-Project, After Data Collection
37 Upon Request, Project Website, GMOD CHADO
How Available
databases, Institutional Repository
33 Original research data (genome assemblies), survey
What Available
data, educational materials
25 Email Request, Material Transfer Agreement, Direct
Process for Gaining Access
Access from Web or Repository
18 Withhold until Publication, Years after Project Ends,
How Long Retain the Right
Years after Data Production
5 Years after data collection, Period for
Embargo Period
commercialization
Ethical/Privacy Issues 21 Privacy information is not available for public
Compliance with IRB Protocol 13 IRB application submission for human subject research
17 Property of the PI and Co-PIs, Institutions, Open-
Whose Intellectual Property
Access
21. Data Archiving
Codes Freq. Examples
31 Hosted on the Web Servers at (university), ICPSR,
Strategy for Archiving Data
disciplinary data repository
55 Organization website, institutional or discipline
Which Repository
data repository
Procedures for Long-Term 33 Submitted to databanks including NCBI GEO,
Storage Genbank, DataONE, Dryad
11 Minimum of five years post-grant funding, Long-
Data Preservation Period term preservation through disciplinary data
repositories
What Data Preserved for Long- 7 All data and materials generated by this award,
Term Genome Sequencing Data
4 Keeping raw image data in its uncompressed form,
Transformation Required
transferred to IRI format
11 Contextual details about experimental procedures,
Data Documentation Submitted
all aspects of the development project
3 Metadata files, proposed study information,
Related Information Submitted
companion web page
22. Data Reuse Plan
Codes Freq. Examples
6 Descriptions about reusable methods (Used by a research
Reusability of the Data
community to follow-up)
Restrictions to Access 6 Access allowed for a certain group of researchers
Wider research community studying the Great Lakes,
Groups Interested In 8 academic geography organizations, and geography
teacher associations
Available to engineers, clinicians, and medical
Foreseeable Uses/Users 10 researchers, sociologists and psychologists working in
relevant sub-fields.
Others
Codes Freq. Examples
Data Lifecycle 1 Application of the Life Cycle Inventory databases
Data Curation 4 Curation (Consortiums and Partnerships)
Budget 9 Institution will absorb costs, no incremental costs , marginal costs
23. Data Available -
30
27
25
20
15 13
10
10 8
5 3 3 3
1
0
After data After After Years after Years after Years after Not Not
collection project publication data project publication Specified Mentioned
ends collection ends
25. Some insights – DMPs’ Preliminary Analysis
More informal/personal data sharing procedures rather than
formal/institutionalized data sharing and management plans
Most DMPs lacks content on “Metadata Standard” and “Data
Reuse Plan”
Few have plans for long-term archiving. Very vague plans and
ideas about long-term use of their data
Many DMPs addressed data archiving in institutional repositories
that are not in existence yet, but expected to be created
A few DMPs mentioned interview transcripts will be available, but
without addressing IRB issues
26. Future Directions
Survey a larger number of Awardees
More exhaustive coding analysis and in-depth
exploration of the DMPs’ content
Analysis of DMPs to identify patterns, common
challenges and best practices across and within
different disciplinary communities
Random sample with 10% of the target population.After Pilot study (based on just 11 responses with no DMPs) we decided to incorporate a few additional questions to the survey, to get a better sense of their experiences on the process of writing and executing their DMPs.Response rate was affected by other factors, such as: wrong/invalid emails, PIs who changed institution , sabbatical or other types of leaves from who we got automatic responses, and in a few cases PIs who past away after receiving the award. Also, some PIs contacted us explaining they wouldn’t participate because, despite NSF’s mandate their research do not produce data or because they did not recall having a DMP.Only 40.24% of participants shared their DMP. Does it tell anything about willingness to share??? In some cases, participants affirmed they did not have it available when they were filling the survey out.
Amount - Almost the half falls on the 300 thousand to a million dollars range Good distribution of respondents across the 7 NSF directorates, slightly larger share for ENGINEERINGBIO – Biological SciencesGEO – GeosciencesCISE – Computer and Information Science and EngineeringEHR – Education and Human ResourcesMPS – Mathematical and Physical Sciences
Respondents fall mostly in the range between 35-44 yearsAnd as expected the majority belongs to an academic institution
From which, most are tenured and full professors
The map shows a geographical distribution of our respondents
On 3 questions in a 7-point likert scale format (strongly disagree to strongly agree) participants were asked about the importance of DMPs to formalize data sharing practices in science. Results show that respondents tend to somehow agree with the importance, but do not see the process of writing the DMP challenging or hard to execute in the future.
When we questioned: do you foresee barriers for the reuse of the data your research is/will be producing?? The word cloud shows the most recurrent topics in respondent’s to comments in cases they responded affirmatively to the question.
Some comments excerpts
Skepticism about enforcement or verification. Disbelieve it will work without an unified platform.Mention that some participants questioned if the DMP’s execution would be ever verified, because the data will be so dispersed, more paperwork for scientists with little effect on the real intention.
This last issue resonated in some other comments across the survey.
Reinforce 68 DMPs.
Reuse is yet very little covered in the DMPs, in some cases very general statements about potential users.
Not specified – time was not provided, but the DMP says the available data will be available. Not mentioned or no reference to when data will be available.
More adhoc procedures.4th item – famous saying “Count the chickens before they hatch”