RDAP13 Renata Curty: What Have Scientists Planned for Data Sharing and Reuse? A C…

  • 379 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
379
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
5
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Random sample with 10% of the target population.After Pilot study (based on just 11 responses with no DMPs) we decided to incorporate a few additional questions to the survey, to get a better sense of their experiences on the process of writing and executing their DMPs.Response rate was affected by other factors, such as: wrong/invalid emails, PIs who changed institution , sabbatical or other types of leaves from who we got automatic responses, and in a few cases PIs who past away after receiving the award. Also, some PIs contacted us explaining they wouldn’t participate because, despite NSF’s mandate their research do not produce data or because they did not recall having a DMP.Only 40.24% of participants shared their DMP. Does it tell anything about willingness to share??? In some cases, participants affirmed they did not have it available when they were filling the survey out.
  • Amount - Almost the half falls on the 300 thousand to a million dollars range Good distribution of respondents across the 7 NSF directorates, slightly larger share for ENGINEERINGBIO – Biological SciencesGEO – GeosciencesCISE – Computer and Information Science and EngineeringEHR – Education and Human ResourcesMPS – Mathematical and Physical Sciences
  • Respondents fall mostly in the range between 35-44 yearsAnd as expected the majority belongs to an academic institution
  • From which, most are tenured and full professors
  • The map shows a geographical distribution of our respondents
  • On 3 questions in a 7-point likert scale format (strongly disagree to strongly agree) participants were asked about the importance of DMPs to formalize data sharing practices in science. Results show that respondents tend to somehow agree with the importance, but do not see the process of writing the DMP challenging or hard to execute in the future.
  • When we questioned: do you foresee barriers for the reuse of the data your research is/will be producing?? The word cloud shows the most recurrent topics in respondent’s to comments in cases they responded affirmatively to the question.
  • Some comments excerpts
  • Skepticism about enforcement or verification. Disbelieve it will work without an unified platform.Mention that some participants questioned if the DMP’s execution would be ever verified, because the data will be so dispersed, more paperwork for scientists with little effect on the real intention.
  • This last issue resonated in some other comments across the survey.
  • Reinforce 68 DMPs.
  • Reuse is yet very little covered in the DMPs, in some cases very general statements about potential users.
  • Not specified – time was not provided, but the DMP says the available data will be available. Not mentioned or no reference to when data will be available.
  • More adhoc procedures.4th item – famous saying “Count the chickens before they hatch”

Transcript

  • 1. What have Scientists Planned for Data Sharing and Reuse? A Content Analysis of NSF Awardees’ Data Management Plans Renata Curty, Youngseek Kim & Dr. Jian Qin Baltimore, 4-5 April 2013
  • 2. MotivationWhile the NSF mandate gives researchersplenty flexibility to define their own DMPand many academic institutions provideDMP writing support, little is known abouthow scientists address their strategies ontheir DMPs.
  • 3. Study Design Online Survey: 20 questions Target Population: NSF Awardees from January 18, 2011 to November 5, 2012 - Standard Grants - Total 16065 Random Sample: 1606 cases Pilot Study: 100 Awardees (Survey Reformulation) Final Deployment: 966 awardees, 169 responses (17.5%) and DMPs (68)
  • 4. Awards InfoAmount Awarded NSF Directorate 13% 10% 16% 15% 12% 16% 18% BIO CISE EHR ENG GEO MPS SBE 166 166
  • 5. Awardees InfoAge Organization Type 65+ 7%55-64 19%45-54 26%35-44 41%25-24 7% Academia, 93% 150 151
  • 6. Awardees InfoPosition in Academia Tenured 62%Researcher 6.77% Assistant On Professor Tenure 22% Track Non- 25% Tenure Full Professor Track Associate 11% 40% Professor Retired 28% 2%Others: Dean (3), Professor Emeritus (1), Professor of Practice(1), Lecturer/Instructor (1), Post-Doctoral Fellow (1), Emeritus SeniorScientist, Director, Expert Consultant, Administrative FacultyPosition, Chair. 143 138
  • 7. Geographical DistributionCreated with Google Fusion Tables. 109
  • 8. DMP is important to formalize data sharing practices in science N=166 10.84% 10.24% 22.89% 33.13% 13.25% 3.01% 6.63% = 4.93 = 1.62 Writing a DMP for NSF proposal is a challenging task N=167 21.56% 13.77% 25.75% 23.35% 10.18%0.40% 2.99% = 3.89 = 1.45 DMP is difficult to execute N=167 22.75% 11.38% 25.75% 23.35% 4.79% 8.98% 2.99% = 3.79 = 1.51 Strongly disagree Disagree Somewhat disagree Neither agree or disagree Somewhat agree Agree Strongly agree
  • 9. Types of Data Documentation of Data 3D Models 13.01% - 19 Audio Files 12.33% - 18 Will follow: Curriculum Materials 21.23% - 31 Data Models 27.40% - 40 46% - Disciplinary Field Notes 26.03% - 38 practices Experimental Data 63.70% - 93 Images 36.99% - 54 37% - Research project’s Interview Transcripts 17.12% - 25 Patient Records 0.68% - 1 needs Samples 20.55% - 30 Software 35.62% - 52 17% - Institutional Spreadsheets 40.41% - 59 recommendations/ Video Files 21.23% - 31 guidelines Others: Computational Models, Surveys, DNA Sequences, Computer Codes, Crowdsourcing Data (Reviews) 158
  • 10. Challenges Encountered Others: Which stage(s) of None 26%  Some projects do not generate research to data share the data Lack of Data guidance  Conflict between DMP 25% Description & from my requirement and IRB Documentation institution requirements regarding social 30% and behavioral research data 29% Level of  Conflicts intellectual property granularity Lack of and data protection of data guidance 25% Appropriate from NSF  Long-term preservation issues infrastructure 36% to archive/  Conflicts individual/group vs. preserve data institutional strategies 41% 169
  • 11. Data Access & Availability Restricted 5% By email request 45.52% - 61 Personal website 17.91% - 24 Open 45% Research Group/Project 51.49% - 69 Website Available with Institutional Repository 20.15% - 27 some restrictions Disciplinary Repository 32.84% - 44 51% Others: “Publications”, “Available to NSF only” 167
  • 12. Barriers for Data Reuse 164
  • 13. Reuse Issues - Privacy, Anonymity & Confidentiality “IRB restrictions on ability to share even deidentified data. Concern that sharing even deidentified data will discourage participation in the study.” “For myself, no. But for others to use my data, yes: for qualitative data, under IRB requirements for the protection of human subjects around confidentiality and anonymity, DMPs are nearly impossible to implement without perhaps some kind of temporal restriction on them (like, ‘This archive can only be opened in 20 - 30 - 40 years’ or something like that)” “The project involves human subject; so protections have to be put in place that may limit reuse applications in the future.” “HIPAA *Health Insurance Portability and Accountability Act+ issues - obtaining self reporting data on human subjects.”
  • 14. Reuse Issues - Context, Time Factor & Documentation“My past data was collected on a unique system built specifically for the research project.Need lots of context to reuse the data.”“The only problems I see is that data can be taken out of context in a way that producesresults that might not be correct.”“Data is specific to testing scenarios. The insight gleaned from our experimental data is ofmore importance than the data itself.”“My data is for specific purposes and it is hard to conceive of how someone would use it forsomething else/different. Even with a significant amount of metadata it would be difficult forsomeone to know all the circumstances under which the data was collected and why it wascollected.”“All scientific data is collected in particular context. Mechanisms that facilitate the descriptionof that context are lacking. The creation of metadata that provides this information is acumbersome, boring task and there are few resources available to ease the burden.”
  • 15. Reuse Issues - Format, Tools, Infrastructure Interoperability & Standards“Systems are always changing...It would be best if we could upload data to NSF sothat it will be publicly available in the same way NIST [National Institutes ofStandards and Technology+ publishes data.”“Our raw data formats are extremely large, and need to be compressed intoreduced, on-line archives for sharing. It is not possible for me as an individual PI toarchive the raw data for others to examine.”“My data is generally related to large software artifacts, so using it could involvequite a bit of work to get those artifacts running. This is something that I explicitlytry to come up with solutions for in my DMPs.”“Until NSF provides a free national repository for data archiving, we will not makeprogress in this area. If such an archive was available, it would be sensible torequire researchers to place data there at the end of a grant and would allow otherresearchers to take advantage of it in a practical way.”
  • 16. DMPs – Preliminary Content Analysis • Coding Scheme  Used both deductive and inductive approaches  35 codes  NSF DMP Policy and University of Virginias Guideline  Emerged from DMP statements • Data Analysis Procedure  A total of 766 utterances were identified  642 unique utterances
  • 17. DMPs’ Content <Wordle Cloud Generated Based on Numbers of Each Code across the 68 DMPs>
  • 18. Coding Scheme Data Access Data Types of Metadata Data Reuse & Sharing Archiving Others Data Standards Plan Process Plan • Strategy for Archiving Data • Which • When Available Repository• What to • Data Format • How Available • Reusability of • Data Lifecycle • Procedures for Generate • Metadata Form • What Available the Data • Data Curation Long-Term• What Data • How to Create • Process for • Restrictions • Budget Storage Types • Gaining Access to Access Which • Data• How to Create • How Long • Groups Metadata Preservation• Where to Get Retain the Right Interested In Standard Period Existing Data • Embargo Period • Foreseeable • Contextual • What Data • Ethical/Privacy Uses/Users Preserved for Details Needed Issues Long-Term • Discoverability • Compliance • Transformation of the Data with IRB Required Protocol • Data • Whose Documentation Intellectual • Related Property Information
  • 19. Types of Data Codes Freq. Examples What to Generate 58 Geochemical Data, Physical Samples, Mathematica (programing) Code, Course Materials What Data Types 37 Gene Sequences, Experimental Data, Interview Transcript, Video Recordings How to Create Data 25 Experimental Setup, Field Observation, Simulation, Survey, Interviews Where to Get Existing Data 13 Moore Laboratory of Zoology, ArcView/GIS Inventories, Prior Study’s DatabaseMetadata Standard Codes Freq. Examples Data Format 38 CSV file, TEMPO data file, XML format, SPSS file, plain text Metadata Form 31 ArcGIS Metadata file, XML-base standard file, GIS database file How to Create Metadata 14 Use existing metadata standards, or develop their own metadata standards Which Metadata Standard 15 Dublin Core, DNA Sequence Metadata, EML (Ecological Metadata Language) Contextual Details Needed 10 All aspect of the development project documented, experimental procedure record Data Discoverability 7 Searches Built into Library, Searchable through Project Website
  • 20. Data Access & Sharing ProcessCodes Freq. ExamplesWhen Available 28 Post-Publication, Post-Project, After Data Collection 37 Upon Request, Project Website, GMOD CHADOHow Available databases, Institutional Repository 33 Original research data (genome assemblies), surveyWhat Available data, educational materials 25 Email Request, Material Transfer Agreement, DirectProcess for Gaining Access Access from Web or Repository 18 Withhold until Publication, Years after Project Ends,How Long Retain the Right Years after Data Production 5 Years after data collection, Period forEmbargo Period commercializationEthical/Privacy Issues 21 Privacy information is not available for publicCompliance with IRB Protocol 13 IRB application submission for human subject research 17 Property of the PI and Co-PIs, Institutions, Open-Whose Intellectual Property Access
  • 21. Data Archiving Codes Freq. Examples 31 Hosted on the Web Servers at (university), ICPSR, Strategy for Archiving Data disciplinary data repository 55 Organization website, institutional or discipline Which Repository data repository Procedures for Long-Term 33 Submitted to databanks including NCBI GEO, Storage Genbank, DataONE, Dryad 11 Minimum of five years post-grant funding, Long- Data Preservation Period term preservation through disciplinary data repositories What Data Preserved for Long- 7 All data and materials generated by this award, Term Genome Sequencing Data 4 Keeping raw image data in its uncompressed form, Transformation Required transferred to IRI format 11 Contextual details about experimental procedures, Data Documentation Submitted all aspects of the development project 3 Metadata files, proposed study information, Related Information Submitted companion web page
  • 22. Data Reuse Plan Codes Freq. Examples 6 Descriptions about reusable methods (Used by a research Reusability of the Data community to follow-up) Restrictions to Access 6 Access allowed for a certain group of researchers Wider research community studying the Great Lakes, Groups Interested In 8 academic geography organizations, and geography teacher associations Available to engineers, clinicians, and medical Foreseeable Uses/Users 10 researchers, sociologists and psychologists working in relevant sub-fields.OthersCodes Freq. ExamplesData Lifecycle 1 Application of the Life Cycle Inventory databases Data Curation 4 Curation (Consortiums and Partnerships) Budget 9 Institution will absorb costs, no incremental costs , marginal costs
  • 23. Data Available -30 27252015 13 1010 8 5 3 3 3 1 0 After data After After Years after Years after Years after Not Not collection project publication data project publication Specified Mentioned ends collection ends
  • 24. Types of Data Repositories for Long-Term Archiving16 1414 13 1312 11 111086 44 220 Disciplinary External/ Institutional Internal/ Journal Lab/ Not Repository Commercial Repository Institutional Repository/ Organization mentioned/ Storage Storage Supplement Website Specified
  • 25. Some insights – DMPs’ Preliminary Analysis More informal/personal data sharing procedures rather than formal/institutionalized data sharing and management plans Most DMPs lacks content on “Metadata Standard” and “Data Reuse Plan” Few have plans for long-term archiving. Very vague plans and ideas about long-term use of their data Many DMPs addressed data archiving in institutional repositories that are not in existence yet, but expected to be created A few DMPs mentioned interview transcripts will be available, but without addressing IRB issues
  • 26. Future Directions  Survey a larger number of Awardees  More exhaustive coding analysis and in-depth exploration of the DMPs’ content  Analysis of DMPs to identify patterns, common challenges and best practices across and within different disciplinary communities
  • 27. Thank you! rcurty@syr.eduLet’s Go Orange!