Brad HoustonUniversity Records Officer November 30, 2012
Document describing data (and/or digital materials) that have been or will be gathered in a study or project. Often includes details on how data will be organized, preserved, and accessed Facilitates re-use of data sets by either PI or other researchers Required component of grants for MANY agencies (NSF and NIH)
Starting January 2011 for NEW, non- collaborative proposals Not voluntary – “integral part” of proposal Data Management Plans for all data resulting from any level of NSF funding Supplementary 2-page document (max) Optional: Also part of 15-page (max) Project Description
Must address both physical and digital data “Efficiency and effectiveness” of the DMP will be considered by NSF and disciplinary division or directorate Must include sufficient information that peer reviewers and project monitors can assess present proposal and past performance As of January 2011, proposals will not be accepted without an accompanying data plan!
Such dissemination of data is necessary for thecommunity to stimulate new advances as quickly aspossible and to allow prompt evaluation of the resultsby the scientific community. “ – NSF (italics mine)Part of Openness trend in federal government(data.gov - Open Government Initiative)NIH Public Access Policy (2008)Public access to federally funded research hearings- Information Policy, Census and National ArchivesSubcommittee of U.S. Congress (July, 2010)
It makes your research easier! Data available in case you need it later Helps avoid accusations of fraud or bad science To share it for others to use and learn from To get credit for producing it To keep from drowning in irrelevant stuff ... especially at grant/project end
Gene expression microarray data: “Publicly available data was significantly (p=0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin.” Piwowar, Heather et al. “Sharing detailed research data is associated with increased citation rate.” PLoS One 2010. DOI: 10.1371/journal.pone.0000308 Maybe there’s an advantage here!
Discuss specific requirements for NSF Data Management plans Suggest ways to manage, share, and archive data more effectively Provide resources for more information
What data are you collecting or making? Can it be recreated? How much would that cost? How much of it? How fast is it growing? Does it change? What file format(s)? What’s your infrastructure for data collection and storage like? How do you find it, or find what you’re looking for in it? How easy is it to get new people up to speed? Or share data with others?
Who are the audiences for your data? You (including Future You), your lab colleagues (including future ones), your PIs Disciplinary colleagues, at your institution or at others Colleagues in allied disciplines The world! What are your obligations to others? Funder requirements Confidentiality issues IP questions Security
How do you and your lab get from where you are to where you need to be? Document, document, document all decisions and all processes! Secret sauce: the more you strategize upfront, the less angst and panic later. “Make it up as you go along” is very bad practice! But the best-laid plans go agley... so be flexible. And watch your field! Best practices are still in flux.
Four kinds of data defined by OMB: Observational Examples: Sensor data, telemetry, survey data, sample data, neuroimages. Experimental Examples: gene sequences, chromatograms, toroid magnetic field data. Simulation Examples: climate models, economic models. Derived or compiled Examples: text and data mining, compiled database, 3D models, data gathered from public documents.
Preliminary analyses Raw data is included in this definition Drafts of scientific papers Plans for future research Peer reviews or communications with colleagues Physical objects, such as gel samples
As early as possible, but no later than guidelines laid down by relevant Directorate Engineering Section: “no later than the acceptance for publication of the main findings of the final data” Earth Sciences: “No later than two (2) years after the data were collected.” Social and Economic Sciences: “within one year after the expiration of an award” Be aware of concerns that may require earlier or later disclosure FERPA? Human Subjects data? HIPAA?
Again, specific retention periods will depend on the type of data and the grant program Example: NSF Engineering Section suggests retention period of “three years after either completion of the grant project or public release of research data, whichever is later” Certain types of data will need to be retained longer Patent data, longitudinal data sets, etc. Ask: is your data of permanent value?
Analyzed data (incl. images, tables and tables of numbers used for making graphs) Metadata that defines how data was generated, such as experiment descriptions, computer code, and computer-calculation input
Investigators are expected to preserve/share primary data, samples, physical collections, & supporting materials Provide easily accessible information about data holdings, including quality assessments and guidance/finding aids Data may be made available through submission to national data center, publication in journal, book, or accessible website of institutional archives
All submitted plans must include, at minimum: 1. Expected Data: types, physical/electronic collections, materials to be produced 2. Standards for data and metadata format and content 3. Policies for access and sharing, including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, etc. 4. Policies and provisions for re-use, re-distribution, and the production of derivatives 5. Plans for archiving data, samples, and other research products, and for preservation of access to them
In short: What kind of data will be produced by your research processes? Keep in mind: File formats of complete data sets Any software or code that will be needed/produced Physical samples or other individual data points Some divisions require retention of physical samples; consult your Program Officer
In short: how will you organize your data within datasets to make it widely accessible, and how will you make data sets identifiable? Keep in mind: Any data formatting standards for your particular discipline Any metadata (author, date, subject, etc.) that your program attaches automatically, and what you will need to attach manually How will you find your data for later consultation? How will others find it?
In short: How will you allow other researchers to find and use your data? Keep in mind: How will other researchers find your data? (i.e. How will you publicize its existence?) How will you provide access to your data?(CD-RW? Data Repository? Download via pantherFILE?) How will you prepare your data for sharing? Do you need to depersonalize or declassify anything?
Data Management Plans are required even if a project is not expected to generate data that requires sharing DMP should clearly explain non-sharing in light of COI standards (peer review) Between the lines: Not sharing will require justification and close scrutiny by NSF Sharing is preferred
In short: How will researchers obtain the appropriate permissions to use your data? Keep in mind: Is a blanket permissions statement or a case-by-case policy more efficient/practical? What responsibilities will users of your data have re: privacy, intellectual property, etc.? How will you deal with users who violate these provisions?
In short: How will you make sure your data stays intact and available once you are done using it? Keep in Mind: What are your retention requirements? Is this a permanent data set? What storage media will you use? Are you prepared to migrate/emulate as needed? Do you have a data backup plan?
Preparing, sharing, and archiving your data sets
Think about where you will put your data Local? Network drive? Online data management system? Think about how you (or others) will find your data Think about how others may use your data, when found Think about how to store your data in the long term (or if to store it long-term at all)
Will anybody be able to read these files at the end of your time horizon? Where possible, prefer file formats that are: Open, standardized Documented In wide use Easy to data-mine, transform, recast If you need to transform data for durability,do it now, not later.
Fundamental question: What would someone unfamiliar with your data need in order to find, evaluate, understand, and reuse them? Consider the differences between someone inside your lab, someone outside your lab but in your field, and someone outside your field. Two parts: metadata and methods
About the project Title, people, key dates, funders and grants About the data Title, key dates, creator(s), subjects, rights, included files, format(s),versions, checksums Interpretive aids: codebooks, data dictionaries, algorithms, code Keep this with the data– think of it as a Readme file
Reason #1 for not reusing someone else’s data: “I don’t know enough about how it was gathered to trust it.” Document what you did. (A published article may or may not be enough.) Document any limitations of what you did. If you ran code on the data, document the code and keep it with the data. Need a codebook? Or a data dictionary? If I can’t identify at sight what each bit of your dataset means, yes, you do need a codebook or data dictionary. DO NOT FORGET UNITS!
Your own drive (PC, server, flash drive, etc.) And if you lose it? Or it breaks? Somebody else’s drive Departmental or campus drive “Cloud” drive Do they care as much about your data as you do? What about versioning? Library motto: Lots Of Copies Keeps Stuff Safe. Two onsite copies, one offsite copy. Keep confidentiality and security requirements in mind, of course
If data need to persist beyond project end, you have to deal with a new kind of risk: organizational risk. Servers come and go. So do labs. So do entire departments. This is especially important if you share data! Don’t let it 404! You need to find a trustworthy partner. On campus: try the library or Research and Sponsored Programs. (UITS has a role but can’t do it alone!) Off campus: look for a disciplinary data repository, or a journal that accepts data. (It’s a good idea to do this as part of your planning process.) Let somebody else worry! You have new projects to get on with.
Informational websites UW-Madison: http://researchdata.wisc.edu/ UW-Milwaukee: http://dataplan.uwm.edu Don’t just use the site for your own campus! Data experts IT cyberinfrastructure experts Archivists/records managers MINDS@UW: minds.wisconsin.edu Data in final form that make sense as discrete files
For Information: NSF Grant Proposal Guide http://www.nsf.gov/pubs/policydocs/pappguide/nsf 11001/gpg_index.jsp MIT Data Management and Publishing http://libraries.mit.edu/guides/subjects/data- management/index.html For storage/management (non-inclusive): A partial list of potential repositories: http://databib.org Ask: can my home institution provide better service?
For assistance with writing your plan: California Digital Library DMP Creation Tool https://dmp.cdlib.org/ (Select “UWM” as institution) Data Conservancy DMP Template/Questionnaire http://dataconservancy.org/dataManagementPlans DataONE Best Practices Examples http://www.dataone.org/plans Data Curation Profiles (Purdue University) http://datacurationprofiles.org/ Digital Curation Center Tools Catalog http://www.dcc.ac.uk/resources/external/tools- services
Make sure your data plan covers at least the minimum requirements set out by NSF Create appropriate metadata to help you manage and find data Use open, universal standards and file formats Be prepared to preserve access tools along with data itself Be aware of time periods for data sharing and retention
Contact the presenter Brad Houston, UW-Milwaukee email@example.com (414) 229-6979 This presentation available online at: http://www.slideshare.net/herodotusjr/data- management-plans-dmp-for-nsf
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.