Open Access and other open movements have advocates and they feel strongly about these causes. Like the patriots of the American Revolution, they feel that publishing is broken and only radical change can create change in the system.
Publishers are offering the same content in more ways. This recent UK study shows the more publishing and viewing options available, the less expensive it becomes. This is opposite of our expenditures, where items are becoming more expensive in newer formats.
Scholars, though, also care about the impact of their work, not just how it is published. The Open movement has created more metrics and groups like altmetrics.org that advocate for weblinks, bookmarks and online conversations on research to help measure impact for tenure and promotion decisions.
The open access conversation is focused on the dissemination of research products like peer-reviewed articles and books at the end of the research life cycle, whereas data management planning is most effective when it’s initiated before data collection begins and implemented throughout the research life cycle.
Elizabeth BrownScholarly Communications and Library Grants OfficerBinghamton University LibrariesOctober 18,2012
What is an NSF Data Management Plan? How and why was it created? Why are Libraries a part of data management?(Short Break) Creating and Implementing NSF Data Management Plans Preserving Research Data after a project is completed
To: Understand current NSF and government data policies requirements. Be aware of research support services within the Libraries. Locate and use various resources to develop data management plans (DMPs) for NSF proposal(s). Write a comprehensive DMP for NSF proposal(s). Identify and plan for long-term preservation of research data from funded projects.
Storing Research Data “Forever”Serge GoldsteinAssociate CIO & Director of Academic ServicesPrinceton UniversityFall 2010 Coalition for Networked Information MeetingURL: http://www.youtube.com/watch?v=fQ-YEcV1k1A
Cyberinfrastructure: computing resources & networks, services, & people Data management: technical processing and preparation of data for analysis Data curation: selection of data for preservation and adding value for current and future use Data citation: mechanisms to enable easy reuse and verification, track impact of data, and create structures to recognize and reward researchers (DataCite) Data sharing: must take into account ethical and legal issues; a spectrum with many optionsSource: Heather Coates and Kristi Palmer, Data management plans & planning: Meeting the NSFRequirement, March 7, 2012URL: http://www.slideshare.net/goldenphizzwizards/meeting-the-nsf-dmp-requirement-20120307-final
Open Access Open Educational Tools Open Standards Open Science Open Source Dorothea Salo, Battle of the Opens, Book of Trogool, March 15, 2010http://en.wikipedia.org/wiki/File:Benjamin_Franklin_-_Join_or_Die.jpg
Houghton, J.W. (2011). "The costs and potential benefits of alternative scholarly publishing models"Information Research, 16(1) paper 469. [Available at http://InformationR.net/ir/16-1/paper469.html]
Saves time Less reorganization for future projects Increases efficiency Compile and prioritizing data collection(s) Anticipate how your data will be used Consider data preservation requirements and plan for them Better aware of funding agency mandates and data preservation culture in your field
1. Types of Data2. Data and Metadata Standards3. Policies for Access and Sharing Data Privacy and Protection4. Data re-use and re-distribution5. Data Archiving and Preservation
Expected data. The DMP should describe the types of data, samples, physical collections, software, curriculum materials, or other materials to be produced in the course of the project. It should then describe the expected types of data to be retained. The Federal government defines ‘data’ in OMB Circular A-110 as: Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples). Research data also do not include: (A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and (B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study. PIs should use the opportunity of the DMP to give thought to matters such as: • The types of data that their project might generate and eventually share with others, and under what conditions • How data are to be managed and maintained until they are shared with others • Factors that might impinge on their ability to manage data, e.g. legal and ethical restrictions on access to non- aggregated data • The lowest level of aggregated data that PIs might share with others in the scientific community, given that community’s norms on data • The mechanism for sharing data and/or making them accessible to others • Other types of information that should be maintained and shared regarding data,Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
“This research project will generate data resulting from sensor recordings (i.e. earth pressures, accelerations, wall deformation and displacement and soil settlement) during the centrifuge experiments. In addition to the raw, uncorrected sensor data, converted and corrected data (in engineering units), as well as several other forms of derived data will be produced. Metadata that describes the experiments with their materials, loads, experimental environment and parameters will be produced. The experiments will also be recorded with still cameras and video cameras. Photos and videos will be part of the data collection.” “A total storage demand of 50 GB is anticipated at the University of Michigan, and 50 GB at Auburn University.” “Based on the previous viscoelastic turbulent channel flow simulations, the amount of resulting binary data is estimated around 40 TB per year. Some text format data files are also required for post-processing in the laboratory and are anticipated to be around 1 TB per year.” “In one year, we will perform approximately 2 to 3 simulations. This means ~100 3D plots, 30 restart files, 1000 EUV, X-ray and LASCO-like images, 10 satellite files, 1000 2D plot files (total of about 150 GB of data per year).”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
“The data, samples, and materials expected to be produced will consist of laboratory notebooks, raw data files from experiments, experimental analysis data files, simulation data, microscopy images, optical images, LabView acquisition programs, and quantum dot superlattice nanowire thermoelectric samples.... each of these data is described below:A. Laboratory notebooks: The graduate student and PI will record by hand any observations, procedures, and ideas generated during the course of the research.B. Experimental raw data files: These files will consist of ASCII text that represents data directly collected from the various electrical instruments used to measure the thermoelectric properties of the superlattice nanowire thermoelectric devices.C. Experimental analysis data files: These files will consist of spreadsheets and plots of the raw data mentioned in Part A. The data in these files will have been manipulated to yield meaningful and quantitative values for the device efficiency and ZT. The analysis will be performed using best practice and acceptable methods for calculating device efficiency and ZT.D. Simulation data: These data will represent the results from commercially available simulation and modeling software to model the quantum confinement.E. Microscopy images: Images of the proposed silicon nanostructures will be generated by scanning electron microscopy (SEM), transmission electron microscopy (TEM) at high resolution to quantify wire diameter and roughness, and atomic force microscopy (AFM).Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
Data formats and dissemination. The DMP should describe data formats, media, and dissemination approaches that will be used to make data and metadata available to others. Policies for public access and sharing should be described, including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements. Research centers and major partnerships with industry or other user communities must also address how data are to be shared and managed with partners, center members, and other major stakeholders.Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
Period of data retention. SBE is committed to timely and rapid data distribution. However, it recognizes that types of data can vary widely and that acceptable norms also vary by scientific discipline. It is strongly committed, however, to the underlying principle of timely access, and applicants should address how this will be met in their DMP statement.Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
“The Dublin Core will be used as the standard for metadata. The metadata set mainly consists of fifteen elements, including title, creator, subject, description, publisher, contributor, date, type, f ormat, identifier, source, language, relation, coverage, and rights. These elements have been ratified as both national (i.e., ANSI/NISO Standard Z39.85) and international standards (i.e., ISO Standard 15836). Further, they describe resources such as text, video, audio, and data files. These standard formats will be used in our study.”“For each code made available, a users manual will be provided with instructions for compiling the source codes, installing and running the codes, formulating input data streams, and visualizing the output. Documentation will be in PDF format.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
“Verilog, SPICE, and MATLAB files generated will be processed and submitted to FTP servers as .mat files with TXT documentation. The data will be distributed in several widely used formats, including ASCII, tab-delimited (for use with Excel), and MAT format. Instructional material and relevant technical reports will be provided as PDF. Digital video data files generated will be processed and submitted to the FTP servers in MPEG-4 (.mp4) and .avi formats. Variables will use a standardized naming convention consisting of a prefix, root, suffix system.”“Plasma image data will be RGB colored JPG or TIFF format with resolution determined by the camera. Video data will be RGB colored AVI format.”“Images from the scanning electron microscopes (SEMs) and focused ion beam workstations (FIBs) are saved in tagged image file format (TIFF), which is readily readable by a wide variety of imaging and processing applications.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
III. Policies for access and sharing and provisions for appropriate protection/privacy As detailed in the project description, the CARE platform in intended to be a research cloud service that provides analytical middleware for use in analyzing health data. During the project, access will be limited to project team member and invited expert stakeholders through a password protected website. Commencing with Task 5 (month 26), means for access by the broader research community will be implemented. At that time, the project team will determine whether there is a need for initiating access charges, which may be appropriate for securing the longer terms sustainability of the CARE platform and analysis tools. All of the data that will be utilized are publicly available data sets that have been de-identified by public agencies and have passed their standards for privacy protection and assurance so that no individually identifiable data is provided. The datasets to be utilized within this project and other intellectual property have been released without restriction. Over the course of the study, the project team will meet with both the Community Health Institute and the SafeRoadMaps/CERS team to arrive at a data-sharing agreement for postproject utilization of their data. Such an agreement will provide a model for not only this partnership, but for licensing the CARE Platform analytics for use by other health data sets.Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
“Before data is stored, it will be stripped of all institutional and individual identifiers to ensure confidentiality by staff of the Center following procedures developed by the researchers.”“Audio files of interviews will be stored on a password protected secure server during the study and for two years after, and destroyed subsequently.”“Exceptions to shared data include proprietary DTE GIS utility information (for security reasons) and software code of commercial interest to the projects GOALI partners or identified licensees. Both exceptions are permitted by the ENG DMP policy.... The research team will however develop a set of 3D GIS datasets for distribution the public. These datasets will represent non-existent buried infrastructure and will only be useful for the evaluation of the other research products.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
IV. Policies and provisions for re-use, re-distributionAs noted in the project description, policies for provision and re-use will be developed as part of the research project. It is anticipated that there will be considerable interest in the platform and tools within the research and practice community, including academic researchers, health research agencies, and cloud service providers, among others. The need for such a tool was identified during a recent NSF sponsored symposium on Health Cyberinfrastructure, which was conducted by the PIs.Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
Data storage and preservation of access. The DMP should describe physical and cyber resources and facilities that will be used for the effective preservation and storage of research data. These can include third party facilities and repositories.Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
V. Plans for archiving and Preservation of access The project website and service will contain all appropriate information and documentation for using the CARE platform and tool for health research discovery and analysis. The site will also contain all references, research papers, and related products developed throughout the course of the project. The San Diego Supercomputer Facility at UC San Diego will host the data throughout the research project and provide a minimum of three years of online access beyond the completion of the project. Data storage will be performed at the nominal rates charged by SDSC to any project using the facility. These are relatively modest (~$1000/TB) and can be borne ahead of time for the 3-year period. Should the CARE platform not extend beyond the three years (post grant), the data could then be archived at SDSC at even lower cost. A decision would have to be made at that point in time regarding how exactly to archive the data, and on paying for the archival storage.Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
“For archiving, the data along with any related publications will be deposited in Libra, the UVA archival system, with an appropriate licensing statement. DOIs will be attached to all data stored from this project. Since the current preservation plan for Libra is indefinite data storage, preservation of access is assured.”“Materials to be publicly shared will be stored with the Deep Blue repository, a service of the UM Libraries that provides deposit access and preservation services. Deposited items will be assigned a persistent URL that will be registered with the Handle System for assigning, managing, and resolving persistent identifiers (‘handles’) for digital objects and other Internet resources.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
What are your goals? Who needs access and when? When/if can data be shared/distributed? Prepare for future funder mandates Plan beyond individual PI/grant projects
• Campus Copyright policy • Collaborator institution copyright and ownership policies, informal agreements • Patent and provenance issues • International copyright considerations • Post-project data retention requirements • Post-employment data agreementsURL: http://www2.binghamton.edu/academics/provost/faculty-staff-handbook/handbook-xii.html
Survey sample: 308 campus researchers with externally sponsored projects or submitted proposals (2009- 2011); 91 survey respondentsSource: Binghamton University Research Faculty Survey, June 2011, Jim Wolf, Director of Academic Computing (ret.)
Source: Jim Wolf, Director of Academic Computing (ret.), June 2011
Data Accessibility Data Preservation Timeframe 50 60 45 40 35 50 30 25 40 20 access granted to 15 individuals 30 forever 10 openly available to all 3-7 yrs 5 20 0 <3 yrs proprietary 10 private 0 Local research ITS storage Library Disciplinary group server archive repository (e.g., ICPSR)Source: Research Faculty Survey, Jim Wolf, Director of Academic Computing (ret.), June 2011
Create consistent, standardized metadata Perform regular file fixity and format checks Identify, update and migrate file formats Mitigate and eliminate file degradation Provide storage space, controlled access and an “exit strategy”
Media Deterioration and Format Obsolescence Demonstrate that“Backups” are Inadequate for Long-Term PreservationSources: http://oldcomputers.net/macintosh.html; http://www.classiccmp.org/dunfield/pc/index.htm
Build content from one project to the next Create a set of policies based on current best practices and funder requirements Refine data collection, access, use, distribution, and preservation policies over time
Elizabeth BrownLS-2504C(607) email@example.comSlideshare: http://www.slideshare.net/ebrown