This class is aimed at those engaged in the life cycle of research, from applying for research grant, thru data collection & ultimately to preparation of the data for deposit in a public archive.Some projects generate enormous amounts of data that it takes up much of the scientists time. Data management primarily occurs within the lifecycle of a research porject.Data sharing plans should be developed in conjunction with an archive to maximize the utility of the data to research and to ensure the availability of the data in the future.
Steps in the Research Life Cycle:Proposal Planning & Writing: Conduct a review of existing data setsDetermine if project will produce a new dataset (or combing existing)Investigate archiving challenges, consent and confidentialityId potential users of your dataDetermine costs related to archivingContact Archives for advice (Look for archives)Project Start UpCreate a data management planMake decisions about document form and contentConduct pretest & tests of materials and methodsData CollectionFollow Best PracticeOrganize files, backups & storage, QA for data collectionAccess Control and SecurityData AnalysisManage file versionsDocument analysis and file manipulationsData SharingDetermine file formatsContact Archive for adviceMore documenting and cleaning up dataEnd of ProjectWrite PaperSubmit Report FindingsDeposit Data in Data Archive (Repository) Remember: Managing Data in a research project is a process that runs throughout the project. Good data management is the foundation for good research. Especially if you are going to share your data. Good management is essential to ensure that data can be preserved and remain accessible I the long-term, so it can be re-used and understood by other researchers. When managed and preserved properly research data can be successfully used for future scientific purposes.
Planning the management of your data before you begin your research AND throughout its lifecycle is essential to ensure its current usability & long-term preservation and access.Can focus on research not user requestsWith a repository keeping your data, you can focus on your research rather than fielding requests or worrying about data on a web page. Your project may have lots of people working on it, you will need to know what each is doing and has done. Project may last years.Funding agencies now require a data management planYou can understand your data at a later timeHaving your data documented will allow future users understand your data and be able to use it.Takes less time to get data ready to shareIf follow plan then data should be ready for archiving (documenting the data throughout) insures proper description of the data are maintained.
Will the data contain direct or indirect identifiers that could be used to identify research participants?Challenges for archiving data…. Need to think about consentLinks on Uva compliance in research links on handout.Health Research links on handouts too. HIPPA Privacy Rule (Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule is the first comprehensive Federal protection for the privacy of personal health information)Your discipline may have other policies, i.e. National Academy of Engineering (link on handouts)Intellectual Property-determine copyright & ownership of research dataIf you’ve gathered the data from multiple sources, need to obtain permission to publish it.
Regarding research data generated from proposal/project Sharing and Data RetentionBefore you start your plan check mandates, policies, & procedures of grant funding and UvaExample from UVA: UVa’s policy on recordkeeping in research, Uva’s Health System Office of ResearchNIH Data Sharing Policy & Implementation Guidance (2003) suggests the following in the proposals: Schedule for data sharing Format of final dataset Documentation to be provided Analytical tools to be provided, if any Need for data sharing agreement Mode of data sharingNIH generally requires that files resulting from research awards be retained for at least three years after the final financial report has been filed. However, Commonwealth of Virginia record retention regulations are more strict (see below) and require that such records be retained five years after filing of the final financial report of a funding periodNSFdevelop and submit specific plans to share materials collected with NSF support, except where this is inappropriate or impossible. These plans should cover how and where these materials will be stored at reasonable cost, and how access will be provided to other researchers, generally at their cost. UVaData and notebooks resulting from sponsored research are the property of the University of Virginia. It is the responsibility of the principal investigator to retain all raw data in laboratory notebooks (or other appropriate format) for at least five years after completion of the research project (i.e., publication of a paper describing the work, or termination of the supporting research grant, whichever comes first) unless required to be retained longer by contract, law, regulation, or by some reasonable continuing need to refer to them.Uva Health SystemHas a responsible conduct of research that includes data management (protection, sharing, retention times)
How do you get started managing data.So how do I get started managing data?Handout has a link to Managing & Sharing Data with more detailsAlso link to a Data Management Plan FormShould be written down… sort like an instruction book.
Life cycle of a research project with respect to the data it creates:Data Collectiondata collection, entry, checking & cleaningData Analysis analyze data, derived “new” data, data documentationData Sharing prepare data for submissionManaging the Data in the Data Life Cycle includes: backup & storage, version control, file conversions, security & access control Document all data details
Here’s the details about what we are going to manage in the Data Life Cycle.
National Science Board. (2005). Long-lived digital data collections: Enabling research and education in the 21st century. Retrieved from http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdfobservational data cannot be recollected and are archived indefinitely. cannot be recollected, remeasured, or verified. Data are typi- cally time and/or location dependent. This context is set by the fact that much of the value of observational data is in its secondary analysis. Experimental data can often be reproduced, although there are cases where experimental conditions or variables are unknown. Experimental data may be associated with a particular meth- odology or instrument
These are sometimes lumped together as computational data:Data that is the result of computer models or simulations can be reproduced if adequate infor- mation is provided about the computer hardware, software, and inputs. Statistical data, computational models, and simulations can also be recreated and verified, as long as sufficient disciplines Can you think of anything else as “data”? Most of the time we are managing the “digital” data, what about the non-digital … lab notebooks, notes, ?
Shows the many differing types and the many different formats for each one.Things to consider when choosing File FormatsCollection/Analysis format does not have to be the same as Preservation format, but if not, then it will need to be converted (interchangeable format – will talk about this later) for archiving.You can choose one format to do analysis, because it may be faster to do in proprietary format. But will need to change to a non-proprietary format later for archiving (Prepare for sharing). Migrate data into a format with these characteristics. Also keep a copy of the original software format.
Keep track of versions of documentation and data. Use directory structure and file naming conventions to help, or use Version Control SoftwareAlways record every change to a file no matter how small. Record relationships between files.Directory Structure: Top Level folder should include Project Name and Date,Each subsequent level should have its naming convention documented….. i.e., categorize by people, experiment, dataset versionFile naming conventions: reserve 3-letter file extension for application-specific codes, Id project in the file nameUse dates in filenames, some disciplines have their own recommendations for file namingFile Structure… flat files vs database (relational)Keep directory structure same for backups.I’ll go over more detail with examples in the next presentation on best practices
Keep master copy to an assigned team memberRestrict write access to specific membersRecord changes with Version controlNetwork: keep confidential data off internet servers (or behind firewalls), put sensitive materials on computers not connected to the internetPhysical security… who has access to your office,. Allowing repairs by an outside companyComputer: Keep virus protection up to date, does your computer have a login password, not sending personal or confidential data via e-mail or FTP, transmit via encrypted data, imposing confidentially agreements for data users Link Managing and Sharing Data document has anindepth section on Ethics, Consent and Confidentiality.
Data Storage for collected data and for backupsConsider Storage and Backup Options the sameUse formats that will be useable in the long-term, not dependent on a software versionCD & DVDs media life not reliable, may have to replace old media, maintaining devices that can still read the proprietary formats or media typeCopy or migrate data files to new media between 2 and 5 years after created.Appropriate environmental conditions will increase the life-span of media. Check environmental conditions recommendations for your particular media. Make sure storage location free from risk of fire and flood. Proper storage of “paper” dataBe aware of thefts, file changes and “loss” (data only on paper..??)
Why backup data?Keeping reliable backups is an integral part of data management. Regular back-ups protect against data loss due to:Hardware failure, software of media faults, virus infection or hacking, power failure, human errorsRecommendation, 3 backup copies original, external/local, external/remoteFull-backups, incrementalCheck the integrity of the files ensure transmitted without error (checksum and file size) Calculate a “value” of a block of data, perform on both files and if same “number” then OK.If using departmental server, check on backup/restore procedures (how quickly can you get files restored?)May want to have the backup procedures controlled by you.Test your backup system, test restoring files, don’t over re-use backup media
Use some options for “storage” others for backupsCloud Storage (Google Docs, DropBox, Windows Live SkyDrive, SpiderOak)
Documentation should start with the Data Management Plan. Start at the beginning and continue reduces likelihood that you will forget aspects of your data later.Document data collection, lab notebooks, digizitation infoThink about non-digital, papers, photos, reports, lab notebooks…. Should be digitized and stored with digital data.In order for the data to be used properly once it’s been archived the data must be documented.Data documentation (otherwise known as Metadata) enables you to understand the data in detail, enable others to find it, use it and properly cite it.Use versioning software for documentation file too.
Conform to community standards for recording data & metadata that adequately describe the context & quality of the data & help others use & find it.Data validation and other quality assurance proceduresModifications of the dataInformation should include:Title, Creator, Subject, Funders, Rights, Dates, Location, Methodology, Data Processing, Sources, File Formats, Variable Lists, Code lists, May need to put the this info in a metadata standard DDI, MODS, FGDC, DarwinCore, EML
Keep a copy of the data in its original form. Maintain it and final version as read-only. With detailed documentation, someone could replicate your findings from the original set to final.As you analyze your data, there will be various changes, additions and deletions to the dataset.Enables reproducibility – validate findings- Executability – others can re-run or re-use analysis
1st version: original data collection2nd version: “cleaned” dataset3rd version: combining variables & analysis Filenames include version# & “who”
There are lots of ways to share your data without depositing it in a repository: e-mail to requestors, posting to a web site, google, or other “cloud” sharing site, but you have to maintain it. And it makes “finding” your data harder.Depositing it in an archive makes it easier to discover and preserve. If it’s documented, well, then easy to use.Make sure confidentiality of respondent data is preserved. Will need to create a version of the dataset without personal info.
Safest option to guarantee long-term data access is to convert data to standard formats.For you the researcher even if not planning on sharing (publishing)These are formats more likely to be accessible in the future.Format of the file is a major factor in the ability to use the data in the future. As technology changes, plan for software and hardware obsolescence. System files (SAS, SPSS) are compact and efficient, but not very portable. Use software to “export” data to a portable (or transport) file. “Interchangeable format”Convert proprietary formats to non-proprietary. Check for data errors in conversion.
Examples of preferred format choicesFormats for long-term digital preservation (open). Don’t expect you (won’t have time) or the archive to be able to convert older formats to new ones.Good chart in the UK Document on Managing and Sharing Data (page 9).
Let’s stop and make sure everyone knows or can define “metadata”. What you use to describe your data, the pieces of information that will allow someone to understand your data, how it was collected thus making another person to replicate your results.In order for the data to be used properly once it’s been archived the data must be documented.If you had been documenting your data and files all along, this step should be easy
In order for the data to be used properly once it’s been archived the data must be documented.Metadata accompanying file should be written for a user 20 yrs into future…. Or written to someone not know about you or your work.
Where you archive your data has an impact on “who can find” your data. Are you looking for long-term preservation (how long would your data be useful)?Each has advantage and disadvantages. Data centers may not be able to accept all data. Start looking at where you want to archive while doing your project. Base your Data management plan on the expectations and criteria for archiving.
Data repositories may have criteria to evaluate and select datasets for reservation.
Select those that could provide long-term access
Managing the research life cycle
Managing the Research Data Life Cycle Presented by Sherry Lake ShLake@virginia.edu July 31, 2012 University of Florida Data Management Workshop
Research Life Cycle Data Re- Data Deposit Discovery Use ArchiveProposal Project Data Data Data End ofPlanning Start Up Collection Analysis Sharing ProjectWriting Re- Purpose Data Life Cycle
Why Manage Data? Saves time Others can understand your data Makes sharing/preserving data easier Reinforces open scientific inquiry and replication of results Increases the visibility of your research Facilitates new discoveries Reduces costs by avoiding duplication Required by funding agencies Proposal Planning Writing
Ethical and Legal Issues Confidentiality Evaluate the sensitivity of your data Comply with institution’s research guidelines Comply with regulations for health research May need to enable a restricted view of your data Intellectual Property Copyright Patents Proposal Planning Writing
Data Sharing and RetentionRequirements Be Aware of Funding Requirements Informal sharing statement Separate Data Management Plan Know What Your Institution Requires Know What Your Department Requires Publisher’s Requirement Nature Magazine Proposal Planning Writing
Create a Data Management Plan Appoint Data Manager Contact Describe data to be collected and methodology Include guidelines on data documentation Plan quality assurance and backup procedures Plan sharing of data for public use Include preservation plans Document copyright and intellectual property rights Project Start Up
Data Life Cycle within Context of the Research Life Cycle Data Re- Data Deposit Discovery Use ArchiveProposal Project Data Data Data End ofPlanning Start Up Collection Analysis Sharing ProjectWriting Re- Purpose Data Life Cycle
Managing Data in the Data Life Cycle Data Collection and Organization Data Control & Security Backup & Storage Documentation and Metadata Processing and Analysis Preparing Data to Share
What is Data? Observational – data captured in real-time Examples: Sensor readings, telemetry, survey results, images Usually irreplaceable Experimental – data from lab equipment Examples: gene sequences, chromatograms, magnetic field readings Often reproducible, but can be expensive
What is Data? Simulation – data generated from test models Examples: climate models, economic models Models & metadata (inputs) more important than output data Derived or compiled – data Examples: text and data mining, compiled database, 3D models Reproducible (but very expensive)
Types and Formats of DataTypes ExamplesText ASCII, Word, PDFNumerical ASCII, SPSS, STATA, Excel, Access, MySQLMultimedia Jpeg, tiff, mpeg, quicktimeModels 3D, statisticalSoftware Java, C, FortranDomain-specific FITS in astronomy, CIF in chemistryInstrument- Olympus Confocal Microscopespecific Data Format
Organizing Your Files File Version Control Directory Structure/File Naming Conventions File Naming Conventions for Specific Disciplines File Structure Use Same Structure for Backups
Data Security & Access ControlProtection of data from unauthorizedaccess, use, change, disclosure and destruction • Network Security • Physical Security • Computer Systems & Files
Data Security & Access Control Network security Keep confidential data off internet servers (or behind firewalls) Put sensitive materials on computers not connected to the internet Physical security Access to buildings and rooms Computer systems & files Use passwords on files/systems Virus protection
Data StorageThings to consider when deciding on where and how to storeyour data File Format Media Life and Format Disaster Recovery Plan Environmental Conditions Security
Backup Your Data Reduce the risk of damage or loss Use multiple locations (one off-site) Validate using checksums Create a backup schedule Use reliable backup medium Test your backup system (i.e., test file recovery)
Backup & Storage Options Personal Computer Departmental or University Server Tape Backups Subject archive CDs or DVDs – NOT Recommended External Hard Drives Cloud Storage
Documentation Start at beginning of research and continue throughout Data documentation enables you to understand the data in detail Enables others to find it, use it and properly cite it
Data DocumentationData documentation includes information on: + The Project + Data Collection Methods + Structure of the data files + Data sources used + Transformations of the dataAt the data-level, information on: + Labels and descriptions for variables & records + Codes and classifications + Derived data algorithms + File format and software used
Data CollectionBest Practices detailed in the presentation that follows. Data Collection
Data Processing & AnalysisSoftware tools to create, process and visualize the data + Programming languages (Fortran, PHP, Ruby, Python, C++, etc) + Data collection software (LabView) + Analysis (SPSS, SAS, Matlab, Mathematica, R, etc) Data Analysis
Recording ProcessesRecord every change to a file, no matter how small + Document changes to files + Use file naming conventions + Headers inside the file + Log files (automatic) + Version Control Software (e.g. SVN) + File sharing software (Google Drive, or DropBox, others) Data Analysis
Prepare to SharePreparing data to share makes publishing data easier • Archive Submission Policies/Guidelines • File Format Conversion • Documentation & Metadata • Programming Code • Citations to existing datasets • Creation of un-restricted dataset Data Sharing
Choosing File FormatsAccessible in the future • Non-proprietary • Open, documented standard • Common, used by the research community • Standard representation (ASCII, Unicode) • Unencrypted • Uncompressed Data Sharing
Preferred Format Choices PDF, not Word ASCII, not Excel MPEG-4, not Quicktime TIFF or JPEG2000, not GIF or JPG XML or RDF, not RDBMSNot software specific Data Sharing
Documentation & MetadataWhat is Metadata? Who created the data? What is the content of the data set? When was it created? Where was it collected? How was it developed? Data Why was it developed? Sharing
Metadata Formats & Standards Provides structure to describe data Common terms Definitions Language Structure Many different standards (based on discipline) DDI FGDC EML Tools for creating metadata files Nesstar (DDI) Data Sharing Metavist (FGDC) Morpho (EML)
Archiving Your Data Informally on a peer-to-peer basis Make accessible on online project web page Make accessible on institutional web site Submitting to a journal Deposit in discipline specific repository Deposit in Institutional Repository
Advantages of Repositories Secure Environment Backups Quality of Data Promotion of Data Access Control to Data Easy Dissemination Long-term Preservation Online Resource Discovery Licensing Arrangements
Data Repositories Example of discipline specific repositories: + SIMBAD (Astronomy) + Protein Data Bank (Biology) + PubChem (Chemistry) + GEON (Earth Science) + Long Term Ecological Research (Ecology) + ICPSR (Social Sciences)Databib is a tool for helping people identifyand locate online repositories of research data.http://databib.org
Data Management BibliographyGraham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management and Publishing. Retrieved 05/31/2012, from http://libraries.mit.edu/guides/subjects/data-management/.Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to social science data preparation and archiving: Best practices throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012, from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved 05/31/2012, from http://www.data- archive.ac.uk/media/2894/managingsharing.pdf
Questions? Sherry Lake Senior Scientific Data Consultant, UVA Library firstname.lastname@example.org Twitter: shlakeuva Slideshare: http://www.slideshare.net/shlake Web: http://www.lib.virginia.edu/brown/data 32
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.