Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Best Practices
Creating and Managing Research Data
Presented by Sherry Lake
ShLake@virginia.edu
http://dmconsult.library.v...
Why Manage Your Data?
Best Practices for Creating Data
1. Use Consistent Data Organization
2. Use Standardized Naming, codes and formats
3. Assi...
Spreadsheet Examples
Spreadsheets
Consistent Data Organization
• Spreadsheets (such as those found in Excel)
are sometimes a necessary evil
– They allow “sh...
Spreadsheets
Spreadsheet Problems?
Problems
• Dates are not
stored
consistently
• Values are labeled inconsistently
• Data coding is inconsistent
• Order of ...
Problems
• Confusion
between
numbers and
text
• Different types of data are stored in the
same columns
• The spreadsheet l...
How would you correct this file?
Spreadsheet Best Practices
• Include a Header Line 1st line (or record)
• Label each Column with a short but descriptive n...
• Columns of data should be consistent
– Use the same naming convention for text data
• Each line should be “complete”
• E...
Spreadsheet Best Practices
• Columns should include only a single kind of data
– Text or “string” data
– Integer numbers
–...
Use Naming Standards & Codes
• Use commonly accepted label names that
describe the contents (e.g., precip for
precipitatio...
Use Standardized Formats
• Use standardized formats for units
International System of Units (SI)
http://physics.nist.gov/P...
File Names
File Names
• Use descriptive names
• Not too long; CamelCase
• Try to include time
– Date using YYYYMMDD
– Use version num...
Organize Files Logically
Make sure your file system is logical and
efficient
Biodiversity
Lake
Grassland
Experiments
Field...
• Check for missing, impossible,
anomalous values
– Plotting
– Mapping
• Examine summary statistics
• Verify data transfer...
Data Manipulation
• You will need to repeat reduction and analysis
procedures many times
– You need to have a workflow tha...
Preserve Information
Keep Original (Raw) File
– Do not include
transformations,
interpolations, etc.
– Consider making the...
Preserving: Scripted Notes
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATL...
Data Documentation (Metadata)
• Informal or formal methods to describe your
data
• Important if you want to reuse your own...
Define Contents of Data Files
• Create a Project Document File (Lab
Notebook)
• Details such as:
– Names of data & analysi...
Data Dictionary Example
Data Dictionary Example
Data Documentation
Project Documentation Dataset Documentation
• Context of data collection
• Data collection methods
• St...
File Format Sustainability
Types Examples
Text ASCII, Word, PDF
Numerical ASCII, SPSS, STATA, Excel, Access, MySQL
Multime...
Choosing File Formats
• Accessible Data (in the future)
– Non-proprietary (software formats)
– Open, documented standard
–...
1. Use Consistent Data Organization
2. Use Standardized Naming, Codes and Formats
3. Assign Descriptive File Names
4. Perf...
• Will improve the usability of the data by you
or by others
• Your data will be “computer ready”
• Save you time
Followin...
Research Life Cycle
Data Life Cycle
Re-
Purpose
Re-
Use
Deposit
Data
Collection
Data
Analysis
Data
Sharing
Proposal
Planni...
Managing Data in the Data Life Cycle
• Choosing file formats
• File naming conventions
• Document all data details
• Acces...
Data Security & Access Control
• Network security
– keep confidential or sensitive data off internet
servers or computers ...
Backup Your Data
• Reduce the risk of damage or loss
• Use multiple locations (here, near, far)
• Create a backup schedule...
Storage & Backup
Sustainable Storage
Lifespan of Storage Media: http://www.crashplan.com/medialifespan/
Best Practices Bibliography
Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple
guidelines ...
Best Practices Bibliography (Cont.)
Inter-university Consortium for Political and Social Research (ICPSR). (2012).
Guide t...
Upcoming SlideShare
Loading in …5
×

Best practices data management

2,068 views

Published on

Presentation for the REU's, Mountain Lake Biological Station 5/28/2014.

Published in: Education
  • Be the first to comment

Best practices data management

  1. 1. Best Practices Creating and Managing Research Data Presented by Sherry Lake ShLake@virginia.edu http://dmconsult.library.virginia.edu/ Data Life Cycle Re-Purpose Re-Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  2. 2. Why Manage Your Data?
  3. 3. Best Practices for Creating Data 1. Use Consistent Data Organization 2. Use Standardized Naming, codes and formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats
  4. 4. Spreadsheet Examples
  5. 5. Spreadsheets
  6. 6. Consistent Data Organization • Spreadsheets (such as those found in Excel) are sometimes a necessary evil – They allow “shortcuts” which will result in your data not being machine-readable • But there are some simple steps you can take to ensure that you are creating spreadsheets that are machine-readable and will withstand the test of time
  7. 7. Spreadsheets
  8. 8. Spreadsheet Problems?
  9. 9. Problems • Dates are not stored consistently • Values are labeled inconsistently • Data coding is inconsistent • Order of values are different
  10. 10. Problems • Confusion between numbers and text • Different types of data are stored in the same columns • The spreadsheet loses interpretability if it is sorted
  11. 11. How would you correct this file?
  12. 12. Spreadsheet Best Practices • Include a Header Line 1st line (or record) • Label each Column with a short but descriptive name Names should be unique Use letters, numbers, or “_” (underscore) Do not include blank spaces or symbols (+ - & ^ *)
  13. 13. • Columns of data should be consistent – Use the same naming convention for text data • Each line should be “complete” • Each line should have a unique identifier Spreadsheet Best Practices
  14. 14. Spreadsheet Best Practices • Columns should include only a single kind of data – Text or “string” data – Integer numbers – Floating point or real numbers
  15. 15. Use Naming Standards & Codes • Use commonly accepted label names that describe the contents (e.g., precip for precipitation) • Use consistent capitalization (e.g., not: temp, Temp, and TEMP in same file) • Standard codes – State Postal (VA, MA) – FIPS Codes for Counties and County Equivalent Entities (http://www.census.gov/geo/reference/codes/cou.html)
  16. 16. Use Standardized Formats • Use standardized formats for units International System of Units (SI) http://physics.nist.gov/Pubs/SP330/sp330.pdf • ISO 8601 Standard for Date and Time YYYYMMDDThh:mmss.sTZD 20091013T09:1234.9Z 20091013T09:1234.9+05:00 • Spatial Coordinates for Latitute/Longitude +/- DD.DDDDD -78.476 (longitude) +38.029 (latitude)
  17. 17. File Names
  18. 18. File Names • Use descriptive names • Not too long; CamelCase • Try to include time – Date using YYYYMMDD – Use version numbers • Don’t use spaces – May use “-” or “_” • Don’t change default extensions
  19. 19. Organize Files Logically Make sure your file system is logical and efficient Biodiversity Lake Grassland Experiments Field Work Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv Project Name Location Experiment Name Date File Format
  20. 20. • Check for missing, impossible, anomalous values – Plotting – Mapping • Examine summary statistics • Verify data transfers from notebooks to digital files • Verify data conversion from one file format to another Data Validation Hook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
  21. 21. Data Manipulation • You will need to repeat reduction and analysis procedures many times – You need to have a workflow that recognizes this – Scripted languages can help capture the workflow – You could just document all steps by hand – After the 20th iteration through your data set; however, you may feel more fondly towards scripted languages • Learn the analytical tools of your field – Talk to colleagues, etc. and choose at least one tool to master
  22. 22. Preserve Information Keep Original (Raw) File – Do not include transformations, interpolations, etc. – Consider making the raw data “read-only” Save as a new file Processing Script (R)
  23. 23. Preserving: Scripted Notes • Use a scripted language to process data – R Statistical package (free, powerful) – SAS – MATLAB • Processing scripts records processing – Steps are recorded in textual format – Can be easily revised and re-executed – Easy to document • GUI-based analysis may be easier, but harder to reproduce
  24. 24. Data Documentation (Metadata) • Informal or formal methods to describe your data • Important if you want to reuse your own data in the future • Also necessary when sharing your data
  25. 25. Define Contents of Data Files • Create a Project Document File (Lab Notebook) • Details such as: – Names of data & analysis files associated with study – Definitions for data and codes (include missing value codes, names) – Units of measure (accuracy and precision) – Standards or instrument calibrations
  26. 26. Data Dictionary Example
  27. 27. Data Dictionary Example
  28. 28. Data Documentation Project Documentation Dataset Documentation • Context of data collection • Data collection methods • Structure, organization of data files • Data sources used • Data validation, quality assurance • Transformations of data from the raw data through analysis • Information on confidentiality, access and use conditions • Variable names and descriptions • Explanation of codes and schemas used • Algorithms used to transform data • File format and software (including version) used
  29. 29. File Format Sustainability Types Examples Text ASCII, Word, PDF Numerical ASCII, SPSS, STATA, Excel, Access, MySQL Multimedia Jpeg, tiff, mpeg, quicktime Models 3D, statistical Software Java, C, Fortran Domain-specific FITS in astronomy, CIF in chemistry Instrument-specific Olympus Confocal Microscope Data Format
  30. 30. Choosing File Formats • Accessible Data (in the future) – Non-proprietary (software formats) – Open, documented standard – Common, used by the research community – Standard representation (ASCII, Unicode) – Unencrypted & Uncompressed
  31. 31. 1. Use Consistent Data Organization 2. Use Standardized Naming, Codes and Formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats Best Practices for Creating Data
  32. 32. • Will improve the usability of the data by you or by others • Your data will be “computer ready” • Save you time Following these Best Practices…….
  33. 33. Research Life Cycle Data Life Cycle Re- Purpose Re- Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  34. 34. Managing Data in the Data Life Cycle • Choosing file formats • File naming conventions • Document all data details • Access control & security • Backup & storage
  35. 35. Data Security & Access Control • Network security – keep confidential or sensitive data off internet servers or computers on connected to the internet • Physical security – Access to buildings and rooms • Computer Systems & Files – Use passwords on files/system – Virus protection
  36. 36. Backup Your Data • Reduce the risk of damage or loss • Use multiple locations (here, near, far) • Create a backup schedule • Use reliable backup medium • Test your backup system (i.e., test file recovery)
  37. 37. Storage & Backup
  38. 38. Sustainable Storage Lifespan of Storage Media: http://www.crashplan.com/medialifespan/
  39. 39. Best Practices Bibliography Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple guidelines for effective data management. Bulletin of the Ecological Society of America, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205 Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management and Publishing. Retrieved 05/31/2012, from http://libraries.mit.edu/guides/subjects/data-management/. Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010). Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
  40. 40. Best Practices Bibliography (Cont.) Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to social science data preparation and archiving: Best practices throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012, from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf. Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved 05/31/2012, from http://www.data- archive.ac.uk/media/2894/managingsharing.pdf.

×