SlideShare a Scribd company logo
1 of 40
Best Practices
Creating and Managing Research Data
Presented by Sherry Lake
ShLake@virginia.edu
http://dmconsult.library.virginia.edu/
Data Life Cycle
Re-Purpose
Re-Use Deposit
Data
Collection
Data
Analysis
Data
Sharing
Proposal
Planning
Writing
Data
Discovery
End of
Project
Data
Archive
Project
Start Up
Why Manage Your Data?
Best Practices for Creating Data
1. Use Consistent Data Organization
2. Use Standardized Naming, codes and formats
3. Assign Descriptive File Names
4. Perform Basic Quality Assurance / Quality Control
5. Preserve Information - Use Scripted Languages
6. Define Contents of Data Files; Create
Documentation
7. Use Consistent, Stable and Open File Formats
Spreadsheet Examples
Spreadsheets
Consistent Data Organization
• Spreadsheets (such as those found in Excel)
are sometimes a necessary evil
– They allow “shortcuts” which will result in your
data not being machine-readable
• But there are some simple steps you can take
to ensure that you are creating spreadsheets
that are machine-readable and will withstand
the test of time
Spreadsheets
Spreadsheet Problems?
Problems
• Dates are not
stored
consistently
• Values are labeled inconsistently
• Data coding is inconsistent
• Order of values are different
Problems
• Confusion
between
numbers and
text
• Different types of data are stored in the
same columns
• The spreadsheet loses interpretability if it
is sorted
How would you correct this file?
Spreadsheet Best Practices
• Include a Header Line 1st line (or record)
• Label each Column with a short but descriptive name
Names should be unique
Use letters, numbers, or “_” (underscore)
Do not include blank spaces or symbols (+ - & ^ *)
• Columns of data should be consistent
– Use the same naming convention for text data
• Each line should be “complete”
• Each line should have a unique identifier
Spreadsheet Best Practices
Spreadsheet Best Practices
• Columns should include only a single kind of data
– Text or “string” data
– Integer numbers
– Floating point or real numbers
Use Naming Standards & Codes
• Use commonly accepted label names that
describe the contents (e.g., precip for
precipitation)
• Use consistent capitalization (e.g., not: temp,
Temp, and TEMP in same file)
• Standard codes
– State Postal (VA, MA)
– FIPS Codes for Counties and County Equivalent
Entities
(http://www.census.gov/geo/reference/codes/cou.html)
Use Standardized Formats
• Use standardized formats for units
International System of Units (SI)
http://physics.nist.gov/Pubs/SP330/sp330.pdf
• ISO 8601 Standard for Date and Time
YYYYMMDDThh:mmss.sTZD
20091013T09:1234.9Z
20091013T09:1234.9+05:00
• Spatial Coordinates for Latitute/Longitude
+/- DD.DDDDD
-78.476 (longitude)
+38.029 (latitude)
File Names
File Names
• Use descriptive names
• Not too long; CamelCase
• Try to include time
– Date using YYYYMMDD
– Use version numbers
• Don’t use spaces
– May use “-” or “_”
• Don’t change default
extensions
Organize Files Logically
Make sure your file system is logical and
efficient
Biodiversity
Lake
Grassland
Experiments
Field Work
Biodiv_H20_heatExp_2005_2008.csv
Biodiv_H20_predatorExp_2001_2003.csv
Biodiv_H20_planktonCount_start2001_active.csv
Biodiv_H20_chla_profiles_2003.csv
Project
Name
Location Experiment
Name
Date File
Format
• Check for missing, impossible,
anomalous values
– Plotting
– Mapping
• Examine summary statistics
• Verify data transfers from
notebooks to digital files
• Verify data conversion from one
file format to another
Data Validation
Hook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Share
and Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
Data Manipulation
• You will need to repeat reduction and analysis
procedures many times
– You need to have a workflow that recognizes this
– Scripted languages can help capture the workflow
– You could just document all steps by hand
– After the 20th iteration through your data set; however, you may
feel more fondly towards scripted languages
• Learn the analytical tools of your field
– Talk to colleagues, etc. and choose at least one tool to
master
Preserve Information
Keep Original (Raw) File
– Do not include
transformations,
interpolations, etc.
– Consider making the raw
data “read-only”
Save as a new file
Processing Script (R)
Preserving: Scripted Notes
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts records processing
– Steps are recorded in textual format
– Can be easily revised and re-executed
– Easy to document
• GUI-based analysis may be easier, but harder to
reproduce
Data Documentation (Metadata)
• Informal or formal methods to describe your
data
• Important if you want to reuse your own data
in the future
• Also necessary when sharing your data
Define Contents of Data Files
• Create a Project Document File (Lab
Notebook)
• Details such as:
– Names of data & analysis files associated with
study
– Definitions for data and codes (include missing
value codes, names)
– Units of measure (accuracy and precision)
– Standards or instrument calibrations
Data Dictionary Example
Data Dictionary Example
Data Documentation
Project Documentation Dataset Documentation
• Context of data collection
• Data collection methods
• Structure, organization of data files
• Data sources used
• Data validation, quality assurance
• Transformations of data from the
raw data through analysis
• Information on confidentiality,
access and use conditions
• Variable names and descriptions
• Explanation of codes and schemas
used
• Algorithms used to transform data
• File format and software (including
version) used
File Format Sustainability
Types Examples
Text ASCII, Word, PDF
Numerical ASCII, SPSS, STATA, Excel, Access, MySQL
Multimedia Jpeg, tiff, mpeg, quicktime
Models 3D, statistical
Software Java, C, Fortran
Domain-specific FITS in astronomy, CIF in chemistry
Instrument-specific Olympus Confocal Microscope Data Format
Choosing File Formats
• Accessible Data (in the future)
– Non-proprietary (software formats)
– Open, documented standard
– Common, used by the research community
– Standard representation (ASCII, Unicode)
– Unencrypted & Uncompressed
1. Use Consistent Data Organization
2. Use Standardized Naming, Codes and Formats
3. Assign Descriptive File Names
4. Perform Basic Quality Assurance / Quality Control
5. Preserve Information - Use Scripted Languages
6. Define Contents of Data Files; Create
Documentation
7. Use Consistent, Stable and Open File Formats
Best Practices for Creating Data
• Will improve the usability of the data by you
or by others
• Your data will be “computer ready”
• Save you time
Following these Best Practices…….
Research Life Cycle
Data Life Cycle
Re-
Purpose
Re-
Use
Deposit
Data
Collection
Data
Analysis
Data
Sharing
Proposal
Planning
Writing
Data
Discovery
End of
Project
Data
Archive
Project
Start Up
Managing Data in the Data Life Cycle
• Choosing file formats
• File naming conventions
• Document all data details
• Access control & security
• Backup & storage
Data Security & Access Control
• Network security
– keep confidential or sensitive data off internet
servers or computers on connected to the internet
• Physical security
– Access to buildings and rooms
• Computer Systems & Files
– Use passwords on files/system
– Virus protection
Backup Your Data
• Reduce the risk of damage or loss
• Use multiple locations (here, near, far)
• Create a backup schedule
• Use reliable backup medium
• Test your backup system (i.e., test file
recovery)
Storage & Backup
Sustainable Storage
Lifespan of Storage Media: http://www.crashplan.com/medialifespan/
Best Practices Bibliography
Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple
guidelines for effective data management. Bulletin of the Ecological Society of
America, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205
Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management and
Publishing. Retrieved 05/31/2012, from
http://libraries.mit.edu/guides/subjects/data-management/.
Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010).
Best Practices for Preparing Environmental Data Sets to Share and Archive.
Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge
National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.
http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
Best Practices Bibliography (Cont.)
Inter-university Consortium for Political and Social Research (ICPSR). (2012).
Guide to social science data preparation and archiving: Best practices
throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012,
from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.
Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and
Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved
05/31/2012, from http://www.data-
archive.ac.uk/media/2894/managingsharing.pdf.

More Related Content

What's hot

Database Archiving - Managing Data for Long Retention Periods
Database Archiving - Managing Data for Long Retention PeriodsDatabase Archiving - Managing Data for Long Retention Periods
Database Archiving - Managing Data for Long Retention PeriodsCraig Mullins
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022Nikki Chapple
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality ManagementAhmed Alorage
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best PracticesDATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Element22
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data GovernanceDATAVERSITY
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?Albert Hoitingh
 

What's hot (20)

Database Archiving - Managing Data for Long Retention Periods
Database Archiving - Managing Data for Long Retention PeriodsDatabase Archiving - Managing Data for Long Retention Periods
Database Archiving - Managing Data for Long Retention Periods
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022
Governance, Risk and Compliance and you | CollabDays Bletchley Park 2022
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
‏‏‏‏‏‏‏‏‏‏Chapter 12: Data Quality Management
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data Governance
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
 

Similar to Best practices data management

Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collectionSherry Lake
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate StudentsRebekah Cummings
 
Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Rebekah Cummings
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Mojtaba Lotfaliany
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycleSherry Lake
 
Elements of Data Documentation
Elements of Data DocumentationElements of Data Documentation
Elements of Data Documentationssri-duke
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemssamiullahamjad06
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto UniversityStephanie Simms
 
Data management (newest version)
Data management (newest version)Data management (newest version)
Data management (newest version)Graça Gabriel
 

Similar to Best practices data management (20)

Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate Students
 
Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Organising and Documenting Data
Organising and Documenting DataOrganising and Documenting Data
Organising and Documenting Data
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing
 
BigData Testing by Shreya Pal
BigData Testing by Shreya PalBigData Testing by Shreya Pal
BigData Testing by Shreya Pal
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 
6.2 software
6.2 software6.2 software
6.2 software
 
Elements of Data Documentation
Elements of Data DocumentationElements of Data Documentation
Elements of Data Documentation
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systems
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto University
 
Data management (newest version)
Data management (newest version)Data management (newest version)
Data management (newest version)
 
Chap01 (ics12)
Chap01 (ics12)Chap01 (ics12)
Chap01 (ics12)
 

More from Sherry Lake

Planning for Libra Data
Planning for Libra DataPlanning for Libra Data
Planning for Libra DataSherry Lake
 
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...Sherry Lake
 
Using a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansUsing a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansSherry Lake
 
Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampSherry Lake
 
DMTool-ASERL-Webinar
DMTool-ASERL-WebinarDMTool-ASERL-Webinar
DMTool-ASERL-WebinarSherry Lake
 
DMPTool Workshop University of Georgia
DMPTool Workshop University of GeorgiaDMPTool Workshop University of Georgia
DMPTool Workshop University of GeorgiaSherry Lake
 
Federal funder mandates
Federal funder mandatesFederal funder mandates
Federal funder mandatesSherry Lake
 
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014Sherry Lake
 
Data Management Planning for Engineers
Data Management Planning for EngineersData Management Planning for Engineers
Data Management Planning for EngineersSherry Lake
 
DMPTool Webinar Environmental Scan
DMPTool Webinar Environmental ScanDMPTool Webinar Environmental Scan
DMPTool Webinar Environmental ScanSherry Lake
 
Lake dmp tool_i_conference
Lake dmp tool_i_conferenceLake dmp tool_i_conference
Lake dmp tool_i_conferenceSherry Lake
 
Lake us-canada policesupdate
Lake us-canada policesupdateLake us-canada policesupdate
Lake us-canada policesupdateSherry Lake
 
Re tooling for data management-support
Re tooling for data management-supportRe tooling for data management-support
Re tooling for data management-supportSherry Lake
 
Dmp tool presentation
Dmp tool presentationDmp tool presentation
Dmp tool presentationSherry Lake
 
Funder requirements for Data Management Plans
Funder requirements for Data Management PlansFunder requirements for Data Management Plans
Funder requirements for Data Management PlansSherry Lake
 
Library support for life cycle
Library support for life cycleLibrary support for life cycle
Library support for life cycleSherry Lake
 
Environmental scan - Keeping Updated
Environmental scan - Keeping UpdatedEnvironmental scan - Keeping Updated
Environmental scan - Keeping UpdatedSherry Lake
 

More from Sherry Lake (20)

Planning for Libra Data
Planning for Libra DataPlanning for Libra Data
Planning for Libra Data
 
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
 
Using a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansUsing a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to Librarians
 
Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM Bootcamp
 
Creating dmp
Creating dmpCreating dmp
Creating dmp
 
DMTool-ASERL-Webinar
DMTool-ASERL-WebinarDMTool-ASERL-Webinar
DMTool-ASERL-Webinar
 
DMPTool Workshop University of Georgia
DMPTool Workshop University of GeorgiaDMPTool Workshop University of Georgia
DMPTool Workshop University of Georgia
 
Federal funder mandates
Federal funder mandatesFederal funder mandates
Federal funder mandates
 
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
 
Data Management Planning for Engineers
Data Management Planning for EngineersData Management Planning for Engineers
Data Management Planning for Engineers
 
DMPTool Webinar Environmental Scan
DMPTool Webinar Environmental ScanDMPTool Webinar Environmental Scan
DMPTool Webinar Environmental Scan
 
Lake dmp tool_i_conference
Lake dmp tool_i_conferenceLake dmp tool_i_conference
Lake dmp tool_i_conference
 
Lake us-canada policesupdate
Lake us-canada policesupdateLake us-canada policesupdate
Lake us-canada policesupdate
 
Why managedata
Why managedataWhy managedata
Why managedata
 
Re tooling for data management-support
Re tooling for data management-supportRe tooling for data management-support
Re tooling for data management-support
 
Web links
Web linksWeb links
Web links
 
Dmp tool presentation
Dmp tool presentationDmp tool presentation
Dmp tool presentation
 
Funder requirements for Data Management Plans
Funder requirements for Data Management PlansFunder requirements for Data Management Plans
Funder requirements for Data Management Plans
 
Library support for life cycle
Library support for life cycleLibrary support for life cycle
Library support for life cycle
 
Environmental scan - Keeping Updated
Environmental scan - Keeping UpdatedEnvironmental scan - Keeping Updated
Environmental scan - Keeping Updated
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Best practices data management

  • 1. Best Practices Creating and Managing Research Data Presented by Sherry Lake ShLake@virginia.edu http://dmconsult.library.virginia.edu/ Data Life Cycle Re-Purpose Re-Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  • 3. Best Practices for Creating Data 1. Use Consistent Data Organization 2. Use Standardized Naming, codes and formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats
  • 6. Consistent Data Organization • Spreadsheets (such as those found in Excel) are sometimes a necessary evil – They allow “shortcuts” which will result in your data not being machine-readable • But there are some simple steps you can take to ensure that you are creating spreadsheets that are machine-readable and will withstand the test of time
  • 9. Problems • Dates are not stored consistently • Values are labeled inconsistently • Data coding is inconsistent • Order of values are different
  • 10. Problems • Confusion between numbers and text • Different types of data are stored in the same columns • The spreadsheet loses interpretability if it is sorted
  • 11. How would you correct this file?
  • 12. Spreadsheet Best Practices • Include a Header Line 1st line (or record) • Label each Column with a short but descriptive name Names should be unique Use letters, numbers, or “_” (underscore) Do not include blank spaces or symbols (+ - & ^ *)
  • 13. • Columns of data should be consistent – Use the same naming convention for text data • Each line should be “complete” • Each line should have a unique identifier Spreadsheet Best Practices
  • 14. Spreadsheet Best Practices • Columns should include only a single kind of data – Text or “string” data – Integer numbers – Floating point or real numbers
  • 15. Use Naming Standards & Codes • Use commonly accepted label names that describe the contents (e.g., precip for precipitation) • Use consistent capitalization (e.g., not: temp, Temp, and TEMP in same file) • Standard codes – State Postal (VA, MA) – FIPS Codes for Counties and County Equivalent Entities (http://www.census.gov/geo/reference/codes/cou.html)
  • 16. Use Standardized Formats • Use standardized formats for units International System of Units (SI) http://physics.nist.gov/Pubs/SP330/sp330.pdf • ISO 8601 Standard for Date and Time YYYYMMDDThh:mmss.sTZD 20091013T09:1234.9Z 20091013T09:1234.9+05:00 • Spatial Coordinates for Latitute/Longitude +/- DD.DDDDD -78.476 (longitude) +38.029 (latitude)
  • 18. File Names • Use descriptive names • Not too long; CamelCase • Try to include time – Date using YYYYMMDD – Use version numbers • Don’t use spaces – May use “-” or “_” • Don’t change default extensions
  • 19. Organize Files Logically Make sure your file system is logical and efficient Biodiversity Lake Grassland Experiments Field Work Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv Project Name Location Experiment Name Date File Format
  • 20. • Check for missing, impossible, anomalous values – Plotting – Mapping • Examine summary statistics • Verify data transfers from notebooks to digital files • Verify data conversion from one file format to another Data Validation Hook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
  • 21. Data Manipulation • You will need to repeat reduction and analysis procedures many times – You need to have a workflow that recognizes this – Scripted languages can help capture the workflow – You could just document all steps by hand – After the 20th iteration through your data set; however, you may feel more fondly towards scripted languages • Learn the analytical tools of your field – Talk to colleagues, etc. and choose at least one tool to master
  • 22. Preserve Information Keep Original (Raw) File – Do not include transformations, interpolations, etc. – Consider making the raw data “read-only” Save as a new file Processing Script (R)
  • 23. Preserving: Scripted Notes • Use a scripted language to process data – R Statistical package (free, powerful) – SAS – MATLAB • Processing scripts records processing – Steps are recorded in textual format – Can be easily revised and re-executed – Easy to document • GUI-based analysis may be easier, but harder to reproduce
  • 24. Data Documentation (Metadata) • Informal or formal methods to describe your data • Important if you want to reuse your own data in the future • Also necessary when sharing your data
  • 25. Define Contents of Data Files • Create a Project Document File (Lab Notebook) • Details such as: – Names of data & analysis files associated with study – Definitions for data and codes (include missing value codes, names) – Units of measure (accuracy and precision) – Standards or instrument calibrations
  • 28. Data Documentation Project Documentation Dataset Documentation • Context of data collection • Data collection methods • Structure, organization of data files • Data sources used • Data validation, quality assurance • Transformations of data from the raw data through analysis • Information on confidentiality, access and use conditions • Variable names and descriptions • Explanation of codes and schemas used • Algorithms used to transform data • File format and software (including version) used
  • 29. File Format Sustainability Types Examples Text ASCII, Word, PDF Numerical ASCII, SPSS, STATA, Excel, Access, MySQL Multimedia Jpeg, tiff, mpeg, quicktime Models 3D, statistical Software Java, C, Fortran Domain-specific FITS in astronomy, CIF in chemistry Instrument-specific Olympus Confocal Microscope Data Format
  • 30. Choosing File Formats • Accessible Data (in the future) – Non-proprietary (software formats) – Open, documented standard – Common, used by the research community – Standard representation (ASCII, Unicode) – Unencrypted & Uncompressed
  • 31. 1. Use Consistent Data Organization 2. Use Standardized Naming, Codes and Formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats Best Practices for Creating Data
  • 32. • Will improve the usability of the data by you or by others • Your data will be “computer ready” • Save you time Following these Best Practices…….
  • 33. Research Life Cycle Data Life Cycle Re- Purpose Re- Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  • 34. Managing Data in the Data Life Cycle • Choosing file formats • File naming conventions • Document all data details • Access control & security • Backup & storage
  • 35. Data Security & Access Control • Network security – keep confidential or sensitive data off internet servers or computers on connected to the internet • Physical security – Access to buildings and rooms • Computer Systems & Files – Use passwords on files/system – Virus protection
  • 36. Backup Your Data • Reduce the risk of damage or loss • Use multiple locations (here, near, far) • Create a backup schedule • Use reliable backup medium • Test your backup system (i.e., test file recovery)
  • 38. Sustainable Storage Lifespan of Storage Media: http://www.crashplan.com/medialifespan/
  • 39. Best Practices Bibliography Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple guidelines for effective data management. Bulletin of the Ecological Society of America, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205 Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management and Publishing. Retrieved 05/31/2012, from http://libraries.mit.edu/guides/subjects/data-management/. Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010). Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
  • 40. Best Practices Bibliography (Cont.) Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to social science data preparation and archiving: Best practices throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012, from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf. Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved 05/31/2012, from http://www.data- archive.ac.uk/media/2894/managingsharing.pdf.

Editor's Notes

  1. The following are seven basic data habits that will help improve the information content of your data and make it easier to share data with others: Some have estimated that researchers can spend up to 80% of their time finding, accessing, understanding, and preparing data and only 20% of their time actually analyzing the data. The habits described in this module will help scientists spend more time doing research and less time doing data management.
  2. Spreadsheets are widely used for simple analyses They are easy to use BUT They allow (encourage) users to structure data in ways that are hard to use with other software You can use them like Word, with columns. These spreadsheets (in this format) are good for “human” interpretation, not computers – and since you probably will need either Write a program or use a software package, then the “human” format is not best. These formats are good for presenting your findings such as publishing…. But it will be harder to use with other software later on (if you need to do any analysis). It is better to store the data in ways that it can be used in automated ways, with minimal human intervention
  3. Example of Poor Data Practice for Collaboration and Sharing This illustration shows an example of poor practice for working in spreadsheets for data collection. At first glance, it may appear this data is well formulated, but a closer look reveals a number of practices that will make it difficult to re-use in its present state. For example, there are calculations in the far right columns that appear to have been made during a data analysis phase but that do not represent valid data entries. Notice in the upper right corner a comment stating “Don’t use – old data”, and “Peter’s lab”. These remarks leave the viewer wondering about who Peter is and which lab he was located in, as well as why this data may not be the most accurate data spreadsheet. One also may wonder what the “c” located in the far right column represents, and what the numbers at the bottom of the spreadsheet represent, since they are unaffiliated with a particular row of data in the spreadsheet. Notice there are numbers added in inconsistent places (two numbers at the bottom of the chart) and the letter “C” appears in an unlabeled column.
  4. Spreadsheets are widely used for simple analyses They are easy to use, however… They allow (encourage) users to structure data in ways that are hard to use with other software You can use them like you would a Word document, with columns and colors. These spreadsheets (in this format) are good for “human” interpretation, not computers – and since you probably will need to either write a program or use a software package, then the “human” format is not best. These formats are good for presenting your findings (publishing)…. But it will be harder to use with other software later on (if you need to do any further analysis). It is better to store the data in formats that it can be used in automated ways, with minimal human intervention
  5. This is some well data measurements, where a salinity meter was used to measure the salinity (top and bottom) and the conductivity (Top & bottom) Take a look at this spreadsheet… What’s wrong with it? Could this be easily automated? Sorted? Would you create a file like this?
  6. Dates are not stored consistently Sometimes date is stored with a label (e.g., “Date:5/23/2005”) sometimes in its own cell (10/2/2005) Values are labeled inconsistently Sometimes “Conductivity Top” others “conductivity_top” For Salinity sometimes two cells are used for top and bottom, in others they are combined in one cell Data coding is inconsistent Sometimes YSI_Model_30, sometimes “YSI Model 30”---- sort of can’t tell if it’s a “label” or a data value Tide State is sometimes a text description, sometimes a number The order of values in the “mini-table” for a given sampling date are different “Meter Type” comes first in the 5/23 table and second in the 10/2 table
  7. Confusion between numbers and text For most software 39% or <30 are considered TEXT not numbers (what is the average of 349 and <30?) Different types of data are stored in the same columns Many software products require that a single column contain either TEXT or NUMBERS (but not both!) The spreadsheet loses interpretability if it is sorted Dates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost. Not sure why you would sort this.
  8. Hint – think about representing missing values and about sortability You want each row to be a complete record…. With no blank cells – think about a way to represent “missing values” Designed to be machine readable, not human readable The original spreadsheet loses interpretability if it is sorted Dates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost
  9. -Sherry Standard convention for many software programs (usually a “check” yes,no) is for the 1st line (record) to be a header line… lists the names of variables in the file. Rest of records (lines) are data. Not too long some software programs may not work with long variable names
  10. Each line in the spreadsheet should have each cell filled. Otherwise, it isn’t machine-readable, and it won’t even survive a “sort” operation. Note we’ve changed the format of the date to an ISO YYYYMMDD format.
  11. Format the columns so they contain a single type of data… One problem with Excel is that it doesn’t like to show trailing zeros. So “33.0” in F2 is shown as “33” unless you change the formatting, as we have done here.
  12. (am/pm NOT allowed) T appears literally in the string. Min. for date is YYYY. YYYY = four-digit year MM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23) mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm) 38.029N. The longitude is -78.476W
  13. File names should reflect the contents of the file and uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type. Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”. Having trouble finding files, telling the most recent one?
  14. File names are the easiest way to indicate the contents of the file. Use terse names but indicative of their content. Want to uniquely id the data file. Be unique but reflect the file content. Think about the organizing principle, don’t just make up a system as you go along. Don’t’ make them too long, some scripting programs have a filename limit for file importing (reading) Don’t use blanks/spaces in file names, some software may not be able to read file names with blanks. Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”.
  15. As with naming files…. Similar logic is useful when designing file directory structures and names, which you should ensure is logical and efficient in design
  16. Perform basic quality assurance Doing quality control will help you in your project but it will also help those who want to use your data. Would you want to use data that you were not sure of the quality?
  17. You don’t want to change something (or delete something) that could be important later (and you don’t know now what that may be). Make corrections/deletions in a derivative file, never in the original file. Things to think about: Operationally, you want to keep the raw data until you are “finished” Whether you preserve the raw data after the project is over depends on various factors. Most importantly, can the data be easily regenerated? If this is experimental data, it often is. However, observational, or survey data usually isn’t reproducible, and needs to be preserved after the end of the project. How would you name the new file? If use a scripted language you could re-run analyses It is important to take good notes of what changes you make to the data (file).
  18. To preserve your data and its integrity, save a "read-only" copy of your raw data files with no transformations, interpolation, or analyses. Use a scripted language such as “R”, “SAS” or “MATLAB” to process data in a separate file, located in a separate directory. In this example, an “R” to call is made on the data set to plot the data and perform a log transform – this way, changes are not retained in the original, raw data file.
  19. Analysis “scripted” software: R, SAS, SPSS, Matlab Analysis scripts are written records of the various steps involved in processing and analyzing data (sort of “analytical metadata”). Easily revised and re-executed at any time if needs to modify analysis VS. GUI (easier) but does not leave a clear accounting of exactly what you have done Document scripted code with comments on why data is being changed. The scripts you have written are an excellent record of data processing, they can also easily and quickly be revised and rerun in the event of data loss or requests for edits, and have the added benefit of allowing a future worker to follow-up or reproduce your processing. Keep in mind that while GUI-based programs are easy on the front end, they do not keep a record of changes to your data and make reproducing results difficult.
  20. Metadata and associated documentation is absolutely crucial for any potential use or reuse of data; no one can responsibly re-use or interpret data without accompanying compliant and standardized metadata or documentation. Metadata describe your data so that others can understand what your data set represents; they are thought of as "data about the data" or the "who, what, where, when, and why" of the data. Metadata should be written from the standpoint of someone reading it who is unfamiliar with your project, methods, or observations. What does a user, 20 years into the future, need to know to use your data properly? Informal something like a ReadMe file. Formal is use of a structured like a data dictionary, codebook, or metadata. Different disciplines may have format standards. Informal is better than nothing
  21. More documentation: Documentation can also be called metadata Description of the data file names (especially if using acronyms and abbreviations). Record why you are collecting data, Details of methods of analysis Names of all data and analysis files Definitions for data (include coding keys) Missing value codes Unit of measures. Structured metadata (XML) format standards for discipline (Ecological Metadata language – EML)
  22. Here the data dictionary specifies units for each field (parameter)
  23. In fact, you probably already have metadata in some form. You just may not recognize it as such. For instance, among your work records, you certainly have notebooks stuffed with color-coded pages or assorted keys to your data stored on your computer. Perhaps the most common form of metadata that you may already have is a file folder filled with notes on your data sources and the procedures that you used to build your data. However, unless you’ve been unusually diligent, your information is probably not organized so that a stranger could stroll into your office at any time, and read and understand it easily. Start at the beginning of the project and continue throughout data collection & analysis Why you are collecting data Exact details of methods of collecting & analyzing Good for reproducability. Some of the issues include reproducibility of science that you can go back when questioned or when updating your results, and reproduce the algorithms. There’s also efficiencies in how the science is done. If you have to spend a lot of time figuring out what was done last time you are losing some efficiencies in reproducing those results or updating analysis. Along the line of those efficiencies is sharing across groups. Much of the work we do nowadays is collaborative, involves more than one agency or university or partner and if you can document the data and the analysis it helps to share the information and have everyone in that collaborative team understand what’s being done. We also, like documenting the data and the analysis create a provenance that gives a full history of when the project was started how the analysis was done and how the final results were completed.
  24. Collection/Analysis format does not have to be the same as Preservation format, but if not, then it will need to be converted (interchangeable format – will talk about this later) for archiving Want to choose a file format that can be read well into the future and is independent of software changes. Fundamental Practice #3: Use stable file formats Data re-use depends on the ability to return to a dataset, perhaps long after the proprietary software you used to develop it, is available. Remember floppy disks? It is difficult to find a computer that will read a floppy disk today. We must think of digital data in a similar way. Select a consistent format that can be read well into the future and is independent of changes in applications. If your data collection process used proprietary file formats, converting those files into a stable, well-documented, and non-proprietary format to maximize others' abilities to use and build upon your data is a best practice. When possible, convert your tabular dataset into ASCII text format. To be accessible in the future: Non-proprietary Open, documented standard Common, used by the research community Standard representation (ASCII, Unicode) Unencrypted Uncompressed
  25. Storing data in recommended formats with detailed documentation will allow your data to be easily read many years into the future. Spreadsheets are widely used for simple analyses But they have poor archival qualities Different versions over time are not compatible Formulas are hard to capture or display Plan what type of data you will be collecting. Want to choose a file format that can be read well into the future and is independent of software changes. These are formats more likely to be accessible in the future. to replace old media, maintaining devices that can still read the proprietary formats or media type Format of the file is a major factor in the ability to use the data in the future. As technology changes, plan for software and hardware obsolescence. System files (SAS, SPSS) are compact and efficient, but not very portable. Use software to “export” data to a portable (or transport) file. Convert proprietary formats to non-proprietary. Check for data errors in conversion.
  26. Remember create spreadsheet so it can be automated Date/Time standards, Geospatial coords, Species, other standards from discipline Descriptive File Names – File names can help id what’ inside Quality Assurance – when planning on data entry can “program” data checks in forms (Access and Excel), create pick lists (codes), missing data values Make it easier to replicate data transformation, can be documented Document EVERYTHING, dataset details, database details, collection notes – conditions, You will not remember everything 20 years from now! What someone would need to know about your data to use it. Stable File Formats – easier if all files were same format, also knowing what formats are better in the long-term
  27. Planning the management of your data before you begin your research AND throughout its lifecycle is essential to ensure its current usability & long-term preservation and access. With a repository keeping your data, you can focus on your research rather than fielding requests or worrying about data on a web page. Your project may have lots of people working on it, you will need to know what each is doing and has done. Project may last years. Funding agencies now require a data management plan Having your data documented will allow future users understand your data and be able to use it. If follow plan then data should be ready for archiving (documenting the data throughout) insures proper description of the data are maintained.
  28. Collecting the data is just part of a research project. Here’s a view of the complete life cycle of research. The data you collect (all the files, notes) need to be managed throughout the project. Steps in the Research Life Cycle: Proposal Planning & Writing: Conduct a review of existing data sets Determine if project will produce a new dataset (or combing existing) Investigate archiving challenges, consent and confidentiality Id potential users of your data Determine costs related to archiving Contact Archives for advice (Look for archives) Project Start Up Create a data management plan Make decisions about document form and content Conduct pretest & tests of materials and methods Data Collection Follow Best Practice Organize files, backups & storage, QA for data collection Access Control and Security Data Analysis Manage file versions Document analysis and file manipulations Data Sharing Determine file formats Contact Archive for advice More documenting and cleaning up data End of Project Write Paper Submit Report Findings Deposit Data in Data Archive (Repository)   Remember: Managing Data in a research project is a process that runs throughout the project. Good data management is the foundation for good research. Especially if you are going to share your data. Good management is essential to ensure that data can be preserved and remain accessible I the long-term, so it can be re-used and understood by other researchers. When managed and preserved properly research data can be successfully used for future scientific purposes.
  29. Here’s the details about what we are going to manage in the Data Life Cycle. Many of the criteria for managing data are the best practices that we already went over. The 2 highlighted we haven’t talked about yet.
  30. Keep master copy to an assigned team member Restrict write access to specific members Record changes with Version control Network: keep confidential data off internet servers (or behind firewalls), put sensitive materials on computers not connected to the internet Physical security… who has access to your office,. Allowing repairs by an outside company Computer: Keep virus protection up to date, does your computer have a login password, not sending personal or confidential data via e-mail or FTP, transmit via encrypted data, imposing confidentially agreements for data users
  31. Why backup data? Keeping reliable backups is an integral part of data management. Regular back-ups protect against data loss due to: Hardware failure, software of media faults, virus infection or hacking, power failure, human errors Recommendation, 3 backup copies original, external/local, external/remote Full-backups, incremental If using departmental server, check on backup/restore procedures (how quickly can you get files restored?) May want to have the backup procedures controlled by you. Test your backup system, test restoring files, don’t over re-use backup media
  32. There are a variety of alternative methods to store and share your data - from thumb drives to shared online environments. Personal Computer Departmental or University Server Home Directory or UVa Collab (Storage only) Tape Backups Subject Archive   Each type of storage (can’t forget backups) has their strengths and weaknesses. You need to be able to evaluate them for your research.
  33. Point… to think about migrating information from obsolete media to new media. Using CD-Roms as data backups is popular. Blank CDs are inexpensive, and copying data onto CDs is easy. However, this is the most unreliable method of all the data backup methods listed here. Who hasn't had the experience of putting a CD into a drive only to find that the data is unreadable and the disk "doesn't work"? CDs, like the floppy disks they've replaced, have a limited shelf life. If you are writing your data backup files onto CDs, make sure that you make (and keep) multiple copies over time. external hard drive for data backups is recommended. External hard drives are relatively cheap. They’re also easy to use; in many cases, all you have to do is plug the hard drive into your computer’s USB port. And while hard drives do fail, their failure rate is much lower than that of backup media such as CDs. Cloud Storage for-fee or free for up to 10G (storage costs, data transfer)