Data Management Lab: Session 3 Slides


Published on

Data Management Lab: Session 3 slides (more details at

What you will learn:
1. Build awareness of research data management issues associated with digital data.
2. Introduce methods to address common data management issues and facilitate data integrity.
3. Introduce institutional resources supporting effective data management methods.
4. Build proficiency in applying these methods.
5. Build strategic skills that enable attendees to solve new data management problems.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Management Lab: Session 3 Slides

  1. 1. Research Data Management Spring 2014: Session 3 Practical strategies for better results University Library Center for Digital Scholarship
  3. 3. LEARNING OUTCOMES • Develop procedures for quality assurance and quality control activities.
  4. 4. Data Integrity 1. Data have integrity if they have been maintained without unauthorized alteration or destruction 2. Data integrity is data that has a complete or whole structure. ( wiki100k/docs/Data_integrity.html)
  5. 5. Data Quality • Fitness for use (depends on context of your questions) • Data quality is the most important aspect of data management • Ensured by – Sufficient resources and expertise – Paying close attention to the design of data collection instruments – Creating appropriate entry, validation, and reporting processes – Ongoing QC processes – Understanding the data collected Chapman, 2005 Dept of Biostatistics – Data Management, IUSM
  6. 6. Data Quality Standards • Check data for its logical consistency. • Check data for reasonableness. • Ensure adherence to sound estimation methodologies. • Ensure adherence to monetary submission standards for stolen and recovered property. • Ensure that other statistical edit functions are processed within established parameters. FBI: Dept of Biostatistics – Data Management, IUSM
  7. 7. Data Entry and Manipulation • Strategies for preventing errors from entering a dataset • Activities to ensure quality of data before collection • Activities that involve monitoring and maintaining the quality of data during the study
  8. 8. Data Entry and Manipulation • Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata • Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC
  9. 9. Quality Assurance v. Control • QA: set of processes, procedures, and activities that are initiated prior to data collection to ensure the expected level of quality will be reached and data integrity will be maintained. • QC: a system for verifying and maintaining a desired level of quality in a product or service. ontrol
  10. 10. Quality Assurance in Practice • CRF (data collection instrument) review & validation • System/process testing & validation • Training, education, communication of a team • Standard Operating Procedures, Standard Operating Guidelines • Site audits Dept of Biostatistics – Data Management, IUSM
  11. 11. Quality Control in Practice • Set of processes, procedures, and activities associated with monitoring, detection, and action during and after data collection. • Examples: – Errors in individual data fields – Systematic errors – Violation of protocol – Staff performance issues – Fraud or scientific misconduct Dept of Biostatistics – Data Management, IUSM
  12. 12. Activity Define data quality standards for the following variables: • Age • Height • BMI • Life satisfaction scale • Number of close friends Don’t forget to upload this to Box. Suggested file name “Data Quality Standards”
  13. 13. References 1. Department of Biostatistics – Data Management Team, Indiana University School of Medicine (2013). Data Management including REDCap. (provided via email) 2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020- 03-8. 3. DataONE Education Module: Data Quality Control and Assurance. DataONE. From /L05_DataQualityControlAssurance.pptx
  15. 15. LEARNING OUTCOMES • Describe key considerations for selecting data collection tools.
  16. 16. Choose your tools wisely
  17. 17. Choose your tools wisely Allie Brosh, 2010
  18. 18. Activity Draft data collection instrument See document “DataMgmtLab-Spr14- CollectionCodingEntry_EX“ Don’t forget to upload this to Box. Suggested file name “Data Collection Tool”
  19. 19. References 1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably. have-ebola-probably.html
  21. 21. LEARNING OUTCOMES • Use best practices for coding. • Use best practices for data entry.
  22. 22. Goals of Data Entry • Publishable results! – Valid data that are organized to support smooth analysis • Easy to import into analytical program • Minimize manipulations and errors • Has a logical [data] structure
  23. 23. Activity Draft data coding scheme for data entry • Review data entry best practices document in Box Don’t forget to upload this to Box. Suggested file name “Coding Scheme”
  24. 24. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From L04_DataEntryManipulation.pptx 2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist presented at the AGU Workshop. From 3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt CRC Research Skills Workshop Series. From
  26. 26. LEARNING OUTCOMES • Develop a screening and cleaning protocol and/or checklist.
  27. 27. Data Entry and Manipulation Data Contamination • Process or phenomenon, other than the one of interest, that affects the variable value • Erroneous values CCimagebyMichaelCoghlanonFlickr
  28. 28. Data Entry and Manipulation • Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data • Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the field CCimagebyNickJWebbonFlickr
  29. 29. Data Entry and Manipulation • Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification • Record a reading of the data and transcribe from the recording • Use text-to-speech program to read data back CCimagebyweskrieselonFlickr
  30. 30. Data Entry and Manipulation • Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information • Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary
  31. 31. Data Entry and Manipulation • Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries CCimagebychesapeakeclimateonFlickr
  32. 32. Data Entry and Manipulation • Look for outliers ◦ Outliers are extreme values for a variable given the statistical model being used ◦ The goal is not to eliminate outliers but to identify potential data contamination 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
  33. 33. Data Entry and Manipulation • Methods to look for outliers ◦ Graphical • Normal probability plots • Regression • Scatter plots ◦ Maps ◦ Subtract values from mean
  34. 34. Data Entry and Manipulation • Data contamination is data that results from a factor not examined by the study that results in altered data values • Data error types: commission or omission • Quality assurance and quality control are strategies for ◦ preventing errors from entering a dataset ◦ ensuring data quality for entered data ◦ monitoring, and maintaining data quality throughout the project • Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle
  35. 35. Discussion Using the Data Review Checklist, evaluate the HBSC codebook “DataMgmtLab-Spr14_DataReviewChecklist_EX” What screening & cleaning procedures were used?
  36. 36. Data Entry and Manipulation 1. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70- 91. Available at 2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001). 3. A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at resources/download-publications/bookelets/
  37. 37. References 1. Cook, 2013, NACP Best Data Management Practices Workshop. From 3.02.03.ppt 2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-Science. SIGMOD Record, 34(3), 31-36. From sw-section-5.pdf 3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and Management. From management-to-enable-data-sharing
  39. 39. LEARNING OUTCOMES • Explain why automation provides better provenance than manual processes. • Identify effective tools for automating data processing and analysis.
  40. 40. Choose your tools wisely • Documents • Excel • Access • SPSS, Minitab • Mathematica, MATLAB, Scilab • SAS, Stata • R • MapReduce • NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc.
  41. 41. Data Formats; Version 1.0 Overview • Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management • Spreadsheets are seldom self-documenting, and seldom well-documented • Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis • Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently • Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes
  42. 42. Data Entry and Manipulation • Great for charts, graphs, calculations • Flexible about cell content type—cells in same column can contain numbers or text • Lack record integrity--can sort a column independently of all others) • Easy to use – but harder to maintain as complexity and size of data grows • Easy to query to select portions of data • Data fields are typed – For example, only integers are allowed in integer fields • Columns cannot be sorted independently of each other • Steeper learning curve than a spreadsheet
  43. 43. NACP Best Data Management Practices, February 3, 2013 5. Preserve information (cont) • Use a scripted language to process data – R Statistical package (free, powerful) – SAS – MATLAB • Processing scripts are records of processing – Scripts can be revised, rerun • Graphical User Interface-based analyses may seem easy, but don’t leave a record 45
  44. 44. Provenance, Audit Trails, etc. • “…information that helps determine the derivation history of a data product, starting from its original sources.” (Simmhan et al, 2005) – Ancestral data products from which the data evolved – Process of transformation of these ancestral data products • Uses: data quality, audit trail, replication recipe, attribution, informational
  45. 45. More Considerations • Field names & descriptions • Structured entry • Validation • Record integrity • Missing data • Data/field types • File types: common, open documented standard • Output required for analysis and visualization
  46. 46. Demonstration & Discussion Run [analysis] in Excel and Stata. Compare output. • What features does Stata have that Excel does not? • How do these features support provenance and data integrity?
  47. 47. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From L04_DataEntryManipulation.pptx