Data Management Lab: Session 3 Slides

Research Data Management
Spring 2014: Session 3
Practical strategies for better results
University Library
Center for Digital Scholarship

QUALITY ASSURANCE & CONTROL
MODULE 3

LEARNING
OUTCOMES
• Develop procedures
for quality
assurance and
quality control
activities.

Data Integrity
1. Data have integrity if they have been
maintained without unauthorized alteration
or destruction
2. Data integrity is data that has a complete or
whole structure.
(http://www.princeton.edu/~achaney/tmve/
wiki100k/docs/Data_integrity.html)

Data Quality
• Fitness for use (depends on context of your questions)
• Data quality is the most important aspect of data
management
• Ensured by
– Sufficient resources and expertise
– Paying close attention to the design of data collection
instruments
– Creating appropriate entry, validation, and reporting processes
– Ongoing QC processes
– Understanding the data collected
Chapman, 2005
Dept of Biostatistics – Data Management, IUSM

Data Quality Standards
• Check data for its logical consistency.
• Check data for reasonableness.
• Ensure adherence to sound estimation methodologies.
• Ensure adherence to monetary submission standards for
stolen and recovered property.
• Ensure that other statistical edit functions are processed
within established parameters.
FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines

Data Entry and Manipulation
• Strategies for preventing errors from entering a dataset
• Activities to ensure quality of data before collection
• Activities that involve monitoring and maintaining the
quality of data during the study

• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC

Quality Assurance v. Control
• QA: set of processes, procedures, and activities that
are initiated prior to data collection to ensure the
expected level of quality will be reached and data
integrity will be maintained.
• QC: a system for verifying and maintaining a desired
level of quality in a product or service.
http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC
ontrol

Quality Assurance in Practice
• CRF (data collection instrument) review & validation
• System/process testing & validation
• Training, education, communication of a team
• Standard Operating Procedures, Standard Operating
Guidelines
• Site audits

Quality Control in Practice
• Set of processes, procedures, and activities
associated with monitoring, detection, and action
during and after data collection.
• Examples:
– Errors in individual data fields
– Systematic errors
– Violation of protocol
– Staff performance issues
– Fraud or scientific misconduct

Activity
Define data quality standards for the following
variables:
• Age
• Height
• BMI
• Life satisfaction scale
• Number of close friends
Don’t forget to upload this to Box.
Suggested file name “Data Quality Standards”

References
1. Department of Biostatistics – Data Management Team, Indiana
University School of Medicine (2013). Data Management including
REDCap. (provided via email)
2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for
the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-
03-8. http://www.gbif.org/resources/2829
3. DataONE Education Module: Data Quality Control and Assurance.
DataONE. From http://www.dataone.org/sites/all/documents
/L05_DataQualityControlAssurance.pptx

LEARNING
OUTCOMES
• Describe key
considerations for
selecting data
collection tools.

Choose your tools wisely
Allie Brosh, 2010

Activity
Draft data collection instrument
See document “DataMgmtLab-Spr14-
CollectionCodingEntry_EX“
Suggested file name “Data Collection Tool”

References
1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.
http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-
have-ebola-probably.html

LEARNING
OUTCOMES
• Use best practices
for coding.
• Use best practices
for data entry.

Goals of Data Entry
• Publishable results!
– Valid data that are organized to support smooth
analysis
• Easy to import into analytical program
• Minimize manipulations and errors
• Has a logical [data] structure

Activity
Draft data coding scheme for data
entry
• Review data entry best practices
document in Box
Suggested file name “Coding Scheme”

References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx
2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist
presented at the AGU Workshop. From
http://wiki.esipfed.org/index.php/2011AGUworkshop
3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt
CRC Research Skills Workshop Series. From
http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf

DATA SCREENING & CLEANING
MODULE 3

LEARNING
OUTCOMES
• Develop a screening
and cleaning
protocol and/or
checklist.

Data Contamination
• Process or phenomenon, other than the one of interest,
that affects the variable value
• Erroneous values
CCimagebyMichaelCoghlanonFlickr

• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr

• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr

• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary

• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr

• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35

• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Maps
◦ Subtract values from mean

• Data contamination is data that results from a factor not
examined by the study that results in altered data values
• Data error types: commission or omission
• Quality assurance and quality control are strategies for
◦ preventing errors from entering a dataset
◦ ensuring data quality for entered data
◦ monitoring, and maintaining data quality throughout the project
• Identify and enforce quality assurance and quality control
measures throughout the Data Life Cycle

Discussion
Using the Data Review Checklist,
evaluate the HBSC codebook
“DataMgmtLab-Spr14_DataReviewChecklist_EX”
What screening & cleaning procedures
were used?

1. D. Edwards, in Ecological Data: Design, Management and Processing,
WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-
91. Available at www.ecoinformatics.org/pubs
2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for
preparing ecological data sets to share and archive. Bull. Ecol. Soc.
Amer. 82, 138-141 (2001).
3. A. D. Chapman, “Principles of Data Quality:. Report for the Global
Biodiversity Information Facility” (Global Biodiversity Information
Facility, Copenhagen, 2004). Available at
http://www.gbif.org/communications/resources/print-and-online-
resources/download-publications/bookelets/

References
1. Cook, 2013, NACP Best Data Management Practices Workshop. From
http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201
3.02.03.ppt
2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data
provenance in e-Science. SIGMOD Record, 34(3), 31-36. From
http://www.sigmod.org/publications/sigmod-record/0509/p31-special-
sw-section-5.pdf
3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and
Management. From http://www.slideshare.net/INSITEUA/provenance-
management-to-enable-data-sharing

LEARNING
OUTCOMES
• Explain why
automation
provides better
provenance than
manual processes.
• Identify effective
tools for automating
data processing and
analysis.

Choose your tools wisely
• Documents
• Excel
• Access
• SPSS, Minitab
• Mathematica, MATLAB, Scilab
• SAS, Stata
• R
• MapReduce
• NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc.
http://www.dataone.org/all-software-tools

Data Formats; Version 1.0
Overview
• Spreadsheets are amazingly flexible, and are commonly
used for data collection, analysis and management
• Spreadsheets are seldom self-documenting, and seldom
well-documented
• Subtle (and not so subtle) errors are easily introduced
during entry, manipulation and analysis
• Spreadsheet conventions – often ad hoc and evolutionary –
may change or be applied inconsistently
• Spreadsheet file formats are proprietary and thus generally
unacceptable as long term archival purposes

• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet

NACP Best Data Management Practices, February 3, 2013
5. Preserve information (cont)
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts are records of processing
– Scripts can be revised, rerun
• Graphical User Interface-based analyses may
seem easy, but don’t leave a record
45

Provenance, Audit Trails, etc.
• “…information that helps determine the
derivation history of a data product, starting from
its original sources.” (Simmhan et al, 2005)
– Ancestral data products from which the data evolved
– Process of transformation of these ancestral data
products
• Uses: data quality, audit trail, replication recipe,
attribution, informational

More Considerations
• Field names & descriptions
• Structured entry
• Validation
• Record integrity
• Missing data
• Data/field types
• File types: common, open documented standard
• Output required for analysis and visualization

Demonstration & Discussion
Run [analysis] in Excel and Stata.
Compare output.
• What features does Stata have that Excel
does not?
• How do these features support
provenance and data integrity?

References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx

Data Management Lab: Session 3 Slides

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Data Management Lab: Session 3 Slides

Similar to Data Management Lab: Session 3 Slides (20)

More from IUPUI

More from IUPUI (20)

Recently uploaded

Recently uploaded (20)

Data Management Lab: Session 3 Slides