CR
RC
Data Quality
Issues and “Fixes”
Dr. Fritz Scheuren
July 3, 2009
(for academic purposes only)
Two Definitions of Quality
• Conformance to Requirements
• (Traditional Producer-Oriented
Definition)
• Fitness for Use
• (Modern Client-Oriented
Definition)
Definition of Process Quality
• Process Improvements Focus
• (Do It Right the First Time)
• Can be Reduced to Slogans
• Can also lead to Continuous
Improvements
• Kaisen
Be Real Four Quality Costs
• Costs of Reputation and Loss of
Business from Inaction
• Cost of Prevention to Avoid Errors
• Cost of Detection to Find Errors
• Cost of Repairing Errors Found
Quality and Cost 2 Worlds
Repair Methods
• Goal is “Fixing” to Fit Use
• Data Editing
• Data Imputation
• Data Fabrication
• Raking at NSS
Data Editing
• Honest Differences of Opinion or
Real Errors?
• Need for Redundancy in System for
Can’t Fail Items
• Achieving Measurability to Frame
Expectations and Improvements
Types of Edits Illustrated
• Range Test
Age Negative
• Deterministic Tests
If Age =14, then code as Child
• Probabilistic Tests
If Income $1,000,000, take a look
Practical Editing Tips
• Edit for Diagnosis, not just
Correction
• Don’t Edit Outside Your Confidence
Interval
• Preserve the Original Dataset as
Backup to Avoid Irreversible
Changes
• Keep Tallies of all Errors Found
Not all errors need to be
corrected
Resist your Perfectionist
Tendencies
More Practical Edit Tips
• Use your skilled staff to
improve system rather than
just edit data
• Never just depend on Intuition
but still use it too!
• Employ Redundancy, Frugally!
Capture Recapture Methods
(Double Keying Example)
• Two-by-Two Table with Cells
A B
C D
• Comparing Data Keyed the Same each
time (A) with Errors Detected, (B and C)
• How to Estimate D?
• One Model D = BC/A?
Bottom Line Take-Away
• Use Data Checking to
Understand Data’s Fitness for
Use
• Edit but Don’t Over-Edit
• Use Edit Checks to Prevent
Future Errors
Data Editing and Data
Imputation
• Joint Role of Imputation and
Editing No Clear Line?
• Editing “fixes” Often are
Model-Based Hunches
• Data Quality (editing)
• Information Quality
(imputation)
Imputation Versus Editing
• What is Imputation?
• Handles Missing and
Misreported Data
• Imputation Goal is roughly
right! Information Quality
• Editing Goal often “correction”
Exactly right? Data Quality
Data Imputation Techniques
• Imputation Needs More
Justification when Data Quality
is the Goal
• Must be no more than Cosmetic
in Nature, if done at all
• Can only be Aggressively applied
for Information Quality Goal
Fellegi-Holt Example
• Identify Errors with Automated Edit
Detection Software
• Hot Deck acceptable values from
Records that Pass Edits
• Can be worth doing if errors are
minor or cosmetic (e.g., Rounding)
More on Imputation
• Treat Influential Errors Individually
not just Automatically
• That Said, Software Fixes can lead
to Better Documentation (Paradata
Matters)
• Need to Measure Variance Impacts
• Provide a natural break to
Overediting but seldom used for this.
Edit/Imputation Summary
• Most Editing Mainly
Eliminates the Bad
• Replacing it with a
(Good?)Guess of some Sort
• Imputation emphasizes
Guessing even more
More Editing/Imputation
• Best Imputation Practice tries to
quantify Guessing impact on
Information Quality
• Editing has not improved as much as
Imputation
• Editing/Imputation needs more Joint
Theory, especially to Measure and
Use Mean Square Error Impacts
First Illustrative Example
• Fabrication/Falsification
• Illustrate the General Points
about Editing and Imputation
• Emphasize Importance of
Fabrication threat to Quality
Fabrication/Falsification
• Respondent/Interviewer
Make up Data
• How Common?
• How to Reduce?
• How to Detect?
Right Structure
Right Resources
• Examine Practice Elsewhere?
• www.amstat.org Website
• Key is right incentives
• Good staff/training
• But Eternal Vigilance
Second Illustration
• Raking Application at NSS
• To link up to Next Talk
• To illustrate Information
Quality that is fit for use
despite Data Quality
Raking Quality “Fix”
• What is Raking?
• How does it improve quality?
Not Data Quality
But Information Quality
• Sometimes both --
Better Point Estimates
More Stable (smaller variances)
Quality Summary
• Editing Data Quality
• Imputation Information Quality
• Raking Information Quality
• Fabrication Can Harm Both
• Must be guarded against always
Almost Done Now
• Tried to Stay Practical, with a Frank
Discussion of Key Weaknesses in
Current Practice
• Deeper Understanding of Data
Quality
• But at an Applied Level
ÞÝáñѳϳÉáõ ÃÛáõ Ý
Fritz Scheuren
Scheuren@aol.com
0 comments
Post a comment