CR
 RC
   Data Quality
  Issues and “Fixes”
     Dr. Fritz Scheuren

                July 3, 2009
       (for academic pur...
Two Definitions of Quality
• Conformance to Requirements
• (Traditional Producer-Oriented
  Definition)
• Fitness for Use
...
Definition of Process Quality
• Process Improvements Focus
• (Do It Right the First Time)
• Can be Reduced to Slogans
• Ca...
Be Real Four Quality Costs
• Costs of Reputation and Loss of
  Business from Inaction
• Cost of Prevention to Avoid Errors...
Quality and Cost 2 Worlds
Repair Methods
• Goal is “Fixing” to Fit Use
• Data Editing
• Data Imputation
• Data Fabrication
• Raking at NSS
Data Editing
• Honest Differences of Opinion or
  Real Errors?
• Need for Redundancy in System for
  Can’t Fail Items
• Ac...
Data Editing Techniques
•   Minimizing Processing Errors
•   Definitional (e.g., Range) Tests
•   Deterministic Tests
•   ...
Types of Edits Illustrated
• Range Test
    Age Negative
• Deterministic Tests
    If Age =14, then code as Child
• Probab...
Practical Editing Tips
• Edit for Diagnosis, not just
  Correction
• Don’t Edit Outside Your Confidence
  Interval
• Prese...
Not all errors need to be
        corrected
  Resist your Perfectionist
         Tendencies
More Practical Edit Tips
• Use your skilled staff to
  improve system rather than
  just edit data
• Never just depend on ...
Capture Recapture Methods
    (Double Keying Example)
• Two-by-Two Table with Cells
                A   B
                ...
Bottom Line Take-Away
• Use Data Checking to
  Understand Data’s Fitness for
  Use
• Edit but Don’t Over-Edit
• Use Edit C...
Data Editing and Data
        Imputation
• Joint Role of Imputation and
  Editing No Clear Line?
• Editing “fixes” Often a...
Imputation Versus Editing
• What is Imputation?
• Handles Missing and
  Misreported Data
• Imputation Goal is roughly
  ri...
Data Imputation Techniques
• Imputation Needs More
  Justification when Data Quality
  is the Goal
• Must be no more than ...
Fellegi-Holt Example
• Identify Errors with Automated Edit
  Detection Software
• Hot Deck acceptable values from
  Record...
More on Imputation
• Treat Influential Errors Individually
  not just Automatically
• That Said, Software Fixes can lead
 ...
Edit/Imputation Summary
• Most Editing Mainly
  Eliminates the Bad
• Replacing it with a
  (Good?)Guess of some Sort
• Imp...
More Editing/Imputation
• Best Imputation Practice tries to
  quantify Guessing impact on
  Information Quality
• Editing ...
First Illustrative Example
• Fabrication/Falsification
• Illustrate the General Points
  about Editing and Imputation
• Em...
Fabrication/Falsification
• Respondent/Interviewer
  Make up Data
• How Common?
• How to Reduce?
• How to Detect?
Right Structure
      Right Resources
• Examine Practice Elsewhere?
• www.amstat.org Website
• Key is right incentives
• G...
Second Illustration
• Raking Application at NSS
• To link up to Next Talk
• To illustrate Information
  Quality that is fi...
Raking Quality “Fix”
• What is Raking?
• How does it improve quality?
    Not Data Quality
    But Information Quality
• S...
Quality Summary
•   Editing Data Quality
•   Imputation Information Quality
•   Raking Information Quality
•   Fabrication...
Almost Done Now
• Tried to Stay Practical, with a Frank
  Discussion of Key Weaknesses in
  Current Practice
• Deeper Unde...
ÞÝáñѳϳÉáõ ÃÛáõ Ý
   Fritz Scheuren
 Scheuren@aol.com
Upcoming SlideShare
Loading in...5
×

ILCS Raking

364

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
364
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ILCS Raking

  1. 1. CR RC Data Quality Issues and “Fixes” Dr. Fritz Scheuren July 3, 2009 (for academic purposes only)
  2. 2. Two Definitions of Quality • Conformance to Requirements • (Traditional Producer-Oriented Definition) • Fitness for Use • (Modern Client-Oriented Definition)
  3. 3. Definition of Process Quality • Process Improvements Focus • (Do It Right the First Time) • Can be Reduced to Slogans • Can also lead to Continuous Improvements • Kaisen
  4. 4. Be Real Four Quality Costs • Costs of Reputation and Loss of Business from Inaction • Cost of Prevention to Avoid Errors • Cost of Detection to Find Errors • Cost of Repairing Errors Found
  5. 5. Quality and Cost 2 Worlds
  6. 6. Repair Methods • Goal is “Fixing” to Fit Use • Data Editing • Data Imputation • Data Fabrication • Raking at NSS
  7. 7. Data Editing • Honest Differences of Opinion or Real Errors? • Need for Redundancy in System for Can’t Fail Items • Achieving Measurability to Frame Expectations and Improvements
  8. 8. Data Editing Techniques • Minimizing Processing Errors • Definitional (e.g., Range) Tests • Deterministic Tests • Probabilistic Tests – Outlier Tests – Ratio Tests
  9. 9. Types of Edits Illustrated • Range Test Age Negative • Deterministic Tests If Age =14, then code as Child • Probabilistic Tests If Income $1,000,000, take a look
  10. 10. Practical Editing Tips • Edit for Diagnosis, not just Correction • Don’t Edit Outside Your Confidence Interval • Preserve the Original Dataset as Backup to Avoid Irreversible Changes • Keep Tallies of all Errors Found
  11. 11. Not all errors need to be corrected Resist your Perfectionist Tendencies
  12. 12. More Practical Edit Tips • Use your skilled staff to improve system rather than just edit data • Never just depend on Intuition but still use it too! • Employ Redundancy, Frugally!
  13. 13. Capture Recapture Methods (Double Keying Example) • Two-by-Two Table with Cells A B C D • Comparing Data Keyed the Same each time (A) with Errors Detected, (B and C) • How to Estimate D? • One Model D = BC/A?
  14. 14. Bottom Line Take-Away • Use Data Checking to Understand Data’s Fitness for Use • Edit but Don’t Over-Edit • Use Edit Checks to Prevent Future Errors
  15. 15. Data Editing and Data Imputation • Joint Role of Imputation and Editing No Clear Line? • Editing “fixes” Often are Model-Based Hunches • Data Quality (editing) • Information Quality (imputation)
  16. 16. Imputation Versus Editing • What is Imputation? • Handles Missing and Misreported Data • Imputation Goal is roughly right! Information Quality • Editing Goal often “correction” Exactly right? Data Quality
  17. 17. Data Imputation Techniques • Imputation Needs More Justification when Data Quality is the Goal • Must be no more than Cosmetic in Nature, if done at all • Can only be Aggressively applied for Information Quality Goal
  18. 18. Fellegi-Holt Example • Identify Errors with Automated Edit Detection Software • Hot Deck acceptable values from Records that Pass Edits • Can be worth doing if errors are minor or cosmetic (e.g., Rounding)
  19. 19. More on Imputation • Treat Influential Errors Individually not just Automatically • That Said, Software Fixes can lead to Better Documentation (Paradata Matters) • Need to Measure Variance Impacts • Provide a natural break to Overediting but seldom used for this.
  20. 20. Edit/Imputation Summary • Most Editing Mainly Eliminates the Bad • Replacing it with a (Good?)Guess of some Sort • Imputation emphasizes Guessing even more
  21. 21. More Editing/Imputation • Best Imputation Practice tries to quantify Guessing impact on Information Quality • Editing has not improved as much as Imputation • Editing/Imputation needs more Joint Theory, especially to Measure and Use Mean Square Error Impacts
  22. 22. First Illustrative Example • Fabrication/Falsification • Illustrate the General Points about Editing and Imputation • Emphasize Importance of Fabrication threat to Quality
  23. 23. Fabrication/Falsification • Respondent/Interviewer Make up Data • How Common? • How to Reduce? • How to Detect?
  24. 24. Right Structure Right Resources • Examine Practice Elsewhere? • www.amstat.org Website • Key is right incentives • Good staff/training • But Eternal Vigilance
  25. 25. Second Illustration • Raking Application at NSS • To link up to Next Talk • To illustrate Information Quality that is fit for use despite Data Quality
  26. 26. Raking Quality “Fix” • What is Raking? • How does it improve quality? Not Data Quality But Information Quality • Sometimes both -- Better Point Estimates More Stable (smaller variances)
  27. 27. Quality Summary • Editing Data Quality • Imputation Information Quality • Raking Information Quality • Fabrication Can Harm Both • Must be guarded against always
  28. 28. Almost Done Now • Tried to Stay Practical, with a Frank Discussion of Key Weaknesses in Current Practice • Deeper Understanding of Data Quality • But at an Applied Level
  29. 29. ÞÝáñѳϳÉáõ ÃÛáõ Ý Fritz Scheuren Scheuren@aol.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×