Upcoming SlideShare
×

# Overview of Missing Value Analysis

666 views

Published on

This presentation was given at the UCSD/VA Medical Center San Diego Addictions Seminar with the intent to orient the research groups to the options that modern missing value analysis afforded them and determine what type of support each group would prefer.

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
666
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
30
0
Likes
0
Embeds 0
No embeds

No notes for slide
• In the M step, teh funcintal form will often have the same functional form aws a complete-data loglikelihood, so maximizing will be computationally no different tfrom finding the MLE in the complete case.
• ### Overview of Missing Value Analysis

1. 1. Statistical Analysis with Missing Data: A Survey of Options Kevin Cummins Addictions Research Seminar October 18, 2006
3. 3. 3 Objectives • Introduce the main concepts with missing value analysis • Provide some tentative guidance dealing with missing values (MV) • Develop a discussion of MV’s impact on the interpretation of our research findings • Identify where we should prioritize our development of MV analysis tools
4. 4. 4 Outline • Introduction – Objective – The Problem • Getting Parameter Estimates – Complete Case – Imputation – Maximum Likelihood • Comparison of Approaches • Hypothesis Testing S = Number of Slides = 33
5. 5. 5 Problems with Missing Data • Bias Potential • Analytical Hurdles • Loss of Power
6. 6. 6 Missing Value Pattern Example Tabulated Patterns 152 X 225 X X 158 X X X X 253 X X X 174 X 164 X X 242 X X X 269 X X 187 X X 155 X X X X X 198 X X X X X X 283 X X X 206 Number of Cases 152 73 8 4 4 12 5 9 36 4 24 9 7 PROP2 AGE GENDER SBF SB4DP SBPR Q71AAQSOCBH AAQGLPOS JPRFHA JPRFIM SEQSUM1 AFESEXP AFESCOH COPE1 Missing Patterns a Completeif... b Patterns with less than 1% cases (3 or fewer) are not displayed. Variables are sorted on missing patterns.a. Number of complete cases if variables missing in that pattern (marked with X) are not used.b.
7. 7. 7 Missing Value Mechanisms (MVM) • Missing by Necessity (NA) • Missing Completely at Random (MCAR) Missingness not dependant on measured variables • Missing at Random (MAR) Missingness not dependant on other measured variables • Not Missing at Random (NMAR) Missingness is dependant on the variables with missings
8. 8. 8 Graphical Examples of MVM Y X Z Y X Z Ymis Y X Z Ymis Ymis
9. 9. 9 Why MVM Assumptions are Crucial • Missing data methods depend very strongly on the nature of the dependencies in these mechanisms
10. 10. 10 What Can You Do? • Complete Case Analyses • Weighting Procedures • Imputation-Based Procedures • Model-Based Procedures
11. 11. 11 Complete Case Analysis • Benchmark + Simple & easy + Often satisfactory + Direct comparability among variables - Inefficient - Can lead to bias, unless MCAR
12. 12. 12 • Can be improved under some designs by using weightings (Little & Rubin 2002) • Can be improved by dropping variables with many missing values Complete Case Analysis
13. 13. 13 Single Imputation Imputed observation: a calculated value used in place of a missing value Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Substitution Imputation Cold deck Imputation Composite Approaches
14. 14. 14 Formal statistical model created to describe distribution of missing values. Explicit assumptions. Single Imputation Imputed observation: a calculated value used in place of a missing value Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Substitution Imputation Cold deck Imputation Composite Approaches No formal model. Algorithm for selecting and assigning imputed values created. Implicit assumptions.
15. 15. 15 “The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where [its application is legitimate and where it creates serious biases]” Dempster and Rubin 1983
16. 16. 16 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches • Replace the missing value with the variable’s mean - Severe bias is possible - Covariance matrices will be attenuated • Some rectification possible with conditional mean imputation +/- Improvement on a bad option - Not a generally recommended
17. 17. 17 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches • Replace the missing value with the expected value from a regression. The regression models the missing variable using the other independent variables. -Substantial bias issues, especially variances estimates (thus correlations impacted) -Valid only with monotone missings
18. 18. 18 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches • Replace the missing value from a regression model. In this case it is not an expected value that it is a random observation created using the model (including the stochastic/error term). + Reduced bias, better variance estimates + Can be recommended at times - Adding the stochastic term can reduce efficiency
19. 19. 19 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches • Replace missings with values from similar sampling units - Unbiased only under MCAR - Inefficient estimators
20. 20. 20 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches • Replace missings with values from a source outside of the current analysis’ data. - Theory for cold deck is lacking or obvious
21. 21. 21 Single Imputation Explicit Model Imputation Mean imputation Regression imputation Stochastic regression imputation Implicit Model Imputation Hot deck Imputation Cold deck Imputation Composite Approaches Example: Hot Deck + RI in Longitudinal Design 1) Find conditional expectation for missing value 2) Obtain a hot deck residual 3) Combine the hot deck residual and expectation to provide imputed value
22. 22. 22 Properties of Imputation + Can be more powerful than complete-case analysis + Imputation produces completed-case data that can be plugged into standard analyses - Variances can biased - P-values overly significant
23. 23. 23 Some Take Homes • Imputation should be conditional  Regression or Matched Cases • Multivariate • Draws from distributions, not expected values • Use when there are few missings • Key problem: inference about parameters based on completed data don’t account for imputation uncertainty (Little & Rubin 2002).
24. 24. 24 Methods Addressing Imputation Uncertainty • Replication Methods Jackknife or bootstrap the analysis - Require large samples (Little and Rubin 2002) + Can be easy • Multiple Imputation  Create multiple imputed data tables and the variability in the completed data analysis is integrated into the assessment of parameter estimate uncertainty
25. 25. 25 MI vs. Resampling • Both make assumptions about the predictive distributions • In large samples, resampling produce consistent estimates of variance with minimal assumptions, whereas MI variance estimates are strongly tied to model and MVM (Little & Rubin 2002) • MI can have Bayesian motivations rendering it more applicable in small samples than resampling (Little & Rubin 2002) • Must assume a stochastic distribution in MI
26. 26. 26 Model Based Approach: Maximum Likelihood • Maximum likelihood (ML) is a method of estimating parameters, as is ordinary least squares + When ML is applied to incomplete-data the means and covariance estimates are unbiased (under MAR)
27. 27. 27 Model Based Approach: Maximum Likelihood )|()( 1 Õ= = n i iyfL qq • Accept a probability density function • Calculate likelihood function • Maximize likelihood Solutions may not be achievable in with incomplete cases (cases with missing values) )|( qiyf )|()( 1 Õ= = ¶ ¶ n i i i yfL qq q
28. 28. 28 Maximum Likelihood Estimators nYyiE obs n tt /),|(1 å=+ qm 2)1(22)1( )(]/),|([)( ++ -= å t obs n t i t nYyE mqs å= = n i i ny 1 /m 21 2 2 )( ms -= å= n y n i i Which converges to, Which converges to,
29. 29. 29 ML with Missing Values • Under MAR, the marginal distribution of the observed data provides the correct likelihood for the unknown parameters, provided that the model is realistic. • This means, ML can be directly applied to the incomplete data. • But, the math gets much harder.
30. 30. 30 EM Algorithm General • Find conditional expectation of “missing data functions” given current estimates • Maximize the new completed-data log likelihood to get new parameter estimates • Reiterate steps until estimates stabilize Multivariate Normal • Regression imputation of missings using the means and entire covariance matrix • Re-estimate the means and covariance matrix with the imputed values • Reiterate steps until estimates stabilize
31. 31. 31 EM Algorithm: Observations Are Not Imputed å= = n i i ny 1 /m Generally missing sufficient statistics rather than observations need to be re-estimated estimated. Consider the trivial case of univariate normal data (Y). åå = -+= r i t iobs n t rnyYyi 1 )(),|( mqE step ->
32. 32. 32 The E step åå = -+= r i t iobs n t rnyYyiE 1 )(),|( mq åå = +-+= r i tt obs n t i rnyYyE i 1 2222 ])())[((),|( smq
33. 33. 33 The M step nYyiE obs n tt /),|(1 å=+ qm 2)1(22)1( )(]/),|([)( ++ -= å t obs n t i t nYyE mqs
34. 34. 34 EM Algorithm Concept • Find conditional expectation of “missing data functions” given current estimates • Maximize the new completed-data log likelihood to get new parameter estimates • Reiterate steps until estimates stabilize Explicit Formalization mis t obsmis t dYYYfyQ ò= ),|()|()|( qqqq ℓ ),|()|()|( qqq obsmisobs YYPYPYP = )|()|( 1 ttt QQ qqqq ³+ )|()|( yEQ misY t qqq ℓ=
35. 35. 35 EM Algorithm: Observations Are Not Imputed å= = n i i ny 1 /m Generally missing sufficient statistics rather than observations need to be re-estimated estimated. Consider the trivial case of univariate normal data (Y). 21 2 2 )( ms -= å= n y n i i
36. 36. 36 Take Homes on ML • If the model and MVM assumptions are good, ML likely to be a best alternative • But, need specialized software or special statisticians to help out
37. 37. 37 Comparison of Methods • Single Imputation Methods (bad) • Complete-case (bad-good) • Conditional Imputation (possibly okay) • Multiple Imputation (okay-good) • Maximum Likelihood (okay-good+)
38. 38. 38 Nonignorable Missing Data Models • Typically, missing data are NMAR • Include the missing value function in the likelihood • Need to know something about the function
39. 39. 39 What is Missing? • Needing further development or exposure are issues and methods for test statistics – For MI there are okay, but not fully satisfactory, approaches to MI include Wald Test, Likelihood Ratio, and Combined Chi- Squared tests (Schaefer 1997). – For ML unsatisfactory coverage in the literature regarding hypothesis testing.
40. 40. 40 Notes on the Literature • New but growing (hub and spokes) • Often two focused on one aspect of mathematical statistics or too applied without clear support and not comparative
41. 41. 41 Objectives • Introduce the main concepts with MVA analysis • Provide some tentative guidance dealing with MVA • Develop a discussion of MVA’s impact on research interpretations • Identify where I we should prioritize our development, in regards to MVA tools