MICE
Multivariate Imputation by Chained Equations
Richard Jacques
21st July 2015
21st July 2015 1 / 13
Multiple Imputation (MI)
MI is a statistical techniques for handling missing data.
The key concent of MI is to use the distribution of the observed data
to estimate a set of plausible values for the missing data.
Random components are incorporated into these estimated values to
reflect their uncertainty.
Multiple datasets are created and then analyzed individually but
identically to obtain a set of parameter estimates.
Estimates are combined to obtain a set of parameter estimates.
IR White et al. Multiple imputation using chained equations: Issues and guidance for
practice. Statist. Med. 2011; 30:337-399.
21st July 2015 2 / 13
Example Data
NHANES (National Health and Nutrition Examination Survey)
Four variables: age (age group), bmi (body mass index), hyp
(hypertention status), chl (cholesterol level)
> library(mice)
> nhanes[1:5,]
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
21st July 2015 3 / 13
Inspecting Missing Data
> md.pattern(nhanes)
age hyp bmi chl
13 1 1 1 1 0
1 1 1 0 1 1
3 1 1 1 0 1
1 1 0 0 1 2
7 1 0 0 0 3
0 8 9 10 27
A matrix, in which each row corresponds to a missing data pattern
(1=observed, 0=missing).
21st July 2015 4 / 13
Multiple Imputation: Main Steps
Imputation Steps R Function R Object Class
Incomplete Data data frame
mice( )
Imputed Data mids
with( )
Analysis Results mira
pool( )
Pooled Results mipo
21st July 2015 5 / 13
Generating mutliple imputations: mice()
> mice(data,m,method,predictorMatrix)
data: A data frame or matrix containing the incomplete data. Missing
values coded as NA.
m: Number of imputations (default = 5)
method: A single string, or a vector of strings, specifying the
imputation method used for each column in the data.
predictorMatrix: A square matrix specifying the set of predictors to be
used for each column.
21st July 2015 6 / 13
Built-in imputation methods
Method Description Scale Type
ppm Predictive mean matching numeric
norm Bayesian linear regression numeric
norm.nob Linear regression, non-Bayesian numeric
mean Unconditional mean imputation numeric
2l.norm Two-level linear model numeric
logreg Logistic regression factor, 2 levels
polyreg Polytomous (unordered) regression factor, >2 levels
lda Linear discriminant analysis factor
sample Random sample from observed data any
21st July 2015 7 / 13
Example
> nhanes_mice<-mice(nhanes,m=5,method=c("","norm","pmm","mean"))
> nhanes_mice
Multiply imputed data set
Call:
mice(data = nhanes, m = 5, method = c("", "norm", "pmm", "mean"))
Number of multiple imputations: 5
Missing cells per column:
age bmi hyp chl
0 9 8 10
Imputation methods:
age bmi hyp chl
"" "norm" "pmm" "mean"
VisitSequence:
bmi hyp chl
2 3 4
PredictorMatrix:
age bmi hyp chl
age 0 0 0 0
bmi 1 0 1 1
hyp 1 1 0 1
chl 1 1 1 0
Random generator seed value: NA
21st July 2015 8 / 13
Diagnostics
Check plausibility of imputations for individual variables:
> nhanes_mice$imp$bmi
1 2 3 4 5
1 25.10264 34.96051 23.63793 27.37651 30.08139
3 28.80055 29.08077 31.07668 31.37782 29.65301
4 20.64118 24.85547 25.44350 25.44396 25.42621
Examine complete data combined with imputed data:
> complete(nhanes_mice,1)
age bmi hyp chl
1 1 25.10264 1 191.4
2 2 22.70000 1 187.0
3 1 28.80055 1 187.0
4 3 20.64118 1 191.4
5 1 20.40000 1 113.0
21st July 2015 9 / 13
Data Analysis
with.mids() is used to perform the desired analysis for each imputed copy
of the data.
> fit<-with(nhanes_mice,lm(chl~age+bmi))
> summary(fit)
## summary of imputation 1 :
Call:
lm(formula = chl ~ age + bmi)
Residuals:
Min 1Q Median 3Q Max
-43.225 -10.881 -2.835 9.934 65.137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.546 48.078 -0.469 0.643721
age 31.660 7.436 4.258 0.000322 ***
bmi 6.004 1.496 4.012 0.000585 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.43 on 22 degrees of freedom
Multiple R-squared: 0.5028, Adjusted R-squared: 0.4576
F-statistic: 11.12 on 2 and 22 DF, p-value: 0.0004593
## summary of imputation 2 : 21st July 2015 10 / 13
Pooling Results
pool() talks the results from with.mids() and combines the separate
estimates and standard errors from each of the m imputed data sets to
give an over estimate and standard error
> est<-pool(fit)
> summary(est)
est se t df Pr(>|t|) lo 95
(Intercept) -2.063050 56.538439 -0.03648934 12.54558 0.971466388 -124.658189
age 28.054106 8.827146 3.17816263 11.35829 0.008466749 8.700200
bmi 5.404212 1.736748 3.11168532 13.92380 0.007695105 1.677345
hi 95 nmis
(Intercept) 120.532089 NA
age 47.408013 0
bmi 9.131079 9
21st July 2015 11 / 13
Models
pool() can be used with any object having both coef() and vcov()
methods. The function will abort if an approporiate method is not
found.
pool() can also be used with results obtained with lme() and lmer(),
but only with the fixed part of the model.
21st July 2015 12 / 13
References
S van Buuren, K Groothuis-Oudshoorn. MICE: Multivariate
Imputation by Chained Equations in R. Journal of Statistical Software
2011; 45(3)
IR White, P Royston, AM Wood. Multiple imputation using chained
equations: Issues and guidance for practice. Statistic in Medicine
2011; 30(4): 337-339.
21st July 2015 13 / 13

SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package

  • 1.
    MICE Multivariate Imputation byChained Equations Richard Jacques 21st July 2015 21st July 2015 1 / 13
  • 2.
    Multiple Imputation (MI) MIis a statistical techniques for handling missing data. The key concent of MI is to use the distribution of the observed data to estimate a set of plausible values for the missing data. Random components are incorporated into these estimated values to reflect their uncertainty. Multiple datasets are created and then analyzed individually but identically to obtain a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates. IR White et al. Multiple imputation using chained equations: Issues and guidance for practice. Statist. Med. 2011; 30:337-399. 21st July 2015 2 / 13
  • 3.
    Example Data NHANES (NationalHealth and Nutrition Examination Survey) Four variables: age (age group), bmi (body mass index), hyp (hypertention status), chl (cholesterol level) > library(mice) > nhanes[1:5,] age bmi hyp chl 1 1 NA NA NA 2 2 22.7 1 187 3 1 NA 1 187 4 3 NA NA NA 5 1 20.4 1 113 21st July 2015 3 / 13
  • 4.
    Inspecting Missing Data >md.pattern(nhanes) age hyp bmi chl 13 1 1 1 1 0 1 1 1 0 1 1 3 1 1 1 0 1 1 1 0 0 1 2 7 1 0 0 0 3 0 8 9 10 27 A matrix, in which each row corresponds to a missing data pattern (1=observed, 0=missing). 21st July 2015 4 / 13
  • 5.
    Multiple Imputation: MainSteps Imputation Steps R Function R Object Class Incomplete Data data frame mice( ) Imputed Data mids with( ) Analysis Results mira pool( ) Pooled Results mipo 21st July 2015 5 / 13
  • 6.
    Generating mutliple imputations:mice() > mice(data,m,method,predictorMatrix) data: A data frame or matrix containing the incomplete data. Missing values coded as NA. m: Number of imputations (default = 5) method: A single string, or a vector of strings, specifying the imputation method used for each column in the data. predictorMatrix: A square matrix specifying the set of predictors to be used for each column. 21st July 2015 6 / 13
  • 7.
    Built-in imputation methods MethodDescription Scale Type ppm Predictive mean matching numeric norm Bayesian linear regression numeric norm.nob Linear regression, non-Bayesian numeric mean Unconditional mean imputation numeric 2l.norm Two-level linear model numeric logreg Logistic regression factor, 2 levels polyreg Polytomous (unordered) regression factor, >2 levels lda Linear discriminant analysis factor sample Random sample from observed data any 21st July 2015 7 / 13
  • 8.
    Example > nhanes_mice<-mice(nhanes,m=5,method=c("","norm","pmm","mean")) > nhanes_mice Multiplyimputed data set Call: mice(data = nhanes, m = 5, method = c("", "norm", "pmm", "mean")) Number of multiple imputations: 5 Missing cells per column: age bmi hyp chl 0 9 8 10 Imputation methods: age bmi hyp chl "" "norm" "pmm" "mean" VisitSequence: bmi hyp chl 2 3 4 PredictorMatrix: age bmi hyp chl age 0 0 0 0 bmi 1 0 1 1 hyp 1 1 0 1 chl 1 1 1 0 Random generator seed value: NA 21st July 2015 8 / 13
  • 9.
    Diagnostics Check plausibility ofimputations for individual variables: > nhanes_mice$imp$bmi 1 2 3 4 5 1 25.10264 34.96051 23.63793 27.37651 30.08139 3 28.80055 29.08077 31.07668 31.37782 29.65301 4 20.64118 24.85547 25.44350 25.44396 25.42621 Examine complete data combined with imputed data: > complete(nhanes_mice,1) age bmi hyp chl 1 1 25.10264 1 191.4 2 2 22.70000 1 187.0 3 1 28.80055 1 187.0 4 3 20.64118 1 191.4 5 1 20.40000 1 113.0 21st July 2015 9 / 13
  • 10.
    Data Analysis with.mids() isused to perform the desired analysis for each imputed copy of the data. > fit<-with(nhanes_mice,lm(chl~age+bmi)) > summary(fit) ## summary of imputation 1 : Call: lm(formula = chl ~ age + bmi) Residuals: Min 1Q Median 3Q Max -43.225 -10.881 -2.835 9.934 65.137 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.546 48.078 -0.469 0.643721 age 31.660 7.436 4.258 0.000322 *** bmi 6.004 1.496 4.012 0.000585 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 25.43 on 22 degrees of freedom Multiple R-squared: 0.5028, Adjusted R-squared: 0.4576 F-statistic: 11.12 on 2 and 22 DF, p-value: 0.0004593 ## summary of imputation 2 : 21st July 2015 10 / 13
  • 11.
    Pooling Results pool() talksthe results from with.mids() and combines the separate estimates and standard errors from each of the m imputed data sets to give an over estimate and standard error > est<-pool(fit) > summary(est) est se t df Pr(>|t|) lo 95 (Intercept) -2.063050 56.538439 -0.03648934 12.54558 0.971466388 -124.658189 age 28.054106 8.827146 3.17816263 11.35829 0.008466749 8.700200 bmi 5.404212 1.736748 3.11168532 13.92380 0.007695105 1.677345 hi 95 nmis (Intercept) 120.532089 NA age 47.408013 0 bmi 9.131079 9 21st July 2015 11 / 13
  • 12.
    Models pool() can beused with any object having both coef() and vcov() methods. The function will abort if an approporiate method is not found. pool() can also be used with results obtained with lme() and lmer(), but only with the fixed part of the model. 21st July 2015 12 / 13
  • 13.
    References S van Buuren,K Groothuis-Oudshoorn. MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 2011; 45(3) IR White, P Royston, AM Wood. Multiple imputation using chained equations: Issues and guidance for practice. Statistic in Medicine 2011; 30(4): 337-339. 21st July 2015 13 / 13