SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package

MICE
Multivariate Imputation by Chained Equations
Richard Jacques
21st July 2015
21st July 2015 1 / 13

Multiple Imputation (MI)
MI is a statistical techniques for handling missing data.
The key concent of MI is to use the distribution of the observed data
to estimate a set of plausible values for the missing data.
Random components are incorporated into these estimated values to
reﬂect their uncertainty.
Multiple datasets are created and then analyzed individually but
identically to obtain a set of parameter estimates.
Estimates are combined to obtain a set of parameter estimates.
IR White et al. Multiple imputation using chained equations: Issues and guidance for
practice. Statist. Med. 2011; 30:337-399.
21st July 2015 2 / 13

Example Data
NHANES (National Health and Nutrition Examination Survey)
Four variables: age (age group), bmi (body mass index), hyp
(hypertention status), chl (cholesterol level)
> library(mice)
> nhanes[1:5,]
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
21st July 2015 3 / 13

Inspecting Missing Data
> md.pattern(nhanes)
age hyp bmi chl
13 1 1 1 1 0
1 1 1 0 1 1
3 1 1 1 0 1
1 1 0 0 1 2
7 1 0 0 0 3
0 8 9 10 27
A matrix, in which each row corresponds to a missing data pattern
(1=observed, 0=missing).
21st July 2015 4 / 13

Multiple Imputation: Main Steps
Imputation Steps R Function R Object Class
Incomplete Data data frame
mice( )
Imputed Data mids
with( )
Analysis Results mira
pool( )
Pooled Results mipo
21st July 2015 5 / 13

Generating mutliple imputations: mice()
> mice(data,m,method,predictorMatrix)
data: A data frame or matrix containing the incomplete data. Missing
values coded as NA.
m: Number of imputations (default = 5)
method: A single string, or a vector of strings, specifying the
imputation method used for each column in the data.
predictorMatrix: A square matrix specifying the set of predictors to be
used for each column.
21st July 2015 6 / 13

Built-in imputation methods
Method Description Scale Type
ppm Predictive mean matching numeric
norm Bayesian linear regression numeric
norm.nob Linear regression, non-Bayesian numeric
mean Unconditional mean imputation numeric
2l.norm Two-level linear model numeric
logreg Logistic regression factor, 2 levels
polyreg Polytomous (unordered) regression factor, >2 levels
lda Linear discriminant analysis factor
sample Random sample from observed data any
21st July 2015 7 / 13

Example
> nhanes_mice<-mice(nhanes,m=5,method=c("","norm","pmm","mean"))
> nhanes_mice
Multiply imputed data set
Call:
mice(data = nhanes, m = 5, method = c("", "norm", "pmm", "mean"))
Number of multiple imputations: 5
Missing cells per column:
age bmi hyp chl
0 9 8 10
Imputation methods:
age bmi hyp chl
"" "norm" "pmm" "mean"
VisitSequence:
bmi hyp chl
2 3 4
PredictorMatrix:
age bmi hyp chl
age 0 0 0 0
bmi 1 0 1 1
hyp 1 1 0 1
chl 1 1 1 0
Random generator seed value: NA
21st July 2015 8 / 13

Diagnostics
Check plausibility of imputations for individual variables:
> nhanes_mice$imp$bmi
1 2 3 4 5
1 25.10264 34.96051 23.63793 27.37651 30.08139
3 28.80055 29.08077 31.07668 31.37782 29.65301
4 20.64118 24.85547 25.44350 25.44396 25.42621
Examine complete data combined with imputed data:
> complete(nhanes_mice,1)
age bmi hyp chl
1 1 25.10264 1 191.4
2 2 22.70000 1 187.0
3 1 28.80055 1 187.0
4 3 20.64118 1 191.4
5 1 20.40000 1 113.0
21st July 2015 9 / 13

Data Analysis
with.mids() is used to perform the desired analysis for each imputed copy
of the data.
> fit<-with(nhanes_mice,lm(chl~age+bmi))
> summary(fit)
## summary of imputation 1 :
Call:
lm(formula = chl ~ age + bmi)
Residuals:
Min 1Q Median 3Q Max
-43.225 -10.881 -2.835 9.934 65.137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.546 48.078 -0.469 0.643721
age 31.660 7.436 4.258 0.000322 ***
bmi 6.004 1.496 4.012 0.000585 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.43 on 22 degrees of freedom
Multiple R-squared: 0.5028, Adjusted R-squared: 0.4576
F-statistic: 11.12 on 2 and 22 DF, p-value: 0.0004593
## summary of imputation 2 : 21st July 2015 10 / 13

Pooling Results
pool() talks the results from with.mids() and combines the separate
estimates and standard errors from each of the m imputed data sets to
give an over estimate and standard error
> est<-pool(fit)
> summary(est)
est se t df Pr(>|t|) lo 95
(Intercept) -2.063050 56.538439 -0.03648934 12.54558 0.971466388 -124.658189
age 28.054106 8.827146 3.17816263 11.35829 0.008466749 8.700200
bmi 5.404212 1.736748 3.11168532 13.92380 0.007695105 1.677345
hi 95 nmis
(Intercept) 120.532089 NA
age 47.408013 0
bmi 9.131079 9
21st July 2015 11 / 13

Models
pool() can be used with any object having both coef() and vcov()
methods. The function will abort if an approporiate method is not
found.
pool() can also be used with results obtained with lme() and lmer(),
but only with the ﬁxed part of the model.
21st July 2015 12 / 13

References
S van Buuren, K Groothuis-Oudshoorn. MICE: Multivariate
Imputation by Chained Equations in R. Journal of Statistical Software
2011; 45(3)
IR White, P Royston, AM Wood. Multiple imputation using chained
equations: Issues and guidance for practice. Statistic in Medicine
2011; 30(4): 337-339.
21st July 2015 13 / 13

SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package

Similar to SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package (20)

More from Paul Richards

More from Paul Richards (12)

Recently uploaded

Recently uploaded (20)

SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) package