Statistical Methods to Handle Missing Data

Statistical Methods of Handling Missing Data-Comparison of Listwise Deletion and Multiple
Imputation
Tianfan Song
Instructor: Meng-Hsuan (Tony) Wu
5.12.2016

Abstract:
This paper will briefly review traditional approaches and modern approaches for dealing
with missing data. Since the most common used approach is listwise deletion, and comparison
between listwise deletion and a popular modern approach multiple imputation will be addressed.
Through an example of analyzing blood pressure dataset, this paper shows MI should be used
when researchers not sure what type of missing data they are dealing with.
Key word: MCAR, MAR, MNAR, Deletion, Imputation, Comparison
1.Introduction:
Missing data is a not a rare case in most datasets. There are typically three different type of
missing data: missing at random, missing completely at random, and missing not at random.
However, many analytic procedures, many of which were developed early in the twentieth
century, were designed to have complete data (Graham, 2009, p550). Then, in late twentieth
century different approach to handle missing data were brought up with. Among all these
methods, likewise deletion, pairwise deletion, and single imputation as traditional approaches
and multiple imputation and maximum likelihood as modern approaches are used a lot. In R, the
default way to handle missing data is listwise deletion and multiple imputation have become
convenient to use.
2.Type of Missing data:
Missing at random(MAR): If the probability of failing to observe a value does not depend on
the unobserved data, then we call this type of missing data as MAR. In other words, according to

Rubin, the probability of missing data does not depend on missing data itself, but only on the
observed data(as cited in Schafer & Graham, 2002, p.151).
Missing completely at random(MCAR): MCAR is a special case of MAR, it means the
probability of missing data does not depend on the complete data at all. (as cited in Schafer &
Graham, 2002, p.151). For this situation, we can think the whole dataset and the missing data
randomly appears throughout the matrix((Acock, 2005, P,1014).
Missing not at random(MNAR): The missingness depends on variables that are not
observed. For example, a study wants to gather the information about people’s daily expenditure.
Several participants refused to provide the information about their entertainment expenditure
because they have lower salary and spend few on entertainment or they are older than 80 and
spend less on entertainment but this survey does not ask. This type of missing value depends on
the missing value itself.
2.1Example of Missing data:
Table 1 illustrates three type of missing data and bp means blood pressure. There are 41
random selected residents from islands. The first column records the age of sample and the next
two columns record sample’s complete diastolic and systolic blood pressure. The other columns
show the values of systolic blood pressure that remain after imposing three different type of
missing data.
To simulate MAR on systolic blood pressure, systolic blood pressure is only recorded for
individual whose age is over 25. This could happen in reality since younger people may not
concern to record blood pressure. We can easily find that the missing value only depends on age

but not systolic. To simulate MCAR, 35 individuals are randomly selected and systolic blood
pressure is reported. The last column comes from MNCR; to simulate it systolic blood pressure
is reported only when diastolic blood pressure is over 80. Diastolic blood pressure a variable not
include in the dataset (blood test result).
Table 1: Blood Pressure Sample from Island.
Age Gender Complete systolic bp MAR systolic bp MCAR systolic bp MNAR systolic bp
82
39
23
41
88
25
57
20
32
52
42
27
69
81
18
32
59
47
24
37
60
22
29
77
47
27
m
f
f
m
f
f
m
m
m
m
f
f
f
f
f
m
m
f
m
m
f
f
f
f
m
m
148
135
125
136
166
114
148
111
134
137
136
127
150
164
120
132
127
133
124
135
140
121
127
158
139
126
148
135

136
166
114
148

134
137
136
127
150
164

132
127
133

135
140

127
158
139
126
135
125

166
114
148
111

137
136
127
150
164
120
132
127
133
124

140
121
127
158
139
126
148
135
125
136
166

148

134
137
136
127
150
164

132
127
133
124
135
140
121

158
139
126

54
56
50
37
39
43
58
20
61
32
45
37
31
29
33
f
m
f
m
m
m
f
f
m
f
m
m
f
m
f
141
140
118
135
141
126
130
103
142
124
120
132
135
131
130
141
140
118
135
141
126
130

142
124
120
132
135
131
130
141
140
118

141
126
130
103
142
124

132
135
131
130
141
140

135
141
126
130

142

132
135
131
130
3. Analyze randomness of missing data
Since methods mentions below assumes the value is missing completely at random or
missing at random, it is important to figure out the type of missing data. One way to diagnose the
randomness, which comes from Little and Rubin (1987), is to build correlation matrix of missing
data for any pair of variables (as cited in Tsikriktsis, 2005, p.56). Here is the detail of this
method: for each variable, observation is replaced by the value of one, and missing data is
replaced by zero. Then a correlation matrix can be build through software like R, and the
correlations in the off-diagonal entities indicate the degree of association between the missing
data on each pair of variables. The low correlations show the strong independent relationships in
the pair of variables, which implies strong randomness. According to Tsikriktsis (2005) there is
no “guidelines for identifying the level of correlation needed to indicate that the missing data are
not random” (p.56). If we believe the correlation matrix show randomness between all pairs of
variables, then we may consider this dataset has MCAR type of missing data. If we observe

significant correlations between some pairs of variables, then we may need to assume that the
dataset is not MCAR. In general case, we can’t test whether MAR holds in a data set, except we
can obtain follow-up data from non-respondents (Glynn, Laird, & Rubin, p.992~993,1993)
4. Review of traditional approach
Many traditional approaches including deletion and single imputation approaches are
heavily used by researchers. In this section, some common traditional approaches will be
reviewed.
4.1 Deletion approaches:
4.1.1 Listwise deletion
This method asks for deleting the whole row of data when one of the value is missing, so it
also called complete-case analysis. Many researchers use this approach because this approach
can avoid make up data and insert to missing place. Clearly, this method is simple, but it
typically results losing of 20%-50% of the data (Acock, 2005, P,1015). Despite the fact that the
large loss of data reduces statistical power and generate biased estimator if data not MCAR, used
by default in many statistical programs (Schafer &Graham, 2002, p151).
4.1.2 Pairwise deletion
In order to minimize the loss that occurs in listwise deletion, researchers sometime used a
method called Pairwise deletion (available-case analysis), it means using every available data.
For example, on table 1, when systolic blood pressure is MCAR, we want to find the relationship
between age and diastolic blood pressure, we can use every individual’s data since none of
diastolic blood pressure’s value is missing. Even when MCAR holds, case deletion may still be

inefficient. (Schafer &Graham, 2002, p.156)
4.2 Single Imputation
It is a common strategy, researchers use this approach when there are not too much missing
data, the main idea is to insert plausible guess to the missing part. Common choices include:
4.2.1 substitute with overall mean
When there is a missing data we substitute it with the mean. This is an easy way to handle
missing data, but it would seriously reduce the variability especially when data set has a large
amount of missing data. Also since this approach ignore the relationship between variables, it
may weak the estimates of covariance and correlation.
4.2.2 Dummy variable adjustment
If a there is a single variable like systolic blood pressure in table 5 has some missing data,
then a binary variable is created with is the value for systolic blood pressure is missing and 0 is
the value is present. Then, residents who have missing value on systolic blood are assigned the
mean. The dummy variable is included in regression. According to Allison’ s research, this
approach has been discredited and should not be used (as cited in Graham, 2009, P.555).
4.2.3 regression estimate
For this approach, we first estimate the coefficients among variables, and then we use the
regression coefficients to estimate the missing value (Frane, 1976, p.410). This approach uses the
available observed data, but since we replace the missing data the estimated value from
regression model, we could increase the coefficient of determination and the correlation between

variables and reduce variability.
5. Modern Approach
5.1 Multiple imputation (MI)
Different than single imputation, in MI the missing values are replaced by M > 1 sets of
simulated imputed values. In different set, the same missing value are replaced by slightly
different values, resultsing in M plausible but different versions of the complete data (Collins &
Schafer & Kam, 2001, p.335). Typically, M=5 to M=10 are sufficient to yield highly efficient (as
cited in Collins & Schafer & Kam, 2001, p.335). The imputation equation is:
Z = b$ + b&X + b(Y + sE
Where 𝑍 is the estimate of missing value and X, Y are fully observed values, E is a random
selected from a standard normal distribution and s is the estimated standard deviation of the error
term in the regression. (Allison, 2012, p.2)
5.2 Maximum Likelihood (ML)
The maximum likelihood approach chooses parameter values that can produce the highest
possible probability or probability density to the data values. The likelihood function is:
L(θ) = f2
3
24& (y2&, y2(, . . y28; θ), i=1…n.
If the missing data are MAR, which is the assumption of ML, then the missing data values
are removed from the likelihood by a process of summation or integration (Collins & Schafer &
Kam, 2001, p.335). For example, suppose observation i has a missing value y& and y& is

discrete then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
;&
(y2&, y2(, . . y28; θ)
If y& is continuous then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
<=
(y2&, y2(, . . y28; θ)dy&
The overall likelihood for n independent observations is product of the likelihoods for all the
observations. Suppose there are m observations that miss y&, the likelihood function for the full
data set is:
L(θ) = f2
?@3
24&
(y2&, y2(, . . y28; θ) f2
3
24?@3A&
(y2(, . . y28; θ)
For ML we will discuss the case that MAR only on response variable and no auxiliary variables
just like table 5 with the the 1,2,4 columns.
6. Comparison between listwise deletion and multiple imputation
Among these approaches complete-case analysis is most common, in 262 journal article that
published in 2010, 81% of them used complete-case analysis, which is lisewise deletion,
(Eekhout,etc…, 2012, P.729) although lisewise has several shortage discussed above and it is the
default method in R. Meanwhile, multiple imputation become easy to realize in R due to several
new packages. Because of these reasons, a comparison is given to show that multiple imputation
should be under seriously consideration. The interest is to find the effect of age and gender to
systolic blood pressure.

6.1 Original data
The original data has no missing data, which is the first three columns of table 5.1. For this
dataset we would use the model: Systolic bp~age*gender. The results is shown in output1.From
output 1, we can see age has a very low p-value, gender and interaction term have a high one,
which indicate that age is a useful predictor with coefficient 0.66 for systolic blood pressure and
gender is not significant.
6.2 Dealing MAR blood pressure sample
6.2.1 Dealing with listwise deletion
The results are shown in output 2. For this method we can say both predictors including
gender and interaction have much lower p-values than original and become significant.
6.2.2 Dealing with MI
For MI approach, this paper uses mice package in R. As mentioned above, usually 5 sets will
generate efficient estimate. So we choose to generate 5 datasets and we use predictive mean
matching as imputation method. Table 2 shows the results of 5 different sets of imputation that
generate by MI procedure.
Table 2: 5 sets of imputation:
Missing place Imputation 1 Imputation 2 Imputation 3 Imputation 4 Imputation 5
3
8
15
19
135
126
127
131
114
114
126
127
126
127
127
114
134
124
127
135
126
126
114
127

22
34

131
127

127
131

127
127

132
131

127
127

Then, we can examine the 5 different imputations’ regression results, which are shown in
output3. We can see intercept and coefficients are slightly different for the 5 imputations dataset.
Now we combine the separate models, in mice this function is called “pool”. The final results
show in output 4. For MI we can see that age is the only useful predictor with coefficient 0.57
and gender is not significant as original regression analysis.
6.3 Dealing MCAR blood pressure sample
MCAR is a stronger assumption than MAR, and this situation is relative rare comparing to
MAR and MNAR.
The results is represented in output 5. Age remains the only useful predictor for systolic
blood pressure with coefficient 0.66; the p-value for gender and interaction are lower comparing
to original data.
The procedure is almost the same as showed in 6.2.2. For this section, the finally results is
showed in output 6. For MI we can see that age is the only useful predictor with coefficient 0.66
and gender is not significant as original regression analysis.
6.4 Dealing MNAR blood pressure sample

As we mentioned above, we can only test if the missing data MCAR without further
information about the non-respondent. In many case, MAR assumption may not hold as
researchers’ claims, then MNCR is actually researchers deal with.
The results for listwise deletion is showed in output 7. Both predictors are significant at 0.05
significant level; age has coefficient 0.60; male’s average systolic blood pressure is 11.35 greater
than female and male has a slower growing systolic blood pressure than female.
The procedure is almost the same as showed in 6.2.2. We only provide the final results
output 8. For MI we can see that age is the only useful predictor with coefficient 0.56 and gender
is significant with 8.37 coefficient, but integration is not significant.
7. Discussion
Many missing data methods assume MCAR or MAR but our data often are MNAR. Thus by
comparing the efficiency that listwise and MI show under three type of missing data, we can
analysis which one may work better. Table 3 shows a summary of comparison by list coefficient
and p-value of each predictor for two methods under three different missing type.
Table 3:
Original:
Predictor: age genderm age:genderm
coeffcient 0.66 11.69 -0.25

p-value 0.00 0.07 0.07
MAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.65/0.00 16.67/0.03 -0.35/0.03
MI 0.57/0.00 11.41/0.08 -0.27/0.05
MCAR:
age
genderm age:genderm
MI 0.66/0.00 8.84/0.26 -0.18/0.32
MNAR:
age
genderm age:genderm
MI 0.56/0.00 8.37/0.08 -0.23/0.03
From table 3, we can see that from original data only age is a significant predictor with 0.66
coefficient. When only MAR assumption is hold, if we use listwise deletion, both gender and
interaction become significant; but if we use MI, these two terms remain non-significant. At least
for this case, we should prefer MI. When a stronger assumption MCAR is hold, we can see both

methods work well and generate efficient results: both coefficients of age are 0.66 and gender
and interaction are non-significant. Thus, we missing type is MCAR, we may use listwise
deletion or MI. When we face a more general condition MNCR, we can see for listwise deletion,
all of the predictor become significant. But if we use MI, based on hierarchy rule in regression
model, although interaction is significant at 0.05 significant level we will not include it in the
model. As for coefficient of age, we observe that using MI the difference comparing to original
data is bigger than using listwise deletion, but since MI shows that age is the only significant
term in the model just as original data shows, we better choose MI.
Since the dataset only contains 41 individuals on the islands, thus it has limits on its analysis
results. Each missing value plays a relatively big role. Especially for MNCR, there are 8 missing
values, which would influence the results no matter we methods we use.
8. Conclusion
Listwise deletion is simple and commonly used by researcher, may perform
reasonably well in some situations (Schafer & Graham, 2002, p.150). But multiple imputation
produces estimates that have more desired statistical properties. “They are consistent (and,
hence, approximately unbiased in large samples), asymptotically efficient (almost), and
asymptotically normal” if we do the right procedure. (Allison, 2012, p.2). When researches not
sure which type of missing value they deal with, they should consider not using the default
method in many analysis software such as R and choose to use multiple imputation as it can
handle MAR, MCAR and MNCR in a more proper way.

Reference
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical
Association, 88(423), 984-993.
Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis.
Psychometrika, 41(3), 409-415.
Allison, P. D. (2012, April). Handling missing data by maximum likelihood. In SAS global forum
(Vol. 312).
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological methods, 7(2), 147.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
Tsikriktsis, N. (2005). A review of techniques for treating missing data in OM survey research.
Journal of Operations Management, 24(1), 53-62.
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical Association,
88(423), 984-993.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological methods, 6(4), 330.
Eekhout, I., de Boer, R. M., Twisk, J. W., de Vet, H. C., & Heymans, M. W. (2012). Missing data:

a systematic review of how they are reported and handled. Epidemiology, 23(5), 729-732.

Appendix:
Output1: regression results from original data.
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.00143 3.74111 27.800 < 2e-16 ***
## age 0.66267 0.07677 8.632 2.15e-10 ***
## genderm 11.69622 6.30700 1.854 0.0717 .
## age:genderm -0.25421 0.13590 -1.871 0.0693 .
Output 2:
Coefficients:
## (Intercept) 104.68335 4.63542 22.583 < 2e-16 ***
## age 0.65129 0.08711 7.477 2.01e-08 ***
## genderm 16.67260 7.37165 2.262 0.0309 *
## age:genderm -0.35043 0.14954 -2.343 0.0257 *
Output 3:
analyses :
## [[1]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 108.7031 0.5873 14.4817 -0.3209
##
##
## [[2]]
##
## Call:
##
## Coefficients:
## 111.6146 0.5406 9.2609 -0.2309
##
##
## [[3]]
##
## Call:

##
## Coefficients:
## 108.4245 0.5904 12.4510 -0.2806
##
##
## [[4]]
##
## Call:
##
## Coefficients:
## 110.0411 0.5656 10.5556 -0.2504
##
##
## [[5]]
##
## Call:
##
## Coefficients:
## 110.5416 0.5574 10.3339 -0.2476
Output 4:
## est se t df Pr(>|t|)
## (Intercept) 109.8649804 3.79063778 28.983244 25.90669 0.000000e+00
## age 0.5682545 0.07543424 7.533111 29.92531 2.162265e-08
## gender2 11.4166290 6.32447102 1.805152 27.25506 8.211053e-02
## age:gender2 -0.2660750 0.13308720 -1.999253 30.37152 5.459700e-02
Output 5:
Coefficients:
## (Intercept) 104.00143 3.77911 27.520 < 2e-16 ***
## age 0.66267 0.07755 8.546 1.19e-09 ***
## genderm 8.48478 7.47132 1.136 0.265
## age:genderm -0.18035 0.16645 -1.084 0.287
Output 6:
## (Intercept) 104.0014342 3.68164783 28.248610 35.14648 0.000000e+00
## age 0.6626701 0.07554582 8.771765 35.14648 2.243947e-10

## gender2 8.8357373 7.55535772 1.169466 14.58076 2.609787e-01
## age:gender2 -0.1769565 0.17028626 -1.039171 12.07404 3.190884e-01
Output 7:
## Coefficients:
## (Intercept) 109.32154 3.56119 30.698 < 2e-16 ***
## age 0.60077 0.06563 9.153 4.72e-10 ***
## genderm 11.35156 5.37859 2.111 0.0436 *
## age:genderm -0.27137 0.10904 -2.489 0.0188 *
Output 8:
## (Intercept) 112.0912386 2.98080526 37.604348 20.57494 0.000000e+00
## age 0.5564151 0.05650112 9.847859 30.31590 5.846257e-11
## gender2 8.3651567 4.69331811 1.782355 28.87603 8.521021e-02
## age:gender2 -0.2261093 0.09822131 -2.302039 32.57357 2.786864e-02

Statistical Methods to Handle Missing Data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Statistical Methods to Handle Missing Data

Similar to Statistical Methods to Handle Missing Data (20)

Statistical Methods to Handle Missing Data