DRS-112 Exploratory Data Analysis (YUE, PGDRS).pdf
Statistical Methods to Handle Missing Data
1. Statistical Methods of Handling Missing Data-Comparison of Listwise Deletion and Multiple
Imputation
Tianfan Song
Instructor: Meng-Hsuan (Tony) Wu
5.12.2016
2. Abstract:
This paper will briefly review traditional approaches and modern approaches for dealing
with missing data. Since the most common used approach is listwise deletion, and comparison
between listwise deletion and a popular modern approach multiple imputation will be addressed.
Through an example of analyzing blood pressure dataset, this paper shows MI should be used
when researchers not sure what type of missing data they are dealing with.
Key word: MCAR, MAR, MNAR, Deletion, Imputation, Comparison
1.Introduction:
Missing data is a not a rare case in most datasets. There are typically three different type of
missing data: missing at random, missing completely at random, and missing not at random.
However, many analytic procedures, many of which were developed early in the twentieth
century, were designed to have complete data (Graham, 2009, p550). Then, in late twentieth
century different approach to handle missing data were brought up with. Among all these
methods, likewise deletion, pairwise deletion, and single imputation as traditional approaches
and multiple imputation and maximum likelihood as modern approaches are used a lot. In R, the
default way to handle missing data is listwise deletion and multiple imputation have become
convenient to use.
2.Type of Missing data:
Missing at random(MAR): If the probability of failing to observe a value does not depend on
the unobserved data, then we call this type of missing data as MAR. In other words, according to
3. Rubin, the probability of missing data does not depend on missing data itself, but only on the
observed data(as cited in Schafer & Graham, 2002, p.151).
Missing completely at random(MCAR): MCAR is a special case of MAR, it means the
probability of missing data does not depend on the complete data at all. (as cited in Schafer &
Graham, 2002, p.151). For this situation, we can think the whole dataset and the missing data
randomly appears throughout the matrix((Acock, 2005, P,1014).
Missing not at random(MNAR): The missingness depends on variables that are not
observed. For example, a study wants to gather the information about people’s daily expenditure.
Several participants refused to provide the information about their entertainment expenditure
because they have lower salary and spend few on entertainment or they are older than 80 and
spend less on entertainment but this survey does not ask. This type of missing value depends on
the missing value itself.
2.1Example of Missing data:
Table 1 illustrates three type of missing data and bp means blood pressure. There are 41
random selected residents from islands. The first column records the age of sample and the next
two columns record sample’s complete diastolic and systolic blood pressure. The other columns
show the values of systolic blood pressure that remain after imposing three different type of
missing data.
To simulate MAR on systolic blood pressure, systolic blood pressure is only recorded for
individual whose age is over 25. This could happen in reality since younger people may not
concern to record blood pressure. We can easily find that the missing value only depends on age
4. but not systolic. To simulate MCAR, 35 individuals are randomly selected and systolic blood
pressure is reported. The last column comes from MNCR; to simulate it systolic blood pressure
is reported only when diastolic blood pressure is over 80. Diastolic blood pressure a variable not
include in the dataset (blood test result).
Table 1: Blood Pressure Sample from Island.
Age Gender Complete systolic bp MAR systolic bp MCAR systolic bp MNAR systolic bp
82
39
23
41
88
25
57
20
32
52
42
27
69
81
18
32
59
47
24
37
60
22
29
77
47
27
m
f
f
m
f
f
m
m
m
m
f
f
f
f
f
m
m
f
m
m
f
f
f
f
m
m
148
135
125
136
166
114
148
111
134
137
136
127
150
164
120
132
127
133
124
135
140
121
127
158
139
126
148
135
136
166
114
148
134
137
136
127
150
164
132
127
133
135
140
127
158
139
126
135
125
166
114
148
111
137
136
127
150
164
120
132
127
133
124
140
121
127
158
139
126
148
135
125
136
166
148
134
137
136
127
150
164
132
127
133
124
135
140
121
158
139
126
5. 54
56
50
37
39
43
58
20
61
32
45
37
31
29
33
f
m
f
m
m
m
f
f
m
f
m
m
f
m
f
141
140
118
135
141
126
130
103
142
124
120
132
135
131
130
141
140
118
135
141
126
130
142
124
120
132
135
131
130
141
140
118
141
126
130
103
142
124
132
135
131
130
141
140
135
141
126
130
142
132
135
131
130
3. Analyze randomness of missing data
Since methods mentions below assumes the value is missing completely at random or
missing at random, it is important to figure out the type of missing data. One way to diagnose the
randomness, which comes from Little and Rubin (1987), is to build correlation matrix of missing
data for any pair of variables (as cited in Tsikriktsis, 2005, p.56). Here is the detail of this
method: for each variable, observation is replaced by the value of one, and missing data is
replaced by zero. Then a correlation matrix can be build through software like R, and the
correlations in the off-diagonal entities indicate the degree of association between the missing
data on each pair of variables. The low correlations show the strong independent relationships in
the pair of variables, which implies strong randomness. According to Tsikriktsis (2005) there is
no “guidelines for identifying the level of correlation needed to indicate that the missing data are
not random” (p.56). If we believe the correlation matrix show randomness between all pairs of
variables, then we may consider this dataset has MCAR type of missing data. If we observe
6. significant correlations between some pairs of variables, then we may need to assume that the
dataset is not MCAR. In general case, we can’t test whether MAR holds in a data set, except we
can obtain follow-up data from non-respondents (Glynn, Laird, & Rubin, p.992~993,1993)
4. Review of traditional approach
Many traditional approaches including deletion and single imputation approaches are
heavily used by researchers. In this section, some common traditional approaches will be
reviewed.
4.1 Deletion approaches:
4.1.1 Listwise deletion
This method asks for deleting the whole row of data when one of the value is missing, so it
also called complete-case analysis. Many researchers use this approach because this approach
can avoid make up data and insert to missing place. Clearly, this method is simple, but it
typically results losing of 20%-50% of the data (Acock, 2005, P,1015). Despite the fact that the
large loss of data reduces statistical power and generate biased estimator if data not MCAR, used
by default in many statistical programs (Schafer &Graham, 2002, p151).
4.1.2 Pairwise deletion
In order to minimize the loss that occurs in listwise deletion, researchers sometime used a
method called Pairwise deletion (available-case analysis), it means using every available data.
For example, on table 1, when systolic blood pressure is MCAR, we want to find the relationship
between age and diastolic blood pressure, we can use every individual’s data since none of
diastolic blood pressure’s value is missing. Even when MCAR holds, case deletion may still be
7. inefficient. (Schafer &Graham, 2002, p.156)
4.2 Single Imputation
It is a common strategy, researchers use this approach when there are not too much missing
data, the main idea is to insert plausible guess to the missing part. Common choices include:
4.2.1 substitute with overall mean
When there is a missing data we substitute it with the mean. This is an easy way to handle
missing data, but it would seriously reduce the variability especially when data set has a large
amount of missing data. Also since this approach ignore the relationship between variables, it
may weak the estimates of covariance and correlation.
4.2.2 Dummy variable adjustment
If a there is a single variable like systolic blood pressure in table 5 has some missing data,
then a binary variable is created with is the value for systolic blood pressure is missing and 0 is
the value is present. Then, residents who have missing value on systolic blood are assigned the
mean. The dummy variable is included in regression. According to Allison’ s research, this
approach has been discredited and should not be used (as cited in Graham, 2009, P.555).
4.2.3 regression estimate
For this approach, we first estimate the coefficients among variables, and then we use the
regression coefficients to estimate the missing value (Frane, 1976, p.410). This approach uses the
available observed data, but since we replace the missing data the estimated value from
regression model, we could increase the coefficient of determination and the correlation between
8. variables and reduce variability.
5. Modern Approach
5.1 Multiple imputation (MI)
Different than single imputation, in MI the missing values are replaced by M > 1 sets of
simulated imputed values. In different set, the same missing value are replaced by slightly
different values, resultsing in M plausible but different versions of the complete data (Collins &
Schafer & Kam, 2001, p.335). Typically, M=5 to M=10 are sufficient to yield highly efficient (as
cited in Collins & Schafer & Kam, 2001, p.335). The imputation equation is:
Z = b$ + b&X + b(Y + sE
Where 𝑍 is the estimate of missing value and X, Y are fully observed values, E is a random
selected from a standard normal distribution and s is the estimated standard deviation of the error
term in the regression. (Allison, 2012, p.2)
5.2 Maximum Likelihood (ML)
The maximum likelihood approach chooses parameter values that can produce the highest
possible probability or probability density to the data values. The likelihood function is:
L(θ) = f2
3
24& (y2&, y2(, . . y28; θ), i=1…n.
If the missing data are MAR, which is the assumption of ML, then the missing data values
are removed from the likelihood by a process of summation or integration (Collins & Schafer &
Kam, 2001, p.335). For example, suppose observation i has a missing value y& and y& is
9. discrete then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
;&
(y2&, y2(, . . y28; θ)
If y& is continuous then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
<=
(y2&, y2(, . . y28; θ)dy&
The overall likelihood for n independent observations is product of the likelihoods for all the
observations. Suppose there are m observations that miss y&, the likelihood function for the full
data set is:
L(θ) = f2
?@3
24&
(y2&, y2(, . . y28; θ) f2
3
24?@3A&
(y2(, . . y28; θ)
For ML we will discuss the case that MAR only on response variable and no auxiliary variables
just like table 5 with the the 1,2,4 columns.
6. Comparison between listwise deletion and multiple imputation
Among these approaches complete-case analysis is most common, in 262 journal article that
published in 2010, 81% of them used complete-case analysis, which is lisewise deletion,
(Eekhout,etc…, 2012, P.729) although lisewise has several shortage discussed above and it is the
default method in R. Meanwhile, multiple imputation become easy to realize in R due to several
new packages. Because of these reasons, a comparison is given to show that multiple imputation
should be under seriously consideration. The interest is to find the effect of age and gender to
systolic blood pressure.
10. 6.1 Original data
The original data has no missing data, which is the first three columns of table 5.1. For this
dataset we would use the model: Systolic bp~age*gender. The results is shown in output1.From
output 1, we can see age has a very low p-value, gender and interaction term have a high one,
which indicate that age is a useful predictor with coefficient 0.66 for systolic blood pressure and
gender is not significant.
6.2 Dealing MAR blood pressure sample
6.2.1 Dealing with listwise deletion
The results are shown in output 2. For this method we can say both predictors including
gender and interaction have much lower p-values than original and become significant.
6.2.2 Dealing with MI
For MI approach, this paper uses mice package in R. As mentioned above, usually 5 sets will
generate efficient estimate. So we choose to generate 5 datasets and we use predictive mean
matching as imputation method. Table 2 shows the results of 5 different sets of imputation that
generate by MI procedure.
Table 2: 5 sets of imputation:
Missing place Imputation 1 Imputation 2 Imputation 3 Imputation 4 Imputation 5
3
8
15
19
135
126
127
131
114
114
126
127
126
127
127
114
134
124
127
135
126
126
114
127
11. 22
34
131
127
127
131
127
127
132
131
127
127
Then, we can examine the 5 different imputations’ regression results, which are shown in
output3. We can see intercept and coefficients are slightly different for the 5 imputations dataset.
Now we combine the separate models, in mice this function is called “pool”. The final results
show in output 4. For MI we can see that age is the only useful predictor with coefficient 0.57
and gender is not significant as original regression analysis.
6.3 Dealing MCAR blood pressure sample
MCAR is a stronger assumption than MAR, and this situation is relative rare comparing to
MAR and MNAR.
6.3.1 Dealing with listwise deletion
The results is represented in output 5. Age remains the only useful predictor for systolic
blood pressure with coefficient 0.66; the p-value for gender and interaction are lower comparing
to original data.
6.3.2 Dealing with MI
The procedure is almost the same as showed in 6.2.2. For this section, the finally results is
showed in output 6. For MI we can see that age is the only useful predictor with coefficient 0.66
and gender is not significant as original regression analysis.
6.4 Dealing MNAR blood pressure sample
12. As we mentioned above, we can only test if the missing data MCAR without further
information about the non-respondent. In many case, MAR assumption may not hold as
researchers’ claims, then MNCR is actually researchers deal with.
6.4.1 Dealing with listwise deletion
The results for listwise deletion is showed in output 7. Both predictors are significant at 0.05
significant level; age has coefficient 0.60; male’s average systolic blood pressure is 11.35 greater
than female and male has a slower growing systolic blood pressure than female.
6.4.2 Dealing with MI
The procedure is almost the same as showed in 6.2.2. We only provide the final results
output 8. For MI we can see that age is the only useful predictor with coefficient 0.56 and gender
is significant with 8.37 coefficient, but integration is not significant.
7. Discussion
Many missing data methods assume MCAR or MAR but our data often are MNAR. Thus by
comparing the efficiency that listwise and MI show under three type of missing data, we can
analysis which one may work better. Table 3 shows a summary of comparison by list coefficient
and p-value of each predictor for two methods under three different missing type.
Table 3:
Original:
Predictor: age genderm age:genderm
coeffcient 0.66 11.69 -0.25
13. p-value 0.00 0.07 0.07
MAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.65/0.00 16.67/0.03 -0.35/0.03
MI 0.57/0.00 11.41/0.08 -0.27/0.05
MCAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.66/0.00 8.48/0.27 -0.18/0.29
MI 0.66/0.00 8.84/0.26 -0.18/0.32
MNAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.60/0.00 11.35/0.04 -0.27/0.02
MI 0.56/0.00 8.37/0.08 -0.23/0.03
From table 3, we can see that from original data only age is a significant predictor with 0.66
coefficient. When only MAR assumption is hold, if we use listwise deletion, both gender and
interaction become significant; but if we use MI, these two terms remain non-significant. At least
for this case, we should prefer MI. When a stronger assumption MCAR is hold, we can see both
14. methods work well and generate efficient results: both coefficients of age are 0.66 and gender
and interaction are non-significant. Thus, we missing type is MCAR, we may use listwise
deletion or MI. When we face a more general condition MNCR, we can see for listwise deletion,
all of the predictor become significant. But if we use MI, based on hierarchy rule in regression
model, although interaction is significant at 0.05 significant level we will not include it in the
model. As for coefficient of age, we observe that using MI the difference comparing to original
data is bigger than using listwise deletion, but since MI shows that age is the only significant
term in the model just as original data shows, we better choose MI.
Since the dataset only contains 41 individuals on the islands, thus it has limits on its analysis
results. Each missing value plays a relatively big role. Especially for MNCR, there are 8 missing
values, which would influence the results no matter we methods we use.
8. Conclusion
Listwise deletion is simple and commonly used by researcher, may perform
reasonably well in some situations (Schafer & Graham, 2002, p.150). But multiple imputation
produces estimates that have more desired statistical properties. “They are consistent (and,
hence, approximately unbiased in large samples), asymptotically efficient (almost), and
asymptotically normal” if we do the right procedure. (Allison, 2012, p.2). When researches not
sure which type of missing value they deal with, they should consider not using the default
method in many analysis software such as R and choose to use multiple imputation as it can
handle MAR, MCAR and MNCR in a more proper way.
15. Reference
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical
Association, 88(423), 984-993.
Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis.
Psychometrika, 41(3), 409-415.
Allison, P. D. (2012, April). Handling missing data by maximum likelihood. In SAS global forum
(Vol. 312).
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological methods, 7(2), 147.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
Tsikriktsis, N. (2005). A review of techniques for treating missing data in OM survey research.
Journal of Operations Management, 24(1), 53-62.
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical Association,
88(423), 984-993.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological methods, 6(4), 330.
Eekhout, I., de Boer, R. M., Twisk, J. W., de Vet, H. C., & Heymans, M. W. (2012). Missing data:
16. a systematic review of how they are reported and handled. Epidemiology, 23(5), 729-732.