SlideShare a Scribd company logo
Statistical Methods of Handling Missing Data-Comparison of Listwise Deletion and Multiple
Imputation
Tianfan Song
Instructor: Meng-Hsuan (Tony) Wu
5.12.2016
Abstract:
This paper will briefly review traditional approaches and modern approaches for dealing
with missing data. Since the most common used approach is listwise deletion, and comparison
between listwise deletion and a popular modern approach multiple imputation will be addressed.
Through an example of analyzing blood pressure dataset, this paper shows MI should be used
when researchers not sure what type of missing data they are dealing with.
Key word: MCAR, MAR, MNAR, Deletion, Imputation, Comparison
1.Introduction:
Missing data is a not a rare case in most datasets. There are typically three different type of
missing data: missing at random, missing completely at random, and missing not at random.
However, many analytic procedures, many of which were developed early in the twentieth
century, were designed to have complete data (Graham, 2009, p550). Then, in late twentieth
century different approach to handle missing data were brought up with. Among all these
methods, likewise deletion, pairwise deletion, and single imputation as traditional approaches
and multiple imputation and maximum likelihood as modern approaches are used a lot. In R, the
default way to handle missing data is listwise deletion and multiple imputation have become
convenient to use.
2.Type of Missing data:
Missing at random(MAR): If the probability of failing to observe a value does not depend on
the unobserved data, then we call this type of missing data as MAR. In other words, according to
Rubin, the probability of missing data does not depend on missing data itself, but only on the
observed data(as cited in Schafer & Graham, 2002, p.151).
Missing completely at random(MCAR): MCAR is a special case of MAR, it means the
probability of missing data does not depend on the complete data at all. (as cited in Schafer &
Graham, 2002, p.151). For this situation, we can think the whole dataset and the missing data
randomly appears throughout the matrix((Acock, 2005, P,1014).
Missing not at random(MNAR): The missingness depends on variables that are not
observed. For example, a study wants to gather the information about people’s daily expenditure.
Several participants refused to provide the information about their entertainment expenditure
because they have lower salary and spend few on entertainment or they are older than 80 and
spend less on entertainment but this survey does not ask. This type of missing value depends on
the missing value itself.
2.1Example of Missing data:
Table 1 illustrates three type of missing data and bp means blood pressure. There are 41
random selected residents from islands. The first column records the age of sample and the next
two columns record sample’s complete diastolic and systolic blood pressure. The other columns
show the values of systolic blood pressure that remain after imposing three different type of
missing data.
To simulate MAR on systolic blood pressure, systolic blood pressure is only recorded for
individual whose age is over 25. This could happen in reality since younger people may not
concern to record blood pressure. We can easily find that the missing value only depends on age
but not systolic. To simulate MCAR, 35 individuals are randomly selected and systolic blood
pressure is reported. The last column comes from MNCR; to simulate it systolic blood pressure
is reported only when diastolic blood pressure is over 80. Diastolic blood pressure a variable not
include in the dataset (blood test result).
Table 1: Blood Pressure Sample from Island.
Age Gender Complete systolic bp MAR systolic bp MCAR systolic bp MNAR systolic bp
82	
39	
23	
41	
88	
25	
57	
20	
32	
52	
42	
27	
69	
81	
18	
32	
59	
47	
24	
37	
60	
22	
29	
77	
47	
27	
m	
f	
f	
m	
f	
f	
m	
m	
m	
m	
f	
f	
f	
f	
f	
m	
m	
f	
m	
m	
f	
f	
f	
f	
m	
m	
148	
135	
125	
136	
166	
114	
148	
111	
134	
137	
136	
127	
150	
164	
120	
132	
127	
133	
124	
135	
140	
121	
127	
158	
139	
126	
148	
135	
	
136	
166	
114	
148	
	
134	
137	
136	
127	
150	
164	
	
132	
127	
133	
	
135	
140	
	
127	
158	
139	
126	
135	
125	
	
166	
114	
148	
111	
	
137	
136	
127	
150	
164	
120	
132	
127	
133	
124	
	
140	
121	
127	
158	
139	
126	
148	
135	
125	
136	
166	
	
148	
	
134	
137	
136	
127	
150	
164	
	
132	
127	
133	
124	
135	
140	
121	
	
158	
139	
126
54	
56	
50	
37	
39	
43	
58	
20	
61	
32	
45	
37	
31	
29	
33	
f	
m	
f	
m	
m	
m	
f	
f	
m	
f	
m	
m	
f	
m	
f	
141	
140	
118	
135	
141	
126	
130	
103	
142	
124	
120	
132	
135	
131	
130	
141	
140	
118	
135	
141	
126	
130	
	
142	
124	
120	
132	
135	
131	
130	
141	
140	
118	
	
141	
126	
130	
103	
142	
124	
	
132	
135	
131	
130	
141	
140	
	
135	
141	
126	
130	
	
142	
	
132	
135	
131	
130	
3. Analyze randomness of missing data
Since methods mentions below assumes the value is missing completely at random or
missing at random, it is important to figure out the type of missing data. One way to diagnose the
randomness, which comes from Little and Rubin (1987), is to build correlation matrix of missing
data for any pair of variables (as cited in Tsikriktsis, 2005, p.56). Here is the detail of this
method: for each variable, observation is replaced by the value of one, and missing data is
replaced by zero. Then a correlation matrix can be build through software like R, and the
correlations in the off-diagonal entities indicate the degree of association between the missing
data on each pair of variables. The low correlations show the strong independent relationships in
the pair of variables, which implies strong randomness. According to Tsikriktsis (2005) there is
no “guidelines for identifying the level of correlation needed to indicate that the missing data are
not random” (p.56). If we believe the correlation matrix show randomness between all pairs of
variables, then we may consider this dataset has MCAR type of missing data. If we observe
significant correlations between some pairs of variables, then we may need to assume that the
dataset is not MCAR. In general case, we can’t test whether MAR holds in a data set, except we
can obtain follow-up data from non-respondents (Glynn, Laird, & Rubin, p.992~993,1993)
4. Review of traditional approach
Many traditional approaches including deletion and single imputation approaches are
heavily used by researchers. In this section, some common traditional approaches will be
reviewed.
4.1 Deletion approaches:
4.1.1 Listwise deletion
This method asks for deleting the whole row of data when one of the value is missing, so it
also called complete-case analysis. Many researchers use this approach because this approach
can avoid make up data and insert to missing place. Clearly, this method is simple, but it
typically results losing of 20%-50% of the data (Acock, 2005, P,1015). Despite the fact that the
large loss of data reduces statistical power and generate biased estimator if data not MCAR, used
by default in many statistical programs (Schafer &Graham, 2002, p151).
4.1.2 Pairwise deletion
In order to minimize the loss that occurs in listwise deletion, researchers sometime used a
method called Pairwise deletion (available-case analysis), it means using every available data.
For example, on table 1, when systolic blood pressure is MCAR, we want to find the relationship
between age and diastolic blood pressure, we can use every individual’s data since none of
diastolic blood pressure’s value is missing. Even when MCAR holds, case deletion may still be
inefficient. (Schafer &Graham, 2002, p.156)
4.2 Single Imputation
It is a common strategy, researchers use this approach when there are not too much missing
data, the main idea is to insert plausible guess to the missing part. Common choices include:
4.2.1 substitute with overall mean
When there is a missing data we substitute it with the mean. This is an easy way to handle
missing data, but it would seriously reduce the variability especially when data set has a large
amount of missing data. Also since this approach ignore the relationship between variables, it
may weak the estimates of covariance and correlation.
4.2.2 Dummy variable adjustment
If a there is a single variable like systolic blood pressure in table 5 has some missing data,
then a binary variable is created with is the value for systolic blood pressure is missing and 0 is
the value is present. Then, residents who have missing value on systolic blood are assigned the
mean. The dummy variable is included in regression. According to Allison’ s research, this
approach has been discredited and should not be used (as cited in Graham, 2009, P.555).
4.2.3 regression estimate
For this approach, we first estimate the coefficients among variables, and then we use the
regression coefficients to estimate the missing value (Frane, 1976, p.410). This approach uses the
available observed data, but since we replace the missing data the estimated value from
regression model, we could increase the coefficient of determination and the correlation between
variables and reduce variability.
5. Modern Approach
5.1 Multiple imputation (MI)
Different than single imputation, in MI the missing values are replaced by M > 1 sets of
simulated imputed values. In different set, the same missing value are replaced by slightly
different values, resultsing in M plausible but different versions of the complete data (Collins &
Schafer & Kam, 2001, p.335). Typically, M=5 to M=10 are sufficient to yield highly efficient (as
cited in Collins & Schafer & Kam, 2001, p.335). The imputation equation is:	 	
Z = b$ + b&X + b(Y + sE
Where 𝑍 is the estimate of missing value and X, Y are fully observed values, E is a random
selected from a standard normal distribution and s is the estimated standard deviation of the error
term in the regression. (Allison, 2012, p.2)
5.2 Maximum Likelihood (ML)
The maximum likelihood approach chooses parameter values that can produce the highest
possible probability or probability density to the data values. The likelihood function is:
L(θ) = f2
3
24& (y2&, y2(, . . y28; θ), i=1…n.
If the missing data are MAR, which is the assumption of ML, then the missing data values
are removed from the likelihood by a process of summation or integration (Collins & Schafer &
Kam, 2001, p.335). For example, suppose observation i has a missing value y& and y& is
discrete then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
;&
(y2&, y2(, . . y28; θ)
If y& is continuous then the joint probability is:
f2(y2(, y2:, . . y28; θ) = f2
<=
(y2&, y2(, . . y28; θ)dy&	
The overall likelihood for n independent observations is product of the likelihoods for all the
observations. Suppose there are m observations that miss y&, the likelihood function for the full
data set is:
L(θ) = f2
?@3
24&
(y2&, y2(, . . y28; θ) f2
3
24?@3A&
(y2(, . . y28; θ)
For ML we will discuss the case that MAR only on response variable and no auxiliary variables
just like table 5 with the the 1,2,4 columns.
6. Comparison between listwise deletion and multiple imputation
Among these approaches complete-case analysis is most common, in 262 journal article that
published in 2010, 81% of them used complete-case analysis, which is lisewise deletion,
(Eekhout,etc…, 2012, P.729) although lisewise has several shortage discussed above and it is the
default method in R. Meanwhile, multiple imputation become easy to realize in R due to several
new packages. Because of these reasons, a comparison is given to show that multiple imputation
should be under seriously consideration. The interest is to find the effect of age and gender to
systolic blood pressure.
6.1 Original data
The original data has no missing data, which is the first three columns of table 5.1. For this
dataset we would use the model: Systolic bp~age*gender. The results is shown in output1.From
output 1, we can see age has a very low p-value, gender and interaction term have a high one,
which indicate that age is a useful predictor with coefficient 0.66 for systolic blood pressure and
gender is not significant.
6.2 Dealing MAR blood pressure sample
6.2.1 Dealing with listwise deletion
The results are shown in output 2. For this method we can say both predictors including
gender and interaction have much lower p-values than original and become significant.
6.2.2 Dealing with MI
For MI approach, this paper uses mice package in R. As mentioned above, usually 5 sets will
generate efficient estimate. So we choose to generate 5 datasets and we use predictive mean
matching as imputation method. Table 2 shows the results of 5 different sets of imputation that
generate by MI procedure.
Table 2: 5 sets of imputation:
Missing place Imputation 1 Imputation 2 Imputation 3 Imputation 4 Imputation 5
3	
8	
15	
19	
135	
126	
127	
131	
114	
114	
126	
127	
126	
127	
127	
114	
134	
124	
127	
135	
126	
126	
114	
127
22	
34	
	
131	
127	
	
127	
131	
	
127	
127	
	
132	
131	
	
127	
127	
	
Then, we can examine the 5 different imputations’ regression results, which are shown in
output3. We can see intercept and coefficients are slightly different for the 5 imputations dataset.
Now we combine the separate models, in mice this function is called “pool”. The final results
show in output 4. For MI we can see that age is the only useful predictor with coefficient 0.57
and gender is not significant as original regression analysis.
6.3 Dealing MCAR blood pressure sample
MCAR is a stronger assumption than MAR, and this situation is relative rare comparing to
MAR and MNAR.
6.3.1 Dealing with listwise deletion
The results is represented in output 5. Age remains the only useful predictor for systolic
blood pressure with coefficient 0.66; the p-value for gender and interaction are lower comparing
to original data.
6.3.2 Dealing with MI
The procedure is almost the same as showed in 6.2.2. For this section, the finally results is
showed in output 6. For MI we can see that age is the only useful predictor with coefficient 0.66
and gender is not significant as original regression analysis.
6.4 Dealing MNAR blood pressure sample
As we mentioned above, we can only test if the missing data MCAR without further
information about the non-respondent. In many case, MAR assumption may not hold as
researchers’ claims, then MNCR is actually researchers deal with.
6.4.1 Dealing with listwise deletion
The results for listwise deletion is showed in output 7. Both predictors are significant at 0.05
significant level; age has coefficient 0.60; male’s average systolic blood pressure is 11.35 greater
than female and male has a slower growing systolic blood pressure than female.
6.4.2 Dealing with MI
The procedure is almost the same as showed in 6.2.2. We only provide the final results
output 8. For MI we can see that age is the only useful predictor with coefficient 0.56 and gender
is significant with 8.37 coefficient, but integration is not significant.
7. Discussion
Many missing data methods assume MCAR or MAR but our data often are MNAR. Thus by
comparing the efficiency that listwise and MI show under three type of missing data, we can
analysis which one may work better. Table 3 shows a summary of comparison by list coefficient
and p-value of each predictor for two methods under three different missing type.
Table 3:
Original:
Predictor: age genderm age:genderm
coeffcient 0.66 11.69 -0.25
p-value 0.00 0.07 0.07
MAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.65/0.00 16.67/0.03 -0.35/0.03
MI 0.57/0.00 11.41/0.08 -0.27/0.05
MCAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.66/0.00 8.48/0.27 -0.18/0.29
MI 0.66/0.00 8.84/0.26 -0.18/0.32
MNAR:
Method Coefficient/p-value
age
genderm age:genderm
Listwise deletion 0.60/0.00 11.35/0.04 -0.27/0.02
MI 0.56/0.00 8.37/0.08 -0.23/0.03
From table 3, we can see that from original data only age is a significant predictor with 0.66
coefficient. When only MAR assumption is hold, if we use listwise deletion, both gender and
interaction become significant; but if we use MI, these two terms remain non-significant. At least
for this case, we should prefer MI. When a stronger assumption MCAR is hold, we can see both
methods work well and generate efficient results: both coefficients of age are 0.66 and gender
and interaction are non-significant. Thus, we missing type is MCAR, we may use listwise
deletion or MI. When we face a more general condition MNCR, we can see for listwise deletion,
all of the predictor become significant. But if we use MI, based on hierarchy rule in regression
model, although interaction is significant at 0.05 significant level we will not include it in the
model. As for coefficient of age, we observe that using MI the difference comparing to original
data is bigger than using listwise deletion, but since MI shows that age is the only significant
term in the model just as original data shows, we better choose MI.
Since the dataset only contains 41 individuals on the islands, thus it has limits on its analysis
results. Each missing value plays a relatively big role. Especially for MNCR, there are 8 missing
values, which would influence the results no matter we methods we use.
8. Conclusion
Listwise deletion is simple and commonly used by researcher, may perform
reasonably well in some situations (Schafer & Graham, 2002, p.150).	But	multiple imputation
produces estimates that have more desired statistical properties. “They are consistent (and,
hence, approximately unbiased in large samples), asymptotically efficient (almost), and
asymptotically normal” if we do the right procedure. (Allison, 2012, p.2). When researches not
sure which type of missing value they deal with, they should consider not using the default
method in many analysis software such as R and choose to use multiple imputation as it can
handle MAR, MCAR and MNCR in a more proper way.
Reference
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical
Association, 88(423), 984-993.
Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis.
Psychometrika, 41(3), 409-415.
Allison, P. D. (2012, April). Handling missing data by maximum likelihood. In SAS global forum
(Vol. 312).
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological methods, 7(2), 147.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
Tsikriktsis, N. (2005). A review of techniques for treating missing data in OM survey research.
Journal of Operations Management, 24(1), 53-62.
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups. Journal of the American Statistical Association,
88(423), 984-993.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological methods, 6(4), 330.
Eekhout, I., de Boer, R. M., Twisk, J. W., de Vet, H. C., & Heymans, M. W. (2012). Missing data:
a systematic review of how they are reported and handled. Epidemiology, 23(5), 729-732.
Appendix:
Output1: regression results from original data.
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.00143 3.74111 27.800 < 2e-16 ***
## age 0.66267 0.07677 8.632 2.15e-10 ***
## genderm 11.69622 6.30700 1.854 0.0717 .
## age:genderm -0.25421 0.13590 -1.871 0.0693 .
Output 2:
Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.68335 4.63542 22.583 < 2e-16 ***
## age 0.65129 0.08711 7.477 2.01e-08 ***
## genderm 16.67260 7.37165 2.262 0.0309 *
## age:genderm -0.35043 0.14954 -2.343 0.0257 *
Output 3:
analyses :
## [[1]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 108.7031 0.5873 14.4817 -0.3209
##
##
## [[2]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 111.6146 0.5406 9.2609 -0.2309
##
##
## [[3]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 108.4245 0.5904 12.4510 -0.2806
##
##
## [[4]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 110.0411 0.5656 10.5556 -0.2504
##
##
## [[5]]
##
## Call:
## lm(formula = MAR.systolic.bp ~ age * gender)
##
## Coefficients:
## (Intercept) age gender2 age:gender2
## 110.5416 0.5574 10.3339 -0.2476
Output 4:
## est se t df Pr(>|t|)
## (Intercept) 109.8649804 3.79063778 28.983244 25.90669 0.000000e+00
## age 0.5682545 0.07543424 7.533111 29.92531 2.162265e-08
## gender2 11.4166290 6.32447102 1.805152 27.25506 8.211053e-02
## age:gender2 -0.2660750 0.13308720 -1.999253 30.37152 5.459700e-02
Output 5:
Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.00143 3.77911 27.520 < 2e-16 ***
## age 0.66267 0.07755 8.546 1.19e-09 ***
## genderm 8.48478 7.47132 1.136 0.265
## age:genderm -0.18035 0.16645 -1.084 0.287
Output 6:
## est se t df Pr(>|t|)
## (Intercept) 104.0014342 3.68164783 28.248610 35.14648 0.000000e+00
## age 0.6626701 0.07554582 8.771765 35.14648 2.243947e-10
## gender2 8.8357373 7.55535772 1.169466 14.58076 2.609787e-01
## age:gender2 -0.1769565 0.17028626 -1.039171 12.07404 3.190884e-01
Output 7:
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.32154 3.56119 30.698 < 2e-16 ***
## age 0.60077 0.06563 9.153 4.72e-10 ***
## genderm 11.35156 5.37859 2.111 0.0436 *
## age:genderm -0.27137 0.10904 -2.489 0.0188 *
Output 8:
## est se t df Pr(>|t|)
## (Intercept) 112.0912386 2.98080526 37.604348 20.57494 0.000000e+00
## age 0.5564151 0.05650112 9.847859 30.31590 5.846257e-11
## gender2 8.3651567 4.69331811 1.782355 28.87603 8.521021e-02
## age:gender2 -0.2261093 0.09822131 -2.302039 32.57357 2.786864e-02

More Related Content

What's hot

Factor analysis
Factor analysisFactor analysis
Factor analysis
saba khan
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
KAMIL MAJEED
 
Statistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CMEStatistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CME
D. Warnick Consulting
 
Statistical Analysis Overview
Statistical Analysis OverviewStatistical Analysis Overview
Statistical Analysis Overview
Ecumene
 

What's hot (18)

Multiple imputation of missing data
Multiple imputation of missing dataMultiple imputation of missing data
Multiple imputation of missing data
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Eda sri
Eda sriEda sri
Eda sri
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Statistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CMEStatistical Analysis for Educational Outcomes Measurement in CME
Statistical Analysis for Educational Outcomes Measurement in CME
 
Burns And Bush Chapter 15
Burns And Bush Chapter 15Burns And Bush Chapter 15
Burns And Bush Chapter 15
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
Basics of Educational Statistics (Inferential statistics)
Basics of Educational Statistics (Inferential statistics)Basics of Educational Statistics (Inferential statistics)
Basics of Educational Statistics (Inferential statistics)
 
Multivariate analysis - Multiple regression analysis
Multivariate analysis -  Multiple regression analysisMultivariate analysis -  Multiple regression analysis
Multivariate analysis - Multiple regression analysis
 
Statistical Analysis Overview
Statistical Analysis OverviewStatistical Analysis Overview
Statistical Analysis Overview
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Multivariate data analysis regression, cluster and factor analysis on spss
Multivariate data analysis   regression, cluster and factor analysis on spssMultivariate data analysis   regression, cluster and factor analysis on spss
Multivariate data analysis regression, cluster and factor analysis on spss
 
Standard error
Standard error Standard error
Standard error
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Exploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical SciencesExploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
B025209013
B025209013B025209013
B025209013
 

Similar to Statistical Methods to Handle Missing Data

A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
CSCJournals
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
​Iván Rodríguez
 
What do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docxWhat do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docx
philipnelson29183
 
1 descriptive statistics
1 descriptive statistics1 descriptive statistics
1 descriptive statistics
Sanu Kumar
 
2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments
Dr Lendy Spires
 
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
IJTET Journal
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
cambridgeWD
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
cambridgeWD
 
Katagorisel veri analizi
Katagorisel veri analiziKatagorisel veri analizi
Katagorisel veri analizi
Burak Kocak
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
Dennis Sweitzer
 

Similar to Statistical Methods to Handle Missing Data (20)

A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
 
Berd 5-6
Berd 5-6Berd 5-6
Berd 5-6
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
What do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docxWhat do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docx
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 
Advice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchAdvice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation Research
 
1 descriptive statistics
1 descriptive statistics1 descriptive statistics
1 descriptive statistics
 
2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments
 
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
Summarization Techniques in Association Rule Data Mining For Risk Assessment ...
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
 
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
2013.11.14 Big Data Workshop Adam Ralph - 1st set of slides
 
Descriptive statistics-Skewness-Kurtosis-Correlation.ppt
Descriptive statistics-Skewness-Kurtosis-Correlation.pptDescriptive statistics-Skewness-Kurtosis-Correlation.ppt
Descriptive statistics-Skewness-Kurtosis-Correlation.ppt
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Katagorisel veri analizi
Katagorisel veri analiziKatagorisel veri analizi
Katagorisel veri analizi
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
 
ESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSIONESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSION
 
DRS-112 Exploratory Data Analysis (YUE, PGDRS).pdf
DRS-112 Exploratory Data Analysis (YUE, PGDRS).pdfDRS-112 Exploratory Data Analysis (YUE, PGDRS).pdf
DRS-112 Exploratory Data Analysis (YUE, PGDRS).pdf
 

Statistical Methods to Handle Missing Data

  • 1. Statistical Methods of Handling Missing Data-Comparison of Listwise Deletion and Multiple Imputation Tianfan Song Instructor: Meng-Hsuan (Tony) Wu 5.12.2016
  • 2. Abstract: This paper will briefly review traditional approaches and modern approaches for dealing with missing data. Since the most common used approach is listwise deletion, and comparison between listwise deletion and a popular modern approach multiple imputation will be addressed. Through an example of analyzing blood pressure dataset, this paper shows MI should be used when researchers not sure what type of missing data they are dealing with. Key word: MCAR, MAR, MNAR, Deletion, Imputation, Comparison 1.Introduction: Missing data is a not a rare case in most datasets. There are typically three different type of missing data: missing at random, missing completely at random, and missing not at random. However, many analytic procedures, many of which were developed early in the twentieth century, were designed to have complete data (Graham, 2009, p550). Then, in late twentieth century different approach to handle missing data were brought up with. Among all these methods, likewise deletion, pairwise deletion, and single imputation as traditional approaches and multiple imputation and maximum likelihood as modern approaches are used a lot. In R, the default way to handle missing data is listwise deletion and multiple imputation have become convenient to use. 2.Type of Missing data: Missing at random(MAR): If the probability of failing to observe a value does not depend on the unobserved data, then we call this type of missing data as MAR. In other words, according to
  • 3. Rubin, the probability of missing data does not depend on missing data itself, but only on the observed data(as cited in Schafer & Graham, 2002, p.151). Missing completely at random(MCAR): MCAR is a special case of MAR, it means the probability of missing data does not depend on the complete data at all. (as cited in Schafer & Graham, 2002, p.151). For this situation, we can think the whole dataset and the missing data randomly appears throughout the matrix((Acock, 2005, P,1014). Missing not at random(MNAR): The missingness depends on variables that are not observed. For example, a study wants to gather the information about people’s daily expenditure. Several participants refused to provide the information about their entertainment expenditure because they have lower salary and spend few on entertainment or they are older than 80 and spend less on entertainment but this survey does not ask. This type of missing value depends on the missing value itself. 2.1Example of Missing data: Table 1 illustrates three type of missing data and bp means blood pressure. There are 41 random selected residents from islands. The first column records the age of sample and the next two columns record sample’s complete diastolic and systolic blood pressure. The other columns show the values of systolic blood pressure that remain after imposing three different type of missing data. To simulate MAR on systolic blood pressure, systolic blood pressure is only recorded for individual whose age is over 25. This could happen in reality since younger people may not concern to record blood pressure. We can easily find that the missing value only depends on age
  • 4. but not systolic. To simulate MCAR, 35 individuals are randomly selected and systolic blood pressure is reported. The last column comes from MNCR; to simulate it systolic blood pressure is reported only when diastolic blood pressure is over 80. Diastolic blood pressure a variable not include in the dataset (blood test result). Table 1: Blood Pressure Sample from Island. Age Gender Complete systolic bp MAR systolic bp MCAR systolic bp MNAR systolic bp 82 39 23 41 88 25 57 20 32 52 42 27 69 81 18 32 59 47 24 37 60 22 29 77 47 27 m f f m f f m m m m f f f f f m m f m m f f f f m m 148 135 125 136 166 114 148 111 134 137 136 127 150 164 120 132 127 133 124 135 140 121 127 158 139 126 148 135 136 166 114 148 134 137 136 127 150 164 132 127 133 135 140 127 158 139 126 135 125 166 114 148 111 137 136 127 150 164 120 132 127 133 124 140 121 127 158 139 126 148 135 125 136 166 148 134 137 136 127 150 164 132 127 133 124 135 140 121 158 139 126
  • 5. 54 56 50 37 39 43 58 20 61 32 45 37 31 29 33 f m f m m m f f m f m m f m f 141 140 118 135 141 126 130 103 142 124 120 132 135 131 130 141 140 118 135 141 126 130 142 124 120 132 135 131 130 141 140 118 141 126 130 103 142 124 132 135 131 130 141 140 135 141 126 130 142 132 135 131 130 3. Analyze randomness of missing data Since methods mentions below assumes the value is missing completely at random or missing at random, it is important to figure out the type of missing data. One way to diagnose the randomness, which comes from Little and Rubin (1987), is to build correlation matrix of missing data for any pair of variables (as cited in Tsikriktsis, 2005, p.56). Here is the detail of this method: for each variable, observation is replaced by the value of one, and missing data is replaced by zero. Then a correlation matrix can be build through software like R, and the correlations in the off-diagonal entities indicate the degree of association between the missing data on each pair of variables. The low correlations show the strong independent relationships in the pair of variables, which implies strong randomness. According to Tsikriktsis (2005) there is no “guidelines for identifying the level of correlation needed to indicate that the missing data are not random” (p.56). If we believe the correlation matrix show randomness between all pairs of variables, then we may consider this dataset has MCAR type of missing data. If we observe
  • 6. significant correlations between some pairs of variables, then we may need to assume that the dataset is not MCAR. In general case, we can’t test whether MAR holds in a data set, except we can obtain follow-up data from non-respondents (Glynn, Laird, & Rubin, p.992~993,1993) 4. Review of traditional approach Many traditional approaches including deletion and single imputation approaches are heavily used by researchers. In this section, some common traditional approaches will be reviewed. 4.1 Deletion approaches: 4.1.1 Listwise deletion This method asks for deleting the whole row of data when one of the value is missing, so it also called complete-case analysis. Many researchers use this approach because this approach can avoid make up data and insert to missing place. Clearly, this method is simple, but it typically results losing of 20%-50% of the data (Acock, 2005, P,1015). Despite the fact that the large loss of data reduces statistical power and generate biased estimator if data not MCAR, used by default in many statistical programs (Schafer &Graham, 2002, p151). 4.1.2 Pairwise deletion In order to minimize the loss that occurs in listwise deletion, researchers sometime used a method called Pairwise deletion (available-case analysis), it means using every available data. For example, on table 1, when systolic blood pressure is MCAR, we want to find the relationship between age and diastolic blood pressure, we can use every individual’s data since none of diastolic blood pressure’s value is missing. Even when MCAR holds, case deletion may still be
  • 7. inefficient. (Schafer &Graham, 2002, p.156) 4.2 Single Imputation It is a common strategy, researchers use this approach when there are not too much missing data, the main idea is to insert plausible guess to the missing part. Common choices include: 4.2.1 substitute with overall mean When there is a missing data we substitute it with the mean. This is an easy way to handle missing data, but it would seriously reduce the variability especially when data set has a large amount of missing data. Also since this approach ignore the relationship between variables, it may weak the estimates of covariance and correlation. 4.2.2 Dummy variable adjustment If a there is a single variable like systolic blood pressure in table 5 has some missing data, then a binary variable is created with is the value for systolic blood pressure is missing and 0 is the value is present. Then, residents who have missing value on systolic blood are assigned the mean. The dummy variable is included in regression. According to Allison’ s research, this approach has been discredited and should not be used (as cited in Graham, 2009, P.555). 4.2.3 regression estimate For this approach, we first estimate the coefficients among variables, and then we use the regression coefficients to estimate the missing value (Frane, 1976, p.410). This approach uses the available observed data, but since we replace the missing data the estimated value from regression model, we could increase the coefficient of determination and the correlation between
  • 8. variables and reduce variability. 5. Modern Approach 5.1 Multiple imputation (MI) Different than single imputation, in MI the missing values are replaced by M > 1 sets of simulated imputed values. In different set, the same missing value are replaced by slightly different values, resultsing in M plausible but different versions of the complete data (Collins & Schafer & Kam, 2001, p.335). Typically, M=5 to M=10 are sufficient to yield highly efficient (as cited in Collins & Schafer & Kam, 2001, p.335). The imputation equation is: Z = b$ + b&X + b(Y + sE Where 𝑍 is the estimate of missing value and X, Y are fully observed values, E is a random selected from a standard normal distribution and s is the estimated standard deviation of the error term in the regression. (Allison, 2012, p.2) 5.2 Maximum Likelihood (ML) The maximum likelihood approach chooses parameter values that can produce the highest possible probability or probability density to the data values. The likelihood function is: L(θ) = f2 3 24& (y2&, y2(, . . y28; θ), i=1…n. If the missing data are MAR, which is the assumption of ML, then the missing data values are removed from the likelihood by a process of summation or integration (Collins & Schafer & Kam, 2001, p.335). For example, suppose observation i has a missing value y& and y& is
  • 9. discrete then the joint probability is: f2(y2(, y2:, . . y28; θ) = f2 ;& (y2&, y2(, . . y28; θ) If y& is continuous then the joint probability is: f2(y2(, y2:, . . y28; θ) = f2 <= (y2&, y2(, . . y28; θ)dy& The overall likelihood for n independent observations is product of the likelihoods for all the observations. Suppose there are m observations that miss y&, the likelihood function for the full data set is: L(θ) = f2 ?@3 24& (y2&, y2(, . . y28; θ) f2 3 24?@3A& (y2(, . . y28; θ) For ML we will discuss the case that MAR only on response variable and no auxiliary variables just like table 5 with the the 1,2,4 columns. 6. Comparison between listwise deletion and multiple imputation Among these approaches complete-case analysis is most common, in 262 journal article that published in 2010, 81% of them used complete-case analysis, which is lisewise deletion, (Eekhout,etc…, 2012, P.729) although lisewise has several shortage discussed above and it is the default method in R. Meanwhile, multiple imputation become easy to realize in R due to several new packages. Because of these reasons, a comparison is given to show that multiple imputation should be under seriously consideration. The interest is to find the effect of age and gender to systolic blood pressure.
  • 10. 6.1 Original data The original data has no missing data, which is the first three columns of table 5.1. For this dataset we would use the model: Systolic bp~age*gender. The results is shown in output1.From output 1, we can see age has a very low p-value, gender and interaction term have a high one, which indicate that age is a useful predictor with coefficient 0.66 for systolic blood pressure and gender is not significant. 6.2 Dealing MAR blood pressure sample 6.2.1 Dealing with listwise deletion The results are shown in output 2. For this method we can say both predictors including gender and interaction have much lower p-values than original and become significant. 6.2.2 Dealing with MI For MI approach, this paper uses mice package in R. As mentioned above, usually 5 sets will generate efficient estimate. So we choose to generate 5 datasets and we use predictive mean matching as imputation method. Table 2 shows the results of 5 different sets of imputation that generate by MI procedure. Table 2: 5 sets of imputation: Missing place Imputation 1 Imputation 2 Imputation 3 Imputation 4 Imputation 5 3 8 15 19 135 126 127 131 114 114 126 127 126 127 127 114 134 124 127 135 126 126 114 127
  • 11. 22 34 131 127 127 131 127 127 132 131 127 127 Then, we can examine the 5 different imputations’ regression results, which are shown in output3. We can see intercept and coefficients are slightly different for the 5 imputations dataset. Now we combine the separate models, in mice this function is called “pool”. The final results show in output 4. For MI we can see that age is the only useful predictor with coefficient 0.57 and gender is not significant as original regression analysis. 6.3 Dealing MCAR blood pressure sample MCAR is a stronger assumption than MAR, and this situation is relative rare comparing to MAR and MNAR. 6.3.1 Dealing with listwise deletion The results is represented in output 5. Age remains the only useful predictor for systolic blood pressure with coefficient 0.66; the p-value for gender and interaction are lower comparing to original data. 6.3.2 Dealing with MI The procedure is almost the same as showed in 6.2.2. For this section, the finally results is showed in output 6. For MI we can see that age is the only useful predictor with coefficient 0.66 and gender is not significant as original regression analysis. 6.4 Dealing MNAR blood pressure sample
  • 12. As we mentioned above, we can only test if the missing data MCAR without further information about the non-respondent. In many case, MAR assumption may not hold as researchers’ claims, then MNCR is actually researchers deal with. 6.4.1 Dealing with listwise deletion The results for listwise deletion is showed in output 7. Both predictors are significant at 0.05 significant level; age has coefficient 0.60; male’s average systolic blood pressure is 11.35 greater than female and male has a slower growing systolic blood pressure than female. 6.4.2 Dealing with MI The procedure is almost the same as showed in 6.2.2. We only provide the final results output 8. For MI we can see that age is the only useful predictor with coefficient 0.56 and gender is significant with 8.37 coefficient, but integration is not significant. 7. Discussion Many missing data methods assume MCAR or MAR but our data often are MNAR. Thus by comparing the efficiency that listwise and MI show under three type of missing data, we can analysis which one may work better. Table 3 shows a summary of comparison by list coefficient and p-value of each predictor for two methods under three different missing type. Table 3: Original: Predictor: age genderm age:genderm coeffcient 0.66 11.69 -0.25
  • 13. p-value 0.00 0.07 0.07 MAR: Method Coefficient/p-value age genderm age:genderm Listwise deletion 0.65/0.00 16.67/0.03 -0.35/0.03 MI 0.57/0.00 11.41/0.08 -0.27/0.05 MCAR: Method Coefficient/p-value age genderm age:genderm Listwise deletion 0.66/0.00 8.48/0.27 -0.18/0.29 MI 0.66/0.00 8.84/0.26 -0.18/0.32 MNAR: Method Coefficient/p-value age genderm age:genderm Listwise deletion 0.60/0.00 11.35/0.04 -0.27/0.02 MI 0.56/0.00 8.37/0.08 -0.23/0.03 From table 3, we can see that from original data only age is a significant predictor with 0.66 coefficient. When only MAR assumption is hold, if we use listwise deletion, both gender and interaction become significant; but if we use MI, these two terms remain non-significant. At least for this case, we should prefer MI. When a stronger assumption MCAR is hold, we can see both
  • 14. methods work well and generate efficient results: both coefficients of age are 0.66 and gender and interaction are non-significant. Thus, we missing type is MCAR, we may use listwise deletion or MI. When we face a more general condition MNCR, we can see for listwise deletion, all of the predictor become significant. But if we use MI, based on hierarchy rule in regression model, although interaction is significant at 0.05 significant level we will not include it in the model. As for coefficient of age, we observe that using MI the difference comparing to original data is bigger than using listwise deletion, but since MI shows that age is the only significant term in the model just as original data shows, we better choose MI. Since the dataset only contains 41 individuals on the islands, thus it has limits on its analysis results. Each missing value plays a relatively big role. Especially for MNCR, there are 8 missing values, which would influence the results no matter we methods we use. 8. Conclusion Listwise deletion is simple and commonly used by researcher, may perform reasonably well in some situations (Schafer & Graham, 2002, p.150). But multiple imputation produces estimates that have more desired statistical properties. “They are consistent (and, hence, approximately unbiased in large samples), asymptotically efficient (almost), and asymptotically normal” if we do the right procedure. (Allison, 2012, p.2). When researches not sure which type of missing value they deal with, they should consider not using the default method in many analysis software such as R and choose to use multiple imputation as it can handle MAR, MCAR and MNCR in a more proper way.
  • 15. Reference Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association, 88(423), 984-993. Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis. Psychometrika, 41(3), 409-415. Allison, P. D. (2012, April). Handling missing data by maximum likelihood. In SAS global forum (Vol. 312). Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147. Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576. Tsikriktsis, N. (2005). A review of techniques for treating missing data in OM survey research. Journal of Operations Management, 24(1), 53-62. Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association, 88(423), 984-993. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological methods, 6(4), 330. Eekhout, I., de Boer, R. M., Twisk, J. W., de Vet, H. C., & Heymans, M. W. (2012). Missing data:
  • 16. a systematic review of how they are reported and handled. Epidemiology, 23(5), 729-732.
  • 17. Appendix: Output1: regression results from original data. ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 104.00143 3.74111 27.800 < 2e-16 *** ## age 0.66267 0.07677 8.632 2.15e-10 *** ## genderm 11.69622 6.30700 1.854 0.0717 . ## age:genderm -0.25421 0.13590 -1.871 0.0693 . Output 2: Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 104.68335 4.63542 22.583 < 2e-16 *** ## age 0.65129 0.08711 7.477 2.01e-08 *** ## genderm 16.67260 7.37165 2.262 0.0309 * ## age:genderm -0.35043 0.14954 -2.343 0.0257 * Output 3: analyses : ## [[1]] ## ## Call: ## lm(formula = MAR.systolic.bp ~ age * gender) ## ## Coefficients: ## (Intercept) age gender2 age:gender2 ## 108.7031 0.5873 14.4817 -0.3209 ## ## ## [[2]] ## ## Call: ## lm(formula = MAR.systolic.bp ~ age * gender) ## ## Coefficients: ## (Intercept) age gender2 age:gender2 ## 111.6146 0.5406 9.2609 -0.2309 ## ## ## [[3]] ## ## Call:
  • 18. ## lm(formula = MAR.systolic.bp ~ age * gender) ## ## Coefficients: ## (Intercept) age gender2 age:gender2 ## 108.4245 0.5904 12.4510 -0.2806 ## ## ## [[4]] ## ## Call: ## lm(formula = MAR.systolic.bp ~ age * gender) ## ## Coefficients: ## (Intercept) age gender2 age:gender2 ## 110.0411 0.5656 10.5556 -0.2504 ## ## ## [[5]] ## ## Call: ## lm(formula = MAR.systolic.bp ~ age * gender) ## ## Coefficients: ## (Intercept) age gender2 age:gender2 ## 110.5416 0.5574 10.3339 -0.2476 Output 4: ## est se t df Pr(>|t|) ## (Intercept) 109.8649804 3.79063778 28.983244 25.90669 0.000000e+00 ## age 0.5682545 0.07543424 7.533111 29.92531 2.162265e-08 ## gender2 11.4166290 6.32447102 1.805152 27.25506 8.211053e-02 ## age:gender2 -0.2660750 0.13308720 -1.999253 30.37152 5.459700e-02 Output 5: Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 104.00143 3.77911 27.520 < 2e-16 *** ## age 0.66267 0.07755 8.546 1.19e-09 *** ## genderm 8.48478 7.47132 1.136 0.265 ## age:genderm -0.18035 0.16645 -1.084 0.287 Output 6: ## est se t df Pr(>|t|) ## (Intercept) 104.0014342 3.68164783 28.248610 35.14648 0.000000e+00 ## age 0.6626701 0.07554582 8.771765 35.14648 2.243947e-10
  • 19. ## gender2 8.8357373 7.55535772 1.169466 14.58076 2.609787e-01 ## age:gender2 -0.1769565 0.17028626 -1.039171 12.07404 3.190884e-01 Output 7: ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 109.32154 3.56119 30.698 < 2e-16 *** ## age 0.60077 0.06563 9.153 4.72e-10 *** ## genderm 11.35156 5.37859 2.111 0.0436 * ## age:genderm -0.27137 0.10904 -2.489 0.0188 * Output 8: ## est se t df Pr(>|t|) ## (Intercept) 112.0912386 2.98080526 37.604348 20.57494 0.000000e+00 ## age 0.5564151 0.05650112 9.847859 30.31590 5.846257e-11 ## gender2 8.3651567 4.69331811 1.782355 28.87603 8.521021e-02 ## age:gender2 -0.2261093 0.09822131 -2.302039 32.57357 2.786864e-02