2014 IIAG Imputation Assessments

Imputation assessments for the IIAG
1 Introduction
This document contains a write-up of some simulations performed by the Mo
Ibrahim Foundation to assess imputation methods. We begin below by describing the
imputation methods and the missingness mechanisms considered, as well as what
was measured. We then assess the accuracy and precision of the various methods,
and some characteristics of the predicted distributions. An assessment of the
amount of remaining missingness after imputation follows.
Finally, we draw some conclusions from the experiments, viz., that in terms of
accuracy and precision, linear interpolation is the most approriate method for our
data; and that for data where whole country timeseries are missing, the most
accurate method is the so-called all variable multilevel method.
1.1 Methods of imputation under consideration
Below is the list of methods of imputation under consideration.
1. Mean substitution. Here missing interior datapoints are replaced by the mean
of the closest available datapoints on either side in the timeseries; and last
value carried forward (LVCF)/first value carried backwards (FVCB) is used for
the exterior missing data. The special rule for the Antiretroviral Treatment
Provision (ATP) and ART Provision for Pregnant Women (ARTPPW) variables (no
imputation between the year 2000 and the next available datapoint) is used.
2. Mean substitution, no ad hoc. As above, but with no ad hoc rules for ATP and
ARTPPW.
3. Linear interpolation. Interior missing data are replaced by linear interpolation,
while exterior missing data is replaced by LVCF/FVCB (that is, 0th order
extrapolation).
4. Linear interpolation, higher order extrapolation. Interior data replaced by linear
interpolation, and exterior missing data replaced by higher order extrapolation.
Higher order interpolation was also used but was found to be very inaccurate
(the analysis has been omitted).
5. All variable multilevel. Missing data replaced using a multilevel model trained
on all the data present.
6. Variable multilevel. Missing data for a variable replaced by a multilevel model
trained on all data available for that variable.
7. Regression imputation. Missing data replaced by multiple imputation by
chained equations, using linear regression on the variables.
1

The first four above will be referred to as interpolation-type methods. Number
five and six will be referred to as multilevel methods, and the last three will be
referred to as regression-type methods.
1.2 Missingness mechanisms
There are various ways in which missingness can be generated. We have simulated
two in these experiments.
1. Missing completely at random. Data deleted from IIAG dataset at random.
2. Data deleted from IIAG dataset by the following procedure: Select a variable at
random. Delete the data for a random number of years. Delete the data for a
random number of countries.
1.3 What was measured?
For each random selection of parameters which determine the amount and type of
missingness, the following quantities are computed:
1. The amount of missingness before imputation and the amount of missingness
remaining after imputation.
2. The quantiles and mean of the distance between the actual values of the
indicators and the imputed values.
3. The quantiles and mean of the distance between the actual and the imputed
broad governance score.
4. Quantiles, mean, standard deviation, skewness and kurtosis of the difference
between the actual indicator value and the imputed value.
5. The quantiles, mean and skewness of the difference of actual and imputed
broad governance score.
2 Accuracy and precision of predictions
To assess the merits of the imputation techniques, we measure accuracy and
precision. Accuracy means being on target, while precision is doing so consistently.
Being accurate but not precise means an even spread centered on the target, while
accuracy and precision means a smaller spread centered on the target. Good
estimates must be both accurate and precise.
The mean substitution with and without the ad hoc imputation for the ATP and
ARTPPW variables have a mean distance between actual and imputed value of 1.24
and 1.26 respectively. The ad hoc procedure can be expected to be more accurate,
since it tries to impute fewer values. The linear interpolation method of imputation is
similarly accurate. Regression imputation and the all variable multilevel method are
2

the least accurate, as can be expected given that the correlations that it uses are
lower, see Figure 1.
Figure 1: Mean variable distance
As far as distance from the broad governance score is concerned, the all variable
multilevel method fares well, with the variable multilevel method and the mean
substitution and linear interpolation method in the second best group of methods,
see Figure 2.
3

Figure 2: Broad governance score distance
The standard deviations of the difference between actual and imputed variable
scores show that the precision of the variable multilevel method and the
interpolation-type methods is a great deal higher, with the interpolation-type
methods performing the best. All variable multilevel imputation and regression
imputation are quite bad in this respect, see Figure 3.
4

Figure 3: Variable difference standard deviation
The interpolation-type distribution of differences has shorter tails than the
corresponding distribution for the other methods. The linear interpolation method is
slightly more so than the mean substitution. From the measure of kurtosis of the
differences of imputed and actual variable values, it can be seen that on average the
variable multilevel method produced the most peaky distribution, see Table 1.
Method 1st Qu. Median Mean 3rd Qu.
Variable multilevel 24.69 53.82 1501.00 168.90
Linear interpolation 27.83 64.83 1057.00 249.50
Mean substitution, no ad hoc 26.08 61.48 1048.00 241.80
Mean substitution 25.93 60.16 1048.00 239.30
Higher order interpolation 17.84 47.46 678.30 179.20
Regression imputation 2.76 7.29 364.00 24.25
All variable multilevel 0.50 2.47 36.31 8.45
Table 1: Variable difference kurtosis
As far as bias is concerned, the mean skewness of the difference between actual
and imputed is generally positive for all the timeseries imputation methods, with
mean skewness larger and positive for the variable multilevel method. The all
variable multilevel method performs the best in this regard, and the regression
imputation method also performs well, see Figure 4.
5

Figure 4: Variable difference skewness
2.0.1 Higher order extrapolation
Since the linear interpolation methods perform well, it would seem plausible that
higher order interpolation methods could produce better results. This turns out not
to be the case, however (results omitted).
In addition, at mid-to-high levels of missingness, higher order extrapolation (with
linear interpolation) produces less accurate results, as can be seen by regressing the
mean variable distance on the amount of original missingness and the order of the
extrapolation.
Estimate Std. Error t value Pr(>jtj)
(Intercept) -4.1188 0.1328 -31.02 0.0000
ord -0.0401 0.0204 -1.96 0.0498
origMiss 9.9823 0.2033 49.11 0.0000
ord:origMiss 0.1212 0.0316 3.84 0.0001
Table 2: Variable accuracy and order of extrapolation
The coefficient of order in Table 2 is small and negative, while the interaction
term is larger and positive, which means that for a fixed level of original missingness
higher than 0.33, an increase in order will produce a decrease in accuracy.
The mean distance for indicators increases a great deal with the level of original
missingness, and it is also clear that the 0th order extrapolation (that is, the last
6

value carried forward, and the first value carried backwards) is the most accurate, see
Figure 5.
Figure 5: Variable distance quantiles and order of extrapolation
For missingness proportions smaller than 0.33, higher order extrapolation has a
small, positive effect on the mean; for higher values of original missingness, the
median and 10th and 90th quantiles of the imputation-actual difference increases.
3 Remaining missingness
The amount of original missingness is of course predictive of the amount of
missingness remaining after imputation, but the imputation method and the type of
missingness also has a significant impact.
Introducing missingness completely at random, by selecting random entries in
the IIAG dataset to delete results in a dataset in which the all variable multilevel
method can impute all the values, for almost all levels of missingness. Up to high
levels of original missingness, regression imputation performs very well in this regard,
followed by linear extrapolation, constant extrapolation with linear interpolation and
the mean substitution method which are almost indistinguishable , followed by
the variable multilevel method, see Table 3.
We also generate missingness as follows: Make a random selection of indicators,
and for those variables, data for all countries for a random selection of years were
7

Table 3: Remaining missingness proportion
deleted and data for all years for a random selection of countries. Again, the all
variable multilevel and the regression imputation methods both perform well,
leaving little unimputed. The proportion of remaining missingness is higher for this
type of missingness for all methods, except the variable multilevel method, which
performs a lot better, see Table 4.
Table 4: Remaining missingness
As is clear from Figure 6, the missingness mechanism is an important determinant
of how much data will remain unimputed. Below, the left panel shows data from the
missing completely at random mechanism, while the right panel shows data missing
according to the above mechanism.
8

Figure 6: Remaining missingness
3.1 Accuracy and precision, and country and time missingness
Holding the proportion of variables deleted constant, the deletion of whole country
timeseries is expected to have a larger effect on the remaining missingness for
interpolation-type imputation methods, since it will not be possible to impute values,
and this turns out to be the case, see Figure 7, where the rows of panels have
constant variable missingness levels in the intervals (0; 0:33], (0:33; 0:67] and
(0:67; 1]. Also, the variance in the amount of remaining missingness increases with
the proportion of countries missing.
9

While holding the proportion of deleted country timeseries fixed, the deletion of
variable-years affects the variable multilevel and the regression imputation more
than the interpolation-type methods, see Figure 8, where the rows of panels have
constant country missingness levels in the intervals (0; 0:33], (0:33; 0:67] and
(0:67; 1]. Similarly, the variance in remaining missingness increases with the
proportion of deleted variable-years. In addition, the more country-missingness
there is, the less of an effect additional year-missingness has.
10

It is therefore clear that the remaining missingness is more affected by deleting
country timeseries, than by deleting all countries for one year for the
interpolation-type imputation methods; while the variable multilevel method is
more affected by the deletion of variable-years. The all variable multilevel method is
not affected, while the regression imputation is somewhat affected.
The remaining missingness increasing for the interpolation-type imputations
means that the accuracy of those methods is not affected as much as the
regression-type imputation methods which will attempt to impute values even
though present data are sparse. Indeed, for fixed levels of missing variables,
increasing the proportion of missing country timeseries decreases accuracy for the
variable, all variable and regression imputation, as can be seen in Figure 9 below,
where the rows of panels have constant variable missingness levels in the intervals
(0; 0:33], (0:33; 0:67] and (0:67; 1]. The most affected is the regression imputation,
followed by the all variable and variable multilevel methods.
11

Similarly, removing years for all countries decreases the accuracy for the
interpolation-type methods, up to a point after which imputation is no longer
possible. It should also be noted that the all variable and regression imputations are
affected similarly, while the accuracy of the variable multilevel method does not
follow this pattern, see Figure 10, where the rows of panels have constant country
missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1].
12

The amount of country coverage affects the accuracy of the prediction of the
broad governance score for all methods of imputation, with the variable multilevel
the least affected.
4 Conclusion
1. The linear interpolation method is slightly more accurate and precise than
mean substitution, and is therefore the most suitable approach of imputation
for our data. It also has the conceptual advantage of being consistent with the
idea that natural phenomena are continuous.
2. For imputing variables where a whole timeseries is empty, we are reduced to a
choice between all variable multilevel imputation and regression imputation.
The all variable multilevel imputation is much better in terms of accuracy of
prediction of the broad governance score and also in terms of leaving very little
missingness behind.
3. Country timeseries missingness is detrimental to the accuracy of
interpolation-type methods; the removal of additional years has a linear effect
on accuracy. Variable year missingness is less bad.
13

2014 IIAG Imputation Assessments

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (16)

Similar to 2014 IIAG Imputation Assessments

Similar to 2014 IIAG Imputation Assessments (20)

Recently uploaded

Recently uploaded (20)

2014 IIAG Imputation Assessments