SlideShare a Scribd company logo
1 of 13
Download to read offline
Imputation assessments for the IIAG 
1 Introduction 
This document contains a write-up of some simulations performed by the Mo 
Ibrahim Foundation to assess imputation methods. We begin below by describing the 
imputation methods and the missingness mechanisms considered, as well as what 
was measured. We then assess the accuracy and precision of the various methods, 
and some characteristics of the predicted distributions. An assessment of the 
amount of remaining missingness after imputation follows. 
Finally, we draw some conclusions from the experiments, viz., that in terms of 
accuracy and precision, linear interpolation is the most approriate method for our 
data; and that for data where whole country timeseries are missing, the most 
accurate method is the so-called all variable multilevel method. 
1.1 Methods of imputation under consideration 
Below is the list of methods of imputation under consideration. 
1. Mean substitution. Here missing interior datapoints are replaced by the mean 
of the closest available datapoints on either side in the timeseries; and last 
value carried forward (LVCF)/first value carried backwards (FVCB) is used for 
the exterior missing data. The special rule for the Antiretroviral Treatment 
Provision (ATP) and ART Provision for Pregnant Women (ARTPPW) variables (no 
imputation between the year 2000 and the next available datapoint) is used. 
2. Mean substitution, no ad hoc. As above, but with no ad hoc rules for ATP and 
ARTPPW. 
3. Linear interpolation. Interior missing data are replaced by linear interpolation, 
while exterior missing data is replaced by LVCF/FVCB (that is, 0th order 
extrapolation). 
4. Linear interpolation, higher order extrapolation. Interior data replaced by linear 
interpolation, and exterior missing data replaced by higher order extrapolation. 
Higher order interpolation was also used but was found to be very inaccurate 
(the analysis has been omitted). 
5. All variable multilevel. Missing data replaced using a multilevel model trained 
on all the data present. 
6. Variable multilevel. Missing data for a variable replaced by a multilevel model 
trained on all data available for that variable. 
7. Regression imputation. Missing data replaced by multiple imputation by 
chained equations, using linear regression on the variables. 
1
The first four above will be referred to as interpolation-type methods. Number 
five and six will be referred to as multilevel methods, and the last three will be 
referred to as regression-type methods. 
1.2 Missingness mechanisms 
There are various ways in which missingness can be generated. We have simulated 
two in these experiments. 
1. Missing completely at random. Data deleted from IIAG dataset at random. 
2. Data deleted from IIAG dataset by the following procedure: Select a variable at 
random. Delete the data for a random number of years. Delete the data for a 
random number of countries. 
1.3 What was measured? 
For each random selection of parameters which determine the amount and type of 
missingness, the following quantities are computed: 
1. The amount of missingness before imputation and the amount of missingness 
remaining after imputation. 
2. The quantiles and mean of the distance between the actual values of the 
indicators and the imputed values. 
3. The quantiles and mean of the distance between the actual and the imputed 
broad governance score. 
4. Quantiles, mean, standard deviation, skewness and kurtosis of the difference 
between the actual indicator value and the imputed value. 
5. The quantiles, mean and skewness of the difference of actual and imputed 
broad governance score. 
2 Accuracy and precision of predictions 
To assess the merits of the imputation techniques, we measure accuracy and 
precision. Accuracy means being on target, while precision is doing so consistently. 
Being accurate but not precise means an even spread centered on the target, while 
accuracy and precision means a smaller spread centered on the target. Good 
estimates must be both accurate and precise. 
The mean substitution with and without the ad hoc imputation for the ATP and 
ARTPPW variables have a mean distance between actual and imputed value of 1.24 
and 1.26 respectively. The ad hoc procedure can be expected to be more accurate, 
since it tries to impute fewer values. The linear interpolation method of imputation is 
similarly accurate. Regression imputation and the all variable multilevel method are 
2
the least accurate, as can be expected given that the correlations that it uses are 
lower, see Figure 1. 
Figure 1: Mean variable distance 
As far as distance from the broad governance score is concerned, the all variable 
multilevel method fares well, with the variable multilevel method and the mean 
substitution and linear interpolation method in the second best group of methods, 
see Figure 2. 
3
Figure 2: Broad governance score distance 
The standard deviations of the difference between actual and imputed variable 
scores show that the precision of the variable multilevel method and the 
interpolation-type methods is a great deal higher, with the interpolation-type 
methods performing the best. All variable multilevel imputation and regression 
imputation are quite bad in this respect, see Figure 3. 
4
Figure 3: Variable difference standard deviation 
The interpolation-type distribution of differences has shorter tails than the 
corresponding distribution for the other methods. The linear interpolation method is 
slightly more so than the mean substitution. From the measure of kurtosis of the 
differences of imputed and actual variable values, it can be seen that on average the 
variable multilevel method produced the most peaky distribution, see Table 1. 
Method 1st Qu. Median Mean 3rd Qu. 
Variable multilevel 24.69 53.82 1501.00 168.90 
Linear interpolation 27.83 64.83 1057.00 249.50 
Mean substitution, no ad hoc 26.08 61.48 1048.00 241.80 
Mean substitution 25.93 60.16 1048.00 239.30 
Higher order interpolation 17.84 47.46 678.30 179.20 
Regression imputation 2.76 7.29 364.00 24.25 
All variable multilevel 0.50 2.47 36.31 8.45 
Table 1: Variable difference kurtosis 
As far as bias is concerned, the mean skewness of the difference between actual 
and imputed is generally positive for all the timeseries imputation methods, with 
mean skewness larger and positive for the variable multilevel method. The all 
variable multilevel method performs the best in this regard, and the regression 
imputation method also performs well, see Figure 4. 
5
Figure 4: Variable difference skewness 
2.0.1 Higher order extrapolation 
Since the linear interpolation methods perform well, it would seem plausible that 
higher order interpolation methods could produce better results. This turns out not 
to be the case, however (results omitted). 
In addition, at mid-to-high levels of missingness, higher order extrapolation (with 
linear interpolation) produces less accurate results, as can be seen by regressing the 
mean variable distance on the amount of original missingness and the order of the 
extrapolation. 
Estimate Std. Error t value Pr(>jtj) 
(Intercept) -4.1188 0.1328 -31.02 0.0000 
ord -0.0401 0.0204 -1.96 0.0498 
origMiss 9.9823 0.2033 49.11 0.0000 
ord:origMiss 0.1212 0.0316 3.84 0.0001 
Table 2: Variable accuracy and order of extrapolation 
The coefficient of order in Table 2 is small and negative, while the interaction 
term is larger and positive, which means that for a fixed level of original missingness 
higher than 0.33, an increase in order will produce a decrease in accuracy. 
The mean distance for indicators increases a great deal with the level of original 
missingness, and it is also clear that the 0th order extrapolation (that is, the last 
6
value carried forward, and the first value carried backwards) is the most accurate, see 
Figure 5. 
Figure 5: Variable distance quantiles and order of extrapolation 
For missingness proportions smaller than 0.33, higher order extrapolation has a 
small, positive effect on the mean; for higher values of original missingness, the 
median and 10th and 90th quantiles of the imputation-actual difference increases. 
3 Remaining missingness 
The amount of original missingness is of course predictive of the amount of 
missingness remaining after imputation, but the imputation method and the type of 
missingness also has a significant impact. 
Introducing missingness completely at random, by selecting random entries in 
the IIAG dataset to delete results in a dataset in which the all variable multilevel 
method can impute all the values, for almost all levels of missingness. Up to high 
levels of original missingness, regression imputation performs very well in this regard, 
followed by linear extrapolation, constant extrapolation with linear interpolation and 
the mean substitution method  which are almost indistinguishable , followed by 
the variable multilevel method, see Table 3. 
We also generate missingness as follows: Make a random selection of indicators, 
and for those variables, data for all countries for a random selection of years were 
7
Method 1st Qu. Median Mean 3rd Qu. 
All variable multilevel 0.00 0.00 0.00 0.00 
Regression imputation 0.00 0.01 0.10 0.04 
Higher order interpolation 0.05 0.06 0.14 0.14 
Linear interpolation 0.06 0.11 0.24 0.32 
Mean substitution, no ad hoc 0.06 0.11 0.24 0.32 
Mean substitution 0.07 0.12 0.24 0.32 
Variable multilevel 0.17 0.45 0.46 0.73 
Table 3: Remaining missingness proportion 
deleted and data for all years for a random selection of countries. Again, the all 
variable multilevel and the regression imputation methods both perform well, 
leaving little unimputed. The proportion of remaining missingness is higher for this 
type of missingness for all methods, except the variable multilevel method, which 
performs a lot better, see Table 4. 
Method 1st Qu. Median Mean 3rd Qu. 
All variable multilevel 0.00 0.00 0.00 0.00 
Regression imputation 0.00 0.00 0.10 0.10 
Variable multilevel 0.00 0.10 0.20 0.20 
Mean substitution, no ad hoc 0.10 0.20 0.30 0.40 
Mean substitution 0.10 0.20 0.30 0.40 
Linear interpolation 0.10 0.20 0.30 0.40 
Higher order interpolation 0.10 0.20 0.30 0.40 
Table 4: Remaining missingness 
As is clear from Figure 6, the missingness mechanism is an important determinant 
of how much data will remain unimputed. Below, the left panel shows data from the 
missing completely at random mechanism, while the right panel shows data missing 
according to the above mechanism. 
8
Figure 6: Remaining missingness 
3.1 Accuracy and precision, and country and time missingness 
Holding the proportion of variables deleted constant, the deletion of whole country 
timeseries is expected to have a larger effect on the remaining missingness for 
interpolation-type imputation methods, since it will not be possible to impute values, 
and this turns out to be the case, see Figure 7, where the rows of panels have 
constant variable missingness levels in the intervals (0; 0:33], (0:33; 0:67] and 
(0:67; 1]. Also, the variance in the amount of remaining missingness increases with 
the proportion of countries missing. 
9
Figure 7: Remaining missingness 
While holding the proportion of deleted country timeseries fixed, the deletion of 
variable-years affects the variable multilevel and the regression imputation more 
than the interpolation-type methods, see Figure 8, where the rows of panels have 
constant country missingness levels in the intervals (0; 0:33], (0:33; 0:67] and 
(0:67; 1]. Similarly, the variance in remaining missingness increases with the 
proportion of deleted variable-years. In addition, the more country-missingness 
there is, the less of an effect additional year-missingness has. 
10
Figure 8: Remaining missingness 
It is therefore clear that the remaining missingness is more affected by deleting 
country timeseries, than by deleting all countries for one year for the 
interpolation-type imputation methods; while the variable multilevel method is 
more affected by the deletion of variable-years. The all variable multilevel method is 
not affected, while the regression imputation is somewhat affected. 
The remaining missingness increasing for the interpolation-type imputations 
means that the accuracy of those methods is not affected as much as the 
regression-type imputation methods which will attempt to impute values even 
though present data are sparse. Indeed, for fixed levels of missing variables, 
increasing the proportion of missing country timeseries decreases accuracy for the 
variable, all variable and regression imputation, as can be seen in Figure 9 below, 
where the rows of panels have constant variable missingness levels in the intervals 
(0; 0:33], (0:33; 0:67] and (0:67; 1]. The most affected is the regression imputation, 
followed by the all variable and variable multilevel methods. 
11
Figure 9: Mean variable distance 
Similarly, removing years for all countries decreases the accuracy for the 
interpolation-type methods, up to a point after which imputation is no longer 
possible. It should also be noted that the all variable and regression imputations are 
affected similarly, while the accuracy of the variable multilevel method does not 
follow this pattern, see Figure 10, where the rows of panels have constant country 
missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1]. 
12
Figure 10: Mean variable distance 
The amount of country coverage affects the accuracy of the prediction of the 
broad governance score for all methods of imputation, with the variable multilevel 
the least affected. 
4 Conclusion 
1. The linear interpolation method is slightly more accurate and precise than 
mean substitution, and is therefore the most suitable approach of imputation 
for our data. It also has the conceptual advantage of being consistent with the 
idea that natural phenomena are continuous. 
2. For imputing variables where a whole timeseries is empty, we are reduced to a 
choice between all variable multilevel imputation and regression imputation. 
The all variable multilevel imputation is much better in terms of accuracy of 
prediction of the broad governance score and also in terms of leaving very little 
missingness behind. 
3. Country timeseries missingness is detrimental to the accuracy of 
interpolation-type methods; the removal of additional years has a linear effect 
on accuracy. Variable year missingness is less bad. 
13

More Related Content

What's hot

PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...IJMIT JOURNAL
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundationssmackinnon
 
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryA Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryWaqas Tariq
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsLong Beach City College
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
 
Boost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBoost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
 
Data Mining using SAS
Data Mining using SASData Mining using SAS
Data Mining using SASTanu Puri
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: EstimationParag Shah
 
Machine learning session7(nb classifier k-nn)
Machine learning   session7(nb classifier k-nn)Machine learning   session7(nb classifier k-nn)
Machine learning session7(nb classifier k-nn)Abhimanyu Dwivedi
 
statistical estimation
statistical estimationstatistical estimation
statistical estimationAmish Akbar
 

What's hot (16)

PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundations
 
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryA Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Data sampling and probability
Data sampling and probabilityData sampling and probability
Data sampling and probability
 
Boost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBoost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality prediction
 
Data Mining using SAS
Data Mining using SASData Mining using SAS
Data Mining using SAS
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
 
debatrim_report (1)
debatrim_report (1)debatrim_report (1)
debatrim_report (1)
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
 
final paper1
final paper1final paper1
final paper1
 
Machine learning session7(nb classifier k-nn)
Machine learning   session7(nb classifier k-nn)Machine learning   session7(nb classifier k-nn)
Machine learning session7(nb classifier k-nn)
 
statistical estimation
statistical estimationstatistical estimation
statistical estimation
 

Viewers also liked

Aid4tradeglobalvalue13 e
Aid4tradeglobalvalue13 e Aid4tradeglobalvalue13 e
Aid4tradeglobalvalue13 e Dr Lendy Spires
 
CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES
 CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES
CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSESDr Lendy Spires
 
Outcome of the fourth united nations conference on the least developed countr...
Outcome of the fourth united nations conference on the least developed countr...Outcome of the fourth united nations conference on the least developed countr...
Outcome of the fourth united nations conference on the least developed countr...Dr Lendy Spires
 
International marine contractors association vision & strategy
International marine contractors association vision & strategyInternational marine contractors association vision & strategy
International marine contractors association vision & strategyDr Lendy Spires
 
Get active-physical-education-physical-activity and sport for children and yo...
Get active-physical-education-physical-activity and sport for children and yo...Get active-physical-education-physical-activity and sport for children and yo...
Get active-physical-education-physical-activity and sport for children and yo...Dr Lendy Spires
 
The effectiveness of continuing professional development
The effectiveness of continuing professional developmentThe effectiveness of continuing professional development
The effectiveness of continuing professional developmentDr Lendy Spires
 
Sustainability through sport
Sustainability through sportSustainability through sport
Sustainability through sportDr Lendy Spires
 
G20 anti corruption-action_plan
G20 anti corruption-action_planG20 anti corruption-action_plan
G20 anti corruption-action_planDr Lendy Spires
 
Literature review on the impact of public access to information and communica...
Literature review on the impact of public access to information and communica...Literature review on the impact of public access to information and communica...
Literature review on the impact of public access to information and communica...Dr Lendy Spires
 

Viewers also liked (16)

131008 one maputo_final
131008 one maputo_final131008 one maputo_final
131008 one maputo_final
 
Aid4tradeglobalvalue13 e
Aid4tradeglobalvalue13 e Aid4tradeglobalvalue13 e
Aid4tradeglobalvalue13 e
 
CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES
 CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES
CIVIL SOCIETY PARTICIPATION IN DECISION-MAKING PROCESSES
 
Outcome of the fourth united nations conference on the least developed countr...
Outcome of the fourth united nations conference on the least developed countr...Outcome of the fourth united nations conference on the least developed countr...
Outcome of the fourth united nations conference on the least developed countr...
 
International marine contractors association vision & strategy
International marine contractors association vision & strategyInternational marine contractors association vision & strategy
International marine contractors association vision & strategy
 
0670 nebarume book
0670 nebarume book0670 nebarume book
0670 nebarume book
 
0138 human right
0138 human right0138 human right
0138 human right
 
196726
196726196726
196726
 
Get active-physical-education-physical-activity and sport for children and yo...
Get active-physical-education-physical-activity and sport for children and yo...Get active-physical-education-physical-activity and sport for children and yo...
Get active-physical-education-physical-activity and sport for children and yo...
 
The effectiveness of continuing professional development
The effectiveness of continuing professional developmentThe effectiveness of continuing professional development
The effectiveness of continuing professional development
 
Sustainability through sport
Sustainability through sportSustainability through sport
Sustainability through sport
 
G20 anti corruption-action_plan
G20 anti corruption-action_planG20 anti corruption-action_plan
G20 anti corruption-action_plan
 
Gb brichi
Gb brichiGb brichi
Gb brichi
 
Lcarl.229
Lcarl.229Lcarl.229
Lcarl.229
 
Literature review on the impact of public access to information and communica...
Literature review on the impact of public access to information and communica...Literature review on the impact of public access to information and communica...
Literature review on the impact of public access to information and communica...
 
Bloom andcanning
Bloom andcanningBloom andcanning
Bloom andcanning
 

Similar to 2014 IIAG Imputation Assessments

Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisIRJET Journal
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsShantanu Deshpande
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Farhad Ashraf
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
 
Errors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingErrors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingUmer Ali
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2sreenu t
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptxnikshaikh786
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptxmahamoh6
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
Measurement System Analysis Studies_Final
Measurement System Analysis Studies_FinalMeasurement System Analysis Studies_Final
Measurement System Analysis Studies_FinalDavid Little
 
Numerical approximation and solution of equations
Numerical approximation and solution of equationsNumerical approximation and solution of equations
Numerical approximation and solution of equationsRobinson
 

Similar to 2014 IIAG Imputation Assessments (20)

146056297 cc-modul
146056297 cc-modul146056297 cc-modul
146056297 cc-modul
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
Unit-1 Measurement and Error.pdf
Unit-1 Measurement and Error.pdfUnit-1 Measurement and Error.pdf
Unit-1 Measurement and Error.pdf
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
 
Errors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingErrors in Chemical Analysis and Sampling
Errors in Chemical Analysis and Sampling
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptx
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
Errors2
Errors2Errors2
Errors2
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptx
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Chapter_05.ppt
Chapter_05.pptChapter_05.ppt
Chapter_05.ppt
 
DSE-2, ANALYTICAL METHODS.pptx
DSE-2, ANALYTICAL METHODS.pptxDSE-2, ANALYTICAL METHODS.pptx
DSE-2, ANALYTICAL METHODS.pptx
 
Measurement System Analysis Studies_Final
Measurement System Analysis Studies_FinalMeasurement System Analysis Studies_Final
Measurement System Analysis Studies_Final
 
Numerical approximation and solution of equations
Numerical approximation and solution of equationsNumerical approximation and solution of equations
Numerical approximation and solution of equations
 

Recently uploaded

(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证mbetknu
 
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...Christina Parmionova
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.Christina Parmionova
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…nishakur201
 
Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Christina Parmionova
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...narwatsonia7
 
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...Suhani Kapoor
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...narwatsonia7
 
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...ResolutionFoundation
 
EDUROOT SME_ Performance upto March-2024.pptx
EDUROOT SME_ Performance upto March-2024.pptxEDUROOT SME_ Performance upto March-2024.pptx
EDUROOT SME_ Performance upto March-2024.pptxaaryamanorathofficia
 
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jatin Das Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130 Available With Roomishabajaj13
 
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Servicenarwatsonia7
 
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
history of 1935 philippine constitution.pptx
history of 1935 philippine constitution.pptxhistory of 1935 philippine constitution.pptx
history of 1935 philippine constitution.pptxhellokittymaearciaga
 
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up NumberMs Riya
 
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalore
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service BangaloreCall Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalore
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalorenarwatsonia7
 
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...Suhani Kapoor
 

Recently uploaded (20)

(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证
 
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
 
Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
 
Hot Sexy call girls in Palam Vihar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Palam Vihar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Palam Vihar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Palam Vihar🔝 9953056974 🔝 escort Service
 
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...
VIP High Profile Call Girls Gorakhpur Aarushi 8250192130 Independent Escort S...
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
 
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
 
EDUROOT SME_ Performance upto March-2024.pptx
EDUROOT SME_ Performance upto March-2024.pptxEDUROOT SME_ Performance upto March-2024.pptx
EDUROOT SME_ Performance upto March-2024.pptx
 
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jatin Das Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130 Available With Room
 
Model Town (Delhi) 9953330565 Escorts, Call Girls Services
Model Town (Delhi)  9953330565 Escorts, Call Girls ServicesModel Town (Delhi)  9953330565 Escorts, Call Girls Services
Model Town (Delhi) 9953330565 Escorts, Call Girls Services
 
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
 
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
history of 1935 philippine constitution.pptx
history of 1935 philippine constitution.pptxhistory of 1935 philippine constitution.pptx
history of 1935 philippine constitution.pptx
 
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
 
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalore
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service BangaloreCall Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalore
Call Girls Bangalore Saanvi 7001305949 Independent Escort Service Bangalore
 
The Federal Budget and Health Care Policy
The Federal Budget and Health Care PolicyThe Federal Budget and Health Care Policy
The Federal Budget and Health Care Policy
 
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...
VIP High Class Call Girls Amravati Anushka 8250192130 Independent Escort Serv...
 

2014 IIAG Imputation Assessments

  • 1. Imputation assessments for the IIAG 1 Introduction This document contains a write-up of some simulations performed by the Mo Ibrahim Foundation to assess imputation methods. We begin below by describing the imputation methods and the missingness mechanisms considered, as well as what was measured. We then assess the accuracy and precision of the various methods, and some characteristics of the predicted distributions. An assessment of the amount of remaining missingness after imputation follows. Finally, we draw some conclusions from the experiments, viz., that in terms of accuracy and precision, linear interpolation is the most approriate method for our data; and that for data where whole country timeseries are missing, the most accurate method is the so-called all variable multilevel method. 1.1 Methods of imputation under consideration Below is the list of methods of imputation under consideration. 1. Mean substitution. Here missing interior datapoints are replaced by the mean of the closest available datapoints on either side in the timeseries; and last value carried forward (LVCF)/first value carried backwards (FVCB) is used for the exterior missing data. The special rule for the Antiretroviral Treatment Provision (ATP) and ART Provision for Pregnant Women (ARTPPW) variables (no imputation between the year 2000 and the next available datapoint) is used. 2. Mean substitution, no ad hoc. As above, but with no ad hoc rules for ATP and ARTPPW. 3. Linear interpolation. Interior missing data are replaced by linear interpolation, while exterior missing data is replaced by LVCF/FVCB (that is, 0th order extrapolation). 4. Linear interpolation, higher order extrapolation. Interior data replaced by linear interpolation, and exterior missing data replaced by higher order extrapolation. Higher order interpolation was also used but was found to be very inaccurate (the analysis has been omitted). 5. All variable multilevel. Missing data replaced using a multilevel model trained on all the data present. 6. Variable multilevel. Missing data for a variable replaced by a multilevel model trained on all data available for that variable. 7. Regression imputation. Missing data replaced by multiple imputation by chained equations, using linear regression on the variables. 1
  • 2. The first four above will be referred to as interpolation-type methods. Number five and six will be referred to as multilevel methods, and the last three will be referred to as regression-type methods. 1.2 Missingness mechanisms There are various ways in which missingness can be generated. We have simulated two in these experiments. 1. Missing completely at random. Data deleted from IIAG dataset at random. 2. Data deleted from IIAG dataset by the following procedure: Select a variable at random. Delete the data for a random number of years. Delete the data for a random number of countries. 1.3 What was measured? For each random selection of parameters which determine the amount and type of missingness, the following quantities are computed: 1. The amount of missingness before imputation and the amount of missingness remaining after imputation. 2. The quantiles and mean of the distance between the actual values of the indicators and the imputed values. 3. The quantiles and mean of the distance between the actual and the imputed broad governance score. 4. Quantiles, mean, standard deviation, skewness and kurtosis of the difference between the actual indicator value and the imputed value. 5. The quantiles, mean and skewness of the difference of actual and imputed broad governance score. 2 Accuracy and precision of predictions To assess the merits of the imputation techniques, we measure accuracy and precision. Accuracy means being on target, while precision is doing so consistently. Being accurate but not precise means an even spread centered on the target, while accuracy and precision means a smaller spread centered on the target. Good estimates must be both accurate and precise. The mean substitution with and without the ad hoc imputation for the ATP and ARTPPW variables have a mean distance between actual and imputed value of 1.24 and 1.26 respectively. The ad hoc procedure can be expected to be more accurate, since it tries to impute fewer values. The linear interpolation method of imputation is similarly accurate. Regression imputation and the all variable multilevel method are 2
  • 3. the least accurate, as can be expected given that the correlations that it uses are lower, see Figure 1. Figure 1: Mean variable distance As far as distance from the broad governance score is concerned, the all variable multilevel method fares well, with the variable multilevel method and the mean substitution and linear interpolation method in the second best group of methods, see Figure 2. 3
  • 4. Figure 2: Broad governance score distance The standard deviations of the difference between actual and imputed variable scores show that the precision of the variable multilevel method and the interpolation-type methods is a great deal higher, with the interpolation-type methods performing the best. All variable multilevel imputation and regression imputation are quite bad in this respect, see Figure 3. 4
  • 5. Figure 3: Variable difference standard deviation The interpolation-type distribution of differences has shorter tails than the corresponding distribution for the other methods. The linear interpolation method is slightly more so than the mean substitution. From the measure of kurtosis of the differences of imputed and actual variable values, it can be seen that on average the variable multilevel method produced the most peaky distribution, see Table 1. Method 1st Qu. Median Mean 3rd Qu. Variable multilevel 24.69 53.82 1501.00 168.90 Linear interpolation 27.83 64.83 1057.00 249.50 Mean substitution, no ad hoc 26.08 61.48 1048.00 241.80 Mean substitution 25.93 60.16 1048.00 239.30 Higher order interpolation 17.84 47.46 678.30 179.20 Regression imputation 2.76 7.29 364.00 24.25 All variable multilevel 0.50 2.47 36.31 8.45 Table 1: Variable difference kurtosis As far as bias is concerned, the mean skewness of the difference between actual and imputed is generally positive for all the timeseries imputation methods, with mean skewness larger and positive for the variable multilevel method. The all variable multilevel method performs the best in this regard, and the regression imputation method also performs well, see Figure 4. 5
  • 6. Figure 4: Variable difference skewness 2.0.1 Higher order extrapolation Since the linear interpolation methods perform well, it would seem plausible that higher order interpolation methods could produce better results. This turns out not to be the case, however (results omitted). In addition, at mid-to-high levels of missingness, higher order extrapolation (with linear interpolation) produces less accurate results, as can be seen by regressing the mean variable distance on the amount of original missingness and the order of the extrapolation. Estimate Std. Error t value Pr(>jtj) (Intercept) -4.1188 0.1328 -31.02 0.0000 ord -0.0401 0.0204 -1.96 0.0498 origMiss 9.9823 0.2033 49.11 0.0000 ord:origMiss 0.1212 0.0316 3.84 0.0001 Table 2: Variable accuracy and order of extrapolation The coefficient of order in Table 2 is small and negative, while the interaction term is larger and positive, which means that for a fixed level of original missingness higher than 0.33, an increase in order will produce a decrease in accuracy. The mean distance for indicators increases a great deal with the level of original missingness, and it is also clear that the 0th order extrapolation (that is, the last 6
  • 7. value carried forward, and the first value carried backwards) is the most accurate, see Figure 5. Figure 5: Variable distance quantiles and order of extrapolation For missingness proportions smaller than 0.33, higher order extrapolation has a small, positive effect on the mean; for higher values of original missingness, the median and 10th and 90th quantiles of the imputation-actual difference increases. 3 Remaining missingness The amount of original missingness is of course predictive of the amount of missingness remaining after imputation, but the imputation method and the type of missingness also has a significant impact. Introducing missingness completely at random, by selecting random entries in the IIAG dataset to delete results in a dataset in which the all variable multilevel method can impute all the values, for almost all levels of missingness. Up to high levels of original missingness, regression imputation performs very well in this regard, followed by linear extrapolation, constant extrapolation with linear interpolation and the mean substitution method which are almost indistinguishable , followed by the variable multilevel method, see Table 3. We also generate missingness as follows: Make a random selection of indicators, and for those variables, data for all countries for a random selection of years were 7
  • 8. Method 1st Qu. Median Mean 3rd Qu. All variable multilevel 0.00 0.00 0.00 0.00 Regression imputation 0.00 0.01 0.10 0.04 Higher order interpolation 0.05 0.06 0.14 0.14 Linear interpolation 0.06 0.11 0.24 0.32 Mean substitution, no ad hoc 0.06 0.11 0.24 0.32 Mean substitution 0.07 0.12 0.24 0.32 Variable multilevel 0.17 0.45 0.46 0.73 Table 3: Remaining missingness proportion deleted and data for all years for a random selection of countries. Again, the all variable multilevel and the regression imputation methods both perform well, leaving little unimputed. The proportion of remaining missingness is higher for this type of missingness for all methods, except the variable multilevel method, which performs a lot better, see Table 4. Method 1st Qu. Median Mean 3rd Qu. All variable multilevel 0.00 0.00 0.00 0.00 Regression imputation 0.00 0.00 0.10 0.10 Variable multilevel 0.00 0.10 0.20 0.20 Mean substitution, no ad hoc 0.10 0.20 0.30 0.40 Mean substitution 0.10 0.20 0.30 0.40 Linear interpolation 0.10 0.20 0.30 0.40 Higher order interpolation 0.10 0.20 0.30 0.40 Table 4: Remaining missingness As is clear from Figure 6, the missingness mechanism is an important determinant of how much data will remain unimputed. Below, the left panel shows data from the missing completely at random mechanism, while the right panel shows data missing according to the above mechanism. 8
  • 9. Figure 6: Remaining missingness 3.1 Accuracy and precision, and country and time missingness Holding the proportion of variables deleted constant, the deletion of whole country timeseries is expected to have a larger effect on the remaining missingness for interpolation-type imputation methods, since it will not be possible to impute values, and this turns out to be the case, see Figure 7, where the rows of panels have constant variable missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1]. Also, the variance in the amount of remaining missingness increases with the proportion of countries missing. 9
  • 10. Figure 7: Remaining missingness While holding the proportion of deleted country timeseries fixed, the deletion of variable-years affects the variable multilevel and the regression imputation more than the interpolation-type methods, see Figure 8, where the rows of panels have constant country missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1]. Similarly, the variance in remaining missingness increases with the proportion of deleted variable-years. In addition, the more country-missingness there is, the less of an effect additional year-missingness has. 10
  • 11. Figure 8: Remaining missingness It is therefore clear that the remaining missingness is more affected by deleting country timeseries, than by deleting all countries for one year for the interpolation-type imputation methods; while the variable multilevel method is more affected by the deletion of variable-years. The all variable multilevel method is not affected, while the regression imputation is somewhat affected. The remaining missingness increasing for the interpolation-type imputations means that the accuracy of those methods is not affected as much as the regression-type imputation methods which will attempt to impute values even though present data are sparse. Indeed, for fixed levels of missing variables, increasing the proportion of missing country timeseries decreases accuracy for the variable, all variable and regression imputation, as can be seen in Figure 9 below, where the rows of panels have constant variable missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1]. The most affected is the regression imputation, followed by the all variable and variable multilevel methods. 11
  • 12. Figure 9: Mean variable distance Similarly, removing years for all countries decreases the accuracy for the interpolation-type methods, up to a point after which imputation is no longer possible. It should also be noted that the all variable and regression imputations are affected similarly, while the accuracy of the variable multilevel method does not follow this pattern, see Figure 10, where the rows of panels have constant country missingness levels in the intervals (0; 0:33], (0:33; 0:67] and (0:67; 1]. 12
  • 13. Figure 10: Mean variable distance The amount of country coverage affects the accuracy of the prediction of the broad governance score for all methods of imputation, with the variable multilevel the least affected. 4 Conclusion 1. The linear interpolation method is slightly more accurate and precise than mean substitution, and is therefore the most suitable approach of imputation for our data. It also has the conceptual advantage of being consistent with the idea that natural phenomena are continuous. 2. For imputing variables where a whole timeseries is empty, we are reduced to a choice between all variable multilevel imputation and regression imputation. The all variable multilevel imputation is much better in terms of accuracy of prediction of the broad governance score and also in terms of leaving very little missingness behind. 3. Country timeseries missingness is detrimental to the accuracy of interpolation-type methods; the removal of additional years has a linear effect on accuracy. Variable year missingness is less bad. 13