SlideShare a Scribd company logo
1 of 52
Managing missing values in routinely reported data:
One approach from the DRC
Matt Worges
Data for Impact Webinar Series
December 2, 2020
• Framing the Webinar through the D4I lens
• DHIS2 data: advantages and issues
• Exploring a DHIS2 data set
• What to do with blanks?
• Interpolation
• Recreate the “Truth”
• Interpolation diagnostics
Overview
• The D4I team was tasked with conducting an impact evaluation of
the USAID Integrated Health Project (IHP) implemented in 9
provinces of the DRC
• IHP goal: Reduce maternal, newborn, and child deaths through delivery of
integrated health services
• IHP objectives: Increase access to and use of quality health services in
the targeted health zones
IHP Impact Evaluation
• D4I research question: What was the impact of IHP on the
utilization of health services (e.g., treatment for childhood illnesses)
over the course of the study period?
• Measuring impact: D4I is assessing impact through a difference-
in-differences (DID) with propensity score matching (PSM) model
• Data source: We are using DHIS2 data for this impact evaluation
IHP Impact Evaluation – Approach
• PSM is widely used to mitigate confounding in observational
studies
• Complications arise when the covariates used to estimate the propensity
scores are only partially observed
• Interpolation/imputation approaches provide a potential solution for
handling missing data in the estimation of the propensity scores
• Recommended to derive the propensity score after applying interpolation or
imputation
IHP Impact Evaluation – Propensity Score Matching
• Addition/removal of health facilities at different time points
• Long runs of missing values
• Zero counts are typically not entered – they are left blank
• Cannot distinguish between truly missing and zero
• Data entry errors manifesting as outliers/anomalous points
• Reporting has improved over time making older time points less
complete
Some DHIS2 Issues
• Missing data can result in:
• Reduced statistical power
• Biased estimators
• Reduced representativeness of the sample
• Generally incorrect inference and conclusions
Why do we care about missingness?
Overview of Approaches for Missing Data – Susan Buchman
• Time Series Characteristics
• Restricted to Haut-Katanga Province, DRC
• Uncomplicated + severe malaria cases (all ages)
• 24-month period from October 2018 to September 2020
• Health facility count = 1,362
• The monthly-aggregated time series appears to include both a seasonal
and positive trend component
Data Set
Unprocessed Data – Missingness Visualized
HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20
hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219
hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582
hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222
hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639
hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203
hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240
hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313
hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283
hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257
hk Kasomeno Centre de Santé de Référence 282 307 306 265
hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393
hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
Unprocessed Data – Missingness Visualized
Missing (28.6%) Present (71.4%)‘visdat’ package
Malaria Cases – Haut-Katanga Province
Unprocessed Data – Histogram of Missingness
No missing values
(complete case analysis)
Completely blank records
(remove from data set) One missing value
Two missing values
‘ggplot2’ package
284
193
137
27
Unprocessed Data – Outliers?
What are these doing here?
Are they malaria outbreaks?
Are they data entry errors?
Unprocessed Data – Outliers.
‘anomalize’ package
Something looks off here This point didn’t show up as anomalous
• One method to remove outliers is to delete those values that are
± X standard deviations from the median
• The median is insensitive to extreme values in your time series
• Experiment with different thresholds (i.e., ± 4 SDs from the median
or ± 6 SDs from the median) to examine what happens to your data
Removing Egregious Outliers – One Approach
Malaria cases
Median
Standard
deviation
+ 4.5 SDs from the median
This value would be
removed from the data set
Anomalous Data Points
‘anomalize’ package
This is what I’m targeting for removal
Less concerned with these
Removing Egregious Outliers - Effects
Average Malaria Cases – Haut-Katanga Province
+4.5 SDs from the median
Removed 8 values or 0.025%
Unprocessed data set
Are missing values actually
zeros in the DRC DHIS2?
Link between Missingness & Median Case Counts
1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150
Median Health Facility Malaria Cases (binned)
Generalization: the lower the median case counts the
higher the number of average missing values
• Assume no item nonresponse?
• Examine this notion with two extreme examples
• One HF time series with large monthly values and 1 missing
• One HF time series with low monthly values and 1 missing
• Replace missing with zero and run anomaly detection
Assumption: Missing Values are Zeros
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Interpolation on
Univariate Time Series
• A univariate time series is a sequence of single observations at
regular and successive points in time
• Possible to decompose the time series into its trend, seasonal, and
irregular components
• We can use these time series characteristics in the interpolation process
Univariate Time Series
dataseasonaltrendremainder
2017 2018 2019 2020
Loess Seasonal Decomposition of Average Malaria Cases
‘stats’ package
AutocorrelationFunction
Lag
Autocorrelation Function Plot (ACF plot)
• Values in a series do not have violent, unexplained fluctuations
• The rate of change (increases/decreases) between points occurs at
a uniform rate
Assumptions of Interpolation
• Easy to code (one line in R for long form data frame)
• df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2)
• Intuitive understanding of linearly interpolating across very short
gaps of missing values
• Probably a good approach for high case load facilities
• May not grossly deviate from the ‘truth’ when applied to low case load
facilities
A Role for Linear Interpolation?
‘imputeTS’ package
Linear Interpolation
----
---- ----
Joining known
values with linear
segments
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Linearly interpolated
‘anomalize’ package
Seasonality in Interpolation
Un-imputed
data
Linearly
interpolated data
w/o seasonality
Linearly
interpolated data
w/ seasonality
• Take seasonality into account
• na.interp from the ‘forecast’ package in R
• By default, uses linear interpolation for non-seasonal series. For seasonal series, a
robust STL decomposition is first computed. Then a linear interpolation is applied to
the seasonally adjusted data, and the seasonal component is added back.
• na.StructTS from the ‘zoo’ package in R
• Interpolate with seasonal Kalman filter
• These two functions use similar mechanisms to interpolate missing
data in that they both can ‘handle’ seasonality in the time series
Univariate Time Series Interpolation
Seasonality Adjusted Time Series
Let’s reset and apply some
of these steps
Missingness Visualized – Unprocessed Data
Missing (28.6%) Present (71.4%)‘visdat’ package
284 HFs with no missing data
Missingness Visualized – Removed New/Defunct HFs
Missing (13.8%) Present (86.2%)‘visdat’ package
Missingness Visualized – Linear Interpolation (gaps ≤ 2)
Missing (6.7%) Present (93.3%)‘visdat’ package
807 HFs with no missing data
Time Series Trends
New/defunct HFs and outliers have been removed from all time series
Recreate the “Truth”
• Use a data set containing only complete time series records
• 2.5% of data are zero values (primarily limited to smaller facilities)
• Introduce random missingness
• Randomly delete15% of data points
• Delete 90% of remaining zero values
• Include runs of more than 2 missing values
• Apply various imputation methods and compare against the “truth”
• Replace all blanks with zeros
• Linear interpolation on gaps ≤ 2
• Use the two identified interpolation strategies that consider seasonality
A Quick Example
Time Series Trends
Anomalous data points have been removed
na.StructTS
na.interp
na.StructTS
Average raw bias = -1.18
na.interp
Average raw bias = -0.03
na.StructTS
MAPE = 119.03
na.interp
MAPE = 117.41
The RMSE difference is positive for 1,847
HFs indicating that the ‘na.StructTS’
approach had a lower RMSE for 68% of HFs
‘na.StructTS’ approach has lower RMSE
‘na.interp’ approach has lower RMSE
• Assess missingness
• Address egregious outliers
• Manage new/defunct facility records
• Decompose the time series
• Try a few different interpolation techniques and plot results
• Isolate a subset of records with no missing data
• Introduce missing data and then recreate the “truth”
Recap
This presentation was produced with the support of the United States Agency for International
Development (USAID) under the terms of the Data for Impact (D4I) associate award
7200AA18LA00008, which is implemented by the Carolina Population Center at the University of
North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.;
John Snow, Inc.; and Tulane University. The views expressed in this publication do not
necessarily reflect the views of USAID or the United States government.
www.data4impactproject.org
• DHIS 2 time series do not always lend themselves well to multiple
imputation
• Multiple imputation is a preferable choice when there are variables
predictive of missingness that could be included in the imputation model
• With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in
the imputation process
• DHIS 2 time series may exhibit MNAR missingness structure
• Earlier time points have more missing data
• Zero values are more likely to be missing
Imputation
• Advantages of using DHIS2 data
• Access to a wide breadth of data elements/services
• Analyze at various levels of the health system
• National, regional, district, health facility
• Data are generally collected via standardized reporting tools
• Data tend to be reported at regular intervals allowing for frequent updates
to analyses
• However, not all data elements are well-reported, and it is typically
necessary to process/clean DHIS2 data
Why Use DHIS2 Data?

More Related Content

What's hot

Monitoring and Evaluation of Health Services
Monitoring and Evaluation of Health ServicesMonitoring and Evaluation of Health Services
Monitoring and Evaluation of Health Services
Nayyar Kazmi
 
Impact evaluation
Impact evaluationImpact evaluation
Impact evaluation
Carlo Magno
 

What's hot (20)

A Guide to the Fundamentals of Economic Evaluation in Public Health
A Guide to the Fundamentals of Economic Evaluation in Public HealthA Guide to the Fundamentals of Economic Evaluation in Public Health
A Guide to the Fundamentals of Economic Evaluation in Public Health
 
Lessons Learned In Using the Most Significant Change Technique in Evaluation
Lessons Learned In Using the Most Significant Change Technique in EvaluationLessons Learned In Using the Most Significant Change Technique in Evaluation
Lessons Learned In Using the Most Significant Change Technique in Evaluation
 
National health program evaluation
National health program evaluationNational health program evaluation
National health program evaluation
 
Monitoring and Evaluation of Health Services
Monitoring and Evaluation of Health ServicesMonitoring and Evaluation of Health Services
Monitoring and Evaluation of Health Services
 
Basic concepts of health planning
Basic concepts of health planningBasic concepts of health planning
Basic concepts of health planning
 
Planning, monitoring &amp; evaluation of health care program
Planning, monitoring &amp; evaluation of health care programPlanning, monitoring &amp; evaluation of health care program
Planning, monitoring &amp; evaluation of health care program
 
MEASURE Evaluation’s Health Information System Strengthening Model
MEASURE Evaluation’s Health Information System Strengthening ModelMEASURE Evaluation’s Health Information System Strengthening Model
MEASURE Evaluation’s Health Information System Strengthening Model
 
Impact evaluation methods: Qualitative Methods
Impact evaluation methods: Qualitative MethodsImpact evaluation methods: Qualitative Methods
Impact evaluation methods: Qualitative Methods
 
monitoring and evaluation
monitoring and evaluationmonitoring and evaluation
monitoring and evaluation
 
Logical framework
Logical  frameworkLogical  framework
Logical framework
 
Concept of Economic Evaluation in Health Care
Concept of Economic Evaluation in Health CareConcept of Economic Evaluation in Health Care
Concept of Economic Evaluation in Health Care
 
Impact evaluation
Impact evaluationImpact evaluation
Impact evaluation
 
Lessons learned in using process tracing for evaluation
Lessons learned in using process tracing for evaluationLessons learned in using process tracing for evaluation
Lessons learned in using process tracing for evaluation
 
Monitoring and Evaluation of Gender and HIV
Monitoring and Evaluation of Gender and HIVMonitoring and Evaluation of Gender and HIV
Monitoring and Evaluation of Gender and HIV
 
Analysis and interpretation of surveillance data
Analysis and interpretation of surveillance dataAnalysis and interpretation of surveillance data
Analysis and interpretation of surveillance data
 
Surveillance
SurveillanceSurveillance
Surveillance
 
SWOT ANALYSIS Of NVBDCP
SWOT ANALYSIS Of NVBDCP SWOT ANALYSIS Of NVBDCP
SWOT ANALYSIS Of NVBDCP
 
Monitoring and evaluation
Monitoring and evaluationMonitoring and evaluation
Monitoring and evaluation
 
Globalization and public health
Globalization and public healthGlobalization and public health
Globalization and public health
 
Leadership in public health
Leadership in public healthLeadership in public health
Leadership in public health
 

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
rgveroniki
 
Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4
Zijian Huang
 
Operational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data AnalysisOperational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data Analysis
Ignacio Reclusa
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
nnzuliyana2
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo (20)

Julian Flowers Erpho
Julian Flowers ErphoJulian Flowers Erpho
Julian Flowers Erpho
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Application of microbiological data
Application of microbiological dataApplication of microbiological data
Application of microbiological data
 
Biostatistics Class.pptx
Biostatistics Class.pptxBiostatistics Class.pptx
Biostatistics Class.pptx
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.ppt
 
Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4
 
Practical exercise: results analysis with different statistical robust methods.
Practical exercise: results analysis with different statistical robust methods. Practical exercise: results analysis with different statistical robust methods.
Practical exercise: results analysis with different statistical robust methods.
 
Biostatistics.pptx
Biostatistics.pptxBiostatistics.pptx
Biostatistics.pptx
 
data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023
 
Statistics for the Health Scientist: Basic Statistics II
Statistics for the Health Scientist: Basic Statistics IIStatistics for the Health Scientist: Basic Statistics II
Statistics for the Health Scientist: Basic Statistics II
 
Operational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data AnalysisOperational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data Analysis
 
Statistical analysis
Statistical analysisStatistical analysis
Statistical analysis
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
 
Lincoln-Lau-Session-3A-CCIH-2017
Lincoln-Lau-Session-3A-CCIH-2017Lincoln-Lau-Session-3A-CCIH-2017
Lincoln-Lau-Session-3A-CCIH-2017
 
Data analysis
Data analysisData analysis
Data analysis
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 

More from MEASURE Evaluation

Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
MEASURE Evaluation
 
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
MEASURE Evaluation
 

More from MEASURE Evaluation (20)

Tuberculosis/HIV Mobility Study: Objectives and Background
Tuberculosis/HIV Mobility Study: Objectives and BackgroundTuberculosis/HIV Mobility Study: Objectives and Background
Tuberculosis/HIV Mobility Study: Objectives and Background
 
LCI Evaluation Uganda Organizational Network Analysis
LCI Evaluation Uganda Organizational Network AnalysisLCI Evaluation Uganda Organizational Network Analysis
LCI Evaluation Uganda Organizational Network Analysis
 
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
 
Understanding Referral Networks for Adolescent Girls and Young Women
Understanding Referral Networks for Adolescent Girls and Young WomenUnderstanding Referral Networks for Adolescent Girls and Young Women
Understanding Referral Networks for Adolescent Girls and Young Women
 
Local Capacity Initiative (LCI) Evaluation
Local Capacity Initiative (LCI) EvaluationLocal Capacity Initiative (LCI) Evaluation
Local Capacity Initiative (LCI) Evaluation
 
Development and Validation of a Reproductive Empowerment Scale
Development and Validation of a Reproductive Empowerment ScaleDevelopment and Validation of a Reproductive Empowerment Scale
Development and Validation of a Reproductive Empowerment Scale
 
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
 
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
 
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
 
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
Sustaining the Impact: MEASURE Evaluation Conversation on Health InformaticsSustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
 
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
 
Sexual Orientation and Gender Identity Measures for Global Survey Research
Sexual Orientation and Gender Identity Measures for Global Survey ResearchSexual Orientation and Gender Identity Measures for Global Survey Research
Sexual Orientation and Gender Identity Measures for Global Survey Research
 
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
What’s Next?Practical Implementation Lessons from the Partnership for HIV-Fr...What’s Next?Practical Implementation Lessons from the Partnership for HIV-Fr...
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
 
Measuring Outcomes for Vulnerable Children: A Global Snapshot
Measuring Outcomes for Vulnerable Children: A Global SnapshotMeasuring Outcomes for Vulnerable Children: A Global Snapshot
Measuring Outcomes for Vulnerable Children: A Global Snapshot
 
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
 
Les dialogues communautaires pour diffuser des résultats de recherche Example...
Les dialogues communautaires pour diffuser des résultats de recherche Example...Les dialogues communautaires pour diffuser des résultats de recherche Example...
Les dialogues communautaires pour diffuser des résultats de recherche Example...
 
Seven Steps to EnGendering Evaluations of Public Health Programs
 Seven Steps to EnGendering Evaluations of Public Health Programs Seven Steps to EnGendering Evaluations of Public Health Programs
Seven Steps to EnGendering Evaluations of Public Health Programs
 
HIV Risk and Service Use: Results of a Survey of Men in Port-au-Prince and St...
HIV Risk and Service Use: Results of a Survey of Men in Port-au-Prince and St...HIV Risk and Service Use: Results of a Survey of Men in Port-au-Prince and St...
HIV Risk and Service Use: Results of a Survey of Men in Port-au-Prince and St...
 
Sustaining the Impact: MEASURE Evaluation Conversation on Strengthening Real-...
Sustaining the Impact: MEASURE Evaluation Conversation on Strengthening Real-...Sustaining the Impact: MEASURE Evaluation Conversation on Strengthening Real-...
Sustaining the Impact: MEASURE Evaluation Conversation on Strengthening Real-...
 
From Assessment to Action: Using a Maturity Model Approach to Strengthen eHea...
From Assessment to Action: Using a Maturity Model Approach to Strengthen eHea...From Assessment to Action: Using a Maturity Model Approach to Strengthen eHea...
From Assessment to Action: Using a Maturity Model Approach to Strengthen eHea...
 

Recently uploaded

Recently uploaded (20)

Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service AvailableTrichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
 
Call Girls Shimla Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Shimla Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Shimla Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Shimla Just Call 8617370543 Top Class Call Girl Service Available
 
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
 
Top Rated Bangalore Call Girls Mg Road ⟟ 9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Mg Road ⟟   9332606886 ⟟ Call Me For Genuine S...Top Rated Bangalore Call Girls Mg Road ⟟   9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Mg Road ⟟ 9332606886 ⟟ Call Me For Genuine S...
 
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
 
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
 
Call Girls Vadodara Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Vadodara Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Vadodara Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Vadodara Just Call 8617370543 Top Class Call Girl Service Available
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Call Girls Rishikesh Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Rishikesh Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 8250077686 Top Class Call Girl Service Available
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
Top Rated Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
Top Rated  Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...Top Rated  Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
Top Rated Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
 
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
 
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 
Top Rated Bangalore Call Girls Majestic ⟟ 9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Majestic ⟟  9332606886 ⟟ Call Me For Genuine S...Top Rated Bangalore Call Girls Majestic ⟟  9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Majestic ⟟ 9332606886 ⟟ Call Me For Genuine S...
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
8980367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
8980367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad8980367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
8980367676 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

  • 1. Managing missing values in routinely reported data: One approach from the DRC Matt Worges Data for Impact Webinar Series December 2, 2020
  • 2. • Framing the Webinar through the D4I lens • DHIS2 data: advantages and issues • Exploring a DHIS2 data set • What to do with blanks? • Interpolation • Recreate the “Truth” • Interpolation diagnostics Overview
  • 3. • The D4I team was tasked with conducting an impact evaluation of the USAID Integrated Health Project (IHP) implemented in 9 provinces of the DRC • IHP goal: Reduce maternal, newborn, and child deaths through delivery of integrated health services • IHP objectives: Increase access to and use of quality health services in the targeted health zones IHP Impact Evaluation
  • 4. • D4I research question: What was the impact of IHP on the utilization of health services (e.g., treatment for childhood illnesses) over the course of the study period? • Measuring impact: D4I is assessing impact through a difference- in-differences (DID) with propensity score matching (PSM) model • Data source: We are using DHIS2 data for this impact evaluation IHP Impact Evaluation – Approach
  • 5. • PSM is widely used to mitigate confounding in observational studies • Complications arise when the covariates used to estimate the propensity scores are only partially observed • Interpolation/imputation approaches provide a potential solution for handling missing data in the estimation of the propensity scores • Recommended to derive the propensity score after applying interpolation or imputation IHP Impact Evaluation – Propensity Score Matching
  • 6. • Addition/removal of health facilities at different time points • Long runs of missing values • Zero counts are typically not entered – they are left blank • Cannot distinguish between truly missing and zero • Data entry errors manifesting as outliers/anomalous points • Reporting has improved over time making older time points less complete Some DHIS2 Issues
  • 7. • Missing data can result in: • Reduced statistical power • Biased estimators • Reduced representativeness of the sample • Generally incorrect inference and conclusions Why do we care about missingness? Overview of Approaches for Missing Data – Susan Buchman
  • 8. • Time Series Characteristics • Restricted to Haut-Katanga Province, DRC • Uncomplicated + severe malaria cases (all ages) • 24-month period from October 2018 to September 2020 • Health facility count = 1,362 • The monthly-aggregated time series appears to include both a seasonal and positive trend component Data Set
  • 9. Unprocessed Data – Missingness Visualized HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20 hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219 hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582 hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222 hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639 hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203 hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240 hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313 hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283 hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257 hk Kasomeno Centre de Santé de Référence 282 307 306 265 hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393 hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
  • 10. Unprocessed Data – Missingness Visualized Missing (28.6%) Present (71.4%)‘visdat’ package Malaria Cases – Haut-Katanga Province
  • 11. Unprocessed Data – Histogram of Missingness No missing values (complete case analysis) Completely blank records (remove from data set) One missing value Two missing values ‘ggplot2’ package 284 193 137 27
  • 12. Unprocessed Data – Outliers? What are these doing here? Are they malaria outbreaks? Are they data entry errors?
  • 13. Unprocessed Data – Outliers. ‘anomalize’ package Something looks off here This point didn’t show up as anomalous
  • 14. • One method to remove outliers is to delete those values that are ± X standard deviations from the median • The median is insensitive to extreme values in your time series • Experiment with different thresholds (i.e., ± 4 SDs from the median or ± 6 SDs from the median) to examine what happens to your data Removing Egregious Outliers – One Approach
  • 15. Malaria cases Median Standard deviation + 4.5 SDs from the median This value would be removed from the data set
  • 16. Anomalous Data Points ‘anomalize’ package This is what I’m targeting for removal Less concerned with these
  • 17. Removing Egregious Outliers - Effects Average Malaria Cases – Haut-Katanga Province +4.5 SDs from the median Removed 8 values or 0.025% Unprocessed data set
  • 18. Are missing values actually zeros in the DRC DHIS2?
  • 19. Link between Missingness & Median Case Counts 1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150 Median Health Facility Malaria Cases (binned) Generalization: the lower the median case counts the higher the number of average missing values
  • 20. • Assume no item nonresponse? • Examine this notion with two extreme examples • One HF time series with large monthly values and 1 missing • One HF time series with low monthly values and 1 missing • Replace missing with zero and run anomaly detection Assumption: Missing Values are Zeros
  • 21. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 23. • A univariate time series is a sequence of single observations at regular and successive points in time • Possible to decompose the time series into its trend, seasonal, and irregular components • We can use these time series characteristics in the interpolation process Univariate Time Series
  • 24. dataseasonaltrendremainder 2017 2018 2019 2020 Loess Seasonal Decomposition of Average Malaria Cases ‘stats’ package
  • 26.
  • 27. • Values in a series do not have violent, unexplained fluctuations • The rate of change (increases/decreases) between points occurs at a uniform rate Assumptions of Interpolation
  • 28. • Easy to code (one line in R for long form data frame) • df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2) • Intuitive understanding of linearly interpolating across very short gaps of missing values • Probably a good approach for high case load facilities • May not grossly deviate from the ‘truth’ when applied to low case load facilities A Role for Linear Interpolation? ‘imputeTS’ package
  • 29. Linear Interpolation ---- ---- ---- Joining known values with linear segments
  • 30. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 32. Seasonality in Interpolation Un-imputed data Linearly interpolated data w/o seasonality Linearly interpolated data w/ seasonality
  • 33. • Take seasonality into account • na.interp from the ‘forecast’ package in R • By default, uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back. • na.StructTS from the ‘zoo’ package in R • Interpolate with seasonal Kalman filter • These two functions use similar mechanisms to interpolate missing data in that they both can ‘handle’ seasonality in the time series Univariate Time Series Interpolation
  • 35. Let’s reset and apply some of these steps
  • 36. Missingness Visualized – Unprocessed Data Missing (28.6%) Present (71.4%)‘visdat’ package 284 HFs with no missing data
  • 37. Missingness Visualized – Removed New/Defunct HFs Missing (13.8%) Present (86.2%)‘visdat’ package
  • 38. Missingness Visualized – Linear Interpolation (gaps ≤ 2) Missing (6.7%) Present (93.3%)‘visdat’ package 807 HFs with no missing data
  • 39. Time Series Trends New/defunct HFs and outliers have been removed from all time series
  • 41. • Use a data set containing only complete time series records • 2.5% of data are zero values (primarily limited to smaller facilities) • Introduce random missingness • Randomly delete15% of data points • Delete 90% of remaining zero values • Include runs of more than 2 missing values • Apply various imputation methods and compare against the “truth” • Replace all blanks with zeros • Linear interpolation on gaps ≤ 2 • Use the two identified interpolation strategies that consider seasonality A Quick Example
  • 42.
  • 43. Time Series Trends Anomalous data points have been removed
  • 45. na.StructTS Average raw bias = -1.18 na.interp Average raw bias = -0.03
  • 47. The RMSE difference is positive for 1,847 HFs indicating that the ‘na.StructTS’ approach had a lower RMSE for 68% of HFs ‘na.StructTS’ approach has lower RMSE ‘na.interp’ approach has lower RMSE
  • 48. • Assess missingness • Address egregious outliers • Manage new/defunct facility records • Decompose the time series • Try a few different interpolation techniques and plot results • Isolate a subset of records with no missing data • Introduce missing data and then recreate the “truth” Recap
  • 49.
  • 50. This presentation was produced with the support of the United States Agency for International Development (USAID) under the terms of the Data for Impact (D4I) associate award 7200AA18LA00008, which is implemented by the Carolina Population Center at the University of North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.; John Snow, Inc.; and Tulane University. The views expressed in this publication do not necessarily reflect the views of USAID or the United States government. www.data4impactproject.org
  • 51. • DHIS 2 time series do not always lend themselves well to multiple imputation • Multiple imputation is a preferable choice when there are variables predictive of missingness that could be included in the imputation model • With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in the imputation process • DHIS 2 time series may exhibit MNAR missingness structure • Earlier time points have more missing data • Zero values are more likely to be missing Imputation
  • 52. • Advantages of using DHIS2 data • Access to a wide breadth of data elements/services • Analyze at various levels of the health system • National, regional, district, health facility • Data are generally collected via standardized reporting tools • Data tend to be reported at regular intervals allowing for frequent updates to analyses • However, not all data elements are well-reported, and it is typically necessary to process/clean DHIS2 data Why Use DHIS2 Data?