An overview of impact evaluation for organizations based on a program's Theory of Change, highlighting the need for a counterfactual and randomization (when possible) in order to convincingly demonstrate the effect of the program.
2. The Burden of Proof
Example: Medicine (Medieval)
Four Humors Theory
(not falsifiable)
Four Humors Empirics
(sub-optimal outcomes)
3. Medicine (Early Modern, 1850s)
Theory : Cholera is a vector-borne
disease transmitted by water
(falsifiable)
Disease Theory Empirics
(decent outcomes)
The Burden of Proof
4. Evaluation in Development
There is a growing awareness that robust
program evaluation is essential in order for
organizations to optimize their impact
• COMMON GOAL:To effectively execute
programming that is impactful for beneficiaries
Results
Based
Management
5. Results Based Management (RBM)
• Results Based Management has two essential evaluation components
1. Program Monitoring is a continuous process of collecting data on operations
• Informs implementation and program management on effectiveness and accountability
• Compares how well a project or policy strategy is performing against initial design
2. Impact Evaluation is the periodic assessment of the causal effect of a project,
program or policy on beneficiary outcomes
• Estimates the change in outcomes attributable to the intervention
7. Example: Evaluation of aWater, Sanitation and
Hygiene Project
WASH Project
implementation
•Materials produced
•Outreach channels
established
Exposure to HW with
soap promotions
•Beneficiaries reached
•Materials are deemed
appropriate
Changes in beliefs,
knowledge and
availability
•Materials understood
•Knowledge gained
•Attitudes influenced
Improved HW behavior
among mother and
caretakers
•Beneficiaries have
access to HW facilities
•Beneficiaries want to
improve child health
Improved children’s
health
•Disease burden
diminished
•Morbidity rate
diminished
Program Monitoring
8. Example: Evaluation of aWASH Project
WASH Project
implementation
•Materials produced
•Outreach channels
established
Exposure to HW with
soap promotions
•Beneficiaries reached
•Materials are deemed
appropriate
Changes in beliefs,
knowledge and
availability
•Materials understood
•Knowledge gained
•Attitudes influenced
Improved HW behavior
among mother and
caretakers
•Beneficiaries have
access to HW facilities
•Beneficiaries want to
improve child health
Improved children’s
health
•Disease burden
diminished
•Morbidity rate
diminished
•Materials produced
•Personnel employed
•Resources disbursed
•Participate rate
•Media access rate
•Materials uptake
•Observed behavior
•Household survey
•Physical tests
•Participant survey
•Willingness to pay
•Purchase records
•Medical statistics
•Anthropometry
•Reported wellbeing
Program Monitoring
9. Where’s the Impact?
Given the complex nature of measuring impact for interventions in the real world, significant
effort must be made in the designing the DataGenerating Process (DGP) of an evaluation
• What do we mean when we talk about the effect of a program?
A. The difference in outcomes between people who participate in the program and those who don’t
• Observed effect – Many informal evaluations focus on this
B. What happens to someone after she participates in the program?
• The AverageTreatment on theTreated (ATT orTOT) effect
C. The difference between what happened to the person who participated in the program and what
would have happened t that same person if she hadn’t participated in the program?
• The true AverageTreatment Effect (ATE)
10. Example 1: Microfinance
A. Difference in outcomes between people who participate in the program and those
who don’t: Micro-borrowers may be more highly motivated than others
B. What happens to someone after she participates in the program: change in a
borrower’s outcomes determined by outside factors that caused her to borrow as
well as the effect of microfinance
C. Difference between what happened and what would have happened: the difference
between what happened to a borrower’s business (family, health, etc.) and what
would have happened if microfinance were not available to them
11. Example 2: ImprovedWood-burning Stoves
A. Difference in outcomes between people who participate in the program and those
who don’t: those who adopt high-tech stoves may be more concerned with good
health than those who don’t
B. What happens to someone after she participates in the program: the desire to have
an improved woodstove could be triggered by someone having become sick in the
family, and sick people usually get better
C. Difference between what happened and what would have happened: the difference
between a family’s respiratory health after adopting the stove compared to what
their health would have been if the stove were not available to them
12. The Role of the Counterfactual
Control Group
• Concept of the counterfactual: Program
performance is relative to what/whom?
• At every stage of assessment we need a
valid reference for assessment
• Without a counterfactual it is easy to draw
false conclusions or misrepresent impact
Expected growth trend
based on historical and
control group data
13. The Importance of aValid Counterfactual
Without a legitimate counterfactual most impact evaluations lose credibility
• Participants to non-participant comparisons
• e.g. comparing student performance in private schools with kids in public schools, but
because of self-selection outcomes are likely to be different anyway
• “Before-and-after” studies:
• e.g. Income of microfinance loan recipients before and after taking loans from a MFI, but
microfinance borrowers take loans when they have investment opportunities so that the
majority of the apparent impact of microfinance is actually an illusion
14. WASH Project
implementation
•Materials produced
•Outreach channels
established
Exposure to HW with
soap promotions
•Beneficiaries reached
•Materials are deemed
appropriate
Changes in beliefs,
knowledge and
availability
•Materials understood
•Knowledge gained
•Attitudes influenced
Improved HW behavior
among mother and
caretakers
•Beneficiaries have
access to HW facilities
•Beneficiaries want to
improve child health
Improved children’s
health
•Disease burden
diminished
•Morbidity rate
diminished
•Materials produced
•Personnel employed
•Resources disbursed
•Participate rate
•Media access rate
•Materials uptake
•Observed behavior
•Household survey
•Physical tests
•Participant survey
•Willingness to pay
•Purchase records
•Medical statistics
•Anthropometry
•Reported wellbeing
Program Monitoring
Internal performance Program pre/post Beneficiaries pre/post
vs. non-beneficiaries
Randomized
non-beneficiaries
Randomized
non-beneficiaries
Example: Evaluation of aWASH Project
15. Impact Evaluations Provide Critical Data
While monitoring data is essential to the efficient implementation of programs (resources
used, goods & services produced, reach and reaction) only Impact Evaluation can answer
questions about effectiveness:
• Determine if a program had impact, by measuring the causal effect between an intervention and
an outcome of interest
• Estimate the level of impact
• Compare real impact with the expected impact at the time of designing the intervention
• Determine adequate intensity of intervention
• Compare differential impact among geographical areas, communities, or interventions
• What is the effect of different sub-components of a program on specific outcomes?
• What is the right level of subsidy for a service?
• How would outcomes be different if the program design changed?
• Is the program cost-effective?
16. • Example: Soybean yield and fertilizer
• Example: A simple wage equation
Measures the effect of fertilizer on
yield, holding all other factors fixed
Rainfall,
land quality,
presence of parasites, …
Measures the change in hourly wage
given another year of education,
holding all other factors fixed
Labor force experience,
tenure with current employer,
work ethic, intelligence …
Interpreting OLS
Good Impact Data Feeds Robust
Statistical Analysis
17. Correlation Does NOT Imply Causation
Even with a valid counterfactual, evaluators must ensure that they are drawing
conclusions based on causal inference
• Causal Inference: evaluating whether a change in one variable (x) will lead to a change in another
variable (y) assuming that nothing else changes (certis paribus)
Statistical tools can tell us a lot about how two variables covary, but this can lead to
false conclusions
• Correlation does NOT imply causation
• To get to causal inference we generally need to know how the problem works in real life
18. The Endogeneity Problem
The challenge in defining causality in impact evaluations is that many factors in development
are endogenous, and not always observable
• Endogenous – Originating from inside the system, in the case of evaluations this typically means a factor
that is co-influential, or a possible third variable that affects both (e.g.. aspirations)
• Education and earnings
• Voluntary participation and ambition
• Prices of substitute or complimentary goods
• Exogenous – Originating outside the system
• Interpreting an endogenous relationship as exogenous means risking interpreting a system with reverse
causality as strictly causal
Evaluations that imply a causal relationships without accounting for endogeneity lack internal
and external validity
• There is a high probability that if the intervention were to be tested again it would provide different
outcomes
19. ValidityThrough Randomization
Randomization allows an evaluator to eliminate the possibility that they are
arguing for a causal exogenous interpretation on an endogenous relationship
• Randomized ControlTrials (RCTs) assign treatment through lottery or another random
process
• Generates two statistically identical groups
• The only difference is the treatment
22. Why Run Randomized Evaluations?
1. For programing:
• Gives all eligible beneficiaries the same probability to receive the intervention
• Oversubscription: # eligible > available resources
• Selection criteria is ethical, quantitative, fair and transparent
2. For analysis:
• To ensure that the evaluation is measuring a causal relationship
• So that one can employ straight-forward statistical analysis that it is unbiased (OLS)
• Allow costs and benefits to be more accurately quantified
3. For donors:
• Produces the most accurate counterfactual making evaluation intuitive to all stakeholders
• So that innovative programs can be piloted and the most impactful scaled with confidence
• Ensure that programs are constantly working to optimize their outcomes
23. WASH Project
implementation
•Materials produced
•Outreach channels
established
Exposure to HW with
soap promotions
•Beneficiaries reached
•Materials are deemed
appropriate
Changes in beliefs,
knowledge and
availability
•Materials understood
•Knowledge gained
•Attitudes influenced
Improved HW behavior
among mother and
caretakers
•Beneficiaries have
access to HW facilities
•Beneficiaries want to
improve child health
Improved children’s
health
•Disease burden
diminished
•Morbidity rate
diminished
•Materials produced
•Personnel employed
•Resources disbursed
•Participate rate
•Media access rate
•Materials uptake
•Medical statistics
•Anthropometry
•Reported wellbeing
Program Monitoring
Internal performance Program pre/post Randomized
non-beneficiaries
Why Run Randomized Evaluations?
Randomized evaluations allow Evaluators and Managers to measure outcomes
and improve programs while minimizing the need to test behavioral assumptions
24. Randomized Evaluations Are Not Always
Appropriate
Running Randomized Evaluations requires significant time and
resources that may not be justified for some programs, especially:
• When the program is premature and still requires considerable “tinkering” to work well
• When the project is on too small a scale to randomize into two “representative groups”
• If a positive impact has been proven using rigorous methodology and resources are
sufficient to cover everyone
• After the program has already begun and it is not expanding elsewhere
• In emergency situations where ethical considerations suggest that acting to relieve
suffering is the immediate priority
25. Alternative Evaluation Methods
While randomized evaluation is the gold-standard, there are equally valid
“quasi-experimental” methods that can be used:
• Natural experiments that account for “as if random” program participation across
individuals
• e.g. political boundaries, exogenous shocks
• Regression Discontinuity comparing individuals just above and below an eligibility threshold
• e.g. idiosyncratic program prerequisites
• Difference in Difference comparing beneficiaries with themselves and other similar groups
over time
• Statistical Matching comparing beneficiaries to individuals of similar observable traits
These methods often require making fundamental assumptions and
involve more sophisticated statistical analysis which can undermine results
• For an overview of methodologies see:J-PAL impact-evaluation-methods (pdf)