2. Monitoring and Evaluation
2
The Current Context:
increasing call for results & value for money
shrinking budgets
yet increasing complexity in program/project
settings
So, what have M & E to do with these issues?
We need monitoring & evaluation to:
identify projects that better meet needs & lead to
improvements in targeted social, economic &
environmental conditions
improve decision-making
achieve more outcomes
tell stories clearly in an evidence-based manner
enhance organizational learning
3. 3
Definitions of Monitoring and Evaluation
Monitoring: is an internal project activity
It is the continuous assessment of project
implementation in relation to:
schedules
input use
infrastructure
services
Monitoring:
provides continuous feedback
identifies actual or potential successes/
problems
Monitoring and evaluation (M&E) of policies
and projects
4. What’s more important?
Knowing where you are on
schedule?
Knowing where you are on
budget?
Knowing where you are on
work accomplished?
4
Definitions of Monitoring and Evaluation
5. Definitions of Monitoring and Evaluation
5
Monitoring is very important in project planning and
implementation. It is like watching where you are
going while riding a bicycle; you can adjust as you go
along and ensure that you are on the right track.
Monitoring provides information that will be useful in:
Exploring the situation in the community and its projects;
Determining whether the inputs in the project are well
utilized;
Identifying problems facing the community or project and
finding solutions;
Ensuring all activities are carried out properly by the
right people and in time;
Using lessons from one project experience on to
another; and
Determining whether the way the project was planned is
the most appropriate way of solving the problem at
6. Definitions of Monitoring and Evaluation
6
High-quality monitoring of information encourages
timely decision-making, ensures project accountability,
and provides a robust foundation for evaluation and
learning.
It is through the continuous monitoring of project
performance that you have an opportunity to learn
about what is working well and what challenges are
arising.
Job descriptions of staff involved in managing and
implementing projects should include assigned M&E
responsibilities.
7. 7
Evaluation: is a systematic, objective, deliberate, purposeful,
critical, trustworthy & ethical assessment of a project/program
It assesses: relevance, coherence, efficiency, effectiveness,
impact & sustainability
Interim evaluations: first review of progress, prognosis
of likely effects, and a way to identify necessary
adjustments in project design
Terminal evaluations: evaluate the project‘s effects and
potential sustainability (done at the end for project
completion reports)
Evaluation is:
“A periodic assessment of the relevance, performance,
efficiency, effectiveness, impact, sustainability and relevance
of a project in the context of stated objectives. It is usually
undertaken as an independent examination with a view to
drawing lessons that may guide future decision-making”
(European Commission).
Definitions of Monitoring and Evaluation
8. 8
Evaluation means information on:
Strategy
Whether we are doing the right things
– Rationale/justification
– Clear theory of change
Operation
Whether we are doing things right
– Effectiveness in achieving expected outcomes
– Efficiency in optimizing resources
– Client satisfaction
Learning
• Whether there are better ways of doing it
– Alternatives
– Best practices
– Lessons learned
Definitions of Monitoring and Evaluation
9. Definitions of Monitoring & Evaluation
9
Monitoring: What are we doing?
Tracking inputs and outputs to assess whether
programs are performing according to plans
Evaluation: What have we achieved?
Attributing changes in outcomes to a particular
program/intervention requires one to rule out all
other possible explanations.
10. Definitions of Monitoring and Evaluation
10
Differences between monitoring and evaluation
Monitoring Evaluation
Frequency Regular Episodic
Main action Keeping track, oversight Assessment, more analytical
Basic
purpose
Improve efficiency, adjust
work plan
Improve effectiveness, impact,
future programming
Focus Inputs, outputs, process
outcomes, work plans
Effectiveness, relevance,
impact, cost-effectiveness
Information
sources
Routine systems, field
observations, progress
reports, rapid assessments
Same as for monitoring, as well
as surveys and studies
Undertaken
by
Program managers,
community workers,
community (beneficiaries),
supervisors, funders
Program managers,
supervisors, funders, external
evaluators, community
(beneficiaries)
Reporting to Program managers,
community workers,
community (beneficiaries),
supervisors, funders
Same as for monitoring, as well
as policy makers
Source: UNICEF (1991: 4)
13. Functions of M&E in the project cycle
13
Test soundness of a project’s objectives & can lead to
improvements in project design through the process of
selecting indicators for monitoring and through the use
of design tools
Incorporate views of stakeholders as ownership brings
mutual accountability:
Reinforce ownership and highlight emerging
problems through recorded benefits early in project
implementation
Show need for mid-course corrections
15. Guiding Principles of M & E
15
M&E is guided by the following key principles:
1. Systematic Inquiry – Staff conduct site-based inquiries
that gather both quantitative and qualitative data in a
systematic and high-quality manner
2. Honesty/Integrity – Staff display honesty and integrity in
their own behavior and contribute to the honesty and
integrity of the entire M&E process
3. Respect for People – Staff respect the security, dignity,
and self-worth of respondents, program participants,
clients, and other M&E stakeholders
4. Responsibilities to Stakeholders – Staff members
articulate and take into account the diversity of different
stakeholders’ interests and values that are relevant to
project M&E activities
16. 16
Before conducting evaluation, it is crucial to ask:
Why do the evaluation?
How will the results be used?
Who will be influenced by the findings?
Answers to these questions should determine:
The process of evaluation
Steps needed to be taken before quantitative data
collection in the field is contemplated
Designs of (quantitative) evaluations to meet
different needs of decision makers
Issues to consider before undertaking M&E
17. What questions will the evaluation
seek to answer?
17
About outcomes/impacts
What do people do differently as a result of the
program?
Who benefits and how?
Are the program’s accomplishments worth the
resources invested?
What are the strengths and weaknesses of the
program?
What, if any, are unintended secondary
consequences?
How well does the program respond to the
initiating need?
18. What questions will the evaluation seek
to answer?
18
About program context
How well does the program fit in the local setting?
What in the socio-economic-political environment
inhibits or contributes to program success?
Who else works on similar concerns? Is there
duplication?
19. 19
Participatory Monitoring and
Evaluation
“It is a process of collaborative problem-solving through the
generation and use of knowledge. It is a process that leads to
corrective action by involving all levels of stakeholders in shared
decision-making.”
It is a collaborative process that involves stakeholders at different
levels working together to assess a project or policy, and take any
corrective measure required
The stakeholder groups typically involved in participatory M & E
activity include: end users of a project, NGOs, private sector
businesses who get involved in the project and government staff
Key Principles:
Local people are active participants-not just sources of info
Stakeholders evaluate, outsiders facilitate
Focus on building stakeholder capacity for analysis and problem
solving
Process builds commitment to implementing any recommended
corrective actions
20. 20
Participatory Monitoring and Evaluation
Methods in Participatory Monitoring and Evaluation
1. Stakeholder workshops: - to bring together government
officials, project management, and other stakeholders
2. Participatory methods such as rural appraisal, SARAR (Self-
esteem, Associative strengths, Resourcefulness, Action
planning, Responsibility) and Beneficiary Assessment
Participatory Rural appraisal:- visual methods often to
analyze “before and after” situations, through the use of
community mapping, problem ranking, wealth ranking,
seasonal and daily time charts, and other tools.
3. Self-assessment Methods:- interviewing, focus group
discussions
21. 21
Participatory Monitoring and Evaluation
Conventional M &E Participatory M &E
Who? External experts Stakeholders, including
communities and project
staff; outsiders:- facilitate
What? Predetermined indicators, to
measure inputs and outputs
Indicators identified by
stakeholders, to measure
process as well as outputs or
outcomes
How? Questionnaire surveys, by
outside “neutral” evaluators,
distanced from project
Simple, qualitative or
quantitative methods, by
stakeholders themselves
Why? To make project and staff
accountable to funding agency
To empower stakeholders to
take corrective action
24. Components of M&E design
24
5 components of good M&E design during
project preparation help ensure it is relevant
and used to good effect:
1. Clear statements of measurable objectives for
which indicators can be defined.
2. A structured set of indicators covering outputs of
goods and services generated by the project and
their impact on beneficiaries.
3. Provisions for collecting data and managing
project records
4. Institutional arrangements for gathering, analyzing
and reporting project data as well as investing in
capacity building
5. Proposals for how findings will be an input in
decision-making.
25. Defining project objectives and
measuring them with M&E indicators
25
Problem analysis to structure objectives
Stakeholders identify causes and effects of
problems before defining objectives structured to
resolve these problems
Objectives should be:
Specific to the project interventions
Realistic in the timeframe
Measurable for evaluation
It is important to ask the following questions so that
objectives are more precisely defined:
How objectives can be measured?
How components of M&E can lead to those
objectives?
26. 26
Indicators need to be structured to match the analysis of
problems the project is trying to overcome
Logical framework/logframe/ZOPP approach:
Is used to define inputs, outputs, timetables, success
assumptions and performance indicators
Postulates a hierarchy of objectives for which
indicators are required
Identifies problems the project cannot deal with
directly (risks)
The logframe/ZOPP approach
27. 27
GTZ (GIZ) Impact model
Attribution gap:
Caused by the existence of too many other significant
factors
Cannot be plausibly spanned using a linear, causal
bridge
Source: Douthwaite et al. (2003: 250)
The logframe/ZOPP approach
28. 28
Based on program theory evalaution and the logframe
An explicit theory or model of how a project will or
has brought about impact
Consists of a sequenced hierarchy of outcomes
Represents a set of hypotheses about what needs
to happen for the project output to be transformed
overtime into impact on highly aggregated
development indicators
Can be highly complementary to conventional
assessments
Advantages of this approach
Consideration of wider impact helps achieve impact
Complements conventional economic assessment
Impact pathways evaluation
29. 29
Two main phases in impact pathway evaluation
1st phase: using program theory evaluation to guide self-
monitoring and self-evaluation to establish the direct
benefits of the project outputs in its pilot site(s).
Task: to develop a theory or model of how the project sees
itself achieving impact (called an impact pathway)
Identifies steps the project should take to scale-out and -
up
Scale-out: innovation spread from farmer to
farmer/within same stakeholder groups
Scale-up: an institutional expansion from grassroots
organizations to policymakers, donors, development
institutions, and other stakeholders to build an
enabling enviornment for change
Impact pathways evaluation
30. 30
Answers to the following questions are recorded in a
matrix for each identified outcome in the impact pathway:
What would success look like?
What are the factors that influence the achievement
of each outcome?
Which of these can be influenced by the project?
What is the program currently doing to address these
factors to bring about this outcome?
What performance information should we collect?
How can we gather this information?
Impact pathways evaluation
31. 31
2nd phase in impact pathway evaluation: An
independent ex-post impact assessment is carried out
some time (normally several years) after the project
has finished
Begins by establishing the extent to which the
impact pathway was valid in the pilot site(s) and
the extent to which scaling occurred
An attempt to bridge the attribution gap, using
phase 1 results as a foundation
31
Impact pathways evaluation
33. 33
Striga hermonthica is a parasitic weed (the “witch weed”)
which infests nearly 21 million ha in SSA
Project since 1999 using participatory research
approaches to develop locally-adapted integrated Striga
control
Project output: on-farm research to adapt and validate
integrated Striga control options in farmers’ fields
Project goal: improved livelihoods for the 100 million
people in Africa affected by Striga
Impact pathways evaluation: Case study
of impact pathway evaluation
34. 34
Impact pathway for
integrated Striga
control
Shaded boxes =
monitored outcomes
Unshaded boxes = to
be evaluated in future
ex-post impact
assessment
Complemented with a
program theory
matrix (not shown)
Source: Douthwaite et al. (2003: 253)
Impact pathways evaluation
35. Impact pathways evaluation
35
Example of using the impact pathways approach:
The Challenge Program on Water and Food
Used an ex-ante particiapatory impact
assessment analysis to demonstrate to donors
how project outputs will lead to development
outcomes and widespread impacts after the
project has ended
A useful approach since measurable impact can
take 20 years after basic research begins for
technologies
36. Impact pathways evaluation: The Challenge
Program on Water and Food
36
A project involving 50 different organizations and almost 200
organizations in 9 river basins
3 systems level research themes:
Crop water productivity improvement, water and people in
catchments, and aquatic ecosystems and fisheries
Basin level theme: integrated water basin management
systems
Global scale theme: global and national water and food
systems
Measuring impacts:
It can take 10 years to move from basic research to useful
technologies and then other 10 years to see wide scale impacts
thus, use an ex ante impact assessment approach to
demonstrate to donors HOW project outputs WILL lead to
development outcomes and widespread impacts after the end
of the projects that developed them.
ex ante impact assessment also provides a solid base for a
later ex post impact assessment
37. Classification axes: the indicator
axis
37
Conceptual framework for helping guide the design of an
evaluation from Habicht et al. (1999)
An evaluation may be aimed at 1+ categories of decision
makers, so the design must take into account their different
needs
First axis refers to the indicators: whether one is
evaluating the performance of the intervention delivery or
its impact on indicators
Second axis refers to the type of inference to be drawn:
the confidence level of the decision maker that any
observed effects were due to intervention
38. 38
Indicators of provision, utilization, coverage and
impact… what is to be evaluated? What types of info is to
be sought?
Outcomes of interest (“indicators”):
1. Provision: services must be provided (available and
accessible to the target pop. and of adequate quality)
2. Utilization: the population must accept and make use
of the services
3. Coverage: utilization will result in a given population
coverage, which represents the interface between
service delivery and outreach to the population
4. Impact: coverage may lead to an impact
Choose indicators based on decision makers and cost
Classification axes: the indicator axis
39. 39
If a weak link is discovered, investigate why
An impact can be expected only when the correct
service is provided in a timely manner and it is
properly utilized by a sufficiently large number of
beneficiaries
Example: project offering loans to smallholders with the
objective of increasing fertilizer use:
1. Provision: measure the availability of the loans to
smallholders,
2. Utilization: measure the disbursement of the
loans to smallholders,
3. Coverage: measure the proportion of
smallholders that have been able to take out a
new loan, and
4. Impact: measure the impact of the project on the
fertilizer use.)
Classification axes: the indicator axis
40. 40
Indicator Question Example of indicators
Provision Are the services available?
Are the services accessible?
Is the quality of the service
adequate?
Number of cooperatives offering
fertilizer per 1,000 population
Proportion of farmers within 10
km of a cooperative offering
fertilizer
Number of days in a year when
fertilizer is available
Utilization Are the services being used? Proportion of farmers buying
fertilizer
Coverage Is the target population being
reached?
Proportion of all farmers who
want to buy fertilizer who have
bought fertilizer
Impact Were there improvements in
yields as a result of the
fertilizer program?
Change in yield attributable to
fertilizer distribution program
Ex. of indicators for evaluating a fertilizer distribution
program (objective: increase yields)
Classification axes: the indicator axis
41. 41
Second classification axis: how confident
decision makers need to be that any observed
effects are due to the project or program,
Both performance and impact evaluations may
include adequacy, plausibility or probability
assessments as the types of inference.
3.2.2 Types of inference
42. 42
There are 3 types of statements reflecting different
degrees of confidence end-users may require
from the evaluation results:
1) Adequacy assessment:
Determines whether some outcome occured, as
expected,
This assessment is relevant for evaluating process
indicators (provision, utilization, coverage),
For this, no control group is needed.
3.2.2 Types of inference
43. 43
2) Plausibility assessment:
Permit determination of whether change can be attributed to the
project,
Here control group needs to be used, internal or external,
Note: selection bias (control groups often do not exhibit identical
characteristics to the beneficiary group).
Part 3: Monitoring and Evaluation of Development Projects and Policies
3.2.2 Types of inference
44. 44
3) Probability assessment:
Ensures there is a small, known probability that
differences between project and control areas
were due to confounding, systematic bias/chance.
3.2.2 Types of inference
45. 45
1) Adequacy assessment: Did expected changes
occur?
Compares the performance or impact of the
project with previously established criteria
(absolute or relative terms),
Assess how well project activities have met
expected objectives,
Evaluation may be cross-sectional (carried out
once during or at the end of the project) or
longitudinal (to detect trends, requiring baseline
data/including repeated measurements)
Can use secondary data which reduces costs
3.2.2 Types of inference: Adequacy
46. 46
1) Adequacy assessment: Did expected changes
occur?
Inability to causally link project activities with
observed changes:
May show a lack of change in the indicators,
But does not automatically mean that the project was not
effective,
Even though, for many decision makers, more complex
evaluation designs will not be required, particularly since
these would demand additional time, resources and
expertise.
3.2.2 Types of inference: Adequacy
47. Adequacy
evaluation
Measurements In whom?
Compared to
predefined
adequacy
criteria
Inferences:
objective met
• Performance
(provision,
utilization,
coverage)
Project activities
-
Implementation
workers
Activities being
performed as
planned in the
initial
implementation
schedule
- Project
recipients
• Cross-
sectional
Once
Absolute value
• Longitudinal Change
Absolute and
incremental value
• Impact Indicators - Project
recipients or
target
population
Observed
change in
behavior is of
expected
direction and
magnitude
• Cross-
sectional
Once Absolute value
• Longitudinal Change
Absolute and
incremental value
Characteristics of adequacy assessment evaluations
47
Part 3: Monitoring and Evaluation of Development Projects and Policies
3.2.2 Types of inference: Adequacy
48. 48
2) Plausibility assessment: Did the project seem to
have an effect above and beyond external
influences?
Plausible: reasonable or probable,
Goes beyond adequacy assessments by ruling out
“confounding factors” (external factors),
Mostly ex-post research designs, longitudinal and
cross-section samples,
Choose control groups before the evaluation starts
or after, during the analyses of the data
3.2.2 Types of inference: Plausibility
49. 49
2) Plausibility assessment: Did the project seem to
have an effect above and beyond external
influences?
Several alternatives for a control group
(combine!):
Historical: the same target institutions/population
Internal: institutions/geographical areas/individuals that
should have received the full intervention, but did not
(options: dose-response relation, case-control method)
External: 1+ institutions/areas without the project.
3.2.2 Types of inference: Plausibility
50. 50
Essential elements to establish plausibility in
impact:
The source of the impact being investigated,
The model/concept of impact used and how it
applies to the case at hand (such as the
logframe/ZOPP approach),
Objectives, limitations and attribution gap of the
impact
Theory of action on which the intervention or
strategy has been based,
Impact hypotheses (statements about the expected
impact),
Other factors that could have affected observed
changes and hypotheses,
Other informed opinions that support and contest
the study findings (views of beneficiaries and target
groups are particularly important).
Source: Baur et al. (2001:6)
3.2.2 Types of inference: Plausibility
51. 51
Control groups in plausibility assessments may
include:
a) Historical control group: compare change from
before to after project in same target population
with an attempt to rule out external factors (before-
after longitudinal survey, or “panel”)
b) Internal control group: individuals/areas that
should have received the full intervention, but did
not either because they could not or they refused
to be reached by the project. Three subtypes:
Compare the treatment group with a group not
receiving the service
Sources: Habicht et al., 1997 and Schlesselman, 1982
3.2.2Types of inference: Plausibility
52. 52
Control groups in plausibility assessments may
include:
b) Internal control group: Three subtypes:
Compare the group with full treatment to
several control groups that differ in uptake of
the service (dose-response design),
Case control method: compare previous
exposure to the project between with and
without the disease.
c) External control group: one or more areas
without the project. Comparison may be cross-
sectional or longitudinal-control.
Sources: Habicht et al., 1997 and Schlesselman, 1982
3.2.2 Types of inference: Plausibility
53. 53
Intervention and control groups are supposed to be
similar in all relevant characteristics except exposure
to the intervention… yet, in reality this is almost never
true.
Why?
Comparison groups can be influenced by
confounding factors that do not affect the other
groups as much.
How to reconcile?
Measure probable confounders and their statistical
treatment,
For historical controls: socioeconomic development.
So, is it plausible?
Assessments encompass a continuum, ranging from
weak to strong statements… but one cannot
completely rule out all alternative expansion for the
observed differences.
3.2.2 Types of inference: Plausibility
54. Source: Habicht et al., 1997
Example: Increasingly stronger plausibility
assessements with different control groups:
Diarrhea mortality rapidly fell in areas with the control of
diarrheal disease (CDD) interventions,
Diarrhea did not fall in areas without the CDD
interventions (not due to general changes in diarrhea in
the area),
Changes in other known determinants of mortality could
not explain the observed decline,
There was an inverse association between intensity of the
intervention in the project areas and diarrhea mortality,
Mothers with knowledge of oral rehydration therapy (ORT)
had fewer recent child deaths than those without such info
Mortality among non-participants in the project area was
similar to that in the control area.
54
3.2.2 Types of inference: Plausibility
Part 3: Monitoring and Evaluation of Development Projects and Policies
55. Plausibility
evaluation
Measurement In whom?
Compared to
non-random
control
group
Inferences
Performance
(provision,
utilization,
coverage)
Program
activities
Implementation
workers,
program
recipients
Intervention
group
appears to
have better
performance
than control
group
Cross-
sectional
Once Control group
Longitudinal Change Before-after
Longitudinal-
control
Relative
change
Comparing
before-after
between
intervention
and control
Characteristics of plausibility evaluations
55
3.2.2 Types of inference: Plausibility
Part 3: Monitoring and Evaluation of Development Projects and Policies
56. Plausibility
evaluation
Measurement In whom?
Compared to non-
random control
group
Inferences
Impact Indicator
Program
recipients/
target
population
Changes
appear to
be more
beneficial in
intervention
group than
in control
group
Cross-
sectional
Once Control Group
Longitudinal Change Before-after
Longitudinal-
control
Relative
change
Comparing before-
after between
intervention and
control
Case-control Once
Target
population
Comparing
exposure to
program in cases
and control
Characteristics of plausibility evaluations
56
3.2.2 Types of inference: Plausibility
Part 3: Monitoring and Evaluation of Development Projects and Policies
57. 57
3) Probability assessment: Did the project have an
effect? (p<0.05)?
Aim at ensuring there is only a small known
probability that the difference between the project
and control areas were due to confounding, bias or
chance
Requires randomization of treatment and control
activities to the comparison groups so the statistical
statement is directly related to the intervention:
Alternative? Obtain information on all observations!
(census),
Randomization does not gaurantee all confounding is
eliminated. Yet, it does ensure the probability of
confounding is measureable,
Can be used even if the confounding factors are not
known.
3.2.2 Types of inference: Probability
Part 3: Monitoring and Evaluation of Development Projects and Policies
58. 58
Grouped errors
The error term may not be independent across
individuals,
Evaluation designs are often randomized over groups
(not individuals), yet researchers use individual data,
Thus, the error term may not be independent across
individuals,
Can you think of any examples of when the error term
may not be independent across individuals?
3.2.2 Types of inference: Probability
Part 3: Monitoring and Evaluation of Development Projects and Policies
59. 59
Difficulties which limit the use of probability
assessments:
Evaluator must be present at a very early stage of
the project planning cycle to design randomization,
Necessary to overcome political influence affecting
the choice of where to the intervention will take place
Stringencies of probability evaluations may result in
situations artificially different from reality, and thus
the evaluations may lack external validity
Few experienced decision makers require measuring the
effectiveness of every project through probability design,
but key individuals may have been trained to regard it as
the “gold standard” and there are times when it is
needed.
3.2.2 Types of inference: Probability
Part 3: Monitoring and Evaluation of Development Projects and Policies
60. Probability
evaluations
Measurements In whom?
Compared to
randomized
control
group(s)
Inferences:
the program
has an effect
(P > 0.05)
Performance
(provision,
utilization,
coverage)
Program
activities -
Implementation
workers
-Program
recipients
Intervention
group has
better
performance
than control
Longitudinal-
control
Relative
change
Comparing
before-after
between
intervention
and control
Impact
Behavioral
indicators -
Implementation
recipients
-Program
recipients
Changes in
behavior are
more
beneficial in
intervention
than control
group
Longitudinal-
control
Comparing
before-after
between
intervention
and control 60
3.2.2 Types of inference: Probability
Part 3: Monitoring and Evaluation of Development Projects and Policies
61. 3.2.2 Types of inference: Probability
61
Null hypothesis (H0): program effect = 0
The “no effect” or “no difference” case
Alternative hypothesis (H1): program effect ≠ 0
Alpha (type I error): Odds of saying there is a
relationship when there actually is not (wrong)
Power: Odds of saying there is a relationship when
there actually is (correct)
Confidence level: odds of saying there is no
relationship when there is none (correct)
Type II error or Beta error: odds of saying there is no
relationship when there actually is (wrong)
Part 3: Monitoring and Evaluation of Development Projects and Policies
62. We reject H0 We fail to reject H0
H0 in reality true Type I error
(chance is alpha)
No error made
H0 in reality
false
No error made Type II error
(chance is Beta)
H0: no effect
Reject H0 if p-value is < alpha, 0.05
H1: effect
Fail to reject H0 if p-value is ≥ alpha, 0.05
3.2.2 Types of inference: Probability
62
Part 3: Monitoring and Evaluation of Development Projects and Policies
63. 63
Internal validity: the credibility or reliability of an
estimate of project impact conditional on the context
in which it was carried out.
It means a strong justification that causally links the
independent variables to the dependent variables with the
ability to eliminate confounding variables within the study
Laboratory “true experiments” have high internal validity,
but may have weak external validity
Focus: whether observed changes are attributed to the
program and not to other possible causes.
3.2.3.1 Internal Validity
Part 3: Monitoring and Evaluation of Development Projects and Policies
3.2.3 Internal and External Validity
Brewer (2000). Shadish et al. (2002).
64. 3.2.3.1 Internal Validity
64
Inferences possess internal validity if a causal relation
between two variables is properly demonstrated,
When a researcher may confidently attribute the
observed changes or differences in the dependent
variable to the independent variable and s/he can rule
out other explanations, then her/his causal inference is
internally valid,
Causal inference is internally valid if 3 criteria are
satisfied:
Cause precedes the effect in time
The cause and effect are related
There is no plausible alternative explanations for the
observed covariations.
Part 3: Monitoring and Evaluation of Development Projects and Policies
Brewer (2000). Shadish et al. (2002).
65. 3.2.3.1 Internal Validity
65
Threats to internal validity
History: Did a current event affect the change in Y?
o Ex. In a short experiment designed to investigate the effect of
computer-based instruction, students missed some instruction
because of a power failure at the school)
Maturation: Were changes in Y due to normal development
process?
o The performance of first graders in a learning experiment
begins decreasing after 45 minutes because of fatigue.
Statistical regression: Were differences between the two
groups that could influence Y controlled for?
o In an experiment involving reading instruction, subjects
grouped because of poor pre-test reading scores show
considerably greater gain than do the groups who scored
average and high on the pre-test.
Source: CSULB
Part 3: Monitoring and Evaluation of Development Projects and Policies
66. 3.2.3.1 Internal Validity
66
Threats to internal validity
Selection: refers to selecting participants for the various
groups in the study. Are the groups equivalent at the
beginning of the study? Were subjects self-selected?
o The experimental group in an instructional experiment
consisted of a high-ability class, while the comparison group
was an average- ability class.
Experimental mortality: Did some subjects drop out?
o In a health experiment designed to determine the effect of
various exercises, those subjects who find the exercise most
difficult stop participating.
Testing: Did a pre-test affect the post-test?
o In an experiment in which performance on a logical reasoning
test is the dependent variable, a pre-test clues the subjects
about the post-test.
Source: CSULB
Part 3: Monitoring and Evaluation of Development Projects and Policies
67. 3.2.1 Internal Validity
67
Threats to internal validity
Instrumentation: did the measurement method change
during the research?
o Two examiners for an instructional experiment administered
the post-test with different instructions and procedures.
Design contamination: did the control group find out about
the experimental treatment or were influenced?
o In an expectancy experiment, students in the experimental
and comparison groups “compare notes” about what they
were told to expect.
Source: CSULB
Part 3: Monitoring and Evaluation of Development Projects and Policies
68. 3.2.3.2 External Validity
68
External validity: credibility or reliability of an estimate of
project impact when applied to a context different from the
one in which the evaluation was carried out.
It means the ability to generalize the study to other areas,
Participatory methods have little, if any, external validity,
Inferences about a causal relation between two variables
have external validity if they may be generalized from the
unique and idiosyncratic settings, procedures and participants
to other populations and conditions,
Factors to control internal validity may limit external validity of
the findings as well as using people from a single geographic
location who are volunteers
Part 3: Monitoring and Evaluation of Development Projects and Policies
69. 69
Three often occuring issues which threaten the validity
of a randomized experiment or a quasi-experiment are:
Attrition
When some members of the treatment and/or control
group drop out from the sample
Spillover
When the program impact is not confined to the program
participants
Especially a concern when the program impacts a lot of
people or when the program provides a public good
Noncompliance
When some members of the treatment group do not
receive treatment or receive treatment improperly
Part 3: Monitoring and Evaluation of Development Projects and Policies
3.2.3.2 External Validity
70. 70
Other factors influencing choice of evaluation
design:
Efficacy of the intervention: often unknown,
Sector of the project: different sectors require
different outcomes measured and different degrees
of certainty,
Timing and timeliness,
Magnitude of sampling: number of persons and
distances between them,
Based on the willingness of decision makers to be given erroneous
results,
Usual practice: alpha at 5% and Beta error of 20%
Costs.
3.2.4 Combining types of indicators &
inference
Part 3: Monitoring and Evaluation of Development Projects and Policies
71. Ex: possible evaluations of diarrheal disease control
programs:
Type of
evaluation
Provision Utilization Coverage Impact
Adequacy Changes in
availability of
ORT in health
centers
Changes in # of
ORT packets
distributed in
health centers
Measurement of
% of all diarrheal
episodes treated
with ORT in the
population
Measurement of
trends in
diarrhea
mortaility in
intervention area
Plausibility Same as above,
but comparing
intervention with
control services
Same as above,
but comparing
intervnetion with
control services
Comparison of
ORT coverage
between
intervention and
control areas (or
dose-response)
Comparison of
diarrheal
mortaility trends
between
intervention and
control areas (or
dose-response)
Probability Same as above,
but intervention &
control services
would have been
randomized
Same as above,
but intervention &
control services
would have been
randomized
Same as above,
with previous
randomization
Same as above,
with previous
randomization
71
3.2.4 Combining types of indicators & inference
Part 3: Monitoring and Evaluation of Development Projects and Policies
72. Impact Evaluation
72
Why do development programs and policies are designed
and implemented?
Because change is needed in livelihood outcomes
To check whether these developmental outcomes are
achieved, we should do impact evaluation
The common practice at the project or program level is to
monitor and assess inputs and immediate outputs of a
program
But for evidence-based policy making, rigorous impact
evaluation is needed
So, the current move is towards measuring outcomes and
impacts in addition to inputs and processes
Source: Gertler et al (2011)
73. Impact Evaluation
73
Two categories: prospective and retrospective.
Prospective evaluations: developed at the same time as
the program is being designed and are built into program
implementation
Baseline data are collected prior to program
implementation for both treatment and comparison
groups
Retrospective evaluations: done to assess program
impact after the program has been implemented,
generating treatment and comparison groups
Source: Gertler et al (2011)
74. Problems in Impact Evaluation
74
Causal Inference & the Problem of the
Counterfactual
Whether “X” (an intervention) causes “Y” (an outcome
variable) is very difficult to determine
The main challenge is to determine what would have
happened to the beneficiaries if the intervention had not
existed
Evaluation question: what is the effect of a programme?
Effect =
Problem: we only observe individuals that
participate:
or
do not participate :
... but never A and B for everyone!
Outcome A:
with programme
Outcome B:
without programme
A B
A B
80. 80
Randomization: use randomization to obtain the
counterfactual = “the gold standard” by some:
Eligible participants are randomly assigned to a
treatment group who will receive program benefits
while the control group consists of people who will not
receive program benefits,
The treatment and control groups are identical at the
outset of the project, except for participation in the
project.
Quasi-experimental designs: use statistical/non-
experimental research designs to construct the
counterfactual.
3.2.4 Experimental and quasi-
experimental designs
Part 3: Monitoring and Evaluation of Development Projects and Policies
81. 81
1) Oversubscription method:
Units are randomly assigned to the treatment and control
groups and everyone has an equal chance,
Appropriate when there is no reason to discriminate among
applicants and when there are limited resources or
implementation capacity (demand > supply of program),
Ex.: Colombia in the mid-1990s, the lottery design was used
to distribute government subsidies, which were vouchers to
partially cover the cost of private secondary school to
eligible students.
2) Randomized order of phase-in:
Randomize the timing of receiving the program,
Appropriate when program is designed to cover the entire
eligible population and there are budget/administrative
contraints.
(a) Experimental designs: 4 methods of
randomization in Randomized Controlled Trials (RCTs):
Sources: Duflo et al., 2006; ADB, 2006
Part 3: Monitoring and Evaluation of Development Projects and Policies
82. 82
3) Within group randomization:
Provide the program to some subgroups in each area,
One of its problems is that it increases the likelihood that the
comparison group is contaminated.
4) Encouragement design:
Offer incentives to a randomly selected group of people,
Appropriate when everyone is eligible and enough funding,
The remaining population without the incentives is used as
the control group,
Challenge: the probability of participating is not 1 or 0 from
the encouragement or lack of encouragement.
Sources: Duflo et al., 2006; ADB, 2006
Part 3: Monitoring and Evaluation of Development Projects and Policies
(a) Experimental designs: 4 methods of
randomization in Randomized Controlled Trials (RCTs):
83. (a) Experimental designs
83
Why RCTS?
To analyze whether an intervention had an impact, a
counterfactual is needed because it is Hard to ask
counterfactual questions (Ravallion, 2008),
Randomization guarantees statistical independence of the
intervention from preferences (observed and unobserved)
Overcome selection bias of individuals receiving the
intervention.
Internal validity is high,
Involves less rigorous econometric approaches,
Led by MIT Poverty Action Lab and World Bank,
It is criticized by others (see Rodrik, 2008),
Part 3: Monitoring and Evaluation of Development Projects and Policies
84. (a) Experimental designs
84
Potential drawbacks of RCTs:
Evaluator must be present at a very early stage,
Intended random assignments can be compromised,
External validity,
Political influences on where to have the intervention,
Site effects (aspects of a program’s setting, such as
geographic or institutional aspects, interact with the
treatment),
Tendency to estimate abstract efficacy,
Impractically of maintaining treatment and control groups,
Are not possible for some policy questions.
Sources: Barrett and Carter, 2010; Maredia, 2009; Ravallion, 2008
Part 3: Monitoring and Evaluation of Development Projects and Policies
85. (a) Experimental designs
85
How can external validity be increased for RCTs?
Increase information on characteristics of the population
and sampling area,
Try to answer the following questions to be able to apply
results from another study to the study area you are
interested in:
How does the population in the study differ from the
population I am interested in?
How do supporting inputs differ?
How do alternative activities differ? Does the studied
intervention substitute or complement existing
opportunities?
Source: Morduch, J. 2013.
Part 3: Monitoring and Evaluation of Development Projects and Policies
86. (a) Experimental designs
86
Potential for future research incorporating RCTs:
Complement with qualitative research and mixed-method
approaches,
Combine with weighting or matching,
Analyzing impact of feeder roads,
Incorporate field experiments, e.g. time and risk
preferences, skill,
“Ultimately, the goal for evaluation should be to help decide
what to do in the future. This is both for donors who need to
know where to put their money, for skeptics who want to
see that programs can work, and for implementers who
need to know how best to design their programs.” (Karlan,
2009, p. 8)
Part 3: Monitoring and Evaluation of Development Projects and Policies
87. (a) Experimental designs
87
Why don’t more RCTs occur? (Prtichett, 2002)
1) True believers believe they know both the important
issue and the correct instrument to address the issue.
2) The outcomes could undermine budgetary support for
the organization.
3) Problem of deciding which issue is most important and
which is the best instrument to address the issue:
ignorance to maintain budget size.
Part 3: Monitoring and Evaluation of Development Projects and Policies
88. (a) Experimental designs
88
Conclusions on RCTs:
RCTs can provide evidence before expanding programs, but
need to weigh pros and cons,
Sample size needs to be considered,
Understanding the context before is critical,
Was the intervention the right one for the problem?
Best to combine RCTs with other methods.
Part 3: Monitoring and Evaluation of Development Projects and Policies
89. Field experiments
89
Key assumptions: randomized assignment
Treatments effects: individual and intention-to-treat
Internal and external validity
Examples and exercises
92. Example: PROGRESA comparison groups
Treatment
villages
Control
villages
Eligible
households
A B
Non-eligible
households
C D
92
• Impact effect : A – B
• Spill-over effects : C – D
95. 95
Impact assessment techniques:
Propensity score matching: Match program
participants with non-participants, typically using
individual observable characteristics,
Difference-in-differences/double difference:
Observed changes in outcome before and after for a
sample of participants and non-participants,
Regression discontinuity design: Individuals just on
other side of the cut-off point = counterfactual,
Instrumental variables: When program placement is
correlated with participants' characteristics, need to
correct bias by replacing the variable characterizing the
program placement with another variable.
(b) Quasi-experimental
designs
Part 3: Monitoring and Evaluation of Development Projects and Policies
96. The evaluation problem: recap
96
Evaluation question: what is the effect of a
programme?
Effect =
Problem: we only observe individuals that
participate:
or
do not participate :
... but never A and B for everyone!
Outcome A:
with programme
Outcome B:
without programme
A B
A B
A) Propensity Score Matching
97. The evaluation problem: recap
97
Evaluation question: what is the effect of a programme?
Effect =
Problem: we only observe individuals that
participate:
or
do not participate :
... but never A and B for everyone!
Outcome A:
with programme
Outcome B:
without programme
A B
A B
99. Single difference estimation
99
In this section we will look at single difference
methods that rely on the assumption of unconfounded
assignment
Linear regression (OLS)
Matching methods
Typically applied to cross section data
Two crucial assumptions:
1. We observe the factors that determine selection
2. The intervention causes no spill-over effects
103. 103
Treatment Comparison
Matching identifies a control group that is as
similar as possible to the treatment group!
Intuition of Matching: The Perfect
Clone
116. Which neighbour?
116
We can match to more than one neighbour
5 nearest neighbours? Or more?
Radius matching: all neighbours within specific range
Kernel matching: all neighbours, but close neighbours have larger
weight than far neighbours.
Best approach?
Look at sensitivity to choice of approach
How many neighbours?
Using more information reduces bias
Using more control units than treated increases precision
But using control units more than once decreases precision
120. Propensity Score Matching
120
What are propensity scores for?
We want to know the effect of something
We do not have random assignment
We do not have on pre-project characteristics that
determined whether or not the individuals received the
treatment
Example
An NGO has built clinics in several villages
Villages were not selected randomly
We have data on village characteristics before the
project was implemented
What is the effect of the project on infant mortality?
121. Propensity Score Matching
121
T imrate
treated 10
treated 15
treated 22
treated 19
control 25
control 19
control 4
control 8
control 6
The easiest and straightforward answer to this
question is to compare average mortality rates
in the two groups
(10+15+22+19)/4-(25+19+4+8+6)/5= 4.1
What does this mean? Does it mean that clinics
have increased infant mortality rates?
NO!
Pre-project characteristics of the two groups is
very important to answer the above question
What is the effect of the project on infant mortality?
122. Propensity Score Matching
122
T imrate povrate pcdocs
treated 10 0.5 0.01
treated 15 0.6 0.02
treated 22 0.7 0.01
treated 19 0.6 0.02
control 25 0.6 0.01
control 19 0.5 0.02
control 4 0.1 0.04
control 8 0.3 0.05
control 6 0.2 0.04
How similar are the treated and control groups?
On average, the treated group has higher poverty rate and few doctors per
capita
123. Propensity Score Matching
123
The Basic Idea
1. Create a new control group
For each observation in the treatment group, select
the control observation that looks most like it based
on the selection variables (aka background
characteristics)
2. Compute the treatment effect
Compare the average outcome in the treated group
with the average outcome in the control group
124. Propensity Score Matching
124
• Take povrate and pcdocs one at a time to match the treated
group with that of the control one
• Then take the two at a time. What do you observe?
S. No T imrate povrate pcdocs
Macth using
povrate
Macth using
pcdocs
1 treated 10 0.5 0.01
2 treated 15 0.6 0.02
3 treated 22 0.7 0.01
4 treated 19 0.6 0.02
5 control 25 0.6 0.01
6 control 19 0.5 0.02
7 control 4 0.1 0.04
8 control 8 0.3 0.05
9 control 6 0.2 0.04
125. Propensity Score Matching
125
Predicting Selection
What is propensity score?
Propensity score is the conditional probability that an
individual chooses the treatment
Which model do we use to estimate pscores?
1 xxxxxxxxxxx
Ti
0 xxxxxxxxxxx
X
• Linear model cannot be fitted because the
score will be more than 1 and greater than
zero
• So, we use limited dependent variable model: logit or probit as indicated in the
graph
• We consider two conditions: CIA and Propensity score theorem
• CIA: outcomes are independent of treatment assignment given xi
• Whereas propensity score theorem states that outcomes are independent of
treatment assignment given propensity score, i.e., p(xi)
126. Propensity Score Matching
126
Predicting Selection
How do we actually match treatment observations to
control groups?
In stata, we use logictic or probit regression to
predict:
Prob(T=1/X1, X2,…,Xk)
In our example, the X variables are povrate and
pcdocs
So, we run logistic regrsssion and save the predicted
probability of the treatment
We call this propensity score
The commands are:
Logistic T povrate pcdocs
Predict ps1 or any name you want the propensity
127. Propensity Score Matching
127
S. No T imratepovrate
pcdoc
s ps1 Match
1 treated 10 0.5 0.01 0.4165713
2 treated 15 0.6 0.02 0.7358171
3 treated 22 0.7 0.01 0.9284516
4 treated 19 0.6 0.02 0.7358171
5 control 25 0.6 0.01 0.752714
6 control 19 0.5 0.02 0.395162
7 control 4 0.1 0.04 0.0016534
8 control 8 0.3 0.05 0.026803
9 control 6 0.2 0.04 0.0070107
Predicted
probability of
treatment or
Propensity score
Exercise: Use the propensity score to match the treated group with the
control one
Find out the average treatment effect on the treated
((10+15+22+19)/4)-((19+25+25+25)/4)=-7
128. Propensity Score Matching
128
How do we know how well matching worked?
1. Look at covariate balance between the treated and
the new control groups. They should be similar.
2. Compare distributions of propensity scores in the
treated and control groups. They should be similar
3. Compare distributions of the propensity scores in the
treated and original control groups
If the two overlap very much, then matching might
not work very well.
134. Identifying Assumption
134
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post
135. Identifying Assumption
135
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post
Effect of program using
only pre- & post- data
from T group (ignoring
general time trend).
136. Identifying Assumption
136
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post
Effect of program using
only T & C comparison
from post-intervention
(ignoring pre-existing
differences between T &
C groups).
137. Identifying Assumption
137
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post
138. Identifying Assumption
138
Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post
Effect of program
difference-in-difference
(taking into account pre-
existing differences
between T & C and
general time trend).
140. Differences-in-Differences
140
First application DID: John Snow (1855)
Cholera epidemic in London in mid-nineteenth century
Prevailing theory: “bad air”
Snow’s hypothesis: contaminated drinking water
Compared death rates from cholera in districts served
by two water companies
In 1849 both companies obtained water from the dirty
Thames
In 1852, one of them moved water works upriver to an
area free of sewage
Death rates fell sharply in districts served by this water
company!
146. Difference-in-Differences
The DID estimate is clearly presented in a 3 x 3 table
All we need to estimate is 4 averages
◦ Non-parametric regression
◦ Take differences and double differences
146
147. Difference-in-Difference
147
When can we use diff-in-diff?
We want to evaluate a program, project, intervention
We have treatment and control groups
We observe them before and after treatment
But:
Treatment is not random
Other things are happening while the project is in
effect
We cannot control for all the potential confounders
Key Assumptions:
Trend in control group approximates what would have
happened in the treatment group in the absence of the
treatment
148. Difference-in-Difference
148
Assume that there was a free lunch program in Place
A.
The free lunch was assumed to improve student
outcomes
2008 2010
D1
Place A
Place B
D2
2008 2010
D4 D3
DID= D1-D2 or D3-D4
= Dprogram + Dtrend
D3= Dprogram + Ddifference because of factors other than the program
149. Difference-in-Difference
149
Y Pre (2008) Post (2010) Diff
Treated (A) 20 90 70
Control (B) 30 70 40
Diff -10 20 30
20
4
0
60
80
100
Control (B)
Treated (A)
Pre (2008) Post(2010)
150. Difference-in-Difference: data
150
Name Y (score) Dtr Dpost
1 40 0 0
2 80 1 1
3 20 0 0
4 100 1 1
5 30 0 0
6 0 1 0
7 60 0 1
8 40 1 0
9 60 0 1
10 90 0 1
Dtr= dummy variable with a value of 1 if individuals are in the treated group
(A)
Dpost= time dummy variable with a value of 1 if individuals take the test in
2010 (post)