M&E.ppt

Project Monitoring and
Evaluation
July 2016

Monitoring and Evaluation
2
 The Current Context:
 increasing call for results & value for money
 shrinking budgets
 yet increasing complexity in program/project
settings
 So, what have M & E to do with these issues?
 We need monitoring & evaluation to:
 identify projects that better meet needs & lead to
improvements in targeted social, economic &
environmental conditions
 improve decision-making
 achieve more outcomes
 tell stories clearly in an evidence-based manner
 enhance organizational learning

3
Definitions of Monitoring and Evaluation
 Monitoring: is an internal project activity
 It is the continuous assessment of project
implementation in relation to:
 schedules
 input use
 infrastructure
 services
 Monitoring:
 provides continuous feedback
 identifies actual or potential successes/
problems
Monitoring and evaluation (M&E) of policies
and projects

What’s more important?
 Knowing where you are on
schedule?
budget?
work accomplished?
4

5
 Monitoring is very important in project planning and
implementation. It is like watching where you are
going while riding a bicycle; you can adjust as you go
along and ensure that you are on the right track.
 Monitoring provides information that will be useful in:
 Exploring the situation in the community and its projects;
 Determining whether the inputs in the project are well
utilized;
 Identifying problems facing the community or project and
finding solutions;
 Ensuring all activities are carried out properly by the
right people and in time;
 Using lessons from one project experience on to
another; and
 Determining whether the way the project was planned is
the most appropriate way of solving the problem at

6
 High-quality monitoring of information encourages
timely decision-making, ensures project accountability,
and provides a robust foundation for evaluation and
learning.
 It is through the continuous monitoring of project
performance that you have an opportunity to learn
about what is working well and what challenges are
arising.
 Job descriptions of staff involved in managing and
implementing projects should include assigned M&E
responsibilities.

7
 Evaluation: is a systematic, objective, deliberate, purposeful,
critical, trustworthy & ethical assessment of a project/program
 It assesses: relevance, coherence, efficiency, effectiveness,
impact & sustainability
 Interim evaluations: first review of progress, prognosis
of likely effects, and a way to identify necessary
adjustments in project design
 Terminal evaluations: evaluate the project‘s effects and
potential sustainability (done at the end for project
completion reports)
 Evaluation is:
 “A periodic assessment of the relevance, performance,
efficiency, effectiveness, impact, sustainability and relevance
of a project in the context of stated objectives. It is usually
undertaken as an independent examination with a view to
drawing lessons that may guide future decision-making”
(European Commission).

8
Evaluation means information on:
Strategy
 Whether we are doing the right things
– Rationale/justification
– Clear theory of change
Operation
 Whether we are doing things right
– Effectiveness in achieving expected outcomes
– Efficiency in optimizing resources
– Client satisfaction
Learning
• Whether there are better ways of doing it
– Alternatives
– Best practices
– Lessons learned

Definitions of Monitoring & Evaluation
9
Monitoring: What are we doing?
Tracking inputs and outputs to assess whether
programs are performing according to plans
Evaluation: What have we achieved?
Attributing changes in outcomes to a particular
program/intervention requires one to rule out all
other possible explanations.

10
Differences between monitoring and evaluation
Monitoring Evaluation
Frequency Regular Episodic
Main action Keeping track, oversight Assessment, more analytical
Basic
purpose
Improve efficiency, adjust
work plan
Improve effectiveness, impact,
future programming
Focus Inputs, outputs, process
outcomes, work plans
Effectiveness, relevance,
impact, cost-effectiveness
Information
sources
Routine systems, field
observations, progress
reports, rapid assessments
Same as for monitoring, as well
as surveys and studies
Undertaken
by
Program managers,
community workers,
community (beneficiaries),
supervisors, funders
Program managers,
supervisors, funders, external
evaluators, community
(beneficiaries)
Reporting to Program managers,
community workers,
community (beneficiaries),
supervisors, funders
Same as for monitoring, as well
as policy makers
Source: UNICEF (1991: 4)

Planning, Monitoring & Evaluation
12

Functions of M&E in the project cycle
13
 Test soundness of a project’s objectives & can lead to
improvements in project design through the process of
selecting indicators for monitoring and through the use
of design tools
 Incorporate views of stakeholders as ownership brings
mutual accountability:
 Reinforce ownership and highlight emerging
problems through recorded benefits early in project
implementation
 Show need for mid-course corrections

But accountability doesn’t have to be like this!
14

Guiding Principles of M & E
15
M&E is guided by the following key principles:
1. Systematic Inquiry – Staff conduct site-based inquiries
that gather both quantitative and qualitative data in a
systematic and high-quality manner
2. Honesty/Integrity – Staff display honesty and integrity in
their own behavior and contribute to the honesty and
integrity of the entire M&E process
3. Respect for People – Staff respect the security, dignity,
and self-worth of respondents, program participants,
clients, and other M&E stakeholders
4. Responsibilities to Stakeholders – Staff members
articulate and take into account the diversity of different
stakeholders’ interests and values that are relevant to
project M&E activities

16
 Before conducting evaluation, it is crucial to ask:
 Why do the evaluation?
 How will the results be used?
 Who will be influenced by the findings?
 Answers to these questions should determine:
 The process of evaluation
 Steps needed to be taken before quantitative data
collection in the field is contemplated
 Designs of (quantitative) evaluations to meet
different needs of decision makers
Issues to consider before undertaking M&E

What questions will the evaluation
seek to answer?
17
 About outcomes/impacts
 What do people do differently as a result of the
program?
 Who benefits and how?
 Are the program’s accomplishments worth the
resources invested?
 What are the strengths and weaknesses of the
program?
 What, if any, are unintended secondary
consequences?
 How well does the program respond to the
initiating need?

What questions will the evaluation seek
to answer?
18
 About program context
 How well does the program fit in the local setting?
 What in the socio-economic-political environment
inhibits or contributes to program success?
 Who else works on similar concerns? Is there
duplication?

19
Participatory Monitoring and
Evaluation
 “It is a process of collaborative problem-solving through the
generation and use of knowledge. It is a process that leads to
corrective action by involving all levels of stakeholders in shared
decision-making.”
 It is a collaborative process that involves stakeholders at different
levels working together to assess a project or policy, and take any
corrective measure required
 The stakeholder groups typically involved in participatory M & E
activity include: end users of a project, NGOs, private sector
businesses who get involved in the project and government staff
 Key Principles:
 Local people are active participants-not just sources of info
 Stakeholders evaluate, outsiders facilitate
 Focus on building stakeholder capacity for analysis and problem
solving
 Process builds commitment to implementing any recommended
corrective actions

20
Participatory Monitoring and Evaluation
 Methods in Participatory Monitoring and Evaluation
1. Stakeholder workshops: - to bring together government
officials, project management, and other stakeholders
2. Participatory methods such as rural appraisal, SARAR (Self-
esteem, Associative strengths, Resourcefulness, Action
planning, Responsibility) and Beneficiary Assessment
 Participatory Rural appraisal:- visual methods often to
analyze “before and after” situations, through the use of
community mapping, problem ranking, wealth ranking,
seasonal and daily time charts, and other tools.
3. Self-assessment Methods:- interviewing, focus group
discussions

21
Participatory Monitoring and Evaluation
Conventional M &E Participatory M &E
Who? External experts Stakeholders, including
communities and project
staff; outsiders:- facilitate
What? Predetermined indicators, to
measure inputs and outputs
Indicators identified by
stakeholders, to measure
process as well as outputs or
outcomes
How? Questionnaire surveys, by
outside “neutral” evaluators,
distanced from project
Simple, qualitative or
quantitative methods, by
stakeholders themselves
Why? To make project and staff
accountable to funding agency
To empower stakeholders to
take corrective action

Components of M&E design
24
 5 components of good M&E design during
project preparation help ensure it is relevant
and used to good effect:
1. Clear statements of measurable objectives for
which indicators can be defined.
2. A structured set of indicators covering outputs of
goods and services generated by the project and
their impact on beneficiaries.
3. Provisions for collecting data and managing
project records
4. Institutional arrangements for gathering, analyzing
and reporting project data as well as investing in
capacity building
5. Proposals for how findings will be an input in
decision-making.

Defining project objectives and
measuring them with M&E indicators
25
 Problem analysis to structure objectives
 Stakeholders identify causes and effects of
problems before defining objectives structured to
resolve these problems
 Objectives should be:
 Specific to the project interventions
 Realistic in the timeframe
 Measurable for evaluation
 It is important to ask the following questions so that
objectives are more precisely defined:
 How objectives can be measured?
 How components of M&E can lead to those
objectives?

26
 Indicators need to be structured to match the analysis of
problems the project is trying to overcome
 Logical framework/logframe/ZOPP approach:
 Is used to define inputs, outputs, timetables, success
assumptions and performance indicators
 Postulates a hierarchy of objectives for which
indicators are required
 Identifies problems the project cannot deal with
directly (risks)
The logframe/ZOPP approach

27
 GTZ (GIZ) Impact model
 Attribution gap:
 Caused by the existence of too many other significant
factors
 Cannot be plausibly spanned using a linear, causal
bridge
Source: Douthwaite et al. (2003: 250)
The logframe/ZOPP approach

28
 Based on program theory evalaution and the logframe
 An explicit theory or model of how a project will or
has brought about impact
 Consists of a sequenced hierarchy of outcomes
 Represents a set of hypotheses about what needs
to happen for the project output to be transformed
overtime into impact on highly aggregated
development indicators
 Can be highly complementary to conventional
assessments
 Advantages of this approach
 Consideration of wider impact helps achieve impact
 Complements conventional economic assessment
Impact pathways evaluation

29
Two main phases in impact pathway evaluation
1st phase: using program theory evaluation to guide self-
monitoring and self-evaluation to establish the direct
benefits of the project outputs in its pilot site(s).
 Task: to develop a theory or model of how the project sees
itself achieving impact (called an impact pathway)
 Identifies steps the project should take to scale-out and -
up
 Scale-out: innovation spread from farmer to
farmer/within same stakeholder groups
 Scale-up: an institutional expansion from grassroots
organizations to policymakers, donors, development
institutions, and other stakeholders to build an
enabling enviornment for change

30
 Answers to the following questions are recorded in a
matrix for each identified outcome in the impact pathway:
 What would success look like?
 What are the factors that influence the achievement
of each outcome?
 Which of these can be influenced by the project?
 What is the program currently doing to address these
factors to bring about this outcome?
 What performance information should we collect?
 How can we gather this information?

31
2nd phase in impact pathway evaluation: An
independent ex-post impact assessment is carried out
some time (normally several years) after the project
has finished
 Begins by establishing the extent to which the
impact pathway was valid in the pilot site(s) and
the extent to which scaling occurred
 An attempt to bridge the attribution gap, using
phase 1 results as a foundation
31

33
 Striga hermonthica is a parasitic weed (the “witch weed”)
which infests nearly 21 million ha in SSA
 Project since 1999 using participatory research
approaches to develop locally-adapted integrated Striga
control
 Project output: on-farm research to adapt and validate
integrated Striga control options in farmers’ fields
 Project goal: improved livelihoods for the 100 million
people in Africa affected by Striga
Impact pathways evaluation: Case study
of impact pathway evaluation

34
 Impact pathway for
integrated Striga
control
 Shaded boxes =
monitored outcomes
 Unshaded boxes = to
be evaluated in future
ex-post impact
assessment
 Complemented with a
program theory
matrix (not shown)
Source: Douthwaite et al. (2003: 253)

35
 Example of using the impact pathways approach:
The Challenge Program on Water and Food
 Used an ex-ante particiapatory impact
assessment analysis to demonstrate to donors
how project outputs will lead to development
outcomes and widespread impacts after the
project has ended
 A useful approach since measurable impact can
take 20 years after basic research begins for
technologies

Impact pathways evaluation: The Challenge
Program on Water and Food
36
 A project involving 50 different organizations and almost 200
organizations in 9 river basins
 3 systems level research themes:
 Crop water productivity improvement, water and people in
catchments, and aquatic ecosystems and fisheries
 Basin level theme: integrated water basin management
systems
 Global scale theme: global and national water and food
systems
 Measuring impacts:
 It can take 10 years to move from basic research to useful
technologies and then other 10 years to see wide scale impacts
 thus, use an ex ante impact assessment approach to
demonstrate to donors HOW project outputs WILL lead to
development outcomes and widespread impacts after the end
of the projects that developed them.
 ex ante impact assessment also provides a solid base for a
later ex post impact assessment

Classification axes: the indicator
axis
37
 Conceptual framework for helping guide the design of an
evaluation from Habicht et al. (1999)
 An evaluation may be aimed at 1+ categories of decision
makers, so the design must take into account their different
needs
 First axis refers to the indicators: whether one is
evaluating the performance of the intervention delivery or
its impact on indicators
 Second axis refers to the type of inference to be drawn:
the confidence level of the decision maker that any
observed effects were due to intervention

38
 Indicators of provision, utilization, coverage and
impact… what is to be evaluated? What types of info is to
be sought?
 Outcomes of interest (“indicators”):
1. Provision: services must be provided (available and
accessible to the target pop. and of adequate quality)
2. Utilization: the population must accept and make use
of the services
3. Coverage: utilization will result in a given population
coverage, which represents the interface between
service delivery and outreach to the population
4. Impact: coverage may lead to an impact
 Choose indicators based on decision makers and cost
Classification axes: the indicator axis

39
 If a weak link is discovered, investigate why
 An impact can be expected only when the correct
service is provided in a timely manner and it is
properly utilized by a sufficiently large number of
beneficiaries
 Example: project offering loans to smallholders with the
objective of increasing fertilizer use:
1. Provision: measure the availability of the loans to
smallholders,
2. Utilization: measure the disbursement of the
loans to smallholders,
3. Coverage: measure the proportion of
smallholders that have been able to take out a
new loan, and
4. Impact: measure the impact of the project on the
fertilizer use.)

40
Indicator Question Example of indicators
Provision Are the services available?
Are the services accessible?
Is the quality of the service
adequate?
 Number of cooperatives offering
fertilizer per 1,000 population
 Proportion of farmers within 10
km of a cooperative offering
fertilizer
 Number of days in a year when
fertilizer is available
Utilization Are the services being used?  Proportion of farmers buying
fertilizer
Coverage Is the target population being
reached?
 Proportion of all farmers who
want to buy fertilizer who have
bought fertilizer
Impact Were there improvements in
yields as a result of the
fertilizer program?
 Change in yield attributable to
fertilizer distribution program
Ex. of indicators for evaluating a fertilizer distribution
program (objective: increase yields)

41
 Second classification axis: how confident
decision makers need to be that any observed
effects are due to the project or program,
 Both performance and impact evaluations may
include adequacy, plausibility or probability
assessments as the types of inference.
3.2.2 Types of inference

42
 There are 3 types of statements reflecting different
degrees of confidence end-users may require
from the evaluation results:
1) Adequacy assessment:
 Determines whether some outcome occured, as
expected,
 This assessment is relevant for evaluating process
indicators (provision, utilization, coverage),
 For this, no control group is needed.

43
2) Plausibility assessment:
 Permit determination of whether change can be attributed to the
project,
 Here control group needs to be used, internal or external,
 Note: selection bias (control groups often do not exhibit identical
characteristics to the beneficiary group).
Part 3: Monitoring and Evaluation of Development Projects and Policies

44
3) Probability assessment:
 Ensures there is a small, known probability that
differences between project and control areas
were due to confounding, systematic bias/chance.

45
1) Adequacy assessment: Did expected changes
occur?
 Compares the performance or impact of the
project with previously established criteria
(absolute or relative terms),
 Assess how well project activities have met
expected objectives,
 Evaluation may be cross-sectional (carried out
once during or at the end of the project) or
longitudinal (to detect trends, requiring baseline
data/including repeated measurements)
 Can use secondary data which reduces costs
3.2.2 Types of inference: Adequacy

46
1) Adequacy assessment: Did expected changes
occur?
 Inability to causally link project activities with
observed changes:
 May show a lack of change in the indicators,
 But does not automatically mean that the project was not
effective,
 Even though, for many decision makers, more complex
evaluation designs will not be required, particularly since
these would demand additional time, resources and
expertise.

Adequacy
evaluation
Measurements In whom?
Compared to
predefined
adequacy
criteria
Inferences:
objective met
• Performance
(provision,
utilization,
coverage)
Project activities
-
Implementation
workers
Activities being
performed as
planned in the
initial
implementation
schedule
- Project
recipients
• Cross-
sectional
Once
Absolute value
• Longitudinal Change
Absolute and
incremental value
• Impact Indicators - Project
recipients or
target
population
Observed
change in
behavior is of
expected
direction and
magnitude
• Cross-
sectional
Once Absolute value
• Longitudinal Change
Absolute and
incremental value
 Characteristics of adequacy assessment evaluations
47

48
2) Plausibility assessment: Did the project seem to
have an effect above and beyond external
influences?
 Plausible: reasonable or probable,
 Goes beyond adequacy assessments by ruling out
“confounding factors” (external factors),
 Mostly ex-post research designs, longitudinal and
cross-section samples,
 Choose control groups before the evaluation starts
or after, during the analyses of the data
3.2.2 Types of inference: Plausibility

49
2) Plausibility assessment: Did the project seem to
have an effect above and beyond external
influences?
 Several alternatives for a control group
(combine!):
 Historical: the same target institutions/population
 Internal: institutions/geographical areas/individuals that
should have received the full intervention, but did not
(options: dose-response relation, case-control method)
 External: 1+ institutions/areas without the project.

50
 Essential elements to establish plausibility in
impact:
 The source of the impact being investigated,
 The model/concept of impact used and how it
applies to the case at hand (such as the
logframe/ZOPP approach),
 Objectives, limitations and attribution gap of the
impact
 Theory of action on which the intervention or
strategy has been based,
 Impact hypotheses (statements about the expected
impact),
 Other factors that could have affected observed
changes and hypotheses,
 Other informed opinions that support and contest
the study findings (views of beneficiaries and target
groups are particularly important).
Source: Baur et al. (2001:6)

51
 Control groups in plausibility assessments may
include:
a) Historical control group: compare change from
before to after project in same target population
with an attempt to rule out external factors (before-
after longitudinal survey, or “panel”)
b) Internal control group: individuals/areas that
should have received the full intervention, but did
not either because they could not or they refused
to be reached by the project. Three subtypes:
 Compare the treatment group with a group not
receiving the service
Sources: Habicht et al., 1997 and Schlesselman, 1982
3.2.2Types of inference: Plausibility

52
 Control groups in plausibility assessments may
include:
b) Internal control group: Three subtypes:
 Compare the group with full treatment to
several control groups that differ in uptake of
the service (dose-response design),
 Case control method: compare previous
exposure to the project between with and
without the disease.
c) External control group: one or more areas
without the project. Comparison may be cross-
sectional or longitudinal-control.
Sources: Habicht et al., 1997 and Schlesselman, 1982

53
 Intervention and control groups are supposed to be
similar in all relevant characteristics except exposure
to the intervention… yet, in reality this is almost never
true.
 Why?
 Comparison groups can be influenced by
confounding factors that do not affect the other
groups as much.
 How to reconcile?
 Measure probable confounders and their statistical
treatment,
 For historical controls: socioeconomic development.
 So, is it plausible?
 Assessments encompass a continuum, ranging from
weak to strong statements… but one cannot
completely rule out all alternative expansion for the
observed differences.

Source: Habicht et al., 1997
 Example: Increasingly stronger plausibility
assessements with different control groups:
 Diarrhea mortality rapidly fell in areas with the control of
diarrheal disease (CDD) interventions,
 Diarrhea did not fall in areas without the CDD
interventions (not due to general changes in diarrhea in
the area),
 Changes in other known determinants of mortality could
not explain the observed decline,
 There was an inverse association between intensity of the
intervention in the project areas and diarrhea mortality,
 Mothers with knowledge of oral rehydration therapy (ORT)
had fewer recent child deaths than those without such info
 Mortality among non-participants in the project area was
similar to that in the control area.
54

Plausibility
evaluation
Measurement In whom?
Compared to
non-random
control
group
Inferences
Performance
(provision,
utilization,
coverage)
Program
activities
Implementation
workers,
program
recipients
Intervention
group
appears to
have better
performance
than control
group
Cross-
sectional
Once Control group
Longitudinal Change Before-after
Longitudinal-
control
Relative
change
Comparing
before-after
between
intervention
and control
 Characteristics of plausibility evaluations
55

Plausibility
evaluation
Measurement In whom?
Compared to non-
random control
group
Inferences
Impact Indicator
Program
recipients/
target
population
Changes
appear to
be more
beneficial in
intervention
group than
in control
group
Cross-
sectional
Once Control Group
Longitudinal Change Before-after
Longitudinal-
control
Relative
change
Comparing before-
after between
intervention and
control
Case-control Once
Target
population
Comparing
exposure to
program in cases
and control
 Characteristics of plausibility evaluations
56

57
3) Probability assessment: Did the project have an
effect? (p<0.05)?
 Aim at ensuring there is only a small known
probability that the difference between the project
and control areas were due to confounding, bias or
chance
 Requires randomization of treatment and control
activities to the comparison groups so the statistical
statement is directly related to the intervention:
 Alternative? Obtain information on all observations!
(census),
 Randomization does not gaurantee all confounding is
eliminated. Yet, it does ensure the probability of
confounding is measureable,
 Can be used even if the confounding factors are not
known.
3.2.2 Types of inference: Probability

58
 Grouped errors
 The error term may not be independent across
individuals,
 Evaluation designs are often randomized over groups
(not individuals), yet researchers use individual data,
 Thus, the error term may not be independent across
individuals,
 Can you think of any examples of when the error term
may not be independent across individuals?

59
 Difficulties which limit the use of probability
assessments:
 Evaluator must be present at a very early stage of
the project planning cycle to design randomization,
 Necessary to overcome political influence affecting
the choice of where to the intervention will take place
 Stringencies of probability evaluations may result in
situations artificially different from reality, and thus
the evaluations may lack external validity
 Few experienced decision makers require measuring the
effectiveness of every project through probability design,
but key individuals may have been trained to regard it as
the “gold standard” and there are times when it is
needed.

Probability
evaluations
Measurements In whom?
Compared to
randomized
control
group(s)
Inferences:
the program
has an effect
(P > 0.05)
Performance
(provision,
utilization,
coverage)
Program
activities -
Implementation
workers
-Program
recipients
Intervention
group has
better
performance
than control
Longitudinal-
control
Relative
change
Comparing
before-after
between
intervention
and control
Impact
Behavioral
indicators -
Implementation
recipients
-Program
recipients
Changes in
behavior are
more
beneficial in
intervention
than control
group
Longitudinal-
control
Comparing
before-after
between
intervention
and control 60

61
 Null hypothesis (H0): program effect = 0
 The “no effect” or “no difference” case
 Alternative hypothesis (H1): program effect ≠ 0
 Alpha (type I error): Odds of saying there is a
relationship when there actually is not (wrong)
 Power: Odds of saying there is a relationship when
there actually is (correct)
 Confidence level: odds of saying there is no
relationship when there is none (correct)
 Type II error or Beta error: odds of saying there is no
relationship when there actually is (wrong)

We reject H0 We fail to reject H0
H0 in reality true Type I error
(chance is alpha)
No error made
H0 in reality
false
No error made Type II error
(chance is Beta)
 H0: no effect
 Reject H0 if p-value is < alpha, 0.05
 H1: effect
 Fail to reject H0 if p-value is ≥ alpha, 0.05
62

63
 Internal validity: the credibility or reliability of an
estimate of project impact conditional on the context
in which it was carried out.
 It means a strong justification that causally links the
independent variables to the dependent variables with the
ability to eliminate confounding variables within the study
 Laboratory “true experiments” have high internal validity,
but may have weak external validity
 Focus: whether observed changes are attributed to the
program and not to other possible causes.
3.2.3.1 Internal Validity
3.2.3 Internal and External Validity
Brewer (2000). Shadish et al. (2002).

64
 Inferences possess internal validity if a causal relation
between two variables is properly demonstrated,
 When a researcher may confidently attribute the
observed changes or differences in the dependent
variable to the independent variable and s/he can rule
out other explanations, then her/his causal inference is
internally valid,
 Causal inference is internally valid if 3 criteria are
satisfied:
 Cause precedes the effect in time
 The cause and effect are related
 There is no plausible alternative explanations for the
observed covariations.
Brewer (2000). Shadish et al. (2002).

65
 Threats to internal validity
 History: Did a current event affect the change in Y?
o Ex. In a short experiment designed to investigate the effect of
computer-based instruction, students missed some instruction
because of a power failure at the school)
 Maturation: Were changes in Y due to normal development
process?
o The performance of first graders in a learning experiment
begins decreasing after 45 minutes because of fatigue.
 Statistical regression: Were differences between the two
groups that could influence Y controlled for?
o In an experiment involving reading instruction, subjects
grouped because of poor pre-test reading scores show
considerably greater gain than do the groups who scored
average and high on the pre-test.
Source: CSULB

66
 Selection: refers to selecting participants for the various
groups in the study. Are the groups equivalent at the
beginning of the study? Were subjects self-selected?
o The experimental group in an instructional experiment
consisted of a high-ability class, while the comparison group
was an average- ability class.
 Experimental mortality: Did some subjects drop out?
o In a health experiment designed to determine the effect of
various exercises, those subjects who find the exercise most
difficult stop participating.
 Testing: Did a pre-test affect the post-test?
o In an experiment in which performance on a logical reasoning
test is the dependent variable, a pre-test clues the subjects
about the post-test.
Source: CSULB

3.2.1 Internal Validity
67
 Instrumentation: did the measurement method change
during the research?
o Two examiners for an instructional experiment administered
the post-test with different instructions and procedures.
 Design contamination: did the control group find out about
the experimental treatment or were influenced?
o In an expectancy experiment, students in the experimental
and comparison groups “compare notes” about what they
were told to expect.
Source: CSULB

3.2.3.2 External Validity
68
 External validity: credibility or reliability of an estimate of
project impact when applied to a context different from the
one in which the evaluation was carried out.
 It means the ability to generalize the study to other areas,
 Participatory methods have little, if any, external validity,
 Inferences about a causal relation between two variables
have external validity if they may be generalized from the
unique and idiosyncratic settings, procedures and participants
to other populations and conditions,
 Factors to control internal validity may limit external validity of
the findings as well as using people from a single geographic
location who are volunteers

69
 Three often occuring issues which threaten the validity
of a randomized experiment or a quasi-experiment are:
 Attrition
 When some members of the treatment and/or control
group drop out from the sample
 Spillover
 When the program impact is not confined to the program
participants
 Especially a concern when the program impacts a lot of
people or when the program provides a public good
 Noncompliance
 When some members of the treatment group do not
receive treatment or receive treatment improperly
3.2.3.2 External Validity

70
 Other factors influencing choice of evaluation
design:
 Efficacy of the intervention: often unknown,
 Sector of the project: different sectors require
different outcomes measured and different degrees
of certainty,
 Timing and timeliness,
 Magnitude of sampling: number of persons and
distances between them,
 Based on the willingness of decision makers to be given erroneous
results,
 Usual practice: alpha at 5% and Beta error of 20%
 Costs.
3.2.4 Combining types of indicators &
inference

 Ex: possible evaluations of diarrheal disease control
programs:
Type of
evaluation
Provision Utilization Coverage Impact
Adequacy Changes in
availability of
ORT in health
centers
Changes in # of
ORT packets
distributed in
health centers
Measurement of
% of all diarrheal
episodes treated
with ORT in the
population
Measurement of
trends in
diarrhea
mortaility in
intervention area
Plausibility Same as above,
but comparing
intervention with
control services
Same as above,
but comparing
intervnetion with
control services
Comparison of
ORT coverage
between
intervention and
control areas (or
dose-response)
Comparison of
diarrheal
mortaility trends
between
intervention and
control areas (or
dose-response)
Probability Same as above,
but intervention &
control services
would have been
randomized
Same as above,
but intervention &
control services
would have been
randomized
Same as above,
with previous
randomization
Same as above,
with previous
randomization
71
3.2.4 Combining types of indicators & inference

Impact Evaluation
72
 Why do development programs and policies are designed
and implemented?
 Because change is needed in livelihood outcomes
 To check whether these developmental outcomes are
achieved, we should do impact evaluation
 The common practice at the project or program level is to
monitor and assess inputs and immediate outputs of a
program
 But for evidence-based policy making, rigorous impact
evaluation is needed
 So, the current move is towards measuring outcomes and
impacts in addition to inputs and processes
Source: Gertler et al (2011)

Impact Evaluation
73
 Two categories: prospective and retrospective.
 Prospective evaluations: developed at the same time as
the program is being designed and are built into program
implementation
 Baseline data are collected prior to program
implementation for both treatment and comparison
groups
 Retrospective evaluations: done to assess program
impact after the program has been implemented,
generating treatment and comparison groups
Source: Gertler et al (2011)

Problems in Impact Evaluation
74
 Causal Inference & the Problem of the
Counterfactual
 Whether “X” (an intervention) causes “Y” (an outcome
variable) is very difficult to determine
 The main challenge is to determine what would have
happened to the beneficiaries if the intervention had not
existed
Evaluation question: what is the effect of a programme?
Effect =
Problem: we only observe individuals that
 participate:
or
 do not participate :
... but never A and B for everyone!
Outcome A:
with programme
Outcome B:
without programme
A B
A B


75

76



77
Here, we subtract and add
the non-treated outcome
for the treated group

78

 











 N
i
N
i i
i
N
i
N
i i
i
naive
T
Y
T
T
Y
T
1
1
1
1
1
1
)
1
(
1
̂

79

ATE Effect heterogeneity

80
 Randomization: use randomization to obtain the
counterfactual = “the gold standard” by some:
 Eligible participants are randomly assigned to a
treatment group who will receive program benefits
while the control group consists of people who will not
receive program benefits,
 The treatment and control groups are identical at the
outset of the project, except for participation in the
project.
 Quasi-experimental designs: use statistical/non-
experimental research designs to construct the
counterfactual.
3.2.4 Experimental and quasi-
experimental designs

81
1) Oversubscription method:
 Units are randomly assigned to the treatment and control
groups and everyone has an equal chance,
 Appropriate when there is no reason to discriminate among
applicants and when there are limited resources or
implementation capacity (demand > supply of program),
 Ex.: Colombia in the mid-1990s, the lottery design was used
to distribute government subsidies, which were vouchers to
partially cover the cost of private secondary school to
eligible students.
2) Randomized order of phase-in:
 Randomize the timing of receiving the program,
 Appropriate when program is designed to cover the entire
eligible population and there are budget/administrative
contraints.
(a) Experimental designs: 4 methods of
randomization in Randomized Controlled Trials (RCTs):
Sources: Duflo et al., 2006; ADB, 2006

82
3) Within group randomization:
 Provide the program to some subgroups in each area,
 One of its problems is that it increases the likelihood that the
comparison group is contaminated.
4) Encouragement design:
 Offer incentives to a randomly selected group of people,
 Appropriate when everyone is eligible and enough funding,
 The remaining population without the incentives is used as
the control group,
 Challenge: the probability of participating is not 1 or 0 from
the encouragement or lack of encouragement.
Sources: Duflo et al., 2006; ADB, 2006
(a) Experimental designs: 4 methods of
randomization in Randomized Controlled Trials (RCTs):

(a) Experimental designs
83
 Why RCTS?
 To analyze whether an intervention had an impact, a
counterfactual is needed because it is Hard to ask
counterfactual questions (Ravallion, 2008),
 Randomization guarantees statistical independence of the
intervention from preferences (observed and unobserved)
 Overcome selection bias of individuals receiving the
intervention.
 Internal validity is high,
 Involves less rigorous econometric approaches,
 Led by MIT Poverty Action Lab and World Bank,
 It is criticized by others (see Rodrik, 2008),

84
 Potential drawbacks of RCTs:
 Evaluator must be present at a very early stage,
 Intended random assignments can be compromised,
 External validity,
 Political influences on where to have the intervention,
 Site effects (aspects of a program’s setting, such as
geographic or institutional aspects, interact with the
treatment),
 Tendency to estimate abstract efficacy,
 Impractically of maintaining treatment and control groups,
 Are not possible for some policy questions.
Sources: Barrett and Carter, 2010; Maredia, 2009; Ravallion, 2008

85
 How can external validity be increased for RCTs?
 Increase information on characteristics of the population
and sampling area,
 Try to answer the following questions to be able to apply
results from another study to the study area you are
interested in:
 How does the population in the study differ from the
population I am interested in?
 How do supporting inputs differ?
 How do alternative activities differ? Does the studied
intervention substitute or complement existing
opportunities?
Source: Morduch, J. 2013.

86
 Potential for future research incorporating RCTs:
 Complement with qualitative research and mixed-method
approaches,
 Combine with weighting or matching,
 Analyzing impact of feeder roads,
 Incorporate field experiments, e.g. time and risk
preferences, skill,
 “Ultimately, the goal for evaluation should be to help decide
what to do in the future. This is both for donors who need to
know where to put their money, for skeptics who want to
see that programs can work, and for implementers who
need to know how best to design their programs.” (Karlan,
2009, p. 8)

87
 Why don’t more RCTs occur? (Prtichett, 2002)
1) True believers believe they know both the important
issue and the correct instrument to address the issue.
2) The outcomes could undermine budgetary support for
the organization.
3) Problem of deciding which issue is most important and
which is the best instrument to address the issue:
ignorance to maintain budget size.

88
Conclusions on RCTs:
 RCTs can provide evidence before expanding programs, but
need to weigh pros and cons,
 Sample size needs to be considered,
 Understanding the context before is critical,
 Was the intervention the right one for the problem?
 Best to combine RCTs with other methods.

Field experiments
89
 Key assumptions: randomized assignment
 Treatments effects: individual and intention-to-treat
 Internal and external validity
 Examples and exercises

Key assumptions: randomized
assignment
90


Randomized Assignment &
Treatment Effects
91


Example: PROGRESA comparison groups
Treatment
villages
Control
villages
Eligible
households
A B
Non-eligible
households
C D
92
• Impact effect : A – B
• Spill-over effects : C – D

ATE, ITT & TOT in Experimental
Designs
93


ATE, ITT & TOT in Experimental
Designs
94


95
Impact assessment techniques:
 Propensity score matching: Match program
participants with non-participants, typically using
individual observable characteristics,
 Difference-in-differences/double difference:
Observed changes in outcome before and after for a
sample of participants and non-participants,
 Regression discontinuity design: Individuals just on
other side of the cut-off point = counterfactual,
 Instrumental variables: When program placement is
correlated with participants' characteristics, need to
correct bias by replacing the variable characterizing the
program placement with another variable.
(b) Quasi-experimental
designs

The evaluation problem: recap
96
Evaluation question: what is the effect of a
programme?
Effect =
 participate:
or
 do not participate :
Outcome A:
with programme
Outcome B:
without programme
A B
A B
A) Propensity Score Matching

97
Evaluation question: what is the effect of a programme?
Effect =
 participate:
or
 do not participate :
Outcome A:
with programme
Outcome B:
without programme
A B
A B


98

Single difference estimation
99
 In this section we will look at single difference
methods that rely on the assumption of unconfounded
assignment
 Linear regression (OLS)
 Matching methods
 Typically applied to cross section data
 Two crucial assumptions:
1. We observe the factors that determine selection
2. The intervention causes no spill-over effects

Unconfounded assignment

100

102
Beneficiary Clone
Intuition of Matching: The Perfect
Clone

103
Treatment Comparison
 Matching identifies a control group that is as
similar as possible to the treatment group!
Intuition of Matching: The Perfect
Clone

Principle of Matching
104
…
…

Matching treated to controls
105


Propensity score matching (1)

106

Propensity score matching (2)

107

Propensity score matching: overview
108
…
…
average

Estimating the propensity score

109

Range of common support

110

Range of common support
112
0
.125 .25 .375 .5 .625 .75 .875
Propensity Score
Untreated Treated: On support
Treated: Off support

Nearest neighbour matching (1)
114


Nearest neighbour matching (2)
115


Which neighbour?
116
 We can match to more than one neighbour
 5 nearest neighbours? Or more?
 Radius matching: all neighbours within specific range
 Kernel matching: all neighbours, but close neighbours have larger
weight than far neighbours.
 Best approach?
 Look at sensitivity to choice of approach
 How many neighbours?
 Using more information reduces bias
 Using more control units than treated increases precision
 But using control units more than once decreases precision

Weighting by the propensity score
119


Propensity Score Matching
120
 What are propensity scores for?
 We want to know the effect of something
 We do not have random assignment
 We do not have on pre-project characteristics that
determined whether or not the individuals received the
treatment
 Example
 An NGO has built clinics in several villages
 Villages were not selected randomly
 We have data on village characteristics before the
project was implemented
 What is the effect of the project on infant mortality?

121
T imrate
treated 10
treated 15
treated 22
treated 19
control 25
control 19
control 4
control 8
control 6
The easiest and straightforward answer to this
question is to compare average mortality rates
in the two groups
(10+15+22+19)/4-(25+19+4+8+6)/5= 4.1
What does this mean? Does it mean that clinics
have increased infant mortality rates?
NO!
Pre-project characteristics of the two groups is
very important to answer the above question
What is the effect of the project on infant mortality?

122
T imrate povrate pcdocs
treated 10 0.5 0.01
treated 15 0.6 0.02
treated 22 0.7 0.01
treated 19 0.6 0.02
control 25 0.6 0.01
control 19 0.5 0.02
control 4 0.1 0.04
control 8 0.3 0.05
control 6 0.2 0.04
How similar are the treated and control groups?
On average, the treated group has higher poverty rate and few doctors per
capita

123
 The Basic Idea
1. Create a new control group
 For each observation in the treatment group, select
the control observation that looks most like it based
on the selection variables (aka background
characteristics)
2. Compute the treatment effect
 Compare the average outcome in the treated group
with the average outcome in the control group

124
• Take povrate and pcdocs one at a time to match the treated
group with that of the control one
• Then take the two at a time. What do you observe?
S. No T imrate povrate pcdocs
Macth using
povrate
Macth using
pcdocs
1 treated 10 0.5 0.01
2 treated 15 0.6 0.02
3 treated 22 0.7 0.01
4 treated 19 0.6 0.02
5 control 25 0.6 0.01
6 control 19 0.5 0.02
7 control 4 0.1 0.04
8 control 8 0.3 0.05
9 control 6 0.2 0.04

125
 Predicting Selection
 What is propensity score?
Propensity score is the conditional probability that an
individual chooses the treatment
 Which model do we use to estimate pscores?
1 xxxxxxxxxxx
Ti
0 xxxxxxxxxxx
X
• Linear model cannot be fitted because the
score will be more than 1 and greater than
zero
• So, we use limited dependent variable model: logit or probit as indicated in the
graph
• We consider two conditions: CIA and Propensity score theorem
• CIA: outcomes are independent of treatment assignment given xi
• Whereas propensity score theorem states that outcomes are independent of
treatment assignment given propensity score, i.e., p(xi)

126
 Predicting Selection
 How do we actually match treatment observations to
control groups?
 In stata, we use logictic or probit regression to
predict:
 Prob(T=1/X1, X2,…,Xk)
 In our example, the X variables are povrate and
pcdocs
 So, we run logistic regrsssion and save the predicted
probability of the treatment
 We call this propensity score
 The commands are:
 Logistic T povrate pcdocs
 Predict ps1 or any name you want the propensity

127
S. No T imratepovrate
pcdoc
s ps1 Match
1 treated 10 0.5 0.01 0.4165713
2 treated 15 0.6 0.02 0.7358171
3 treated 22 0.7 0.01 0.9284516
4 treated 19 0.6 0.02 0.7358171
5 control 25 0.6 0.01 0.752714
6 control 19 0.5 0.02 0.395162
7 control 4 0.1 0.04 0.0016534
8 control 8 0.3 0.05 0.026803
9 control 6 0.2 0.04 0.0070107
Predicted
probability of
treatment or
Propensity score
Exercise: Use the propensity score to match the treated group with the
control one
Find out the average treatment effect on the treated
((10+15+22+19)/4)-((19+25+25+25)/4)=-7

128
 How do we know how well matching worked?
1. Look at covariate balance between the treated and
the new control groups. They should be similar.
2. Compare distributions of propensity scores in the
treated and control groups. They should be similar
3. Compare distributions of the propensity scores in the
treated and original control groups
 If the two overlap very much, then matching might
not work very well.

129
 Go to stata and let us do the exercise
 Use psm exercise 2.dta

Summarize: how to do PSM
130


Final comments on PSM and
OLS
131


Difference methods
133
 Key assumption: parallel trends
 Difference-in-difference
 Fixed effects
 Difference-in-difference and propensity score matching

Identifying Assumption
134
 Whatever happened to the control group over
time is what would have happened to the
treatment group in the absence of the program.
Pre Post

135
Pre Post
Effect of program using
only pre- & post- data
from T group (ignoring
general time trend).

136
Pre Post
Effect of program using
only T & C comparison
from post-intervention
(ignoring pre-existing
differences between T &
C groups).

137
Pre Post

138
Pre Post
Effect of program
difference-in-difference
(taking into account pre-
existing differences
between T & C and
general time trend).

Differences-in-differences

139

Differences-in-Differences
140
 First application DID: John Snow (1855)
 Cholera epidemic in London in mid-nineteenth century
 Prevailing theory: “bad air”
 Snow’s hypothesis: contaminated drinking water
 Compared death rates from cholera in districts served
by two water companies
 In 1849 both companies obtained water from the dirty
Thames
 In 1852, one of them moved water works upriver to an
area free of sewage
 Death rates fell sharply in districts served by this water
company!

What does panel data add?

141

Example: effect of health insurance on health
spending
142
OOP budget
share
OOP budget
share
Before After

Observed and potential outcomes

143

Difference-in-Differences

144

 The DID estimate is clearly presented in a 3 x 3 table
145

 The DID estimate is clearly presented in a 3 x 3 table
 All we need to estimate is 4 averages
◦ Non-parametric regression
◦ Take differences and double differences
146

Difference-in-Difference
147
 When can we use diff-in-diff?
 We want to evaluate a program, project, intervention
 We have treatment and control groups
 We observe them before and after treatment
But:
 Treatment is not random
 Other things are happening while the project is in
effect
 We cannot control for all the potential confounders
Key Assumptions:
 Trend in control group approximates what would have
happened in the treatment group in the absence of the
treatment

148
 Assume that there was a free lunch program in Place
A.
 The free lunch was assumed to improve student
outcomes
2008 2010
D1
Place A
Place B
D2
2008 2010
D4 D3
DID= D1-D2 or D3-D4
= Dprogram + Dtrend
D3= Dprogram + Ddifference because of factors other than the program

149
Y Pre (2008) Post (2010) Diff
Treated (A) 20 90 70
Control (B) 30 70 40
Diff -10 20 30
20
4
0
60
80
100
Control (B)
Treated (A)
Pre (2008) Post(2010)

Difference-in-Difference: data
150
Name Y (score) Dtr Dpost
1 40 0 0
2 80 1 1
3 20 0 0
4 100 1 1
5 30 0 0
6 0 1 0
7 60 0 1
8 40 1 0
9 60 0 1
10 90 0 1
Dtr= dummy variable with a value of 1 if individuals are in the treated group
(A)
Dpost= time dummy variable with a value of 1 if individuals take the test in
2010 (post)

151
Y (score) Dpost =0
(pre; 2008)
Dpost =1
(post; 2010)
Dtr=0
(Control)
Dtr=1
(Treated)
Difference
0
 1
0 
 
2
0 
  3
2
1
0 


 


DiD with regression
Y = 0
 + 1
 Dpost
+ 2
 Dtr
+ 3
 Dpost
x Dtr
[+ 4
 X]+

152
 Go to stata exercise and use diff-in-diff exercise.dta

M&E.ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to M&E.ppt

Similar to M&E.ppt (20)

More from selam49

More from selam49 (20)

Recently uploaded

Recently uploaded (20)

M&E.ppt