ICAR - IFPRI- Power Calculation

Power calculation
Devesh Roy
September 22, 2015

Focusing on quantitative methods- Propose to execute double
difference methods
• Central feature of the method is use of longitudinal data to use
“difference-in-differences” or “double difference”.
• Method relies on baseline data collected before the project
implementation and follow-up data after it starts to develop a
“before/after” comparison.
• Data collected from households receiving the program and those that
do not (“with the program” / “without the program”).
Page 2

Double difference methods: continued
• Why both “before/after” and “with/without” data are necessary ?
• Suppose only collected data from beneficiaries.
• Suppose between the baseline and follow-up, some adverse event occurs.
• —the benefits of the program being more than offset by the damage from bad event. These
effects would show up in the difference over time in the intervention group, in addition to the
effects attributable to the program.
• More generally, restricting the evaluation to only “before/after” comparisons makes it impossible
to separate program impacts from the influence of other events that affect beneficiary
households.
• To guard against this add a second dimension to evaluation design that includes data on
households “with” and “without” the program.
Page 3

Summary of the method and its application
• The approach- By comparing changes in selected outcome indicators between treatment
group and the comparable control group, the project impact is estimated quantitatively.
• Approach can also be applied to measure spillover effect from the treated to the non-
treated famers in the treated areas.
• examined by comparing the outcomes between non-treated households in treatment areas and
households in control areas.
• Moreover, impact heterogeneity across population sub-groups can be investigated.
• The sub-groups can be defined based on caste, gender, agro-ecological zones etc.
• Such information will be collected in the baseline survey.
Page 4

Step 2 and 3: Continued- Matching
methods – Second best
• Suppose cannot separate treatment and control groups clearly
• Then do –
• Survey farmers to identify beneficiaries and non-beneficiaries realizing the self selection would have happened
• Collect data on farmer, household and location characteristics
• Find “similar” farmers and compare their outcomes – the essence of matching method
• Question – how is similarity defined- It can be many many dimensions (education, land size, family size, crop
and so on)
• Theory can make this multi-dimensionality problem manageable by reducing it to one variable that can be used
for matching
• That one variable is called propensity score that is the estimated probability of being a beneficiary. Each farmer
will have a propensity score
• Those who have similar chance of being a beneficiary across treatment & control i.e. propensity score are
logically the ones to match and compare outcomes
• Again if we can get longitudinal data on farmers who are beneficiary and non-beneficiary it can improve estimates of
impacts from matching as well
Page 5

Technical blurb on matching
• Matching methods construct a comparison group by “matching” treatment to comparison group based
on observable characteristics (both farmer and location characteristics in the baseline survey).
• For applying matching methods-survey farmers to know users and non-users of benefits of the project
• The impact is estimated as the average difference in the outcomes (or change in outcomes) for each
treatment farmer from a weighted average of outcomes (or change in outcomes) in each similar
comparison group farmer from the matched sample.
• In essence find beneficiary and non-beneficiary farmers from survey and for each beneficiary get the
incomes of several similar non-beneficiaries and take the difference.
• Then take the average of the differences
• That is project impact- called average treatment effect on the treated i.e. average impact of the
project on the beneficiary group
Page 6

Issues- What if there is entry and exit from
the program or if there is no way to exclude
• Could change the item to evaluate – if there is lot of flux could make
time in the program as the item to evaluate
• In 5 year period that is there in this project might not be so much of
an issue
• Think of encouragement design if no one can be excluded. But
matching methods are already specified in FTF.
Page 7

Power calculation
• Power calculations provide the smallest sample with which it is
possible to measure the impact of a program, that is, the smallest
sample that will allow meaningful (or desired) differences in
outcomes between the treatment and comparison groups to be
detected.
Page 8

Before power calculation some statistical
concepts
• Hypothesis testing –Convention: any difference found is by chance
alone referred to as null hypothesis
• Statistical analysis null hypothesis is rejected or not
• If analysis indicates that the difference or effect is not likely to have
occurred by chance then the null hypothesis is rejected in favor of the
alternative hypothesis, stating that a real effect has occurred.
• Statistically “not significant” if null hypothesis is not rejected and
“statistically “significant” if null hypothesis is rejected
• Clearly need a criteria for rejecting the null hypothesis

Statistical power: continued
• This is referred to at the alpha level. Alpha is often set at 0.05 or 5%. Statistical
analysis is then carried out in order to calculate the probability that the difference
or effect was purely due to chance. The null hypothesis is only rejected if the
probability (P-value) is equal to or less than the alpha level.
• This process however has two possible errors
• False positive or type 1 error-if null hypothesis is rejected incorrectly -There is a
5% chance of this occurring if the alpha level is set at 0.05.
• A type II error, or false-negative, error occurs if the null hypothesis is accepted
incorrectly. A beta level can be chosen as protection against this type of error.
• Statistical power =1 − 𝛽.
• Fixing the size of type 1 error minimize the type 2 error
• Statistical power is conventionally set at 0.80 or 80%10 i.e. there is a 20% chance of
accepting the null hypothesis in error

How large should a sample size be?
• Unfortunately there is no simple answer to this question and depends
on several factors
• Effect size- This is the smallest difference or effect that the researcher
considers to be economically or policy relevant. In other words whats
the difference between beneficiary and non-beneficiary outcomes
that can make the project qualify as success
• Fixing effect size can be a difficult task. It can be based on monitoring,
qualitative data, pilot study, previous study, expert elicitation among
other things

Power: Continued
• Alpha level
• For a smaller alpha level a larger sample size is needed and vice versa.
• Standard deviation
• Effects being investigated often involve comparing mean values measured in two or
more samples. Each mean value will be associated with a standard deviation As standard
deviation increases a larger sample size is needed to achieve acceptable statistical power.
Again, the standard deviations expected in a sample need to be estimated based on
judgement, previous (pilot) studies and/or other published literature.
• One or two-tailed statistical tests
• There are two types of alternative hypothesis. The first is one-tailed and is appropriate
when a difference in one direction is expected. For example, it might be hypothesised
that sample A has a higher income than sample B. The second is two-tailed and is
appropriate when a difference in any direction is expected.
• One-tailed alternative hypotheses require smaller sample sizes.
• However, the use of one-tailed tests should be justified and not be used purely to reduce
the sample size required.

Summarize- Power calculation to determine
the sample size for baseline ( & end line)
• Power
• The ability of a study to detect an impact. Conducting a power calculation is a
crucial step in IE. The statistical power of an IE is the probability that it will detect
a difference between the treatment and comparison groups, when in fact one
exists. An IE has a high power if there is a low risk of not detecting real program
impacts, that is, of committing what is called a type II error.
• A calculation of the sample required for the impact evaluation, depends on the
minimum effect size and required level of confidence.
Page 13

Sample size determination: continued
• More sample is good
• But there are resource constraints
• Non-sampling errors increase (enumerators get tired and data quality
gets poorer)
• minimum effect size – How much increase in incomes has to be detected
for treating the project as success – say 15%
• Required level of confidence- Do you want to be 90 percent sure that the
effects detected are true or 80 percent sure
• Know that can never be 100 percent sure unless we do a census
• Basic principle
• Smaller impacts to get detected require larger sample size
• More confidence in estimated impacts being true requires larger
sample size
Page 14

Large samples better resemble population (both
treatment and control) (Gertler et al 2010)
Page 15

Other technicalities for sample size
calculation
• There are both clusters (say districts) and unclustered interventions
• There are groups on which impacts are important for the project (low
caste population for example)
• If interventions are designed by cluster and target groups it has
implications for sample size
• Looks complex but software does this in easy steps
• But we need to provide it basic data like size of effect to detect, confidence
level that we want, number of clusters and groups etc.
Page 16

Read medical science papers for power stuff
• Sir Karl Popper (1959) the philosopher of science theorized that we
can never prove anything but rather our strongest support for an idea
comes from our repeated unsuccessful attempts to disprove that
idea.
• Sampling Procedure (random is best)
Page 17

Take home- Errors in hypothesis testing
• The Type II Error occurs when we conclude that there is no difference
between treatments when in truth there is a difference
• fail to reject H0 when H0 is in Fact False
• probability of making type II error is denoted by β Traditionally many
investigators have ignored β, but there is now increased recognition
of the importance of minimizing β.
• Power is the probability of finding an effect when an effect actually
exists.
Page 18

Errors
Page 19
H0 FalseH0 True

Example: test of difference of means
in two populations
• Researcher fixes probabilities of type I and II errors
• Prob (type I error) = Prob (reject H0 when H0 is true) = 
• Smaller error  greater precision  need more information  need
larger sample size
• Prob (type II error) = Prob (don’t reject H0 when H0 is false) = 
• Power =1- 
• More power  smaller error  need larger sample size

ICAR - IFPRI- Power Calculation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ICAR - IFPRI- Power Calculation

Similar to ICAR - IFPRI- Power Calculation (20)

More from International Food Policy Research Institute- South Asia Office

More from International Food Policy Research Institute- South Asia Office (20)

Recently uploaded

Recently uploaded (20)

ICAR - IFPRI- Power Calculation