Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Â
Introduction to hypothesis testing and statistical significance
1. How to do the maths:
introduction to hypothesis
testing
DR HAJER ELKOUT
2. Agenda:
⢠Null and alternative hypotheses.
⢠Outcome measures? Effect size â What is it?
⢠Confidence Interval and its interpretations.
⢠Statistical significance and p-values
⢠Odds ratio, Relative risk and Hazard ratio.
⢠Putting it all Together: building conclusions.
3. The Null hypothesis Ho
⢠A null hypothesis is one that proposes there is no
difference in outcomes. I.e. Assumes that the
two population being compared are not different.
⢠We commonly design research to disprove a H0.
⢠We test the H0 and if there is enough evidence to
say that the H0 is wrong we reject it in favour of
the alternative hypothesis.
4. Alternate hypotheses Ha /H1
⢠Assumes that the two groups are different.
⢠Rejecting null hypothesis suggests that the
alternative hypothesis may be true
5. Steps in significance testing
⢠Stating the research question.
⢠List assumptions and state a hypothesis.
⢠Choice of statistical test / to calculate test
statistic.
⢠Getting the p- value.
⢠Inference/Forming conclusions.
6. Outcome measures
⢠Types of outcome measures*:
⢠Count people (put them in groups/categories).
⢠Percentages, proportions.
⢠Take a measurement on them (wt, ht, BP).
⢠Average, mean and st.d.
⢠Measure the time taken for an event to occur
⢠time to event data/Kaplan Meier curve
7. 1- counting people
⢠Categorical data, mutually exclusive groups.
⢠We compare the proportions (%) in each group
regardless of the No of people in groups.
⢠The risk of having the outcome in group A i.e.
xx% and the risk in group B is zz%.
⢠Relative risk or risk ratio= xx% divided by zz%.
8. 2- taking measurement
⢠Continuous data.
⢠Could involve anything that has a decimal place
or a count.
⢠No of days in hospital.
⢠RBC count.
⢠Blood glucose level.
⢠Compare means.
9. Effect size â What is it?
⢠It is the summary measure that is used to
interpret research data.
⢠Obtained by comparing outcome measures
between groups of people.
⢠Types of effect sizes depends on the type of
outcome measure.
10. Effect size: Risk ratio
⢠Is the ratio of the risk of an event in experimental
group compared to the control group.
11. Effect size: risk difference/Risk
reduction
⢠e.g. Drug B reduces the chance of a stroke from
20% to 17% . What is the ARR?
⢠Answer 3%.
⢠e.g. RR of a MI with placebo over 5 years is 6%
and with drug A is 3%, the ARR is simply 3%.
⢠However, the RRR=50%. (The RRR is the
proportional reduction seen in an event rate
between the experimental and control groups).
12. NoteâŚ
⢠Saying that this drug reduces your risk of a MI by
50% sounds great; but if your absolute risk was
only 6%, this is the same as reducing it 3%. So,
saying this drug reduces your risk by 50% or 3%
are both true statements but sound very
different, so guess which one that drug
companies tend to prefer!
13. Risk of 1ry end point in the salmeterol group = 20/2366= 0.845%
Risk of 1ry end point in the placebo group= 5/2319= 0.215%
RR= 0.845/0.215= 3.09
i.e. patients in the salmeterol group were more likely to die or have a life-
threatening experience than those in the placebo group.
SMART trial
14. Interpreting risk ratio
⢠When RR= 0.5 or 2, it is easy to explain, i.e. risk
is halfed or doubled.
⢠When the RR=0.35 or 1.75, it is better to convert
it to percentage to interpret it.
⢠How far is it from 1
⢠0.35-1=-0.65, X100=65%; risk is reduced by 65%.
⢠1.75-1=0.75, X100=75%; risk is increased by 76%.
⢠Generally if risk is less than 2 convert.
⢠If more than 2 just say the risk is increased by xx folds.
15. Odds ratio
⢠Calculated in a different way to relative risk but
can bet interpreted in the same way.
⢠Risk is the no of subjects with the disease out of
all subjects (if one affected R=1/n)
⢠while an odds expresses the no with disease to
the no without (if one affected OR=1/n-1)
16. Confidence Intervals (CI)
⢠A CI is a range of numbers within which there is
a 95% chance that the true result lies.
⢠With CI you can easily tell if statistical
significance has been reached, without doing
any maths!
⢠If the CI includes the value that reflects âno-effectâ the
result is statistically non-significant.
⢠This value of no-effect is 1 for results that are expressed
as ratios (e.g. Relative Risk, Odds Ratio) and 0 for
measurements (e.g. percentages or ARR).
17. Confidence Intervals (CI)
⢠Example: a survey to estimate the prevalence of
smokers among 502 dental graduates.
⢠41% were smokers, 95% CI =37%-45%.
⢠We think the TRUE prevalence among ALL dental
graduates is 41% but we are not sure.
⢠But, we are 95% certain that the TRUE prevalence lies
between 37% and 45%.
⢠And it is unlikely for the true prevalence to be outside
this interval.
18. Confidence Intervals
⢠The level of uncertainty is called standard error.
Smaller study Larger standard error
No firm conclusion Wider CI
⢠A wide CI means a less precise result and a
narrow CI a more precise result.
19. Hypothetical example of 95%
confidence interval
⢠Exposure: Caffeine intake (high versus low)
⢠Outcome: Incidence of breast cancer
⢠Risk Ratio: 1.32 (point estimate)
⢠p-value: 0.14 (not statistically significant)
⢠95% C.I.: 0.87 - 1.98
20. Interpretation
⢠Our best estimate is that women with high
caffeine intake are 1.32 times (or 32%) more
likely to develop breast cancer compared to
women with low caffeine intake.
⢠However, we are 95% confident that the true
value (risk) of the population lies between 0.87
and 1.98 (assuming an unbiased study).
21. Another exampleâŚ
⢠The ARR (the difference in risk) is estimated to be 19.6%
with a 95% CI of 5.7% to 33.6%.
⢠The p value of 0.006 means that an ARR of 19.6% or
more would occur in only 6 in 1000 trials if streptomycin
was equally as effective as bed rest.
⢠The 95% CI suggests that the likely true benefit of
streptomycin could be as small as 5.7% or as large as
33.6%.
23. 2 x 2 table for calculation of
measures of association
Outcome
Exposure Present Absent TOTAL
Present a b a+b
Absent c d c+d
TOTAL a+c b+d a+b+c+d
24. ⢠Odds ratio:
⢠OR = a/c / b/d
= Odds of exposure among cases
Odds of exposure among controls
25. Figure from Agresti and Franklin, Statistics: The Art and Science of Learning from Data
(p. 468)
26. What is statistical significance?
⢠In any experiment there is going to be difference
in the results obtained from the experimental and
control groups.
⢠Statistical significance tells us the difference is
real on not just due to chance.
27. Tests of significance
⢠Tests of statistical significance can be used to
determine the extent to which chance has
operated
⢠The significance level of any difference is
called p value, with p standing for probability.
28. P-values
⢠The probability that an effect at least as extreme
as that observed could have occurred by chance
alone.
⢠the probability that the observed results
occurred by chance.
⢠the sample estimates of association
differ only because of sample variability.
⢠If we do the experiment 100 times on
same sample, it should result in the same results.
29. Size of P-value
⢠Is a balance between the effect size, the study
sample and the no of events (if applicable).
⢠When the outcome is about taking measurement,
it is not applicable to count events, instead we
look at the standard deviations: small Sd= small
p-value.
Effect size Study
sample
No of
events
P-value
Large Large Large small
Small Small Small large
30. Cut off for p-value
⢠Arbitrary cut-off 0.05 (5% chance of a false +ve
conclusion).
⢠A 0.05 significance level occurs when the
probability of chance is â¤5 times in 100 repetitions
of the research.
⢠If p<0.05 statistically significant- Reject H0.
⢠If p>0.05 statistically not-significant- Accept H0.
31. Borderline P-value (>0,05 but
<0.10)
⢠We can not say with certainty that there is or isn't
an effect.
⢠Does not give a strong evidence either for or
against an effect.
⢠Uses the following terms: suggestion of an effect,
indication, seems, tends.
⢠Perhaps there should be further confirmatory
studies.
32. Limitations of significance tests
⢠Statistical significance does not mean practical
significance.
⢠Significance tests donât tell us about the size of the
effect (like a CI does).
⢠Some tests may be âstatistically significantâ just by
chance
(and some journals only report âsignificantâ results!)
33. Type 1 error:
⢠If a result is statistically significant, but this is a
chance finding and in fact there is no real
difference, i.e. rejecting Ho when it is true.
⢠If you see an unexpected positive result (e.g. a
small trial shows willow bark extract is effective
for back pain),
⢠think: could this be a type 1 error? after all, every
RCT has at least a 1 in 20 chance of a positive
result and a lot of RCTs are publishedâŚ
34. Type 2 error
⢠If the study finds no-significant difference when in
fact there is a real treatment difference.
⢠If a trial shows a non-significant result, when
perhaps you might not have expected it, think could
this be a type 2 error?
⢠Is the study under-powered to show a positive
result? Systematic reviews, which increase study
power and reduce CI, are therefore very useful at
reducing Type 1 and 2 error. Ho
36. Forming conclusions
⢠If the results are statistically significant, decide
whether the observed differences are clinically
important.
⢠If not significant, see if the sample size was
adequate enough not to have missed a clinically
important difference
⢠âThe power of the study â tells us the strength
which we can conclude that there is no
difference between the two groups.
37. ⢠Statistical significance does not necessarily
mean real significance.
⢠If sample size is large, even very small
differences can have a low p-value.
⢠Lack of significance does not necessarily mean
that the null hypothesis is true.
⢠If sample size is small, there could be a real
difference, but we are not able to detect it.
38. Interpreting results
⢠Significant P values do not mean the results are
clinical significant.
⢠E.g. differences of 2-3 mm Hg in BP
management can be statistical significant!!
⢠Significant P values do not mean that the result
is unbiased, un-confounded or biological
plausible.
39. When reading a paperâŚ
⢠Are the results valid?
⢠Were patients randomised?
⢠Were patients in the treatment and control groups
similar with respect to other variables?
⢠Were patients aware of group allocation?
⢠Were clinicians aware of group allocation?
⢠Were outcome assessors aware of group allocation?
⢠Was follow-up complete?
40. Reading a paperâŚ
⢠What are the results?
⢠How large was the treatment effect?
⢠How precise was the estimate of the treatment effect?
⢠When authors do not report the confidence interval
⢠How can I apply the results to patient care?
⢠Were the study patients similar to the patient in my practice?
⢠Were all clinically important outcomes considered?
⢠Are the likely treatment benefits worth the potential harm and
costs?
41. Moral of the story
⢠Be skeptical when you hear reports of new medical
advances.
⢠There may be no actual effect, i.e. the entire study
may merely be a Type I error. (rejecting the null
hypothesis when it is true).
⢠If an effect does exist, we may be seeing a sample
outcome in right-hand tail of sampling distribution of
possible sample effects, and the actual effect may
be much weaker than reported.
42. Examples of methods for pharmaceutical
companies to get the results they want from
RCT:
⢠Conduct a trial against a treatment known to be inferior.
⢠Trial your drugs against too low a dose of a competitor drug.
⢠Conduct a trial of your drug against too high a dose of a
competitor drug (making your drug seem less toxic).
⢠Conduct trials that are too small to show difference from
competitor.
Smith, R. (2005) PLoS Medicine, 2 (5), e138.
43. Examples of methods for pharmaceutical
companies to get the results they want
from RCT:
⢠Use multiple end points in the trial and select for publication
those that give favourable results.
⢠Do multicentre trials and select for publication results from
centres that are favourable.
⢠Conduct subgroup analyses and select for publication those
that are favourable.
⢠Present results that are most likely to impress â for example,
reduction in relative rather than absolute risk.
44. Important questionsâŚ
⢠Who funds trials and does the funding source matter?
Public vs industry fundingâŚBias?
⢠How are trials selected and what questions are asked?
Clinically unimportant questions or questions already adequately addressed
⢠Which trials are more likely to make it into the literature?
Publication bias, selective publication of positive results and non-publication
of negative results.
⢠How are biased studies identified when their conclusions
are discredited?
Bad evidence not clearly labelled, Once papers enter the electronic
literature there they remain.