2. THE P-VALUE POLICE?
Often researchers see statistics (and statisticians) as
barriers to publishing their important work
However, good statistics can help you avoid wasting time
and money following false leads
My personal feeling is that if you are trying to use
statistics to show why your work is important and
publishable, then you need good statistics
3. ROLE OF EXPERIMENTAL DESIGN
Statistics can only be as good as the data
Good data requires thoughtfully designed experiments
Some failures of animal experiments to translate to human
trials have raised the issue of experimental design of
animal studies
NXY-059 for Stroke (Gawrylewski 2007)
Fluid resuscitation in bleeding trauma patients (Roberts 2002)
4. EXPERIMENTAL DESIGN
A well designed experiment should
Produce unbiased comparisons between groups
Provide precise estimates
Well designed experiments require
Clear objectives
Planning
Sample size large enough to achieve the objectives with good
power
5. EXPERIMENTAL DESIGN
Comparison/Control group
Concurrent controls
Internal control (before and after treatment)
Replication
Reduce effect of uncontrolled variation
Quantify the uncertainty in the results
Randomization
Computer generated
Blocking or stratification
Blinding
6. HYPOTHESIS TESTS
Hypothesis tests answer a yes/no question about a
population value
Example:
Quantitative assay for level of antibodies for a virus in mice
Does a vaccine have an effect on the levels of antibodies?
Null Hypothesis (H0) corresponds to no effect
Alternative Hypothesis (HA) indicates that there is an effect
7. HYPOTHESIS TESTS
Example:
Suppose there are 10 mice available for the experiment
Assay the mice for antibodies before and after vaccination
Xi is the difference in assay values for mouse number i
Is the mean value of the Xi close to 0? No effect
μ is population mean difference
Null hypothesis H0: μ=0
Alternative hypothesis HA: μ≠0
8. HYPOTHESIS TESTS
The goal of a hypothesis test is to reject H0
Rejecting H0 indicates that either
H0 is wrong
A rare event occurred (type I error)
We cannot confirm H0 on the basis of a test
We may fail to reject H0, but we do not accept H0
9. HYPOTHESIS TESTS
Each test has an associated test statistic
For a paired t-test for the mouse vaccine data
We reject H0 when T > t*
t* is chosen so that
Pr(Reject H0 when H0 is true) = α
In this case, t* is from a t-distribution with 9 degrees of freedom
(number of mice – 1)
/ 10
X
T
s
=
11. HYPOTHESIS TESTS
Decision
Not
Reject H0
Reject H0
Truth
H0 True Right Type I
Error (α)
H0 False Type II
Error (β)
Right
(Power)
Unfortunately with
testing comes the
possibility of reaching
a wrong conclusion and
making an error
12. HYPOTHESIS TESTS
Type I Error – reject H0 when it is true (false positive
finding)
Hypothesis tests are set up so that the user specifies the Type I Error
rate
Significance level α, almost always 0.05
Type II Error – failing to reject H0 when it is false (false
negative finding)
As the Type I error rate is decreased, the rate of Type II error is
increased
13. HYPOTHESIS TESTS
The significance level is the rate of false positive findings
that you are willing to live with
Power is the probability of rejecting the null hypothesis (1
- Type II Error rate)
Once the significance level is set, the Power is determined by the
sample size
For the alternative shown in the figure, the power is 76%
14. HYPOTHESIS TESTS
For a 0.05 two
sided t-test
with 9 degrees
of freedom, we
reject the null if
T<-2.26 or
T>2.26
76% power if
true difference
is 3.0
15. HYPOTHESIS TESTS
Role of sample size
In designing an experiment, one should determine an
appropriate sample size for the goals of the experiment
Given
Expected difference between groups
Expected variability of measurements
Significance level that will be used
Power to be targeted
One can determine the sample size to achieve the study goal
16. HYPOTHESIS TESTS
Role of sample size
There are software packages and online power calculators
available for determining sample size
If the sample size is too small for the study goal, test result is likely
to be negative (underpowered)
If the sample size is too large for the study goal, resources will be
wasted
18. HYPOTHESIS TESTS
P-value
Smallest level of significance for which you would reject the Null
Hypothesis with your data
Probability of obtaining data as extreme as what was found if the
Null Hypothesis were true
Provides a measure of the evidence against the Null Hypothesis
Small p-values (close to 0) show strong evidence against the null hypothesis
Large p-values (close to 1) show only weak evidence against the null hypothesis
19. HYPOTHESIS TESTS
If p-value ≤ α then reject H0
The p-value is determined by
How far the data are from the Null Hypothesis
The sample size
The larger the sample, the smaller the p-value and the
greater the power
20. HYPOTHESIS TEST LIMITATIONS
P-values and hypothesis tests give a dichotomous
(significant/not significant) view of study results
Statistically significant means that the observed difference
is unlikely to be due to chance
Either H0 is not correct or
The observed data is a rare event – happening no more than
(100*α)% of the time
21. HYPOTHESIS TEST LIMITATIONS
Statistical significance doesn’t mean that the observed
difference is important
Could find a significantly significant result with a large sample
size when the observed difference is small and unimportant
Could have a large and important difference between groups
with a small sample size and not have statistical significance
Would especially be the case for an underpowered study
22. CONFIDENCE INTERVALS
Confidence intervals show the precision of the sample
values as estimates of population values
Provides a range of population values that are consistent with the
study findings
Often more informative than the p-values
23. TEST OR INTERVAL LIMITATIONS
A significance test/confidence interval doesn’t provide a
check of the study design
Example: in a study of gene expression
Cancer tissue samples kept on ice while the normal tissue samples are processed
Observed differences in expression may be due to iced/not iced rather than
cancer/normal
A statistical procedure will never indicate that this is the reason for the result
24. ROLE OF DATA DISTRIBUTION
Particular tests are tuned for data from the normal
(Gaussian) distribution
Examples
T-test
Standard (Pearson) correlation
Often it is difficult to be sure that the data come from
the normal distribution
Plot histograms of data – bell-shaped and symmetric?
Plot ordered data values against expected normal values – is
a straight line is obtained? (called QQplots)
Plots require a substantial amount of data to be conclusive
25. ROLE OF DATA DISTRIBUTION
Some tests are specifically designed to work reasonably
well with data from any distribution
Called Nonparametric or distribution-free tests
Examples
Wilcoxon test (alternative to t-test)
Spearman correlation (alternative to standard correlation)
In some situations these may be less likely to reject the null
hypothesis of no difference than tests based on normal
data
May want to see if nonparametric results are similar to
those assuming normality
26. EXAMPLE
Study question: what is the effect of calcium on blood
pressure in African-American men
Experiment: a Randomized comparison
Treatment group of 10 men received a calcium supplement for
12 weeks
Control group of 11 men received a placebo during the same
period
Outcome is the difference in the seated systolic blood
pressure (BP) over the 12-week period
Lyle RM, et al., "Blood pressure and metabolic effects of calcium supplementation in normotensive white
and black men," JAMA, 257(1987), pp. 1772-1776
29. EXAMPLE
These plots aren’t very useful in determining the data
distribution
Don’t really suggest normality
Aren’t conclusively non-normal either
Ambiguity is typical with small numbers
Should probably look at both t-test and Wilcoxon test
If same results – everything is fine
If different results – probably trust nonparametric more
30. EXAMPLE
The t-test is not significant at the 0.05 significance
level
P-value = 0.12
The Wilcoxon test is not statistically significant at the
0.05 significance level
P-value = 0.33
The test results are consistent in that with either we fail
to reject the null hypothesis
Important difference? Check the confidence intervals
32. EXAMPLE
So we found a 5 mm Hg difference between
groups…
Might be large enough to be important?
But can’t rule out that this finding is due to chance (P-value >
α)
If 5 mm Hg is worth pursuing, would need to evaluate
this in a larger sample
Do the power and sample size calculation!
If not, pursue more promising therapies
33. MULTIPLE-TESTING
Another issue to be aware of is limits of ordinary statistical
significance when doing many tests
When we use a significance level of α=0.05, we allow
about 5 out of every 100 tests to be false positives
When 10s or 100s of tests are run, false positive findings
are almost guaranteed
36. MULTIPLE-TESTING
Methods exist (and new ones are being continually
developed) to deal with multiple testing issues
Bonferroni correction
Tukey’s method
False discovery rates
Which method is used is less important than that something is done
to account for the number of tests
37. REFERENCES
Triola MM, Triola MF. Biostatistics for the Biological
and Health Sciences. Pearson Education Inc., 2006
Broman K. Statistics for Laboratory Scientists I, 2006
(Course Website)
http://ocw.jhsph.edu/courses/StatisticsLaboratoryScie
ntistsI/
Festing MFW, Overend P, Das RG, Borja MC, Berdoy
M. The Design of Animal Experiments. Laboratory
Animal Handbooks #14. Royal Society of Medicine
Press Ltd., 2011
38. REFERENCES
Festing M. Principles: the need for better
experimental design. TRENDS in Pharmacological
Sciences, 24:341-5, 2003
Roberts I, Kwan I, Evans P, Haig S. Does animal
experimentation inform human healthcare?
Observations from a systematic review of
international animal experiments on fluid resuscitation.
BMJ, 324:474-6, 2002