Overview of statistics: Statistical testing (Part I)
1. Overview of Statistics I:
Statistical Testing
Presented by: Jeff Skinner, M.S.ese ted by Je S e , S
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch (BCBB)
Office of Cyber Infrastructure and Computational Biology (OCICB)y p gy ( )
National Institute of Allergy and Infectious Diseases (NIAID)
2. Want To Publish Statistical Results?
Most research journals ask three big questions:
• Do your statistical tests inappropriately assume your
data is normal or Gaussian distributed?
P t i t t t i t t Parametric tests vs. nonparametric tests
E.g. Student’s T-tests vs. Wilcoxon Rank Sum tests
D t ti ti l t t i l l ?• Do your statistical tests require large samples?
Approximate tests vs. exact tests
• Do you require adjustments for multiple testing?
False Discovery Rate (FDR) adjustments for high throughput biology
Tukey’s Honest Significant Difference (HSD) for analysis of variance Tukey s Honest Significant Difference (HSD) for analysis of variance
4. Outline
• Review the statistical testing process
• When do I use one-sided vs. two-sided tests?
Wh d I t i t i t t ?• When do I use parametric vs. nonparametric tests?
• When do I use large-sample tests vs. exact tests?g p
• When do I need to adjust for multiple testing?
• Additional questions about testing?
5. Recall the Statistical Testing Processg
• Formulate null and alternative hypotheses
E.g. (null) H0: μ1 = μ2 vs. (alternative) HA: μ1 ≠ μ2
• Calculate the appropriate test statistic
E g Student’s t-test Wilcoxon testE.g. Student s t test, Wilcoxon test, …
• Compute the probability of observing the test statistic
(i.e. your sample data) under the null hypothesis
I.e. Compute a p-value
• Make a statistical decision
“Reject the n ll h pothesis” or “fail to reject the n ll h pothesis” “Reject the null hypothesis” or “fail to reject the null hypothesis”
• Make a biological conclusion
E.g. New drug reduces viral load, vitamin C helps prevent cancer, …g g , p p ,
6. Null and Alternative Hypothesesyp
• (null) Drug and placebo viral loads are equal vs. (alternative) viral
loads higher for placebo group than for drug group
• (null) H : μ – μ ≤ 0 vs (alternative) H : μ – μ > 0• (null) H0: μP – μD ≤ 0 vs. (alternative) HA: μP – μD > 0
7. What is a Statistical Test?
ValueNullStatisticDifference
Error
ValueNullStatistic
Error
Difference
Test
• Almost all tests used in inferential statistics can be
li d th ti f “diff ” “ ”generalized as the ratio of a “difference” over an “error”
Difference between a statistic and null value (usually 0)
A statistic is nothing more than a numeric summary of the experimental
d t ith t t th ll h th idata with respect to the null hypothesis
A null value is an assumption about the population under the null
hypothesis
An error is an estimate of the sampling distribution error An error is an estimate of the sampling distribution error
8. Example: Two-sample Student’s T-testp p
X X 0statistic null value
T*
X1 X2 0
1 1
n11 s1
2
n21 s2
2 standard
error1
n1
1
n2
1 1 2 2
n1 n2 2
• The “statistic” in a two-sample t-test is a difference between
the two sample means and the null value is zero
The hypothesis μ1 = μ2 implies μ1 – μ2 = 0
• The standard error is an estimate of the common variance• The standard error is an estimate of the common variance
9. Null Distributions
Compute T* statistic
f lfor our sample
C i t thCompare against the
distribution of all possible
T* statistics for all
possible samples from
the population under the
• If the null hypothesis is true (i.e. no difference between groups),
p p
null hypothesis
yp ( g p ),
then the T* statistics from most samples should be near zero
• Many null distributions (or sampling distributions) approximately
follow well known probability distributions e g normal distributionfollow well known probability distributions, e.g. normal distribution
10. P-values
• A p-value is the probability of
observing your data given that theobserving your data given that the
null hypothesis is actually true
O• P-values do NOT represent the
probability that the null is true
• P-values do NOT represent the
probabilty that a model is incorrect
If the null distribution follows a well
• P-values do NOT represent the
strength or size of an effect
known probability distribution, like
the normal distribution, the p-values
are computed by integration
11. Statistical Decisions and
Bi l i l C l iBiological Conclusions
• A statistical decision is a choice to “reject the null
hypothesis” or “fail to reject the null hypothesis”
The decision is based on a critical value or decision rule The decision is based on a critical value or decision rule
E.g. Reject the null hypothesis if p-value < 0.05
A bi l i l l i i th fi l i t t ti f• A biological conclusion is the final interpretation of
the statistical testing process in plain language
E g Vitamin C prevents cancer drug reduced viral loadE.g. Vitamin C prevents cancer, drug reduced viral load, …
Make sure conclusion can be justified by the hypotheses
12. Type I and Type II Errorsyp yp
Actual population difference?Actual population difference?
Yes No
Type I Error
Was the difference
detected by the
i i l ?
Yes OK
Type I Error
(False Positive)
Type II Error
Diff t t f i t tt t t i i i
statistical test? No
Type II Error
(False Negative)
OK
• Different types of experiments attempt to minimize
Type I errors, Type II errors or both kinds of errors
E.g. Type II errors are more important in medical testingE.g. Type II errors are more important in medical testing
13. Type I and Type II Errors - Exampleype a d ype o s a p e
• Suppose mean viral load is 7,000
lower after taking a new druglower after taking a new drug
Drug population mean viral load 33,000
Placebo population mean viral load
40,000,
• Samples from the population may
not be representative
By chance we sample 120 sickly patients
for the drug treatment group
By chance we sample 120 robust patients
for the placebo treatment groupfor the placebo treatment group
• This data yields a Type II error
because of a strange sample
14. Review: Statistical Testingg
• Formulate null and alternative hypotheses
Null and alternative hypotheses are mutually exclusive
and exhaustive statements about the population
Typically assume the null hypothesis is true, until we find
evidence to refute the null in favor of the alternative
E.g. H0: µ = 0 versus HA: µ ≠ 0
• Calculate the appropriate test statistic and find its
probability under the null hypothesis
• Make a statistical decision and biological conclusion
15. Two-sample Student’s T-testp
T*
X1 X2 0
T* 1 2
sp
1
1
p
n1 n2
Assumptions of Student’s T‐test:p
Data from each group are normal
or sample sizes greater than 30
Equal variances among groups
• Student’s T-test is a parametric test
to compare means of two samples
from normal distributed populations
Equal variances among groups
Independent and identically
distributed (iid) normal random
errors from the group meansp p errors from the group means
16. Parametric Statistical Tests
• Parametric tests assume the populations have known distributions
ith k t th t t b ti t d ( d d)with unknown parameters that must be estimated (and compared)
17. Central Limit Theorem
• Student’s T-test is computed with
a difference of two samplep
means
• Draw thousands of samples of
size n from one population top p
view the distribution of their
sample means
• As sample size n increases, thep ,
distribution of the sample means
becomes Gaussian (normal),
even from non-normal
populations
• Student’s T-test does not require
normal data if sample size is
large Sample means from a uniform distribution
are approximately normal for n = 20
18. Equal Variance AssumptionEqual Variance Assumption
T*
X1 X2 0
s1
2
s2
2
s1
n1
s2
n2
• Usual two-sample Student’s T-test computations assume both
samples share approximately equal variances
• Welsh’s correction computes an appropriate T-test when variances
are NOT equal among the two samples
• Welch’s correction is available on most software, so look carefully, y
19. Two-Sided Test
• Researchers expect two samples
ill b diff t b t d t kwill be different, but do not know
which will have the higher mean
E.g. viral load for drug group could
be higher or lower than placebo
E.g. H0: μP = μD vs. HA: μP ≠ μD
• Two-sided tests are less powerful,p
but more general than analogous
one-sided tests of same data
The α = 0.05 rejection region of aThe α 0.05 rejection region of a
two-sided test is divided in two parts
Need larger T-statistic for significant
p-values with two-sided testp
20. One-Sided Test
• Researchers expect one sample
will have a higher mean than the
other sample before testing
E.g. viral load for drug group should
be lower than placebo
E.g. H0: μP ≤ μD vs. HA: μP > μD
O id d t t f l• One-sided tests are more powerful
but require more assumptions
The α = 0.05 rejection region of the
one-sided test is all on one side
Smaller T-statistics will produce
significant p-values in one-sided
teststests
21. Nonparametric Statistical TestsNonparametric Statistical Tests
• Nonparametric tests make no assumptions about the distribution of
a population and focus on more general descriptions like mediansa population and focus on more general descriptions, like medians
22. Wilcoxon Rank Sum TestWilcoxon Rank Sum Test
n n 1 n n
Z*
R1
n1 n1 1
2
n1n2
2
1 n1n2
n1 n21
12
• The Wilcoxon rank sum test is a
nonparametric test to compare
the sums of two ranked samples Wilcoxon rank sum test assumes both
the sums of two ranked samples
• If assumptions are met, the test
can be used to compare medians
samples are from the same type of
distribution with different locations
Also known as the Mann Whitney Test Also known as the Mann‐Whitney Test
23. Example: Wilcoxon-Rank Sum Test
R
n1 n1 1 n1n2
Example: Wilcoxon Rank Sum Test
statistic null value
Z*
R1 1 1
2
1 2
2
1
statistic
n1n2
n1 n21
12
standard
error
12
• The “statistic” in a Wilcoxon rank sum test is equivalent to the
diff b t th f th k d d t i h ldifference between the sums of the ranked data in each class
• The null hypothesis R1 = R2 produces a strange statistic, null value
and standard error due to relationships among sums of ranks
24. Does It Really Compare Medians?Does It Really Compare Medians?
• If two samples come from the same
type of distributiontype of distribution …
YES
Median of a sample is comparable to its
middle ranked observation(s)middle ranked observation(s)
If two samples share a similar shape, the
sample with the significantly higher rank
sums will have the higher median too
• If two samples come from two very
different types of distributions … NO
The Wilcoxon rank sum test actually
Control and Treated samplescompares the sums of the ranked data
Many counter examples have significant
Wilcoxon tests, but equal medians
Control and Treated samples
both have median = 100
Wilcoxon rank sum test has a
i ifi t l 0 0000027significant p‐value = 0.0000027
25. Does Wilcoxon Compare Distributions?Does Wilcoxon Compare Distributions?
• Wilcoxon rank sum test
does NOT compare the
distributions of samples
• Samples from two very
different distributions can
yield non-significanty g
Wilcoxon test p-values
• It is difficult to interpretIt is difficult to interpret
Wilcoxon rank sum test if
assumptions aren’t met
26. Student’s T vs WilcoxonStudent s T vs. Wilcoxon
• Two-sample Student’s T-test
A l ti l b t k f ll Assumes populations are normal, but works for all
populations if large sample sizes are used for both
classes
Assumes variances are equal or requires Welch
correction
Wil R k S T t• Wilcoxon Rank Sum Test
Assumes both samples are from the same type of
distribution
Generally preferred over Student’s T-test by all journals
• What if neither test is appropriate?
27. Example: Viral LoadsExample: Viral Loads
• Want to compare viral loadsp
under treatment and placebo
Viral loads are very high (> 10,000)
and skewed right for the placebo
groupgroup
Viral loads all equal zero for treatment
• Both Student’s T-test and the
Wilcoxon test are inappropriateWilcoxon test are inappropriate
Don’t have normal data or equal
variances for Student’s T-test
Can’t use Wilcoxon test to compare
two different distributions
• Need an appropriate p-value, so
what statistical test can we use?
28. Exact Tests vs Approximate TestsExact Tests vs. Approximate Tests
• Exact statistical tests are based on probabilityExact statistical tests are based on probability
statements that are valid for any sample size
Usually based combinatorial or resampling strategies
All resampling tests are considered exact tests
Some implementations of Wilcoxon tests use exact tests based on
combinatorial arguments for small sample sizes
• Approximate statistical tests are based mathematical
arguments about convergence with large sample sizes
Student’s T-test is an approximate test based on arguments similar to
the Central Limit Theorem
Approximate tests may have inaccurate p-values for small samples
29. Example: Bootstrap Samplesa p e: oo s ap Sa p es
Original Data Bootstrap Samples
Class Data
A 1
A 2
A 3
Sample 1 Sample 2 Sample 3 …
3 1 2 …
3 5 2 …
2 3 2A 3
A 4
A 5
2 3 2 …
4 1 4 …
2 2 1 …
B 6
B 7
B 8
7 9 6 …
6 6 7 …
8 6 8 …
• Bootstrapping uses sampling with replacement within each class
B 9 8 9 9 …
30. Example: Jackknife Samplesp p
Original Data
Cl D t
Jackknife Samples
S l 1 S l 2 S l 3Class Data
A 1
A 2
A 3
Sample 1 Sample 2 Sample 3 …
1 1 …
2 2 …
3 3A 3
A 4
A 5
3 3 …
4 4 4 …
5 5 5 …
B 6
B 7
B 8
6 6 …
7 7 …
8 8 …
• Jackknifing uses “leave one out” sampling within each class
B 9 9 9 9 …
31. Example: Permutation Samplesp p
Original Data
Cl D t
Permutation Samples
S l 1 S l 2 S l 3Class Data
A 1
A 2
A 3
Sample 1 Sample 2 Sample 3 …
B A A …
B A B …
A A AA 3
A 4
A 5
A A A …
A B B …
B A A …
B 6
B 7
B 8
A B B …
A A B …
A B A …
• Permutation tests scramble the class labels among the samples
B 9 B B A …
32. Example: Viral Loadsp
• One-sided permutation test
to compare the mediansto compare the medians
from two different samples
• Permutations of differences• Permutations of differences
in median are centered at 0
• Compute p value using:• Compute p-value using:
p
# permutations > true difference in medians
t t l # t titotal # permutations
or
p
# permutations < true difference in medians
total # permutationsp
33. Example: Viral LoadsExample: Viral Loads
Two‐sided permutation test p
uses absolute values of each
permutation and the true
difference in mediansdifference in medians
Permutations of differences
are all greater than zeroare all greater than zero
Compute p‐value using:
p
# permutations > true difference in medians
total # permutations
34. Multiple Testingp g
• Use Family-Wise Error-Rate (FWER) adjustments for
l i f i (ANOVA) t t ianalysis of variance (ANOVA) tests or comparisons
among 3-20 groups of samples
One-way ANOVA is an extension of the Student’s T-test
Kruskal-Wallis is an extension of the Wilcoxon rank sum test
Permutation tests can be adjusted with Bonferroni and other methods
• Use False Discovery Rate (FDR) adjustments for high-
throughput biology experiments like microarrays
E.g. Microarrays, real time PCR, next gen sequencing, …E.g. Microarrays, real time PCR, next gen sequencing, …
FDR methods are more powerful than family-wise error rate (FWER)
controlling methods, like those used in ANOVA, for high-throughput
methods with hundreds or thousands of tests
35. FWER Adjustments: Bonferronij
• Suppose you want to compare 5 new drugs against
l b b t k ll 5 d i ff tia placebo, but you know all 5 drugs are ineffective
Compute 5 Student’s T-tests with false positive rate α = 0.05
Each test has a 95% chance to correctly find p > 0.05Each test has a 95% chance to correctly find p 0.05
Among all 5 tests, the chance of at least one false positive is:
1 – 0.955 = 0.23 > 0.05
• The Bonferroni FWER adjustment
Divide the false positive rate α = 0 05 by the number of tests so onlyDivide the false positive rate α 0.05 by the number of tests, so only
p-values smaller than α = 0.05 / 5 = 0.01 are significant
Multiply p-values by the number of tests for an “adjusted p-value”,
using the formula min(1 5*p) for these five testsusing the formula min(1, 5 p) for these five tests.
36. Other FWER MethodsOther FWER Methods
• Tukey’s Honest Significant Difference
Uses the “standardized range” method for all pair-wise comparisons
E.g. for three groups, compare A vs. B, B vs. C and A vs. C
• Dunnett’s Multiple Comparisons Against a Control
Uses “standardized range method” for comparisons against a control
E.g. for three groups, compare A vs. C and B vs. C for control group C
• Popular yet outdated methods• Popular, yet outdated methods
Fisher’s LSD, Student-Newman-Keuls, Duncan’s Test, …
37. False Discovery Rate MethodsFalse Discovery Rate Methods
• Consider a microarray experiment with 20,000 genes
B f i i f l iti t 0 05 / 20 000 0 0000025 Bonferroni requires a false positive rate α = 0.05 / 20,000 = 0.0000025
Few, if any, genes will be statistically significant using Bonferroni
• The purpose of a microarray experiment is different
than an ANOVA experiment comparing 3-20 groups
Microarray experiments are often considered “fishing expeditions” Microarray experiments are often considered fishing expeditions”
Want to find approximately 100 genes of interest for follow up
experiments with quantitative real-time PCR or other methods
Willing to accept a few false positives among our significant results if Willing to accept a few false positives among our significant results, if
we can capture all the biologically important genes in the process
38. FDR Methods (cont)
• Suppose you could test
5 000 genes that were not5,000 genes that were not
differentially expressed
Th 000 ld• Those 5,000 genes would
include many false
positives
• The p-values should
follow a uniformfollow a uniform
distribution from
p = 0.00 to p = 1.00
39. FDR Methods (cont.)
• Add in 1,000 differentially
expressed genes (DEGs)expressed genes (DEGs)
• All DEGs have p < 0.05p
• Want to adjust the cut-off
value α = 0.05 until the list
of significant genes has a
controlled proportion ofp p
false positives
40. Th k YThank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
40