Overview of statistics: Statistical testing (Part I)

Overview of Statistics I:
Statistical Testing
Presented by: Jeff Skinner, M.S.ese ted by Je S e , S
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch (BCBB)
Office of Cyber Infrastructure and Computational Biology (OCICB)y p gy ( )
National Institute of Allergy and Infectious Diseases (NIAID)

Want To Publish Statistical Results?
Most research journals ask three big questions:
• Do your statistical tests inappropriately assume your
data is normal or Gaussian distributed?
P t i t t t i t t Parametric tests vs. nonparametric tests
 E.g. Student’s T-tests vs. Wilcoxon Rank Sum tests
D t ti ti l t t i l l ?• Do your statistical tests require large samples?
 Approximate tests vs. exact tests
• Do you require adjustments for multiple testing?
 False Discovery Rate (FDR) adjustments for high throughput biology
 Tukey’s Honest Significant Difference (HSD) for analysis of variance Tukey s Honest Significant Difference (HSD) for analysis of variance

Nature – Statistical Checklist

Outline
• Review the statistical testing process
• When do I use one-sided vs. two-sided tests?
Wh d I t i t i t t ?• When do I use parametric vs. nonparametric tests?
• When do I use large-sample tests vs. exact tests?g p
• When do I need to adjust for multiple testing?
• Additional questions about testing?

Recall the Statistical Testing Processg
• Formulate null and alternative hypotheses
 E.g. (null) H0: μ1 = μ2 vs. (alternative) HA: μ1 ≠ μ2
• Calculate the appropriate test statistic
 E g Student’s t-test Wilcoxon testE.g. Student s t test, Wilcoxon test, …
• Compute the probability of observing the test statistic
(i.e. your sample data) under the null hypothesis
 I.e. Compute a p-value
• Make a statistical decision
“Reject the n ll h pothesis” or “fail to reject the n ll h pothesis” “Reject the null hypothesis” or “fail to reject the null hypothesis”
• Make a biological conclusion
 E.g. New drug reduces viral load, vitamin C helps prevent cancer, …g g , p p ,

Null and Alternative Hypothesesyp
• (null) Drug and placebo viral loads are equal vs. (alternative) viral
loads higher for placebo group than for drug group
• (null) H : μ – μ ≤ 0 vs (alternative) H : μ – μ > 0• (null) H0: μP – μD ≤ 0 vs. (alternative) HA: μP – μD > 0

What is a Statistical Test?
ValueNullStatisticDifference
Error
ValueNullStatistic
Error
Difference
Test


• Almost all tests used in inferential statistics can be
li d th ti f “diff ” “ ”generalized as the ratio of a “difference” over an “error”
 Difference between a statistic and null value (usually 0)
 A statistic is nothing more than a numeric summary of the experimental
d t ith t t th ll h th idata with respect to the null hypothesis
 A null value is an assumption about the population under the null
hypothesis
 An error is an estimate of the sampling distribution error An error is an estimate of the sampling distribution error

Example: Two-sample Student’s T-testp p
X X 0statistic null value
T* 
X1  X2  0
1 1



n11 s1
2
 n21 s2
2 standard
error1
n1

1
n2



1  1 2  2
n1  n2  2
• The “statistic” in a two-sample t-test is a difference between
the two sample means and the null value is zero
 The hypothesis μ1 = μ2 implies μ1 – μ2 = 0
• The standard error is an estimate of the common variance• The standard error is an estimate of the common variance

Null Distributions
Compute T* statistic
f lfor our sample
C i t thCompare against the
distribution of all possible
T* statistics for all
possible samples from
the population under the
• If the null hypothesis is true (i.e. no difference between groups),
p p
null hypothesis
yp ( g p ),
then the T* statistics from most samples should be near zero
• Many null distributions (or sampling distributions) approximately
follow well known probability distributions e g normal distributionfollow well known probability distributions, e.g. normal distribution

P-values
• A p-value is the probability of
observing your data given that theobserving your data given that the
null hypothesis is actually true
O• P-values do NOT represent the
probability that the null is true
• P-values do NOT represent the
probabilty that a model is incorrect
If the null distribution follows a well
• P-values do NOT represent the
strength or size of an effect
known probability distribution, like
the normal distribution, the p-values
are computed by integration

Statistical Decisions and
Bi l i l C l iBiological Conclusions
• A statistical decision is a choice to “reject the null
hypothesis” or “fail to reject the null hypothesis”
 The decision is based on a critical value or decision rule The decision is based on a critical value or decision rule
 E.g. Reject the null hypothesis if p-value < 0.05
A bi l i l l i i th fi l i t t ti f• A biological conclusion is the final interpretation of
the statistical testing process in plain language
 E g Vitamin C prevents cancer drug reduced viral loadE.g. Vitamin C prevents cancer, drug reduced viral load, …
 Make sure conclusion can be justified by the hypotheses

Type I and Type II Errorsyp yp
Actual population difference?Actual population difference?
Yes No
Type I Error
Was the difference
detected by the
i i l ?
Yes OK
Type I Error
(False Positive)
Type II Error
Diff t t f i t tt t t i i i
statistical test? No
Type II Error
(False Negative)
OK
• Different types of experiments attempt to minimize
Type I errors, Type II errors or both kinds of errors
 E.g. Type II errors are more important in medical testingE.g. Type II errors are more important in medical testing

Type I and Type II Errors - Exampleype a d ype o s a p e
• Suppose mean viral load is 7,000
lower after taking a new druglower after taking a new drug
 Drug population mean viral load 33,000
 Placebo population mean viral load
40,000,
• Samples from the population may
not be representative
 By chance we sample 120 sickly patients
for the drug treatment group
 By chance we sample 120 robust patients
for the placebo treatment groupfor the placebo treatment group
• This data yields a Type II error
because of a strange sample

Review: Statistical Testingg
• Formulate null and alternative hypotheses
 Null and alternative hypotheses are mutually exclusive
and exhaustive statements about the population
 Typically assume the null hypothesis is true, until we find
evidence to refute the null in favor of the alternative
 E.g. H0: µ = 0 versus HA: µ ≠ 0
• Calculate the appropriate test statistic and find its
probability under the null hypothesis
• Make a statistical decision and biological conclusion

Two-sample Student’s T-testp
T*
X1  X2  0
T*  1 2
sp
1

1
p
n1 n2
Assumptions of Student’s T‐test:p
 Data from each group are normal
or sample sizes greater than 30
 Equal variances among groups
• Student’s T-test is a parametric test
to compare means of two samples
from normal distributed populations
Equal variances among groups
 Independent and identically
distributed (iid) normal random
errors from the group meansp p errors from the group means

Parametric Statistical Tests
• Parametric tests assume the populations have known distributions
ith k t th t t b ti t d ( d d)with unknown parameters that must be estimated (and compared)

Central Limit Theorem
• Student’s T-test is computed with
a difference of two samplep
means
• Draw thousands of samples of
size n from one population top p
view the distribution of their
sample means
• As sample size n increases, thep ,
distribution of the sample means
becomes Gaussian (normal),
even from non-normal
populations
• Student’s T-test does not require
normal data if sample size is
large Sample means from a uniform distribution
are approximately normal for n = 20

Equal Variance AssumptionEqual Variance Assumption
T* 
X1  X2  0
s1
2
s2
2
s1
n1

s2
n2
• Usual two-sample Student’s T-test computations assume both
samples share approximately equal variances
• Welsh’s correction computes an appropriate T-test when variances
are NOT equal among the two samples
• Welch’s correction is available on most software, so look carefully, y

Two-Sided Test
• Researchers expect two samples
ill b diff t b t d t kwill be different, but do not know
which will have the higher mean
 E.g. viral load for drug group could
be higher or lower than placebo
 E.g. H0: μP = μD vs. HA: μP ≠ μD
• Two-sided tests are less powerful,p
but more general than analogous
one-sided tests of same data
 The α = 0.05 rejection region of aThe α 0.05 rejection region of a
two-sided test is divided in two parts
 Need larger T-statistic for significant
p-values with two-sided testp

One-Sided Test
• Researchers expect one sample
will have a higher mean than the
other sample before testing
 E.g. viral load for drug group should
be lower than placebo
 E.g. H0: μP ≤ μD vs. HA: μP > μD
O id d t t f l• One-sided tests are more powerful
but require more assumptions
 The α = 0.05 rejection region of the
one-sided test is all on one side
 Smaller T-statistics will produce
significant p-values in one-sided
teststests

Nonparametric Statistical TestsNonparametric Statistical Tests
• Nonparametric tests make no assumptions about the distribution of
a population and focus on more general descriptions like mediansa population and focus on more general descriptions, like medians

Wilcoxon Rank Sum TestWilcoxon Rank Sum Test
n n 1  n n
Z* 
R1 
n1 n1 1 
2

n1n2
2
1 n1n2
n1 n21 
12
• The Wilcoxon rank sum test is a
nonparametric test to compare
the sums of two ranked samples Wilcoxon rank sum test assumes both
the sums of two ranked samples
• If assumptions are met, the test
can be used to compare medians
samples are from the same type of
distribution with different locations
 Also known as the Mann Whitney Test Also known as the Mann‐Whitney Test

Example: Wilcoxon-Rank Sum Test
R
n1 n1 1  n1n2
Example: Wilcoxon Rank Sum Test
statistic null value
Z* 
R1  1 1 
2
 1 2
2
1 
statistic
n1n2
n1 n21 
12
standard
error
12
• The “statistic” in a Wilcoxon rank sum test is equivalent to the
diff b t th f th k d d t i h ldifference between the sums of the ranked data in each class
• The null hypothesis R1 = R2 produces a strange statistic, null value
and standard error due to relationships among sums of ranks

Does It Really Compare Medians?Does It Really Compare Medians?
• If two samples come from the same
type of distributiontype of distribution …
YES
 Median of a sample is comparable to its
middle ranked observation(s)middle ranked observation(s)
 If two samples share a similar shape, the
sample with the significantly higher rank
sums will have the higher median too
• If two samples come from two very
different types of distributions … NO
 The Wilcoxon rank sum test actually
 Control and Treated samplescompares the sums of the ranked data
 Many counter examples have significant
Wilcoxon tests, but equal medians
 Control and Treated samples
both have median = 100
 Wilcoxon rank sum test has a
i ifi t l 0 0000027significant p‐value = 0.0000027

Does Wilcoxon Compare Distributions?Does Wilcoxon Compare Distributions?
• Wilcoxon rank sum test
does NOT compare the
distributions of samples
• Samples from two very
different distributions can
yield non-significanty g
Wilcoxon test p-values
• It is difficult to interpretIt is difficult to interpret
Wilcoxon rank sum test if
assumptions aren’t met

Student’s T vs WilcoxonStudent s T vs. Wilcoxon
• Two-sample Student’s T-test
A l ti l b t k f ll Assumes populations are normal, but works for all
populations if large sample sizes are used for both
classes
 Assumes variances are equal or requires Welch
correction
Wil R k S T t• Wilcoxon Rank Sum Test
 Assumes both samples are from the same type of
distribution
 Generally preferred over Student’s T-test by all journals
• What if neither test is appropriate?

Example: Viral LoadsExample: Viral Loads
• Want to compare viral loadsp
under treatment and placebo
 Viral loads are very high (> 10,000)
and skewed right for the placebo
groupgroup
 Viral loads all equal zero for treatment
• Both Student’s T-test and the
Wilcoxon test are inappropriateWilcoxon test are inappropriate
 Don’t have normal data or equal
variances for Student’s T-test
 Can’t use Wilcoxon test to compare
two different distributions
• Need an appropriate p-value, so
what statistical test can we use?

Exact Tests vs Approximate TestsExact Tests vs. Approximate Tests
• Exact statistical tests are based on probabilityExact statistical tests are based on probability
statements that are valid for any sample size
 Usually based combinatorial or resampling strategies
 All resampling tests are considered exact tests
 Some implementations of Wilcoxon tests use exact tests based on
combinatorial arguments for small sample sizes
• Approximate statistical tests are based mathematical
arguments about convergence with large sample sizes
 Student’s T-test is an approximate test based on arguments similar to
the Central Limit Theorem
 Approximate tests may have inaccurate p-values for small samples

Example: Bootstrap Samplesa p e: oo s ap Sa p es
Original Data Bootstrap Samples
Class Data
A 1
A 2
A 3
Sample 1 Sample 2 Sample 3 …
3 1 2 …
3 5 2 …
2 3 2A 3
A 4
A 5
2 3 2 …
4 1 4 …
2 2 1 …
B 6
B 7
B 8
7 9 6 …
6 6 7 …
8 6 8 …
• Bootstrapping uses sampling with replacement within each class
B 9 8 9 9 …

Example: Jackknife Samplesp p
Original Data
Cl D t
Jackknife Samples
S l 1 S l 2 S l 3Class Data
A 1
A 2
A 3
1 1 …
2 2 …
3 3A 3
A 4
A 5
3 3 …
4 4 4 …
5 5 5 …
B 6
B 7
B 8
6 6 …
7 7 …
8 8 …
• Jackknifing uses “leave one out” sampling within each class
B 9 9 9 9 …

Example: Permutation Samplesp p
Original Data
Cl D t
Permutation Samples
S l 1 S l 2 S l 3Class Data
A 1
A 2
A 3
B A A …
B A B …
A A AA 3
A 4
A 5
A A A …
A B B …
B A A …
B 6
B 7
B 8
A B B …
A A B …
A B A …
• Permutation tests scramble the class labels among the samples
B 9 B B A …

Example: Viral Loadsp
• One-sided permutation test
to compare the mediansto compare the medians
from two different samples
• Permutations of differences• Permutations of differences
in median are centered at 0
• Compute p value using:• Compute p-value using:
p 
# permutations > true difference in medians
t t l # t titotal # permutations
or
p 
# permutations < true difference in medians
total # permutationsp

Example: Viral LoadsExample: Viral Loads
 Two‐sided permutation test p
uses absolute values of each
permutation and the true
difference in mediansdifference in medians
 Permutations of differences
are all greater than zeroare all greater than zero
 Compute p‐value using:
p 
# permutations > true difference in medians
total # permutations

Multiple Testingp g
• Use Family-Wise Error-Rate (FWER) adjustments for
l i f i (ANOVA) t t ianalysis of variance (ANOVA) tests or comparisons
among 3-20 groups of samples
 One-way ANOVA is an extension of the Student’s T-test
 Kruskal-Wallis is an extension of the Wilcoxon rank sum test
 Permutation tests can be adjusted with Bonferroni and other methods
• Use False Discovery Rate (FDR) adjustments for high-
throughput biology experiments like microarrays
 E.g. Microarrays, real time PCR, next gen sequencing, …E.g. Microarrays, real time PCR, next gen sequencing, …
 FDR methods are more powerful than family-wise error rate (FWER)
controlling methods, like those used in ANOVA, for high-throughput
methods with hundreds or thousands of tests

FWER Adjustments: Bonferronij
• Suppose you want to compare 5 new drugs against
l b b t k ll 5 d i ff tia placebo, but you know all 5 drugs are ineffective
 Compute 5 Student’s T-tests with false positive rate α = 0.05
 Each test has a 95% chance to correctly find p > 0.05Each test has a 95% chance to correctly find p 0.05
 Among all 5 tests, the chance of at least one false positive is:
1 – 0.955 = 0.23 > 0.05
• The Bonferroni FWER adjustment
 Divide the false positive rate α = 0 05 by the number of tests so onlyDivide the false positive rate α 0.05 by the number of tests, so only
p-values smaller than α = 0.05 / 5 = 0.01 are significant
 Multiply p-values by the number of tests for an “adjusted p-value”,
using the formula min(1 5*p) for these five testsusing the formula min(1, 5 p) for these five tests.

Other FWER MethodsOther FWER Methods
• Tukey’s Honest Significant Difference
 Uses the “standardized range” method for all pair-wise comparisons
 E.g. for three groups, compare A vs. B, B vs. C and A vs. C
• Dunnett’s Multiple Comparisons Against a Control
 Uses “standardized range method” for comparisons against a control
 E.g. for three groups, compare A vs. C and B vs. C for control group C
• Popular yet outdated methods• Popular, yet outdated methods
 Fisher’s LSD, Student-Newman-Keuls, Duncan’s Test, …

False Discovery Rate MethodsFalse Discovery Rate Methods
• Consider a microarray experiment with 20,000 genes
B f i i f l iti t 0 05 / 20 000 0 0000025 Bonferroni requires a false positive rate α = 0.05 / 20,000 = 0.0000025
 Few, if any, genes will be statistically significant using Bonferroni
• The purpose of a microarray experiment is different
than an ANOVA experiment comparing 3-20 groups
 Microarray experiments are often considered “fishing expeditions” Microarray experiments are often considered fishing expeditions”
 Want to find approximately 100 genes of interest for follow up
experiments with quantitative real-time PCR or other methods
Willing to accept a few false positives among our significant results if Willing to accept a few false positives among our significant results, if
we can capture all the biologically important genes in the process

FDR Methods (cont)
• Suppose you could test
5 000 genes that were not5,000 genes that were not
differentially expressed
Th 000 ld• Those 5,000 genes would
include many false
positives
• The p-values should
follow a uniformfollow a uniform
distribution from
p = 0.00 to p = 1.00

FDR Methods (cont.)
• Add in 1,000 differentially
expressed genes (DEGs)expressed genes (DEGs)
• All DEGs have p < 0.05p
• Want to adjust the cut-off
value α = 0.05 until the list
of significant genes has a
controlled proportion ofp p
false positives

Th k YThank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
40

Overview of statistics: Statistical testing (Part I)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overview of statistics: Statistical testing (Part I)

Similar to Overview of statistics: Statistical testing (Part I) (20)

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Recently uploaded

Recently uploaded (20)

Overview of statistics: Statistical testing (Part I)