SlideShare a Scribd company logo
Null hypothesis
p-value Chi-square test
G-test
t-test
ANOVA
Non-parametric tests
Statistical power
Multiple test corrections
Null hypothesis
p-value Chi-square test
G-test
t-test
ANOVA
Non-parametric tests
Statistical power
Multiple test corrections
WRONG
WRONG
1,000 tests
real effect in 10%
100 tests
no effect in 90%
900 tests
Pr(real) = 0.1
power = 80%
significance level = 5%
80% detected
80 true positives
20% not detected
20 false negatives
95% tested negative
855 true negatives
5% “detected”
45 false positives
False discovery rate
𝐹𝐷𝑅 =
false positives
discoveries
𝐹𝐷𝑅 =
45
45 + 80
= 0.36
Colquhoun D., 2014, “An investigation of the false discovery rate
and the misinterpretation of p-values”, R. Soc. open sci. 1: 140216.
False positive rate
𝐹𝑃𝑅 =
false positives
no effect
𝐹𝑃𝑅 =
45
900
= 0.05
If you publish a 𝑝 < 0.05
result, you have a 36%
chance of making a fool
of yourself
Colquhoun D., 2014, “An investigation of the false discovery rate
and the misinterpretation of p-values”, R. Soc. open sci. 1: 140216.
1,000 tests
real effect in 10%
100 tests
no effect in 90%
900 tests
Pr(real) = 0.1
power = 80%
significance level = 5%
80% detected
80 true positives
20% not detected
20 false negatives
95% tested negative
855 true negatives
5% “detected”
45 false positives
What’s wrong with p-values?
Hand-outs available at http://is.gd/statlec
Marek Gierliński
Division of Computational Biology
❌
❌
❌
Statistical testing
Statistical model
Null hypothesis
H0: no effect
All other assumptions
Significance level
𝛼 = 0.05
p-value: probability that the
observed effect is random
𝑝 < 𝛼
Reject H0
(at your own risk)
Effect is real
𝑝 ≥ 𝛼
Insufficient evidence
Statistical test against H0
Data
p-value:
Given that H0 is true, the probability of
observed, or more extreme, data
It is not the probability that H0 is true
P-value is the degree to which the data are
embarrassed by the null hypothesis
Nicholas Maxwell
“All other assumptions”
Null hypothesis
H0: no effect
Significance level
𝛼 = 0.05
𝑝 < 𝛼
Reject H0
𝑝 ≥ 𝛼
Do not reject H0
Experimental
protocols followed
Instruments
calibrated
Data collected
correctly
All other assumptions
about biology are
correct No other effects
No other effects
𝑝 < 𝛼
Reject H0
No silly misakes
1
p-values test not only the null hypothesis,
but everything else in the experiment
𝐹𝐷𝑅 =
45
45 + 80
= 0.36
1,000 tests
real effect in 10%
100 tests
no effect in 90%
900 tests
Pr(real) = 0.1
power = 80%
significance level = 5%
80% detected
80 true positives
20% not detected
20 false negatives
95% tested negative
855 true negatives
5% “detected”
45 false positives
Why large false discovery rate?
Simulated population of mice
13
No effect (97%)
𝜇C = 20 g
𝜎 = 5 g
Real effect (3%)
𝜇F = 30 g
𝜎 = 5 g
Power analysis
effect size Δ𝑚 = 10 g
power 𝒫 = 0.9
significance level 𝛼 = 0.05
sample size 𝑛 = 5
Null hypothesis H0: 𝜇 = 20 g
one-sample t-test
Gedankenexperiment: distribution of p-values
14
𝛼 = 0.05
positives
Gedankenexperiment: “significant” p-values
15
No effect
Real effect
True positives
False positives
𝐹𝐷𝑅 =
𝐹𝑃
𝐹𝑃 + 𝑇𝑃
≈ 0.63
𝛼 = 0.05
positives
Small 𝛼 doesn’t help
16
True positives
False positives
𝐹𝐷𝑅 =
𝐹𝑃
𝐹𝑃 + 𝑇𝑃
≈ 0.20
No effect
Real effect
positives
𝛼 = 0.001
2
The chance of making a fool of yourself
is much larger than 𝛼 = 0.05
FDR depends on the probability of real effect
18
True positives
False positives
𝛼 = 0.05
𝐹𝐷𝑅 ≈ 0.05
No effect (50%) Real effect (50%)
3
When the effect is rare,
you are screwed
What does a p-value ~ 0.05 really mean?
20
True positives
False positives
𝛼 ~ 0.05
𝐹𝐷𝑅 = 0.21
Bayesian approach: consider all prior distributions
21
Berger & Selke
(Bayesian approach)
𝑝 ~ 0.05 ⇒ 𝐹𝐷𝑅 ≥ 0.3
3-sigma approach
𝑝 ~ 0.003 ⇒ 𝐹𝐷𝑅 ≥ 0.04
Berger J.O, Selke T., “Testing a
point null hypothesis: the
irreconcilability of P values and
evidence”, 1987, JASA, 82, 112-
122
Simulation
4
When you get a 𝑝 ~ 0.05,
you are screwed
Gedankenexperiment: reliability of p-values
23
Normal population, 100% real effect
One-sample t-test
p-values can be
unreliable
Sample size = 3, power = 0.18 Sample size = 10, power = 0.80
Underpowered studies lead to
unreliable p-values
Inflation of the effect size
25
real mean effect size = 5 g
estimated
mean effect
size = 7.3 g
Underpowered studies lead to
unreliable p-values
Underpowered studies lead to
overestimated effect size
5
When your experiment is underpowered,
you are screwed
Neuroscience: most studies underpowered
Button et al. (2013) “Power failure: why small sample size undermines the
reliability of neuroscience”, Nature Reviews Neuroscience 14, 365-376
The effect size
𝑝 = 0.004
The effect size
With sample size large
enough everything is
“significant”
Effect size is more important
Looking at whole data is
even more important
𝑝 = 0.004
𝑛F = 775 𝑛Q = 392
6
When you have lots of replicates,
p-values are useless
7
Statistical significance does not imply
biological relevance
Multiple test corrections can be tricky
10,000 genes 10,000 tests
Benjamini-Hochberg
correction
RESULT
Multiple test corrections can be tricky
10,000 genes 10,000 tests
Benjamini-Hochberg
correction
RESULT
Complex experiment
Multi-dimensional data
Searching...
Nothing Searching...
Nothing
Searching...
Nothing
Searching...
RESULT
Batch effects? No
8
It is not always obvious how to correct
p-values
What’s wrong with p-values?
P-values test not only the
targeted null hypothesis,
but everything else in the
experiment
The chance of making a
fool of yourself is much
larger than 𝛼 = 0.05
When you get a
𝑝 ~ 0.05, you are
screwed
When your experiment is
underpowered, you are
screwed
Multiple test corrections
are tricky
A lot, because
statistics
When the effect is rare,
you are screwed
Statistical significance
does not imply biological
relevance
When you have lots of
replicates, p-values are
useless
P-values test not only the
targeted null hypothesis,
but everything else in the
experiment
When the effect is rare,
you are screwed
When you get a
𝑝 ~ 0.05, you are
screwed
When your experiment is
underpowered, you are
screwed
Statistical significance
does not imply biological
relevance
When you have lots of
replicates, p-values are
useless
Multiple test corrections
are tricky
The chance of making a
fool of yourself is much
larger than 𝛼 = 0.05
The plain fact is that 70 years ago
Ronald Fisher gave scientists a
mathematical machine for turning
baloney into breakthroughs, and flukes
into funding. It is time to pull the plug.
Robert Matthews, Sunday Telegraph,
13 September 1998.
Null hypothesis significance testing is a potent
but sterile intellectual rake who leaves in his
merry path a long train of ravished maidens
but no viable scientific offspring.
Paul Meehl, 1967, Philosophy of Science, 34,
103-115
The widespread use of “statistical
significance” as a license for
making a claim of a scientific
finding leads to considerable
distortion of the scientific process.
ASA statement on statistical
significance and p-values (2016)
By Jim Borgman, first published by the Cincinnati Inquirer 27 April 1997
What’s wrong with us?
Journal of the American Statistical Association,
Vol. 54, No. 285 (Mar., 1959), pp. 30-34
“There is some evidence that [...] research which yields nonsigificant results is not
published. Such research being unknown to other investigators may be repeated
independently until eventually by chance a significant result occurs [...] The
possibility thus arises that the literature [...] consists in substantial part of false
conclusions [...].”
Canonization of false facts
Nissen S.B., et al., “Research: Publication bias and the
canonization of false facts”, eLife 2016;5:e21451
Canonization of false facts
Negative publication rate
Probability
of
canonizing
false
claim
as
fact
Nissen S.B., et al., “Research: Publication bias and the
canonization of false facts”, eLife 2016;5:e21451
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
𝛼 = 0.05
Power = 0.8
If you don’t publish negative results,
science is screwed
but...
there is a thin line between “negative
result” and “no result”
Data dredging, p-hacking
Searching until you find the
result you were looking for
Massaging data
Post-hoc hypothesis
Unaccounted multiple
experiments/tests
𝑝 = 0.06?
Let’s try again
Ignoring
confounding effects
Not reporting non-
significant results
Evidence of p-hacking
46
Head M.L., et al. “The Extent and Consequences of P-Hacking
in Science”, PLoS Biol 13(3)
Distribution of p-values reported in publications
Evidence of p-hacking
𝑛F 𝑛Q H0: 𝑛F = 𝑛Q
Reproducibility crisis
Open Science Collaboration, “Estimating the reproducibility
of psychological science”, Science, 349 (2015)
Managed to reproduce only 39% results
Reproducibility crisis
Nature's survey of 1,576 researchers
The great reproducibility experiment
Are referees more likely to give red cards to black players?
51
• one data set
• 29 teams
• 61 scientists
• task: find odds ratio
Silberzahn et al., “Many analysts, one dataset: Making
transparent how variations in analytical choices affect
results”, https://osf.io/j5v8f
52
P-values are broken
We are broken
What do we do?
What the hell do we do?
Before you do the experiment
talk to us
The Data Analysis Group
http://www.compbio.dundee.ac.uk/dag.html
Specify the null
hypothesis
Design the experiment
• randomization
• statistical power
Quality control
some crap comes out in statistics
We assumed the null hypothesis
Never, ever say that large 𝑝 supports H0
Ditch the 𝜶 limit
use p-values as a continuous measure of
data incompatibility with H0
𝑝 ~ 0.05 only means ‘worth a look’
Reporting a discovery based only on
𝑝 < 0.05 is wrong
Use the three-sigma rule
that is 𝑝 < 0.003, to demonstrate a
discovery
Reporting
• Always report the effect size and its confidence limits
• Show data (not dynamite plots)
• Don’t use the word ‘significant’
• Don’t use asterisks to mark ‘significant’ results in
figures
Validation
Follow-up experiments to
confirm discoveries
Publication
Publish negative results
ASA Statement on Statistical Significance and P-Values
1. P-values can indicate how incompatible the data are with a specified statistical
model
2. P-values do not measure the probability that the studied hypothesis is true, or
the probability that the data were produced by random chance alone
3. Scientific conclusions and business or policy decisions should not be based only
on whether a p-value passes a specific threshold
4. Proper inference requires full reporting and transparency
5. A p-value, or statistical significance, does not measure the size of an effect or
the importance of a result
6. By itself, a p-value does not provide a good measure of evidence regarding a
model or hypothesis
https://is.gd/asa_stat
Hand-outs available at
http://is.gd/statlec

More Related Content

Similar to p-values9.pdf

"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
jemille6
 
Research Sample size by Dr Allah Yar Malik
Research Sample size by Dr Allah Yar MalikResearch Sample size by Dr Allah Yar Malik
Research Sample size by Dr Allah Yar Malik
huraismalik
 
Statistics for Lab Scientists
Statistics for Lab ScientistsStatistics for Lab Scientists
Statistics for Lab Scientists
Mike LaValley
 
Statistics in clinical and translational research common pitfalls
Statistics in clinical and translational research  common pitfallsStatistics in clinical and translational research  common pitfalls
Statistics in clinical and translational research common pitfalls
Pavlos Msaouel, MD, PhD
 
Hypothesis testing 1.0
Hypothesis testing 1.0Hypothesis testing 1.0
Hypothesis testing 1.0
Dr. C.V. Suresh Babu
 
How to read a paper
How to read a paperHow to read a paper
How to read a paper
faheta
 
To p or not to p
To p or not to pTo p or not to p
To p or not to p
Mike LaValley
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
jemille6
 
Copenhagen 23.10.2008
Copenhagen 23.10.2008Copenhagen 23.10.2008
Copenhagen 23.10.2008
Jonas Ranstam PhD
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
i i
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
o_devinyak
 
On p-values
On p-valuesOn p-values
On p-values
Maarten van Smeden
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
Larry Baum
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
Wayne Lee
 
Novelties in social science statistics
Novelties in social science statisticsNovelties in social science statistics
Novelties in social science statistics
Jiri Haviger
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
jemille6
 
Steps in hypothesis.pptx
Steps in hypothesis.pptxSteps in hypothesis.pptx
Steps in hypothesis.pptx
Yashwanth Rm
 
Copenhagen 2008
Copenhagen 2008Copenhagen 2008
Copenhagen 2008
Jonas Ranstam PhD
 
Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2
Vlady Fckfb
 
Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2
Vlady Fckfb
 

Similar to p-values9.pdf (20)

"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
Research Sample size by Dr Allah Yar Malik
Research Sample size by Dr Allah Yar MalikResearch Sample size by Dr Allah Yar Malik
Research Sample size by Dr Allah Yar Malik
 
Statistics for Lab Scientists
Statistics for Lab ScientistsStatistics for Lab Scientists
Statistics for Lab Scientists
 
Statistics in clinical and translational research common pitfalls
Statistics in clinical and translational research  common pitfallsStatistics in clinical and translational research  common pitfalls
Statistics in clinical and translational research common pitfalls
 
Hypothesis testing 1.0
Hypothesis testing 1.0Hypothesis testing 1.0
Hypothesis testing 1.0
 
How to read a paper
How to read a paperHow to read a paper
How to read a paper
 
To p or not to p
To p or not to pTo p or not to p
To p or not to p
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Copenhagen 23.10.2008
Copenhagen 23.10.2008Copenhagen 23.10.2008
Copenhagen 23.10.2008
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
 
On p-values
On p-valuesOn p-values
On p-values
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
 
Novelties in social science statistics
Novelties in social science statisticsNovelties in social science statistics
Novelties in social science statistics
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Steps in hypothesis.pptx
Steps in hypothesis.pptxSteps in hypothesis.pptx
Steps in hypothesis.pptx
 
Copenhagen 2008
Copenhagen 2008Copenhagen 2008
Copenhagen 2008
 
Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2
 
Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2Psbe2 08 research methods 2011-2012 - week 2
Psbe2 08 research methods 2011-2012 - week 2
 

Recently uploaded

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 

Recently uploaded (20)

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 

p-values9.pdf

  • 1. Null hypothesis p-value Chi-square test G-test t-test ANOVA Non-parametric tests Statistical power Multiple test corrections
  • 2. Null hypothesis p-value Chi-square test G-test t-test ANOVA Non-parametric tests Statistical power Multiple test corrections WRONG WRONG
  • 3. 1,000 tests real effect in 10% 100 tests no effect in 90% 900 tests Pr(real) = 0.1 power = 80% significance level = 5% 80% detected 80 true positives 20% not detected 20 false negatives 95% tested negative 855 true negatives 5% “detected” 45 false positives False discovery rate 𝐹𝐷𝑅 = false positives discoveries 𝐹𝐷𝑅 = 45 45 + 80 = 0.36 Colquhoun D., 2014, “An investigation of the false discovery rate and the misinterpretation of p-values”, R. Soc. open sci. 1: 140216. False positive rate 𝐹𝑃𝑅 = false positives no effect 𝐹𝑃𝑅 = 45 900 = 0.05
  • 4. If you publish a 𝑝 < 0.05 result, you have a 36% chance of making a fool of yourself Colquhoun D., 2014, “An investigation of the false discovery rate and the misinterpretation of p-values”, R. Soc. open sci. 1: 140216. 1,000 tests real effect in 10% 100 tests no effect in 90% 900 tests Pr(real) = 0.1 power = 80% significance level = 5% 80% detected 80 true positives 20% not detected 20 false negatives 95% tested negative 855 true negatives 5% “detected” 45 false positives
  • 5. What’s wrong with p-values? Hand-outs available at http://is.gd/statlec Marek Gierliński Division of Computational Biology
  • 7. Statistical testing Statistical model Null hypothesis H0: no effect All other assumptions Significance level 𝛼 = 0.05 p-value: probability that the observed effect is random 𝑝 < 𝛼 Reject H0 (at your own risk) Effect is real 𝑝 ≥ 𝛼 Insufficient evidence Statistical test against H0 Data
  • 8. p-value: Given that H0 is true, the probability of observed, or more extreme, data It is not the probability that H0 is true
  • 9. P-value is the degree to which the data are embarrassed by the null hypothesis Nicholas Maxwell
  • 10. “All other assumptions” Null hypothesis H0: no effect Significance level 𝛼 = 0.05 𝑝 < 𝛼 Reject H0 𝑝 ≥ 𝛼 Do not reject H0 Experimental protocols followed Instruments calibrated Data collected correctly All other assumptions about biology are correct No other effects No other effects 𝑝 < 𝛼 Reject H0 No silly misakes
  • 11. 1 p-values test not only the null hypothesis, but everything else in the experiment
  • 12. 𝐹𝐷𝑅 = 45 45 + 80 = 0.36 1,000 tests real effect in 10% 100 tests no effect in 90% 900 tests Pr(real) = 0.1 power = 80% significance level = 5% 80% detected 80 true positives 20% not detected 20 false negatives 95% tested negative 855 true negatives 5% “detected” 45 false positives Why large false discovery rate?
  • 13. Simulated population of mice 13 No effect (97%) 𝜇C = 20 g 𝜎 = 5 g Real effect (3%) 𝜇F = 30 g 𝜎 = 5 g Power analysis effect size Δ𝑚 = 10 g power 𝒫 = 0.9 significance level 𝛼 = 0.05 sample size 𝑛 = 5 Null hypothesis H0: 𝜇 = 20 g one-sample t-test
  • 14. Gedankenexperiment: distribution of p-values 14 𝛼 = 0.05 positives
  • 15. Gedankenexperiment: “significant” p-values 15 No effect Real effect True positives False positives 𝐹𝐷𝑅 = 𝐹𝑃 𝐹𝑃 + 𝑇𝑃 ≈ 0.63 𝛼 = 0.05 positives
  • 16. Small 𝛼 doesn’t help 16 True positives False positives 𝐹𝐷𝑅 = 𝐹𝑃 𝐹𝑃 + 𝑇𝑃 ≈ 0.20 No effect Real effect positives 𝛼 = 0.001
  • 17. 2 The chance of making a fool of yourself is much larger than 𝛼 = 0.05
  • 18. FDR depends on the probability of real effect 18 True positives False positives 𝛼 = 0.05 𝐹𝐷𝑅 ≈ 0.05 No effect (50%) Real effect (50%)
  • 19. 3 When the effect is rare, you are screwed
  • 20. What does a p-value ~ 0.05 really mean? 20 True positives False positives 𝛼 ~ 0.05 𝐹𝐷𝑅 = 0.21
  • 21. Bayesian approach: consider all prior distributions 21 Berger & Selke (Bayesian approach) 𝑝 ~ 0.05 ⇒ 𝐹𝐷𝑅 ≥ 0.3 3-sigma approach 𝑝 ~ 0.003 ⇒ 𝐹𝐷𝑅 ≥ 0.04 Berger J.O, Selke T., “Testing a point null hypothesis: the irreconcilability of P values and evidence”, 1987, JASA, 82, 112- 122 Simulation
  • 22. 4 When you get a 𝑝 ~ 0.05, you are screwed
  • 23. Gedankenexperiment: reliability of p-values 23 Normal population, 100% real effect One-sample t-test p-values can be unreliable Sample size = 3, power = 0.18 Sample size = 10, power = 0.80
  • 24. Underpowered studies lead to unreliable p-values
  • 25. Inflation of the effect size 25 real mean effect size = 5 g estimated mean effect size = 7.3 g
  • 26. Underpowered studies lead to unreliable p-values Underpowered studies lead to overestimated effect size
  • 27. 5 When your experiment is underpowered, you are screwed
  • 28. Neuroscience: most studies underpowered Button et al. (2013) “Power failure: why small sample size undermines the reliability of neuroscience”, Nature Reviews Neuroscience 14, 365-376
  • 30. The effect size With sample size large enough everything is “significant” Effect size is more important Looking at whole data is even more important 𝑝 = 0.004 𝑛F = 775 𝑛Q = 392
  • 31. 6 When you have lots of replicates, p-values are useless
  • 32. 7 Statistical significance does not imply biological relevance
  • 33. Multiple test corrections can be tricky 10,000 genes 10,000 tests Benjamini-Hochberg correction RESULT
  • 34. Multiple test corrections can be tricky 10,000 genes 10,000 tests Benjamini-Hochberg correction RESULT Complex experiment Multi-dimensional data Searching... Nothing Searching... Nothing Searching... Nothing Searching... RESULT Batch effects? No
  • 35. 8 It is not always obvious how to correct p-values
  • 36. What’s wrong with p-values? P-values test not only the targeted null hypothesis, but everything else in the experiment The chance of making a fool of yourself is much larger than 𝛼 = 0.05 When you get a 𝑝 ~ 0.05, you are screwed When your experiment is underpowered, you are screwed Multiple test corrections are tricky A lot, because statistics When the effect is rare, you are screwed Statistical significance does not imply biological relevance When you have lots of replicates, p-values are useless P-values test not only the targeted null hypothesis, but everything else in the experiment When the effect is rare, you are screwed When you get a 𝑝 ~ 0.05, you are screwed When your experiment is underpowered, you are screwed Statistical significance does not imply biological relevance When you have lots of replicates, p-values are useless Multiple test corrections are tricky The chance of making a fool of yourself is much larger than 𝛼 = 0.05
  • 37.
  • 38. The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. Robert Matthews, Sunday Telegraph, 13 September 1998. Null hypothesis significance testing is a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring. Paul Meehl, 1967, Philosophy of Science, 34, 103-115 The widespread use of “statistical significance” as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. ASA statement on statistical significance and p-values (2016)
  • 39. By Jim Borgman, first published by the Cincinnati Inquirer 27 April 1997
  • 41. Journal of the American Statistical Association, Vol. 54, No. 285 (Mar., 1959), pp. 30-34 “There is some evidence that [...] research which yields nonsigificant results is not published. Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs [...] The possibility thus arises that the literature [...] consists in substantial part of false conclusions [...].”
  • 42. Canonization of false facts Nissen S.B., et al., “Research: Publication bias and the canonization of false facts”, eLife 2016;5:e21451
  • 43. Canonization of false facts Negative publication rate Probability of canonizing false claim as fact Nissen S.B., et al., “Research: Publication bias and the canonization of false facts”, eLife 2016;5:e21451 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 𝛼 = 0.05 Power = 0.8
  • 44. If you don’t publish negative results, science is screwed but... there is a thin line between “negative result” and “no result”
  • 45. Data dredging, p-hacking Searching until you find the result you were looking for Massaging data Post-hoc hypothesis Unaccounted multiple experiments/tests 𝑝 = 0.06? Let’s try again Ignoring confounding effects Not reporting non- significant results
  • 46. Evidence of p-hacking 46 Head M.L., et al. “The Extent and Consequences of P-Hacking in Science”, PLoS Biol 13(3) Distribution of p-values reported in publications Evidence of p-hacking 𝑛F 𝑛Q H0: 𝑛F = 𝑛Q
  • 47. Reproducibility crisis Open Science Collaboration, “Estimating the reproducibility of psychological science”, Science, 349 (2015) Managed to reproduce only 39% results
  • 49.
  • 51. Are referees more likely to give red cards to black players? 51 • one data set • 29 teams • 61 scientists • task: find odds ratio Silberzahn et al., “Many analysts, one dataset: Making transparent how variations in analytical choices affect results”, https://osf.io/j5v8f
  • 52. 52
  • 54. What do we do? What the hell do we do?
  • 55. Before you do the experiment talk to us The Data Analysis Group http://www.compbio.dundee.ac.uk/dag.html
  • 56. Specify the null hypothesis Design the experiment • randomization • statistical power Quality control some crap comes out in statistics We assumed the null hypothesis Never, ever say that large 𝑝 supports H0 Ditch the 𝜶 limit use p-values as a continuous measure of data incompatibility with H0 𝑝 ~ 0.05 only means ‘worth a look’ Reporting a discovery based only on 𝑝 < 0.05 is wrong Use the three-sigma rule that is 𝑝 < 0.003, to demonstrate a discovery Reporting • Always report the effect size and its confidence limits • Show data (not dynamite plots) • Don’t use the word ‘significant’ • Don’t use asterisks to mark ‘significant’ results in figures Validation Follow-up experiments to confirm discoveries Publication Publish negative results
  • 57. ASA Statement on Statistical Significance and P-Values 1. P-values can indicate how incompatible the data are with a specified statistical model 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold 4. Proper inference requires full reporting and transparency 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis https://is.gd/asa_stat