SlideShare a Scribd company logo
In the world beyond p < .05:
When & How to use p < .0499…
Yoav Benjamini
Tel Aviv University, Israel
replicability.tau.ac.il
Supported by ERC PSARPS and HBP grants
We cannot assure the replicability of results from a single study:
Only to enhance it
The gold standard of science: A discovery should be replicable
When to use p-values? Almost Always
1. Statistical significance testing (via the p-value) is the
first-line defense against being fooled by randomness:
• Valid calculation of the p-value requires minimal
assumptions – less than any other statistical method.
• The “assumptions” need not be merely assumed, but
can be assured by a properly designed experiment
(randomization + non-parametric test)
When to use p-values? Almost Always
• Usable, even if the scale of measurement is
meaningless, and only directional decision is
needed: Significant difference gives sign
determination (the null need not be precisely true).
• In some emerging branches of science it’s the only
way to compare across conditions: GWAS, fMRI,
Brain Networks, Genomics pathways..
• The p-value offers a common concept and language
for addressing randomness across science
When to use with p<“some line”?
Often
2a. Selection according to some line is unavoidable
• When the analysis should lead to an action.
• A line is needed for power analysis in the design
stage, allowing justification of human and animal
sample sizes
• A minimal requirement for publication decision in
journals
• A bright line is needed when the action should be fair
When to use with p<“some line”?
Often
2b. Selection according to some line is unavoidable
in modern science
• A line is always used for selection when the analysis
results are too numerous to be included in total
• When highlighting results
• Giovannucci et al. (1995) look for relationships
between more than a hundred types of food intakes
and the risk of prostate cancer
• The abstract reports only three (marginal) 95%
confidence intervals (CIs), apparently only for those
relative risks whose CIs do not cover 1.
“Eat Ketchup and Pizza and avoid Prostate Cancer”
7
Epidemiology: a p-values free zone
Selection by a Table
SCIENCE, ‘07
Y Benjamini
The Main Table
11 selected by the table out of ~400,000
Y Benjamini
YB
32,0001 Voxels searched
1
448,000
SNPs
Selection by a Figure
number of tests ~ 13,000,000,000
10
Goal: Association between volume changes at voxels with genotype (Stein et al.’10)
When use it with <0.0499…?
3. Selection according to is p < 0.0499… is not too bad
• In a well designed FDA regulated clinical trial with a
single primary endpoint, the .05 line is used to define
success or failure. It managed to screen ~50% too
much.
• Moreover, FDA does not approve a drug based on a
single successful trial. At least twice, and often more.
• Indeed, the p<.05 threshold served us well until the
80’s.
A change in the way science is conducted.
The industrialization of the scientific process
1888 1999
1950 2010
In general scientists were slow to adjust
One paper per year from Genetics: #Gene-expressions reported
1
10
100
1000
10000
100000
1000000
Number of…
Micro-arrays
invented
Multiplicity addressed
for the first time
When use it with <0.0499…?
In large-scale problems lower levels as threshold for significance
are used in order to control some error-rate
With which we can maintain that
most of the selected findings will not be false !
The p < 5*10-8 used in whole genome scan was chosen
to control the prob. of making even one error (FWER) at 0.05
And it works well, without getting rid of the p-value, or of the .0499…
The Main Table
11 selected by the table out of ~400,000 FWER <.05
Y Benjamini
In depth analysis of 100 papers from NEJM 2002-2010. All
had multiple endpoints (Cohen and YB ‘16)
• # of endpoints in a paper 4-167 ; mean=27
• In 80% the issue of multiplicity was entirely ignored: p≤0.05
• All studies designated primary endpoints
• Conclusions were based on other endpoints when the
primary failed
The above reflects most of the published medical research
But not at the regulatory stage (Phase III trials).
The reasons that ~ 50% of phase III trials fail ?
YB
Elsewhere? in medical research?
In Experimental Psychology?
Our analysis of the 100 in the Psychology reproducibility
project:
# of inferences per study (4-700, average 72);
Only 11 (very very partially) addressed selection
YB
“With a clean conscience”(Schnall et al ’08, Psy.
Sc.)
Presented with 6 moral dilemmas and asked “how wrong each action was”
Does priming for cleanliness affect the response:
One assessment of wrong-doing & 9 emotions rating per each dilemma;
Two methods of priming verbal & physical (separate experiments) (at least m=84)
Results: No significant difference on any of the emotions in any of the experiments;
only a contrast for disgust was significant
Each experiment barely significant on moral judgment over all dilemmas;
In 3 of the 6*2 particular dilemmas priming made a significant difference.
All tests at 0.05; No adjustment for selection.
Their Conclusion: The findings support the idea that moral judgment is affected by
priming for cleanliness.
The replication study could not replicate these results.
Ignoring selection in the reported results is a quite killer of
replicability in problems of medium complexity
How should p-values be used?
Like every other statistical tool
1. Address the effect of selection
2. Add other tools for inference: confidence intervals &
estimators if relevant and possible, Empirical/Bayes
(but follow 1 above)
3. Use the relevant variability
4. Replicate others’ work as a way of life
Address selective inference
Inference on a selected subset of the parameters that
turned out to be of interest after viewing the data!
How is selection manifested?
Out-of-study selection - not evident in the published work
File drawer problem / publication bias
The garden of forking paths, p-hacking,
Data dredging, Double dipping,
Inferactive Data Analysis
In-study selection - evident in the published work:
Selection by the Abstract, a Table, a Figure
Selection by highlighting those passing a threshold
Selection by modeling: AIC, Cp, BIC, FDR, LASSO,…
Address the effect of in-study selection
Report adjusted p-values by some method,
controlling
either The Familywise Error Rate (FWER)
or The rate Conditional on being selected
or The False Discovery Rate (FDR)
Alternatively, highlight/table/display only results
remaining statistically significant after adjustment
Address the effect of out-of-study
“bright lines” selection
Policy: Select to report the result only if the estimator is
significant at some level (e.g. publication bias / file drawer)
Y given |Y| ≥ z1-a/2,
Conditional Conf. Int. -> False Coverage Rate
Conditional density
Conditional max. likelihood estimator
23
Hedges ’84, Weinstein et al ’13, Taylor and others ‘14…
Conditional MLE Cond. MLE and CI for
correlation
Hedges ‘84, Zhong and Prentice ’08, Fithian, Sun, Taylor (16) YB and Meir (16+)
Use to assess ‘publication bias’ & other bright line rules
CIs extend more towards 0 and beyond as the original estimators are closer to
significance. 77% of replications fall inside. Meir Zeevi and YB (17+)
-
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Original Effect Size
ReplicationEffectSize
p−value
Not Significant
Significant
Replication Power
0.6
0.7
0.8
0.9
CI and estimators in original study conditional on being significant at 5%.
Thresholding at
p-value ≤ 0.005
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
ReplicationEffectSize
p−value
Not Significant
Significant
Replication Power
0.80
0.85
0.90
0.95
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Original Effect Size
ReplicationEffectSize
p−value
Not Significant
Significant
Replication Power
0.6
0.7
0.8
0.9
Thresholding at
p-value ≤ 0.05
Marginal Confidence Intervals are too optimistic
2.Where possible add confidence intervals &
estimators –adjusted for selection as well
20 parameters to be estimated with 90%
CIs
● ● ●
●
● ●
● ● ●
●
● ●
●
●
● ● ●
● ● ●
5 10 15 20
−4−2024
Index
● ● ●
●
● ●
● ● ●
●
● ●
●
●
● ● ●
● ● ●
5 10 15 20
−4−2024
Index
● ● ●
●
● ●
● ● ●
●
● ●
●
●
● ● ●
● ● ●
5 10 15 20
−4−2024
Index
● ● ●
●
● ●
● ● ●
●
● ●
●
●
● ● ●
● ● ●
5 10 15 20
−4−2024
Index
● ● ●
●
● ●
● ● ●
●
● ●
●
●
● ● ●
● ● ●
5 10 15 20
−4−2024
Index
●
●
● ●
3/20 do not cover
3/4 CI do not cover
when selected
Need for False
Coverage-
statement Rate
controlling intervals
(FCR)
Selection of this form
harms Bayesian Intervals
as well
(Wang & Lagakos ‘07 EMR, Yekutieli 2012)
28
Mouse phenotyping example: opposite single lab results
Kafkafi et al (’17 Nature Methods)
3. Address the relevant variability
The above GxL interaction is “a fact of life”
Genotype-by-Lab effect for a genotype in a new lab is not
known; but when its variability s2
GxLcan be estimated, use
Mean(MG1) – Mean(MG2)
(s
Within (1/n+1/n)+ s
GxL )1/2
Interaction size is the right “yardstick” against which genetic
differences should be compared,
when the concern is about replicability in other labs.
Good design, large sample size, transparency, avoiding p-
values - won’t solve it. So GxL adjust at each lab:
Single-lab analyses in all known replication
studies
Kafkafi et al (’17 Nature Methods)
From the example to generality
Choosing the relevant level of variability is critical in order
to increase replicability, for any inferential procedure: tests,
confidence intervals, and estimates.
Many small studies are better than single large one even if
underpowered!
Clinical research: multiple centers with Center by
Treatment interaction
Educational research: random effects for schools &
teachers (and interactions)
Functional MRI: Random effect for subjects
YB
Replicability is a minimal form of Generalizability
The ‘‘Many Labs’’ Replication Project ‘14
4. Replicate others’ work as a way of
life
• Check consistency of effect’s directional decision (sign)
Significant (p≤.05) in both studies (Fisher’s definition); or
Both confidence intervals are entirely on same side.
If replicated, strong evidence against randomness, (p<0.0025),
but Scientifically much stronger:
combining different tests by different investigators
r2/2–value = max (p1,p2)
ru/m–value: the smallest significance level at which
effects in at least u out of the m studies
adjusted for selection are significant
Have been used to analyze generalizability of results
In the Psychological Reproducibility Project 36% replicated
Replicate others’ work as a way of life
Reproducibility projects are not sustainable.
Neither are publishing many papers with negative results only.
Instead
• Every research proposal and paper should have a replicability-
check component of a result, considered by the authors
important for their proposed research.
• Its result will be reported whatever the outcome is, in the
extended-abstract/main-body in 1-2 searchable sentences.
• The authors of a replicated study will receive special recognition
for having published a result considered important enough by
others to invest the effort toreplicate it.
• Researchers, Granting agencies, Publishers, Academic leaders

More Related Content

What's hot

Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
jemille6
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
jemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
jemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
jemille6
 
On p-values
On p-valuesOn p-values
On p-values
Maarten van Smeden
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
jemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
jemille6
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
jemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
Christian Robert
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
jemille6
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
Christian Robert
 
Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2
John Labrador
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
Tushar Kumar
 
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
Pat Barlow
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
jemille6
 

What's hot (20)

Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
On p-values
On p-valuesOn p-values
On p-values
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Hypothesis testing - Primer
Hypothesis testing - PrimerHypothesis testing - Primer
Hypothesis testing - Primer
 
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 

Similar to Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."

Bio statistical analysis in clinical research
Bio statistical analysis  in clinical research  Bio statistical analysis  in clinical research
Bio statistical analysis in clinical research
Helwan University
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 
RCT to causal inference.pptx
RCT to causal inference.pptxRCT to causal inference.pptx
RCT to causal inference.pptx
Francois MAIGNEN
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...
David Pratap
 
Statistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptxStatistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptx
CHRISTINE MAY CERDA
 
Chapter 28 clincal trials
Chapter 28 clincal trials Chapter 28 clincal trials
Chapter 28 clincal trials
Nilesh Kucha
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
Amandeep Kaur
 
4 primaryresearchquestionanddefinitionofendpointsindia2007
4 primaryresearchquestionanddefinitionofendpointsindia20074 primaryresearchquestionanddefinitionofendpointsindia2007
4 primaryresearchquestionanddefinitionofendpointsindia2007KhanhHoa Tran
 
Inferential statistics_AAF 500L 2021.ppt
Inferential statistics_AAF 500L 2021.pptInferential statistics_AAF 500L 2021.ppt
Inferential statistics_AAF 500L 2021.ppt
OfeniJoshuaSeyi
 
Lect w6 hypothesis_testing
Lect w6 hypothesis_testingLect w6 hypothesis_testing
Lect w6 hypothesis_testing
Rione Drevale
 
Sample determinants and size
Sample determinants and sizeSample determinants and size
Sample determinants and size
Tarek Tawfik Amin
 
Principles of Diagnostic Testing and ROC 2016
Principles of Diagnostic Testing and ROC 2016Principles of Diagnostic Testing and ROC 2016
Principles of Diagnostic Testing and ROC 2016
evadew1
 
Test of-significance : Z test , Chi square test
Test of-significance : Z test , Chi square testTest of-significance : Z test , Chi square test
Test of-significance : Z test , Chi square test
dr.balan shaikh
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
Kiran Ramakrishna
 
Biomarker discovery and validation
Biomarker discovery and validationBiomarker discovery and validation
Biomarker discovery and validationWinton Gibbons
 
Bio-Statistics in Bio-Medical research
Bio-Statistics in Bio-Medical researchBio-Statistics in Bio-Medical research
Bio-Statistics in Bio-Medical research
Shinjan Patra
 
Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
Santam Chakraborty
 

Similar to Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..." (20)

Bio statistical analysis in clinical research
Bio statistical analysis  in clinical research  Bio statistical analysis  in clinical research
Bio statistical analysis in clinical research
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
Lund 2009
Lund 2009Lund 2009
Lund 2009
 
RCT to causal inference.pptx
RCT to causal inference.pptxRCT to causal inference.pptx
RCT to causal inference.pptx
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...
 
Statistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptxStatistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptx
 
Chapter 28 clincal trials
Chapter 28 clincal trials Chapter 28 clincal trials
Chapter 28 clincal trials
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
 
4 primaryresearchquestionanddefinitionofendpointsindia2007
4 primaryresearchquestionanddefinitionofendpointsindia20074 primaryresearchquestionanddefinitionofendpointsindia2007
4 primaryresearchquestionanddefinitionofendpointsindia2007
 
Copenhagen 2008
Copenhagen 2008Copenhagen 2008
Copenhagen 2008
 
Inferential statistics_AAF 500L 2021.ppt
Inferential statistics_AAF 500L 2021.pptInferential statistics_AAF 500L 2021.ppt
Inferential statistics_AAF 500L 2021.ppt
 
Lect w6 hypothesis_testing
Lect w6 hypothesis_testingLect w6 hypothesis_testing
Lect w6 hypothesis_testing
 
Sample determinants and size
Sample determinants and sizeSample determinants and size
Sample determinants and size
 
Hypo
HypoHypo
Hypo
 
Principles of Diagnostic Testing and ROC 2016
Principles of Diagnostic Testing and ROC 2016Principles of Diagnostic Testing and ROC 2016
Principles of Diagnostic Testing and ROC 2016
 
Test of-significance : Z test , Chi square test
Test of-significance : Z test , Chi square testTest of-significance : Z test , Chi square test
Test of-significance : Z test , Chi square test
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
Biomarker discovery and validation
Biomarker discovery and validationBiomarker discovery and validation
Biomarker discovery and validation
 
Bio-Statistics in Bio-Medical research
Bio-Statistics in Bio-Medical researchBio-Statistics in Bio-Medical research
Bio-Statistics in Bio-Medical research
 
Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
 

More from jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
jemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
jemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
jemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
jemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
jemille6
 
What's the question?
What's the question? What's the question?
What's the question?
jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
jemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
jemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
jemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
jemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
jemille6
 

More from jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 

Recently uploaded

MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 

Recently uploaded (20)

MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 

Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."

  • 1. In the world beyond p < .05: When & How to use p < .0499… Yoav Benjamini Tel Aviv University, Israel replicability.tau.ac.il Supported by ERC PSARPS and HBP grants
  • 2. We cannot assure the replicability of results from a single study: Only to enhance it The gold standard of science: A discovery should be replicable
  • 3. When to use p-values? Almost Always 1. Statistical significance testing (via the p-value) is the first-line defense against being fooled by randomness: • Valid calculation of the p-value requires minimal assumptions – less than any other statistical method. • The “assumptions” need not be merely assumed, but can be assured by a properly designed experiment (randomization + non-parametric test)
  • 4. When to use p-values? Almost Always • Usable, even if the scale of measurement is meaningless, and only directional decision is needed: Significant difference gives sign determination (the null need not be precisely true). • In some emerging branches of science it’s the only way to compare across conditions: GWAS, fMRI, Brain Networks, Genomics pathways.. • The p-value offers a common concept and language for addressing randomness across science
  • 5. When to use with p<“some line”? Often 2a. Selection according to some line is unavoidable • When the analysis should lead to an action. • A line is needed for power analysis in the design stage, allowing justification of human and animal sample sizes • A minimal requirement for publication decision in journals • A bright line is needed when the action should be fair
  • 6. When to use with p<“some line”? Often 2b. Selection according to some line is unavoidable in modern science • A line is always used for selection when the analysis results are too numerous to be included in total • When highlighting results
  • 7. • Giovannucci et al. (1995) look for relationships between more than a hundred types of food intakes and the risk of prostate cancer • The abstract reports only three (marginal) 95% confidence intervals (CIs), apparently only for those relative risks whose CIs do not cover 1. “Eat Ketchup and Pizza and avoid Prostate Cancer” 7 Epidemiology: a p-values free zone
  • 8. Selection by a Table SCIENCE, ‘07 Y Benjamini
  • 9. The Main Table 11 selected by the table out of ~400,000 Y Benjamini
  • 10. YB 32,0001 Voxels searched 1 448,000 SNPs Selection by a Figure number of tests ~ 13,000,000,000 10 Goal: Association between volume changes at voxels with genotype (Stein et al.’10)
  • 11. When use it with <0.0499…? 3. Selection according to is p < 0.0499… is not too bad • In a well designed FDA regulated clinical trial with a single primary endpoint, the .05 line is used to define success or failure. It managed to screen ~50% too much. • Moreover, FDA does not approve a drug based on a single successful trial. At least twice, and often more. • Indeed, the p<.05 threshold served us well until the 80’s. A change in the way science is conducted.
  • 12. The industrialization of the scientific process 1888 1999 1950 2010
  • 13. In general scientists were slow to adjust One paper per year from Genetics: #Gene-expressions reported 1 10 100 1000 10000 100000 1000000 Number of… Micro-arrays invented Multiplicity addressed for the first time
  • 14. When use it with <0.0499…? In large-scale problems lower levels as threshold for significance are used in order to control some error-rate With which we can maintain that most of the selected findings will not be false ! The p < 5*10-8 used in whole genome scan was chosen to control the prob. of making even one error (FWER) at 0.05 And it works well, without getting rid of the p-value, or of the .0499…
  • 15. The Main Table 11 selected by the table out of ~400,000 FWER <.05 Y Benjamini
  • 16. In depth analysis of 100 papers from NEJM 2002-2010. All had multiple endpoints (Cohen and YB ‘16) • # of endpoints in a paper 4-167 ; mean=27 • In 80% the issue of multiplicity was entirely ignored: p≤0.05 • All studies designated primary endpoints • Conclusions were based on other endpoints when the primary failed The above reflects most of the published medical research But not at the regulatory stage (Phase III trials). The reasons that ~ 50% of phase III trials fail ? YB Elsewhere? in medical research?
  • 17. In Experimental Psychology? Our analysis of the 100 in the Psychology reproducibility project: # of inferences per study (4-700, average 72); Only 11 (very very partially) addressed selection YB
  • 18. “With a clean conscience”(Schnall et al ’08, Psy. Sc.) Presented with 6 moral dilemmas and asked “how wrong each action was” Does priming for cleanliness affect the response: One assessment of wrong-doing & 9 emotions rating per each dilemma; Two methods of priming verbal & physical (separate experiments) (at least m=84) Results: No significant difference on any of the emotions in any of the experiments; only a contrast for disgust was significant Each experiment barely significant on moral judgment over all dilemmas; In 3 of the 6*2 particular dilemmas priming made a significant difference. All tests at 0.05; No adjustment for selection. Their Conclusion: The findings support the idea that moral judgment is affected by priming for cleanliness. The replication study could not replicate these results. Ignoring selection in the reported results is a quite killer of replicability in problems of medium complexity
  • 19. How should p-values be used? Like every other statistical tool 1. Address the effect of selection 2. Add other tools for inference: confidence intervals & estimators if relevant and possible, Empirical/Bayes (but follow 1 above) 3. Use the relevant variability 4. Replicate others’ work as a way of life
  • 20. Address selective inference Inference on a selected subset of the parameters that turned out to be of interest after viewing the data! How is selection manifested? Out-of-study selection - not evident in the published work File drawer problem / publication bias The garden of forking paths, p-hacking, Data dredging, Double dipping, Inferactive Data Analysis
  • 21. In-study selection - evident in the published work: Selection by the Abstract, a Table, a Figure Selection by highlighting those passing a threshold Selection by modeling: AIC, Cp, BIC, FDR, LASSO,…
  • 22. Address the effect of in-study selection Report adjusted p-values by some method, controlling either The Familywise Error Rate (FWER) or The rate Conditional on being selected or The False Discovery Rate (FDR) Alternatively, highlight/table/display only results remaining statistically significant after adjustment
  • 23. Address the effect of out-of-study “bright lines” selection Policy: Select to report the result only if the estimator is significant at some level (e.g. publication bias / file drawer) Y given |Y| ≥ z1-a/2, Conditional Conf. Int. -> False Coverage Rate Conditional density Conditional max. likelihood estimator 23 Hedges ’84, Weinstein et al ’13, Taylor and others ‘14…
  • 24. Conditional MLE Cond. MLE and CI for correlation Hedges ‘84, Zhong and Prentice ’08, Fithian, Sun, Taylor (16) YB and Meir (16+) Use to assess ‘publication bias’ & other bright line rules
  • 25. CIs extend more towards 0 and beyond as the original estimators are closer to significance. 77% of replications fall inside. Meir Zeevi and YB (17+) - ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Original Effect Size ReplicationEffectSize p−value Not Significant Significant Replication Power 0.6 0.7 0.8 0.9 CI and estimators in original study conditional on being significant at 5%.
  • 26. Thresholding at p-value ≤ 0.005 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 ReplicationEffectSize p−value Not Significant Significant Replication Power 0.80 0.85 0.90 0.95 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Original Effect Size ReplicationEffectSize p−value Not Significant Significant Replication Power 0.6 0.7 0.8 0.9 Thresholding at p-value ≤ 0.05
  • 27. Marginal Confidence Intervals are too optimistic 2.Where possible add confidence intervals & estimators –adjusted for selection as well
  • 28. 20 parameters to be estimated with 90% CIs ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 −4−2024 Index ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 −4−2024 Index ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 −4−2024 Index ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 −4−2024 Index ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 −4−2024 Index ● ● ● ● 3/20 do not cover 3/4 CI do not cover when selected Need for False Coverage- statement Rate controlling intervals (FCR) Selection of this form harms Bayesian Intervals as well (Wang & Lagakos ‘07 EMR, Yekutieli 2012) 28
  • 29. Mouse phenotyping example: opposite single lab results Kafkafi et al (’17 Nature Methods) 3. Address the relevant variability
  • 30. The above GxL interaction is “a fact of life” Genotype-by-Lab effect for a genotype in a new lab is not known; but when its variability s2 GxLcan be estimated, use Mean(MG1) – Mean(MG2) (s Within (1/n+1/n)+ s GxL )1/2 Interaction size is the right “yardstick” against which genetic differences should be compared, when the concern is about replicability in other labs. Good design, large sample size, transparency, avoiding p- values - won’t solve it. So GxL adjust at each lab:
  • 31. Single-lab analyses in all known replication studies Kafkafi et al (’17 Nature Methods)
  • 32. From the example to generality Choosing the relevant level of variability is critical in order to increase replicability, for any inferential procedure: tests, confidence intervals, and estimates. Many small studies are better than single large one even if underpowered! Clinical research: multiple centers with Center by Treatment interaction Educational research: random effects for schools & teachers (and interactions) Functional MRI: Random effect for subjects YB
  • 33. Replicability is a minimal form of Generalizability The ‘‘Many Labs’’ Replication Project ‘14
  • 34. 4. Replicate others’ work as a way of life • Check consistency of effect’s directional decision (sign) Significant (p≤.05) in both studies (Fisher’s definition); or Both confidence intervals are entirely on same side. If replicated, strong evidence against randomness, (p<0.0025), but Scientifically much stronger: combining different tests by different investigators r2/2–value = max (p1,p2) ru/m–value: the smallest significance level at which effects in at least u out of the m studies adjusted for selection are significant Have been used to analyze generalizability of results In the Psychological Reproducibility Project 36% replicated
  • 35. Replicate others’ work as a way of life Reproducibility projects are not sustainable. Neither are publishing many papers with negative results only. Instead • Every research proposal and paper should have a replicability- check component of a result, considered by the authors important for their proposed research. • Its result will be reported whatever the outcome is, in the extended-abstract/main-body in 1-2 searchable sentences. • The authors of a replicated study will receive special recognition for having published a result considered important enough by others to invest the effort toreplicate it. • Researchers, Granting agencies, Publishers, Academic leaders