SlideShare a Scribd company logo
1 of 40
Download to read offline
The Statistics Wars and Their Casualties
September 2022
The two statistical pillars of replicability: Addressing
selective inference and irrelevant variability
Yoav Benjamini
Statistics & O.R., School of Mathematical Sciences
The Sagol School of Neuroscience
Tel Aviv University
Outline
• Tukey’s two pillars of replicability
Addressing Selective Inference
Relevant Variability
• Some recent evidence
• The relation of StatWars to replicability issues
• Frequentist & Bayesian responses to the pillars
• NEJM guidelines: a casualty of the wars
Tukey’s last published work
A puzzling encyclopedic entry on Multiple Comparisons.
Opens : ``a diversity of issues … that tend to be
important, difficult, and often unresolved.”
Details:
• The FDR approach in pairwise comparisons2
• The Random Effects vs Fixed Effects analysis 3
What’s that to do with multiple comparisons?
1 Jones et al (‘22) International Encyclopedia of Statistics in the Social Sciences
2 Williams Jones & Tukey ( ’99) 3 Cornfield & Tukey (’56)
Genes & Behaviour: Crabbe et a (Science, ‘99)
Compared 12 measures across strains at 3 labs
In spite of strict standardization,
Significant Lab*Genotype Interaction
“Thus, experiments characterizing mutants may yield
results that are idiosyncratic to a particular laboratory.”
We thought that using our computational tools will solve the
problem
Comparing 17 measures between 8 inbred strains of mice
At 3 labs: Golani at TAU, Elmer MPRC, Kafkafi NIDA1
MCP 2002, 1Kafkafi et al PNAS 2004
YB Berkeley ʼ10
Behavioral Endpoint Labs Fixed Labs Mixed
Prop. Lingering Time 0.00001 0.0029
# Progression segments 0.00001 0.0068
Median Turn Radius (scaled) 0.00001 0.0092
Time away from wall 0.00001 0.0108
Distance traveled 0.00001 0.0144
Acceleration 0.00001 0.0146
# Excursions 0.00001 0.0178
Time to half max speed 0.00001 0.0204
Max speed wall segments 0.00001 0.0257
Median Turn rate 0.00001 0.0320
Spatial spread 0.00001 0.0388
Lingering mean speed 0.00001 0.0588
Homebase occupancy 0.001 0.0712
# stops per excursion 0.0028 0.1202
Stop diversity 0.027 0.1489
Length of progression segments 0.44 0.5150
Activity decrease 0.67 0.8875
Significance of 8 Strain differences
Strain x Lab
Interaction
significant
Strain x Lab
Interaction
not significant
FDR ≤ .05
Recalling Mann’s warning in “Behavior Genetics in transition”1
(Science, 94)
“…jumping too soon to discoveries..” (and press discoveries)
“raises the issue of Replicability”
Tukey’s entry is about replicability of discoveries1
Addressing : Selective Inference
The relevant variability
1 Kafkafi et al (PNAS ’04)
Intensified due to the industrialization of the scientific process
Testing the approach1
• Took Single lab experimental results involving comparisons
between mouse strains from Mouse Phenotyping Database
• Carried similar experiments in 3 labs : JAX, TAUL, and TAUM
without standardization
• Used Random Lab Mixed Model Analysis to assess replicability of
original result
• Estimated g2=s2
GxL / s2
within for each endpoint from Database or
from our experiments
And used it to adjust the single lab results (by inflating sd)
60% of single lab rejected results were non-replicable
12% of adjusted single lab rejected results were non-replicable
Jaljuli, Kafkafi et al ’22+ BioRxiv
Reading ‘56 paper again
EXPERIMENT
Fixed
Scientific Knowledge
Island reached by
statist Inference
Reading ‘56 paper again
EXPERIMENT
Mixed
Scientific Knowledge
Island reached by
statist Inference
Inference on a selected subset of the parameters that turned out to
be of interest after viewing the data!
Relevant to all statistical methods – hurting replicability
Out-of-study selection - not evident in the published work
File drawer problem / publication bias
The garden of forking paths, p-hacking, cherry picking
significance chasing, HARKing, Data dredging,
All are widely discussed and addressed
e.g. by Transparency & Reproducibility standards
Selective inference
In-study selection - evident in the published work:
Selection by the Abstract, Discussion
Table, Figure
Selection by highlighting those passing a threshold
p<.05, p<.005, p<5*10-8, *,**,2 fold
Selection by modeling: AIC, Cp, BIC, LASSO,…
In complex research problems - in-study selection is unavoidable!
Selective inference
• Giovannucci et al. (1995) look for relationships between more
than a hundred types of food intakes and the risk of prostate
cancer
• The abstract reports three (marginal) 95% confidence intervals
(CIs), apparently only for those relative risks whose CIs do not
cover 1.
“Eat Ketchup and Pizza and avoid Prostate Cancer”
12
Selective inference hampers replicability
“Although the pooled RR for raw tomato consumption
was initially significant in 1995, this association has
remained nonsignificant since 2000 after the addition
of 7 studies…” Meta-analysis by Rowles et al (2017)
Any adjustment for selection yields Conf. Intervals covering 1
Error-rates for selective inference
Secondary endpoints from effect of fish oil consumption on CHD, stroke or
death from CVD . Manson et al ‘18 used by NEJM editorial
• Simultaneous over
all possible selections
• Simultaneous
over the selected
• Conditional
on being selected
• On the average
over the selected
• On the average
over all
Addressing selective inference in psychology
Transparency problems 1/100
Reproducibility problems 6/100
Reproducible & selected p ≤. 05
56/88 (=64%) failed.
Evident selection
Adjusting via hierarchical FDR
22 with padj > .05; and screened
Of them
21 non-replicable results
1 replicable discovery lost
Failure rate 36/67 (=52%)
Power loss ~1/31
100 replications efforts, 64/100 failed
Addressing selective inference in psychology
Reducing level to p ≤ 0.005
Benjamin et al + 200 signatures
32 with p > .005; of them
21 non-replicable results
11 replicable discovery
Failure rate 25/47 (=54%)
Power loss =1/3
Zeevi, et al, (‘21+)
The statistical wars
A. Identifying problems of non-evident selective inference
with the use of p-values and statistical testing
Ending with the ‘New Statistics’ and bans on p-values &NHST
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values1
“… some statisticians prefer …to even replace p-values
with other approaches”
e.g. Bayes factors, confidence intervals, credence intervals
1Wasserstein& Lazar (Amm. Stat. ‘16)
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values1
C. The 2019 ASA conference and Editorial2
Don’t use p<.05; Don’t say “statistically significant”
1Wasserstein& Lazar (Am. Stat. ‘16) 2Wasserstein,Schirm & Lazar (Am. Stat ’19)
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values
C. The 2019 ASA conference and Editorial
D. The ASA president’s task force statement3
3 YB et al AOS ‘21
E. The disclaimer
E. By 2022, MacNaughton documented 41 explicit references
to 2019 editorial as official ASA policy.
Past-president Kafadar’s letter to ASA board required:
• Either the board approves the editorial as policy; or
• A disclaimer is added
The editorial was written by the three editors acting as
individuals and reflects their scientific views not an
endorsed position of the American Statistical Association.
May 2022 in the online version only
Scientific Reproducibility and Statistical Significance
Symposium, Convened by ASA on June 3:
“The goals of the symposium are:”
1. To disseminate the stance of the ASA on the
appropriate use of results from null hypothesis
significance testing
2. To offer alternatives to such testing
3. To discuss changes to publication policies that
would benefit both individual scientists and science
writ large.
The war goes on
• The war between Bayesian and frequentists was fierce
in the 1950’s.
• It receded to co-existence and mutual respect
• The replicability crisis was used by zealous Bayesians to
restart the war
• On the eve of the meeting preparing the 2016
statement I expressed my opinion that the statement
should be about statistics and replicabiliy in general -
not merely focused on the p-value
• Unfortunately, I see no connection between the
StatWars and replicability.
When P values are reported for multiple outcomes without
adjustment for multiplicity, the probability of declaring a treatment
difference when none exists can be much higher than 5%. (July ‘19)
The Casualties: NEJM guidelines
1. P-values may not be reported (for secondary endpoints) if
multiplicity correction method was not specified in the protocol or in
the statistical analysis plan
2. Unadjusted (marginal) 95% CIs reported for all secondary
endpoints
The Casualties: NEJM guidelines
Wu et al, citing results of Manson et al NEJM 2018
Nature Reviews Cardiology (2019)
Fish oil supplementation …
had no significant effect on the composite primary end point of
CHD, stroke or death from CVD
but reduced the risk of
total CHD* (HR 0.83, 95% CI 0.71–0.97),
percutaneous corona intervention (HR 0.78, 95% CI 0.63–0.95),
total myocardial infarction* (HR 0.72, 95% CI 0.59–0.90),
fatal myocardial infarction (HR 0.50, 95% CI 0.26–0.97).
The only 4 out of 17 that excluded 1 and were not exploratory.
What’s the problem?
• CI as decision tool
not crossing the no-effect value ó significance testing at 0.05
The issue of multiplicity, as recognized by NEJM, does not disappear
• CIs coverage
Coverage deteriorates: Taking the 17 estimators from above as
parameters.
Generated random means with SNR as estimated above times k
Selected the CIs not crossing above 1 ; 10,000 simulation
Checked the average non-coverage over the so selected
For K=1 0.11 ; For K=0.5 0.18 ; For K=0.01 0.56
Selective inference by CIs is totally ignored
We1 suggest:
Select if nominal 95%CI
does not cross 0 (=log(1))
Assure False Coverage-Rate
control over the so selected
via the general BY-CIs:
* Replace the 4 nominal CIs
that do not cross 1 by
95(1-0.05*4/17)% CIs
*For others use nominal CIs
1YB, Heller, Panagiotou arXiv ‘21
To offer NEJM default guidelines retaining power for exploration
An epidemiologist with 13k followers
The Casualties: 2
From a Lancet reviewer
• How can statistical hypotheses and strategies for addressing
multiplicity issues be pre-specified in an observational
study, which typically requires exploratory analyses to
reduce bias?
• Can an author claim to have performed a confirmatory test
without this pre-specification?
Logical Fallacy:
Confirmatory analysis => Multiplicity adjustment
Therefore:
Multiplicity adjustment => Confirmatory analysis
Addressing the relevant variability
Frequentists Ignored by many but increasing awareness
Bayesians Recognized (Hierarchical modelling)
Addressing evident selective inference
Frequentists Recognized more for p-values less for CIs &
exploratory research
Bayesian Ignored as a matter of principle
The 2 pillars in Frequentist & Bayesian eyes
E.g. Gelman, Hill & Yajima, M. (‘12)
Why we (usually) don't have to worry about multiple comparisons.
The underlying theoretical justification:
Since we condition on all the data,
Any selection after the data is viewed is already reflected in the
posterior distribution.
Are Bayesian intervals immune from selection’s harms?
Assumed Prior µi~N(0,0.52); yi~N(µi ,1); i=1,2,…,106 (Gelman’s Ex.)
Parameters generated by N(0,.52) 0.999*N(0,.52)+0.001*N(0,.52+32)
Type of 95%
confidence/credence intervals
Marginal Bayesian
Credibility
Marginal
FCR-
adjusted
Bayesian
Credibility
BH-Selected FCR-
adjusted
Intervals not covering
their parameter
5.0% 5.0% 5.1% 2.1%
Intervals not covering 0:
Selected
7.3% 0.01% 0.03% 0.03%
Intervals not covering their
parameter: Out of the Selected
48% 3.4% 1.0% 71.5% 2.1%
YB HDSR ‘19
From Gelman’s Blog August ‘22
Eric van Zwet offers a solution to the above example:
“We can mix the N(0,0.5) with almost any wider normal
distribution with almost any probability and then very large
effects will hardly be shrunken.
He demonstrates it by the prior 0.99*N(0,0.52)+0.01*N(0,62)
Of 741 credible intervals not covering 0, the proportion not
covering the parameter is 0.07 (CI: 0.05 to 0.09)
Gelman response: I continue to think that Bayesian inference
completely solves the multiple comparisons problem.
So, I generated data from this prior to which I add 5% N(4.052)
And got 29% of the so selected not covering their parameter.
(with an order of magnitude power loss relative to frequentist
FDR adjusted testing and FCR CIs)
My point: Bayesians should worry about selective inference if
they care about replicability,
And cannot hide behind the theoretical guarantees.
Some Bayesians do that
Connections with FDR in large inferential problems
Genovese & Wasserman, ’02 Storey et al ’03…
Fdr and fdr variations on FDR in empirical Bayes framework
Efron et al ’13 …
Purely Bayes model where selection should be addressed
Yekutieli et al ’13
Thresholding of posterior odds using BH
Take away massages
Replicability can be enhanced mainly by addressing
Selective inference
• Evident selective inference is as harmful as non-evident
• Needed in exploratory research even when spool is mall
• Needed for CIs
• ASA attitude against p-values and statistical significance is
political and harms replicability
The relevant variability
• Prefer random effect (mixed model) analysis
• Many small studies are better than one/few large ones
Thanks to JWT for the insight
Take away massages
Most Bayesians ignore selective inference
But appreciate addressing the relevant variability
For frequentists it’s the opposite
In both research communities practitioners try to avoid
addressing them,
Until the research complexity is so large that selective
inference is addressed
Until results are so non-replicable that the relevant
variability is addressed
This usually takes too long
Thanks to JWT for the insight
Thanks!
www.replicability.tau.ac.il
The industrialization of the scientific process
1888 1999
1950 2010
The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability

More Related Content

Similar to The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability

Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...cambridgeWD
 
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Evangelos Kritsotakis
 
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...cheweb1
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysisDr Shri Sangle
 
Glymour aaai
Glymour aaaiGlymour aaai
Glymour aaaimglymour
 
Lemeshow samplesize
Lemeshow samplesizeLemeshow samplesize
Lemeshow samplesize1joanenab
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica....
Effective strategies to monitor clinical risks using biostatistics - Pubrica....Effective strategies to monitor clinical risks using biostatistics - Pubrica....
Effective strategies to monitor clinical risks using biostatistics - Pubrica....Pubrica
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104RSS6
 
David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012MedicReS
 
Causal inference lecture to Texas Children's fellows
Causal inference lecture to Texas Children's fellowsCausal inference lecture to Texas Children's fellows
Causal inference lecture to Texas Children's fellowsPavlos Msaouel, MD, PhD
 
Metanalysis Lecture
Metanalysis LectureMetanalysis Lecture
Metanalysis Lecturedrmomusa
 
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptxDr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptxPriyankaSharma89719
 
Avoid overfitting in precision medicine: How to use cross-validation to relia...
Avoid overfitting in precision medicine: How to use cross-validation to relia...Avoid overfitting in precision medicine: How to use cross-validation to relia...
Avoid overfitting in precision medicine: How to use cross-validation to relia...Nicole Krämer
 
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...Cytel USA
 

Similar to The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability (20)

Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
 
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)
 
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysis
 
Glymour aaai
Glymour aaaiGlymour aaai
Glymour aaai
 
Lemeshow samplesize
Lemeshow samplesizeLemeshow samplesize
Lemeshow samplesize
 
Analysis and Interpretation
Analysis and InterpretationAnalysis and Interpretation
Analysis and Interpretation
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
 
Test of significance
Test of significanceTest of significance
Test of significance
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica....
Effective strategies to monitor clinical risks using biostatistics - Pubrica....Effective strategies to monitor clinical risks using biostatistics - Pubrica....
Effective strategies to monitor clinical risks using biostatistics - Pubrica....
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104
 
David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012
 
Oac guidelines
Oac guidelinesOac guidelines
Oac guidelines
 
Causal inference lecture to Texas Children's fellows
Causal inference lecture to Texas Children's fellowsCausal inference lecture to Texas Children's fellows
Causal inference lecture to Texas Children's fellows
 
Metanalysis Lecture
Metanalysis LectureMetanalysis Lecture
Metanalysis Lecture
 
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptxDr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
 
Prague 02.10.2008
Prague 02.10.2008Prague 02.10.2008
Prague 02.10.2008
 
Avoid overfitting in precision medicine: How to use cross-validation to relia...
Avoid overfitting in precision medicine: How to use cross-validation to relia...Avoid overfitting in precision medicine: How to use cross-validation to relia...
Avoid overfitting in precision medicine: How to use cross-validation to relia...
 
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...
2014-10-22 EUGM | WEI | Moving Beyond the Comfort Zone in Practicing Translat...
 

More from jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 

More from jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 

Recently uploaded

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 

The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability

  • 1. The Statistics Wars and Their Casualties September 2022 The two statistical pillars of replicability: Addressing selective inference and irrelevant variability Yoav Benjamini Statistics & O.R., School of Mathematical Sciences The Sagol School of Neuroscience Tel Aviv University
  • 2. Outline • Tukey’s two pillars of replicability Addressing Selective Inference Relevant Variability • Some recent evidence • The relation of StatWars to replicability issues • Frequentist & Bayesian responses to the pillars • NEJM guidelines: a casualty of the wars
  • 3. Tukey’s last published work A puzzling encyclopedic entry on Multiple Comparisons. Opens : ``a diversity of issues … that tend to be important, difficult, and often unresolved.” Details: • The FDR approach in pairwise comparisons2 • The Random Effects vs Fixed Effects analysis 3 What’s that to do with multiple comparisons? 1 Jones et al (‘22) International Encyclopedia of Statistics in the Social Sciences 2 Williams Jones & Tukey ( ’99) 3 Cornfield & Tukey (’56)
  • 4. Genes & Behaviour: Crabbe et a (Science, ‘99) Compared 12 measures across strains at 3 labs In spite of strict standardization, Significant Lab*Genotype Interaction “Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory.” We thought that using our computational tools will solve the problem Comparing 17 measures between 8 inbred strains of mice At 3 labs: Golani at TAU, Elmer MPRC, Kafkafi NIDA1 MCP 2002, 1Kafkafi et al PNAS 2004
  • 5. YB Berkeley ʼ10 Behavioral Endpoint Labs Fixed Labs Mixed Prop. Lingering Time 0.00001 0.0029 # Progression segments 0.00001 0.0068 Median Turn Radius (scaled) 0.00001 0.0092 Time away from wall 0.00001 0.0108 Distance traveled 0.00001 0.0144 Acceleration 0.00001 0.0146 # Excursions 0.00001 0.0178 Time to half max speed 0.00001 0.0204 Max speed wall segments 0.00001 0.0257 Median Turn rate 0.00001 0.0320 Spatial spread 0.00001 0.0388 Lingering mean speed 0.00001 0.0588 Homebase occupancy 0.001 0.0712 # stops per excursion 0.0028 0.1202 Stop diversity 0.027 0.1489 Length of progression segments 0.44 0.5150 Activity decrease 0.67 0.8875 Significance of 8 Strain differences Strain x Lab Interaction significant Strain x Lab Interaction not significant FDR ≤ .05
  • 6. Recalling Mann’s warning in “Behavior Genetics in transition”1 (Science, 94) “…jumping too soon to discoveries..” (and press discoveries) “raises the issue of Replicability” Tukey’s entry is about replicability of discoveries1 Addressing : Selective Inference The relevant variability 1 Kafkafi et al (PNAS ’04) Intensified due to the industrialization of the scientific process
  • 7. Testing the approach1 • Took Single lab experimental results involving comparisons between mouse strains from Mouse Phenotyping Database • Carried similar experiments in 3 labs : JAX, TAUL, and TAUM without standardization • Used Random Lab Mixed Model Analysis to assess replicability of original result • Estimated g2=s2 GxL / s2 within for each endpoint from Database or from our experiments And used it to adjust the single lab results (by inflating sd) 60% of single lab rejected results were non-replicable 12% of adjusted single lab rejected results were non-replicable Jaljuli, Kafkafi et al ’22+ BioRxiv
  • 8. Reading ‘56 paper again EXPERIMENT Fixed Scientific Knowledge Island reached by statist Inference
  • 9. Reading ‘56 paper again EXPERIMENT Mixed Scientific Knowledge Island reached by statist Inference
  • 10. Inference on a selected subset of the parameters that turned out to be of interest after viewing the data! Relevant to all statistical methods – hurting replicability Out-of-study selection - not evident in the published work File drawer problem / publication bias The garden of forking paths, p-hacking, cherry picking significance chasing, HARKing, Data dredging, All are widely discussed and addressed e.g. by Transparency & Reproducibility standards Selective inference
  • 11. In-study selection - evident in the published work: Selection by the Abstract, Discussion Table, Figure Selection by highlighting those passing a threshold p<.05, p<.005, p<5*10-8, *,**,2 fold Selection by modeling: AIC, Cp, BIC, LASSO,… In complex research problems - in-study selection is unavoidable! Selective inference
  • 12. • Giovannucci et al. (1995) look for relationships between more than a hundred types of food intakes and the risk of prostate cancer • The abstract reports three (marginal) 95% confidence intervals (CIs), apparently only for those relative risks whose CIs do not cover 1. “Eat Ketchup and Pizza and avoid Prostate Cancer” 12 Selective inference hampers replicability
  • 13. “Although the pooled RR for raw tomato consumption was initially significant in 1995, this association has remained nonsignificant since 2000 after the addition of 7 studies…” Meta-analysis by Rowles et al (2017) Any adjustment for selection yields Conf. Intervals covering 1
  • 14. Error-rates for selective inference Secondary endpoints from effect of fish oil consumption on CHD, stroke or death from CVD . Manson et al ‘18 used by NEJM editorial • Simultaneous over all possible selections • Simultaneous over the selected • Conditional on being selected • On the average over the selected • On the average over all
  • 15. Addressing selective inference in psychology Transparency problems 1/100 Reproducibility problems 6/100 Reproducible & selected p ≤. 05 56/88 (=64%) failed. Evident selection Adjusting via hierarchical FDR 22 with padj > .05; and screened Of them 21 non-replicable results 1 replicable discovery lost Failure rate 36/67 (=52%) Power loss ~1/31 100 replications efforts, 64/100 failed
  • 16. Addressing selective inference in psychology Reducing level to p ≤ 0.005 Benjamin et al + 200 signatures 32 with p > .005; of them 21 non-replicable results 11 replicable discovery Failure rate 25/47 (=54%) Power loss =1/3 Zeevi, et al, (‘21+)
  • 17. The statistical wars A. Identifying problems of non-evident selective inference with the use of p-values and statistical testing Ending with the ‘New Statistics’ and bans on p-values &NHST
  • 18. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values1 “… some statisticians prefer …to even replace p-values with other approaches” e.g. Bayes factors, confidence intervals, credence intervals 1Wasserstein& Lazar (Amm. Stat. ‘16)
  • 19. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values1 C. The 2019 ASA conference and Editorial2 Don’t use p<.05; Don’t say “statistically significant” 1Wasserstein& Lazar (Am. Stat. ‘16) 2Wasserstein,Schirm & Lazar (Am. Stat ’19)
  • 20. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values C. The 2019 ASA conference and Editorial D. The ASA president’s task force statement3 3 YB et al AOS ‘21
  • 21. E. The disclaimer E. By 2022, MacNaughton documented 41 explicit references to 2019 editorial as official ASA policy. Past-president Kafadar’s letter to ASA board required: • Either the board approves the editorial as policy; or • A disclaimer is added The editorial was written by the three editors acting as individuals and reflects their scientific views not an endorsed position of the American Statistical Association. May 2022 in the online version only
  • 22. Scientific Reproducibility and Statistical Significance Symposium, Convened by ASA on June 3: “The goals of the symposium are:” 1. To disseminate the stance of the ASA on the appropriate use of results from null hypothesis significance testing 2. To offer alternatives to such testing 3. To discuss changes to publication policies that would benefit both individual scientists and science writ large. The war goes on
  • 23. • The war between Bayesian and frequentists was fierce in the 1950’s. • It receded to co-existence and mutual respect • The replicability crisis was used by zealous Bayesians to restart the war • On the eve of the meeting preparing the 2016 statement I expressed my opinion that the statement should be about statistics and replicabiliy in general - not merely focused on the p-value • Unfortunately, I see no connection between the StatWars and replicability.
  • 24. When P values are reported for multiple outcomes without adjustment for multiplicity, the probability of declaring a treatment difference when none exists can be much higher than 5%. (July ‘19) The Casualties: NEJM guidelines
  • 25. 1. P-values may not be reported (for secondary endpoints) if multiplicity correction method was not specified in the protocol or in the statistical analysis plan 2. Unadjusted (marginal) 95% CIs reported for all secondary endpoints The Casualties: NEJM guidelines
  • 26. Wu et al, citing results of Manson et al NEJM 2018 Nature Reviews Cardiology (2019) Fish oil supplementation … had no significant effect on the composite primary end point of CHD, stroke or death from CVD but reduced the risk of total CHD* (HR 0.83, 95% CI 0.71–0.97), percutaneous corona intervention (HR 0.78, 95% CI 0.63–0.95), total myocardial infarction* (HR 0.72, 95% CI 0.59–0.90), fatal myocardial infarction (HR 0.50, 95% CI 0.26–0.97). The only 4 out of 17 that excluded 1 and were not exploratory.
  • 27. What’s the problem? • CI as decision tool not crossing the no-effect value ó significance testing at 0.05 The issue of multiplicity, as recognized by NEJM, does not disappear • CIs coverage Coverage deteriorates: Taking the 17 estimators from above as parameters. Generated random means with SNR as estimated above times k Selected the CIs not crossing above 1 ; 10,000 simulation Checked the average non-coverage over the so selected For K=1 0.11 ; For K=0.5 0.18 ; For K=0.01 0.56
  • 28. Selective inference by CIs is totally ignored We1 suggest: Select if nominal 95%CI does not cross 0 (=log(1)) Assure False Coverage-Rate control over the so selected via the general BY-CIs: * Replace the 4 nominal CIs that do not cross 1 by 95(1-0.05*4/17)% CIs *For others use nominal CIs 1YB, Heller, Panagiotou arXiv ‘21 To offer NEJM default guidelines retaining power for exploration
  • 29. An epidemiologist with 13k followers The Casualties: 2
  • 30.
  • 31. From a Lancet reviewer • How can statistical hypotheses and strategies for addressing multiplicity issues be pre-specified in an observational study, which typically requires exploratory analyses to reduce bias? • Can an author claim to have performed a confirmatory test without this pre-specification? Logical Fallacy: Confirmatory analysis => Multiplicity adjustment Therefore: Multiplicity adjustment => Confirmatory analysis
  • 32. Addressing the relevant variability Frequentists Ignored by many but increasing awareness Bayesians Recognized (Hierarchical modelling) Addressing evident selective inference Frequentists Recognized more for p-values less for CIs & exploratory research Bayesian Ignored as a matter of principle The 2 pillars in Frequentist & Bayesian eyes
  • 33. E.g. Gelman, Hill & Yajima, M. (‘12) Why we (usually) don't have to worry about multiple comparisons. The underlying theoretical justification: Since we condition on all the data, Any selection after the data is viewed is already reflected in the posterior distribution.
  • 34. Are Bayesian intervals immune from selection’s harms? Assumed Prior µi~N(0,0.52); yi~N(µi ,1); i=1,2,…,106 (Gelman’s Ex.) Parameters generated by N(0,.52) 0.999*N(0,.52)+0.001*N(0,.52+32) Type of 95% confidence/credence intervals Marginal Bayesian Credibility Marginal FCR- adjusted Bayesian Credibility BH-Selected FCR- adjusted Intervals not covering their parameter 5.0% 5.0% 5.1% 2.1% Intervals not covering 0: Selected 7.3% 0.01% 0.03% 0.03% Intervals not covering their parameter: Out of the Selected 48% 3.4% 1.0% 71.5% 2.1% YB HDSR ‘19
  • 35. From Gelman’s Blog August ‘22 Eric van Zwet offers a solution to the above example: “We can mix the N(0,0.5) with almost any wider normal distribution with almost any probability and then very large effects will hardly be shrunken. He demonstrates it by the prior 0.99*N(0,0.52)+0.01*N(0,62) Of 741 credible intervals not covering 0, the proportion not covering the parameter is 0.07 (CI: 0.05 to 0.09) Gelman response: I continue to think that Bayesian inference completely solves the multiple comparisons problem. So, I generated data from this prior to which I add 5% N(4.052) And got 29% of the so selected not covering their parameter. (with an order of magnitude power loss relative to frequentist FDR adjusted testing and FCR CIs)
  • 36. My point: Bayesians should worry about selective inference if they care about replicability, And cannot hide behind the theoretical guarantees. Some Bayesians do that Connections with FDR in large inferential problems Genovese & Wasserman, ’02 Storey et al ’03… Fdr and fdr variations on FDR in empirical Bayes framework Efron et al ’13 … Purely Bayes model where selection should be addressed Yekutieli et al ’13 Thresholding of posterior odds using BH
  • 37. Take away massages Replicability can be enhanced mainly by addressing Selective inference • Evident selective inference is as harmful as non-evident • Needed in exploratory research even when spool is mall • Needed for CIs • ASA attitude against p-values and statistical significance is political and harms replicability The relevant variability • Prefer random effect (mixed model) analysis • Many small studies are better than one/few large ones Thanks to JWT for the insight
  • 38. Take away massages Most Bayesians ignore selective inference But appreciate addressing the relevant variability For frequentists it’s the opposite In both research communities practitioners try to avoid addressing them, Until the research complexity is so large that selective inference is addressed Until results are so non-replicable that the relevant variability is addressed This usually takes too long Thanks to JWT for the insight
  • 39. Thanks! www.replicability.tau.ac.il The industrialization of the scientific process 1888 1999 1950 2010