SlideShare a Scribd company logo
1 of 45
Download to read offline
Statistical "Reforms":
Fixing Science or Threats to
Replication and Falsification
Deborah G Mayo
June 7, 2022
The Du Bois-Wells Symposium on Socially Aware Data Science
and Ethical Issues in Modeling
2
Mounting failures of replication give a new urgency
to critically appraising proposed statistical reforms.
• While many are welcome
o preregistration
o replication
o avoid cookbook statistics
• Others are quite radical!
Replication Crisis Paradox
Critic of P-values: It’s much too easy to get
a small P-value
Crisis of replication: It is much too difficult to
replicate small P-values (with prespecified
hypotheses)
Is it easy or is it hard?
3
• R.A. Fisher: it’s easy to lie with statistics by
selective reporting, (“political principle that
anything can be proved by statistics” (1955,
75))
• Sufficient finagling—cherry-picking, data-
dredging, multiple testing, optional stopping—
may practically guarantee a preferred claim H
appears supported, even if it’s unwarranted
4
• “We knew many researchers - including
ourselves - who readily admitted to [biasing
selection effects]..but they thought it was wrong
the way it's wrong to jaywalk. …simulations
revealed it was wrong the way it's wrong to rob a
bank.” (Simmons, Nelson and Simonsohn (2018,
255)
• “21 word solution” (2012)
5
This underwrites key features of significance tests:
• to bound the probabilities of misleading
inferences error probabilities
• to constrain the human tendency to selectively
favor views they believe in.
They should not be replaced with methods less
able to control erroneous interpretations of data.
Not if we want socially aware data science.
6
(Simple) Statistical significance
tests
Significance tests (R.A. Fisher) are a small part of a
rich error statistical methodology:
“…to test the conformity of the particular data
under analysis with H0 in some respect….”
…the P-value: the probability of an even
larger value of t0bs merely from background
variability or noise (Mayo and Cox 2006, 81) 7
Testing reasoning, as we see it
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of inconsistency with H0
• Small P-value indicates H1 some underlying
discrepancy from H0 because very probably
(1–P) you would have seen a smaller difference
than t0bs were H0 true.
• Even if the small P-value is valid, it isn’t evidence
of a scientific conclusion H*
Stat-Sub fallacy H1 => H*
8
Neyman-Pearson (N-P) put
Fisherian tests on firmer
footing (1933):
Introduces alternative hypotheses H0, H1
H0: μ ≤ 0 vs. H1: μ > 0
• Constrains tests by requiring control of both Type I
error (erroneously rejecting) and Type II error
(erroneously failing to reject) H0, and power
(Neyman also developed confidence interval
estimation at the same time)
9
N-P tests tools for optimal
performance:
Their success in optimal control of error
probabilities gives a new paradigm for
statistics
Yet a major criticism: they are only
“accept/reject” rules that should be “retired” for
anything but quality control decisions, not
science (e.g., Amrhein et al 2021)
10
• Our view is that reliable error control is crucial
for warranting specific inferences (not mere
long-run quality control)
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking?
• Not a problem about long-run performance —
11
12
• We cannot say the test has done its job in the
case at hand in avoiding sources of
misinterpreting data
Basis for the severe testing
philosophy of evidence
• Use error probabilities to assess capabilities
of tools to probe various flaws (“probativism”)
• Data supply good evidence for a claim only if it
has been subjected to and passes a test that
probably would have found it flawed or
specifiably false (if it is).
• Altering this capability changes the evidence (on
our view)
13
On a rival view of evidence…
“Two problems that plague frequentist [error
statistical] inference: multiple comparisons and
multiple looks, or, …data dredging and peeking at the
data. The frequentist solution to both problems
involves adjusting the P-value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Meta-Research Innovation Center at Stanford)
14
This view of evidence
(probabilist) is based on the
Likelihood Principle (LP)
All the evidence is contained in the ratio of
likelihoods:
Pr(x0;H0)/Pr(x0;H1)
x support H0 less well than H1 if H0 is less likely
than H1 in this technical sense
15
Likelihood Principle (LP) vs error
statistics
• Any hypothesis that perfectly fits the data is
maximally likely
• Pr(H0 is less well supported than H1; H0) is high
for some H1 or other
• So even with strong support, the error
probability associated with the inference is high
16
All error probabilities
violate the Likelihood Principle
(LP):
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
17
Many “reforms” offered as
alternatives to significance tests,
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert, The Likelihood Principle 1988,
78)
18
In testing the mean of a
standard normal
distribution
Bayesian sequential (adaptive)
analysts say:
“The [regulatory] requirement of type I error control
for Bayesian adaptive designs causes them to lose
many of their philosophical advantages, such as
compliance with the likelihood principle” (Ryan et al.
2020, radiation oncology)
Our view: Why buy a philosophy that relinquishes
error probability control?
20
This is of relevance to AI/ML
If the field follows critics of explainable AI/ML:
“regulators should place more emphasis on
well-designed clinical trials, at least for some
higher-risk devices, and less on whether the
AI/ML system can be explained”. (Babic et al.
2021)
“Beware Explanations From AI in Health
Care”
Bayesians may (indirectly) block
implausible inferences
• With a low prior degree of belief on H (e.g.,
real effect), the Bayesian can block
inferring H
Concerns:
• Doesn’t show what has gone wrong—it’s the
multiplicity
• The believability of post hoc hypotheses is
what makes them so seductive
• Claims can be highly probable (or even
known) while poorly probed.
23
How to obtain
and interpret Bayesian priors?
• Most (?) use nonsubjective or default priors, to
prevent prior beliefs from influencing posteriors—
data dominant
• There is no agreement on which of rival systems
to use. (They may not even be probabilities.)
(e.g., maximum entropy, invariance, maximizing the
missing information, coverage matching)
There may be ways to combine
Bayesian and error statistical accounts
(Gelman: Falsificationist Bayesian; Shalizi: error
statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in which
certain of the model’s implications are compared
directly to the data.” (Gelman and Shalizi 2013, 10,
20).
• Can’t also champion “abandoning statistical
significance”, as in a recent “reform” 25
A recent recommended “reform”:
Don’t say ‘significance’, don’t use
P-value thresholds
• In 2019, executive director of the American
Statistical Association (ASA) (and 2 co-authors*)
announce: “declarations of ‘statistical
significance’ be abandoned”
• “It is time to stop using the term ‘statistically
significant’ entirely.”
*Wasserstein, Schirm & Lazar
26
• We agree the actual P-value should be
reported (as all the founders of tests
recommended)
• But the 2019 Editorial says prespecified P-
value thresholds should not be used at all in
interpreting results.
27
• Many who signed on to the “no threshold view”
think by removing P-value thresholds,
researchers lose an incentive to data dredge
and multiple test and otherwise exploit
researcher flexibility
• D. Hand and I (2022) argue: banning the use of
P-value thresholds in interpreting data does not
diminish but rather exacerbates data-dredging
28
• In a world without predesignated thresholds,
it would be hard to hold the data dredgers
accountable for reporting a nominally small P-
value through ransacking, data dredging,
trying and trying again
• What distinguishes data dredged P-values
from valid ones is that they fail to meet a
prespecified error probability
“Statistical Significance and its Critics: Practicing damaging
science, or damaging scientific practice?” (Mayo and Hand
2022)
29
No tests, no falsification
• If you cannot say about any results, ahead of
time, they will not be allowed to count in
favor of a claim C, then you do not have a
test of C
• Why insist on replications if at no point can
you say, the effect has failed to replicate?
• Nor can you test the assumptions of
statistical models and likelihood functions
30
ASA (President’s) Task Force on
Statistical Significance and
Replicability
• In 2019 ASA President appointed a task force of 14
statisticians put in the odd position of needing:
“to address concerns that [the ASA executive
director’s “no significance” editorial] might be
mistakenly interpreted as official ASA policy”
“P-values and significance testing, properly applied
and interpreted, are important tools that should not
be abandoned.” (Benjamini et al. 2021)
31
32
The ASA President’s Task Force:
Linda Young, National Agric Stats, U of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, George Washington U (ASA Pubs Rep)
Mark Glickman, Harvard University (ASA Section Rep)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri
The Task Force also states:
“P-values and significance tests are among the
most studied and best understood statistical
procedures in the statistics literature”.
As an aside to this, Stephen Stigler asks:
“…Which of the exciting new methods on
modern data science machine learning can the
same be said?”
33
Our view: Reformulate Tests
• Instead of a binary cut-off (significant or not)
the particular outcome is used to infer
discrepancies that are or are not well-
warranted
• Avoids fallacies of significance and
nonsignificance, and improves on confidence
interval estimation
34
Severity Reformulation
Severity function: SEV(Test T, data x, claim C)
In a nutshell: one tests several discrepancies
from a test hypothesis and infers those well or
poorly warranted
Akin to confidence distributions
35
36
To avoid Fallacies of Rejection
(e.g., magnitude error)
Testing the mean of a Normal distribution: H0: μ ≤ 0
vs. H1: μ > 0, consider
H0: μ ≤ μ1 vs. H1: μ > μ1
for μ1 = μ0 + γ. If you very probably would have
observed a more impressive (smaller) P-value if μ =
μ1 then the data are poor evidence that μ > μ1.
SEV(μ > μ1 ) is low
Power vs Severity for 𝛍 > 𝛍𝟏
37
To give an informal construal,
consider how severity avoids the
“large n problem”
• Fixing the P-value, increasing sample size n,
the cut-off gets smaller
• Get to a point where x is closer to the null
than various alternatives
38
Severity tells us:
• an observed difference just statistically significant at
level α indicates less of a discrepancy from the null if it
results from larger (n1) rather than a smaller (n2) sample
size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that doesn’t
go off unless the house is fully ablaze?
• The larger sample size is like the one that goes off with
burnt toast
[assumptions are presumed to hold] 39
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed — set upper bounds
• If you very probably would have observed a
larger value of test statistic (smaller P-value),
were μ = μ1 then the data indicate that μ< μ1
SEV(μ < μ1) is high
40
Brief overview
• The sources of irreplication are not
mysterious: in many fields, latitude in
collecting and interpreting data makes it too
easy to dredge up impressive looking
findings even when spurious.
• Some of the reforms intended to fix science
enable rather than reveal illicit inferences
due to multiple testing, and data-dredging.
(either they obey the LP or block thresholds)
41
• Banning the use of P-value thresholds in
interpreting data does not diminish but rather
exacerbates data-dredging and biasing
selection effects.
• If an account cannot specify outcomes that
will not be allowed to count as evidence for a
claim—if all thresholds are abandoned—then
there is no test of that claim
• We should instead reformulate tests so as to
avoid fallacies and report the extent of
discrepancies that are and are not indicated
with severity.
42
Mayo (1996, 2018); Mayo and Cox (2006):
Frequentist Principle of Evidence (FEV); SEV: Mayo
and Spanos (2006), Mayo and Hand (2022)
FEV/SEV significant result : A small P-value is
evidence of discrepancy γ from H0, if and only if, there
is a high probability the test would have d(X) < d(x0)
were a discrepancy as large as γ absent
FEV/SEV: insignificant result: A moderate P-value is
evidence of the absence of a discrepancy γ from H0,
only if there is a high probability the test would
have given a worse fit with H0 (i.e., d(X) > d(x0))
were a discrepancy γ to exist 43
References
• Amrhein, V., Greenland, S., & McShane, B. (2019). Comment: Scientists rise up against
statistical significance. Nature, 567, 305–307. https://doi.org/10.1038/d41586-019-00857-9
• Babic, B., Gerke, S., Evgeniou, T., & Cohen, I. G. (2021). Beware explanations from ai in
health care. Science, 373(6552), 284–286. https://doi.org/10.1126/science.abg1834
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A
Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-
103.
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force
statement on statistical significance and replicability. The Annals of Applied Statistics.
(Online June 20, 2021.)
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-
Monograph Series. Hayward, CA: Institute of Mathematical Statistics.
• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical
Society, Series B (Methodological) 17 (1) (January 1): 69–78.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and
“Rejoinder.” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward Evidence-based Medical Statistics. 2: The Bayes factor.” Annals of
Internal Medicine 1999; 130:1005 –1013.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference,
edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of
Chicago Press.
• Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
44
References cont.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in J.
Rojo (ed.), The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph
Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
• Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or
damaging scientific practice? Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical
Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337.
Reprinted in Joint Statistical Papers, 140–85.
• Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a
Bayesian adaptive trial design?. BMC Medical Research Methodology, 20(1), 1-9.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue: The
Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4–7.
• Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives
on Psychological Science, 13(2), 255 259. https://doi.org/10.1177/1745691617698146
• Stigler, S. (2022). Comment at the NISS Awards Ceremony & Affiliate Luncheon Program
(Discussion on discussions leading to Task Force’s final report) on August 2, 2021 5 pm ET.
https://www.niss.org/events/niss-awards-ceremony-affiliate-luncheon-program
• Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and
purpose (and supplemental materials). The American Statistician, 70(2), 129–133.
https://doi.org/10.1080/00031305.2016.1154108
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial].
The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913 45

More Related Content

Similar to Statistical Reforms: Fixing Science or Threats to Replication

The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talkjemille6
 

Similar to Statistical Reforms: Fixing Science or Threats to Replication (20)

The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

Recently uploaded

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 

Recently uploaded (20)

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 

Statistical Reforms: Fixing Science or Threats to Replication

  • 1. Statistical "Reforms": Fixing Science or Threats to Replication and Falsification Deborah G Mayo June 7, 2022 The Du Bois-Wells Symposium on Socially Aware Data Science and Ethical Issues in Modeling
  • 2. 2 Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. • While many are welcome o preregistration o replication o avoid cookbook statistics • Others are quite radical!
  • 3. Replication Crisis Paradox Critic of P-values: It’s much too easy to get a small P-value Crisis of replication: It is much too difficult to replicate small P-values (with prespecified hypotheses) Is it easy or is it hard? 3
  • 4. • R.A. Fisher: it’s easy to lie with statistics by selective reporting, (“political principle that anything can be proved by statistics” (1955, 75)) • Sufficient finagling—cherry-picking, data- dredging, multiple testing, optional stopping— may practically guarantee a preferred claim H appears supported, even if it’s unwarranted 4
  • 5. • “We knew many researchers - including ourselves - who readily admitted to [biasing selection effects]..but they thought it was wrong the way it's wrong to jaywalk. …simulations revealed it was wrong the way it's wrong to rob a bank.” (Simmons, Nelson and Simonsohn (2018, 255) • “21 word solution” (2012) 5
  • 6. This underwrites key features of significance tests: • to bound the probabilities of misleading inferences error probabilities • to constrain the human tendency to selectively favor views they believe in. They should not be replaced with methods less able to control erroneous interpretations of data. Not if we want socially aware data science. 6
  • 7. (Simple) Statistical significance tests Significance tests (R.A. Fisher) are a small part of a rich error statistical methodology: “…to test the conformity of the particular data under analysis with H0 in some respect….” …the P-value: the probability of an even larger value of t0bs merely from background variability or noise (Mayo and Cox 2006, 81) 7
  • 8. Testing reasoning, as we see it • If even larger differences than t0bs occur fairly frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of inconsistency with H0 • Small P-value indicates H1 some underlying discrepancy from H0 because very probably (1–P) you would have seen a smaller difference than t0bs were H0 true. • Even if the small P-value is valid, it isn’t evidence of a scientific conclusion H* Stat-Sub fallacy H1 => H* 8
  • 9. Neyman-Pearson (N-P) put Fisherian tests on firmer footing (1933): Introduces alternative hypotheses H0, H1 H0: μ ≤ 0 vs. H1: μ > 0 • Constrains tests by requiring control of both Type I error (erroneously rejecting) and Type II error (erroneously failing to reject) H0, and power (Neyman also developed confidence interval estimation at the same time) 9
  • 10. N-P tests tools for optimal performance: Their success in optimal control of error probabilities gives a new paradigm for statistics Yet a major criticism: they are only “accept/reject” rules that should be “retired” for anything but quality control decisions, not science (e.g., Amrhein et al 2021) 10
  • 11. • Our view is that reliable error control is crucial for warranting specific inferences (not mere long-run quality control) • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking? • Not a problem about long-run performance — 11
  • 12. 12 • We cannot say the test has done its job in the case at hand in avoiding sources of misinterpreting data
  • 13. Basis for the severe testing philosophy of evidence • Use error probabilities to assess capabilities of tools to probe various flaws (“probativism”) • Data supply good evidence for a claim only if it has been subjected to and passes a test that probably would have found it flawed or specifiably false (if it is). • Altering this capability changes the evidence (on our view) 13
  • 14. On a rival view of evidence… “Two problems that plague frequentist [error statistical] inference: multiple comparisons and multiple looks, or, …data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Meta-Research Innovation Center at Stanford) 14
  • 15. This view of evidence (probabilist) is based on the Likelihood Principle (LP) All the evidence is contained in the ratio of likelihoods: Pr(x0;H0)/Pr(x0;H1) x support H0 less well than H1 if H0 is less likely than H1 in this technical sense 15
  • 16. Likelihood Principle (LP) vs error statistics • Any hypothesis that perfectly fits the data is maximally likely • Pr(H0 is less well supported than H1; H0) is high for some H1 or other • So even with strong support, the error probability associated with the inference is high 16
  • 17. All error probabilities violate the Likelihood Principle (LP): “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) 17
  • 18. Many “reforms” offered as alternatives to significance tests, follow the LP • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert, The Likelihood Principle 1988, 78) 18
  • 19. In testing the mean of a standard normal distribution
  • 20. Bayesian sequential (adaptive) analysts say: “The [regulatory] requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle” (Ryan et al. 2020, radiation oncology) Our view: Why buy a philosophy that relinquishes error probability control? 20
  • 21. This is of relevance to AI/ML If the field follows critics of explainable AI/ML: “regulators should place more emphasis on well-designed clinical trials, at least for some higher-risk devices, and less on whether the AI/ML system can be explained”. (Babic et al. 2021) “Beware Explanations From AI in Health Care”
  • 22. Bayesians may (indirectly) block implausible inferences • With a low prior degree of belief on H (e.g., real effect), the Bayesian can block inferring H
  • 23. Concerns: • Doesn’t show what has gone wrong—it’s the multiplicity • The believability of post hoc hypotheses is what makes them so seductive • Claims can be highly probable (or even known) while poorly probed. 23
  • 24. How to obtain and interpret Bayesian priors? • Most (?) use nonsubjective or default priors, to prevent prior beliefs from influencing posteriors— data dominant • There is no agreement on which of rival systems to use. (They may not even be probabilities.) (e.g., maximum entropy, invariance, maximizing the missing information, coverage matching)
  • 25. There may be ways to combine Bayesian and error statistical accounts (Gelman: Falsificationist Bayesian; Shalizi: error statistician) “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20). • Can’t also champion “abandoning statistical significance”, as in a recent “reform” 25
  • 26. A recent recommended “reform”: Don’t say ‘significance’, don’t use P-value thresholds • In 2019, executive director of the American Statistical Association (ASA) (and 2 co-authors*) announce: “declarations of ‘statistical significance’ be abandoned” • “It is time to stop using the term ‘statistically significant’ entirely.” *Wasserstein, Schirm & Lazar 26
  • 27. • We agree the actual P-value should be reported (as all the founders of tests recommended) • But the 2019 Editorial says prespecified P- value thresholds should not be used at all in interpreting results. 27
  • 28. • Many who signed on to the “no threshold view” think by removing P-value thresholds, researchers lose an incentive to data dredge and multiple test and otherwise exploit researcher flexibility • D. Hand and I (2022) argue: banning the use of P-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging 28
  • 29. • In a world without predesignated thresholds, it would be hard to hold the data dredgers accountable for reporting a nominally small P- value through ransacking, data dredging, trying and trying again • What distinguishes data dredged P-values from valid ones is that they fail to meet a prespecified error probability “Statistical Significance and its Critics: Practicing damaging science, or damaging scientific practice?” (Mayo and Hand 2022) 29
  • 30. No tests, no falsification • If you cannot say about any results, ahead of time, they will not be allowed to count in favor of a claim C, then you do not have a test of C • Why insist on replications if at no point can you say, the effect has failed to replicate? • Nor can you test the assumptions of statistical models and likelihood functions 30
  • 31. ASA (President’s) Task Force on Statistical Significance and Replicability • In 2019 ASA President appointed a task force of 14 statisticians put in the odd position of needing: “to address concerns that [the ASA executive director’s “no significance” editorial] might be mistakenly interpreted as official ASA policy” “P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned.” (Benjamini et al. 2021) 31
  • 32. 32 The ASA President’s Task Force: Linda Young, National Agric Stats, U of Florida (Co-Chair) Xuming He, University of Michigan (Co-Chair) Yoav Benjamini, Tel Aviv University Dick De Veaux, Williams College (ASA Vice President) Bradley Efron, Stanford University Scott Evans, George Washington U (ASA Pubs Rep) Mark Glickman, Harvard University (ASA Section Rep) Barry Graubard, National Cancer Institute Xiao-Li Meng, Harvard University Vijay Nair, Wells Fargo and University of Michigan Nancy Reid, University of Toronto Stephen Stigler, The University of Chicago Stephen Vardeman, Iowa State University Chris Wikle, University of Missouri
  • 33. The Task Force also states: “P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature”. As an aside to this, Stephen Stigler asks: “…Which of the exciting new methods on modern data science machine learning can the same be said?” 33
  • 34. Our view: Reformulate Tests • Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not well- warranted • Avoids fallacies of significance and nonsignificance, and improves on confidence interval estimation 34
  • 35. Severity Reformulation Severity function: SEV(Test T, data x, claim C) In a nutshell: one tests several discrepancies from a test hypothesis and infers those well or poorly warranted Akin to confidence distributions 35
  • 36. 36 To avoid Fallacies of Rejection (e.g., magnitude error) Testing the mean of a Normal distribution: H0: μ ≤ 0 vs. H1: μ > 0, consider H0: μ ≤ μ1 vs. H1: μ > μ1 for μ1 = μ0 + γ. If you very probably would have observed a more impressive (smaller) P-value if μ = μ1 then the data are poor evidence that μ > μ1. SEV(μ > μ1 ) is low
  • 37. Power vs Severity for 𝛍 > 𝛍𝟏 37
  • 38. To give an informal construal, consider how severity avoids the “large n problem” • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Get to a point where x is closer to the null than various alternatives 38
  • 39. Severity tells us: • an observed difference just statistically significant at level α indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • The larger sample size is like the one that goes off with burnt toast [assumptions are presumed to hold] 39
  • 40. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • Using severity reasoning: rule out discrepancies that very probably would have resulted in larger differences than observed — set upper bounds • If you very probably would have observed a larger value of test statistic (smaller P-value), were μ = μ1 then the data indicate that μ< μ1 SEV(μ < μ1) is high 40
  • 41. Brief overview • The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. • Some of the reforms intended to fix science enable rather than reveal illicit inferences due to multiple testing, and data-dredging. (either they obey the LP or block thresholds) 41
  • 42. • Banning the use of P-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. • If an account cannot specify outcomes that will not be allowed to count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim • We should instead reformulate tests so as to avoid fallacies and report the extent of discrepancies that are and are not indicated with severity. 42
  • 43. Mayo (1996, 2018); Mayo and Cox (2006): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006), Mayo and Hand (2022) FEV/SEV significant result : A small P-value is evidence of discrepancy γ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as γ absent FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy γ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0)) were a discrepancy γ to exist 43
  • 44. References • Amrhein, V., Greenland, S., & McShane, B. (2019). Comment: Scientists rise up against statistical significance. Nature, 567, 305–307. https://doi.org/10.1038/d41586-019-00857-9 • Babic, B., Gerke, S., Evgeniou, T., & Cohen, I. G. (2021). Beware explanations from ai in health care. Science, 373(6552), 284–286. https://doi.org/10.1126/science.abg1834 • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90- 103. • Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.) • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes- Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder.” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Goodman SN. (1999). “Toward Evidence-based Medical Statistics. 2: The Bayes factor.” Annals of Internal Medicine 1999; 130:1005 –1013. • Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. • Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. 44
  • 45. References cont. • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in J. Rojo (ed.), The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275. • Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice? Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0 • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. • Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. • Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design?. BMC Medical Research Methodology, 20(1), 1-9. • Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4–7. • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255 259. https://doi.org/10.1177/1745691617698146 • Stigler, S. (2022). Comment at the NISS Awards Ceremony & Affiliate Luncheon Program (Discussion on discussions leading to Task Force’s final report) on August 2, 2021 5 pm ET. https://www.niss.org/events/niss-awards-ceremony-affiliate-luncheon-program • Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and purpose (and supplemental materials). The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108 • Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913 45