SlideShare a Scribd company logo
1 of 80
Download to read offline
Statistical Inference as Severe
Testing: Beyond Performance
and Probabilism
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Seminar in Advanced Research Methods
Dept of Psychology, Princeton University
November 14, 2023
1
Philosophical controversies in
statistics
Both ancient and up to the minute:
• How do humans learn about the world despite
threats of error due to incomplete and variable data?
2
Role of Probability: performance or
probabilism?
• Despite “unifications,” long-standing battles
simmer below the surface in today’s
”statistical crisis in science”
• What’s behind it?
3
Minimal principle of evidence
• We set sail with a minimal principle:
• We don’t have evidence for a claim C if little
if anything has been done that would have
found C flawed, even if it is
4
Statistical inference as severe testing
• Probability is used to assess error-probing
capabilities
• Excavation tool for appraising reforms
5
Replication crisis leads to “reforms”
Several are welcome:
• preregistration of protocol, replication
checks, avoid cookbook statistics
Others are radical
• and even lead to violating our minimal
requirement for evidence
6
• How to subject today’s “reforms” to your
own severe, critical examination?
7
Most often used tools are most
criticized
“Several methodologists have pointed out
that the high rate of nonreplication of
research discoveries is a consequence of
the convenient, yet ill-founded strategy of
claiming conclusive research findings solely
on the basis of a single study assessed by
formal statistical significance, typically for a
p-value less than 0.05. …” (Ioannidis 2005,
696)
8
R.A. Fisher
“[W]e need, not an isolated record, but a
reliable method of procedure. In relation to
the test of significance, we may say that a
phenomenon is experimentally
demonstrable when we know how to
conduct an experiment which will rarely fail
to give us a statistically significant result.”
(Fisher 1947, 14)
9
Simple significance tests (Fisher)
“to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the
test statistic, such that
• the larger the value of T the more
inconsistent are the data with H0;
p = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 10
Testing Reasoning
• Small P-value indicates some underlying
discrepancy from H0 because very
probably you would have seen a less
impressive difference than t0bs were
H0 true.
• This still isn’t evidence of a genuine
statistical effect H1, let alone a scientific
conclusion H*
11
Fallacy of rejection
• H* makes claims that haven’t been
probed by the statistical test
• The moves from experimental
interventions to H* don’t get enough
attention–but statistical accounts
should block them
12
13
Neyman and Pearson tests (1933) put
Fisherian tests on firmer ground:
Introduce alternative hypotheses H0, H1
H0: μ = 0 vs. H1: μ > 0
• Trade-off between Type I errors and Type II errors
• Restricts the inference to statistical alternatives (in a model)
14
Fisher-Neyman (pathological) battles
(after 1935)
• The success of N-P optimal error control
led to a new paradigm in statistics,
overshadows Fisher.
15
Contemporary casualties of Fisher-
Neyman (N-P) battles
• N-P & Fisher tests claimed to be an “inconsistent
hybrid” (Gigerenzer 2004, 590):
• Fisherians can’t use power; N-P testers can’t report
P-values only fixed error probabilities (e.g., P < .05)
• In fact, Fisher & N-P recommended both pre-
data error probabilities and post-data P-value
They are mathematically
essentially identical
• They both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data”
(Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian
tests, resampling, randomization
16
Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking,
significance seeking, multiple testing,
post-data subgroups, trying and trying
again—may practically guarantee an
impressive-looking effect, even if it’s
unwarranted by evidence
• Violates severity
17
18
Severity Requirement
• We have evidence for a claim C only to
the extent C has been subjected to and
passes a test that would probably have
found C flawed, just if it is.
• This probability is the stringency or
severity with which it has passed the
test.
Requires a third role for probability
Probabilism. To assign a degree of confirmation,
support or belief in a hypothesis, given data x0 (absolute
or comparative)
(e.g., Bayesian, likelihoodist, Fisher at times)
Performance. Ensure long-run reliability of methods,
coverage probabilities (frequentist, behavioristic
Neyman-Pearson, Fisher)
Only probabilism is thought to be inferential or evidential
19
What happened to using probability to
assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
20
Key to solving a central problem
for frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective
reporting, cherry picking, stopping when
the data look good, P-hacking
• Not problems about long-runs—
21
We cannot say the case at hand has
done a good job of avoiding the
sources of misinterpreting data
A claim C is not warranted _______
• Probabilism: unless C is true or
probable (gets a probability boost, made
comparatively firmer)
• Performance: unless it stems from a
method with low long-run error
• Probativism (severe testing) unless
something (a fair amount) has been done
to probe ways we can be wrong about C
23
Fishing for significance alters
probative capacity
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at the
5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent!* (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!) [*Pr(no success) = (.95)20]
24
• Frequentist (error statisticians) need to
adjust P-values to avoid being “fooled
by randomness”
25
• In a classic example in the volume
adjustments for multiplicity led to
falsifying claimed infant training benefits
on personality (from the 1940s)
• Only 18 of 460 tests were found
statistically significant (11 in the
direction expected by Freudian theory
of the day)
26
Today’s Meta-research is not free of
philosophy of statistics
From the Bayesian perspective:
“adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director of the Meta-Research Innovation Center
at Stanford)
27
Likelihood Principle (LP)
A pivotal disagreement about statistical
evidence
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses
vary
28
Logic of Support
• Ian Hacking (1965) “Law of Likelihood”: x
support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• “there always is such a rival hypothesis
viz., that things just had to turn out the
way they actually did” (Barnard 1972,
129).
29
Error Probability
• Pr(H0 is less well supported than H1; H0 )
is high
for some H1 or other
30
On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance
levels, power, all depend on something
more [than the likelihood function]–
something that is irrelevant in Bayesian
inference–namely the sample space”
(Lindley 1971, 436)
31
32
Familiar “reforms” offered as alternative
to significance tests follow the LP
• “The Bayes factor only depends on the actually
observed data, and not on whether they have been
collected from an experiment with fixed or variable
sample size.”
(van Dongen, Sprenger, and Wagenmakers 2022)
• “It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for
itself”.
(Berger and Wolpert 1988, 78)
In testing the mean of a standard
normal distribution
33
Optional Stopping
34
• “if an experimenter uses this [optional
stopping] procedure, then with probability
1 he will eventually reject any sharp null
hypothesis, even though it be true”.
(Edwards, Lindman, and Savage in
Psychological Review 1963, 239)
The Stopping Rule Principle
From their Bayesian standpoint the stopping
rule is irrelevant
• “[the] irrelevance of stopping rules to
statistical inference restores a simplicity
and freedom to experimental design that
had been lost by classical emphasis on
significance levels (in the sense of Neyman
and Pearson).” (Edwards, Lindman, and
Savage 1963, 239)
35
Contrast this with: The 21 Word
Solution
• Replication researchers (re)discovered that
data-dependent hypotheses and stopping are a
major source of spurious significance levels.
• Simmons, Nelson, and Simonsohn (2011) place
at the top of their list the need to block flexible
stopping
• “Authors must decide the rule for terminating
data collection before data collection begins and
report this rule in the articles” (ibid. 1362).
36
37
You might think replication researchers
disavow the stopping rule principle
“[I]f the sampling plan is ignored, the
researcher is able to always reject the null
hypothesis, even if it is true. ..Some people
feel that ‘optional stopping’ amounts to
cheating…. This feeling is, however,
contradicted by a mathematical analysis. (Eric-
Jan Wagenmakers, 2007, 785)
The mathematical analysis assumes the
likelihood principle
Replication Paradox
• Significance test critic: It’s too easy to satisfy
standard statistical significance thresholds
• You: Why is it so hard to replicate significance
thresholds with preregistered protocols?
• Significance test critic: Obviously the initial studies
were guilty of P-hacking, cherry-picking, data-
dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on these biasing selection effects.
• Significance test critic: Actually, “reforms”
recommend methods with no need to adjust P-values
due to multiplicity 38
39
But the value of preregistered reports
is error statistical
• Your appraisal is altered by considering the
probability that some hypotheses, stopping
point, subgroups, etc. could have led to a
false positive –even if informal
• True, there are many ways to correct P-
values (Bonferroni, false discovery rates).
• The main thing is to have an alert that the
reported P-values are invalid
Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)
• Supplement with subjective beliefs
• Likelihoods + prior probabilities
40
Problems
• Could work in some cases, but doesn’t show
what researchers had done wrong—battle of
beliefs
• The believability of data-dredged hypotheses
is what makes them so seductive
• Additional source of flexibility, priors and
biasing selection effects
41
Most Bayesians (last decade) use
“default” priors
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
42
How should we interpret them?
• “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Conventional priors may not even be
probabilities…” (Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
(maximum entropy, invariance, maximizing the
missing information, coverage matching.)
43
Criticisms of P-hackers lose force
• Wanting to promote an account that
downplays error probabilities, the data
dredger deserving criticism is given a life-raft:
44
Bem’s “Feeling the future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than
chance at predicting the (erotic) picture
shown in the future
• Bem admits data dredging, but Bayesian
critics resort to a default Bayesian prior to (a
point) null hypothesis
Wagenmakers et al. 2011 “Why psychologists must
change the way they analyze their data”
45
Bem’s response
“Whenever the null hypothesis is sharply defined but the
prior distribution on the alternative hypothesis is diffused
over a wide range of values, as it is as it is in...
Wagenmakers et al. (2011), it boosts the probability that
any observed data will be higher under the null
hypothesis than under the alternative.
This is known as the Lindley-Jeffreys paradox*: ... strong
[frequentist] evidence in support of the experimental
hypothesis be contradicted by a ... Bayesian analysis.”
(Bem et al. 2011, 717)
*Bayes-Fisher disagreement 46
Many of Today’s Significance Test
Debates Trace to Bayes/Fisher
Disagreement
• The posterior probability Pr(H0|x) can be
high while the P-value is low (2-sided test)
47
Bayes/Fisher Disagreement
With a lump of prior given to a point null, and
the rest appropriately spread over the
alternative [spike and smear], an α significant
result can correspond to
Pr(H0 |x) = (1 - α)! (e.g., 0.95)
with large n
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
48
• To the Bayesian, the P-value exaggerates
the evidence against H0
• The significance tester balks at taking low p-
values as no evidence against, or even
evidence for, H0
49
“Concentrating mass on the point null
hypothesis is biasing the prior in favor
of H0 as much as possible” (Casella
and R. Berger 1987, 111)
50
“Redefine Statistical Significance”
‘Spike and smear” is the basis for the move
to lower the P-value threshold to .005
(Benjamin et al. 2018)
Opposing megateam: Lakens et al. (2018)
51
• The problem isn’t lowering the probability
of type I errors
• The problem is assuming there should be
agreement between quantities measuring
different things
52
53
Whether P-values exaggerate,
“depends on one’s philosophy of
statistics …
…based on directly comparing P values
against certain quantities (likelihood ratios
and Bayes factors) that play a central role as
evidence measures in Bayesian analysis …
other statisticians do not accept these
quantities as gold standards,”… (Greenland,
Senn, Rothman, Carlin, Poole, Goodman,
Altman 2016, 342)
54
• A silver lining to distinguishing highly
probable and highly probed–can use
different methods for different contexts
55
“A Bayesian Perspective on Severity” van
Dongen, Sprenger, Wagenmakers [VSW]
(Psychonomic Bulletin and Review 2022):
“As Mayo emphasizes, the Bayes factor is insensitive
to variations in the sampling protocol that affect the
error rates, i.e., optional stopping” ...
They argue that Bayesians can satisfy severity
“regardless of whether the test has been conducted in
a severe or less severe fashion”. (VSW 2022)
56
What they mean is that data can be much more
probable on hypothesis H1 than on H0
But severity in their comparative subjective Bayesian
sense does not mean H1 was well probed (in the error
statistical sense)
Bayes Factors (BF)
57
58
The Bayes factor proponents and
severe testers are on the same
side against
• “declarations of ‘statistical significance’ be
abandoned” (Wasserstein, Schirm & Lazar
2019).
• “whether a p-value passes any arbitrary
threshold should not be considered at all" in
interpreting data (ibid., 2)
No significance/no threshold view
59
ASA (President’s) Task Force on
Statistical Significance and
Replicability (2019-2021)
The ASA executive director’s “no threshold”
view is not ASA policy:
“P-values and significance testing, properly
applied and interpreted, are important tools
that should not be abandoned.” (Benjamini et
al. 2021)
60
Severity Reformulate tests
in terms of discrepancies (effect sizes) that
are and are not severely-tested
SEV(Test T, data x, claim C)
• In a nutshell: one tests several
discrepancies from a test hypothesis and
infers those well or poorly warranted
Mayo1981-2018; Mayo and Spanos (2006); Mayo and
Cox (2006); Mayo and Hand (2022)
61
Avoid misinterpreting a 2 SE
significant result (let SE =1)
62
Severity vs Power for 𝛍 > 𝛍𝟏
In the same way, severity avoids
the “large n” problem
• Fixing the P-value, increasing sample
size n, the cut-off gets smaller
• Large n is the basis for the Jeffreys-
Lindley paradox
63
Severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 64
65
Setting upper bounds
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
66
Confidence Intervals CIs are also
improved
Duality between tests and intervals: values within the
(1 --α) CI are non--rejectable at the α level.
• Differentiate the warrant for claims within the
interval
• Move away from fixed confidence levels (e.g., .95)
• Provide an inferential rationale
67
68
We get an inferential rationale
CI Estimate:
CI-lower < μ < CI-upper
Performance rationale: Because it came from a
procedure with good coverage probability
Severe Tester:
μ > CI-lower because with high probability (.975) we
would have observed a smaller ̅
𝑥 if μ ≤ CI-lower
likewise for warranting μ < CI-upper
• I begin with a simple tool: the minimal
requirement for evidence
• We have evidence for C only to the
extent C has been subjected to and
passes a test it probably would have
failed if false
69
• Biasing selection effects make it easy to
find impressive-looking effects
erroneously
• They alter a method’s error probing
capacities
• They do not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
• On the LP, error probabilities consider
“imaginary data” and “intentions”
70
• To the severe tester, probabilists are
robbed from a main way to block spurious
results
• Probabilists may block inferences without
appeal to error probabilities: high prior to
H0 (no effect) can result in a high posterior
probability to H0
• Problems: Increased flexibility, puts blame
in the wrong place, unclear how to interpret
• Gives a life-raft to the P-hacker and cherry
picker 71
• The aim of severe testing directs the
reinterpretation of significance tests and other
methods
• Severe probing (formal or informal) must take
place at every level: from data to statistical
hypothesis; to substantive claims
• A silver lining to distinguishing highly probable
and highly probed–can use different methods
for different contexts
72
Thank you!
I’d be glad to have questions
73
References
• Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of
Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science
23(2), 123–32.
• Bem, J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). Must psychologists change the way they
analyze their data? Journal of Personality and Social Psychology 101(4), 716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical
significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-
017-0189-z
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task
force statement on statistical significance and replicability. The Annals of Applied
Statistics. https://doi.org/10.1080/09332480.2021.2003631.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Birnbaum, A. (1970). Statistical methods in scientific inference (letter to the
Editor).” Nature, 225 (5237) (March 14), 1033.
74
• Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence
in the one-sided testing problem. Journal of the American Statistical Association,
82(397), 106-11.
• Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press.
https://doi.org/10.1017/CBO9780511813559
• Cox, D. R., and Mayo, D. G. (2010). Objectivity and conditionality in frequentist
inference. In D. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges
on Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science,, pp. 276–304. Cambridge: Cambridge University Press.
• Edwards, W., Lindman, H., and Savage, L. (1963). Bayesian statistical inference for
psychological research. Psychological Review, 70(3), 193-242.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
• Gigerenzer, G. 2004. Mindless statistics. Journal of Socio-Economics, 33(5), 587–
606.
• Goodman SN. (1999). Toward evidence-based medical statistics. 2: The Bayes
factor. Annals of Internal Medicine, 130, 1005 –1013.
• Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Statistical tests, P values,
confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 31,
337–350. https://doi.org/10.1007/s10654-016-0149-3 75
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and
Braithwaite. In D. Mellor (Ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). Why most published research findings are false. PLoS
Medicine 2(8), 0696–0701.
• Lakens, D., et al. (2018). Justify your alpha. Nature Human Behavior 2, 168-71.
• Lindley, D. V. (1971). The estimation of many parameters. In V. Godambe & D.
Sprott, (Eds.), Foundations of Statistical Inference pp. 435–455. Toronto: Holt,
Rinehart and Winston.
• Mayo, D. (1981). In Defense of the Neyman-Pearson Theory of Confidence
Intervals. Philosophy of Science, 48(2), 269-280.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2020). Significance tests: Vitiated or vindicated by the replication
crisis in psychology? Review of Philosophy and Psychology 12, 101-120.
DOI https://doi.org/10.1007/s13164-020-00501-w
76
77
• Mayo, D. G. (2020). P-values on trial: Selective reporting of (best practice guides
against) selective reporting. Harvard Data Science Review 2.1.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest.
Conservation Biology : The Journal of the Society for Conservation Biology, 36(1),
13861. https://doi.org/10.1111/cobi.13861.
• Mayo, D. G. and Cox, D. R. (2006). Frequentist statistics as a theory of inductive
inference. In J. Rojo, (Ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, pp. 247-275. Lecture Notes-Monograph Series, Volume 49, Institute of
Mathematical Statistics.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing
damaging science, or damaging scientific practice?. Synthese 200, 220.
• Mayo, D. G. and Kruse, M. (2001). Principles of inference and their consequences.
In D. Cornfield & J. Williamson (Eds.) Foundations of Bayesianism, pp. 381-403.
Dordrecht: Kluwer Academic Publishes.
• Mayo, D. G., and A. Spanos. (2006). Severe testing as a basic concept in a
Neyman–Pearson philosophy of induction.” British Journal for the Philosophy of
Science 57(2) (June 1), 323–357.
• Mayo, D. G., and A. Spanos (2011). Error statistics. In P. Bandyopadhyay and M.
Forster (Eds.), Philosophy of Statistics, 7, pp. 152–198. Handbook of the Philosophy
of Science. The Netherlands: Elsevier.
78
• Morrison, D. E., and R. E. Henkel, (Eds.), (1970). The Significance Test Controversy:
A Reader. Chicago: Aldine De Gruyter.
• Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of
statistical hypotheses. Philosophical Transactions of the Royal Society of London
Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. & Pearson, E. (1967). Joint statistical papers of J. Neyman and E. S.
Pearson. University of California Press.
• Open Science Collaboration (2015). Estimating the reproducibility of psychological
science”, Science 349(6251), 943-51.
• Pearson, E. S. & Neyman, J. (1967). On the problem of two samples. In J. Neyman &
E.S. Pearson (Eds.) Joint Statistical Papers, pp. 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London:
Methuen.
• Selvin, H. (1970). A critique of tests of significance in survey research. In D. Morrison
and R. Henkel (Eds.). The Significance Test Controversy, pp., 94-106. Chicago:
Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2011). A false-positive psychology:
Undisclosed flexibility in data collection and analysis allow presenting anything as
significant”, Dialogue: Psychological Science, 22(11), 1359-66.
79
• Simmons, J., et al. 2012. A 21 word solution. Dialogue: The Official Newsletter of the
Society for Personality and Social Psychology 26 (2), 4–7.
• van Dongen, N., Sprenger, J. & Wagenmakers, EJ. (2022). A Bayesian perspective
on severity: Risky predictions and specific hypotheses. Psychon Bull Rev 30, 516–
533. https://doi.org/10.3758/s13423-022-02069-1
• Wagenmakers, E-J., (2007). A practical solution to the pervasive problems of p
values. Psychonomic Bulletin & Review 14(5), 779-804.
• Wagenmakers et al. (2011). Why psychologists must change the way they analyze
their data: The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100(3): 426-32.
• Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05”
(Editorial). The American Statistician 73(S1), 1–19.
https://doi.org/10.1080/00031305.2019.1583913
Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is
the datum of some other experiment,
and if it happens that P(x|µ) and
P(y|µ) are proportional functions of
µ (that is, constant multiples of each
other), then each of the two data x
and y have exactly the same thing to
say about the values of µ…” (Savage
1962, 17)
80

More Related Content

Similar to Statistical Inference as Severe Testing: Beyond Performance and Probabilism

P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talkjemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 

Similar to Statistical Inference as Severe Testing: Beyond Performance and Probabilism (20)

P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

Recently uploaded

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Recently uploaded (20)

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

Statistical Inference as Severe Testing: Beyond Performance and Probabilism

  • 1. Statistical Inference as Severe Testing: Beyond Performance and Probabilism Deborah G Mayo Dept of Philosophy, Virginia Tech Seminar in Advanced Research Methods Dept of Psychology, Princeton University November 14, 2023 1
  • 2. Philosophical controversies in statistics Both ancient and up to the minute: • How do humans learn about the world despite threats of error due to incomplete and variable data? 2
  • 3. Role of Probability: performance or probabilism? • Despite “unifications,” long-standing battles simmer below the surface in today’s ”statistical crisis in science” • What’s behind it? 3
  • 4. Minimal principle of evidence • We set sail with a minimal principle: • We don’t have evidence for a claim C if little if anything has been done that would have found C flawed, even if it is 4
  • 5. Statistical inference as severe testing • Probability is used to assess error-probing capabilities • Excavation tool for appraising reforms 5
  • 6. Replication crisis leads to “reforms” Several are welcome: • preregistration of protocol, replication checks, avoid cookbook statistics Others are radical • and even lead to violating our minimal requirement for evidence 6
  • 7. • How to subject today’s “reforms” to your own severe, critical examination? 7
  • 8. Most often used tools are most criticized “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. …” (Ioannidis 2005, 696) 8
  • 9. R.A. Fisher “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 9
  • 10. Simple significance tests (Fisher) “to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; p = Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 10
  • 11. Testing Reasoning • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* 11
  • 12. Fallacy of rejection • H* makes claims that haven’t been probed by the statistical test • The moves from experimental interventions to H* don’t get enough attention–but statistical accounts should block them 12
  • 13. 13 Neyman and Pearson tests (1933) put Fisherian tests on firmer ground: Introduce alternative hypotheses H0, H1 H0: μ = 0 vs. H1: μ > 0 • Trade-off between Type I errors and Type II errors • Restricts the inference to statistical alternatives (in a model)
  • 14. 14 Fisher-Neyman (pathological) battles (after 1935) • The success of N-P optimal error control led to a new paradigm in statistics, overshadows Fisher.
  • 15. 15 Contemporary casualties of Fisher- Neyman (N-P) battles • N-P & Fisher tests claimed to be an “inconsistent hybrid” (Gigerenzer 2004, 590): • Fisherians can’t use power; N-P testers can’t report P-values only fixed error probabilities (e.g., P < .05) • In fact, Fisher & N-P recommended both pre- data error probabilities and post-data P-value
  • 16. They are mathematically essentially identical • They both fall under tools for “appraising and bounding the probabilities of seriously misleading interpretations of data” (Birnbaum 1970, 1033)–error probabilities • I place all under the rubric of error statistics • Confidence intervals, N-P and Fisherian tests, resampling, randomization 16
  • 17. Both Fisher & N-P: it’s easy to lie with biasing selection effects • Sufficient finagling—cherry-picking, significance seeking, multiple testing, post-data subgroups, trying and trying again—may practically guarantee an impressive-looking effect, even if it’s unwarranted by evidence • Violates severity 17
  • 18. 18 Severity Requirement • We have evidence for a claim C only to the extent C has been subjected to and passes a test that would probably have found C flawed, just if it is. • This probability is the stringency or severity with which it has passed the test.
  • 19. Requires a third role for probability Probabilism. To assign a degree of confirmation, support or belief in a hypothesis, given data x0 (absolute or comparative) (e.g., Bayesian, likelihoodist, Fisher at times) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson, Fisher) Only probabilism is thought to be inferential or evidential 19
  • 20. What happened to using probability to assess error-probing capacity? • Neither “probabilism” nor “performance” directly captures error probing capacity • Good long-run performance is a necessary, not a sufficient, condition for severity 20
  • 21. Key to solving a central problem for frequentists • Why is good performance relevant for inference in the case at hand? • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking • Not problems about long-runs— 21
  • 22. We cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data
  • 23. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 23
  • 24. Fishing for significance alters probative capacity Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent!* (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) [*Pr(no success) = (.95)20] 24
  • 25. • Frequentist (error statisticians) need to adjust P-values to avoid being “fooled by randomness” 25
  • 26. • In a classic example in the volume adjustments for multiplicity led to falsifying claimed infant training benefits on personality (from the 1940s) • Only 18 of 460 tests were found statistically significant (11 in the direction expected by Freudian theory of the day) 26
  • 27. Today’s Meta-research is not free of philosophy of statistics From the Bayesian perspective: “adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Co-director of the Meta-Research Innovation Center at Stanford) 27
  • 28. Likelihood Principle (LP) A pivotal disagreement about statistical evidence In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 28
  • 29. Logic of Support • Ian Hacking (1965) “Law of Likelihood”: x support hypothesis H0 less well than H1 if, Pr(x;H0) < Pr(x;H1) (rejects in 1980) • “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). 29
  • 30. Error Probability • Pr(H0 is less well supported than H1; H0 ) is high for some H1 or other 30
  • 31. On the LP, error probabilities appeal to something irrelevant “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]– something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) 31
  • 32. 32 Familiar “reforms” offered as alternative to significance tests follow the LP • “The Bayes factor only depends on the actually observed data, and not on whether they have been collected from an experiment with fixed or variable sample size.” (van Dongen, Sprenger, and Wagenmakers 2022) • “It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself”. (Berger and Wolpert 1988, 78)
  • 33. In testing the mean of a standard normal distribution 33
  • 34. Optional Stopping 34 • “if an experimenter uses this [optional stopping] procedure, then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true”. (Edwards, Lindman, and Savage in Psychological Review 1963, 239)
  • 35. The Stopping Rule Principle From their Bayesian standpoint the stopping rule is irrelevant • “[the] irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson).” (Edwards, Lindman, and Savage 1963, 239) 35
  • 36. Contrast this with: The 21 Word Solution • Replication researchers (re)discovered that data-dependent hypotheses and stopping are a major source of spurious significance levels. • Simmons, Nelson, and Simonsohn (2011) place at the top of their list the need to block flexible stopping • “Authors must decide the rule for terminating data collection before data collection begins and report this rule in the articles” (ibid. 1362). 36
  • 37. 37 You might think replication researchers disavow the stopping rule principle “[I]f the sampling plan is ignored, the researcher is able to always reject the null hypothesis, even if it is true. ..Some people feel that ‘optional stopping’ amounts to cheating…. This feeling is, however, contradicted by a mathematical analysis. (Eric- Jan Wagenmakers, 2007, 785) The mathematical analysis assumes the likelihood principle
  • 38. Replication Paradox • Significance test critic: It’s too easy to satisfy standard statistical significance thresholds • You: Why is it so hard to replicate significance thresholds with preregistered protocols? • Significance test critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data- dredging (QRPs) • You: So, the replication researchers want methods that pick up on these biasing selection effects. • Significance test critic: Actually, “reforms” recommend methods with no need to adjust P-values due to multiplicity 38
  • 39. 39 But the value of preregistered reports is error statistical • Your appraisal is altered by considering the probability that some hypotheses, stopping point, subgroups, etc. could have led to a false positive –even if informal • True, there are many ways to correct P- values (Bonferroni, false discovery rates). • The main thing is to have an alert that the reported P-values are invalid
  • 40. Probabilists can still block intuitively unwarranted inferences (without error probabilities) • Supplement with subjective beliefs • Likelihoods + prior probabilities 40
  • 41. Problems • Could work in some cases, but doesn’t show what researchers had done wrong—battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive • Additional source of flexibility, priors and biasing selection effects 41
  • 42. Most Bayesians (last decade) use “default” priors • Default priors are supposed to prevent prior beliefs from influencing the posteriors–data dominant 42
  • 43. How should we interpret them? • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors (maximum entropy, invariance, maximizing the missing information, coverage matching.) 43
  • 44. Criticisms of P-hackers lose force • Wanting to promote an account that downplays error probabilities, the data dredger deserving criticism is given a life-raft: 44
  • 45. Bem’s “Feeling the future” 2011: ESP? • Daryl Bem (2011): subjects do better than chance at predicting the (erotic) picture shown in the future • Bem admits data dredging, but Bayesian critics resort to a default Bayesian prior to (a point) null hypothesis Wagenmakers et al. 2011 “Why psychologists must change the way they analyze their data” 45
  • 46. Bem’s response “Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is as it is in... Wagenmakers et al. (2011), it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox*: ... strong [frequentist] evidence in support of the experimental hypothesis be contradicted by a ... Bayesian analysis.” (Bem et al. 2011, 717) *Bayes-Fisher disagreement 46
  • 47. Many of Today’s Significance Test Debates Trace to Bayes/Fisher Disagreement • The posterior probability Pr(H0|x) can be high while the P-value is low (2-sided test) 47
  • 48. Bayes/Fisher Disagreement With a lump of prior given to a point null, and the rest appropriately spread over the alternative [spike and smear], an α significant result can correspond to Pr(H0 |x) = (1 - α)! (e.g., 0.95) with large n Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0. 48
  • 49. • To the Bayesian, the P-value exaggerates the evidence against H0 • The significance tester balks at taking low p- values as no evidence against, or even evidence for, H0 49
  • 50. “Concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (Casella and R. Berger 1987, 111) 50
  • 51. “Redefine Statistical Significance” ‘Spike and smear” is the basis for the move to lower the P-value threshold to .005 (Benjamin et al. 2018) Opposing megateam: Lakens et al. (2018) 51
  • 52. • The problem isn’t lowering the probability of type I errors • The problem is assuming there should be agreement between quantities measuring different things 52
  • 53. 53 Whether P-values exaggerate, “depends on one’s philosophy of statistics … …based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … other statisticians do not accept these quantities as gold standards,”… (Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman 2016, 342)
  • 54. 54 • A silver lining to distinguishing highly probable and highly probed–can use different methods for different contexts
  • 55. 55 “A Bayesian Perspective on Severity” van Dongen, Sprenger, Wagenmakers [VSW] (Psychonomic Bulletin and Review 2022): “As Mayo emphasizes, the Bayes factor is insensitive to variations in the sampling protocol that affect the error rates, i.e., optional stopping” ... They argue that Bayesians can satisfy severity “regardless of whether the test has been conducted in a severe or less severe fashion”. (VSW 2022)
  • 56. 56 What they mean is that data can be much more probable on hypothesis H1 than on H0 But severity in their comparative subjective Bayesian sense does not mean H1 was well probed (in the error statistical sense)
  • 58. 58 The Bayes factor proponents and severe testers are on the same side against • “declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm & Lazar 2019). • “whether a p-value passes any arbitrary threshold should not be considered at all" in interpreting data (ibid., 2) No significance/no threshold view
  • 59. 59 ASA (President’s) Task Force on Statistical Significance and Replicability (2019-2021) The ASA executive director’s “no threshold” view is not ASA policy: “P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned.” (Benjamini et al. 2021)
  • 60. 60 Severity Reformulate tests in terms of discrepancies (effect sizes) that are and are not severely-tested SEV(Test T, data x, claim C) • In a nutshell: one tests several discrepancies from a test hypothesis and infers those well or poorly warranted Mayo1981-2018; Mayo and Spanos (2006); Mayo and Cox (2006); Mayo and Hand (2022)
  • 61. 61 Avoid misinterpreting a 2 SE significant result (let SE =1)
  • 62. 62 Severity vs Power for 𝛍 > 𝛍𝟏
  • 63. In the same way, severity avoids the “large n” problem • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Large n is the basis for the Jeffreys- Lindley paradox 63
  • 64. Severity tells us: • an α-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 64
  • 66. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • Using severity reasoning: rule out discrepancies that very probably would have resulted in larger differences than observed- set upper bounds 66
  • 67. Confidence Intervals CIs are also improved Duality between tests and intervals: values within the (1 --α) CI are non--rejectable at the α level. • Differentiate the warrant for claims within the interval • Move away from fixed confidence levels (e.g., .95) • Provide an inferential rationale 67
  • 68. 68 We get an inferential rationale CI Estimate: CI-lower < μ < CI-upper Performance rationale: Because it came from a procedure with good coverage probability Severe Tester: μ > CI-lower because with high probability (.975) we would have observed a smaller ̅ 𝑥 if μ ≤ CI-lower likewise for warranting μ < CI-upper
  • 69. • I begin with a simple tool: the minimal requirement for evidence • We have evidence for C only to the extent C has been subjected to and passes a test it probably would have failed if false 69
  • 70. • Biasing selection effects make it easy to find impressive-looking effects erroneously • They alter a method’s error probing capacities • They do not alter evidence (in traditional probabilisms): Likelihood Principle (LP) • On the LP, error probabilities consider “imaginary data” and “intentions” 70
  • 71. • To the severe tester, probabilists are robbed from a main way to block spurious results • Probabilists may block inferences without appeal to error probabilities: high prior to H0 (no effect) can result in a high posterior probability to H0 • Problems: Increased flexibility, puts blame in the wrong place, unclear how to interpret • Gives a life-raft to the P-hacker and cherry picker 71
  • 72. • The aim of severe testing directs the reinterpretation of significance tests and other methods • Severe probing (formal or informal) must take place at every level: from data to statistical hypothesis; to substantive claims • A silver lining to distinguishing highly probable and highly probed–can use different methods for different contexts 72
  • 73. Thank you! I’d be glad to have questions 73
  • 74. References • Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science 23(2), 123–32. • Bem, J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology 100(3), 407-425. • Bem, J., Utts, J., and Johnson, W. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology 101(4), 716-719. • Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562- 017-0189-z • Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. https://doi.org/10.1080/09332480.2021.2003631. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. (1970). Statistical methods in scientific inference (letter to the Editor).” Nature, 225 (5237) (March 14), 1033. 74
  • 75. • Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82(397), 106-11. • Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press. https://doi.org/10.1017/CBO9780511813559 • Cox, D. R., and Mayo, D. G. (2010). Objectivity and conditionality in frequentist inference. In D. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science,, pp. 276–304. Cambridge: Cambridge University Press. • Edwards, W., Lindman, H., and Savage, L. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3), 193-242. • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Gigerenzer, G. 2004. Mindless statistics. Journal of Socio-Economics, 33(5), 587– 606. • Goodman SN. (1999). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130, 1005 –1013. • Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 31, 337–350. https://doi.org/10.1007/s10654-016-0149-3 75
  • 76. • Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press. • Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and Braithwaite. In D. Mellor (Ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60. • Ioannidis, J. (2005). Why most published research findings are false. PLoS Medicine 2(8), 0696–0701. • Lakens, D., et al. (2018). Justify your alpha. Nature Human Behavior 2, 168-71. • Lindley, D. V. (1971). The estimation of many parameters. In V. Godambe & D. Sprott, (Eds.), Foundations of Statistical Inference pp. 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (1981). In Defense of the Neyman-Pearson Theory of Confidence Intervals. Philosophy of Science, 48(2), 269-280. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2020). Significance tests: Vitiated or vindicated by the replication crisis in psychology? Review of Philosophy and Psychology 12, 101-120. DOI https://doi.org/10.1007/s13164-020-00501-w 76
  • 77. 77 • Mayo, D. G. (2020). P-values on trial: Selective reporting of (best practice guides against) selective reporting. Harvard Data Science Review 2.1. • Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology : The Journal of the Society for Conservation Biology, 36(1), 13861. https://doi.org/10.1111/cobi.13861. • Mayo, D. G. and Cox, D. R. (2006). Frequentist statistics as a theory of inductive inference. In J. Rojo, (Ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, pp. 247-275. Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics. • Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. • Mayo, D. G. and Kruse, M. (2001). Principles of inference and their consequences. In D. Cornfield & J. Williamson (Eds.) Foundations of Bayesianism, pp. 381-403. Dordrecht: Kluwer Academic Publishes. • Mayo, D. G., and A. Spanos. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction.” British Journal for the Philosophy of Science 57(2) (June 1), 323–357. • Mayo, D. G., and A. Spanos (2011). Error statistics. In P. Bandyopadhyay and M. Forster (Eds.), Philosophy of Statistics, 7, pp. 152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.
  • 78. 78 • Morrison, D. E., and R. E. Henkel, (Eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. • Neyman, J. & Pearson, E. (1967). Joint statistical papers of J. Neyman and E. S. Pearson. University of California Press. • Open Science Collaboration (2015). Estimating the reproducibility of psychological science”, Science 349(6251), 943-51. • Pearson, E. S. & Neyman, J. (1967). On the problem of two samples. In J. Neyman & E.S. Pearson (Eds.) Joint Statistical Papers, pp. 99-115 (Berkeley: U. of Calif. Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96. • Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen. • Selvin, H. (1970). A critique of tests of significance in survey research. In D. Morrison and R. Henkel (Eds.). The Significance Test Controversy, pp., 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2011). A false-positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant”, Dialogue: Psychological Science, 22(11), 1359-66.
  • 79. 79 • Simmons, J., et al. 2012. A 21 word solution. Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26 (2), 4–7. • van Dongen, N., Sprenger, J. & Wagenmakers, EJ. (2022). A Bayesian perspective on severity: Risky predictions and specific hypotheses. Psychon Bull Rev 30, 516– 533. https://doi.org/10.3758/s13423-022-02069-1 • Wagenmakers, E-J., (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review 14(5), 779-804. • Wagenmakers et al. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3): 426-32. • Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05” (Editorial). The American Statistician 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
  • 80. Jimmy Savage on the LP: “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, 17) 80