SlideShare a Scribd company logo
1 of 80
The Statistics Wars: Errors and
Casualties
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Conference: Statistical Reasoning and Scientific Error
MuST Conference in Philosophy of Science
July 1, 2019
1
The Statistics Wars
2
Role of probability: performance
or probabilism?
(Frequentist vs. Bayesian)
• End of foundations? (Unifications, Peace
Treaties, and Eclecticism)
• Long-standing battles still simmer below the
surface (agreement on numbers)
3
Statistical inference as severe
testing
•Brush the dust off pivotal debates in relation to
today’s statistical crisis in science
• We set sail with a simple tool: a minimal
requirement for evidence
• Sufficiently general to apply to any methods
now in use
• Excavation tool: You needn’t accept this
philosophy to use it to get beyond today’s
statistical wars and appraise reforms
5
Statistical reforms
• Several are welcome: preregistration, avoidance of
cookbook statistics, calls for more replication
research
• Others are quite radical, and even violate our minimal
principle of evidence
• These are the errors and casualties I refer to
• To combat paradoxical, self-defeating “reforms” and
unthinking bandwagon effects requires taking on very
strong ideological leaders
6
In SIST: How to get beyond the
stat wars requires chutzpah
“You will need to critically evaluate …brilliant
leaders, high priests, maybe even royalty. Are
they asking the most unbiased questions in
examining methods, or are they like admen
touting their brand, dragging out howlers to
make their favorite method look good? (I am not
sparing any of the statistical tribes here.)” (p. 12)
• Statistics wars as proxy wars
7
Most often used tools are most
criticized
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries is
a consequence of the convenient, yet ill-founded
strategy of claiming conclusive research findings
solely on the basis of a single study assessed by
formal statistical significance, typically for a p-value
less than 0.05. …” (Ioannidis 2005, 696)
Do researchers do that?
8
R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.” (Fisher 1947, 14)
9
Simple significance tests (Fisher)
“p-value. …to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the test
statistic, such that
• the larger the value of T the more inconsistent are
the data with H0;
• T = t(Y) has a known probability distribution
when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 10
Testing reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of incompatibility
with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
11
12
“It would be a mistake to allow the tail to wag the
dog by being overly influenced by flawed statistical
inferences” (Cook et al. 2019)
In response to suggesting the concept of statistical
significance be dropped (Amhrein et al. 2019)
Don’t let the tail wag the dog
Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive*
H0: μ ≤ 0 vs. H1: μ > 0
“no effect” vs. “some positive effect”
• So this fallacy of rejection H1H* is blocked
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
*(introduces Type II error, and power )
13
Casualties from (Pathological)
Fisher-N-P Battles
• Neyman Pearson (N-P) give a rationale for
Fisherian tests
• All is hunky dory until Fisher and Neyman
began fighting (from 1935) largely due to
professional and personality disputes
• A huge philosophical difference is read into
their in-fighting
14
Get beyond the inconsistent hybrid
charge
• Major casualties from the “inconsistent hybrid”:
Fisher–inferential; N-P–long run performance
• Fisherians can’t use power
• N-P testers must adhere to fixed error
probabilities (P <); can’t report P-values (P =)
15
16
Erich Lehmann (Neyman’s student)
“It is good practice to determine … the smallest
significance level…at which the hypothesis would
be rejected for the given observation. ..the P-
value gives an idea of how strongly the data
contradict the hypothesis [and] enables others
to reach a verdict based on the significance
level of their choice.”
(Lehmann and Romano 2005, 63-4)
Error Statistics
• Fisher and N-P both fall under tools for “appraising
and bounding the probabilities (under respective
hypotheses) of seriously misleading interpretations
of data” (Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization
17
Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
• Violates severity
18
Severity Requirement:
If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
• Such a test fails a minimal requirement for a
stringent or severe test
• N-P and Fisher did not put it in these terms, but
our severe tester does
19
Requires a third role for
probability
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis, given data x0
(absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of methods,
coverage probabilities (frequentist, behavioristic Neyman-
Pearson, Fisher (at times))
Only probabilism is thought to be inferential or evidential 20
What happened to using probability
to assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
21
Key to solving a key problem for
frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking
• Not problems about long-runs—
22
We cannot say the case at hand has done a
good job of avoiding the sources of
misinterpreting data
A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing): unless
something (a fair amount) has been done to
probe ways we can be wrong about C
24
A severe test: My weight
Informal example: To test if I’ve gained weight
between the time I left for Germany and my return, I
use a series of well-calibrated and stable scales,
both before leaving and upon my return.
All show an over 4 lb gain, none shows a difference
in weighing EGEK, I infer:
H: I’ve gained at least 4 pounds
25
26
• Properties of the scales are akin to the properties of
statistical tests (performance).
• No one claims the justification is merely long run,
and can say nothing about my weight.
• We infer something about the source of the
readings from the high capability to reveal if any
scales were wrong
27
The severe tester assumed to be in a
context of wanting to find things out
• I could insist all the scales are wrong—they work
fine with weighing known objects—but this would
prevent correctly finding out about weight…..
• What sort of extraordinary circumstance could
cause them all to go astray just when we don’t
know the weight of the test object?
• Argument from coincidence-goes beyond being
highly improbable
28
According to modern logical empiricist orthodoxy,
in deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the
statements h and e, and the logical relations
[C(h,e)] between them. It is quite irrelevant whether
e was known first and h proposed to explain it, or
whether e resulted from testing predictions drawn
from h”.
(Alan Musgrave 1974, p. 2)
Battles about roles of probability
trace to philosophies of inference
Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
29
Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 30
Error probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
“In order to fix a limit between ‘small’ and ‘large’
values of [the likelihood ratio] we must know how
often such values appear when we deal with a
true hypothesis.” (Pearson and Neyman 1967,
106)
31
Hunting for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
32
Spurious P-Value
The hunter reports: Such results would be
difficult to achieve under the assumption of H0
When in fact such results are easy to get under
the assumption of H0
• There are many more ways to be wrong with
hunting (different sample space)
• Need to adjust P-values or at least report the
multiple testing
33
Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford) 34
On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space”
(Lindley 1971, 436)
35
Casualty: Error control is lost
“[I]f the sampling plan is ignored, the researcher is able
to always reject the null hypothesis, even if it is true. This
example is sometimes used to argue that any statistical
framework should somehow take the sampling plan into
account. ..This feeling is, however, contradicted by a
mathematical analysis. (E-J Wagenmakers, 2007, 785)
But the “proof” assumes the likelihood principle (LP) by
which error probabilities drop out. (Edwards, Lindman,
and Savage 1963)
A mathematical ‘principle of rationality’ (LP) gets weight
over what appears to be common sense
36
At odds with key way to advance
replication:
21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find flexibility with data-
dredging and stopping rules major source of failed-
replication (the “forking paths”, Gelman and Loken
2014))
37
Many “reforms” offered as
alternative to significance tests
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is not
given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78; authors of the
Likelihood Principle)
• No wonder reformers talk past each other
38
Tension within replication reforms
American Statistical Association 2016 (statement
on P-values):
• Principle 4: P-values and related analyses should not
be reported selectively. …Whenever a researcher
chooses what to present based on statistical results,
valid interpretation of those results is severely
compromised if the reader is not informed of the
choice and its basis.
• Other approaches: In view of the prevalent misuses
of and misconceptions concerning p-values, some
statisticians prefer to supplement or even replace p-
values with other approaches… such as likelihood
ratios or Bayes Factors;(p. 132)
39
Replication Paradox
• Test Critic: It’s too easy to satisfy standard
significance thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Test Critic: Obviously the initial studies were guilty
of P-hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Test Critic: Actually “reforms” recommend methods
where the need to alter P-values due to data
dredging vanishes 40
Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)?
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall)
• Likelihoodists + prior probabilities
41
Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
42
Peace Treaty (J. Berger 2003, 2006):
use “default” priors: unification
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief.
Conventional priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors (no “uninformative” priors)
43
Casualties: Criticisms of data-
dredgers lose force
Wanting to promote an account that downplays error
probabilities, Bayesian critics turn to other means–give
H0 (no effect) a high prior probability in a Bayesian
analysis
• The researcher deserving criticism deflects this
saying: you can always counter an effect this way
(the defense violates our minimal principle of
evidence)
44
Bem’s “Feeling the future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the future
• Some locate the start of the Replication Crisis with
Bem
• Bem admits data dredging
• Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
45
Bem’s response
“Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox*: A
frequentist [can always] be contradicted by a
…Bayesian analysis that concludes that the same data
are more likely under the null.” (Bem et al. 2011, 717)
*Bayes-Fisher disagreement
46
47
Key to today’s statistics wars: P-
values vs posteriors
 The posterior probability Pr(H0|x) can be large
while the P-value is small
 To a Bayesian this shows P-values exaggerate
evidence against
 Significance testers object to highly significant
results being interpreted as no evidence against
the null– or even evidence for it! High Type 2
error
Bayes (Jeffreys)/Fisher disagreement
(“spike and smear”)
• The “P-values exaggerate” arguments refer to
testing a point null hypothesis, a lump of prior
probability given to H0 (or a tiny region around 0).
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• The rest appropriately spread over the
alternative, an α significant result can correspond
to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
48
“Concentrating mass on the point null hypothesis is
biasing the prior in favor of H0 as much as possible”
(Casella and R. Berger 1987, 111) whether in 1 or 2-
sided tests
Yet ‘spike and smear” is the basis for: “Redefine
Statistical Significance” (Benjamin et al. (2018)
Opposing megateam: Lakens et al. (2018)
• Whether tests should use a lower Type 1 error
probability is separate; the problem is supposing
there should be agreement between quantities
measuring different things
49
50
Whether P-values exaggerate,
“depends on one’s philosophy of
statistics” (Greenland et al., 2016)
..based on directly comparing P values against
certain quantities (likelihood ratios and Bayes
factors) that play a central role as evidence
measures in Bayesian analysis … Nonetheless,
many other statisticians do not accept these
quantities as gold standards,”… (Greenland, Senn,
Rothman, Carlin, Poole, Goodman, Altman 2016,
342)
Recent Example of a battle based on P-
values disagree with posteriors
• If we imagine randomly selecting a hypothesis
from an urn of nulls 90% of which are true
• Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
• Take the prevalence of 90% as
Pr(H0) = 0.9, Pr(H1)= 0.1
• Reject H0 with a single (just) 0.05 significant result,
with cherry-picking, selection effects
Then it can be shown most “findings” are false 51
52
Diagnostic Screening (DS) model of Tests
• Pr(H0|Test T rejects H0 ) > 0.5
really: prevalence of true nulls among those
rejected at the 0.05 level > 0.5.
Call this: False Finding rate FFR
• Pr(Test T rejects H0 | H0 ) = 0.05
Criticism: N-P Type I error probability ≠ FFR
(Ioannidis 2005, Colquhoun 2014)
FFR: False Finding Rate: Prev(H0 ) = .9
53α = 0.05 and (1 – β) = .8, FFR = 0.36, the PPV = .64
54
But there are major casualties
Pr(H0|Test T rejects H0) is not a Type I error
probability.
Transposes conditional
Combines crude performance with a probabilist
assignment (What’s Prev(H1)?)
Fallacy of probabilistic instantiation: Even if the
prevalence of true effects in the urn is 0.1 does not
follow that a specific hypothesis gets a probability of
0.1 of being true, for a frequentist
Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20).
Falsify (statistically) with small P-values 55
56
Can’t also claim:
 “No p-value can reveal the plausibility, presence,
[or] truth of an association or effect”. (Wasserstein
et al., 2019, 2)
 I recommend “No p-value ‘by itself’…”
 Granted P-values don’t give effect sizes
 Who says we only report P-values and stop?
57
Frequentist Family Feud
Confidence Interval (CI) “Crusade”
Leaders of the “New Statistics” (in psychology) often
wage a proxy war for replacing tests with CIs.
58
• The “CIs only” battlers encourage supplementing
tests with CIs, which is good; but there have been
casualties
• They promote the view: the only alternative to CIs is
the abusive cookbook variation of Fisherian
significance tests
• But CIs are inversions of tests: A 95% CI contains
the parameter values you can’t reject at the 5%
level
• Same man who co-developed N-P tests developed
(1930): Neyman
59
Duality of tests and CIs
(estimating μ in a Normal Distribution)
μ > M0 – 1.96σ/√n CI-lower
μ < M0 + 1.96σ/√n CI-upper
M0 : the observed sample mean
CI-lower: the value of μ that M0 is statistically
significantly greater than at P= 0.025
CI-upper: the value of μ that M0 is statistically
significantly lower than at P= 0.025
 You could get a CI by asking for these values, and
learn indicated effect sizes with tests
CIs (as standardly used) inherit
problems of N-P tests
• Too dichotomous: in/out
• Justified in terms of long-‐run coverage
• All members of the CI treated on par
• Fixed confidence levels (need several
benchmarks)
60
61
The severe tester reformulates
tests with a discrepancy γ from H0
 Severity function: SEV(Test T, data x, claim C)
 Instead of a binary cut-off (significant or not) the
particular outcome is used to infer
discrepancies that are or are not warranted
62
To avoid Fallacies of Rejection
(e.g., magnitude error)
 If you very probably would have observed a more
impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ);
the data are poor evidence that
μ > μ1.
 To connect with power: spoze M0 just reaches
statistical significance at level P, say 0.025; and
compute power in relation to this cut-off
 If the power against μ1 is high then the data are
poor evidence that
μ > μ1.
Power vs Severity for 𝛍 > 𝛍 𝟏
63
Similarly, severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 64
What about fallacies of
non-significant results?
• They don’t warrant 0 discrepancy; (there are
discrepancies the test had low probability of
detecting)
• Not uninformative; can find upper bounds μ1
SEV(μ < μ1) is high
• It’s with negative results (P-values not small) that
severity goes in the same direction as power
–provided model assumptions hold
65
Severely applies informally: to
probe links from statistical to H*
• Casualty of focusing on whether the replication
gets low P-values:
• Much replication ignore the larger question: are
they even measuring the phenomenon they
intend?
• Failed replication often construed: There’s a real
effect but it’s smaller
66
OSC: Reproducibility Project:
Psychology: 2011-15 (Science
2015) (led by Brian Nosek, U. VA)
• Crowd sourced: Replicators chose 100 articles
from three journals (2008)
67
• One of the non-replications: cleanliness and
morality: Do cleanliness primes make you
less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether it’s
cool to eat your dog after it gets run over by a
car.”
68
“Subjects who had unscrambled clean words
weren’t as harsh on the guy who chows down on
his chow.”
(Bartlett, Chronicle of Higher Education)
Is the cleanliness prime responsible?
69
Nor is there discussion of the multiple
testing in the original study
• Only 1 of the 6 dilemmas in the original study
showed statistically significant differences in degree
of wrongness–not the dog one
• No differences on 9 different emotions (relaxed,
angry, happy, sad, afraid, depressed, disgusted,
upset, and confused)
• Similar studies in experimental philosophy:
philosophers of science need to critique them 70
71
The statistics wars: Errors & casualties
• Mounting failures of replication …give a new urgency
to critically appraising proposed statistical reforms.
• While many reforms are welcome (preregistration of
experiments, replication, discouraging cookbook uses
of statistics), there have been casualties.
• The philosophical presuppositions …remain largely
hidden.
• Too often the statistics wars have become proxy wars
between competing tribe leaders, each keen to
advance one or another tool or school, rather than
build on efforts to do better science.
72
Efforts of replication researchers and open science
advocates are diminished when
• attention is centered on repeating hackneyed
howlers of statistical significance tests (statistical
significance isn’t substantive significance, no
evidence against isn’t evidence for),
• erroneous understanding of basic statistical terms
goes uncorrected, and
• bandwagon effects lead to popular reforms that
downplay the importance of error probability
control.
These casualties threaten our ability to hold
accountable the “experts,” the agencies, and all the
data handlers increasingly exerting power over our
lives.
73
NOTES ON SLIDES
SLIDE #57. The term confidence interval crusade is from Hurlbert and Lombardi
(2008).
Slide #59. We’re really Involved with Two 1-sided tests:
“More commonly,” Cox and Hinkley (1974, p. 79) note “large and small values of
[the test statistic] indicate quite different kinds of departure… . Then it is best to
regard the tests in the two directions as two different tests… .one to examine the
possibility” μ > μ0, the other for μ < μ0 .
“.. look at both [p+ and p- ] and select the more significant of the two i.e., the
smaller. To preserve the physical meaning of the significance level, we make a
correction for selection.” (ibid. p. 106)
In the continuous case, we double the separate p-values.
SLIDE #63. This Figure alludes to aa specific example, but the point is general. The
example is H0 : μ ≤ 0 vs. H1; μ > 0, α = 0.025, n = 25, 𝜎 𝑋 = 0.2
SLIDE #65. Note: This is akin to power analysis (first used by Neyman and
popularized by Jacob Cohen) except that SEV uses the actual outcome rather than
the one that just misses rejection. Focusing on Fisherian tests, power
considerations are absent. Note that severity is not the same as something called
post-hoc power, where the observed difference becomes the hypothesis for which
power is computed. I call it “shpower”. Another serious casualty is to warn against
“post hoc power” when “shpower” is meant. 74
Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
75
FEV: Frequentist Principle of Evidence; Mayo
and Cox (2006); SEV: Mayo 1991, Mayo and
Spanos (2006)
FEV/SEV A small P-value indicates discrepancy γ from
H0, if and only if, there is a high probability the test
would have resulted in a larger P-value were a
discrepancy as large as γ absent.
FEV/SEV A moderate P-value indicates the absence of
a discrepancy γ from H0, only if there is a high
probability the test would have given a worse fit with
H0 (i.e., a smaller P-value) were a discrepancy γ
present.
76
Researcher degrees of freedom in
the Schnall study
After the priming task, participants rated six moral
dilemmas : ‘‘Dog’’ (eating one’s dead dog),
‘‘Trolley’’ (switching the tracks of a trolley to kill
one workman instead of five), ‘‘Wallet’’ (keeping
money inside a found wallet), ‘‘Plane Crash’’
(killing a terminally ill plane crash survivor to avoid
starvation), ‘‘Resume’’ (putting false information
on a resume), and ‘‘Kitten’’ (using a kitten for
sexual arousal). Participants rated how wrong
each action was from 0 (perfectly OK) to 9
(extremely wrong). 77
References
• Amrhein, V., Greenland, S. and McShane B. (2019). “Comment: Retire Statistical Significance”, Nature 567: 305-7. (Online, 20
March 2019).
• Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by Ian Hacking)”, British
Journal for the Philosophy of Science 23(2), 123–32.
• Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of Higher Education (online)
June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in
Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103.
• Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal
of Personality and Social Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way They Analyze Their Data?”, Journal of Personality
and Social Psychology 101(4), 716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10.
• Berger, J. O. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and Rejoinder’, Statistical Science 18(1), 1–12;
28–32.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA:
Institute of Mathematical Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem”, Journal of
the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’, Royal Society Open
Science 1(3), 140216.
• Cook, J., Fergusson, D. ,Ford, I., Gonen, M., Kimmelman, J., Korn, E., 6 and Begg, C. (2019). “There Is Still a Place for
Significance Testing in Clinical Trials”, Clinical Trials 16(3): 223-4.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and Inference: Recent
Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Mayo and Spanos (eds.), 276–
304. CUP.
78
• Edwards, W., Lindman, H., and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological
Review 70(3), 193-242.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of
Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999;
130:1005 –1013.
• Greenland, S., Senn, S., Rothman, K., et al. (2016). ‘Statistical Tests, P values, Confidence Intervals, and Power: A Guide to
Misinterpretations’, European Journal of Epidemiology 31(4), 337–50.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press.
• Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and Braithwaite’, in Mellor, D. (ed.), Science, Belief and
Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Hurlbert, S. and Lombardi, C. (2009). ‘Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the
NeoFisherian’, Annales Zoologici Fennici 46, 311–49.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–0701.
• Lakens, D., et al. (2018). ‘Justify your Alpha’, Nature Human Behavior 2, 168–71.
• Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses, 3rd ed. New York: Springer.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical
Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago:
University of Chicago Press.
• Mayo, D. (2016). ‘Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R.
L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”’, The American Statistician 70(2)
(supplemental materials).
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge
University Press. 79
80
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of
Mathematical Statistics: 247-275.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of
Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S.
Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands:
Elsevier.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De
Gruyter.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, BJPS 25(1), 1–23.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science 349(6251),
943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman &
E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.
• Schnall, S., Benton, J., and Harvey, S. (2008). ‘With a Clean Conscience: Cleanliness Reduces the Severity of
Moral Judgments’, Psychological Science 19(12), 1219–22.
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy, edited
by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Wagenmakers, E-J., (2007). “A Practical Solution to the Pervasive Problems of P values”, Psychonomic Bulletin &
Review 14(5): 779-804.
• Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, The
American Statistician 70(2), 129–33.
• Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American
Statistician 73(S1): 1-19.

More Related Content

What's hot

Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talkjemille6
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsEugene Yan Ziyou
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2jemille6
 
3. parametric assumptions
3. parametric assumptions3. parametric assumptions
3. parametric assumptionsSteve Saffhill
 
Biomedical statistics
Biomedical statisticsBiomedical statistics
Biomedical statisticsAbdullaAhmad6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 

What's hot (20)

Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
3. parametric assumptions
3. parametric assumptions3. parametric assumptions
3. parametric assumptions
 
Biomedical statistics
Biomedical statisticsBiomedical statistics
Biomedical statistics
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 

Similar to The Statistics Wars: Errors and Casualties

Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1jemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhdHimanshuSharma723273
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 

Similar to The Statistics Wars: Errors and Casualties (20)

Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Hypothesis in rm
Hypothesis in rmHypothesis in rm
Hypothesis in rm
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Bayesian intro
Bayesian introBayesian intro
Bayesian intro
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 

Recently uploaded

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Recently uploaded (20)

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

The Statistics Wars: Errors and Casualties

  • 1. The Statistics Wars: Errors and Casualties Deborah G Mayo Dept of Philosophy, Virginia Tech Conference: Statistical Reasoning and Scientific Error MuST Conference in Philosophy of Science July 1, 2019 1
  • 3. Role of probability: performance or probabilism? (Frequentist vs. Bayesian) • End of foundations? (Unifications, Peace Treaties, and Eclecticism) • Long-standing battles still simmer below the surface (agreement on numbers) 3
  • 4. Statistical inference as severe testing •Brush the dust off pivotal debates in relation to today’s statistical crisis in science
  • 5. • We set sail with a simple tool: a minimal requirement for evidence • Sufficiently general to apply to any methods now in use • Excavation tool: You needn’t accept this philosophy to use it to get beyond today’s statistical wars and appraise reforms 5
  • 6. Statistical reforms • Several are welcome: preregistration, avoidance of cookbook statistics, calls for more replication research • Others are quite radical, and even violate our minimal principle of evidence • These are the errors and casualties I refer to • To combat paradoxical, self-defeating “reforms” and unthinking bandwagon effects requires taking on very strong ideological leaders 6
  • 7. In SIST: How to get beyond the stat wars requires chutzpah “You will need to critically evaluate …brilliant leaders, high priests, maybe even royalty. Are they asking the most unbiased questions in examining methods, or are they like admen touting their brand, dragging out howlers to make their favorite method look good? (I am not sparing any of the statistical tribes here.)” (p. 12) • Statistics wars as proxy wars 7
  • 8. Most often used tools are most criticized “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. …” (Ioannidis 2005, 696) Do researchers do that? 8
  • 9. R.A. Fisher “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 9
  • 10. Simple significance tests (Fisher) “p-value. …to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; • T = t(Y) has a known probability distribution when H0 is true. …the p-value corresponding to any t0bs as p = p(t) = Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 10
  • 11. Testing reasoning • If even larger differences than t0bs occur fairly frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of incompatibility with H0 • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 11
  • 12. 12 “It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences” (Cook et al. 2019) In response to suggesting the concept of statistical significance be dropped (Amhrein et al. 2019) Don’t let the tail wag the dog
  • 13. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustive* H0: μ ≤ 0 vs. H1: μ > 0 “no effect” vs. “some positive effect” • So this fallacy of rejection H1H* is blocked • Rejecting H0 only indicates statistical alternatives H1 (how discrepant from null) *(introduces Type II error, and power ) 13
  • 14. Casualties from (Pathological) Fisher-N-P Battles • Neyman Pearson (N-P) give a rationale for Fisherian tests • All is hunky dory until Fisher and Neyman began fighting (from 1935) largely due to professional and personality disputes • A huge philosophical difference is read into their in-fighting 14
  • 15. Get beyond the inconsistent hybrid charge • Major casualties from the “inconsistent hybrid”: Fisher–inferential; N-P–long run performance • Fisherians can’t use power • N-P testers must adhere to fixed error probabilities (P <); can’t report P-values (P =) 15
  • 16. 16 Erich Lehmann (Neyman’s student) “It is good practice to determine … the smallest significance level…at which the hypothesis would be rejected for the given observation. ..the P- value gives an idea of how strongly the data contradict the hypothesis [and] enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)
  • 17. Error Statistics • Fisher and N-P both fall under tools for “appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, 1033)–error probabilities • I place all under the rubric of error statistics • Confidence intervals, N-P and Fisherian tests, resampling, randomization 17
  • 18. Both Fisher & N-P: it’s easy to lie with biasing selection effects • Sufficient finagling—cherry-picking, significance seeking, multiple testing, post-data subgroups, trying and trying again—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence • Violates severity 18
  • 19. Severity Requirement: If the test had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H • Such a test fails a minimal requirement for a stringent or severe test • N-P and Fisher did not put it in these terms, but our severe tester does 19
  • 20. Requires a third role for probability Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (absolute or comparative) (e.g., Bayesian, likelihoodist, Fisher (at times)) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman- Pearson, Fisher (at times)) Only probabilism is thought to be inferential or evidential 20
  • 21. What happened to using probability to assess error-probing capacity? • Neither “probabilism” nor “performance” directly captures assessing error probing capacity • Good long-run performance is a necessary, not a sufficient, condition for severity 21
  • 22. Key to solving a key problem for frequentists • Why is good performance relevant for inference in the case at hand? • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking • Not problems about long-runs— 22
  • 23. We cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data
  • 24. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing): unless something (a fair amount) has been done to probe ways we can be wrong about C 24
  • 25. A severe test: My weight Informal example: To test if I’ve gained weight between the time I left for Germany and my return, I use a series of well-calibrated and stable scales, both before leaving and upon my return. All show an over 4 lb gain, none shows a difference in weighing EGEK, I infer: H: I’ve gained at least 4 pounds 25
  • 26. 26 • Properties of the scales are akin to the properties of statistical tests (performance). • No one claims the justification is merely long run, and can say nothing about my weight. • We infer something about the source of the readings from the high capability to reveal if any scales were wrong
  • 27. 27 The severe tester assumed to be in a context of wanting to find things out • I could insist all the scales are wrong—they work fine with weighing known objects—but this would prevent correctly finding out about weight….. • What sort of extraordinary circumstance could cause them all to go astray just when we don’t know the weight of the test object? • Argument from coincidence-goes beyond being highly improbable
  • 28. 28 According to modern logical empiricist orthodoxy, in deciding whether hypothesis h is confirmed by evidence e, . . . we must consider only the statements h and e, and the logical relations [C(h,e)] between them. It is quite irrelevant whether e was known first and h proposed to explain it, or whether e resulted from testing predictions drawn from h”. (Alan Musgrave 1974, p. 2) Battles about roles of probability trace to philosophies of inference
  • 29. Likelihood Principle (LP) In logics of induction, like probabilist accounts (as I’m using the term) the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 29
  • 30. Comparative logic of support • Ian Hacking (1965) “Law of Likelihood”: x support hypothesis H0 less well than H1 if, Pr(x;H0) < Pr(x;H1) (rejects in 1980) • Any hypothesis that perfectly fits the data is maximally likely (even if data-dredged) • “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). 30
  • 31. Error probabilities • Pr(H0 is less well supported than H1;H0 ) is high for some H1 or other “In order to fix a limit between ‘small’ and ‘large’ values of [the likelihood ratio] we must know how often such values appear when we deal with a true hypothesis.” (Pearson and Neyman 1967, 106) 31
  • 32. Hunting for significance (nominal vs. actual) Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) 32
  • 33. Spurious P-Value The hunter reports: Such results would be difficult to achieve under the assumption of H0 When in fact such results are easy to get under the assumption of H0 • There are many more ways to be wrong with hunting (different sample space) • Need to adjust P-values or at least report the multiple testing 33
  • 34. Some accounts of evidence object: “Two problems that plague frequentist inference: multiple comparisons and multiple looks, or…data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Co-director, with Ioannidis, the Meta-Research Innovation Center at Stanford) 34
  • 35. On the LP, error probabilities appeal to something irrelevant “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) 35
  • 36. Casualty: Error control is lost “[I]f the sampling plan is ignored, the researcher is able to always reject the null hypothesis, even if it is true. This example is sometimes used to argue that any statistical framework should somehow take the sampling plan into account. ..This feeling is, however, contradicted by a mathematical analysis. (E-J Wagenmakers, 2007, 785) But the “proof” assumes the likelihood principle (LP) by which error probabilities drop out. (Edwards, Lindman, and Savage 1963) A mathematical ‘principle of rationality’ (LP) gets weight over what appears to be common sense 36
  • 37. At odds with key way to advance replication: 21 Word Solution “We report how we determined our sample size, and data exclusions (if any), all manipulations, and all measures in the study” (Simmons, Nelson, and Simonsohn 2012, 4). • Replication researchers find flexibility with data- dredging and stopping rules major source of failed- replication (the “forking paths”, Gelman and Loken 2014)) 37
  • 38. Many “reforms” offered as alternative to significance tests follow the LP • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert 1988, 78; authors of the Likelihood Principle) • No wonder reformers talk past each other 38
  • 39. Tension within replication reforms American Statistical Association 2016 (statement on P-values): • Principle 4: P-values and related analyses should not be reported selectively. …Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. • Other approaches: In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p- values with other approaches… such as likelihood ratios or Bayes Factors;(p. 132) 39
  • 40. Replication Paradox • Test Critic: It’s too easy to satisfy standard significance thresholds • You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)? • Test Critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data-dredging (QRPs) • You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects. • Test Critic: Actually “reforms” recommend methods where the need to alter P-values due to data dredging vanishes 40
  • 41. Probabilists can still block intuitively unwarranted inferences (without error probabilities)? • Supplement with subjective beliefs: What do I believe? As opposed to What is the evidence? (Royall) • Likelihoodists + prior probabilities 41
  • 42. Problems • Additional source of flexibility, priors as well as biasing selection effects • Doesn’t show what researchers had done wrong—battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive 42
  • 43. Peace Treaty (J. Berger 2003, 2006): use “default” priors: unification • Default priors are supposed to prevent prior beliefs from influencing the posteriors–data dominant • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors (no “uninformative” priors) 43
  • 44. Casualties: Criticisms of data- dredgers lose force Wanting to promote an account that downplays error probabilities, Bayesian critics turn to other means–give H0 (no effect) a high prior probability in a Bayesian analysis • The researcher deserving criticism deflects this saying: you can always counter an effect this way (the defense violates our minimal principle of evidence) 44
  • 45. Bem’s “Feeling the future” 2011: ESP? • Daryl Bem (2011): subjects do better than chance at predicting the (erotic) picture shown in the future • Some locate the start of the Replication Crisis with Bem • Bem admits data dredging • Bayesian critics resort to a default Bayesian prior to (a point) null hypothesis 45
  • 46. Bem’s response “Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is [here] it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox*: A frequentist [can always] be contradicted by a …Bayesian analysis that concludes that the same data are more likely under the null.” (Bem et al. 2011, 717) *Bayes-Fisher disagreement 46
  • 47. 47 Key to today’s statistics wars: P- values vs posteriors  The posterior probability Pr(H0|x) can be large while the P-value is small  To a Bayesian this shows P-values exaggerate evidence against  Significance testers object to highly significant results being interpreted as no evidence against the null– or even evidence for it! High Type 2 error
  • 48. Bayes (Jeffreys)/Fisher disagreement (“spike and smear”) • The “P-values exaggerate” arguments refer to testing a point null hypothesis, a lump of prior probability given to H0 (or a tiny region around 0). Xi ~ N(μ, σ2) H0: μ = 0 vs. H1: μ ≠ 0. • The rest appropriately spread over the alternative, an α significant result can correspond to Pr(H0|x) = (1 – α)! (e.g., 0.95) 48
  • 49. “Concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (Casella and R. Berger 1987, 111) whether in 1 or 2- sided tests Yet ‘spike and smear” is the basis for: “Redefine Statistical Significance” (Benjamin et al. (2018) Opposing megateam: Lakens et al. (2018) • Whether tests should use a lower Type 1 error probability is separate; the problem is supposing there should be agreement between quantities measuring different things 49
  • 50. 50 Whether P-values exaggerate, “depends on one’s philosophy of statistics” (Greenland et al., 2016) ..based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … Nonetheless, many other statisticians do not accept these quantities as gold standards,”… (Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman 2016, 342)
  • 51. Recent Example of a battle based on P- values disagree with posteriors • If we imagine randomly selecting a hypothesis from an urn of nulls 90% of which are true • Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored, • Take the prevalence of 90% as Pr(H0) = 0.9, Pr(H1)= 0.1 • Reject H0 with a single (just) 0.05 significant result, with cherry-picking, selection effects Then it can be shown most “findings” are false 51
  • 52. 52 Diagnostic Screening (DS) model of Tests • Pr(H0|Test T rejects H0 ) > 0.5 really: prevalence of true nulls among those rejected at the 0.05 level > 0.5. Call this: False Finding rate FFR • Pr(Test T rejects H0 | H0 ) = 0.05 Criticism: N-P Type I error probability ≠ FFR (Ioannidis 2005, Colquhoun 2014)
  • 53. FFR: False Finding Rate: Prev(H0 ) = .9 53α = 0.05 and (1 – β) = .8, FFR = 0.36, the PPV = .64
  • 54. 54 But there are major casualties Pr(H0|Test T rejects H0) is not a Type I error probability. Transposes conditional Combines crude performance with a probabilist assignment (What’s Prev(H1)?) Fallacy of probabilistic instantiation: Even if the prevalence of true effects in the urn is 0.1 does not follow that a specific hypothesis gets a probability of 0.1 of being true, for a frequentist
  • 55. Some Bayesians reject probabilism (Gelman: Falsificationist Bayesian; Shalizi: error statistician) “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20). Falsify (statistically) with small P-values 55
  • 56. 56 Can’t also claim:  “No p-value can reveal the plausibility, presence, [or] truth of an association or effect”. (Wasserstein et al., 2019, 2)  I recommend “No p-value ‘by itself’…”  Granted P-values don’t give effect sizes  Who says we only report P-values and stop?
  • 57. 57 Frequentist Family Feud Confidence Interval (CI) “Crusade” Leaders of the “New Statistics” (in psychology) often wage a proxy war for replacing tests with CIs.
  • 58. 58 • The “CIs only” battlers encourage supplementing tests with CIs, which is good; but there have been casualties • They promote the view: the only alternative to CIs is the abusive cookbook variation of Fisherian significance tests • But CIs are inversions of tests: A 95% CI contains the parameter values you can’t reject at the 5% level • Same man who co-developed N-P tests developed (1930): Neyman
  • 59. 59 Duality of tests and CIs (estimating μ in a Normal Distribution) μ > M0 – 1.96σ/√n CI-lower μ < M0 + 1.96σ/√n CI-upper M0 : the observed sample mean CI-lower: the value of μ that M0 is statistically significantly greater than at P= 0.025 CI-upper: the value of μ that M0 is statistically significantly lower than at P= 0.025  You could get a CI by asking for these values, and learn indicated effect sizes with tests
  • 60. CIs (as standardly used) inherit problems of N-P tests • Too dichotomous: in/out • Justified in terms of long-‐run coverage • All members of the CI treated on par • Fixed confidence levels (need several benchmarks) 60
  • 61. 61 The severe tester reformulates tests with a discrepancy γ from H0  Severity function: SEV(Test T, data x, claim C)  Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not warranted
  • 62. 62 To avoid Fallacies of Rejection (e.g., magnitude error)  If you very probably would have observed a more impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ); the data are poor evidence that μ > μ1.  To connect with power: spoze M0 just reaches statistical significance at level P, say 0.025; and compute power in relation to this cut-off  If the power against μ1 is high then the data are poor evidence that μ > μ1.
  • 63. Power vs Severity for 𝛍 > 𝛍 𝟏 63
  • 64. Similarly, severity tells us: • an α-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 64
  • 65. What about fallacies of non-significant results? • They don’t warrant 0 discrepancy; (there are discrepancies the test had low probability of detecting) • Not uninformative; can find upper bounds μ1 SEV(μ < μ1) is high • It’s with negative results (P-values not small) that severity goes in the same direction as power –provided model assumptions hold 65
  • 66. Severely applies informally: to probe links from statistical to H* • Casualty of focusing on whether the replication gets low P-values: • Much replication ignore the larger question: are they even measuring the phenomenon they intend? • Failed replication often construed: There’s a real effect but it’s smaller 66
  • 67. OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015) (led by Brian Nosek, U. VA) • Crowd sourced: Replicators chose 100 articles from three journals (2008) 67
  • 68. • One of the non-replications: cleanliness and morality: Do cleanliness primes make you less judgmental? “Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car.” 68
  • 69. “Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Bartlett, Chronicle of Higher Education) Is the cleanliness prime responsible? 69
  • 70. Nor is there discussion of the multiple testing in the original study • Only 1 of the 6 dilemmas in the original study showed statistically significant differences in degree of wrongness–not the dog one • No differences on 9 different emotions (relaxed, angry, happy, sad, afraid, depressed, disgusted, upset, and confused) • Similar studies in experimental philosophy: philosophers of science need to critique them 70
  • 71. 71 The statistics wars: Errors & casualties • Mounting failures of replication …give a new urgency to critically appraising proposed statistical reforms. • While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. • The philosophical presuppositions …remain largely hidden. • Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science.
  • 72. 72 Efforts of replication researchers and open science advocates are diminished when • attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), • erroneous understanding of basic statistical terms goes uncorrected, and • bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
  • 73. 73
  • 74. NOTES ON SLIDES SLIDE #57. The term confidence interval crusade is from Hurlbert and Lombardi (2008). Slide #59. We’re really Involved with Two 1-sided tests: “More commonly,” Cox and Hinkley (1974, p. 79) note “large and small values of [the test statistic] indicate quite different kinds of departure… . Then it is best to regard the tests in the two directions as two different tests… .one to examine the possibility” μ > μ0, the other for μ < μ0 . “.. look at both [p+ and p- ] and select the more significant of the two i.e., the smaller. To preserve the physical meaning of the significance level, we make a correction for selection.” (ibid. p. 106) In the continuous case, we double the separate p-values. SLIDE #63. This Figure alludes to aa specific example, but the point is general. The example is H0 : μ ≤ 0 vs. H1; μ > 0, α = 0.025, n = 25, 𝜎 𝑋 = 0.2 SLIDE #65. Note: This is akin to power analysis (first used by Neyman and popularized by Jacob Cohen) except that SEV uses the actual outcome rather than the one that just misses rejection. Focusing on Fisherian tests, power considerations are absent. Note that severity is not the same as something called post-hoc power, where the observed difference becomes the hypothesis for which power is computed. I call it “shpower”. Another serious casualty is to warn against “post hoc power” when “shpower” is meant. 74
  • 75. Jimmy Savage on the LP: “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17) 75
  • 76. FEV: Frequentist Principle of Evidence; Mayo and Cox (2006); SEV: Mayo 1991, Mayo and Spanos (2006) FEV/SEV A small P-value indicates discrepancy γ from H0, if and only if, there is a high probability the test would have resulted in a larger P-value were a discrepancy as large as γ absent. FEV/SEV A moderate P-value indicates the absence of a discrepancy γ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller P-value) were a discrepancy γ present. 76
  • 77. Researcher degrees of freedom in the Schnall study After the priming task, participants rated six moral dilemmas : ‘‘Dog’’ (eating one’s dead dog), ‘‘Trolley’’ (switching the tracks of a trolley to kill one workman instead of five), ‘‘Wallet’’ (keeping money inside a found wallet), ‘‘Plane Crash’’ (killing a terminally ill plane crash survivor to avoid starvation), ‘‘Resume’’ (putting false information on a resume), and ‘‘Kitten’’ (using a kitten for sexual arousal). Participants rated how wrong each action was from 0 (perfectly OK) to 9 (extremely wrong). 77
  • 78. References • Amrhein, V., Greenland, S. and McShane B. (2019). “Comment: Retire Statistical Significance”, Nature 567: 305-7. (Online, 20 March 2019). • Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by Ian Hacking)”, British Journal for the Philosophy of Science 23(2), 123–32. • Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of Higher Education (online) June 23, 2014. • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. • Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal of Personality and Social Psychology 100(3), 407-425. • Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4), 716-719. • Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10. • Berger, J. O. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and Rejoinder’, Statistical Science 18(1), 1–12; 28–32. • Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. • Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem”, Journal of the American Statistical Association 82(397), 106-11. • Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’, Royal Society Open Science 1(3), 140216. • Cook, J., Fergusson, D. ,Ford, I., Gonen, M., Kimmelman, J., Korn, E., 6 and Begg, C. (2019). “There Is Still a Place for Significance Testing in Clinical Trials”, Clinical Trials 16(3): 223-4. • Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Mayo and Spanos (eds.), 276– 304. CUP. 78
  • 79. • Edwards, W., Lindman, H., and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3), 193-242. • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. • Greenland, S., Senn, S., Rothman, K., et al. (2016). ‘Statistical Tests, P values, Confidence Intervals, and Power: A Guide to Misinterpretations’, European Journal of Epidemiology 31(4), 337–50. • Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press. • Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and Braithwaite’, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60. • Hurlbert, S. and Lombardi, C. (2009). ‘Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the NeoFisherian’, Annales Zoologici Fennici 46, 311–49. • Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–0701. • Lakens, D., et al. (2018). ‘Justify your Alpha’, Nature Human Behavior 2, 168–71. • Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses, 3rd ed. New York: Springer. • Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. (2016). ‘Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R. L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”’, The American Statistician 70(2) (supplemental materials). • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. 79
  • 80. 80 • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics: 247-275. • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. • Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. • Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, BJPS 25(1), 1–23. • Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science 349(6251), 943–51. • Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). • Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press. • Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen. • Schnall, S., Benton, J., and Harvey, S. (2008). ‘With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments’, Psychological Science 19(12), 1219–22. • Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7. • Wagenmakers, E-J., (2007). “A Practical Solution to the Pervasive Problems of P values”, Psychonomic Bulletin & Review 14(5): 779-804. • Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician 70(2), 129–33. • Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19.

Editor's Notes

  1. They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences.
  2. Should probability enter to ensure we won’t reach mistaken interpretations of data too often in the long run of experience? Or to capture degrees of belief about claims? (performance or probabilism) The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious that everyone wants to believe we are long past them. We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method “works.” The truth is, long-standing battles still simmer below the surface of today’ s debates about scientific integrity, irreproducibility Reluctance to reopen wounds from old battles has allowed them to fester. The reconciliations and unifications have been revealed to have serious problems, and there’s little agreement on which to use or how to interpret them. As for eclecticism, it’s often not clear what is even meant by “works.” The presumption that all we need is an agreement on numbers–never mind if they’re measuring different things–leads to statistical schizophrenia.
  3. What’s behind the constant drum beat today that science is in crisis? The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That’s what it means to view statistical inference as severe testing. In saying we may view statistical inference as severe testing, I’m not saying statistical inference is always about formal statistical testing. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction. Regardless of the type of claim, you don’t have evidence for it if nothing has been done to have found it flawed.
  4. With Bayes Factors make to make the null hypothesis comparatively more probable than a chosen alternative