SlideShare a Scribd company logo
1 of 75
The ASA (2016) Statement on P-values:
How to Stop Refighting
the Statistics Wars
The CLA Quantitative Methods
Collaboration Committee &
Minnesota Center for Philosophy of Science
April 8, 2016
Deborah G Mayo
Brad Efron
“By and large, Statistics is a prosperous and
happy country, but it is not a completely
peaceful one. Two contending philosophical
parties, the Bayesians and the frequentists,
have been vying for supremacy over the past
two-and-a-half centuries. …Unlike most
philosophical arguments, this one has
important practical consequences. The two
philosophies represent competing visions of
how science progresses….” (2013, p. 130)
2
Today’s Practice: Eclectic
O Use of eclectic tools, little handwringing
of foundations
O Bayes-frequentist unifications
O Scratch a bit below the surface
foundational problems emerge….
O Not just 2: family feuds within (Fisherian,
Neyman-Pearson; tribes of Bayesians.
likelihoodists)
3
Why are the statistics wars more
serious today?
O Replication crises led to programs to
restore credibility: fraud busting,
reproducibility studies
O Taskforces, journalistic reforms, and
debunking treatises
O Proposed methodological reforms––many
welcome (preregistration)–some quite
radical
4
I was a philosophical observer at the
ASA P-value “pow wow”
5
“Don’t throw out the error control baby
with the bad statistics bathwater”
The American Statistician
6
O “Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities … of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)”.
7
One Rock in a Shifting Scene
O “Birnbaum calls it the ‘one rock in a
shifting scene’ in statistical practice
O “Misinterpretations and abuses of tests,
warned against by the very founders of
the tools, shouldn’t be the basis for
supplanting them with methods unable or
less able to assess, control, and alert us
to erroneous interpretations of data”
8
Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O It’s qualified by a claim about the
method’s capabilities to control and alert
us to erroneous interpretations (error
probabilities)
9
“p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function t = t(y) of the data,
to be called the test statistic, such that
• the larger the value of t the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t as
p = p(t) = P(T ≥ t; H0)”
(Mayo and Cox 2006, p. 81) 10
O “Clearly, if even larger differences
than t occur fairly frequently under H0
(p-value is not small), there’s scarcely
evidence of incompatibility
O But a small p-value doesn’t warrant
inferring a genuine effect H, let alone a
scientific conclusion H*–as the ASA
document correctly warns (Principle 3)”
11
A Paradox for Significance Test Critics
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
12
O R.A. Fisher: it’s easy to lie with statistics
by selective reporting (he called it the
“political principle”)
O Sufficient finagling—cherry-picking, P-
hacking, significance seeking—may
practically guarantee a researcher’s
preferred claim C gets support, even if it’s
unwarranted by evidence
O Note: Rejecting a null taken as support for
some non-null claim C
13
Severity Requirement:
O If data x0 agree with a claim C, but the
test procedure had little or no capability
of finding flaws with C (even if the claim
is incorrect), then x0 provide poor
evidence for C
O Such a test fails a minimal requirement
for a stringent or severe test
O My account: severe testing based on
error statistics
14
Two main views of the role of
probability in inference (not in ASA doc)
O Probabilism. To assign a degree of
probability, confirmation, support or belief
in a hypothesis, given data x0. (e.g.,
Bayesian, likelihoodist)—with regard for
inner coherency
O Performance. Ensure long-run reliability
of methods, coverage probabilities
(frequentist, behavioristic Neyman-
Pearson)
15
What happened to using probability to assess
the error probing capacity by the severity
criterion?
O Neither “probabilism” nor “performance”
directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition for
severity
O That’s why frequentist methods can be
shown to have howlers
16
O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
17
A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) something
(a fair amount) has been done to probe
ways we can be wrong about C
18
O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O It’s crucial to be able to say, C is highly
believable or plausible but poorly tested
O Probabilists can allow for the distinct task
of severe testing (you may not have to
entirely take sides in the stat wars)
19
The ASA doc gives no sense of
different tools for different jobs
O “To use an eclectic toolbox in statistics,
it’s important not to expect an agreement
on numbers form methods evaluating
different things
O A p-value isn’t ‘invalid’ because it does
not supply “the probability of the null
hypothesis, given the finding” (the
posterior probability of H0) (Trafimow and
Marks*, 2015)
*Editors of a journal, Basic and Applied Social
Psychology
20
O ASA Principle 2 says a p-value ≠
posterior but one doesn’t get the sense of
its role in error probability control
O It’s not that I’m keen to defend many
common uses of significance tests
O The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
21
Biasing selection effects:
O One function of severity is to identify
problematic selection effects (not all are)
O Biasing selection effects: when data or
hypotheses are selected or generated (or
a test criterion is specified), in such a way
that the minimal severity requirement is
violated, seriously altered or incapable of
being assessed
O Picking up on these alterations is
precisely what enables error statistics to
be self-correcting—
22
Nominal vs actual significance levels
The ASA correctly warns that “[c]onducting
multiple analyses of the data and reporting
only those with certain p-values” leads to
spurious p-values (Principle 4)
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent! (Selvin, 1970, p. 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
23
O They were clear on the fallacy: blurring
the “computed” or “nominal” significance
level, and the “actual” level
O There are many more ways you can be
wrong with hunting (different sample
space)
O Here’s a case where a p-value report is
invalid
24
You report: Such results would be difficult
to achieve under the assumption of H0
When in fact such results are common
under the assumption of H0
(Formally):
O You say Pr(P-value < Pobs; H0) ~ Pobs
small
O But in fact Pr(P-value < Pobs; H0) = high
25
O Nowadays, we’re likely to see the tests
blamed
O My view: Tests don’t kill inference, people
do
O Even worse are those statistical accounts
where the abuse vanishes!
26
On some views, taking account of biasing
selection effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-
value…But adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of ‘objectivity’ that is often made
for the P-value” (Goodman 1999, p. 1010)
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford)
27
Technical activism isn’t free of philosophy
Ben Goldacre (of Bad Science) in a 2016
Nature article, is puzzled that bad statistical
practices continue even in the face of the
new "technical activism”:
The editors at Annals of Internal
Medicine,… repeatedly (but confusedly)
argue that it is acceptable to identify
“prespecified outcomes” [from results]
produced after a trial began; ….they say
that their expertise allows them to
permit — and even solicit —
undeclared outcome-switching
28
His paper: “Make journals report clinical
trials properly”
O He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
29
Likelihood Principle (LP)
The vanishing act links to a pivotal
disagreement in the philosophy of statistics
battles
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
The data x0 are fixed, while the hypotheses
vary
30
Jimmy Savage on the LP:
O “According to Bayes' theorem,…. if y is
the datum of some other experiment, and
if it happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have
exactly the same thing to say about the
values of µ…” (Savage 1962, p. 17)
31
All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space (Lindley 1971, p. 436)
The LP implies…the irrelevance of
predesignation, of whether a hypothesis
was thought of before hand or was
introduced to explain known effects
(Rosenkrantz, 1977, p. 122)
32
Paradox of Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
33
“Trying and trying again”
O Keep sampling until H0 is rejected at
0.05 level
i.e., keep sampling until M  1.96 s/√n
O Trying and trying again: Having failed
to rack up a 1.96 s difference after 10
trials, go to 20, 30 and so on until
obtaining a 1.96 s difference
34
Nominal vs. Actual
significance levels again:
O With n fixed the Type 1 error probability is
0.05
O With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O Violates Cox and Hinkley’s (1974) “weak
repeated sampling principle”
35
1959 Savage Forum
Jimmy Savage audaciously declared:
“optional stopping is no sin”
so the problem must be with significance
levels
Peter Armitage:
“thou shalt be misled”
if thou dost not know the person tried and
tried again (p. 72)
36
O “The ASA correctly warns that
“[c]onducting multiple analyses of the
data and reporting only those with
certain p-values” leads to spurious p-
values (Principle 4)
O However, the same p-hacked hypothesis
can occur in Bayes factors; optional
stopping can be guaranteed to exclude
true nulls from HPD intervals
37
With One Big Difference:
O “The direct grounds to criticize inferences
as flouting error statistical control is lost
O They condition on the actual data,
O error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)”
38
Tension: Does Principle 4 Hold for
Other Approaches?
O “In view of the prevalent misuses of and
misconceptions concerning p-values, some
statisticians prefer to supplement or even
replace p-values with other approaches”
(They include Bayes factors, likelihood ratios,
as “alternative measures of evidence”)
O They appear to extend “full reporting and
transparency” (principle 4) to all methods.
O Some controversy: should it apply only to “p-
values and related statistics”
39
How might probabilists block intuitively
unwarranted inferences (without error
probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the
interpretation of the evidence, we wouldn’t
declare there’s statistical evidence of some
unbelievable claim (distinguishing shades of
grey and being politically moderate,
ovulation and voting preferences)
40
Rescued by beliefs
O That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O So now you’ve got two sources of
flexibility, priors and biasing selection
effects
41
No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
42
Most Bayesians are “conventional”
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
43
O A classic conundrum: no general non-
informative prior exists, so most are
conventional
O “The priors are not to be considered
expressions of uncertainty, ignorance, or
degree of belief. Conventional priors may
not even be probabilities…” (Cox and
Mayo 2010, p. 299)
O Prior probability: An undefined
mathematical construct for obtaining
posteriors (giving highest weight to data,
or satisfying invariance, or matching or….)
44
Conventional Bayesian Reforms are
touted as free of selection effects
O Jim Berger gives us “conditional error
probabilities” CEPs
O “[I]t is considerably less important to disabuse
students of the notion that a frequentist error
probability is the probability that the hypothesis
is true, given the data”, since under his new
definition “a CEP actually has that
interpretation”
O “CEPs do not depend on the stopping rule”
(“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
2003)
45
By and large the ASA doc highlights
classic foibles
“In relation to the test of significance, we
may say that a phenomenon is
experimentally demonstrable when we
know how to conduct an experiment
which will rarely fail to give us a
statistically significant result”
(Fisher 1935, p. 14)
(“isolated” low P-value ≠> H: statistical
effect)
46
Statistical ≠> substantive (H ≠> H*)
“[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to accepting
the efficacy of the cause in question. The
latter...requires obtaining more significant
results when the experiment, or an
improvement of it, is repeated at other
laboratories or under other conditions”
(Gigerentzer 1989, pp. 95-6)
47
O Flaws in alternative H* have not been probed
by the test,
O The inference from a statistically significant
result to H* fails to pass with severity
O “Merely refuting the null hypothesis is too
weak to corroborate” substantive H*, “we
have to have ‘Popperian risk’, ‘severe test’
[as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable
coincidence’” (Meehl and Waller 2002,
p.184)
48
O Encouraged by something called NHSTs
–that supposedly allow moving from
statistical to substantive
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
49
Neyman-Pearson (N-P) Tests:
A null and alternative hypotheses
H0, H1 that exhaust the parameter
space
O So the fallacy of rejection H – > H* is
impossible
O Rejecting the null only indicates statistical
alternatives
50
P-values Don’t Report Effect Sizes
Principle 5
Who ever said to just report a P-value?
O “Tests should be accompanied by
interpretive tools that avoid the fallacies of
rejection and non-rejection. These
correctives can be articulated in either
Fisherian or Neyman-Pearson terms”
(Mayo and Cox 2006, Mayo and Spanos
2006)
51
To Avoid Inferring a Discrepancy
Beyond What’s Warranted:
large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2 )
52
O What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
[The larger sample size is like the one that
goes off with burnt toast]
53
What About the Fallacy of
Non-Significant Results?
O Non-Replication occurs with non-
significant results, but there’s much
confusion in interpreting them
O No point in running replication research if
you view negative results as uninformative
54
O They don’t warrant 0 discrepancy
O Use the same severity reasoning to rule
out discrepancies that very probably
would have resulted in a larger difference
than observed- set upper bounds
O If you very probably would have observed
a more impressive (smaller) p-value than
you did, if μ > μ1 (where μ1 = μ0 + γ), then
the data are good evidence that μ< μ1
O Akin to power analysis (Cohen, Neyman)
but sensitive to x0
55
Improves on Confidence Intervals
“This is akin to confidence intervals (which
are dual to tests) but we get around their
shortcomings:
O We do not fix a single confidence level,
O The evidential warrant for different points
in any interval are distinguished”
O Go beyond “performance goal” to give
inferential construal
56
Simple Fisherian Tests Have Important
Uses
O Model validation:
George Box calls for ecumenism because
“diagnostic checks and tests of fit” he
argues “require frequentist theory
significance tests for their formal
justification” (Box 1983, p. 57)
57
“What we are advocating, ..is what Cox
and Hinkley (1974) call ‘pure significance
testing’, in which certain of the model’s
implications are compared directly to the
data, rather than entering into a contest
with some alternative model” (Gelman &
Shalizi p. 20)
O Fraudbusting and forensics: Finding
Data too good to be true (Simonsohn)
58
Concluding remarks: Reforms without
Philosophy of Statistics are Blind
O I end my commentary: “Failing to understand the
correct (if limited) role of simple significance tests
threatens to throw the error control baby out with the
bad statistics bathwater
O Avoid refighting the same wars, or banning methods
based on cookbook methods long lampooned
O Makes no sense to banish tools for testing
assumptions the other methods require and cannot
perform
59
O Don’t expect an agreement on numbers
form methods evaluating difference things
O Recognize different roles of probability:
probabilism, long run performance ,
probativism (severe testing)
O Probabilisms may enable rather than
block illicit inferences due to biasing
selection effects
O Main paradox of the “replication crisis”
60
Paradox of Replication
O Critic: It’s too easy to satisfy standard significance
thresholds
O You: Why do replicationists find it so hard to achieve
them with preregistered trials?
O Critic: Most likely the initial studies were guilty of p-
hacking, cherry-picking, significance seeking, QRPs
O You: So, replication researchers want methods that
pick up on and block these biasing selection effects
O Critic: Actually the “reforms” recommend methods
where selection effects make no difference
61
Either you care about error probabilities
or not
O If not, experimental design principles (e.g.,
RCTs) may well go by the board
O Not enough to have a principle: we must be
transparent about data-dependent selections
O Your statistical account needs a way to make
use of the information
O “Technical activists” are not free of conflicts of
interest and of philosophy
62
Granted, error statistical improvements
are needed
O An inferential construal of error probabilities
wasn’t clearly given (Birnbaum)-my goal
O It’s not long-run error control (performance),
but severely probing flaws today
O I also grant an error statistical account need
to say more about how they’ll use
background information
63
Future ASA project
O Look at the “other approaches” (Bayes
factors, LRs, Bayesian updating)
O What is it for a replication to succeed or
fail on those approaches?
(can’t be just a matter of prior beliefs in
the hypotheses)
64
Finally, it should be recognized that
often better statistics cannot help
O Rather than search for more “idols”, do
better science, get better experiments and
theories
O One hypothesis must always be: our
results point to the inability of our study to
severely probe the phenomenon of
interest
65
Be ready to admit questionable science
O The scientific status of an inquiry is
questionable if unable to distinguish
poorly run study and poor hypothesis
O Continually violate minimal requirements
for severe testing
66
Non-replications often construed as
simply weaker effects
2 that didn’t replicate in psychology:
O Belief in free will and cheating
O Physical distance (of points plotted) and
emotional closeness
67
68
The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
69
Mayo and Cox (2010): Frequentist Principle of
Evidence (FEV); SEV: Mayo and Spanos (2006)
FEV/SEV: insignificant result: A moderate P-value
is evidence of the absence of a discrepancy δ
from H0, only if there is a high probability the
test would have given a worse fit with H0 (i.e.,
d(X) > d(x0) ) were a discrepancy δ to exist
FEV/SEV significant result d(X) > d(x0) is
evidence of discrepancy δ from H0, if and only if,
there is a high probability the test would have d(X)
< d(x0) were a discrepancy as large as δ absent
70
Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0
σ known
(FEV/SEV): If d(x) is not statistically significant,
then
μ < M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
(FEV/SEV): If d(x) is statistically significant, then
μ > M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
where P(d(X) > kε) = ε 71
References
O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
72
O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature
530(7588);online 02Feb2016.
O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357.
73
O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7
(3): 283–300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in Bul. Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology
37(1): pp. 1-2.
O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context,
process, and purpose”, The American Statistician 74
Abstract
If a statistical methodology is to be adequate, it needs
to register how “questionable research practices”
(QRPs) alter a method’s error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
today’s statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacements—Bayesian updating,
Bayes factors, likelihood ratios—fail to control severity.
75

More Related Content

What's hot

Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talkjemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2jemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slidesjemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 

What's hot (20)

Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
On p-values
On p-valuesOn p-values
On p-values
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 

Viewers also liked

LBS GES 2014 - EnergyDeck - Nic Mason
LBS GES 2014 - EnergyDeck - Nic MasonLBS GES 2014 - EnergyDeck - Nic Mason
LBS GES 2014 - EnergyDeck - Nic Masonglobalenergysummit
 
Sustainable Business: Realising the True Business Case.
Sustainable Business: Realising the True Business Case.Sustainable Business: Realising the True Business Case.
Sustainable Business: Realising the True Business Case.Mike Townsend
 
5 best home remedies for toothache
5 best home remedies for toothache5 best home remedies for toothache
5 best home remedies for toothacheMichal Vilimovsky
 
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...Sibelle El Labban
 
Sophie Drew -Solar Report
Sophie Drew -Solar ReportSophie Drew -Solar Report
Sophie Drew -Solar ReportSophie Drew
 
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...Global Business Events
 
Public Device & Biopharma Ophthalmology Company Showcase - QLT
Public Device & Biopharma Ophthalmology Company Showcase - QLTPublic Device & Biopharma Ophthalmology Company Showcase - QLT
Public Device & Biopharma Ophthalmology Company Showcase - QLTHealthegy
 
Présentation achats groupés des citoyens verviétois
Présentation achats groupés des citoyens verviétoisPrésentation achats groupés des citoyens verviétois
Présentation achats groupés des citoyens verviétoisWikipower
 
Energy Dispute Resolution: Is Arbitration Outdated?
Energy Dispute Resolution: Is Arbitration Outdated?Energy Dispute Resolution: Is Arbitration Outdated?
Energy Dispute Resolution: Is Arbitration Outdated?Florence Shool of Regulation
 
3.plant instrumentation and control theory
3.plant instrumentation and control theory3.plant instrumentation and control theory
3.plant instrumentation and control theoryNoor Azman Muhammad
 
Posterior Segment Company Showcase - Aerpio
Posterior Segment Company Showcase - AerpioPosterior Segment Company Showcase - Aerpio
Posterior Segment Company Showcase - AerpioHealthegy
 
Renewable Energy Policy: What comes after Feed-in Tariffs?
Renewable Energy Policy: What comes after Feed-in Tariffs?Renewable Energy Policy: What comes after Feed-in Tariffs?
Renewable Energy Policy: What comes after Feed-in Tariffs?Florence Shool of Regulation
 

Viewers also liked (16)

LBS GES 2014 - EnergyDeck - Nic Mason
LBS GES 2014 - EnergyDeck - Nic MasonLBS GES 2014 - EnergyDeck - Nic Mason
LBS GES 2014 - EnergyDeck - Nic Mason
 
Sustainable Business: Realising the True Business Case.
Sustainable Business: Realising the True Business Case.Sustainable Business: Realising the True Business Case.
Sustainable Business: Realising the True Business Case.
 
IFPRI-Performance Indicators on Functioning of KVK's-PG Chengappa
IFPRI-Performance Indicators on Functioning of KVK's-PG ChengappaIFPRI-Performance Indicators on Functioning of KVK's-PG Chengappa
IFPRI-Performance Indicators on Functioning of KVK's-PG Chengappa
 
Proyecto colectivas. Bate y Carrera
Proyecto colectivas. Bate y CarreraProyecto colectivas. Bate y Carrera
Proyecto colectivas. Bate y Carrera
 
5 best home remedies for toothache
5 best home remedies for toothache5 best home remedies for toothache
5 best home remedies for toothache
 
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...
The Effect of Roux-en-Y Gastric Bypass and Sleeve Gastrectomy Surgery on Diet...
 
Sophie Drew -Solar Report
Sophie Drew -Solar ReportSophie Drew -Solar Report
Sophie Drew -Solar Report
 
IFPRI Extensive and Intensive Margins of India’s Pulses Trade, Devesh Roy, IFPRI
IFPRI Extensive and Intensive Margins of India’s Pulses Trade, Devesh Roy, IFPRIIFPRI Extensive and Intensive Margins of India’s Pulses Trade, Devesh Roy, IFPRI
IFPRI Extensive and Intensive Margins of India’s Pulses Trade, Devesh Roy, IFPRI
 
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...
Tony Chukwueke, Director at Transcorp Energy - Potential of oil and gas in ni...
 
Public Device & Biopharma Ophthalmology Company Showcase - QLT
Public Device & Biopharma Ophthalmology Company Showcase - QLTPublic Device & Biopharma Ophthalmology Company Showcase - QLT
Public Device & Biopharma Ophthalmology Company Showcase - QLT
 
Présentation achats groupés des citoyens verviétois
Présentation achats groupés des citoyens verviétoisPrésentation achats groupés des citoyens verviétois
Présentation achats groupés des citoyens verviétois
 
Energy Dispute Resolution: Is Arbitration Outdated?
Energy Dispute Resolution: Is Arbitration Outdated?Energy Dispute Resolution: Is Arbitration Outdated?
Energy Dispute Resolution: Is Arbitration Outdated?
 
3.plant instrumentation and control theory
3.plant instrumentation and control theory3.plant instrumentation and control theory
3.plant instrumentation and control theory
 
IFPRI-Land Rights Challenges of Women and Poor in Conflict Areas-Somdatta Man...
IFPRI-Land Rights Challenges of Women and Poor in Conflict Areas-Somdatta Man...IFPRI-Land Rights Challenges of Women and Poor in Conflict Areas-Somdatta Man...
IFPRI-Land Rights Challenges of Women and Poor in Conflict Areas-Somdatta Man...
 
Posterior Segment Company Showcase - Aerpio
Posterior Segment Company Showcase - AerpioPosterior Segment Company Showcase - Aerpio
Posterior Segment Company Showcase - Aerpio
 
Renewable Energy Policy: What comes after Feed-in Tariffs?
Renewable Energy Policy: What comes after Feed-in Tariffs?Renewable Energy Policy: What comes after Feed-in Tariffs?
Renewable Energy Policy: What comes after Feed-in Tariffs?
 

Similar to Mayo minnesota 28 march 2 (1)

D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
importance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime Significanceimportance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime SignificanceSukumarReddy43
 
A Bayesian Alternative To Null Hypothesis Significance Testing
A Bayesian Alternative To Null Hypothesis Significance TestingA Bayesian Alternative To Null Hypothesis Significance Testing
A Bayesian Alternative To Null Hypothesis Significance TestingDustin Pytko
 

Similar to Mayo minnesota 28 march 2 (1) (20)

D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Basic Conecepts of Inferential Statistics _ Slideshare.pptx
Basic Conecepts of Inferential Statistics _ Slideshare.pptxBasic Conecepts of Inferential Statistics _ Slideshare.pptx
Basic Conecepts of Inferential Statistics _ Slideshare.pptx
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
importance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime Significanceimportance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime Significance
 
A Bayesian Alternative To Null Hypothesis Significance Testing
A Bayesian Alternative To Null Hypothesis Significance TestingA Bayesian Alternative To Null Hypothesis Significance Testing
A Bayesian Alternative To Null Hypothesis Significance Testing
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

Recently uploaded

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 

Recently uploaded (20)

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 

Mayo minnesota 28 march 2 (1)

  • 1. The ASA (2016) Statement on P-values: How to Stop Refighting the Statistics Wars The CLA Quantitative Methods Collaboration Committee & Minnesota Center for Philosophy of Science April 8, 2016 Deborah G Mayo
  • 2. Brad Efron “By and large, Statistics is a prosperous and happy country, but it is not a completely peaceful one. Two contending philosophical parties, the Bayesians and the frequentists, have been vying for supremacy over the past two-and-a-half centuries. …Unlike most philosophical arguments, this one has important practical consequences. The two philosophies represent competing visions of how science progresses….” (2013, p. 130) 2
  • 3. Today’s Practice: Eclectic O Use of eclectic tools, little handwringing of foundations O Bayes-frequentist unifications O Scratch a bit below the surface foundational problems emerge…. O Not just 2: family feuds within (Fisherian, Neyman-Pearson; tribes of Bayesians. likelihoodists) 3
  • 4. Why are the statistics wars more serious today? O Replication crises led to programs to restore credibility: fraud busting, reproducibility studies O Taskforces, journalistic reforms, and debunking treatises O Proposed methodological reforms––many welcome (preregistration)–some quite radical 4
  • 5. I was a philosophical observer at the ASA P-value “pow wow” 5
  • 6. “Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician 6
  • 7. O “Statistical significance tests are a small part of a rich set of: “techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033) O These I call error statistical methods (or sampling theory)”. 7
  • 8. One Rock in a Shifting Scene O “Birnbaum calls it the ‘one rock in a shifting scene’ in statistical practice O “Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data” 8
  • 9. Error Statistics O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes O The inference may be in error O It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities) 9
  • 10. “p-value. …to test the conformity of the particular data under analysis with H0 in some respect: …we find a function t = t(y) of the data, to be called the test statistic, such that • the larger the value of t the more inconsistent are the data with H0; • The random variable T = t(Y) has a (numerically) known probability distribution when H0 is true. …the p-value corresponding to any t as p = p(t) = P(T ≥ t; H0)” (Mayo and Cox 2006, p. 81) 10
  • 11. O “Clearly, if even larger differences than t occur fairly frequently under H0 (p-value is not small), there’s scarcely evidence of incompatibility O But a small p-value doesn’t warrant inferring a genuine effect H, let alone a scientific conclusion H*–as the ASA document correctly warns (Principle 3)” 11
  • 12. A Paradox for Significance Test Critics Critic: It’s much too easy to get a small P- value You: Why do they find it so difficult to replicate the small P-values others found? Is it easy or is it hard? 12
  • 13. O R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”) O Sufficient finagling—cherry-picking, P- hacking, significance seeking—may practically guarantee a researcher’s preferred claim C gets support, even if it’s unwarranted by evidence O Note: Rejecting a null taken as support for some non-null claim C 13
  • 14. Severity Requirement: O If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C O Such a test fails a minimal requirement for a stringent or severe test O My account: severe testing based on error statistics 14
  • 15. Two main views of the role of probability in inference (not in ASA doc) O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0. (e.g., Bayesian, likelihoodist)—with regard for inner coherency O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman- Pearson) 15
  • 16. What happened to using probability to assess the error probing capacity by the severity criterion? O Neither “probabilism” nor “performance” directly captures it O Good long-run performance is a necessary, not a sufficient, condition for severity O That’s why frequentist methods can be shown to have howlers 16
  • 17. O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs— O It’s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data 17
  • 18. A claim C is not warranted _______ O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer) O Performance: unless it stems from a method with low long-run error O Probativism (severe testing) something (a fair amount) has been done to probe ways we can be wrong about C 18
  • 19. O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False! O I claim, error probabilities play a crucial role in appraising well-testedness O It’s crucial to be able to say, C is highly believable or plausible but poorly tested O Probabilists can allow for the distinct task of severe testing (you may not have to entirely take sides in the stat wars) 19
  • 20. The ASA doc gives no sense of different tools for different jobs O “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers form methods evaluating different things O A p-value isn’t ‘invalid’ because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (Trafimow and Marks*, 2015) *Editors of a journal, Basic and Applied Social Psychology 20
  • 21. O ASA Principle 2 says a p-value ≠ posterior but one doesn’t get the sense of its role in error probability control O It’s not that I’m keen to defend many common uses of significance tests O The criticisms are often based on misunderstandings; consequently so are many “reforms” 21
  • 22. Biasing selection effects: O One function of severity is to identify problematic selection effects (not all are) O Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed O Picking up on these alterations is precisely what enables error statistics to be self-correcting— 22
  • 23. Nominal vs actual significance levels The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4) Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104) From (Morrison & Henkel’s Significance Test controversy 1970!) 23
  • 24. O They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level O There are many more ways you can be wrong with hunting (different sample space) O Here’s a case where a p-value report is invalid 24
  • 25. You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 (Formally): O You say Pr(P-value < Pobs; H0) ~ Pobs small O But in fact Pr(P-value < Pobs; H0) = high 25
  • 26. O Nowadays, we’re likely to see the tests blamed O My view: Tests don’t kill inference, people do O Even worse are those statistical accounts where the abuse vanishes! 26
  • 27. On some views, taking account of biasing selection effects “defies scientific sense” Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010) (To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford) 27
  • 28. Technical activism isn’t free of philosophy Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”: The editors at Annals of Internal Medicine,… repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; ….they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching 28
  • 29. His paper: “Make journals report clinical trials properly” O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy! 29
  • 30. Likelihood Principle (LP) The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses P(x0;H1)/P(x0;H0) The data x0 are fixed, while the hypotheses vary 30
  • 31. Jimmy Savage on the LP: O “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17) 31
  • 32. All error probabilities violate the LP (even without selection effects): Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, p. 436) The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, p. 122) 32
  • 33. Paradox of Optional Stopping: Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules: Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0. Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule: 33
  • 34. “Trying and trying again” O Keep sampling until H0 is rejected at 0.05 level i.e., keep sampling until M  1.96 s/√n O Trying and trying again: Having failed to rack up a 1.96 s difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 s difference 34
  • 35. Nominal vs. Actual significance levels again: O With n fixed the Type 1 error probability is 0.05 O With this stopping rule the actual significance level differs from, and will be greater than 0.05 O Violates Cox and Hinkley’s (1974) “weak repeated sampling principle” 35
  • 36. 1959 Savage Forum Jimmy Savage audaciously declared: “optional stopping is no sin” so the problem must be with significance levels Peter Armitage: “thou shalt be misled” if thou dost not know the person tried and tried again (p. 72) 36
  • 37. O “The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p- values (Principle 4) O However, the same p-hacked hypothesis can occur in Bayes factors; optional stopping can be guaranteed to exclude true nulls from HPD intervals 37
  • 38. With One Big Difference: O “The direct grounds to criticize inferences as flouting error statistical control is lost O They condition on the actual data, O error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)” 38
  • 39. Tension: Does Principle 4 Hold for Other Approaches? O “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches” (They include Bayes factors, likelihood ratios, as “alternative measures of evidence”) O They appear to extend “full reporting and transparency” (principle 4) to all methods. O Some controversy: should it apply only to “p- values and related statistics” 39
  • 40. How might probabilists block intuitively unwarranted inferences (without error probabilities)? A subjective Bayesian might say: If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences) 40
  • 41. Rescued by beliefs O That could work in some cases (it still wouldn’t show what researchers had done wrong)—battle of beliefs O Besides, researchers sincerely believe their hypotheses O So now you’ve got two sources of flexibility, priors and biasing selection effects 41
  • 42. No help with our most important problem O How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? 42
  • 43. Most Bayesians are “conventional” O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006) 43
  • 44. O A classic conundrum: no general non- informative prior exists, so most are conventional O “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, p. 299) O Prior probability: An undefined mathematical construct for obtaining posteriors (giving highest weight to data, or satisfying invariance, or matching or….) 44
  • 45. Conventional Bayesian Reforms are touted as free of selection effects O Jim Berger gives us “conditional error probabilities” CEPs O “[I]t is considerably less important to disabuse students of the notion that a frequentist error probability is the probability that the hypothesis is true, given the data”, since under his new definition “a CEP actually has that interpretation” O “CEPs do not depend on the stopping rule” (“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?” 2003) 45
  • 46. By and large the ASA doc highlights classic foibles “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, p. 14) (“isolated” low P-value ≠> H: statistical effect) 46
  • 47. Statistical ≠> substantive (H ≠> H*) “[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, pp. 95-6) 47
  • 48. O Flaws in alternative H* have not been probed by the test, O The inference from a statistically significant result to H* fails to pass with severity O “Merely refuting the null hypothesis is too weak to corroborate” substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence’” (Meehl and Waller 2002, p.184) 48
  • 49. O Encouraged by something called NHSTs –that supposedly allow moving from statistical to substantive O If defined that way, they exist only as abuses of tests O ASA doc ignores Neyman-Pearson (N-P) tests 49
  • 50. Neyman-Pearson (N-P) Tests: A null and alternative hypotheses H0, H1 that exhaust the parameter space O So the fallacy of rejection H – > H* is impossible O Rejecting the null only indicates statistical alternatives 50
  • 51. P-values Don’t Report Effect Sizes Principle 5 Who ever said to just report a P-value? O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006) 51
  • 52. To Avoid Inferring a Discrepancy Beyond What’s Warranted: large n problem. O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) 52
  • 53. O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? [The larger sample size is like the one that goes off with burnt toast] 53
  • 54. What About the Fallacy of Non-Significant Results? O Non-Replication occurs with non- significant results, but there’s much confusion in interpreting them O No point in running replication research if you view negative results as uninformative 54
  • 55. O They don’t warrant 0 discrepancy O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds O If you very probably would have observed a more impressive (smaller) p-value than you did, if μ > μ1 (where μ1 = μ0 + γ), then the data are good evidence that μ< μ1 O Akin to power analysis (Cohen, Neyman) but sensitive to x0 55
  • 56. Improves on Confidence Intervals “This is akin to confidence intervals (which are dual to tests) but we get around their shortcomings: O We do not fix a single confidence level, O The evidential warrant for different points in any interval are distinguished” O Go beyond “performance goal” to give inferential construal 56
  • 57. Simple Fisherian Tests Have Important Uses O Model validation: George Box calls for ecumenism because “diagnostic checks and tests of fit” he argues “require frequentist theory significance tests for their formal justification” (Box 1983, p. 57) 57
  • 58. “What we are advocating, ..is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model” (Gelman & Shalizi p. 20) O Fraudbusting and forensics: Finding Data too good to be true (Simonsohn) 58
  • 59. Concluding remarks: Reforms without Philosophy of Statistics are Blind O I end my commentary: “Failing to understand the correct (if limited) role of simple significance tests threatens to throw the error control baby out with the bad statistics bathwater O Avoid refighting the same wars, or banning methods based on cookbook methods long lampooned O Makes no sense to banish tools for testing assumptions the other methods require and cannot perform 59
  • 60. O Don’t expect an agreement on numbers form methods evaluating difference things O Recognize different roles of probability: probabilism, long run performance , probativism (severe testing) O Probabilisms may enable rather than block illicit inferences due to biasing selection effects O Main paradox of the “replication crisis” 60
  • 61. Paradox of Replication O Critic: It’s too easy to satisfy standard significance thresholds O You: Why do replicationists find it so hard to achieve them with preregistered trials? O Critic: Most likely the initial studies were guilty of p- hacking, cherry-picking, significance seeking, QRPs O You: So, replication researchers want methods that pick up on and block these biasing selection effects O Critic: Actually the “reforms” recommend methods where selection effects make no difference 61
  • 62. Either you care about error probabilities or not O If not, experimental design principles (e.g., RCTs) may well go by the board O Not enough to have a principle: we must be transparent about data-dependent selections O Your statistical account needs a way to make use of the information O “Technical activists” are not free of conflicts of interest and of philosophy 62
  • 63. Granted, error statistical improvements are needed O An inferential construal of error probabilities wasn’t clearly given (Birnbaum)-my goal O It’s not long-run error control (performance), but severely probing flaws today O I also grant an error statistical account need to say more about how they’ll use background information 63
  • 64. Future ASA project O Look at the “other approaches” (Bayes factors, LRs, Bayesian updating) O What is it for a replication to succeed or fail on those approaches? (can’t be just a matter of prior beliefs in the hypotheses) 64
  • 65. Finally, it should be recognized that often better statistics cannot help O Rather than search for more “idols”, do better science, get better experiments and theories O One hypothesis must always be: our results point to the inability of our study to severely probe the phenomenon of interest 65
  • 66. Be ready to admit questionable science O The scientific status of an inquiry is questionable if unable to distinguish poorly run study and poor hypothesis O Continually violate minimal requirements for severe testing 66
  • 67. Non-replications often construed as simply weaker effects 2 that didn’t replicate in psychology: O Belief in free will and cheating O Physical distance (of points plotted) and emotional closeness 67
  • 68. 68
  • 69. The ASA’s Six Principles O (1) P-values can indicate how incompatible the data are with a specified statistical model O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold O (4) Proper inference requires full reporting and transparency O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis 69
  • 70. Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent 70
  • 71. Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε) (FEV/SEV): If d(x) is statistically significant, then μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε) where P(d(X) > kε) = ε 71
  • 72. References O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference: A Discussion, edited by L. J. Savage. London: Methuen. O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32. O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46. O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press. O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. 72
  • 73. O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press. O Goldacre, B. 2008. Bad Science. HarperCollins Publishers. O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588);online 02Feb2016. O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes- Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275. O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. 73
  • 74. O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300. O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96. O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888. O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2. O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician 74
  • 75. Abstract If a statistical methodology is to be adequate, it needs to register how “questionable research practices” (QRPs) alter a method’s error probing capacities. If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. The goal of severe testing is the linchpin for (re)interpreting frequentist methods so as to avoid long-standing fallacies at the heart of today’s statistics wars. A contrasting philosophy views statistical inference in terms of posterior probabilities in hypotheses: probabilism. Presupposing probabilism, critics mistakenly argue that significance and confidence levels are misinterpreted, exaggerate evidence, or are irrelevant for inference. Recommended replacements—Bayesian updating, Bayes factors, likelihood ratios—fail to control severity. 75