P-Value "Reforms":
Fixing Science or Threats to
Replication and Falsification
Deborah G Mayo
Fixing Science
National Assoc of Scholars, Independent Institute
February 7, 2020
2
P-Value "Reforms": Fixing Science or
Threats to Replication and Falsification
• Mounting failures of replication give a new urgency to
critically appraising proposed statistical reforms.
• While many are welcome
o preregistration of experiments
o testing effects by replication,
o discouraging cookbook uses of statistics
• Others are radical and, paradoxically, actually
obstruct the practices known to improve on
replication.
3
• The statistics wars have often become proxy battles
between competing tribe leaders keen to advance
one or another method or philosophy
• I am concerned about unthinking bandwagon effects,
and "political groupthink" [a term on this conference's
website]
• Here the politics refers to "the politics of statistics"
• The problem calls for a mix of statistical,
philosophical, historical and political resources
Replication Paradox
(for Significance Test Critics)
Critic of tests: It’s much too easy to get a
small P-value
Crisis of replication: It is much too difficult to
get small P-values (when we try to replicate)
Is it easy or is it hard?
4
• R.A. Fisher: it’s easy to lie with statistics by
selective reporting, (he called it the “political
principle” (1955, 75))
• Sufficient finagling—cherry-picking, P-hacking,
significance seeking, multiple testing, look
elsewhere—may practically guarantee a preferred
claim H gets support, even if it’s unwarranted
5
Bad Evidence, No Test (BENT)
If the test procedure had little or no capability of
finding flaws with H (even if H is incorrect), then
agreement between data x0 and H provides poor (or
no) evidence for H
(“Mere supporting instances are too cheap to be
worth having” Popper 1983, 130)
• Such a test fails a minimal requirement for
evidence (tests are minimally severe)
• The sources of irreplication are not mysterious
6
Taking a single small P-value as
grounds for a conclusive finding
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries
is a consequence of the convenient, yet ill-
founded strategy of claiming conclusive
research findings solely on the basis of a
single study assessed by formal statistical
significance, typically for a p-value less than
0.05.” (John Ioannidis 2005, 0696)
7
8
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how
to conduct an experiment which will rarely fail to
give us a statistically significant result.” (Fisher
1947, 14)
9
Statistical Falsification
• The claim of no genuine effect or
phenomenon is falsified statistically
Popper: “If we find reproducible deviations. . .
deduced from a probability estimate . . . then
we must assume that the probability estimate
is falsified”. (Popper 1959, 203)
10
Fisher’s Simple Significance Test
“…to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, to be
called the test statistic, such that
• the larger the value of T the more inconsistent
are the data with H0;
• The random variable T = t(Y) has a known
probability distribution when H0 is true.
…the p-value corresponding to any t0bs as
Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 11
Testing Reasoning
• If even larger differences than t0bs occur frequently
under H0 (i.e., P-value is not small), there’s
scarcely evidence of incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
12
Statistical ≠> substantive (H ≠> H*)
correlation ≠ cause
• H* makes claims that haven’t been probed by the
statistical test
“Merely refuting the null hypothesis is too weak to
corroborate” substantive H*, “we have to have
‘Popperian risk’, ‘severe test’ [as in Mayo]’”
(Meehl and Waller 2002,184)
13
Neyman-Pearson (N-P) tests
(1933):
A test (null) and alternative hypotheses
H0, H1 that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
• This fallacy of rejection H1èH* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
• We get the type II error, and power
14
Neyman-Pearson (N-P) tests
• Criticisms of tests nearly always stick to the
simple Fisherian test & point nulls
• It has important uses but overlooks the ability to
infer effect sizes or discrepancies in N-P tests
• Neyman also developed confidence intervals
(1930)
15
N-P avoids fallacies of non-
significant Results
“If the power to detect a meaningful effect is high,
then repeatedly failing to find one warrants taking it
as absent.” (Neyman 1957, 16–17)
Using the observed P-value:
If a more impressive result (smaller P-value) is
probable, if μ = μ1 (where μ1 = μ0 + γ), then the data
indicate that μ < μ1.
16
Despite philosophical debates
between Fisher & N-P
• They both fall under tools for “appraising and
bounding the probabilities (under respective
hypotheses) of seriously misleading interpretations
of data” (Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization.
17
Fishing for significance
(nominal vs. actual)
N-P and Fisher show error control is lost with
selective reporting
“Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out to
be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent, but
64 percent!” (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
18
Welcome Reform: 21 Word Solution:
Report Sampling Plan
• Replication researchers (re)discovered that data-
dependent hypotheses are a major source of
spurious significance levels.
“We report how we determined our sample size, all
data exclusions (if any), all manipulations, and all
measures in the study.”
(Simmons, Nelson, and Simonsohn 2012, 4)
19
Scapegoating
Others blame the tests!
Statistical significance tests don’t kill
inferences, people do
Even worse are those statistical accounts
where the abuse vanishes!
20
On some views, taking account of biasing
selection effects “defies scientific sense”
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, …data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because of
considerations that have nothing to do with the
data defies scientific sense” (Goodman 1999, p.
1010)
(To his credit, he’s open about this; heads the Meta-
Research Innovation Center at Stanford) 21
Likelihood Principle (LP)
The vanishing act links to a pivotal disagreement
between falsification vs. confirmation accounts
(probabilisms)
• Probabilisms, condition on the actual data
• Error probabilities consider outcomes that could
have occurred but did not (sampling distribution)
22
On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
• No wonder reformers often talk past each
other
23
Many “reform” methods offered
follow the LP
• “Bayes factors can be used in the complete
absence of a sampling plan” (Bayarri, Benjamin,
Berger, Sellke 2016, 100)
• “Data should be able to speak for itself.” (Berger
and Wolpert 1988, 78; The Likelihood Principle)
• If someone is selling you an account where
sampling plans don’t matter, you might want to
hold off buying.
24
Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)?
• If our beliefs were mixed into the interpretation
of the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
• Might work in some cases
25
Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
26
Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, … can be
understood as ‘error probes’ in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20).
Can’t also champion “abandoning statistical
significance”
27
The 2019: Don’t say ‘significance’,
don’t use P-value thresholds
• Editors of the March 2019 issue TAS "A World
Beyond p < 0.05"—Wasserstein, Schirm, Lazar—
say "declarations of ‘statistical significance’ be
abandoned" (p. 2).
• On their view: Prespecified P-value thresholds
should never be used in interpreting results.
• it is not just a word ban but a gatekeeper ban
28
“Retiring statistical significance
would give bias a free pass".
John Ioannidis (2019)
"...potential for falsification is a prerequisite for
science. Fields that obstinately resist refutation
can hide behind the abolition of statistical
significance but risk becoming self-ostracized
from the remit of science”.
I agree, and in Mayo (2019) I show why.
29
• Complying with the “no threshold” view precludes
the FDA's long-established drug review
procedures, as Wasserstein et al. 2019 recognize
• They think by removing P-value thresholds,
researchers lose an incentive to data dredge, and
otherwise exploit researcher flexibility
• Even if true, it's a bad argument.
(Decriminalizing robbery results in less robbery arrests)
• But it's not true.
30
• Even without the word significance, eager
researchers still can’t take the large (non-
significant) P-value to indicate a genuine effect
• It would be to say: Even though larger differences
would frequently occur by chance variability alone,
my data provide evidence they are not due to
chance variability
• In short, he would still need to report a reasonably
small P-value
• The eager investigator will need to "spin" his
results, ransack, data dredge
31
• In a world without predesignated thresholds, it
would be hard to hold him accountable for
reporting a nominally small P-value:
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in interpreting
data (Wasserstein et al. 2019, 2)
32
New England Journal of Medicine
reacts to abandon significance
• The significance level from [a well-defined study] is
a reliable indicator of the extent to which the data
contradict a null hypothesis of no association
between an intervention and a response
• Clinicians and regulatory agencies must make
decisions about which treatment to use or to allow
to be marketed, and P values interpreted by
reliably calculated thresholds subjected to
appropriate adjustments [for multiple trials]
have a role in those decisions
(Harrington et al. 2019, 286, NEJM)
33
No tests, no falsification
• The “no thresholds” view also blocks common uses
of confidence intervals and Bayes factor standards
• If you cannot say about any results, ahead of time,
they will not be allowed to count in favor of a claim,
then you do not have a test of it
• Don’t confuse having a threshold for a terrible test
with using a fixed P-value across all studies in an
unthinking manner
• We should reject the latter 34
My view: Reformulate Tests
• Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies
that are or are not warranted
• Avoids fallacies of significance and
nonsignificance, and improves on confidence
interval estimation
35
Mayo and Cox (2006): Frequentist Principle of
Evidence (FEV); SEV: Mayo and Spanos (2006)
FEV/SEV significant result : A small P-value is
evidence of discrepancy γ from H0, if and only if, there
is a high probability the test would have d(X) < d(x0)
were a discrepancy as large as γ absent
FEV/SEV: insignificant result: A moderate P-value is
evidence of the absence of a discrepancy γ from H0,
only if there is a high probability the test would
have given a worse fit with H0 (i.e., d(X) > d(x0))
were a discrepancy γ to exist 36
New ASA Task Force on Significance
Tests and Replication
• to “prepare a …piece reflecting “good statistical
practice,” without leaving the impression that p-
values and hypothesis tests…have no role.” (Karen
Kafadar 2019)
37
Brief concluding remark
• The sources of irreplication are not mysterious:
in many fields, latitude in collecting and
interpreting data makes it too easy to dredge up
impressive looking findings even when spurious.
• Some of the reforms intended to fix science
enable rather than reveal illicit inferences due to
P-hacking, multiple testing, and data-dredging.
(either they obey the LP or block thresholds)
38
39
References
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.”
Journal of Mathematical Psychology 72: 90-103.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
• Cox, D. R. and Hinkley, D. (1974). Theoretical Statistics. London: Chapman and
Hall.
• Fisher, R. A. (1947). The Design of Experiments 4th ed. Edinburgh: Oliver and
Boyd.
• Fisher, R. A. (1955). “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
• FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical
Trials: Guidance for Industry (DRAFT GUIDANCE).” Retrieved from
https://www.fda.gov/media/102657/download
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder.” British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
40
• Goodman SN. (1999). “Toward Evidence-based Medical Statistics. 2: The Bayes
factor.” Annals of Internal Medicine 1999; 130:1005 –1013.
• Harrington D, D'Agostino R, Gatsonis C, et al. (2019). "New Guidelines for Statistical
Reporting in the Journal. " N Engl J Med.381: 285-286.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False.” PLoS
Medicine 2(8), 0696–0701.
• Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical
Analyses: Do Not Abandon Significance.” JAMA. 321(21): 2067–2068.
doi:10.1001/jama.2019.4582
• Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner,
AmStat News, (Issue 510), (Dec. 2019).
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt,
Rinehart and Winston.
• Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical
Investigation, 49, e13170. https://doi.org/10.1111/eci.13170
41
• Mayo, D. and Cox, D.R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. and Spanos, A. (2006). "Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction." The British Journal for the Philosophy
of Science 57(2): 323–57.
• Meehl, P., and Waller, N. (2002). “The Path Analysis Controversy: A New Statistical
Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–
300.
• Morrison, D. and Henkel, R., (eds.) (1970). The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
• NEJM Author Guidelines (2019): Retrieved from: https://www.nejm.org/author-
center/new-manuscripts on July 19, 2019.
• Neyman, J. (1957). “The Use of the Concept of Power in Agricultural
Experimentation.” Journal of the Indian Society of Agricultural Statistics IX(1), 9–
17.
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests
of Statistical Hypotheses." Philosophical Transactions of the Royal Society of
London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85.
42
43
• Neyman, J. and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S.
Pearson. Cambridge: Cambridge University Press.
• Pearson, E.S. & Neyman, J. (1930). “On the problem of two samples”, Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Cambridge: Cambridge University
Press). First published in Bul. Acad. Pol.Sci. 73-96.
• Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books. Reprinted
2000 The Logic of Scientific Discovery. London, New York: Routledge.
• Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and
Littlefield.
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The
Significance Test Controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue:
The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4–
7.
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05”
[Editorial]. The American Statistician, 73(S1), 1–19.
https://doi.org/10.1080/00031305.2019.1583913

P-Value "Reforms": Fixing Science or Threat to Replication and Falsification

  • 1.
    P-Value "Reforms": Fixing Scienceor Threats to Replication and Falsification Deborah G Mayo Fixing Science National Assoc of Scholars, Independent Institute February 7, 2020
  • 2.
    2 P-Value "Reforms": FixingScience or Threats to Replication and Falsification • Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. • While many are welcome o preregistration of experiments o testing effects by replication, o discouraging cookbook uses of statistics • Others are radical and, paradoxically, actually obstruct the practices known to improve on replication.
  • 3.
    3 • The statisticswars have often become proxy battles between competing tribe leaders keen to advance one or another method or philosophy • I am concerned about unthinking bandwagon effects, and "political groupthink" [a term on this conference's website] • Here the politics refers to "the politics of statistics" • The problem calls for a mix of statistical, philosophical, historical and political resources
  • 4.
    Replication Paradox (for SignificanceTest Critics) Critic of tests: It’s much too easy to get a small P-value Crisis of replication: It is much too difficult to get small P-values (when we try to replicate) Is it easy or is it hard? 4
  • 5.
    • R.A. Fisher:it’s easy to lie with statistics by selective reporting, (he called it the “political principle” (1955, 75)) • Sufficient finagling—cherry-picking, P-hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted 5
  • 6.
    Bad Evidence, NoTest (BENT) If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H (“Mere supporting instances are too cheap to be worth having” Popper 1983, 130) • Such a test fails a minimal requirement for evidence (tests are minimally severe) • The sources of irreplication are not mysterious 6
  • 7.
    Taking a singlesmall P-value as grounds for a conclusive finding “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill- founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05.” (John Ioannidis 2005, 0696) 7
  • 8.
  • 9.
    “[W]e need, notan isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 9
  • 10.
    Statistical Falsification • Theclaim of no genuine effect or phenomenon is falsified statistically Popper: “If we find reproducible deviations. . . deduced from a probability estimate . . . then we must assume that the probability estimate is falsified”. (Popper 1959, 203) 10
  • 11.
    Fisher’s Simple SignificanceTest “…to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, to be called the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; • The random variable T = t(Y) has a known probability distribution when H0 is true. …the p-value corresponding to any t0bs as Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 11
  • 12.
    Testing Reasoning • Ifeven larger differences than t0bs occur frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of incompatibility with H0 • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 12
  • 13.
    Statistical ≠> substantive(H ≠> H*) correlation ≠ cause • H* makes claims that haven’t been probed by the statistical test “Merely refuting the null hypothesis is too weak to corroborate” substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo]’” (Meehl and Waller 2002,184) 13
  • 14.
    Neyman-Pearson (N-P) tests (1933): Atest (null) and alternative hypotheses H0, H1 that are exhaustive H0: μ ≤ 0 vs. H1: μ > 0 • This fallacy of rejection H1èH* is impossible • Rejecting H0 only indicates statistical alternatives H1 (how discrepant from null) • We get the type II error, and power 14
  • 15.
    Neyman-Pearson (N-P) tests •Criticisms of tests nearly always stick to the simple Fisherian test & point nulls • It has important uses but overlooks the ability to infer effect sizes or discrepancies in N-P tests • Neyman also developed confidence intervals (1930) 15
  • 16.
    N-P avoids fallaciesof non- significant Results “If the power to detect a meaningful effect is high, then repeatedly failing to find one warrants taking it as absent.” (Neyman 1957, 16–17) Using the observed P-value: If a more impressive result (smaller P-value) is probable, if μ = μ1 (where μ1 = μ0 + γ), then the data indicate that μ < μ1. 16
  • 17.
    Despite philosophical debates betweenFisher & N-P • They both fall under tools for “appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, 1033)–error probabilities • I place all under the rubric of error statistics • Confidence intervals, N-P and Fisherian tests, resampling, randomization. 17
  • 18.
    Fishing for significance (nominalvs. actual) N-P and Fisher show error control is lost with selective reporting “Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent!” (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) 18
  • 19.
    Welcome Reform: 21Word Solution: Report Sampling Plan • Replication researchers (re)discovered that data- dependent hypotheses are a major source of spurious significance levels. “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.” (Simmons, Nelson, and Simonsohn 2012, 4) 19
  • 20.
    Scapegoating Others blame thetests! Statistical significance tests don’t kill inferences, people do Even worse are those statistical accounts where the abuse vanishes! 20
  • 21.
    On some views,taking account of biasing selection effects “defies scientific sense” “Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, …data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, p. 1010) (To his credit, he’s open about this; heads the Meta- Research Innovation Center at Stanford) 21
  • 22.
    Likelihood Principle (LP) Thevanishing act links to a pivotal disagreement between falsification vs. confirmation accounts (probabilisms) • Probabilisms, condition on the actual data • Error probabilities consider outcomes that could have occurred but did not (sampling distribution) 22
  • 23.
    On the LP,error probabilities appeal to something irrelevant “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) • No wonder reformers often talk past each other 23
  • 24.
    Many “reform” methodsoffered follow the LP • “Bayes factors can be used in the complete absence of a sampling plan” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • “Data should be able to speak for itself.” (Berger and Wolpert 1988, 78; The Likelihood Principle) • If someone is selling you an account where sampling plans don’t matter, you might want to hold off buying. 24
  • 25.
    Probabilists can stillblock intuitively unwarranted inferences (without error probabilities)? • If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim • Might work in some cases 25
  • 26.
    Problems • Additional sourceof flexibility, priors as well as biasing selection effects • Doesn’t show what researchers had done wrong—battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive 26
  • 27.
    Some Bayesians rejectprobabilism (Gelman: Falsificationist Bayesian; Shalizi: error statistician) “[C]rucial parts of Bayesian data analysis, … can be understood as ‘error probes’ in Mayo’s sense” “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20). Can’t also champion “abandoning statistical significance” 27
  • 28.
    The 2019: Don’tsay ‘significance’, don’t use P-value thresholds • Editors of the March 2019 issue TAS "A World Beyond p < 0.05"—Wasserstein, Schirm, Lazar— say "declarations of ‘statistical significance’ be abandoned" (p. 2). • On their view: Prespecified P-value thresholds should never be used in interpreting results. • it is not just a word ban but a gatekeeper ban 28
  • 29.
    “Retiring statistical significance wouldgive bias a free pass". John Ioannidis (2019) "...potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science”. I agree, and in Mayo (2019) I show why. 29
  • 30.
    • Complying withthe “no threshold” view precludes the FDA's long-established drug review procedures, as Wasserstein et al. 2019 recognize • They think by removing P-value thresholds, researchers lose an incentive to data dredge, and otherwise exploit researcher flexibility • Even if true, it's a bad argument. (Decriminalizing robbery results in less robbery arrests) • But it's not true. 30
  • 31.
    • Even withoutthe word significance, eager researchers still can’t take the large (non- significant) P-value to indicate a genuine effect • It would be to say: Even though larger differences would frequently occur by chance variability alone, my data provide evidence they are not due to chance variability • In short, he would still need to report a reasonably small P-value • The eager investigator will need to "spin" his results, ransack, data dredge 31
  • 32.
    • In aworld without predesignated thresholds, it would be hard to hold him accountable for reporting a nominally small P-value: • “whether a p-value passes any arbitrary threshold should not be considered at all" in interpreting data (Wasserstein et al. 2019, 2) 32
  • 33.
    New England Journalof Medicine reacts to abandon significance • The significance level from [a well-defined study] is a reliable indicator of the extent to which the data contradict a null hypothesis of no association between an intervention and a response • Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments [for multiple trials] have a role in those decisions (Harrington et al. 2019, 286, NEJM) 33
  • 34.
    No tests, nofalsification • The “no thresholds” view also blocks common uses of confidence intervals and Bayes factor standards • If you cannot say about any results, ahead of time, they will not be allowed to count in favor of a claim, then you do not have a test of it • Don’t confuse having a threshold for a terrible test with using a fixed P-value across all studies in an unthinking manner • We should reject the latter 34
  • 35.
    My view: ReformulateTests • Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not warranted • Avoids fallacies of significance and nonsignificance, and improves on confidence interval estimation 35
  • 36.
    Mayo and Cox(2006): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV significant result : A small P-value is evidence of discrepancy γ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as γ absent FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy γ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0)) were a discrepancy γ to exist 36
  • 37.
    New ASA TaskForce on Significance Tests and Replication • to “prepare a …piece reflecting “good statistical practice,” without leaving the impression that p- values and hypothesis tests…have no role.” (Karen Kafadar 2019) 37
  • 38.
    Brief concluding remark •The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. • Some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. (either they obey the LP or block thresholds) 38
  • 39.
  • 40.
    References • Bayarri, M.,Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-103. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. • Cox, D. R. and Hinkley, D. (1974). Theoretical Statistics. London: Chapman and Hall. • Fisher, R. A. (1947). The Design of Experiments 4th ed. Edinburgh: Oliver and Boyd. • Fisher, R. A. (1955). “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. • FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical Trials: Guidance for Industry (DRAFT GUIDANCE).” Retrieved from https://www.fda.gov/media/102657/download • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder.” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. 40
  • 41.
    • Goodman SN.(1999). “Toward Evidence-based Medical Statistics. 2: The Bayes factor.” Annals of Internal Medicine 1999; 130:1005 –1013. • Harrington D, D'Agostino R, Gatsonis C, et al. (2019). "New Guidelines for Statistical Reporting in the Journal. " N Engl J Med.381: 285-286. • Ioannidis, J. (2005). “Why Most Published Research Findings are False.” PLoS Medicine 2(8), 0696–0701. • Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019.4582 • Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner, AmStat News, (Issue 510), (Dec. 2019). • Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical Investigation, 49, e13170. https://doi.org/10.1111/eci.13170 41
  • 42.
    • Mayo, D.and Cox, D.R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics: 247-275. • Mayo, D. and Spanos, A. (2006). "Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction." The British Journal for the Philosophy of Science 57(2): 323–57. • Meehl, P., and Waller, N. (2002). “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283– 300. • Morrison, D. and Henkel, R., (eds.) (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • NEJM Author Guidelines (2019): Retrieved from: https://www.nejm.org/author- center/new-manuscripts on July 19, 2019. • Neyman, J. (1957). “The Use of the Concept of Power in Agricultural Experimentation.” Journal of the Indian Society of Agricultural Statistics IX(1), 9– 17. • Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. 42
  • 43.
    43 • Neyman, J.and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson. Cambridge: Cambridge University Press. • Pearson, E.S. & Neyman, J. (1930). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Cambridge: Cambridge University Press). First published in Bul. Acad. Pol.Sci. 73-96. • Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books. Reprinted 2000 The Logic of Scientific Discovery. London, New York: Routledge. • Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and Littlefield. • Selvin, H. (1970). “A critique of tests of significance in survey research. In The Significance Test Controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4– 7. • Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913