Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
1. 1
Error Control and Severity
Deborah G Mayo
Dept of Philosophy, Virginia Tech
November 13, 2022
Philosophy of Science Association
Multiplicity, Data-Dredging, and Error Control
3. (minimal) Severity Requirement:
If the test procedure had little or no capability of
finding flaws with C (even if present), then
agreement between data x0 and C provides poor
(or no) evidence for C
(“too cheap to be worth having” Popper 1983)
3
4. 4
Data Dredging (Torturing): Hunting
for Subgroups in RCTs
The case of the Drug CEO:
• No statistically significant benefit on the primary
endpoint (improved lung function)
• Nor on any 10 secondary endpoints
• Ransacks the unblinded data for a subgroup
where those on the drug did better.
• Reports it as a statistically significant result from a
double-blind study
The method has a high probability of reporting drug
benefit (in some subgroup or other), even if none
exists—illicit P-value.
5. 5
But some cases of multiplicity & data
dredging satisfy severity
Searching a full database for a DNA match with a
criminal’s DNA:
• The probability is high of a mismatch with person i, if
i were not the criminal;
• So, the match is good evidence that i is the criminal.
• A non-match virtually excludes the person thereby
strengthening the inference.
It’s the severity or lack of it that distinguishes whether
a data dredged claim is warranted
6. 6
The data dredging at issue involves
some element of double-counting
When data-driven discoveries are tested on new
data, it’s not data dredging
• The FDA gave the drug CEO funds to test his
‘exploratory’ hypothesis
• So it’s not a matter of whether it might be worth
studying
(Highly motivated CEOs can make them appear
so.)
7. 7
• When the new study was stopped for futility
(2009), FDA said: you’re going to jail (for the
misleading press report)!
(reached Supreme Court, 2013, Mayo 2020)
• Even if the follow-up had succeeded, the initial
data poorly tested the dredged claim
8. 8
Ruling out chance vs explaining
a known effect
The dredged hypotheses need not be
prespecified to be kosher
• The same data were used to arrive at and
test the source of a set of blurred 1919
eclipse data (mirror distortion by the sun’s
heat)
Nor is it a problem that the same data are used
to test multiple different claims (Fisher
recommended)
9. 9
My main point
The problem is when the results (or
hypotheses) are related in such a way that the
tester ensures the only ones to emerge or be
reported are in sync with the claim C at issue,
(even if false).
The successes are due to the biasing selection
effects, not C’s truth
10. Biasing Selection Effects:
When data or hypotheses are selected,
generated or interpreted in such a way as to
fail the severity requirement
(includes inability to assess severity even
approximately)
10
11. The Severity Requirement with
Data Dredging and Multiplicity
11
• has to be assessed according to the type
of error that C claims is well ruled out by
the data x.
12. 12
In some cases we can compensate
with P-value adjustments
(huge literature)
• Even where such adjustments don’t give a
quantitative severity assessment, they can
reveal reasonably high or terrible severity*
• If you got 10 hits (out of 400 tests) and chance
alone expects 12 say, you haven’t distinguished
your effect from random variation.
• The main thing is: cases with selection effects
shouldn’t be treated the same as if selection
didn’t occur
I prefer to report the nominal P-value, and try to adjust post data.
13. 13
Post-data model selection is a
field of its own
• AI/ML prediction models may compensate for
using the “same” data by cross validation and
data splitting, at least with IID data.
• Not free from reproducibility and replicability
crises
14. Recap so far
I) Multiplicity and data dredging can alter
capabilities to probe errors (error probabilities)
a. True, but appropriate data-dredging can
satisfy relevant error probabilities
Next part:
b. True, but rivals to error statistics hold
principles of evidence where error
probabilities don’t matter
14
15. 15
Non-error statistical principles
of evidence
All the evidence is via likelihood ratios (LR) of
hypotheses
Pr(x0;H1)/Pr(x0;H0)
The data x0 are fixed, the hypotheses vary
• Error probabilities drop out; the analysis
“conditions” on the observed x
• The drug CEO’s data-dredged claim is
“supported” by x
16. 16
All error probabilities violate the LP
• “Sampling distributions, significance levels,
power, all depend on something more [than the
likelihood function]–something that is irrelevant
in Bayesian inference–namely the sample
space.” (Lindley 1971, 436)
19. In (2-sided) testing the
mean of a standard normal
distribution
19
Another kind of dredging: Optional Stopping
Stopping rules are irrelevant for Bayesians: Stopping Rule Principle
20. “This principle is automatically
satisfied by Bayesian analysis, but is
viewed as crazy by many
frequentists. (Bayarri et al. 2004, 77)
20
21. Not just an example in phil stat, it
leads to quandaries in real trials
“The [regulatory] requirement of type I error control
for Bayesian adaptive designs causes them to lose
many of their philosophical advantages, e.g.,
compliance with the likelihood principle”
(Ryan et al. 2020, radiation oncology)
They admit “the type I error was inflated in the
Bayesian adaptive designs … [but] adjustments to
posterior probabilities, are not required” 21
22. • Default, non-subjective, O-Bayesians admit to
“technical violations” of the likelihood principle
(the prior can depend on the model)* (Ghosh et
al. 2010)
• We don’t see them embracing error statistical
principles—yet (it would be welcome)
(*That’s aside from violations of the LP in testing
model assumptions.)
22
23. Bayesians may block (or accept)
claims from multiplicity/dredging
with prior probabilities
(without error probs)
Problems:
• Increases flexibility (selection effects + priors
which can also be data dependent)
• Doesn’t show what’s gone wrong—it’s the
multiplicity
24. Criticisms of P-hackers lose force
• Statistically significant results that fail to replicate
are often reanalyzed Bayesianly
• Rather than point to (sometimes blatant) biased
selections, Bayesians show with a high prior to H0
the data favor H0 more than H1
• A P-value can be small while Pr(H0|x) non-small or
even large.
24
25. 25
Bayes/Fisher Disagreement:
Spike and Smear
• A point null hypothesis, a lump of prior probability
on H0 or a tiny area around it [Xi ~ N(μ, σ2)]
H0: μ = 0 vs. H1: μ ≠ 0.
• Depending on how you spike and how you smear,
an α significant result can even correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
• But Pr(H0|x) can also agree with the small α
(though they’re measuring different things)
26. A life raft to data dredgers:
“Bayesians can easily discount my statistically
significant result this way”.
• Even if it’s correct to reject the data dredged claim,
it can be right for the wrong reason:
• Put the blame where it belongs.
26
27. More stringent P-value thresholds?
• The high prior to H0 leads to the popular
movement to require more stringent P-values
(“redefine P-values” Benjamin et al. 2017).
• Lowering P-value thresholds may compensate for
multiple testing
(e.g., in high energy particle physics (look elsewhere
effect) and for error control in Big Data “diagnostic
screening” (e.g., interesting/not interesting))
27
28. • But the goal of the redefiners is to get the P-value
more in line with a Bayesian posterior on H0,
assuming H0 is given the high prior.
• Even advocates say it “does not address multiple
hypothesis testing, P-hacking” (Benjamin et al.
2017)
28
29. We should worry about biases in
favor of “no effect”
• Severe testers compute the probability this
Bayesian analysis leads to erroneously failing
to find various discrepancies (type 2 errors)
• If it’s high for discrepancies of interest, she
denies claims of no evidence against, let alone
evidence for H0
29
30. 30
Last piece: Implications for using
error probabilities inferentially
• The problem of double counting/novel evidence
led me to the severity criterion, and to a distinct
use of error probabilities (probativism)
31. 31
The severe tester reformulates tests
with a discrepancy γ from H0
• Admittedly, small P-values can be misinterpreted
as indicating larger discrepancies than warranted
• We infer discrepancies (population effect sizes)
that are and are not warranted (and how well)
Mayo (1991, 2018); Mayo and Spanos (2006, 2010
Mayo and Cox (2006, 2010); Mayo and Hand
(2022)
33. 33
Akin to forming the lower confidence
interval (CI) bound
(estimating μ, SE is the standard error, SE = 1)
μ > !
𝑋 - 2SE
The 98% lower confidence interval estimator. 98% of
the time, it would correctly estimate μ (performance)
Once the sample mean is observed,
μ > ̅
𝑥 - 2SE is the CI-lower estimate* (μ > 0)
it cannot be assigned .98 probability.
*fiducial 2% value (Fisher 1936): μ < CI-lower estimator 2% of the
time. These values of μ are not rejectable at the .02 level (Neyman)
34. 34
SEV gives an inferential view of CIs
Severe tester:
μ > CI-lower (e.g., μ > 0) is warranted because with
high probability (.98) we would have observed a
smaller sample mean if μ ≤ CI-lower
We report several confidence levels, a confidence
distribution (Cox 1958), or a severity distribution.
(Thornton will give her take)
35. 35
Fallacies of non-significant results?
Say the cut-off for significance had to exceed 2
• It’s not evidence of no discrepancy from 0, but not
uninformative (even in simple Fisherian tests)
• Can infer (with severity) the absence of
discrepancies that probably would have led to a
larger sample mean (smaller P-value)*
*Less coarse than power analysis
38. Recap
I. Multiplicity and data dredging can alter
capabilities to probe errors (error probabilities)
a) True, but appropriate data-dredging can satisfy
relevant error probabilities
b) True, but rivals to error statistics hold principles
of evidence where error probabilities don’t
matter.
II. A severity formulation allows error probabilities of
methods to be construed evidentially.
38
40. References
• Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and Frequentist
Analysis’, Statistical Science 19, 58–80.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). ‘Redefine Statistical
Significance’, Nature Human Behaviour 2, 6–10.
• Berger, J. (2006). ‘The Case for Objective Bayesian Analysis’ and ‘Rejoinder’, Bayesian
Analysis 1(3), 385–402; 457–64.
• Cox, D. (1958). ‘Some Problems Connected with Statistical Inference’, Annals of
Mathematical Statistics 29(2), 357–72.
• Casella, G. and Berger, R. (1987a). ‘Reconciling Bayesian and Frequentist Evidence in
the One-sided Testing Problem’, Journal of the American Statistical Association 82
(397), 106–11.
• Fisher, R. A. (1936), ‘Uncertain Inference’, Proceedings of the American Academy of
Arts and Sciences 71, 248–58.
• Ghosh, J., Delampady, M., and Samanta, T. (2010). An Introduction to Bayesian
Analysis: Theory and Methods. New York: Springer.
• Godambe, V. and Sprott, D. (eds.) (1971). Foundations of Statistical Inference. Toronto:
Holt, Rinehart and Winston of Canada.
• Lindley, D. (1971). ‘The Estimation of Many Parameters’, in Godambe, V. and Sprott, D.
(eds.), pp. 435–55.
• Mayo, D. G. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4):
523-552. Reprinted (1991) in The Philosopher’s Annual XIV: 203-232.
40
41. References (cont.)
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge, Chicago:
Chicago University Press. (1998 Lakatos Prize)
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides
Against) Selective Reporting” Harvard Data Science Review 2.1.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing
damaging science, or damaging scientific practice?. Synthese 200,220.
https://doi.org/10.1007/s11229-022-03692-0
• Mayo, D.G. and Cox, D. R. (2006) “Frequentist Statistics as a Theory of Inductive
Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo),
Lecture Notes-Monograph series, IMS, Vol. 49: 77-97.
• Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-
Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-
357.
• Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and
Littlefield.
41