Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects)
1. 2nd half of “Frequentist Statistics as a Theory of Inductive
Inference” (Selection Effects)
The idealized formulation in the initial definition of a significance
test starts with a hypothesis and a test statistic, obtains data, then
applies the test and looks at the outcome:
The hypothetical procedure involved in the definition of the
test then matches reasonably closely what was done;
The possible outcomes are the different possible values of the
specified test statistic.
This permits features of the distribution of the test statistic to
be relevant for learning about aspects of the mechanism
generating the data.
1
2. It often happens that either the null hypothesis or the test statistic
are influenced by preliminary inspection of the data, so that the
actual procedure generating the final test result is altered:*
This, may alter the capabilities of the test to detect
discrepancies from the null hypotheses reliably, calling for
adjustments in its error probabilities.
This is required to ensure that the p-values serve their intended
purpose for frequentist inference, whether in behavioral or
evidential contexts.
* the objective of the test is to enable us to learn something about
the underlying data generating mechanism, and this learning is
made possible by correctly assessing the actual error probabilities.
2
3. Ad hoc Hypotheses, Non-novel Data, Double-Counting, etc.
The general point involved has been discussed extensively in both
philosophical and statistical literatures.
In the former under such headings as requiring novelty or
avoiding ad hoc hypotheses (use-constructions, etc.)
Under the latter, as rules against peeking at the data, shopping
for significance, data mining, etc., for taking selection effects into
account.
(This will come up again throughout the semester. Optional
stopping is an example of a data dependent strategy, as with “look
elsewhere” effects in the Higgs research)
These problems remain unresolved in general..
3
4. Error statistical considerations, coupled with a sound principle of
inductive evidence, may allow going further by providing criteria
for when various data dependent selections matter and how to take
account of their influence on error probabilities.
(some items of Mayo and Spanos: “How to discount double
counting when it counts,” “Some surprising facts about surprising
facts”, chapters 7,8,9 of EGEK, especially 8, “hunting without a
license” Spanos)
4
5. In particular, if the null hypothesis chosen for testing just because
the test statistic is large, the probability of finding some such
discordance or other may be high even under the null.
Thus, following FEV, we would not have genuine evidence of
inconsistency with the null, and unless the p-value is modified
accordingly, the inference would be misleading.
5
6. Example 1: Hunting for Statistical Significance
Investigators have 20 independent sets of data, each reporting on
different but closely related effects.
After doing all 20 tests, with 20 nulls, H0i, i = 1, …20
they report only the smallest p-value, e.g., 0.05, and its
corresponding null hypothesis, say H013.
e.g., there is no difference between some treatment (a childhood
training regimen) and a factor, f13 (some personality characteristic
later in life).
Passages from EGEK (Morrison and Henkel)
6
7. This “hunting” procedure should be compared with a case where
H013 was preset as the single hypothesis to test, and the small pvalue found.
In the hunting case, the possible results are the possible
statistically significant factors that might be found to show a
"calculated" statistical significant departure from the null. The
relevant type 1 error probability is the probability of finding at
least one such significant difference out of 20, even though the
global null is true (i.e., all twenty observed differences are due
to chance).
The probability that this procedure yields erroneous rejection
differs from, and will be much greater than, 0.05 (and is
approximately 0.64).
7
8. There are different, and indeed many more, ways one can err in
this example than when one null is preset, and this is reflected
in the adjusted p-value.
My blog (reblogged March 3, 2014)
Hardly a day goes by where I do not come across an article on the
problems for statistical inference based on fallaciously capitalizing
on chance: high-powered computer searches and “big” data trolling
offer rich hunting grounds out of which apparently impressive
results may be “cherry-picked”:
When the hypotheses are tested on the same data that suggested
them and when tests of significance are based on such data, then a
spurious impression of validity may result. The computed level of
significance may have almost no relation to the true level. . . .
8
9. Suppose that twenty sets of differences have been examined, that
one difference seems large enough to test and that this difference
turns out to be “significant at the 5 percent level.” Does this mean
that differences as large as the one tested would occur by chance
only 5 percent of the time when the true difference is zero? The
answer is no, because the difference tested has been selected from
the twenty differences that were examined. The actual level of
significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]
Critics of the Morrison and Henkel ilk clearly report that to ignore
a variety of “selection effects” results in a fallacious computation
of the actual significance level associated with a given inference;
the “computed” or “nominal” significance level differs from the
actual or warranted significance level.
9
10. [1] Selvin calculates this approximately by considering the
probability of finding at least one statistically significant difference
at the .05 level when 20 independent samples are drawn from
populations having true differences of zero, 1 – P (no such
difference): 1 – (.95)20 = 1 – .36. This assumes, unrealistically,
independent samples, but without that it may be unclear how to
even approximately compute actual p-values.
This influence on long-run error is well known, but should this
influence the interpretation of the result in a context of inductive
inference?
According to frequentist or severity reasoning it should
Not so easy to explain why:
10
11. The concern is not the avoidance of often announcing genuine
effects erroneously in a series, the concern is that this test performs
poorly as a tool for discriminating genuine from chance effects in
this case.
Because at least one such impressive departure, we know, is
common even if all are due to chance, the test has scarcely
reassured us that it has done a good job of avoiding such a
mistake in this case.
Even if there are other grounds for believing the genuineness of
the one effect that is found, we deny that this test alone has
supplied such evidence.
11
12. The "hunting procedure" does a very poor job in alerting us to,
in effect, temper our enthusiasm, even where such tempering is
warranted.
If the p-value is adjusted to reflect the actual error rate, the
test again becomes a tool that serves this purpose.
12
13. Example 2. Hunting for a Murderer
(hunting for the source of a known effect by eliminative induction)
Testing for a DNA match with a given specimen, known to be that
of the murderer, a search through a data-base of possible matches
is done one at a time.
We are told, in a fairly well-known presentation of this case, that:
P(DNA match; not murderer) = very small
P(DNA match; murderer) ~ 1
The first individual, if any, from the data-base for which a match is
found is declared to truly match the criminal, i.e., to be the
murderer.
13
14. (The null hypothesis, in effect, asserts that the person tested does
NOT “match the criminal”; so the null is rejected iff there is an
observed DNA match.)
Example 2 is superficially similar to Example 1, finding a DNA
match being somewhat akin to finding a statistically significant
departure from a null hypothesis: one searches through data and
concentrates on the one case where a "match" with the criminal's
DNA is found, ignoring the non-matches.
If one adjusts for "hunting" in Example 1, shouldn't one do so
in broadly the same way in Example 2?
14
15. No!
(Although some have erroneously supposed frequentists say “yes”)
In Example 1 the concern is inferring a genuine, “reproducible"
effect, when in fact no such effect exists;
In Example 2, there is a known effect or specific event, the
criminal's DNA, and reliable procedures are used to track down the
specific cause or source (as conveyed by the low "erroneousmatch" rate.)
15
16. The probability is high that we would not obtain a match with
person i, if i were not the criminal; so, by FEV, finding the
match is excellent evidence that i is the criminal. Moreover,
each non-match found, by the stipulations of the example,
virtually excludes that person;
Note: the contrast in hunting for a DNA match is finding a match
with the first person tested, as opposed to hunting through a data
base
The more negative results found, the more the inferred "match"
is fortified; whereas in Example 1 this is not so.
16
18. Data-dependent Specification of distance or “cut-offs”
(case 1)
An analogy — The Texas Sharpshooter: testing a
sharpshooter's ability by having him shoot and then drawing a
bull's-eye around his results so as to yield the highest number of
bull's-eyes,
The skill that one is allegedly testing and making inferences
about is his ability to shoot when the target is given and fixed,
while that is not the skill actually responsible for the resulting
score.
18
19. Case 2:
By contrast, if the choice of specification is guided not by
considerations of the statistical significance of departure from the
original null hypothesis, but rather because one is an empirically
adequate statistical model, the other violates assumptions, no
adjustment for selection is called for.
Indeed, using a statistically adequate specification gives
reassurance that the calculated p-value is relevant for
interpreting the evidence reliably.
19
20. Need for Adjustments for Data-Dependent Selections
How does our conception of the frequentist theory of induction
help to guide the answers?
1. It must be considered whether the context is one where the key
concern is the control of error rates in a series of applications
(behavioristic goal), or whether it is a context of evaluating
specific evidence (inferential goal).
The relevant error probabilities may be altered for the former
context and not for the latter.
2. To determine the relevant hypothetical series on which to base
error frequencies one must identify the particular obstacles that
need to be avoided for a reliable inference in the particular case,
20
21. and the capacity of the test, as a measuring instrument, to have
revealed the presence of the obstacle.
21
22. Statistics in the Discovery of the Higgs
Everyone was excited with the announced evidence for the
discovery of a standard model (SM) Higgs particle based on a
“5 sigma observed effect” (July 2012).
But because this report links to the use of statistical significance
tests, some Bayesians raised criticisms
22
23. They want to ensure that before announcing the hypothesis H*:
“a SM Higgs boson has been discovered” (with such and such
properties) that
H* has been given a severe run for its money
That with extremely high probability we would have observed
a smaller excess of signal-like events, were we in a universe
where:
H0: μ = 0 —background only hypothesis, vs.
So, very probably H0 would have survived a cluster of tests,
fortified with much cross-checking T, were μ = 0.
23
24. Note what’s being given a high probability:
Pr(test T would produce less than 5 sigma; H0) > 9999997.
With probability .9999997, our methods would show that the
bumps disappear (as so often occurred), under the assumption
data are due to background H0.
Assuming we want a posterior probability in H* seems to be a
slide from the value of knowing this probability is high for
assessing the warrant for H*
Granted, this inference relies on an implicit severity principle
of evidence.
24
25. Data provide good evidence for inferring H (just) to the extent
that H passes severely with x0, i.e., to the extent that H would
(very probably) not have survived the test so well were H false.
They then quantify various properties of the particle discovered
(inferring ranges of magnitudes)
25
26. The p-value police
Leading (subjective) Bayesian, Dennis Lindley had a letter
sent around (to ISBA members)i:
Why demand such strong evidence?
(Could only be warranted if beliefs in the Higgs extremely
low or costs of error exorbitant.)
Are they so wedded to frequentist methods? Lindley asks.
“If so, has anyone tried to explain what bad science that is?”
26
27. Other critics rushed in to examine if reports (by journalists
and scientists) misinterpreted the sigma levels as posterior
probability assignments to the models.
Many critics have claimed that the .99999 was fallaciously
being assigned to H* itself—a posterior probability in H*1.
Surely there are misinterpretations, but many were not
What critics are doing is interpret a legitimate error
probability as a posterior in H*: SM Higgs
1
27
28. Physicists did not assign a high probability to
H*: SM Higgs exists
(whatever it might mean
Besides, many believe in beyond the standard model
physics.
One may say informally, “so probably we have experimentally
demonstrated an SM-like Higgs”.
When you hear: what they really want are posterior
probabilities, ask: How are we to interpret prior probabilities?
Posterior probabilities?
28
29. This is a great methodological controversy in practice that
philosophers of science and evidence should be in on
Our job is to clarify terms, is it not?
i
29