D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
D. Mayo: Replication Research Under an Error Statistical Philosophy
1. SPP
D.
Mayo
1
Replication Research Under an Error Statistical Philosophy
Deborah Mayo
Around a year ago on my blog:
“There are some ironic twists in the way psychology is
dealing with its replication crisis that may well threaten even
the most sincere efforts to put the field on firmer scientific
footing”
Philosopher’s talk: I see a rich source of problems that cry out
for ministrations of philosophers of science and of statistics
2. SPP
D.
Mayo
2
Three main philosophical tasks:
#1 Clarify concepts and presuppositions
#2 Reveal inconsistencies, puzzles, tensions (“ironies”)
#3 Solve problems, improve on methodology
• Philosophers usually stop with the first two, but I think
going on to solve problems is important.
This presentation is ‘programmatic’- what might replication
research under an error statistical philosophy be?
My interest grew thanks to Caitlin Parker whose MA thesis was
on the topic
3. SPP
D.
Mayo
3
Example of a conceptual clarification (#1)
Editors of a journal, Basic and Applied Social Psychology,
announced they are banning statistical hypothesis testing
because it is “invalid”
It’s invalid because it does not supply “the probability of the
null hypothesis, given the finding” (the posterior probability of
H0) (2015 Trafimow and Marks)
• Since the methodology of testing explicitly rejects the mode
of inference they don’t supply, it would be incorrect to claim
the methods were invalid.
• Simple conceptual job that philosophers are good at
4. SPP
D.
Mayo
4
Example of revealing inconsistencies and tensions (#2)
Critic: It’s too easy to satisfy standard significance thresholds
You: Why do replicationists find it so hard to achieve
significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking,
cherry-picking, significance seeking, QRPs
You: So, the replication researchers want methods that pick up
on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where
selection effects and data dredging make no difference
5. SPP
D.
Mayo
5
Whether this can be resolved or not is separate.
• We are constantly hearing of how the “reward structure”
leads to taking advantage of researcher flexibility
• As philosophers, we can at least show how to hold their
feet to the fire, and warn of the perils of accounts that bury
the finagling
The philosopher is the curmudgeon (takes chutzpah!)
I’ll give examples of
#1 clarifying terms
#2 inconsistencies
#3 proposed solutions (though I won’t always number them)
.
6. SPP
D.
Mayo
6
Demarcation: Bad Methodology/Bad Statistics
• A lot of the recent attention grew out of the case of Diederik
Stapel, the social psychologist who fabricated his data.
• Kahneman
in
2012
“I
see
a
train-‐wreck
looming,”
setting
up
a
“daisy
chain”
of
replication.
• The Stapel investigators: 2012 Tilberg Report, “Flawed
Science” do a good job of characterizing pseudoscience.
• Philosophers tend to have cold feet when it comes to saying
anything general about science versus pseudoscience.
7. SPP
D.
Mayo
7
Items in their list of “dirty laundry” include:
“An experiment fails to yield the expected statistically
significant results. The experimenters try and try again
until they find something (multiple testing, multiple
modeling, post-data search of endpoint or subgroups),
and the only experiment subsequently reported is the
one that did yield the expected results.”
… continuing an experiment until it works as desired, or
excluding unwelcome experimental subjects or results,
inevitably tends to confirm the researcher’s research
hypotheses, and essentially render the hypotheses
immune to the facts”. (Report, 48)
--they walked into a “culture of verification bias”
8. SPP
D.
Mayo
8
Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e., little
or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.
9. SPP
D.
Mayo
9
• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views and personality conflicts (“It’s the methods, stupid”)
(example of new solutions #3)
10. SPP
D.
Mayo
10
Are philosophies about science relevant?
One of the final recommendations in the Report is this:
In the training program for PhD students, the relevant
basic principles of philosophy of science, methodology,
ethics and statistics that enable the responsible practice
of science must be covered. (p. 57)
11. SPP
D.
Mayo
11
A critic might protest:
“There’s nothing philosophical about my criticism of
significance tests: a small p-value is invariably, and
erroneously, interpreted as giving a small probability to the null
hypothesis that the observed difference is mere chance.”
Really? P-values are not intended to be used this way;
presupposing they should be stems from a conception of the role
of probability in statistical inference—this conception is
philosophical.
(of course criticizing them because they might be misinterpreted
is just silly)
12. SPP
D.
Mayo
12
Two
main
views
of
the
role
of
probability
in
inference
Probabilism.
To
provide
a
post-‐data
assignment
of
degree
of
probability,
confirmation,
support
or
belief
in
a
hypothesis,
absolute
or
comparative,
given
data
x0.
Performance.
To
ensure
long-‐run
reliability
of
methods,
coverage
probabilities,
control
the relative frequency of
erroneous inferences in a long-run series of trials.
What happened to the goal of scrutinizing bad science by the
severity criterion?
13. SPP
D.
Mayo
13
• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary not a sufficient
condition for avoiding insevere tests.
• The problems with selective reporting, multiple testing,
stopping when the data look good are not problems about
long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of
misinterpretation.
14. SPP
D.
Mayo
14
• Probabilism
says
H
is
not
justified
unless
it’s
true
or
probable
(made
firmer).
• Error
statistics
(probativism)
says
H
is
not
justified
unless
something
(a
good
job)
has
been
done
to
probe
ways
we
can
be
wrong
about
H.
• If
it’s
assumed
probabilism
is
required
for
inference,
error
probabilities
could
be
relevant
only
by
misinterpretation.
False!
• Error
probabilities
have
a
crucial
role
in
appraising
well-‐
testedness
(new
philosophy
for
probability
#3)
• Both
H
and
not-‐H
be
can
be
poorly
tested,
so
a
severe
testing
assessment
violates
probability
15. SPP
D.
Mayo
15
Understanding
the
Replication
Crisis
Requires
Understanding
How
it
Intermingles
with
PhilStat
Controversies
• It’s not that I’m keen to defend many common uses of
significance tests
• It’s just that the criticisms (in psychology and elsewhere)
are based on serious misunderstandings of the nature and
role of these methods; consequently so are many “reforms”
• How can you be clear the reforms are better if you might be
mistaken about existing methods?
16. SPP
D.
Mayo
16
Criticisms
concern
a
kind
of
Fisherian
Significance
Test
(i) Sample
space:
Let
the
sample
be
X
=
(X1,
…,Xn),
be
n
iid
(independent
and
identically
distributed)
outcomes
from
a
Normal
distribution
with
standard
deviation
σ
(ii)
A
null
hypothesis
H0:
µ
=
0
(Δ: µΤ − µC = 0)
(iii)
Test
statistic:
A
function
of
the
sample,
d(X)
reflecting
the
difference
between
the
data
x0
=
(x1,
…,xn),
and
H0:
The
larger
d(x0)
the
further
the
outcome
from
what’s
expected
under
H0,
with
respect
to
the
particular
question.
(iv)
Sampling
distribution
of
test
statistic:
d(X)
17. SPP
D.
Mayo
17
The
p-‐value
is
the
probability
of
a
difference
larger
than
d(x0),
under
the
assumption
that
H0
is
true:
p(x0)=Pr(d(X)
>
d(x0);
H0).
If p(x0)
is
sufficiently
small,
there’s
an
indication
of
discrepancy
from
the
null.
(Even
Fisher
had
implicit
alternatives,
by
the
way)
18. SPP
D.
Mayo
18
P-‐value
reasoning:
from
high
capacity
to
curb
enthusiasm
If
the
hypothesis
H0
is
correct
then,
with
high
probability,
1-‐p,
the
data
would
not
be
statistically
significant
at
level
p.
x0
is
statistically
significant
at
level
p.
____________________________
Thus,
x0
indicates
a
discrepancy
from
H0.
That
merely
indicates
some
discrepancy!
19. SPP
D.
Mayo
19
A genuine experimental effect is needed
“[W]e need, not an isolated record, but a reliable method of
procedure. In relation to the test of significance, we may say
that a phenomenon is experimentally demonstrable when we
know how to conduct an experiment which will rarely fail to
give us a statistically significant result.” (Fisher 1935, 14)
(low P-value ≠> H: statistical effect)
“[A]ccording
to
Fisher,
rejecting
the
null
hypothesis
is
not
equivalent
to
accepting
the
efficacy
of
the
cause
in
question.
The
latter...requires
obtaining
more
significant
results
when
the
experiment,
or
an
improvement
of
it,
is
repeated
at
other
laboratories
or
under
other
conditions.”
(Gigerentzer
1989,
95-‐6)
(H ≠> H*)
20. SPP
D.
Mayo
20
Still,
simple
Fisherian
Tests
have
Important
Uses
• Testing
assumptions
• Fraudbusting
and
forensics:
Finding
Data
too
good
to
be
true
(Simonsohn)
• Finding
if
data
are
consistent
with
a
model
Gelman and Shalizi (meeting of minds between a Bayesian and
an error statistician)
“What we are advocating, then, is what Cox and Hinkley (1974)
call ‘pure significance testing’, in which certain of the model’s
implications are compared directly to the data, rather than
entering into a contest with some alternative model.” (p.20)
21. SPP
D.
Mayo
21
Fallacy
of
Rejection:
H
–
>
H*
:
Erroneously
take
statistical
significance
as
evidence
of
research
hypothesis
H*
The
fallacy
is
explicated
by
severity:
flaws
in
alternative
H*
have
not
been
probed
by
the
test,
the
inference
from
a
statistically
significant
result
to
H*
fails
to
pass
with
severity
Merely refuting the null hypothesis is too weak to
corroborate substantive H*, “we have to have ‘Popperian
risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable coincidence.’” (Meehl
and Waller 2002, 184)
(Meehl
was
wrong
to
blame
Fisher)
22. SPP
D.
Mayo
22
NHST
are
pseudostatistical:
Why
do
psychologists
speak
of
NHSTs
–tests
that
supposedly
allow
moving
from
statistical
to
substantive?
So
defined,
they
exist
only
as
abuses
of
tests:
they
exist as
something you’re never supposed to do
Psychologists
tend
to
ignore
Neyman-‐Pearson
(N-‐P)
tests:
N-‐P
supplemented
Fisher’s
tests
with
explicit
alternatives
23. SPP
D.
Mayo
23
Neyman-‐Pearson
(N-‐P)
Tests:
A
null
and
alternative
hypotheses
H0,
H1
that
exhaust
the
parameter
space
So
the
fallacy
of
rejection
H
–
>
H*
is
impossible
(rejecting
the
null
only
indicates
statistical
alternatives)
Scotches
criticisms
that
P-‐values
are
only
under
the
null
Example:
Test
T+:
sampling
distribution
of
d(x)
under
null
and
alternatives.
H0:
µ
≤
µ0
vs.
H1:
µ
>
µ0
if
d(x0)
>
cα,
"reject"
H0,
if
d(x0)
<
cα,
"do
not
reject”
or
“accept"
H0,
e.g.
cα=1.96
for
α=.025
24. SPP
D.
Mayo
24
The
sampling
distribution
yields
Error
Probabilities
Probability
of
a
Type
I
error
=
P(d(X)
>
cα;
H0)
≤
α.
Probability
of
a
Type
II
error:
=
P(d(X)
<
cα;
H0)
=
ß(µ1),
for
any
µ1
>
µ0.
The
complement
of
the
Type
II
error
probability=
power
against
(µ1)
POW(µ1)=
P(d(X)
>
cα;
µ1)
Even
without
“best”
tests,
there
are
“good”
tests
25. SPP
D.
Mayo
25
N-‐P
test
in
terms
of
the
P-‐value:
reject
H0
iff
P-‐value
<
.025
• Even
N-‐P
report
the
attained
significance
level
or
P-‐value
(Lehmann)
• “reject/do
not
reject”
uninterpreted
parts
of
the
mathematical
apparatus
Reject
could
be:
“Declare
statistically
significant
at
the
p-‐level”
• “The
tests…
must
be
used
with
discretion
and
understanding”
(N-‐P,
1928,
p.
58)
(“it’s
the
methods,
stupid”)
26. SPP
D.
Mayo
26
Why
Inductive
behavior?
N-‐P
justify
tests
(and
confidence
intervals)
by
performance,
control
of
long-‐run
error
coverage
probabilities
They
called
this
inductive
behavior,
why?
• They
were
reaching
conclusions
beyond
the
data
(inductive)
• If
inductive
inference
is
probabilist,
then
they
needed
a
new
term.
In
Popperian
spirit,
they
(mostly
Neyman)
called
it
inductive
behavior-‐-‐
adjust
how
we’d
act
rather
than
beliefs
(I’m
not
knocking
performance,
but
error
probabilities
also
serve
for
particular
inferences—evidential)
27. SPP
D.
Mayo
27
N-‐P
tests
can
still
commit
a
type
of
fallacy
of
rejection:
Infer
a
discrepancy
beyond
what’s
warranted:
––especially
with n sufficiently large:
large
n
problem.
• Severity
tells
us:
an
α-‐significant
difference
is
indicative
of
less
of
a
discrepancy
from
the
null
if
it
results
from
larger
(n1)
rather
than
a
smaller
(n2)
sample
size
(n1
>
n2
)
What’s
more
indicative
of
a
large
effect
(fire),
a
fire
alarm
that
goes
off
with
burnt
toast
or
one
so
insensitive
that
it
doesn’t
go
off
unless
the
house
is
fully
ablaze?
[The
larger
sample
size
is
like
the
one
that
goes
off
with
burnt
toast.)
28. SPP
D.
Mayo
28
Fallacy
of
Non-‐Significant
results:
Insensitive
tests
• Negative
results
may
not
warrant
0
discrepancy
from
the
null,
but
we
can
use
severity
to
rule
out
discrepancies
that,
with
high
probability,
would
have
resulted
in
a
larger
difference
than
observed
Similar
to
Cohen’s
power
analysis
but
sensitive
to
the
outcome—P-‐value
distribution
(#3)
• I
hear
some
replicationists
say
negative
results
are
uninformative:
not
so
(#2
ironies)
No
point
in
running
replication
research
if
your
account
views
negative
results
as
uninformative
29. SPP
D.
Mayo
29
Error
statistics
gives
evidential
interpretation
to
tests
(#3)
Use
results
to
infer
discrepancies
from
a
null
that
are
well
ruled-‐
out,
and
those
which
are
not
I’d
never
just
report
a
P-‐value
Mayo
(1996);
Mayo
and
Cox
(2010):
Frequentist
Principle
of
Evidence:
FEV
Mayo
and
Spanos
(2006):
SEV
30. SPP
D.
Mayo
30
One-‐sided
Test
T+:
H0:
µ
<
µ0
vs.
H1:
µ
>
µ0
d(x)
is
statistically
significant
(set
lower
bounds)
(i)
If
the
test
had
high
capacity
to
warn
us
(by
producing
a
less
significant
result)
if
µ
≤
µ0
+
γ.
then
d(x)
is
a
good
indication
of
µ
>
µ0
+
γ.
(ii)
If
the
test
had
little
(or
even
moderate)
capacity
(e.g.
<
.5)
to
produce
a
less
significant
result
even
if
µ
≤
µ0
+
γ,
then
d(x)
is
a
poor
indication
of
µ
>
µ0
+
γ
(If
an
even
more
impressive
result
is
probable,
due
to
guppies,
it’s
not
a
good
indication
of
a
great
whale)
31. SPP
D.
Mayo
31
d(x)
is
not
statistically
significant
(set
upper
bounds)
(i)If
the
test
had
a
high
probability
of
producing
a
more
statistically
significant
difference
if
µ
>
µ0
+
γ,
then
d(x)
is
a
good
indication
that
µ
≤
µ0
+
γ.
(ii)
If
the
test
had
a
low
probability
of
a
more
statistically
significant
difference
if
µ
>
µ0
+
γ,
then
d(x)
is
poor
indication
that
µ
≤
µ0
+
γ.
(too
insensitive
to
rule
out
discrepancy
γ)
If
you
set
an
overly
stringent
significance
level
in
order
to
block
rejecting
a
null,
we
can
determine
the
discrepancies
you
can’t
detect
(e.g.,
risks
of
concern)
32. SPP
D.
Mayo
32
Confidence
Intervals
also
require
supplementing
Duality
between
tests
and
intervals:
values
within
the
(1
-‐
α)
CI
are
non-‐rejectable
at
the
α
level
• Still
too
dichotomous:
in
/out,
plausible/not
plausible
(Permit
fallacies
of
rejection/non-‐rejection).
• Justified
in
terms
of
long-‐run
coverage
(performance).
• All
members
of
the
CI
treated
on
par.
• Fixed
confidence
level
(SEV
needs
several
benchmarks).
• Estimation
is
important
but
we
need
tests
for
distinguishing
real
and
spurious
effects,
and
checking
assumptions
of
statistical
models.
33. SPP
D.
Mayo
33
The
evidential
interpretation
is
crucial
but
error
probabilities
can
be
violated
by
selection
effects
(also
violated
model
assumptions)
One
function
of
severity
is
to
identify
which
selection
effects
are
problematic
(not
all
are)
(#3).
Biasing
selection
effects:
when
data
or
hypotheses
are
selected
or
generated
(or
a
test
criterion
is
specified),
in
such
a
way
that
the
minimal
severity
requirement
is
violated,
seriously
altered
or
incapable
of
being
assessed.
34. SPP
D.
Mayo
34
Nominal vs actual significance levels
Suppose
that
twenty
sets
of
differences
have
been
examined,
that
one
difference
seems
large
enough
to
test
and
that
this
difference
turns
out
to
be
‘significant
at
the
5
percent
level.’
….The
actual
level
of
significance
is
not
5
percent,
but
64
percent!
(Selvin,
1970,
p.
104)
• They
were
clear
on
the
fallacy:
blurring
the
“computed”
or
“nominal”
significance
level,
and
the
“actual”
level
• There
are
many
more
ways
you
can
be
wrong
with
hunting
(different
sample
space)
35. SPP
D.
Mayo
35
This is a genuine example of an invalid or unsound method
You report: Such
results
would
be
difficult
to
achieve
under
the
assumption
of
H0
When
in
fact
such
results
are
common
under
the
assumption
of
H0
(formally): You say Pr(P-value < Pobs; H0) ~ α (small)
but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed
• Nowadays,
we’re
likely
to
see
the
tests
blamed
for
permitting
such
misuses
(instead
of
the
testers).
• Worse
are
those
accounts
where
the
abuse
vanishes!
36. SPP
D.
Mayo
36
What
defies
scientific
sense?
On
some
views,
biasing
selection
effects
are
irrelevant….
Stephen
Goodman
(epidemiologist):
Two
problems
that
plague
frequentist
inference:
multiple
comparisons
and
multiple
looks,
or,
as
they
are
more
commonly
called,
data
dredging
and
peeking
at
the
data.
The
frequentist
solution
to
both
problems
involves
adjusting
the
P-‐value…But
adjusting
the
measure
of
evidence
because
of
considerations
that
have
nothing
to
do
with
the
data
defies
scientific
sense,
belies
the
claim
of
‘objectivity’
that
is
often
made
for
the
P-‐value.”
(1999,
p.
1010).
37. SPP
D.
Mayo
37
Likelihood
Principle
(LP)
The
vanishing
act
takes
us
to
the
pivot
point
around
which
much
debate
in
philosophy
of
statistics
revolves:
In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses:
P(x0;H1)/P(x0;H0)
Different
forms:
posterior
probabilities,
Bayes
factor
(inference
is
comparative,
data
favors
this
over
that–is
that
even
inference?)
38. SPP
D.
Mayo
38
All
error
probabilities
violate
the
LP
(even
without
selection
effects):
“Sampling
distributions,
significance
levels,
power,
all
depend
on
something
more
[than
the
likelihood
function]–something
that
is
irrelevant
in
Bayesian
inference–namely
the
sample
space”.
(Lindley
1971,
p.
436)
The
information
is
just
a
matter
of
our
“intentions”
“The
LP
implies…the
irrelevance
of
predesignation,
of
whether
a
hypothesis
was
thought
of
before
hand
or
was
introduced
to
explain
known
effects
(Rosenkrantz,
1977,
122)
39. SPP
D.
Mayo
39
Many current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail.
The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals.
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished.
(lots of #2 inconsistencies)
40. SPP
D.
Mayo
40
How
might
probabilists
block
intuitively
unwarranted
inferences?
(Consider
first
subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the interpretation
of the evidence, we wouldn’t be fooled
We know these things are unbelievable, a subjective Bayesian
might say
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.
41. SPP
D.
Mayo
41
It wouldn’t help with our most important problem:
• How to distinguish the warrant for a single hypothesis H
with different methods (e.g., one has biasing selection
effects, another, registered results and precautions)?
So now you’ve got two sources of flexibility, priors and biasing
selection effects (which can no longer be criticized).
Besides, researchers really do believe their hypotheses.
42. SPP
D.
Mayo
42
Diederik Stapel says he always read the research literature
extensively to generate his hypotheses.
“So that it was believable and could be argued that this
was the only logical thing you would find.” (E.g., eating
meat causes aggression.)
(In “The Mind of a Con Man,” NY Times, April 26,
2013[4])
43. SPP
D.
Mayo
43
Conventional
Bayesians
The most popular probabilisms these days are “non-subjective”
(reference, default) or conventional designed
to
prevent
prior
beliefs
from
influencing
the
posteriors:
“The
priors
are
not
to
be
considered
expressions
of
uncertainty,
ignorance,
or
degree
of
belief.
Conventional
priors
may
not
even
be
probabilities…
.”
(Cox
and
Mayo
2010,
p.
299)
How
might
they
avoid
too-‐easy
rejections
of
a
null?
44. SPP
D.
Mayo
44
Cult
of
the
Holy
Spike
Give
a
spike
prior
of
.5
to
H0
the
remaining
.5
probability
being
spread
out
over
the
alternative
parameter
space,
Jeffreys.
This
“spiked
concentration
of
belief
in
the
null”
is
at
odds
with
the
prevailing
view
“we
know
all
nulls
are
false”
(#2)
Bottom line: By convenient choices of priors and alternatives
statistically significant differences can be evidence for the null
The
conflict
often
considers
the
two
sided
test
H0:
µ
=
0
versus
H1:
µ
≠
0
45. SPP
D.
Mayo
45
Posterior
Probabilities
in
H0
n
(sample
size)
____________________________
p
z
n=50
n=100
n=1000
.10
1.645
.65
.72
.89
.05
1.960
.52
.60
.82
.01
2.576
.22
.27
.53
.001
3.291
.034
.045
.124
If
n
=
1000,
a
result
statistically
significant
at
the
.05
level
leads
to
a
posterior
to
the
null
of
.82!
From
Berger
and
Sellke
(1987)
based
on
a
Jeffreys
pror
46. SPP
D.
Mayo
46
• With
a
z
=
1.96
difference,
the
95%
CI
(2-‐sided)
or
the
.975
CI
one
sided
excludes
the
null
(0)
from
the
interval
• Severity reasoning: Were H0 true, the probability of getting
d(x) < dobs is high (~.975), so SEV
(µ
>
0) ∼ .975
• But they give P(H0 | z = 1.96 ) = .82
• Error statistical critique: there’s a high probability that they
give posterior probability of .82 to H0:µ = 0 erroneously
• The onus is on probabilists to show a high posterior for H
constitutes having passed a good test.
47. SPP
D.
Mayo
47
Informal
and
Quasi-‐Formal
Severity
:
H
-‐>
H*
• Error
statisticians
avoid
the
fallacy
of
going
directly
from
statistical
to
research
hypothesis
H*
• Can
we
say
nothing
about
this
link?
• I
think
we
can
and
must,
and
informal
severity
assessments
are
relevant
(#3)
I
will
not
discuss
straw
man
studies
(“chump
effects”).
This is believable: Men react more negatively to success of
their partners than to their failures (compared to women)?
Studies have shown:
H: partner’s success lowers self-esteem in men
48. SPP
D.
Mayo
48
Macho
Men
H*: partner’s success lowers self-esteem in men
I
have
no
doubts
that
certain
types
of
men
feel
threatened
by
the
success
of
their
female
partners,
wives
or
girlfriends
I’ve
even
known
a
few.
Can
this
be
studied
in
the
lab?
Ratliff
and
Oishi
(2013)
did:
.
H*:
“men’s
implicit
self-‐esteem
is
lower
when
a
partner
succeeds
than
when
a
partner
fails.”
Not so for women
Their example does a good job, given the standards in place.
49. SPP
D.
Mayo
49
Treatments: Subjects are randomly assigned to five
“treatments”:
think,
write
about
a
time
your
partner
succeeded,
failed,
succeeded
when
you
failed
(partner
beats
me),
failed
when
you
succeeded
(I
beat
partner),
and
a
typical
day
(control).
Effects:
a
measure
of
“self-‐esteem”
Explicit:
“How
do
you
feel
about
yourself?”
Implicit:
a test of word associations with “me” versus “other”.
None showed statistical significance in explicit self-esteem, so
consider just implicit measures
50. SPP
D.
Mayo
50
Some null hypotheses: The average self-esteem score is no
different (these are statistical hypotheses)
a) when partner succeeds (rather than failing)
b) when partner beats (surpasses) me or I beat her
c) control: when she succeeds, fails, or it’s a regular day
There are at least double this, given self-esteem could be
“explicit” or “implicit” (others too, e.g., the area of success)
Only
null
(a)
was
rejected
statistically!
Should
they
have
taken
the
research
hypothesis
as
disconfirmed
by
negative
cases?
Or
as
casting
doubt
on
their
test?
51. SPP
D.
Mayo
51
Or
should
they
just
focus
on
the
null
hypotheses
that
were
rejected,
in
particular
null
(a),
for
implicit
self-‐esteem.
They
opt
for
the
third.
It’s not that they should have regarded their research
hypothesis H* as disconfirmed much less falsified.
This is precisely the nub of the problem! I’m saying the
hypothesis that the study isn’t well-run needs to be considered
• Is the artificial writing assignment sufficiently relevant to
the phenomenon of interest? (look at proxy variables)
• Is the measure of implicit self esteem (word associations) a
valid measure of the effect? (measurements of effects)
52. SPP
D.
Mayo
52
Take,
null
hypothesis
b):
The average self-esteem score is no
different when partner beats (surpasses) me or I beat her
Clearly
they
expected
“she
beat
me
in
X”
to
have
a
greater
negative
impact
on
self-‐esteem
than
“she
succeeded
at
X”.
Still,
they
could
view
it
as
lending
“some
support
to
the
idea
that
men
interpret
‘my
partner
is
successful’
as
‘my
partner
is
more
successful
than
me”
(p.
698),
….as
do
the
authors.
That
is,
any
success
of
hers
is
always
construed
by
Macho
man
as,
she
beat
me.
53. SPP
D.
Mayo
53
Bending
over
Backwards
For
the
stringent
self-‐critic,
this
skirts
too
close
to
viewing
the
data
through
the
theory,
a
kind
of
“self-‐sealing
fallacy”.
I want to be clear that this is not a criticism of them given
existing standards
“I'm talking about a specific, extra type of integrity...bending
over backwards to show how you're maybe wrong, that you
ought to have when acting as a scientist.”
(R. Feynman 1974)
I’m
describing
what’s
needed
to
show
“sincerely
trying
to
find
flaws”
under
the
austere
account
I
recommend
The
most
interesting
information
was
never
reported!
Perhaps
it
was
never
even
looked
at:
what
they
wrote
about.
54. SPP
D.
Mayo
54
Conclusion: Replication Research in Psychology Under an
Error Statistical Philosophy
Replication problems can’t be solved without correctly
understanding their sources
Biggest
sources
of
problems
in
replication
crises
(a) Stat
H
-‐>research
H*
and
(b)
biasing
selection
effects:
Reasons for (a): focus on P-values and Fisherian tests ignoring
N-P tests (and the illicit NHST that goes directly H–> H*)
55. SPP
D.
Mayo
55
Another reason, false dilemma:
probabilism or long-run performance
plus assuming that N-P can only give the latter
I argue for a third use of probability: Rather than report on
believability researchers need to report the properties of the
methods they used:
What was their capacity to have identified, avoided,
admitted bias?
What’s
wanted
is
not
a
high
posterior
probability
in
H
(however
construed)
but
a
high
probability
the
procedure
would
have
unearthed
flaws
in
H
(reinterpretation
of
N-‐P
methods)
56. SPP
D.
Mayo
56
What’s
replicable?
Discrepancies
that
are
severely
warranted
Reasons
for
(b)
[embracing
accounts
that
formally
ignore
selection
effects]:
accepting
probabilisms
that
embrace
the
likelihood
principle
LP
There’s
no
point
in
raising
thresholds
for
significance
if
your
methodology
does
not
pick
up
on
biasing
selection
effects.
57. SPP
D.
Mayo
57
Informal assessments of probativeness are needed to scrutinize
statistical inferences in relation to research hypotheses H –> H*
One
hypothesis
must
always
be:
our
results
point
to
the
inability
of
our
study
to
severely
probe
the
phenomenon
of
interest
(problem
with
proxy
variables,
measurements,
etc.)
The scientific status of an inquiry is questionable if it cannot or
will not distinguish the correctness of inferences from problems
stemming from a poorly run study
If ordinary research reports adopted the Feynman “bending over
backwards” scrutiny, the interpretation of replication efforts
would be more informative (or perhaps not needed)
58. SPP
D.
Mayo
58
REFERENCES
Baggerly,
K.
A.,
Coombes,
K.
R.
&
Neeley,
E.
S.
(2008).
“Run
Batch
Effects
Potentially
Compromise
the
Usefulness
of
Genomic
Signatures
for
Ovarian
Cancer.”
Journal
of
Clinical
Oncology.
26(7):
1186-‐1187.
Bartless,
T.
(2012).
“Daniel
Kahneman
Sees
‘Train-‐Wreck
Looming’
for
Social
Psychology”.
Chronicle
of
Higher
Education
Blog
(Oct.
4,
2012)
article
w/links
to
email
D.
Kahneman
sent
to
several
social
psychologists.
http://chronicle.com/blogs/percolator/daniel-‐kahneman-‐sees-‐train-‐
wreck-‐looming-‐for-‐social-‐psychology/31338.
Berger,
J.
O.
(2006).
“The
Case
for
Objective
Bayesian
Analysis.”
Bayesian
Analysis
1
(3):
385–402.
Berger,
J.
O.
&
Sellke,
T.
(1987).
“Testing
a
Point
Null
Hypothesis:
The
Irreconcilability
of
P
Values
and
Evidence
(with
Discussion).”
Journal
of
the
American
Statistical
Association
82
(397)
(March
1):
112–122.
Bhattacharjee,
Y.
(2013).
“The
Mind
of
a
Con
Man”.
The
New
York
Times
Magazine
(4/28/2013),
p.
44.
Cohen,
J.
1988.
Statistical
Power
Analysis
for
the
Behavioral
Sciences.
2nd
ed.
Hillsdale,
NJ:
Erlbaum.
59. SPP
D.
Mayo
59
Coombes,
K.
R.,
Wang,
J.
&
Baggerly,
K.
A.
(2007).
“Microrrays:
retracing
steps.”
Nature
Medicine.
13(11):1276-‐7.
Cox,
D.
R.
&
D.
V.
Hinkley.
(1974).
Theoretical
Statistics.
London:
Chapman
and
Hall.
Cox,
D.
R.
&
Mayo,
D.
G.
(2010).
“Objectivity
and
Conditionality
in
Frequentist
Inference.”
In
Error
and
Inference:
Recent
Exchanges
on
Experimental
Reasoning,
Reliability,
and
the
Objectivity
and
Rationality
of
Science,
edited
by
Deborah
G.
Mayo
and
Aris
Spanos,
276–304.
Cambridge:
Cambridge
University
Press.
Diaconis,
P.
(1978).
“Statistical
Problems
in
ESP
Research”.
Science
201
(4351):
131-‐136.
(Letters
in
response
can
be
found
in
the
Dec.
15,
1978
issue
pp.
1145-‐6.)
Dienes,
Z.
(2011)
“Bayesian
versus
Orthodox
Statistics:
Which
Side
Are
You
On?”
Perspectives
on
Psychological
Science
6(3):
274-‐290.
Feynman,
R.
(1974).
“Cargo
Cult
Science.”
Caltech
Commencement
Speech.
Fisher,
R.
A.
(1947).
The
Design
of
Experiments,
4th
ed.
Edinburgh:
Oliver
and
Boyd.
60. SPP
D.
Mayo
60
Gelman,
A.
(2011).
“Induction
and
Deduction
in
Bayesian
Data
Analysis.”
Edited
by
Deborah
G.
Mayo,
Aris
Spanos,
and
Kent
W.
Staley.
Rationality,
Markets
and
Morals:
Studies
at
the
Intersection
of
Philosophy
and
Economics
2
(Special
Topic:
Statistical
Science
and
Philosophy
of
Science):
67–78.
Gelman,
A.
&
Shalizi,
C.
(2013).
“Philosophy
and
the
Practice
of
Bayesian
Statistics.”
British
Journal
of
Mathematical
and
Statistical
Psychology
66
(1):
8–38.
Gigerenzer,
G.
(2000).
“The
Superego,
the
Ego,
and
the
Id
in
Statistical
Reasoning.
“
Adaptive
Thinking,
Rationality
in
the
Real
World,
OUP.
Goodman,
S.
N.
(1999).
Toward
evidence-‐based
medical
statistics.
2:
The
Bayes
factor.”
Annals
of
Internal
Medicine,
130:1005
–1013.
Howson,
C.
&
Urbach,
P.
(1993).
Scientific
Reasoning:
The
Bayesian
Approach.
2nd
ed.
La
Salle,
IL:
Open
Court.
Johansson
T.
(2010)
“Hail
the
impossible:
p-‐values,
evidence,
and
likelihood.”
Scandinavian
Journal
of
Psychology
52:113-‐125.
Kruschke,
J.
K.
(2010).
“What
to
believe:
Bayesian
methods
for
data
analysis”.
Trends
in
Cognitive
Science,
14(7):
297-‐300.
Lehmann,
E.
L.
(1993).
“The
Fisher,
Neyman-‐Pearson
Theories
of
Testing
61. SPP
D.
Mayo
61
Hypotheses:
One
Theory
or
Two?”
Journal
of
the
American
Statistical
Association
88
(424):
1242–1249.
Levelt
Committee,
Noort
Committee,
Drenth
Committee.
(2012).
“Flawed
science:
The
fraudulent
research
practices
of
social
psychologist
Diederik
Stapel”.
Stapel
Investigation:
Joint
Tilburg/Groningen/Amsterdam
investigation
of
the
publications
by
Mr.
Stapel.
https://www.commissielevelt.nl/
Lindley,
D.
V.
(1971).
“The
Estimation
of
Many
Parameters.”
In
Foundations
of
Statistical
Inference,
edited
by
V.
P.
Godambe
and
D.
A.
Sprott,
435–455.
Toronto:
Holt,
Rinehart
and
Winston.
Mayo,
D.
G.
(1996).
Error
and
the
Growth
of
Experimental
Knowledge.
Science
and
Its
Conceptual
Foundation.
Chicago:
University
of
Chicago
Press.
Mayo,
D.
G.
&
Cox,
D.
R.
(2010).
"Frequentist
Statistics
as
a
Theory
of
Inductive
Inference"
in
Error
and
Inference:
Recent
Exchanges
on
Experimental
Reasoning,
Reliability
and
the
Objectivity
and
Rationality
of
Science
(D.
Mayo
and
A.
Spanos
eds.),
Cambridge:
Cambridge
University
Press:
1-‐27.
This
paper
appeared
in
The
Second
Erich
L.
Lehmann
Symposium:
Optimality,
2006,
Lecture
Notes-‐Monograph
Series,
Volume
49,
Institute
of
Mathematical
Statistics,
pp.
247-‐275.
62. SPP
D.
Mayo
62
Mayo,
D.
G.,
and
A.
Spanos.
(2006).
“Severe
Testing
as
a
Basic
Concept
in
a
Neyman–Pearson
Philosophy
of
Induction.”
British
Journal
for
the
Philosophy
of
Science
57
(2)
(June
1):
323–357.
Mayo,
D.
G.,
and
A.
Spanos.
(2011).
“Error
Statistics.”
In
Philosophy
of
Statistics,
edited
by
Prasanta
S.
Bandyopadhyay
and
Malcom
R.
Forster,
7:152–198.
Handbook
of
the
Philosophy
of
Science.
The
Netherlands:
Elsevier.
Meehl,
P.
E.
&
Waller,
N.
G.
(2002).
“The
Path
Analysis
Controversy:
A
New
Statistical
Approach
to
Strong
Appraisal
of
Verisimilitude.”
Psychological
Methods
7(3):
283–300.
Morrison,
D.
E.
&
Henkel,
R.
E.
(eEds).
(1970).
The
Significance
Test
Controversy:
A
Reader.
Chicago:
Aldine
De
Gruyter.
Micheel,
C.
M.,
Nass,
S.
J.
&
Omenn
G.
S.
(Eds)
Committee
on
the
Review
of
Omics-‐
Based
Tests
for
Predicting
Patient
Outcomes
in
Clinical
Trials;
Board
on
Health
Care
Services;
Board
on
Health
Sciences
Policy;
Institute
of
Medicine
(2012).
Evolution
of
Translational
Omics:
Lessons
Learned
and
the
Path
Forward.
Nat.
Acad.
Press.
Neyman,
J.
(1957).
“‘Inductive
Behavior’”
as
a
Basic
Concept
of
Science.”
Revue
de
l'Institut
International
de
Statistique/Review
of
the
International
Statistical
Institute,
25
(1/3):
7-‐22.
63. SPP
D.
Mayo
63
Neyman,
J.
&
Pearson,
E.
S.
(1928).
“On
the
Use
and
Interpretation
of
Certain
Test
Criteria
for
Purposes
of
Statistical
Inference.
Part
I,”
Biometrica
20A:
175-‐240
(reprinted
in
Joint
Statistical
Papers,
University
of
California
Press,
Berkeley,
1967,
pp.
1-‐66.)
Popper,
K.
(1962).
Conjectures
and
Refutations:
The
Growth
of
Scientific
Knowledge.
New
York:
Basic
Books.
Potti,
A.,
Dressman
H.
K.,
Bild,
A.,
Riedel,
R.
F.,
Chan,
G.,
Sayer,
R.,
Cragun,
J.,
Cottrill,
H.,
Kelley,
M.
J.,
Petersen,
R.,
Harpole,
D.,
Marks,
J.,
Berchuck,
A.,
Ginsburg,
G.
S.,
Febbo,
P.,
Lancaster,
J.
&
Nevins,
J.
R.
(2006).
“Genomic
signatures
to
guide
the
use
of
chemotherapeutics.”
Nature
Medicine.
Nov
12(11):1294-‐300.
Epub
2006
Oct
22.
Potti,
A.
&
Nevins,
J.
R.
(2007)
“Reply
to
Coombes,
Wang
&
Baggerly.”
Nature
Medicine
Nov
13(11):1277-‐8.
Ratliff,
K.
A.
&
Oishi,
S.
(2013).
“Gender
Differences
in
Implicit
Self-‐Esteem
Following
a
Romantic
Partner’s
Success
or
Failure”.
Journal
of
Personality
and
Social
Psychology
105(4):
688–702.
Rosenkrantz,
R.
(1977).
Inference,
Method
and
Decision:
Towards
a
Bayesian
Philosophy
of
Science.
Dordrecht,
The
Netherlands:
D.
Reidel.
64. SPP
D.
Mayo
64
Savage,
L.
J.
(1962).
The
Foundations
of
Statistical
Inference:
A
Discussion.
London:
Methuen.
Savage,
L.
J.
(1964).
“The
Foundations
of
Statistics
Reconsidered.”
In
Studies
in
Subjective
Probability,
H.
Kyburg
&
H.
Smokler
(eds.),
173-‐188.
New
York:
John
Wiley
&
Sons.
Selvin,
H.
(1970).
“A
Critique
of
Tests
of
Significance
in
Survey
Research.”
In
The
Significance
Test
Controversy,
edited
by
D.
Morrison
and
R.
Henkel,
94-‐106.
Chicago:
Aldine
De
Gruyter.
Trafimow,
D.
&
Marks
M.
(2015).
“Editorial”.
Basic
and
Applied
Social
Psychology,
37(1),
pp.
1-‐2.
Wagenmakers,
E.-‐J.
(2007).
“A
Practical
Solution
to
the
Pervasive
Problems
of
P
Values”.
Psychonomic
Bulletin
&
Review
14
(5),
779-‐804.