D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP
D.
Mayo
1

Replication Research Under an Error Statistical Philosophy
Deborah Mayo
Around a year ago on my blog:
“There are some ironic twists in the way psychology is
dealing with its replication crisis that may well threaten even
the most sincere efforts to put the field on firmer scientific
footing”
Philosopher’s talk: I see a rich source of problems that cry out
for ministrations of philosophers of science and of statistics

SPP
D.
Mayo
2

Three main philosophical tasks:
#1 Clarify concepts and presuppositions
#2 Reveal inconsistencies, puzzles, tensions (“ironies”)
#3 Solve problems, improve on methodology
• Philosophers usually stop with the first two, but I think
going on to solve problems is important.
This presentation is ‘programmatic’- what might replication
research under an error statistical philosophy be?
My interest grew thanks to Caitlin Parker whose MA thesis was
on the topic

SPP
D.
Mayo
3

Example of a conceptual clarification (#1)
Editors of a journal, Basic and Applied Social Psychology,
announced they are banning statistical hypothesis testing
because it is “invalid”
It’s invalid because it does not supply “the probability of the
null hypothesis, given the finding” (the posterior probability of
H0) (2015 Trafimow and Marks)
• Since the methodology of testing explicitly rejects the mode
of inference they don’t supply, it would be incorrect to claim
the methods were invalid.
• Simple conceptual job that philosophers are good at

SPP
D.
Mayo
4

Example of revealing inconsistencies and tensions (#2)
Critic: It’s too easy to satisfy standard significance thresholds
You: Why do replicationists find it so hard to achieve
significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking,
cherry-picking, significance seeking, QRPs
You: So, the replication researchers want methods that pick up
on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where
selection effects and data dredging make no difference

SPP
D.
Mayo
5

Whether this can be resolved or not is separate.
• We are constantly hearing of how the “reward structure”
leads to taking advantage of researcher flexibility
• As philosophers, we can at least show how to hold their
feet to the fire, and warn of the perils of accounts that bury
the finagling
The philosopher is the curmudgeon (takes chutzpah!)
I’ll give examples of
#1 clarifying terms
#2 inconsistencies
#3 proposed solutions (though I won’t always number them)
.

SPP
D.
Mayo
6

Demarcation: Bad Methodology/Bad Statistics
• A lot of the recent attention grew out of the case of Diederik
Stapel, the social psychologist who fabricated his data.
• Kahneman
in
2012
“I
see
a
train-‐wreck
looming,”
setting

up
a
“daisy
chain”
of
replication.

• The Stapel investigators: 2012 Tilberg Report, “Flawed
Science” do a good job of characterizing pseudoscience.
• Philosophers tend to have cold feet when it comes to saying
anything general about science versus pseudoscience.

SPP
D.
Mayo
7

Items in their list of “dirty laundry” include:
“An experiment fails to yield the expected statistically
significant results. The experimenters try and try again
until they find something (multiple testing, multiple
modeling, post-data search of endpoint or subgroups),
and the only experiment subsequently reported is the
one that did yield the expected results.”
… continuing an experiment until it works as desired, or
excluding unwelcome experimental subjects or results,
inevitably tends to confirm the researcher’s research
hypotheses, and essentially render the hypotheses
immune to the facts”. (Report, 48)
--they walked into a “culture of verification bias”

SPP
D.
Mayo
8

Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e., little
or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.

SPP
D.
Mayo
9

• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views and personality conflicts (“It’s the methods, stupid”)
(example of new solutions #3)

SPP
D.
Mayo
10

Are philosophies about science relevant?
One of the final recommendations in the Report is this:
In the training program for PhD students, the relevant
basic principles of philosophy of science, methodology,
ethics and statistics that enable the responsible practice
of science must be covered. (p. 57)

SPP
D.
Mayo
11

A critic might protest:
“There’s nothing philosophical about my criticism of
significance tests: a small p-value is invariably, and
erroneously, interpreted as giving a small probability to the null
hypothesis that the observed difference is mere chance.”
Really? P-values are not intended to be used this way;
presupposing they should be stems from a conception of the role
of probability in statistical inference—this conception is
philosophical.
(of course criticizing them because they might be misinterpreted
is just silly)

SPP
D.
Mayo
12

Two
main
views
of
the
role
of
probability
in
inference

Probabilism.
To
provide
a
post-‐data
assignment
of
degree

of
probability,
confirmation,
support
or
belief
in
a

hypothesis,
absolute
or
comparative,
given
data
x0.

Performance.
To
ensure
long-‐run
reliability
of
methods,

coverage
probabilities,
control
the relative frequency of
erroneous inferences in a long-run series of trials.

What happened to the goal of scrutinizing bad science by the
severity criterion?

SPP
D.
Mayo
13

• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary not a sufficient
condition for avoiding insevere tests.

• The problems with selective reporting, multiple testing,
stopping when the data look good are not problems about
long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of
misinterpretation.

SPP
D.
Mayo
14

• Probabilism
says
H
is
not
justified
unless
it’s
true
or

probable
(made
firmer).

• Error
statistics
(probativism)
says
H
is
not
justified

unless
something
(a
good
job)
has
been
done
to
probe

ways
we
can
be
wrong
about
H.

• If
it’s
assumed
probabilism
is
required
for
inference,

error
probabilities
could
be
relevant
only
by

misinterpretation.
False!

• Error
probabilities
have
a
crucial
role
in
appraising
well-‐
testedness
(new
philosophy
for
probability
#3)

• Both
H
and
not-‐H
be
can
be
poorly
tested,
so
a
severe
testing

assessment
violates
probability

SPP
D.
Mayo
15

Understanding
the
Replication
Crisis
Requires

Understanding
How
it
Intermingles
with
PhilStat

Controversies

• It’s not that I’m keen to defend many common uses of
significance tests
• It’s just that the criticisms (in psychology and elsewhere)
are based on serious misunderstandings of the nature and
role of these methods; consequently so are many “reforms”
• How can you be clear the reforms are better if you might be
mistaken about existing methods?

SPP
D.
Mayo
16

Criticisms
concern
a
kind
of
Fisherian
Significance
Test
(i) Sample
space:
Let
the
sample
be
X
=
(X1,
…,Xn),
be
n
iid

(independent
and
identically
distributed)
outcomes
from
a

Normal
distribution
with
standard
deviation

σ

(ii)
A
null
hypothesis
H0:
µ
=

0

(Δ: µΤ − µC = 0)

(iii)
Test
statistic:
A
function
of
the
sample,
d(X)
reflecting
the

difference
between
the
data
x0
=
(x1,
…,xn),
and
H0:

The
larger
d(x0)
the
further
the
outcome
from
what’s

expected
under
H0,
with
respect
to
the
particular
question.

(iv)
Sampling
distribution
of
test
statistic:
d(X)

SPP
D.
Mayo
17

The
p-‐value
is
the
probability
of
a
difference
larger
than
d(x0),

under
the
assumption
that
H0
is
true:

p(x0)=Pr(d(X)
>
d(x0);
H0).

If p(x0)
is
sufficiently
small,
there’s
an
indication
of

discrepancy
from
the
null.

(Even
Fisher
had
implicit
alternatives,
by
the
way)

SPP
D.
Mayo
18

P-‐value
reasoning:
from
high
capacity
to
curb

enthusiasm

If
the
hypothesis
H0
is
correct
then,
with
high
probability,
1-‐p,

the
data
would
not
be
statistically
significant
at
level
p.

x0
is
statistically
significant
at
level
p.

____________________________

Thus,
x0
indicates
a
discrepancy
from
H0.

That
merely
indicates
some
discrepancy!

SPP
D.
Mayo
19

A genuine experimental effect is needed
“[W]e need, not an isolated record, but a reliable method of
procedure. In relation to the test of significance, we may say
that a phenomenon is experimentally demonstrable when we
know how to conduct an experiment which will rarely fail to
give us a statistically significant result.” (Fisher 1935, 14)
(low P-value ≠> H: statistical effect)
“[A]ccording
to
Fisher,
rejecting
the
null
hypothesis
is
not

equivalent
to
accepting
the
efficacy
of
the
cause
in

question.
The
latter...requires
obtaining
more
significant

results
when
the
experiment,
or
an
improvement
of
it,
is

repeated
at
other
laboratories
or
under
other

conditions.”
(Gigerentzer
1989,
95-‐6)
(H ≠> H*)

SPP
D.
Mayo
20

Still,
simple
Fisherian
Tests
have
Important
Uses

• Testing
assumptions

• Fraudbusting
and
forensics:
Finding
Data
too
good
to
be

true
(Simonsohn)

• Finding
if
data
are
consistent
with
a
model

Gelman and Shalizi (meeting of minds between a Bayesian and
an error statistician)
“What we are advocating, then, is what Cox and Hinkley (1974)
call ‘pure significance testing’, in which certain of the model’s
implications are compared directly to the data, rather than
entering into a contest with some alternative model.” (p.20)

SPP
D.
Mayo
21

Fallacy
of
Rejection:
H
–
>
H*
:
Erroneously
take
statistical

significance
as
evidence
of
research
hypothesis
H*

The
fallacy
is
explicated
by
severity:
flaws
in
alternative
H*
have

not
been
probed
by
the
test,
the
inference
from
a
statistically

significant
result
to
H*
fails
to
pass
with
severity

Merely refuting the null hypothesis is too weak to
corroborate substantive H*, “we have to have ‘Popperian
risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable coincidence.’” (Meehl
and Waller 2002, 184)

(Meehl
was
wrong
to
blame
Fisher)

SPP
D.
Mayo
22

NHST
are
pseudostatistical:

Why
do
psychologists
speak
of
NHSTs
–tests
that
supposedly

allow
moving
from
statistical
to
substantive?

So
defined,
they
exist
only
as
abuses
of
tests:
they
exist as
something you’re never supposed to do

Psychologists
tend
to
ignore
Neyman-‐Pearson
(N-‐P)
tests:
N-‐P

supplemented
Fisher’s
tests
with
explicit
alternatives

SPP
D.
Mayo
23

Neyman-‐Pearson
(N-‐P)
Tests:
A
null
and
alternative

hypotheses
H0,
H1
that
exhaust
the
parameter
space

So
the
fallacy
of
rejection
H
–
>
H*
is
impossible

(rejecting
the
null
only
indicates
statistical
alternatives)

Scotches
criticisms
that
P-‐values
are
only
under
the
null

Example:
Test
T+:

sampling
distribution
of
d(x)
under
null

and
alternatives.
H0:
µ
≤
µ0

vs.

H1:
µ
>
µ0

if
d(x0)
>

cα,
"reject"
H0,

if
d(x0)
<

cα,
"do
not
reject”
or
“accept"
H0,

e.g.
cα=1.96
for
α=.025

SPP
D.
Mayo
24

The
sampling
distribution
yields
Error
Probabilities

Probability
of
a
Type
I
error
=
P(d(X)
>

cα;
H0)
≤

α.

Probability
of
a
Type
II
error:
=
P(d(X)
<
cα;
H0)
=
ß(µ1),
for

any
µ1
>
µ0.

The
complement
of
the
Type
II
error
probability=
power

against
(µ1)

POW(µ1)=
P(d(X)
>
cα;
µ1)

Even
without
“best”
tests,
there
are
“good”
tests

SPP
D.
Mayo
25

N-‐P
test
in
terms
of
the
P-‐value:
reject
H0
iff
P-‐value
<
.025

• Even
N-‐P
report
the
attained
significance
level
or
P-‐value

(Lehmann)

• “reject/do
not
reject”
uninterpreted
parts
of
the

mathematical
apparatus

Reject
could
be:
“Declare
statistically
significant
at
the
p-‐level”

• “The
tests…
must
be
used
with
discretion
and

understanding”
(N-‐P,
1928,
p.
58)

(“it’s
the
methods,
stupid”)

SPP
D.
Mayo
26

Why
Inductive
behavior?

N-‐P
justify
tests
(and
confidence
intervals)
by
performance,

control
of
long-‐run
error
coverage
probabilities

They
called
this
inductive
behavior,
why?

• They
were
reaching
conclusions
beyond
the
data

(inductive)

• If
inductive
inference
is
probabilist,
then
they
needed
a

new
term.

In
Popperian
spirit,
they
(mostly
Neyman)
called
it

inductive
behavior-‐-‐
adjust
how
we’d
act
rather
than
beliefs

(I’m
not
knocking
performance,
but
error
probabilities
also

serve
for
particular
inferences—evidential)

SPP
D.
Mayo
27

N-‐P
tests
can
still
commit
a
type
of
fallacy
of
rejection:

Infer
a
discrepancy
beyond
what’s
warranted:

––especially
with n sufficiently large:
large
n
problem.

• Severity
tells
us:
an
α-‐significant
difference
is
indicative
of
less

of
a
discrepancy
from
the
null
if
it
results
from
larger
(n1)

rather
than
a
smaller
(n2)
sample
size
(n1
>
n2
)
What’s
more
indicative
of
a
large
effect
(fire),
a
fire
alarm
that

goes
off
with
burnt
toast
or
one
so
insensitive
that
it
doesn’t

go
off
unless
the
house
is
fully
ablaze?
[The
larger
sample
size

is
like
the
one
that
goes
off
with
burnt
toast.)

SPP
D.
Mayo
28

Fallacy
of
Non-‐Significant
results:
Insensitive
tests

• Negative
results
may
not
warrant
0
discrepancy

from
the
null,
but
we
can
use
severity
to
rule
out

discrepancies
that,
with
high
probability,
would
have

resulted
in
a
larger
difference
than
observed

Similar
to
Cohen’s
power
analysis
but
sensitive
to
the

outcome—P-‐value
distribution
(#3)

• I
hear
some
replicationists
say
negative
results
are

uninformative:
not
so
(#2
ironies)

No
point
in
running
replication
research
if
your

account
views
negative
results
as
uninformative

SPP
D.
Mayo
29

Error
statistics
gives
evidential
interpretation
to
tests

(#3)

Use
results
to
infer
discrepancies
from
a
null
that
are
well
ruled-‐
out,
and
those
which
are
not

I’d
never
just
report
a
P-‐value

Mayo
(1996);

Mayo
and
Cox
(2010):
Frequentist

Principle
of

Evidence:
FEV

Mayo
and
Spanos
(2006):
SEV

SPP
D.
Mayo
30

One-‐sided
Test
T+:

H0:
µ
<
µ0

vs.

H1:
µ
>
µ0

d(x)
is
statistically
significant
(set
lower
bounds)

(i)
If
the
test
had
high
capacity
to
warn
us
(by

producing
a
less
significant
result)
if
µ
≤
µ0
+
γ.
then

d(x)
is
a
good
indication
of
µ
>
µ0
+
γ.

(ii)
If
the
test
had
little
(or
even
moderate)
capacity

(e.g.
<
.5)
to
produce
a
less
significant
result
even
if
µ
≤

µ0
+
γ,
then
d(x)
is
a
poor
indication
of
µ
>
µ0
+
γ

(If
an
even
more
impressive
result
is
probable,
due
to

guppies,
it’s
not
a
good
indication
of
a
great
whale)

SPP
D.
Mayo
31

d(x)
is
not
statistically
significant
(set
upper
bounds)

(i)If
the
test
had
a
high
probability
of
producing
a

more
statistically
significant
difference
if
µ
>
µ0
+
γ,

then
d(x)
is
a
good
indication
that
µ
≤
µ0
+
γ.

(ii)
If
the
test
had
a
low
probability
of
a
more

statistically
significant
difference
if
µ
>
µ0
+
γ,
then
d(x)

is
poor
indication
that
µ
≤
µ0
+
γ.
(too
insensitive
to

rule
out
discrepancy
γ)

If
you
set
an
overly
stringent
significance
level
in
order
to

block
rejecting
a
null,
we
can
determine
the

discrepancies
you
can’t
detect
(e.g.,
risks
of
concern)

SPP
D.
Mayo
32

Confidence
Intervals
also
require
supplementing

Duality
between
tests
and
intervals:
values
within
the
(1
-‐
α)

CI
are
non-‐rejectable
at
the
α
level

• Still
too
dichotomous:
in
/out,
plausible/not
plausible

(Permit
fallacies
of
rejection/non-‐rejection).

• Justified
in
terms
of
long-‐run
coverage
(performance).

• All
members
of
the
CI
treated
on
par.

• Fixed
confidence
level
(SEV
needs
several
benchmarks).

• Estimation
is
important
but
we
need
tests
for

distinguishing
real
and
spurious
effects,
and
checking

assumptions
of
statistical
models.

SPP
D.
Mayo
33

The
evidential
interpretation
is
crucial
but
error

probabilities
can
be
violated
by
selection
effects
(also

violated
model
assumptions)

One
function
of
severity
is
to
identify
which
selection
effects

are
problematic
(not
all
are)
(#3).

Biasing
selection
effects:
when
data
or
hypotheses
are

selected
or
generated
(or
a
test
criterion
is
specified),
in

such
a
way
that
the
minimal
severity
requirement
is

violated,
seriously
altered
or
incapable
of
being
assessed.

SPP
D.
Mayo
34

Nominal vs actual significance levels
Suppose
that
twenty
sets
of
differences
have
been

examined,
that
one
difference
seems
large
enough
to
test

and
that
this
difference
turns
out
to
be
‘significant
at
the
5

percent
level.’
….The
actual
level
of
significance
is
not
5

percent,
but
64
percent!
(Selvin,
1970,
p.
104)

• They
were
clear
on
the
fallacy:
blurring
the
“computed”

or
“nominal”
significance
level,
and
the
“actual”
level

• There
are
many
more
ways
you
can
be
wrong
with

hunting
(different
sample
space)

SPP
D.
Mayo
35

This is a genuine example of an invalid or unsound method

You report: Such
results
would
be
difficult
to
achieve
under

the
assumption
of
H0
When
in
fact
such
results
are
common
under
the

assumption
of
H0
(formally): You say Pr(P-value < Pobs; H0) ~ α (small)

but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed
• Nowadays,
we’re
likely
to
see
the
tests
blamed
for

permitting
such
misuses
(instead
of
the
testers).

• Worse
are
those
accounts
where
the
abuse
vanishes!

SPP
D.
Mayo
36

What
defies
scientific
sense?

On
some
views,
biasing
selection
effects
are
irrelevant….

Stephen
Goodman
(epidemiologist):

Two
problems
that
plague
frequentist
inference:
multiple

comparisons
and
multiple
looks,
or,
as
they
are
more

commonly
called,
data
dredging
and
peeking
at
the
data.

The
frequentist
solution
to
both
problems
involves

adjusting
the
P-‐value…But
adjusting
the
measure
of

evidence
because
of
considerations
that
have
nothing
to
do

with
the
data
defies
scientific
sense,
belies
the
claim
of

‘objectivity’
that
is
often
made
for
the
P-‐value.”
(1999,
p.

1010).

SPP
D.
Mayo
37

Likelihood
Principle
(LP)

The
vanishing
act
takes
us
to
the
pivot
point
around
which

much
debate
in
philosophy
of
statistics
revolves:

In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses:
P(x0;H1)/P(x0;H0)

Different
forms:
posterior
probabilities,
Bayes
factor

(inference
is
comparative,
data
favors
this
over
that–is
that

even
inference?)

SPP
D.
Mayo
38

All
error
probabilities
violate
the
LP
(even
without

selection
effects):

“Sampling
distributions,
significance
levels,
power,
all
depend

on
something
more
[than
the
likelihood
function]–something

that
is
irrelevant
in
Bayesian
inference–namely
the
sample

space”.
(Lindley
1971,
p.
436)

The
information
is
just
a
matter
of
our
“intentions”

“The
LP
implies…the
irrelevance
of
predesignation,
of

whether
a
hypothesis
was
thought
of
before
hand
or
was

introduced
to
explain
known
effects
(Rosenkrantz,
1977,

122)

SPP
D.
Mayo
39

Many current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail.

The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals.
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished.
(lots of #2 inconsistencies)

SPP
D.
Mayo
40

How
might
probabilists
block
intuitively
unwarranted

inferences?
(Consider
first
subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the interpretation
of the evidence, we wouldn’t be fooled
We know these things are unbelievable, a subjective Bayesian
might say
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.

SPP
D.
Mayo
41

It wouldn’t help with our most important problem:
• How to distinguish the warrant for a single hypothesis H
with different methods (e.g., one has biasing selection
effects, another, registered results and precautions)?
So now you’ve got two sources of flexibility, priors and biasing
selection effects (which can no longer be criticized).
Besides, researchers really do believe their hypotheses.

SPP
D.
Mayo
42

Diederik Stapel says he always read the research literature
extensively to generate his hypotheses.
“So that it was believable and could be argued that this
was the only logical thing you would find.” (E.g., eating
meat causes aggression.)
(In “The Mind of a Con Man,” NY Times, April 26,
2013[4])

SPP
D.
Mayo
43

Conventional
Bayesians

The most popular probabilisms these days are “non-subjective”
(reference, default) or conventional designed
to
prevent
prior

beliefs
from
influencing
the
posteriors:

“The
priors
are
not
to
be
considered
expressions
of

uncertainty,
ignorance,
or
degree
of
belief.
Conventional

priors
may
not
even
be
probabilities…
.”
(Cox
and
Mayo

2010,
p.
299)

How
might
they
avoid
too-‐easy
rejections
of
a
null?

SPP
D.
Mayo
44

Cult
of
the
Holy
Spike

Give
a
spike
prior
of
.5
to
H0
the
remaining
.5
probability
being

spread
out
over
the
alternative
parameter
space,
Jeffreys.

This
“spiked
concentration
of
belief
in
the
null”
is
at
odds
with

the
prevailing
view
“we
know
all
nulls
are
false”
(#2)

Bottom line: By convenient choices of priors and alternatives
statistically significant differences can be evidence for the null

The
conflict
often
considers
the
two
sided
test

H0:
µ
=
0
versus
H1:
µ
≠
0

SPP
D.
Mayo
45

Posterior
Probabilities
in
H0

n
(sample
size)

____________________________

p

z

n=50

n=100

n=1000

.10

1.645

.65

.72

.89

.05

1.960

.52

.60

.82

.01

2.576

.22

.27

.53

.001

3.291

.034

.045

.124

If
n
=
1000,
a
result
statistically
significant
at
the
.05
level

leads
to
a
posterior
to
the
null
of
.82!

From
Berger
and
Sellke
(1987)
based
on
a
Jeffreys
pror

SPP
D.
Mayo
46

• With
a
z
=
1.96
difference,
the
95%
CI
(2-‐sided)
or
the
.975

CI
one
sided
excludes
the
null
(0)
from
the
interval

• Severity reasoning: Were H0 true, the probability of getting
d(x) < dobs is high (~.975), so SEV
(µ
>
0) ∼ .975
• But they give P(H0 | z = 1.96 ) = .82
• Error statistical critique: there’s a high probability that they
give posterior probability of .82 to H0:µ = 0 erroneously
• The onus is on probabilists to show a high posterior for H
constitutes having passed a good test.

SPP
D.
Mayo
47

Informal
and
Quasi-‐Formal
Severity
:
H
-‐>
H*

• Error
statisticians
avoid
the
fallacy
of
going
directly
from

statistical
to
research
hypothesis
H*

• Can
we
say
nothing
about
this
link?

• I
think
we
can
and
must,
and
informal
severity

assessments
are
relevant
(#3)

I
will
not
discuss
straw
man
studies
(“chump
effects”).

This is believable: Men react more negatively to success of
their partners than to their failures (compared to women)?
Studies have shown:
H: partner’s success lowers self-esteem in men

SPP
D.
Mayo
48

Macho
Men

H*: partner’s success lowers self-esteem in men

I
have
no
doubts
that
certain
types
of
men
feel
threatened

by
the
success
of
their
female
partners,
wives
or
girlfriends

I’ve
even
known
a
few.

Can
this
be
studied
in
the
lab?
Ratliff
and
Oishi
(2013)
did:

.

H*:
“men’s
implicit
self-‐esteem
is
lower
when
a
partner

succeeds
than
when
a
partner
fails.”

Not so for women
Their example does a good job, given the standards in place.

SPP
D.
Mayo
49

Treatments: Subjects are randomly assigned to five

“treatments”:
think,
write
about
a
time
your
partner

succeeded,
failed,
succeeded
when
you
failed
(partner

beats
me),
failed
when
you
succeeded
(I
beat
partner),

and
a
typical
day
(control).

Effects:
a
measure
of
“self-‐esteem”

Explicit:
“How
do
you
feel
about
yourself?”

Implicit:
a test of word associations with “me” versus “other”.
None showed statistical significance in explicit self-esteem, so
consider just implicit measures

SPP
D.
Mayo
50

Some null hypotheses: The average self-esteem score is no
different (these are statistical hypotheses)
a) when partner succeeds (rather than failing)
b) when partner beats (surpasses) me or I beat her
c) control: when she succeeds, fails, or it’s a regular day
There are at least double this, given self-esteem could be
“explicit” or “implicit” (others too, e.g., the area of success)

Only
null
(a)
was
rejected
statistically!

Should
they
have
taken
the
research
hypothesis
as

disconfirmed
by
negative
cases?

Or
as
casting
doubt
on
their
test?

SPP
D.
Mayo
51

Or
should
they
just
focus
on
the
null
hypotheses
that

were
rejected,
in
particular
null
(a),
for
implicit
self-‐esteem.

They
opt
for
the
third.

It’s not that they should have regarded their research
hypothesis H* as disconfirmed much less falsified.

This is precisely the nub of the problem! I’m saying the
hypothesis that the study isn’t well-run needs to be considered
• Is the artificial writing assignment sufficiently relevant to
the phenomenon of interest? (look at proxy variables)
• Is the measure of implicit self esteem (word associations) a
valid measure of the effect? (measurements of effects)

SPP
D.
Mayo
52

Take,
null
hypothesis
b):
The average self-esteem score is no
different when partner beats (surpasses) me or I beat her

Clearly
they
expected
“she
beat
me
in
X”
to
have
a
greater

negative
impact
on
self-‐esteem
than
“she
succeeded
at
X”.

Still,
they
could
view
it
as
lending
“some
support
to
the
idea

that
men
interpret
‘my
partner
is
successful’
as
‘my
partner

is
more
successful
than
me”
(p.
698),

….as
do
the
authors.

That
is,
any
success
of
hers
is
always
construed
by
Macho
man

as,
she
beat
me.

SPP
D.
Mayo
53

Bending
over
Backwards

For
the
stringent
self-‐critic,
this
skirts
too
close
to
viewing

the
data
through
the
theory,
a
kind
of
“self-‐sealing
fallacy”.

I want to be clear that this is not a criticism of them given
existing standards
“I'm talking about a specific, extra type of integrity...bending
over backwards to show how you're maybe wrong, that you
ought to have when acting as a scientist.”
(R. Feynman 1974)

I’m
describing
what’s
needed
to
show
“sincerely
trying
to

find
flaws”
under
the
austere
account
I
recommend

The
most
interesting
information
was
never
reported!

Perhaps
it
was
never
even
looked
at:
what
they
wrote
about.

SPP
D.
Mayo
54

Conclusion: Replication Research in Psychology Under an
Error Statistical Philosophy
Replication problems can’t be solved without correctly
understanding their sources

Biggest
sources
of
problems
in
replication
crises

(a) Stat
H
-‐>research
H*
and
(b)
biasing
selection
effects:

Reasons for (a): focus on P-values and Fisherian tests ignoring
N-P tests (and the illicit NHST that goes directly H–> H*)

SPP
D.
Mayo
55

Another reason, false dilemma:
probabilism or long-run performance
plus assuming that N-P can only give the latter
I argue for a third use of probability: Rather than report on
believability researchers need to report the properties of the
methods they used:
What was their capacity to have identified, avoided,
admitted bias?
What’s
wanted
is
not
a
high
posterior
probability
in
H

(however
construed)
but
a
high
probability
the
procedure

would
have
unearthed
flaws
in
H
(reinterpretation
of
N-‐P

methods)

SPP
D.
Mayo
56

What’s
replicable?
Discrepancies
that
are
severely
warranted

Reasons
for
(b)
[embracing
accounts
that
formally
ignore

selection
effects]:
accepting
probabilisms
that
embrace
the

likelihood
principle
LP

There’s
no
point
in
raising
thresholds
for
significance
if

your
methodology
does
not
pick
up
on
biasing
selection

effects.

SPP
D.
Mayo
57

Informal assessments of probativeness are needed to scrutinize
statistical inferences in relation to research hypotheses H –> H*
One
hypothesis
must
always
be:
our
results
point
to
the

inability
of
our
study
to
severely
probe
the
phenomenon
of

interest
(problem
with
proxy
variables,
measurements,
etc.)

The scientific status of an inquiry is questionable if it cannot or
will not distinguish the correctness of inferences from problems
stemming from a poorly run study
If ordinary research reports adopted the Feynman “bending over
backwards” scrutiny, the interpretation of replication efforts
would be more informative (or perhaps not needed)

SPP
D.
Mayo
58

REFERENCES

Baggerly,
K.
A.,
Coombes,
K.
R.
&
Neeley,
E.
S.
(2008).
“Run
Batch
Effects

Potentially
Compromise
the
Usefulness
of
Genomic
Signatures
for
Ovarian

Cancer.”
Journal
of
Clinical
Oncology.
26(7):
1186-‐1187.

Bartless,
T.
(2012).
“Daniel
Kahneman
Sees
‘Train-‐Wreck
Looming’
for
Social

Psychology”.
Chronicle
of
Higher
Education
Blog
(Oct.
4,
2012)
article

w/links
to
email
D.
Kahneman
sent
to
several
social
psychologists.

http://chronicle.com/blogs/percolator/daniel-‐kahneman-‐sees-‐train-‐
wreck-‐looming-‐for-‐social-‐psychology/31338.

Berger,
J.
O.
(2006).
“The
Case
for
Objective
Bayesian
Analysis.”
Bayesian

Analysis
1
(3):
385–402.

Berger,
J.
O.
&
Sellke,
T.
(1987).
“Testing
a
Point
Null
Hypothesis:
The

Irreconcilability
of
P
Values
and
Evidence
(with
Discussion).”
Journal
of
the

American
Statistical
Association
82
(397)
(March
1):
112–122.

Bhattacharjee,
Y.
(2013).
“The
Mind
of
a
Con
Man”.
The
New
York
Times

Magazine
(4/28/2013),
p.
44.

Cohen,
J.
1988.
Statistical
Power
Analysis
for
the
Behavioral
Sciences.
2nd
ed.

Hillsdale,
NJ:
Erlbaum.

SPP
D.
Mayo
59

Coombes,
K.
R.,
Wang,
J.
&
Baggerly,
K.
A.
(2007).
“Microrrays:
retracing
steps.”

Nature
Medicine.
13(11):1276-‐7.

Cox,
D.
R.
&
D.
V.
Hinkley.
(1974).
Theoretical
Statistics.
London:
Chapman
and

Hall.

Cox,
D.
R.
&
Mayo,
D.
G.
(2010).
“Objectivity
and
Conditionality
in
Frequentist

Inference.”
In
Error
and
Inference:
Recent
Exchanges
on
Experimental

Reasoning,
Reliability,
and
the
Objectivity
and
Rationality
of
Science,
edited

by
Deborah
G.
Mayo
and
Aris
Spanos,
276–304.
Cambridge:
Cambridge

University
Press.

Diaconis,
P.
(1978).
“Statistical
Problems
in
ESP
Research”.
Science
201
(4351):

131-‐136.
(Letters
in
response
can
be
found
in
the
Dec.
15,
1978
issue
pp.

1145-‐6.)

Dienes,
Z.
(2011)
“Bayesian
versus
Orthodox
Statistics:
Which
Side
Are
You
On?”

Perspectives
on
Psychological
Science
6(3):
274-‐290.

Feynman,
R.

(1974).
“Cargo
Cult
Science.”
Caltech
Commencement
Speech.

Fisher,
R.
A.
(1947).
The
Design
of
Experiments,
4th
ed.
Edinburgh:
Oliver
and

Boyd.

SPP
D.
Mayo
60

Gelman,
A.
(2011).
“Induction
and
Deduction
in
Bayesian
Data
Analysis.”
Edited

by
Deborah
G.
Mayo,
Aris
Spanos,
and
Kent
W.
Staley.
Rationality,
Markets

and
Morals:
Studies
at
the
Intersection
of
Philosophy
and
Economics
2

(Special
Topic:
Statistical
Science
and
Philosophy
of
Science):
67–78.

Gelman,
A.
&
Shalizi,
C.
(2013).
“Philosophy
and
the
Practice
of
Bayesian

Statistics.”
British
Journal
of
Mathematical
and
Statistical
Psychology
66
(1):

8–38.

Gigerenzer,
G.
(2000).
“The
Superego,
the
Ego,
and
the
Id
in
Statistical

Reasoning.
“
Adaptive
Thinking,
Rationality
in
the
Real
World,
OUP.

Goodman,
S.
N.
(1999).
Toward
evidence-‐based
medical
statistics.
2:
The
Bayes

factor.”
Annals
of
Internal
Medicine,
130:1005
–1013.

Howson,
C.
&
Urbach,
P.
(1993).
Scientific
Reasoning:
The
Bayesian
Approach.

2nd
ed.
La
Salle,
IL:
Open
Court.

Johansson
T.
(2010)
“Hail
the
impossible:
p-‐values,
evidence,
and
likelihood.”

Scandinavian
Journal
of
Psychology
52:113-‐125.

Kruschke,
J.
K.
(2010).
“What
to
believe:
Bayesian
methods
for
data
analysis”.

Trends
in
Cognitive
Science,
14(7):
297-‐300.

Lehmann,
E.
L.
(1993).
“The
Fisher,
Neyman-‐Pearson
Theories
of
Testing

SPP
D.
Mayo
61

Hypotheses:
One
Theory
or
Two?”
Journal
of
the
American
Statistical

Association
88
(424):
1242–1249.

Levelt
Committee,
Noort
Committee,
Drenth
Committee.
(2012).
“Flawed

science:
The
fraudulent
research
practices
of
social
psychologist
Diederik

Stapel”.
Stapel
Investigation:
Joint
Tilburg/Groningen/Amsterdam

investigation
of
the
publications
by
Mr.
Stapel.

https://www.commissielevelt.nl/

Lindley,
D.
V.
(1971).
“The
Estimation
of
Many
Parameters.”
In
Foundations
of

Statistical
Inference,
edited
by
V.
P.
Godambe
and
D.
A.
Sprott,
435–455.

Toronto:
Holt,
Rinehart
and
Winston.

Mayo,
D.
G.
(1996).
Error
and
the
Growth
of
Experimental
Knowledge.
Science
and

Its
Conceptual
Foundation.
Chicago:
University
of
Chicago
Press.

Mayo,
D.
G.
&
Cox,
D.
R.
(2010).
"Frequentist
Statistics
as
a
Theory
of
Inductive

Inference"
in
Error
and
Inference:
Recent
Exchanges
on
Experimental

Reasoning,
Reliability
and
the
Objectivity
and
Rationality
of
Science
(D.

Mayo
and
A.
Spanos
eds.),
Cambridge:
Cambridge
University
Press:
1-‐27.

This
paper
appeared
in
The
Second
Erich
L.
Lehmann
Symposium:

Optimality,
2006,
Lecture
Notes-‐Monograph
Series,
Volume
49,
Institute
of

Mathematical
Statistics,
pp.
247-‐275.

SPP
D.
Mayo
62

Mayo,
D.
G.,
and
A.
Spanos.
(2006).
“Severe
Testing
as
a
Basic
Concept
in
a

Neyman–Pearson
Philosophy
of
Induction.”
British
Journal
for
the

Philosophy
of
Science
57
(2)
(June
1):
323–357.

Mayo,
D.
G.,
and
A.
Spanos.

(2011).
“Error
Statistics.”
In
Philosophy
of
Statistics,

edited
by
Prasanta
S.
Bandyopadhyay
and
Malcom
R.
Forster,
7:152–198.

Handbook
of
the
Philosophy
of
Science.
The
Netherlands:
Elsevier.

Meehl,
P.
E.
&
Waller,
N.
G.
(2002).
“The
Path
Analysis
Controversy:
A
New

Statistical
Approach
to
Strong
Appraisal
of
Verisimilitude.”
Psychological

Methods
7(3):
283–300.

Morrison,
D.
E.
&
Henkel,
R.
E.
(eEds).
(1970).
The
Significance
Test
Controversy:

A
Reader.
Chicago:
Aldine
De
Gruyter.

Micheel,
C.
M.,
Nass,
S.
J.
&
Omenn
G.
S.
(Eds)
Committee
on
the
Review
of
Omics-‐
Based
Tests
for
Predicting
Patient
Outcomes
in
Clinical
Trials;
Board
on

Health
Care
Services;
Board
on
Health
Sciences
Policy;
Institute
of
Medicine

(2012).
Evolution
of
Translational
Omics:
Lessons
Learned
and
the
Path

Forward.
Nat.
Acad.
Press.

Neyman,
J.
(1957).
“‘Inductive
Behavior’”
as
a
Basic
Concept
of
Science.”
Revue

de
l'Institut
International
de
Statistique/Review
of
the
International

Statistical
Institute,
25
(1/3):
7-‐22.

SPP
D.
Mayo
63

Neyman,
J.
&
Pearson,
E.
S.
(1928).
“On
the
Use
and
Interpretation
of
Certain

Test
Criteria
for
Purposes
of
Statistical
Inference.
Part
I,”
Biometrica
20A:

175-‐240
(reprinted
in
Joint
Statistical
Papers,
University
of
California
Press,

Berkeley,
1967,
pp.
1-‐66.)

Popper,
K.
(1962).
Conjectures
and
Refutations:
The
Growth
of
Scientific

Knowledge.
New
York:
Basic
Books.

Potti,
A.,
Dressman
H.
K.,
Bild,
A.,
Riedel,
R.
F.,
Chan,
G.,
Sayer,
R.,
Cragun,
J.,

Cottrill,
H.,
Kelley,
M.
J.,
Petersen,
R.,
Harpole,
D.,
Marks,
J.,
Berchuck,
A.,

Ginsburg,
G.
S.,
Febbo,
P.,
Lancaster,
J.

&
Nevins,
J.
R.

(2006).
“Genomic

signatures
to
guide
the
use
of
chemotherapeutics.”
Nature
Medicine.
Nov

12(11):1294-‐300.
Epub
2006
Oct
22.

Potti,
A.
&
Nevins,
J.
R.
(2007)
“Reply
to
Coombes,
Wang
&
Baggerly.”
Nature

Medicine
Nov
13(11):1277-‐8.

Ratliff,
K.
A.
&
Oishi,
S.
(2013).
“Gender
Differences
in
Implicit
Self-‐Esteem

Following
a
Romantic
Partner’s
Success
or
Failure”.

Journal
of
Personality

and
Social
Psychology
105(4):
688–702.

Rosenkrantz,
R.
(1977).
Inference,
Method
and
Decision:
Towards
a
Bayesian

Philosophy
of
Science.
Dordrecht,
The
Netherlands:
D.
Reidel.

SPP
D.
Mayo
64

Savage,
L.
J.
(1962).
The
Foundations
of
Statistical
Inference:
A
Discussion.

London:
Methuen.

Savage,
L.
J.
(1964).
“The
Foundations
of
Statistics
Reconsidered.”
In
Studies
in

Subjective
Probability,
H.
Kyburg
&
H.
Smokler
(eds.),
173-‐188.
New
York:

John
Wiley
&
Sons.

Selvin,
H.
(1970).
“A
Critique
of
Tests
of
Significance
in
Survey
Research.”
In
The

Significance
Test
Controversy,
edited
by
D.
Morrison
and
R.
Henkel,
94-‐106.

Chicago:
Aldine
De
Gruyter.

Trafimow,
D.
&
Marks
M.
(2015).
“Editorial”.
Basic
and
Applied
Social
Psychology,

37(1),
pp.
1-‐2.

Wagenmakers,
E.-‐J.
(2007).
“A
Practical
Solution
to
the
Pervasive
Problems
of
P

Values”.
Psychonomic
Bulletin
&
Review
14
(5),
779-‐804.

D. Mayo: Replication Research Under an Error Statistical Philosophy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to D. Mayo: Replication Research Under an Error Statistical Philosophy

Similar to D. Mayo: Replication Research Under an Error Statistical Philosophy (20)

More from jemille6

More from jemille6 (20)

Recently uploaded

Recently uploaded (20)

D. Mayo: Replication Research Under an Error Statistical Philosophy