SlideShare a Scribd company logo
1 of 64
Download to read offline
SPP	
  D.	
  Mayo	
   1	
  
Replication Research Under an Error Statistical Philosophy
Deborah Mayo
Around a year ago on my blog:
“There are some ironic twists in the way psychology is
dealing with its replication crisis that may well threaten even
the most sincere efforts to put the field on firmer scientific
footing”
Philosopher’s talk: I see a rich source of problems that cry out
for ministrations of philosophers of science and of statistics
SPP	
  D.	
  Mayo	
   2	
  
Three main philosophical tasks:
#1 Clarify concepts and presuppositions
#2 Reveal inconsistencies, puzzles, tensions (“ironies”)
#3 Solve problems, improve on methodology
• Philosophers usually stop with the first two, but I think
going on to solve problems is important.
This presentation is ‘programmatic’- what might replication
research under an error statistical philosophy be?
My interest grew thanks to Caitlin Parker whose MA thesis was
on the topic
SPP	
  D.	
  Mayo	
   3	
  
Example of a conceptual clarification (#1)
Editors of a journal, Basic and Applied Social Psychology,
announced they are banning statistical hypothesis testing
because it is “invalid”
It’s invalid because it does not supply “the probability of the
null hypothesis, given the finding” (the posterior probability of
H0) (2015 Trafimow and Marks)
• Since the methodology of testing explicitly rejects the mode
of inference they don’t supply, it would be incorrect to claim
the methods were invalid.
• Simple conceptual job that philosophers are good at
SPP	
  D.	
  Mayo	
   4	
  
Example of revealing inconsistencies and tensions (#2)
Critic: It’s too easy to satisfy standard significance thresholds
You: Why do replicationists find it so hard to achieve
significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking,
cherry-picking, significance seeking, QRPs
You: So, the replication researchers want methods that pick up
on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where
selection effects and data dredging make no difference
SPP	
  D.	
  Mayo	
   5	
  
Whether this can be resolved or not is separate.
• We are constantly hearing of how the “reward structure”
leads to taking advantage of researcher flexibility
• As philosophers, we can at least show how to hold their
feet to the fire, and warn of the perils of accounts that bury
the finagling
The philosopher is the curmudgeon (takes chutzpah!)
I’ll give examples of
#1 clarifying terms
#2 inconsistencies
#3 proposed solutions (though I won’t always number them)
.
SPP	
  D.	
  Mayo	
   6	
  
Demarcation: Bad Methodology/Bad Statistics
• A lot of the recent attention grew out of the case of Diederik
Stapel, the social psychologist who fabricated his data.
• Kahneman	
  in	
  2012	
  “I	
  see	
  a	
  train-­‐wreck	
  looming,”	
  setting	
  
up	
  a	
  “daisy	
  chain”	
  of	
  replication.	
  
• The Stapel investigators: 2012 Tilberg Report, “Flawed
Science” do a good job of characterizing pseudoscience.
• Philosophers tend to have cold feet when it comes to saying
anything general about science versus pseudoscience.
SPP	
  D.	
  Mayo	
   7	
  
Items in their list of “dirty laundry” include:
“An experiment fails to yield the expected statistically
significant results. The experimenters try and try again
until they find something (multiple testing, multiple
modeling, post-data search of endpoint or subgroups),
and the only experiment subsequently reported is the
one that did yield the expected results.”
… continuing an experiment until it works as desired, or
excluding unwelcome experimental subjects or results,
inevitably tends to confirm the researcher’s research
hypotheses, and essentially render the hypotheses
immune to the facts”. (Report, 48)
--they walked into a “culture of verification bias”
	
  
SPP	
  D.	
  Mayo	
   8	
  
Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e., little
or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.
SPP	
  D.	
  Mayo	
   9	
  
• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views and personality conflicts (“It’s the methods, stupid”)
(example of new solutions #3)
SPP	
  D.	
  Mayo	
   10	
  
Are philosophies about science relevant?
One of the final recommendations in the Report is this:
In the training program for PhD students, the relevant
basic principles of philosophy of science, methodology,
ethics and statistics that enable the responsible practice
of science must be covered. (p. 57)
	
  
SPP	
  D.	
  Mayo	
   11	
  
A critic might protest:
“There’s nothing philosophical about my criticism of
significance tests: a small p-value is invariably, and
erroneously, interpreted as giving a small probability to the null
hypothesis that the observed difference is mere chance.”
Really? P-values are not intended to be used this way;
presupposing they should be stems from a conception of the role
of probability in statistical inference—this conception is
philosophical.
(of course criticizing them because they might be misinterpreted
is just silly)
SPP	
  D.	
  Mayo	
   12	
  
Two	
  main	
  views	
  of	
  the	
  role	
  of	
  probability	
  in	
  inference
	
  
Probabilism.	
  To	
  provide	
  a	
  post-­‐data	
  assignment	
  of	
  degree	
  
of	
  probability,	
  confirmation,	
  support	
  or	
  belief	
  in	
  a	
  
hypothesis,	
  absolute	
  or	
  comparative,	
  given	
  data	
  x0.	
  	
  
	
  
Performance.	
  To	
  ensure	
  long-­‐run	
  reliability	
  of	
  methods,	
  
coverage	
  probabilities,	
  control	
  the relative frequency of
erroneous inferences in a long-run series of trials. 	
  
	
  
What happened to the goal of scrutinizing bad science by the
severity criterion?
SPP	
  D.	
  Mayo	
   13	
  
• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary not a sufficient
condition for avoiding insevere tests.
	
  
• The problems with selective reporting, multiple testing,
stopping when the data look good are not problems about
long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of
misinterpretation.
	
  
SPP	
  D.	
  Mayo	
   14	
  
• Probabilism	
  says	
  H	
  is	
  not	
  justified	
  unless	
  it’s	
  true	
  or	
  
probable	
  (made	
  firmer).	
  
• Error	
  statistics	
  (probativism)	
  says	
  H	
  is	
  not	
  justified	
  
unless	
  something	
  (a	
  good	
  job)	
  has	
  been	
  done	
  to	
  probe	
  
ways	
  we	
  can	
  be	
  wrong	
  about	
  H.	
  
• If	
  it’s	
  assumed	
  probabilism	
  is	
  required	
  for	
  inference,	
  
error	
  probabilities	
  could	
  be	
  relevant	
  only	
  by	
  
misinterpretation.	
  False!	
  
• Error	
  probabilities	
  have	
  a	
  crucial	
  role	
  in	
  appraising	
  well-­‐
testedness	
  (new	
  philosophy	
  for	
  probability	
  #3)	
  
	
  
• Both	
  H	
  and	
  not-­‐H	
  be	
  can	
  be	
  poorly	
  tested,	
  so	
  a	
  severe	
  testing	
  
assessment	
  violates	
  probability	
  
SPP	
  D.	
  Mayo	
   15	
  
Understanding	
  the	
  Replication	
  Crisis	
  Requires	
  
Understanding	
  How	
  it	
  Intermingles	
  with	
  PhilStat	
  
Controversies	
  	
  
	
  
• It’s not that I’m keen to defend many common uses of
significance tests
• It’s just that the criticisms (in psychology and elsewhere)
are based on serious misunderstandings of the nature and
role of these methods; consequently so are many “reforms”
• How can you be clear the reforms are better if you might be
mistaken about existing methods?
SPP	
  D.	
  Mayo	
   16	
  
Criticisms	
  concern	
  a	
  kind	
  of	
  Fisherian	
  Significance	
  Test
(i) Sample	
  space:	
  Let	
  the	
  sample	
  be	
  X	
  =	
  (X1,	
  …,Xn),	
  be	
  n	
  iid	
  
(independent	
  and	
  identically	
  distributed)	
  outcomes	
  from	
  a	
  
Normal	
  distribution	
  with	
  standard	
  deviation	
  	
  σ 	
  
	
  	
  
(ii)	
  A	
  null	
  hypothesis	
  H0:	
  µ	
  =	
  	
  0	
  	
  (Δ: µΤ − µC = 0)
	
  
	
  (iii)	
  Test	
  statistic:	
  A	
  function	
  of	
  the	
  sample,	
  d(X)	
  reflecting	
  the	
  
difference	
  between	
  the	
  data	
  x0	
  =	
  (x1,	
  …,xn),	
  and	
  H0:	
  	
  
The	
  larger	
  d(x0)	
  the	
  further	
  the	
  outcome	
  from	
  what’s	
  
expected	
  under	
  H0,	
  with	
  respect	
  to	
  the	
  particular	
  question.	
  	
  
	
  
	
  (iv)	
  Sampling	
  distribution	
  of	
  test	
  statistic:	
  d(X)	
  
SPP	
  D.	
  Mayo	
   17	
  
The	
  p-­‐value	
  is	
  the	
  probability	
  of	
  a	
  difference	
  larger	
  than	
  d(x0),	
  
under	
  the	
  assumption	
  that	
  H0	
  is	
  true:	
  
p(x0)=Pr(d(X)	
  >	
  d(x0);	
  H0).	
  
	
  
If p(x0)	
  is	
  sufficiently	
  small,	
  there’s	
  an	
  indication	
  of	
  
discrepancy	
  from	
  the	
  null.	
  
	
  
(Even	
  Fisher	
  had	
  implicit	
  alternatives,	
  by	
  the	
  way)
SPP	
  D.	
  Mayo	
   18	
  
P-­‐value	
  reasoning:	
  from	
  high	
  capacity	
  to	
  curb	
  
enthusiasm	
  
	
  
If	
  the	
  hypothesis	
  H0	
  is	
  correct	
  then,	
  with	
  high	
  probability,	
  1-­‐p,	
  
the	
  data	
  would	
  not	
  be	
  statistically	
  significant	
  at	
  level	
  p.	
  
x0	
  is	
  statistically	
  significant	
  at	
  level	
  p.	
  
____________________________	
  
Thus,	
  x0	
  indicates	
  a	
  discrepancy	
  from	
  H0.	
  
	
  
That	
  merely	
  indicates	
  some	
  discrepancy!	
  
SPP	
  D.	
  Mayo	
   19	
  
A genuine experimental effect is needed
“[W]e need, not an isolated record, but a reliable method of
procedure. In relation to the test of significance, we may say
that a phenomenon is experimentally demonstrable when we
know how to conduct an experiment which will rarely fail to
give us a statistically significant result.” (Fisher 1935, 14)
(low P-value ≠> H: statistical effect)
“[A]ccording	
  to	
  Fisher,	
  rejecting	
  the	
  null	
  hypothesis	
  is	
  not	
  
equivalent	
  to	
  accepting	
  the	
  efficacy	
  of	
  the	
  cause	
  in	
  
question.	
  The	
  latter...requires	
  obtaining	
  more	
  significant	
  
results	
  when	
  the	
  experiment,	
  or	
  an	
  improvement	
  of	
  it,	
  is	
  
repeated	
  at	
  other	
  laboratories	
  or	
  under	
  other	
  
conditions.”	
  (Gigerentzer	
  1989,	
  95-­‐6)	
  (H ≠> H*)
SPP	
  D.	
  Mayo	
   20	
  
	
  
Still,	
  simple	
  Fisherian	
  Tests	
  have	
  Important	
  Uses	
  	
  
	
  
• Testing	
  assumptions	
  
• Fraudbusting	
  and	
  forensics:	
  Finding	
  Data	
  too	
  good	
  to	
  be	
  
true	
  (Simonsohn)	
  
• Finding	
  if	
  data	
  are	
  consistent	
  with	
  a	
  model	
  
Gelman and Shalizi (meeting of minds between a Bayesian and
an error statistician)
“What we are advocating, then, is what Cox and Hinkley (1974)
call ‘pure significance testing’, in which certain of the model’s
implications are compared directly to the data, rather than
entering into a contest with some alternative model.” (p.20)	
  
SPP	
  D.	
  Mayo	
   21	
  
Fallacy	
  of	
  Rejection:	
  H	
  –	
  >	
  H*	
  :	
  Erroneously	
  take	
  statistical	
  
significance	
  as	
  evidence	
  of	
  research	
  hypothesis	
  H*	
  	
  
	
  
The	
  fallacy	
  is	
  explicated	
  by	
  severity:	
  flaws	
  in	
  alternative	
  H*	
  have	
  
not	
  been	
  probed	
  by	
  the	
  test,	
  the	
  inference	
  from	
  a	
  statistically	
  
significant	
  result	
  to	
  H*	
  fails	
  to	
  pass	
  with	
  severity	
  
	
  
Merely refuting the null hypothesis is too weak to
corroborate substantive H*, “we have to have ‘Popperian
risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable coincidence.’” (Meehl
and Waller 2002, 184)
	
  
(Meehl	
  was	
  wrong	
  to	
  blame	
  Fisher)	
  
SPP	
  D.	
  Mayo	
   22	
  
NHST	
  are	
  pseudostatistical:	
  
	
  
Why	
  do	
  psychologists	
  speak	
  of	
  NHSTs	
  –tests	
  that	
  supposedly	
  
allow	
  moving	
  from	
  statistical	
  to	
  substantive?	
  
	
  
So	
  defined,	
  they	
  exist	
  only	
  as	
  abuses	
  of	
  tests:	
  they	
  exist as
something you’re never supposed to do 	
  
	
  	
  
Psychologists	
  tend	
  to	
  ignore	
  Neyman-­‐Pearson	
  (N-­‐P)	
  tests:	
  N-­‐P	
  
supplemented	
  Fisher’s	
  tests	
  with	
  explicit	
  alternatives	
  	
  
	
   	
  
SPP	
  D.	
  Mayo	
   23	
  
Neyman-­‐Pearson	
  (N-­‐P)	
  Tests:	
  A	
  null	
  and	
  alternative	
  
hypotheses	
  H0,	
  H1	
  that	
  exhaust	
  the	
  parameter	
  space	
  
	
  
So	
  the	
  fallacy	
  of	
  rejection	
  H	
  –	
  >	
  H*	
  is	
  impossible	
  
(rejecting	
  the	
  null	
  only	
  indicates	
  statistical	
  alternatives)	
  
	
  
Scotches	
  criticisms	
  that	
  P-­‐values	
  are	
  only	
  under	
  the	
  null	
  	
  
Example:	
  Test	
  T+:	
  	
  sampling	
  distribution	
  of	
  d(x)	
  under	
  null	
  
and	
  alternatives.	
  H0:	
  µ	
  ≤	
  µ0	
  	
  vs.	
  	
  	
  H1:	
  µ	
  >	
  µ0	
   	
  
	
   	
  
if	
  d(x0)	
  >	
  	
  cα,	
  "reject"	
  H0,	
  	
  	
  
if	
  d(x0)	
  <	
  	
  cα,	
  "do	
  not	
  reject”	
  or	
  “accept"	
  H0,	
  
	
  
e.g.	
  cα=1.96	
  for	
  α=.025	
  
SPP	
  D.	
  Mayo	
   24	
  
	
  	
  
The	
  sampling	
  distribution	
  yields	
  Error	
  Probabilities	
  
	
  
Probability	
  of	
  a	
  Type	
  I	
  error	
  =	
  P(d(X)	
  >	
  	
  cα;	
  H0)	
  ≤	
  	
  α.
	
  
Probability	
  of	
  a	
  Type	
  II	
  error:	
  =	
  P(d(X)	
  <	
  cα;	
  H0)	
  =	
  ß(µ1),	
  for	
  
any	
  µ1	
  >	
  µ0.	
  
The	
  complement	
  of	
  the	
  Type	
  II	
  error	
  probability=	
  power	
  
against	
  (µ1)	
  
POW(µ1)=	
  P(d(X)	
  >	
  cα;	
  µ1)	
  
Even	
  without	
  “best”	
  tests,	
  there	
  are	
  “good”	
  tests	
  
	
   	
  
SPP	
  D.	
  Mayo	
   25	
  
N-­‐P	
  test	
  in	
  terms	
  of	
  the	
  P-­‐value:	
  reject	
  H0	
  iff	
  P-­‐value	
  <	
  .025	
  
	
  
• Even	
  N-­‐P	
  report	
  the	
  attained	
  significance	
  level	
  or	
  P-­‐value	
  
(Lehmann)	
  
	
  
• “reject/do	
  not	
  reject”	
  uninterpreted	
  parts	
  of	
  the	
  
mathematical	
  apparatus	
  
	
  
Reject	
  could	
  be:	
  “Declare	
  statistically	
  significant	
  at	
  the	
  p-­‐level”	
  
	
  
• “The	
  tests…	
  must	
  be	
  used	
  with	
  discretion	
  and	
  
understanding”	
  (N-­‐P,	
  1928,	
  p.	
  58)	
  
(“it’s	
  the	
  methods,	
  stupid”)	
  
	
  
	
   	
  
SPP	
  D.	
  Mayo	
   26	
  
Why	
  Inductive	
  behavior?	
  
N-­‐P	
  justify	
  tests	
  (and	
  confidence	
  intervals)	
  by	
  performance,	
  
control	
  of	
  long-­‐run	
  error	
  coverage	
  probabilities	
  	
  
They	
  called	
  this	
  inductive	
  behavior,	
  why?	
  
• They	
  were	
  reaching	
  conclusions	
  beyond	
  the	
  data	
  
(inductive)	
  
• If	
  inductive	
  inference	
  is	
  probabilist,	
  then	
  they	
  needed	
  a	
  
new	
  term.	
  
In	
  Popperian	
  spirit,	
  they	
  (mostly	
  Neyman)	
  called	
  it	
  
inductive	
  behavior-­‐-­‐	
  adjust	
  how	
  we’d	
  act	
  rather	
  than	
  beliefs	
  
(I’m	
  not	
  knocking	
  performance,	
  but	
  error	
  probabilities	
  also	
  
serve	
  for	
  particular	
  inferences—evidential)	
  
SPP	
  D.	
  Mayo	
   27	
  
N-­‐P	
  tests	
  can	
  still	
  commit	
  a	
  type	
  of	
  fallacy	
  of	
  rejection:	
  
Infer	
  a	
  discrepancy	
  beyond	
  what’s	
  warranted:	
  	
  
––especially	
  with n sufficiently large:	
  large	
  n	
  problem.	
  
• Severity	
  tells	
  us:	
  an	
  α-­‐significant	
  difference	
  is	
  indicative	
  of	
  less	
  
of	
  a	
  discrepancy	
  from	
  the	
  null	
  if	
  it	
  results	
  from	
  larger	
  (n1)	
  
rather	
  than	
  a	
  smaller	
  (n2)	
  sample	
  size	
  (n1	
  >	
  n2	
  )
What’s	
  more	
  indicative	
  of	
  a	
  large	
  effect	
  (fire),	
  a	
  fire	
  alarm	
  that	
  
goes	
  off	
  with	
  burnt	
  toast	
  or	
  one	
  so	
  insensitive	
  that	
  it	
  doesn’t	
  
go	
  off	
  unless	
  the	
  house	
  is	
  fully	
  ablaze?	
  [The	
  larger	
  sample	
  size	
  
is	
  like	
  the	
  one	
  that	
  goes	
  off	
  with	
  burnt	
  toast.)	
  
	
  
	
   	
  
SPP	
  D.	
  Mayo	
   28	
  
Fallacy	
  of	
  Non-­‐Significant	
  results:	
  Insensitive	
  tests	
  
	
  
• Negative	
  results	
  may	
  not	
  warrant	
  0	
  discrepancy	
  
from	
  the	
  null,	
  but	
  we	
  can	
  use	
  severity	
  to	
  rule	
  out	
  
discrepancies	
  that,	
  with	
  high	
  probability,	
  would	
  have	
  
resulted	
  in	
  a	
  larger	
  difference	
  than	
  observed	
  	
  
Similar	
  to	
  Cohen’s	
  power	
  analysis	
  but	
  sensitive	
  to	
  the	
  
outcome—P-­‐value	
  distribution	
  (#3)	
  
	
  
• I	
  hear	
  some	
  replicationists	
  say	
  negative	
  results	
  are	
  
uninformative:	
  not	
  so	
  (#2	
  ironies)	
  
No	
  point	
  in	
  running	
  replication	
  research	
  if	
  your	
  
account	
  views	
  negative	
  results	
  as	
  uninformative	
  
SPP	
  D.	
  Mayo	
   29	
  
Error	
  statistics	
  gives	
  evidential	
  interpretation	
  to	
  tests	
  
(#3)	
  
	
  
Use	
  results	
  to	
  infer	
  discrepancies	
  from	
  a	
  null	
  that	
  are	
  well	
  ruled-­‐
out,	
  and	
  those	
  which	
  are	
  not	
  	
  
	
  
I’d	
  never	
  just	
  report	
  a	
  P-­‐value	
  
	
  
Mayo	
  (1996);	
  	
  
Mayo	
  and	
  Cox	
  (2010):	
  Frequentist	
  	
  Principle	
  of	
  
Evidence:	
  FEV	
  	
  
Mayo	
  and	
  Spanos	
  (2006):	
  SEV	
  
	
  
SPP	
  D.	
  Mayo	
   30	
  
One-­‐sided	
  Test	
  T+:	
  	
  H0:	
  µ	
  <	
  µ0	
  	
  vs.	
  	
  	
  H1:	
  µ	
  >	
  µ0	
   	
  
	
  
d(x)	
  is	
  statistically	
  significant	
  (set	
  lower	
  bounds)	
  
	
  
(i)	
  If	
  the	
  test	
  had	
  high	
  capacity	
  to	
  warn	
  us	
  (by	
  
producing	
  a	
  less	
  significant	
  result)	
  if	
  µ	
  ≤	
  µ0	
  +	
  γ.	
  then	
  
d(x)	
  is	
  a	
  good	
  indication	
  of	
  µ	
  >	
  µ0	
  +	
  γ.	
  
	
  
(ii)	
  If	
  the	
  test	
  had	
  little	
  (or	
  even	
  moderate)	
  capacity	
  
(e.g.	
  <	
  .5)	
  to	
  produce	
  a	
  less	
  significant	
  result	
  even	
  if	
  µ	
  ≤	
  
µ0	
  +	
  γ,	
  then	
  d(x)	
  is	
  a	
  poor	
  indication	
  of	
  µ	
  >	
  µ0	
  +	
  γ	
  
	
  
(If	
  an	
  even	
  more	
  impressive	
  result	
  is	
  probable,	
  due	
  to	
  
guppies,	
  it’s	
  not	
  a	
  good	
  indication	
  of	
  a	
  great	
  whale)	
   	
  
SPP	
  D.	
  Mayo	
   31	
  
d(x)	
  is	
  not	
  statistically	
  significant	
  (set	
  upper	
  bounds)	
  	
  
	
  
(i)If	
  the	
  test	
  had	
  a	
  high	
  probability	
  of	
  producing	
  a	
  
more	
  statistically	
  significant	
  difference	
  if	
  µ	
  >	
  µ0	
  +	
  γ,	
  
then	
  d(x)	
  is	
  a	
  good	
  indication	
  that	
  µ	
  ≤	
  µ0	
  +	
  γ.	
  
	
  
(ii)	
  If	
  the	
  test	
  had	
  a	
  low	
  probability	
  of	
  a	
  more	
  
statistically	
  significant	
  difference	
  if	
  µ	
  >	
  µ0	
  +	
  γ,	
  then	
  d(x)	
  
is	
  poor	
  indication	
  that	
  µ	
  ≤	
  µ0	
  +	
  γ.	
  (too	
  insensitive	
  to	
  
rule	
  out	
  discrepancy	
  γ)	
  
	
  
If	
  you	
  set	
  an	
  overly	
  stringent	
  significance	
  level	
  in	
  order	
  to	
  
block	
  rejecting	
  a	
  null,	
  we	
  can	
  determine	
  the	
  
discrepancies	
  you	
  can’t	
  detect	
  (e.g.,	
  risks	
  of	
  concern)	
  
SPP	
  D.	
  Mayo	
   32	
  
Confidence	
  Intervals	
  also	
  require	
  supplementing	
  	
  
	
  
Duality	
  between	
  tests	
  and	
  intervals:	
  values	
  within	
  the	
  (1	
  -­‐	
  α)	
  
CI	
  are	
  non-­‐rejectable	
  at	
  the	
  α	
  level	
  	
  
• Still	
  too	
  dichotomous:	
  in	
  /out,	
  plausible/not	
  plausible	
  
(Permit	
  fallacies	
  of	
  rejection/non-­‐rejection).	
  
• Justified	
  in	
  terms	
  of	
  long-­‐run	
  coverage	
  (performance).	
  
• All	
  members	
  of	
  the	
  CI	
  treated	
  on	
  par.	
  
• Fixed	
  confidence	
  level	
  (SEV	
  needs	
  several	
  benchmarks).	
  
• Estimation	
  is	
  important	
  but	
  we	
  need	
  tests	
  for	
  
distinguishing	
  real	
  and	
  spurious	
  effects,	
  and	
  checking	
  
assumptions	
  of	
  statistical	
  models.	
  
	
  
SPP	
  D.	
  Mayo	
   33	
  
The	
  evidential	
  interpretation	
  is	
  crucial	
  but	
  error	
  
probabilities	
  can	
  be	
  violated	
  by	
  selection	
  effects	
  (also	
  
violated	
  model	
  assumptions)	
  
One	
  function	
  of	
  severity	
  is	
  to	
  identify	
  which	
  selection	
  effects	
  
are	
  problematic	
  (not	
  all	
  are)	
  (#3).	
  
	
  
	
  Biasing	
  selection	
  effects:	
  when	
  data	
  or	
  hypotheses	
  are	
  
selected	
  or	
  generated	
  (or	
  a	
  test	
  criterion	
  is	
  specified),	
  in	
  
such	
  a	
  way	
  that	
  the	
  minimal	
  severity	
  requirement	
  is	
  
violated,	
  seriously	
  altered	
  or	
  incapable	
  of	
  being	
  assessed.	
  	
  	
  
	
   	
  
SPP	
  D.	
  Mayo	
   34	
  
Nominal vs actual significance levels
Suppose	
  that	
  twenty	
  sets	
  of	
  differences	
  have	
  been	
  
examined,	
  that	
  one	
  difference	
  seems	
  large	
  enough	
  to	
  test	
  
and	
  that	
  this	
  difference	
  turns	
  out	
  to	
  be	
  ‘significant	
  at	
  the	
  5	
  
percent	
  level.’	
  ….The	
  actual	
  level	
  of	
  significance	
  is	
  not	
  5	
  
percent,	
  but	
  64	
  percent!	
  (Selvin,	
  1970,	
  p.	
  104)	
  
	
  
• They	
  were	
  clear	
  on	
  the	
  fallacy:	
  blurring	
  the	
  “computed”	
  
or	
  “nominal”	
  significance	
  level,	
  and	
  the	
  “actual”	
  level	
  
	
  
• There	
  are	
  many	
  more	
  ways	
  you	
  can	
  be	
  wrong	
  with	
  
hunting	
  (different	
  sample	
  space)	
  
	
  
	
   	
  
SPP	
  D.	
  Mayo	
   35	
  
This is a genuine example of an invalid or unsound method
	
  
You report: Such	
  results	
  would	
  be	
  difficult	
  to	
  achieve	
  under	
  
the	
  assumption	
  of	
  H0
When	
  in	
  fact	
  such	
  results	
  are	
  common	
  under	
  the	
  
assumption	
  of	
  H0
(formally): You say Pr(P-value < Pobs; H0) ~ α (small)	
  
but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed
• Nowadays,	
  we’re	
  likely	
  to	
  see	
  the	
  tests	
  blamed	
  for	
  
permitting	
  such	
  misuses	
  (instead	
  of	
  the	
  testers).	
  
	
  
• Worse	
  are	
  those	
  accounts	
  where	
  the	
  abuse	
  vanishes!	
  
SPP	
  D.	
  Mayo	
   36	
  
What	
  defies	
  scientific	
  sense?	
  
	
  
On	
  some	
  views,	
  biasing	
  selection	
  effects	
  are	
  irrelevant….	
  
Stephen	
  Goodman	
  (epidemiologist):	
  
	
  
Two	
  problems	
  that	
  plague	
  frequentist	
  inference:	
  multiple	
  
comparisons	
  and	
  multiple	
  looks,	
  or,	
  as	
  they	
  are	
  more	
  
commonly	
  called,	
  data	
  dredging	
  and	
  peeking	
  at	
  the	
  data.	
  
The	
  frequentist	
  solution	
  to	
  both	
  problems	
  involves	
  
adjusting	
  the	
  P-­‐value…But	
  adjusting	
  the	
  measure	
  of	
  
evidence	
  because	
  of	
  considerations	
  that	
  have	
  nothing	
  to	
  do	
  
with	
  the	
  data	
  defies	
  scientific	
  sense,	
  belies	
  the	
  claim	
  of	
  
‘objectivity’	
  that	
  is	
  often	
  made	
  for	
  the	
  P-­‐value.”	
  (1999,	
  p.	
  
1010).	
  	
  
SPP	
  D.	
  Mayo	
   37	
  
Likelihood	
  Principle	
  (LP)	
  
The	
  vanishing	
  act	
  takes	
  us	
  to	
  the	
  pivot	
  point	
  around	
  which	
  
much	
  debate	
  in	
  philosophy	
  of	
  statistics	
  revolves:	
  
In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses:
P(x0;H1)/P(x0;H0)	
  
	
  
Different	
  forms:	
  posterior	
  probabilities,	
  Bayes	
  factor	
  
(inference	
  is	
  comparative,	
  data	
  favors	
  this	
  over	
  that–is	
  that	
  
even	
  inference?)	
  
	
  
	
  
SPP	
  D.	
  Mayo	
   38	
  
All	
  error	
  probabilities	
  violate	
  the	
  LP	
  (even	
  without	
  
selection	
  effects):	
  
	
  
“Sampling	
  distributions,	
  significance	
  levels,	
  power,	
  all	
  depend	
  
on	
  something	
  more	
  [than	
  the	
  likelihood	
  function]–something	
  
that	
  is	
  irrelevant	
  in	
  Bayesian	
  inference–namely	
  the	
  sample	
  
space”.	
  (Lindley	
  1971,	
  p.	
  436)	
  
	
  
The	
  information	
  is	
  just	
  a	
  matter	
  of	
  our	
  “intentions”	
  
“The	
  LP	
  implies…the	
  irrelevance	
  of	
  predesignation,	
  of	
  
whether	
  a	
  hypothesis	
  was	
  thought	
  of	
  before	
  hand	
  or	
  was	
  
introduced	
  to	
  explain	
  known	
  effects	
  (Rosenkrantz,	
  1977,	
  
122)	
  
SPP	
  D.	
  Mayo	
   39	
  
Many current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail.
	
  
The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals.
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished.
(lots of #2 inconsistencies)
	
  
SPP	
  D.	
  Mayo	
   40	
  
How	
  might	
  probabilists	
  block	
  intuitively	
  unwarranted	
  
inferences?	
  (Consider	
  first	
  subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the interpretation
of the evidence, we wouldn’t be fooled
We know these things are unbelievable, a subjective Bayesian
might say
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.
SPP	
  D.	
  Mayo	
   41	
  
It wouldn’t help with our most important problem:
• How to distinguish the warrant for a single hypothesis H
with different methods (e.g., one has biasing selection
effects, another, registered results and precautions)?
So now you’ve got two sources of flexibility, priors and biasing
selection effects (which can no longer be criticized).
Besides, researchers really do believe their hypotheses.
SPP	
  D.	
  Mayo	
   42	
  
Diederik Stapel says he always read the research literature
extensively to generate his hypotheses.
“So that it was believable and could be argued that this
was the only logical thing you would find.” (E.g., eating
meat causes aggression.)
(In “The Mind of a Con Man,” NY Times, April 26,
2013[4])
SPP	
  D.	
  Mayo	
   43	
  
Conventional	
  Bayesians
	
  
The most popular probabilisms these days are “non-subjective”
(reference, default) or conventional designed	
  to	
  prevent	
  prior	
  
beliefs	
  from	
  influencing	
  the	
  posteriors:	
  
“The	
  priors	
  are	
  not	
  to	
  be	
  considered	
  expressions	
  of	
  
uncertainty,	
  ignorance,	
  or	
  degree	
  of	
  belief.	
  Conventional	
  
priors	
  may	
  not	
  even	
  be	
  probabilities…	
  .”	
  (Cox	
  and	
  Mayo	
  
2010,	
  p.	
  299)	
  
	
  
How	
  might	
  they	
  avoid	
  too-­‐easy	
  rejections	
  of	
  a	
  null?	
  
	
   	
  
SPP	
  D.	
  Mayo	
   44	
  
Cult	
  of	
  the	
  Holy	
  Spike	
  
	
  
Give	
  a	
  spike	
  prior	
  of	
  .5	
  to	
  H0	
  the	
  remaining	
  .5	
  probability	
  being	
  
spread	
  out	
  over	
  the	
  alternative	
  parameter	
  space,	
  Jeffreys.	
  	
  
	
  
This	
  “spiked	
  concentration	
  of	
  belief	
  in	
  the	
  null”	
  is	
  at	
  odds	
  with	
  
the	
  prevailing	
  view	
  “we	
  know	
  all	
  nulls	
  are	
  false”	
  (#2)	
  	
  
	
  
Bottom line: By convenient choices of priors and alternatives
statistically significant differences can be evidence for the null
	
  
The	
  conflict	
  often	
  considers	
  the	
  two	
  sided	
  test	
  	
  
H0:	
  µ	
  =	
  0	
  versus	
  H1:	
  µ	
  ≠	
  0	
  	
  	
  
	
  
SPP	
  D.	
  Mayo	
   45	
  
	
  	
  
Posterior	
  Probabilities	
  in	
  H0	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  n	
  (sample	
  size)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ____________________________	
  
	
  	
  p	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  z	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  n=50	
  	
  	
  	
  	
  n=100	
  	
  	
  	
  	
  	
  n=1000	
  
	
  
.10	
  	
  	
  	
  	
  	
  	
  1.645	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .65	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .72	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .89	
  
.05	
  	
  	
  	
  	
  	
  	
  1.960	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .52	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .60	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .82	
  
.01	
  	
  	
  	
  	
  	
  	
  2.576	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .22	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .27	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .53	
  
.001	
  	
  	
  	
  	
  3.291	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .034	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .045	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .124	
  
	
  
If	
  n	
  =	
  1000,	
  a	
  result	
  statistically	
  significant	
  at	
  the	
  .05	
  level	
  
leads	
  to	
  a	
  posterior	
  to	
  the	
  null	
  of	
  .82!	
  
	
  
From	
  Berger	
  and	
  Sellke	
  (1987)	
  based	
  on	
  a	
  Jeffreys	
  pror	
  
SPP	
  D.	
  Mayo	
   46	
  
	
  
• With	
  a	
  z	
  =	
  1.96	
  difference,	
  the	
  95%	
  CI	
  (2-­‐sided)	
  or	
  the	
  .975	
  
CI	
  one	
  sided	
  excludes	
  the	
  null	
  (0)	
  from	
  the	
  interval	
  	
  
• Severity reasoning: Were H0 true, the probability of getting
d(x) < dobs is high (~.975), so SEV	
  (µ	
  >	
  0) ∼ .975
• But they give P(H0 | z = 1.96 ) = .82
• Error statistical critique: there’s a high probability that they
give posterior probability of .82 to H0:µ = 0 erroneously
• The onus is on probabilists to show a high posterior for H
constitutes having passed a good test.
SPP	
  D.	
  Mayo	
   47	
  
Informal	
  and	
  Quasi-­‐Formal	
  Severity	
  :	
  H	
  -­‐>	
  H*	
  
	
  
• Error	
  statisticians	
  avoid	
  the	
  fallacy	
  of	
  going	
  directly	
  from	
  
statistical	
  to	
  research	
  hypothesis	
  H*	
  	
  
• Can	
  we	
  say	
  nothing	
  about	
  this	
  link?	
  
• I	
  think	
  we	
  can	
  and	
  must,	
  and	
  informal	
  severity	
  
assessments	
  are	
  relevant	
  (#3)	
  
	
  
I	
  will	
  not	
  discuss	
  straw	
  man	
  studies	
  (“chump	
  effects”).	
  
	
  
This is believable: Men react more negatively to success of
their partners than to their failures (compared to women)?
Studies have shown:
H: partner’s success lowers self-esteem in men
SPP	
  D.	
  Mayo	
   48	
  
Macho	
  Men	
  
H*: partner’s success lowers self-esteem in men
	
  
I	
  have	
  no	
  doubts	
  that	
  certain	
  types	
  of	
  men	
  feel	
  threatened	
  
by	
  the	
  success	
  of	
  their	
  female	
  partners,	
  wives	
  or	
  girlfriends	
  
	
  	
  
I’ve	
  even	
  known	
  a	
  few.	
  	
  
Can	
  this	
  be	
  studied	
  in	
  the	
  lab?	
  Ratliff	
  and	
  Oishi	
  (2013)	
  did:	
  
.	
  	
  
H*:	
  “men’s	
  implicit	
  self-­‐esteem	
  is	
  lower	
  when	
  a	
  partner	
  
succeeds	
  than	
  when	
  a	
  partner	
  fails.”	
  	
  
Not so for women
Their example does a good job, given the standards in place.
SPP	
  D.	
  Mayo	
   49	
  
Treatments: Subjects are randomly assigned to five	
  
“treatments”:	
  think,	
  write	
  about	
  a	
  time	
  your	
  partner	
  
succeeded,	
  failed,	
  succeeded	
  when	
  you	
  failed	
  (partner	
  
beats	
  me),	
  failed	
  when	
  you	
  succeeded	
  (I	
  beat	
  partner),	
  
and	
  a	
  typical	
  day	
  (control).	
  	
  
	
  
Effects:	
  a	
  measure	
  of	
  “self-­‐esteem”	
  
Explicit:	
  “How	
  do	
  you	
  feel	
  about	
  yourself?”	
  
Implicit:	
  a test of word associations with “me” versus “other”.
None showed statistical significance in explicit self-esteem, so
consider just implicit measures
	
   	
  
SPP	
  D.	
  Mayo	
   50	
  
	
  
Some null hypotheses: The average self-esteem score is no
different (these are statistical hypotheses)
a) when partner succeeds (rather than failing)
b) when partner beats (surpasses) me or I beat her
c) control: when she succeeds, fails, or it’s a regular day
There are at least double this, given self-esteem could be
“explicit” or “implicit” (others too, e.g., the area of success)
	
  
Only	
  null	
  (a)	
  was	
  rejected	
  statistically!	
  
Should	
  they	
  have	
  taken	
  the	
  research	
  hypothesis	
  as	
  
disconfirmed	
  by	
  negative	
  cases?	
  	
  
Or	
  as	
  casting	
  doubt	
  on	
  their	
  test?	
  	
  
SPP	
  D.	
  Mayo	
   51	
  
Or	
  should	
  they	
  just	
  focus	
  on	
  the	
  null	
  hypotheses	
  that	
  
were	
  rejected,	
  in	
  particular	
  null	
  (a),	
  for	
  implicit	
  self-­‐esteem.	
  	
  
They	
  opt	
  for	
  the	
  third.	
  
	
  
It’s not that they should have regarded their research
hypothesis H* as disconfirmed much less falsified. 	
  
This is precisely the nub of the problem! I’m saying the
hypothesis that the study isn’t well-run needs to be considered
• Is the artificial writing assignment sufficiently relevant to
the phenomenon of interest? (look at proxy variables)
• Is the measure of implicit self esteem (word associations) a
valid measure of the effect? (measurements of effects)
SPP	
  D.	
  Mayo	
   52	
  
Take,	
  null	
  hypothesis	
  b):	
  The average self-esteem score is no
different when partner beats (surpasses) me or I beat her
	
  
Clearly	
  they	
  expected	
  “she	
  beat	
  me	
  in	
  X”	
  to	
  have	
  a	
  greater	
  
negative	
  impact	
  on	
  self-­‐esteem	
  than	
  “she	
  succeeded	
  at	
  X”.	
  	
  
	
  
Still,	
  they	
  could	
  view	
  it	
  as	
  lending	
  “some	
  support	
  to	
  the	
  idea	
  
that	
  men	
  interpret	
  ‘my	
  partner	
  is	
  successful’	
  as	
  ‘my	
  partner	
  
is	
  more	
  successful	
  than	
  me”	
  (p.	
  698),	
  	
  
….as	
  do	
  the	
  authors.	
  	
  	
  
	
  
That	
  is,	
  any	
  success	
  of	
  hers	
  is	
  always	
  construed	
  by	
  Macho	
  man	
  
as,	
  she	
  beat	
  me.	
  
	
  
SPP	
  D.	
  Mayo	
   53	
  
Bending	
  over	
  Backwards	
  
For	
  the	
  stringent	
  self-­‐critic,	
  this	
  skirts	
  too	
  close	
  to	
  viewing	
  
the	
  data	
  through	
  the	
  theory,	
  a	
  kind	
  of	
  “self-­‐sealing	
  fallacy”.	
  	
  
	
  
I want to be clear that this is not a criticism of them given
existing standards
“I'm talking about a specific, extra type of integrity...bending
over backwards to show how you're maybe wrong, that you
ought to have when acting as a scientist.”	
   (R. Feynman 1974)
	
  
I’m	
  describing	
  what’s	
  needed	
  to	
  show	
  “sincerely	
  trying	
  to	
  
find	
  flaws”	
  under	
  the	
  austere	
  account	
  I	
  recommend	
  
	
  
The	
  most	
  interesting	
  information	
  was	
  never	
  reported!	
  
Perhaps	
  it	
  was	
  never	
  even	
  looked	
  at:	
  what	
  they	
  wrote	
  about.	
  	
  
SPP	
  D.	
  Mayo	
   54	
  
Conclusion: Replication Research in Psychology Under an
Error Statistical Philosophy
Replication problems can’t be solved without correctly
understanding their sources
	
  
Biggest	
  sources	
  of	
  problems	
  in	
  replication	
  crises	
  
(a) Stat	
  H	
  -­‐>research	
  H*	
  and	
  (b)	
  biasing	
  selection	
  effects:	
  	
  
Reasons for (a): focus on P-values and Fisherian tests ignoring
N-P tests (and the illicit NHST that goes directly H–> H*)
SPP	
  D.	
  Mayo	
   55	
  
Another reason, false dilemma:
probabilism or long-run performance
plus assuming that N-P can only give the latter
I argue for a third use of probability: Rather than report on
believability researchers need to report the properties of the
methods they used:
What was their capacity to have identified, avoided,
admitted bias?
What’s	
  wanted	
  is	
  not	
  a	
  high	
  posterior	
  probability	
  in	
  H	
  
(however	
  construed)	
  but	
  a	
  high	
  probability	
  the	
  procedure	
  
would	
  have	
  unearthed	
  flaws	
  in	
  H	
  (reinterpretation	
  of	
  N-­‐P	
  
methods)	
  
SPP	
  D.	
  Mayo	
   56	
  
What’s	
  replicable?	
  Discrepancies	
  that	
  are	
  severely	
  warranted	
  
Reasons	
  for	
  (b)	
  [embracing	
  accounts	
  that	
  formally	
  ignore	
  
selection	
  effects]:	
  accepting	
  probabilisms	
  that	
  embrace	
  the	
  
likelihood	
  principle	
  LP	
  
	
  
There’s	
  no	
  point	
  in	
  raising	
  thresholds	
  for	
  significance	
  if	
  
your	
  methodology	
  does	
  not	
  pick	
  up	
  on	
  biasing	
  selection	
  
effects.	
  
	
   	
  
SPP	
  D.	
  Mayo	
   57	
  
	
  
Informal assessments of probativeness are needed to scrutinize
statistical inferences in relation to research hypotheses H –> H*
One	
  hypothesis	
  must	
  always	
  be:	
  our	
  results	
  point	
  to	
  the	
  
inability	
  of	
  our	
  study	
  to	
  severely	
  probe	
  the	
  phenomenon	
  of	
  
interest	
  (problem	
  with	
  proxy	
  variables,	
  measurements,	
  etc.)	
  	
  
The scientific status of an inquiry is questionable if it cannot or
will not distinguish the correctness of inferences from problems
stemming from a poorly run study
If ordinary research reports adopted the Feynman “bending over
backwards” scrutiny, the interpretation of replication efforts
would be more informative (or perhaps not needed)
SPP	
  D.	
  Mayo	
   58	
  
REFERENCES	
  
	
  
Baggerly,	
  K.	
  A.,	
  Coombes,	
  K.	
  R.	
  &	
  Neeley,	
  E.	
  S.	
  (2008).	
  “Run	
  Batch	
  Effects	
  
Potentially	
  Compromise	
  the	
  Usefulness	
  of	
  Genomic	
  Signatures	
  for	
  Ovarian	
  
Cancer.”	
  Journal	
  of	
  Clinical	
  Oncology.	
  26(7):	
  1186-­‐1187.	
  
Bartless,	
  T.	
  (2012).	
  “Daniel	
  Kahneman	
  Sees	
  ‘Train-­‐Wreck	
  Looming’	
  for	
  Social	
  
Psychology”.	
  Chronicle	
  of	
  Higher	
  Education	
  Blog	
  (Oct.	
  4,	
  2012)	
  article	
  
w/links	
  to	
  email	
  D.	
  Kahneman	
  sent	
  to	
  several	
  social	
  psychologists.	
  	
  
http://chronicle.com/blogs/percolator/daniel-­‐kahneman-­‐sees-­‐train-­‐
wreck-­‐looming-­‐for-­‐social-­‐psychology/31338.	
  
Berger,	
  J.	
  O.	
  (2006).	
  “The	
  Case	
  for	
  Objective	
  Bayesian	
  Analysis.”	
  Bayesian	
  
Analysis	
  1	
  (3):	
  385–402.	
  
Berger,	
  J.	
  O.	
  &	
  Sellke,	
  T.	
  (1987).	
  “Testing	
  a	
  Point	
  Null	
  Hypothesis:	
  The	
  
Irreconcilability	
  of	
  P	
  Values	
  and	
  Evidence	
  (with	
  Discussion).”	
  Journal	
  of	
  the	
  
American	
  Statistical	
  Association	
  82	
  (397)	
  (March	
  1):	
  112–122.	
  
Bhattacharjee,	
  Y.	
  (2013).	
  “The	
  Mind	
  of	
  a	
  Con	
  Man”.	
  The	
  New	
  York	
  Times	
  
Magazine	
  (4/28/2013),	
  p.	
  44.	
  
Cohen,	
  J.	
  1988.	
  Statistical	
  Power	
  Analysis	
  for	
  the	
  Behavioral	
  Sciences.	
  2nd	
  ed.	
  
Hillsdale,	
  NJ:	
  Erlbaum.	
  
SPP	
  D.	
  Mayo	
   59	
  
	
  
Coombes,	
  K.	
  R.,	
  Wang,	
  J.	
  &	
  Baggerly,	
  K.	
  A.	
  (2007).	
  “Microrrays:	
  retracing	
  steps.”	
  
Nature	
  Medicine.	
  13(11):1276-­‐7.	
  
Cox,	
  D.	
  R.	
  &	
  D.	
  V.	
  Hinkley.	
  (1974).	
  Theoretical	
  Statistics.	
  London:	
  Chapman	
  and	
  
Hall.	
  
Cox,	
  D.	
  R.	
  &	
  Mayo,	
  D.	
  G.	
  (2010).	
  “Objectivity	
  and	
  Conditionality	
  in	
  Frequentist	
  
Inference.”	
  In	
  Error	
  and	
  Inference:	
  Recent	
  Exchanges	
  on	
  Experimental	
  
Reasoning,	
  Reliability,	
  and	
  the	
  Objectivity	
  and	
  Rationality	
  of	
  Science,	
  edited	
  
by	
  Deborah	
  G.	
  Mayo	
  and	
  Aris	
  Spanos,	
  276–304.	
  Cambridge:	
  Cambridge	
  
University	
  Press.	
  
Diaconis,	
  P.	
  (1978).	
  “Statistical	
  Problems	
  in	
  ESP	
  Research”.	
  Science	
  201	
  (4351):	
  
131-­‐136.	
  (Letters	
  in	
  response	
  can	
  be	
  found	
  in	
  the	
  Dec.	
  15,	
  1978	
  issue	
  pp.	
  
1145-­‐6.)	
  
Dienes,	
  Z.	
  (2011)	
  “Bayesian	
  versus	
  Orthodox	
  Statistics:	
  Which	
  Side	
  Are	
  You	
  On?”	
  
Perspectives	
  on	
  Psychological	
  Science	
  6(3):	
  274-­‐290.	
  
Feynman,	
  R.	
  	
  (1974).	
  “Cargo	
  Cult	
  Science.”	
  Caltech	
  Commencement	
  Speech.	
  
Fisher,	
  R.	
  A.	
  (1947).	
  The	
  Design	
  of	
  Experiments,	
  4th	
  ed.	
  Edinburgh:	
  Oliver	
  and	
  
Boyd.	
  
SPP	
  D.	
  Mayo	
   60	
  
Gelman,	
  A.	
  (2011).	
  “Induction	
  and	
  Deduction	
  in	
  Bayesian	
  Data	
  Analysis.”	
  Edited	
  
by	
  Deborah	
  G.	
  Mayo,	
  Aris	
  Spanos,	
  and	
  Kent	
  W.	
  Staley.	
  Rationality,	
  Markets	
  
and	
  Morals:	
  Studies	
  at	
  the	
  Intersection	
  of	
  Philosophy	
  and	
  Economics	
  2	
  
(Special	
  Topic:	
  Statistical	
  Science	
  and	
  Philosophy	
  of	
  Science):	
  67–78.	
  
Gelman,	
  A.	
  &	
  Shalizi,	
  C.	
  (2013).	
  “Philosophy	
  and	
  the	
  Practice	
  of	
  Bayesian	
  
Statistics.”	
  British	
  Journal	
  of	
  Mathematical	
  and	
  Statistical	
  Psychology	
  66	
  (1):	
  
8–38.	
  
Gigerenzer,	
  G.	
  (2000).	
  “The	
  Superego,	
  the	
  Ego,	
  and	
  the	
  Id	
  in	
  Statistical	
  
Reasoning.	
  “	
  Adaptive	
  Thinking,	
  Rationality	
  in	
  the	
  Real	
  World,	
  OUP.	
  
Goodman,	
  S.	
  N.	
  (1999).	
  Toward	
  evidence-­‐based	
  medical	
  statistics.	
  2:	
  The	
  Bayes	
  
factor.”	
  Annals	
  of	
  Internal	
  Medicine,	
  130:1005	
  –1013.	
  
Howson,	
  C.	
  &	
  Urbach,	
  P.	
  (1993).	
  Scientific	
  Reasoning:	
  The	
  Bayesian	
  Approach.	
  
2nd	
  ed.	
  La	
  Salle,	
  IL:	
  Open	
  Court.	
  
Johansson	
  T.	
  (2010)	
  “Hail	
  the	
  impossible:	
  p-­‐values,	
  evidence,	
  and	
  likelihood.”	
  
Scandinavian	
  Journal	
  of	
  Psychology	
  52:113-­‐125.	
  
Kruschke,	
  J.	
  K.	
  (2010).	
  “What	
  to	
  believe:	
  Bayesian	
  methods	
  for	
  data	
  analysis”.	
  
Trends	
  in	
  Cognitive	
  Science,	
  14(7):	
  297-­‐300.	
  
Lehmann,	
  E.	
  L.	
  (1993).	
  “The	
  Fisher,	
  Neyman-­‐Pearson	
  Theories	
  of	
  Testing	
  
SPP	
  D.	
  Mayo	
   61	
  
Hypotheses:	
  One	
  Theory	
  or	
  Two?”	
  Journal	
  of	
  the	
  American	
  Statistical	
  
Association	
  88	
  (424):	
  1242–1249.	
  
Levelt	
  Committee,	
  Noort	
  Committee,	
  Drenth	
  Committee.	
  (2012).	
  “Flawed	
  
science:	
  The	
  fraudulent	
  research	
  practices	
  of	
  social	
  psychologist	
  Diederik	
  
Stapel”.	
  Stapel	
  Investigation:	
  Joint	
  Tilburg/Groningen/Amsterdam	
  
investigation	
  of	
  the	
  publications	
  by	
  Mr.	
  Stapel.	
  
https://www.commissielevelt.nl/	
  
Lindley,	
  D.	
  V.	
  (1971).	
  “The	
  Estimation	
  of	
  Many	
  Parameters.”	
  In	
  Foundations	
  of	
  
Statistical	
  Inference,	
  edited	
  by	
  V.	
  P.	
  Godambe	
  and	
  D.	
  A.	
  Sprott,	
  435–455.	
  
Toronto:	
  Holt,	
  Rinehart	
  and	
  Winston.	
  
Mayo,	
  D.	
  G.	
  (1996).	
  Error	
  and	
  the	
  Growth	
  of	
  Experimental	
  Knowledge.	
  Science	
  and	
  
Its	
  Conceptual	
  Foundation.	
  Chicago:	
  University	
  of	
  Chicago	
  Press.	
  
Mayo,	
  D.	
  G.	
  &	
  Cox,	
  D.	
  R.	
  (2010).	
  "Frequentist	
  Statistics	
  as	
  a	
  Theory	
  of	
  Inductive	
  
Inference"	
  in	
  Error	
  and	
  Inference:	
  Recent	
  Exchanges	
  on	
  Experimental	
  
Reasoning,	
  Reliability	
  and	
  the	
  Objectivity	
  and	
  Rationality	
  of	
  Science	
  (D.	
  
Mayo	
  and	
  A.	
  Spanos	
  eds.),	
  Cambridge:	
  Cambridge	
  University	
  Press:	
  1-­‐27.	
  
This	
  paper	
  appeared	
  in	
  The	
  Second	
  Erich	
  L.	
  Lehmann	
  Symposium:	
  
Optimality,	
  2006,	
  Lecture	
  Notes-­‐Monograph	
  Series,	
  Volume	
  49,	
  Institute	
  of	
  
Mathematical	
  Statistics,	
  pp.	
  247-­‐275.	
  
SPP	
  D.	
  Mayo	
   62	
  
Mayo,	
  D.	
  G.,	
  and	
  A.	
  Spanos.	
  (2006).	
  “Severe	
  Testing	
  as	
  a	
  Basic	
  Concept	
  in	
  a	
  
Neyman–Pearson	
  Philosophy	
  of	
  Induction.”	
  British	
  Journal	
  for	
  the	
  
Philosophy	
  of	
  Science	
  57	
  (2)	
  (June	
  1):	
  323–357.	
  	
  
Mayo,	
  D.	
  G.,	
  and	
  A.	
  Spanos.	
  	
  (2011).	
  “Error	
  Statistics.”	
  In	
  Philosophy	
  of	
  Statistics,	
  
edited	
  by	
  Prasanta	
  S.	
  Bandyopadhyay	
  and	
  Malcom	
  R.	
  Forster,	
  7:152–198.	
  
Handbook	
  of	
  the	
  Philosophy	
  of	
  Science.	
  The	
  Netherlands:	
  Elsevier.	
  
Meehl,	
  P.	
  E.	
  &	
  Waller,	
  N.	
  G.	
  (2002).	
  “The	
  Path	
  Analysis	
  Controversy:	
  A	
  New	
  
Statistical	
  Approach	
  to	
  Strong	
  Appraisal	
  of	
  Verisimilitude.”	
  Psychological	
  
Methods	
  7(3):	
  283–300.	
  
Morrison,	
  D.	
  E.	
  &	
  Henkel,	
  R.	
  E.	
  (eEds).	
  (1970).	
  The	
  Significance	
  Test	
  Controversy:	
  
A	
  Reader.	
  Chicago:	
  Aldine	
  De	
  Gruyter.	
  
Micheel,	
  C.	
  M.,	
  Nass,	
  S.	
  J.	
  &	
  Omenn	
  G.	
  S.	
  (Eds)	
  Committee	
  on	
  the	
  Review	
  of	
  Omics-­‐
Based	
  Tests	
  for	
  Predicting	
  Patient	
  Outcomes	
  in	
  Clinical	
  Trials;	
  Board	
  on	
  
Health	
  Care	
  Services;	
  Board	
  on	
  Health	
  Sciences	
  Policy;	
  Institute	
  of	
  Medicine	
  
(2012).	
  Evolution	
  of	
  Translational	
  Omics:	
  Lessons	
  Learned	
  and	
  the	
  Path	
  
Forward.	
  Nat.	
  Acad.	
  Press.	
  	
  
Neyman,	
  J.	
  (1957).	
  “‘Inductive	
  Behavior’”	
  as	
  a	
  Basic	
  Concept	
  of	
  Science.”	
  Revue	
  
de	
  l'Institut	
  International	
  de	
  Statistique/Review	
  of	
  the	
  International	
  
Statistical	
  Institute,	
  25	
  (1/3):	
  7-­‐22.	
  
SPP	
  D.	
  Mayo	
   63	
  
Neyman,	
  J.	
  &	
  Pearson,	
  E.	
  S.	
  (1928).	
  “On	
  the	
  Use	
  and	
  Interpretation	
  of	
  Certain	
  
Test	
  Criteria	
  for	
  Purposes	
  of	
  Statistical	
  Inference.	
  Part	
  I,”	
  Biometrica	
  20A:	
  
175-­‐240	
  (reprinted	
  in	
  Joint	
  Statistical	
  Papers,	
  University	
  of	
  California	
  Press,	
  
Berkeley,	
  1967,	
  pp.	
  1-­‐66.)	
  
Popper,	
  K.	
  (1962).	
  Conjectures	
  and	
  Refutations:	
  The	
  Growth	
  of	
  Scientific	
  
Knowledge.	
  New	
  York:	
  Basic	
  Books.	
  
Potti,	
  A.,	
  Dressman	
  H.	
  K.,	
  Bild,	
  A.,	
  Riedel,	
  R.	
  F.,	
  Chan,	
  G.,	
  Sayer,	
  R.,	
  Cragun,	
  J.,	
  
Cottrill,	
  H.,	
  Kelley,	
  M.	
  J.,	
  Petersen,	
  R.,	
  Harpole,	
  D.,	
  Marks,	
  J.,	
  Berchuck,	
  A.,	
  
Ginsburg,	
  G.	
  S.,	
  Febbo,	
  P.,	
  Lancaster,	
  J.	
  	
  &	
  Nevins,	
  J.	
  R.	
  	
  (2006).	
  “Genomic	
  
signatures	
  to	
  guide	
  the	
  use	
  of	
  chemotherapeutics.”	
  Nature	
  Medicine.	
  Nov	
  
12(11):1294-­‐300.	
  Epub	
  2006	
  Oct	
  22.	
  	
  
Potti,	
  A.	
  &	
  Nevins,	
  J.	
  R.	
  (2007)	
  “Reply	
  to	
  Coombes,	
  Wang	
  &	
  Baggerly.”	
  Nature	
  
Medicine	
  Nov	
  13(11):1277-­‐8.	
  	
  
Ratliff,	
  K.	
  A.	
  &	
  Oishi,	
  S.	
  (2013).	
  “Gender	
  Differences	
  in	
  Implicit	
  Self-­‐Esteem	
  
Following	
  a	
  Romantic	
  Partner’s	
  Success	
  or	
  Failure”.	
  	
  Journal	
  of	
  Personality	
  
and	
  Social	
  Psychology	
  105(4):	
  688–702.	
  
Rosenkrantz,	
  R.	
  (1977).	
  Inference,	
  Method	
  and	
  Decision:	
  Towards	
  a	
  Bayesian	
  
Philosophy	
  of	
  Science.	
  Dordrecht,	
  The	
  Netherlands:	
  D.	
  Reidel.	
  
SPP	
  D.	
  Mayo	
   64	
  
Savage,	
  L.	
  J.	
  (1962).	
  The	
  Foundations	
  of	
  Statistical	
  Inference:	
  A	
  Discussion.	
  
London:	
  Methuen.	
  
Savage,	
  L.	
  J.	
  (1964).	
  “The	
  Foundations	
  of	
  Statistics	
  Reconsidered.”	
  In	
  Studies	
  in	
  
Subjective	
  Probability,	
  H.	
  Kyburg	
  &	
  H.	
  Smokler	
  (eds.),	
  173-­‐188.	
  New	
  York:	
  
John	
  Wiley	
  &	
  Sons.	
  
Selvin,	
  H.	
  (1970).	
  “A	
  Critique	
  of	
  Tests	
  of	
  Significance	
  in	
  Survey	
  Research.”	
  In	
  The	
  
Significance	
  Test	
  Controversy,	
  edited	
  by	
  D.	
  Morrison	
  and	
  R.	
  Henkel,	
  94-­‐106.	
  
Chicago:	
  Aldine	
  De	
  Gruyter.	
  
Trafimow,	
  D.	
  &	
  Marks	
  M.	
  (2015).	
  “Editorial”.	
  Basic	
  and	
  Applied	
  Social	
  Psychology,	
  
37(1),	
  pp.	
  1-­‐2.	
  
Wagenmakers,	
  E.-­‐J.	
  (2007).	
  “A	
  Practical	
  Solution	
  to	
  the	
  Pervasive	
  Problems	
  of	
  P	
  
Values”.	
  Psychonomic	
  Bulletin	
  &	
  Review	
  14	
  (5),	
  779-­‐804.	
  

More Related Content

What's hot

Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
jemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
jemille6
 

What's hot (20)

Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 

Similar to D. Mayo: Replication Research Under an Error Statistical Philosophy

Similar to D. Mayo: Replication Research Under an Error Statistical Philosophy (20)

Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
20 OCT-Hypothesis Testing.ppt
20 OCT-Hypothesis Testing.ppt20 OCT-Hypothesis Testing.ppt
20 OCT-Hypothesis Testing.ppt
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
HYPOTHESIS
HYPOTHESISHYPOTHESIS
HYPOTHESIS
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Statistics basics
Statistics basicsStatistics basics
Statistics basics
 

More from jemille6

What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
jemille6
 
What's the question?
What's the question? What's the question?
What's the question?
jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
jemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
jemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
jemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Recently uploaded (20)

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

D. Mayo: Replication Research Under an Error Statistical Philosophy

  • 1. SPP  D.  Mayo   1   Replication Research Under an Error Statistical Philosophy Deborah Mayo Around a year ago on my blog: “There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing” Philosopher’s talk: I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics
  • 2. SPP  D.  Mayo   2   Three main philosophical tasks: #1 Clarify concepts and presuppositions #2 Reveal inconsistencies, puzzles, tensions (“ironies”) #3 Solve problems, improve on methodology • Philosophers usually stop with the first two, but I think going on to solve problems is important. This presentation is ‘programmatic’- what might replication research under an error statistical philosophy be? My interest grew thanks to Caitlin Parker whose MA thesis was on the topic
  • 3. SPP  D.  Mayo   3   Example of a conceptual clarification (#1) Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks) • Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid. • Simple conceptual job that philosophers are good at
  • 4. SPP  D.  Mayo   4   Example of revealing inconsistencies and tensions (#2) Critic: It’s too easy to satisfy standard significance thresholds You: Why do replicationists find it so hard to achieve significance thresholds? Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs You: So, the replication researchers want methods that pick up on and block these biasing selection effects. Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference
  • 5. SPP  D.  Mayo   5   Whether this can be resolved or not is separate. • We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility • As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling The philosopher is the curmudgeon (takes chutzpah!) I’ll give examples of #1 clarifying terms #2 inconsistencies #3 proposed solutions (though I won’t always number them) .
  • 6. SPP  D.  Mayo   6   Demarcation: Bad Methodology/Bad Statistics • A lot of the recent attention grew out of the case of Diederik Stapel, the social psychologist who fabricated his data. • Kahneman  in  2012  “I  see  a  train-­‐wreck  looming,”  setting   up  a  “daisy  chain”  of  replication.   • The Stapel investigators: 2012 Tilberg Report, “Flawed Science” do a good job of characterizing pseudoscience. • Philosophers tend to have cold feet when it comes to saying anything general about science versus pseudoscience.
  • 7. SPP  D.  Mayo   7   Items in their list of “dirty laundry” include: “An experiment fails to yield the expected statistically significant results. The experimenters try and try again until they find something (multiple testing, multiple modeling, post-data search of endpoint or subgroups), and the only experiment subsequently reported is the one that did yield the expected results.” … continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts”. (Report, 48) --they walked into a “culture of verification bias”  
  • 8. SPP  D.  Mayo   8   Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H. Such a test we would say fails a minimal requirement for a stringent or severe test. • This seems utterly uncontroversial.
  • 9. SPP  D.  Mayo   9   • Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical. • Existing error probabilities (confidence levels, significance levels) may but need not provide severity assessments. • New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views and personality conflicts (“It’s the methods, stupid”) (example of new solutions #3)
  • 10. SPP  D.  Mayo   10   Are philosophies about science relevant? One of the final recommendations in the Report is this: In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. (p. 57)  
  • 11. SPP  D.  Mayo   11   A critic might protest: “There’s nothing philosophical about my criticism of significance tests: a small p-value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis that the observed difference is mere chance.” Really? P-values are not intended to be used this way; presupposing they should be stems from a conception of the role of probability in statistical inference—this conception is philosophical. (of course criticizing them because they might be misinterpreted is just silly)
  • 12. SPP  D.  Mayo   12   Two  main  views  of  the  role  of  probability  in  inference   Probabilism.  To  provide  a  post-­‐data  assignment  of  degree   of  probability,  confirmation,  support  or  belief  in  a   hypothesis,  absolute  or  comparative,  given  data  x0.       Performance.  To  ensure  long-­‐run  reliability  of  methods,   coverage  probabilities,  control  the relative frequency of erroneous inferences in a long-run series of trials.     What happened to the goal of scrutinizing bad science by the severity criterion?
  • 13. SPP  D.  Mayo   13   • Neither “probabilism” nor “performance” directly captures it. • Good long-run performance is a necessary not a sufficient condition for avoiding insevere tests.   • The problems with selective reporting, multiple testing, stopping when the data look good are not problems about long-runs— • It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpretation.  
  • 14. SPP  D.  Mayo   14   • Probabilism  says  H  is  not  justified  unless  it’s  true  or   probable  (made  firmer).   • Error  statistics  (probativism)  says  H  is  not  justified   unless  something  (a  good  job)  has  been  done  to  probe   ways  we  can  be  wrong  about  H.   • If  it’s  assumed  probabilism  is  required  for  inference,   error  probabilities  could  be  relevant  only  by   misinterpretation.  False!   • Error  probabilities  have  a  crucial  role  in  appraising  well-­‐ testedness  (new  philosophy  for  probability  #3)     • Both  H  and  not-­‐H  be  can  be  poorly  tested,  so  a  severe  testing   assessment  violates  probability  
  • 15. SPP  D.  Mayo   15   Understanding  the  Replication  Crisis  Requires   Understanding  How  it  Intermingles  with  PhilStat   Controversies       • It’s not that I’m keen to defend many common uses of significance tests • It’s just that the criticisms (in psychology and elsewhere) are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms” • How can you be clear the reforms are better if you might be mistaken about existing methods?
  • 16. SPP  D.  Mayo   16   Criticisms  concern  a  kind  of  Fisherian  Significance  Test (i) Sample  space:  Let  the  sample  be  X  =  (X1,  …,Xn),  be  n  iid   (independent  and  identically  distributed)  outcomes  from  a   Normal  distribution  with  standard  deviation    σ       (ii)  A  null  hypothesis  H0:  µ  =    0    (Δ: µΤ − µC = 0)    (iii)  Test  statistic:  A  function  of  the  sample,  d(X)  reflecting  the   difference  between  the  data  x0  =  (x1,  …,xn),  and  H0:     The  larger  d(x0)  the  further  the  outcome  from  what’s   expected  under  H0,  with  respect  to  the  particular  question.        (iv)  Sampling  distribution  of  test  statistic:  d(X)  
  • 17. SPP  D.  Mayo   17   The  p-­‐value  is  the  probability  of  a  difference  larger  than  d(x0),   under  the  assumption  that  H0  is  true:   p(x0)=Pr(d(X)  >  d(x0);  H0).     If p(x0)  is  sufficiently  small,  there’s  an  indication  of   discrepancy  from  the  null.     (Even  Fisher  had  implicit  alternatives,  by  the  way)
  • 18. SPP  D.  Mayo   18   P-­‐value  reasoning:  from  high  capacity  to  curb   enthusiasm     If  the  hypothesis  H0  is  correct  then,  with  high  probability,  1-­‐p,   the  data  would  not  be  statistically  significant  at  level  p.   x0  is  statistically  significant  at  level  p.   ____________________________   Thus,  x0  indicates  a  discrepancy  from  H0.     That  merely  indicates  some  discrepancy!  
  • 19. SPP  D.  Mayo   19   A genuine experimental effect is needed “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1935, 14) (low P-value ≠> H: statistical effect) “[A]ccording  to  Fisher,  rejecting  the  null  hypothesis  is  not   equivalent  to  accepting  the  efficacy  of  the  cause  in   question.  The  latter...requires  obtaining  more  significant   results  when  the  experiment,  or  an  improvement  of  it,  is   repeated  at  other  laboratories  or  under  other   conditions.”  (Gigerentzer  1989,  95-­‐6)  (H ≠> H*)
  • 20. SPP  D.  Mayo   20     Still,  simple  Fisherian  Tests  have  Important  Uses       • Testing  assumptions   • Fraudbusting  and  forensics:  Finding  Data  too  good  to  be   true  (Simonsohn)   • Finding  if  data  are  consistent  with  a  model   Gelman and Shalizi (meeting of minds between a Bayesian and an error statistician) “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model.” (p.20)  
  • 21. SPP  D.  Mayo   21   Fallacy  of  Rejection:  H  –  >  H*  :  Erroneously  take  statistical   significance  as  evidence  of  research  hypothesis  H*       The  fallacy  is  explicated  by  severity:  flaws  in  alternative  H*  have   not  been  probed  by  the  test,  the  inference  from  a  statistically   significant  result  to  H*  fails  to  pass  with  severity     Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence.’” (Meehl and Waller 2002, 184)   (Meehl  was  wrong  to  blame  Fisher)  
  • 22. SPP  D.  Mayo   22   NHST  are  pseudostatistical:     Why  do  psychologists  speak  of  NHSTs  –tests  that  supposedly   allow  moving  from  statistical  to  substantive?     So  defined,  they  exist  only  as  abuses  of  tests:  they  exist as something you’re never supposed to do       Psychologists  tend  to  ignore  Neyman-­‐Pearson  (N-­‐P)  tests:  N-­‐P   supplemented  Fisher’s  tests  with  explicit  alternatives        
  • 23. SPP  D.  Mayo   23   Neyman-­‐Pearson  (N-­‐P)  Tests:  A  null  and  alternative   hypotheses  H0,  H1  that  exhaust  the  parameter  space     So  the  fallacy  of  rejection  H  –  >  H*  is  impossible   (rejecting  the  null  only  indicates  statistical  alternatives)     Scotches  criticisms  that  P-­‐values  are  only  under  the  null     Example:  Test  T+:    sampling  distribution  of  d(x)  under  null   and  alternatives.  H0:  µ  ≤  µ0    vs.      H1:  µ  >  µ0         if  d(x0)  >    cα,  "reject"  H0,       if  d(x0)  <    cα,  "do  not  reject”  or  “accept"  H0,     e.g.  cα=1.96  for  α=.025  
  • 24. SPP  D.  Mayo   24       The  sampling  distribution  yields  Error  Probabilities     Probability  of  a  Type  I  error  =  P(d(X)  >    cα;  H0)  ≤    α.   Probability  of  a  Type  II  error:  =  P(d(X)  <  cα;  H0)  =  ß(µ1),  for   any  µ1  >  µ0.   The  complement  of  the  Type  II  error  probability=  power   against  (µ1)   POW(µ1)=  P(d(X)  >  cα;  µ1)   Even  without  “best”  tests,  there  are  “good”  tests      
  • 25. SPP  D.  Mayo   25   N-­‐P  test  in  terms  of  the  P-­‐value:  reject  H0  iff  P-­‐value  <  .025     • Even  N-­‐P  report  the  attained  significance  level  or  P-­‐value   (Lehmann)     • “reject/do  not  reject”  uninterpreted  parts  of  the   mathematical  apparatus     Reject  could  be:  “Declare  statistically  significant  at  the  p-­‐level”     • “The  tests…  must  be  used  with  discretion  and   understanding”  (N-­‐P,  1928,  p.  58)   (“it’s  the  methods,  stupid”)        
  • 26. SPP  D.  Mayo   26   Why  Inductive  behavior?   N-­‐P  justify  tests  (and  confidence  intervals)  by  performance,   control  of  long-­‐run  error  coverage  probabilities     They  called  this  inductive  behavior,  why?   • They  were  reaching  conclusions  beyond  the  data   (inductive)   • If  inductive  inference  is  probabilist,  then  they  needed  a   new  term.   In  Popperian  spirit,  they  (mostly  Neyman)  called  it   inductive  behavior-­‐-­‐  adjust  how  we’d  act  rather  than  beliefs   (I’m  not  knocking  performance,  but  error  probabilities  also   serve  for  particular  inferences—evidential)  
  • 27. SPP  D.  Mayo   27   N-­‐P  tests  can  still  commit  a  type  of  fallacy  of  rejection:   Infer  a  discrepancy  beyond  what’s  warranted:     ––especially  with n sufficiently large:  large  n  problem.   • Severity  tells  us:  an  α-­‐significant  difference  is  indicative  of  less   of  a  discrepancy  from  the  null  if  it  results  from  larger  (n1)   rather  than  a  smaller  (n2)  sample  size  (n1  >  n2  ) What’s  more  indicative  of  a  large  effect  (fire),  a  fire  alarm  that   goes  off  with  burnt  toast  or  one  so  insensitive  that  it  doesn’t   go  off  unless  the  house  is  fully  ablaze?  [The  larger  sample  size   is  like  the  one  that  goes  off  with  burnt  toast.)        
  • 28. SPP  D.  Mayo   28   Fallacy  of  Non-­‐Significant  results:  Insensitive  tests     • Negative  results  may  not  warrant  0  discrepancy   from  the  null,  but  we  can  use  severity  to  rule  out   discrepancies  that,  with  high  probability,  would  have   resulted  in  a  larger  difference  than  observed     Similar  to  Cohen’s  power  analysis  but  sensitive  to  the   outcome—P-­‐value  distribution  (#3)     • I  hear  some  replicationists  say  negative  results  are   uninformative:  not  so  (#2  ironies)   No  point  in  running  replication  research  if  your   account  views  negative  results  as  uninformative  
  • 29. SPP  D.  Mayo   29   Error  statistics  gives  evidential  interpretation  to  tests   (#3)     Use  results  to  infer  discrepancies  from  a  null  that  are  well  ruled-­‐ out,  and  those  which  are  not       I’d  never  just  report  a  P-­‐value     Mayo  (1996);     Mayo  and  Cox  (2010):  Frequentist    Principle  of   Evidence:  FEV     Mayo  and  Spanos  (2006):  SEV    
  • 30. SPP  D.  Mayo   30   One-­‐sided  Test  T+:    H0:  µ  <  µ0    vs.      H1:  µ  >  µ0       d(x)  is  statistically  significant  (set  lower  bounds)     (i)  If  the  test  had  high  capacity  to  warn  us  (by   producing  a  less  significant  result)  if  µ  ≤  µ0  +  γ.  then   d(x)  is  a  good  indication  of  µ  >  µ0  +  γ.     (ii)  If  the  test  had  little  (or  even  moderate)  capacity   (e.g.  <  .5)  to  produce  a  less  significant  result  even  if  µ  ≤   µ0  +  γ,  then  d(x)  is  a  poor  indication  of  µ  >  µ0  +  γ     (If  an  even  more  impressive  result  is  probable,  due  to   guppies,  it’s  not  a  good  indication  of  a  great  whale)    
  • 31. SPP  D.  Mayo   31   d(x)  is  not  statistically  significant  (set  upper  bounds)       (i)If  the  test  had  a  high  probability  of  producing  a   more  statistically  significant  difference  if  µ  >  µ0  +  γ,   then  d(x)  is  a  good  indication  that  µ  ≤  µ0  +  γ.     (ii)  If  the  test  had  a  low  probability  of  a  more   statistically  significant  difference  if  µ  >  µ0  +  γ,  then  d(x)   is  poor  indication  that  µ  ≤  µ0  +  γ.  (too  insensitive  to   rule  out  discrepancy  γ)     If  you  set  an  overly  stringent  significance  level  in  order  to   block  rejecting  a  null,  we  can  determine  the   discrepancies  you  can’t  detect  (e.g.,  risks  of  concern)  
  • 32. SPP  D.  Mayo   32   Confidence  Intervals  also  require  supplementing       Duality  between  tests  and  intervals:  values  within  the  (1  -­‐  α)   CI  are  non-­‐rejectable  at  the  α  level     • Still  too  dichotomous:  in  /out,  plausible/not  plausible   (Permit  fallacies  of  rejection/non-­‐rejection).   • Justified  in  terms  of  long-­‐run  coverage  (performance).   • All  members  of  the  CI  treated  on  par.   • Fixed  confidence  level  (SEV  needs  several  benchmarks).   • Estimation  is  important  but  we  need  tests  for   distinguishing  real  and  spurious  effects,  and  checking   assumptions  of  statistical  models.    
  • 33. SPP  D.  Mayo   33   The  evidential  interpretation  is  crucial  but  error   probabilities  can  be  violated  by  selection  effects  (also   violated  model  assumptions)   One  function  of  severity  is  to  identify  which  selection  effects   are  problematic  (not  all  are)  (#3).      Biasing  selection  effects:  when  data  or  hypotheses  are   selected  or  generated  (or  a  test  criterion  is  specified),  in   such  a  way  that  the  minimal  severity  requirement  is   violated,  seriously  altered  or  incapable  of  being  assessed.          
  • 34. SPP  D.  Mayo   34   Nominal vs actual significance levels Suppose  that  twenty  sets  of  differences  have  been   examined,  that  one  difference  seems  large  enough  to  test   and  that  this  difference  turns  out  to  be  ‘significant  at  the  5   percent  level.’  ….The  actual  level  of  significance  is  not  5   percent,  but  64  percent!  (Selvin,  1970,  p.  104)     • They  were  clear  on  the  fallacy:  blurring  the  “computed”   or  “nominal”  significance  level,  and  the  “actual”  level     • There  are  many  more  ways  you  can  be  wrong  with   hunting  (different  sample  space)        
  • 35. SPP  D.  Mayo   35   This is a genuine example of an invalid or unsound method   You report: Such  results  would  be  difficult  to  achieve  under   the  assumption  of  H0 When  in  fact  such  results  are  common  under  the   assumption  of  H0 (formally): You say Pr(P-value < Pobs; H0) ~ α (small)   but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed • Nowadays,  we’re  likely  to  see  the  tests  blamed  for   permitting  such  misuses  (instead  of  the  testers).     • Worse  are  those  accounts  where  the  abuse  vanishes!  
  • 36. SPP  D.  Mayo   36   What  defies  scientific  sense?     On  some  views,  biasing  selection  effects  are  irrelevant….   Stephen  Goodman  (epidemiologist):     Two  problems  that  plague  frequentist  inference:  multiple   comparisons  and  multiple  looks,  or,  as  they  are  more   commonly  called,  data  dredging  and  peeking  at  the  data.   The  frequentist  solution  to  both  problems  involves   adjusting  the  P-­‐value…But  adjusting  the  measure  of   evidence  because  of  considerations  that  have  nothing  to  do   with  the  data  defies  scientific  sense,  belies  the  claim  of   ‘objectivity’  that  is  often  made  for  the  P-­‐value.”  (1999,  p.   1010).    
  • 37. SPP  D.  Mayo   37   Likelihood  Principle  (LP)   The  vanishing  act  takes  us  to  the  pivot  point  around  which   much  debate  in  philosophy  of  statistics  revolves:   In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses: P(x0;H1)/P(x0;H0)     Different  forms:  posterior  probabilities,  Bayes  factor   (inference  is  comparative,  data  favors  this  over  that–is  that   even  inference?)      
  • 38. SPP  D.  Mayo   38   All  error  probabilities  violate  the  LP  (even  without   selection  effects):     “Sampling  distributions,  significance  levels,  power,  all  depend   on  something  more  [than  the  likelihood  function]–something   that  is  irrelevant  in  Bayesian  inference–namely  the  sample   space”.  (Lindley  1971,  p.  436)     The  information  is  just  a  matter  of  our  “intentions”   “The  LP  implies…the  irrelevance  of  predesignation,  of   whether  a  hypothesis  was  thought  of  before  hand  or  was   introduced  to  explain  known  effects  (Rosenkrantz,  1977,   122)  
  • 39. SPP  D.  Mayo   39   Many current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior) while ignoring biasing selection effects, will fail.   The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals. With one big difference: Your direct basis for criticism and possible adjustments has just vanished. (lots of #2 inconsistencies)  
  • 40. SPP  D.  Mayo   40   How  might  probabilists  block  intuitively  unwarranted   inferences?  (Consider  first  subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled We know these things are unbelievable, a subjective Bayesian might say That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.
  • 41. SPP  D.  Mayo   41   It wouldn’t help with our most important problem: • How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)? So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized). Besides, researchers really do believe their hypotheses.
  • 42. SPP  D.  Mayo   42   Diederik Stapel says he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” (E.g., eating meat causes aggression.) (In “The Mind of a Con Man,” NY Times, April 26, 2013[4])
  • 43. SPP  D.  Mayo   43   Conventional  Bayesians   The most popular probabilisms these days are “non-subjective” (reference, default) or conventional designed  to  prevent  prior   beliefs  from  influencing  the  posteriors:   “The  priors  are  not  to  be  considered  expressions  of   uncertainty,  ignorance,  or  degree  of  belief.  Conventional   priors  may  not  even  be  probabilities…  .”  (Cox  and  Mayo   2010,  p.  299)     How  might  they  avoid  too-­‐easy  rejections  of  a  null?      
  • 44. SPP  D.  Mayo   44   Cult  of  the  Holy  Spike     Give  a  spike  prior  of  .5  to  H0  the  remaining  .5  probability  being   spread  out  over  the  alternative  parameter  space,  Jeffreys.       This  “spiked  concentration  of  belief  in  the  null”  is  at  odds  with   the  prevailing  view  “we  know  all  nulls  are  false”  (#2)       Bottom line: By convenient choices of priors and alternatives statistically significant differences can be evidence for the null   The  conflict  often  considers  the  two  sided  test     H0:  µ  =  0  versus  H1:  µ  ≠  0        
  • 45. SPP  D.  Mayo   45       Posterior  Probabilities  in  H0                                                                                n  (sample  size)                                                                ____________________________      p                        z                        n=50          n=100            n=1000     .10              1.645                    .65                        .72                          .89   .05              1.960                    .52                        .60                          .82   .01              2.576                    .22                        .27                          .53   .001          3.291                    .034                    .045                    .124     If  n  =  1000,  a  result  statistically  significant  at  the  .05  level   leads  to  a  posterior  to  the  null  of  .82!     From  Berger  and  Sellke  (1987)  based  on  a  Jeffreys  pror  
  • 46. SPP  D.  Mayo   46     • With  a  z  =  1.96  difference,  the  95%  CI  (2-­‐sided)  or  the  .975   CI  one  sided  excludes  the  null  (0)  from  the  interval     • Severity reasoning: Were H0 true, the probability of getting d(x) < dobs is high (~.975), so SEV  (µ  >  0) ∼ .975 • But they give P(H0 | z = 1.96 ) = .82 • Error statistical critique: there’s a high probability that they give posterior probability of .82 to H0:µ = 0 erroneously • The onus is on probabilists to show a high posterior for H constitutes having passed a good test.
  • 47. SPP  D.  Mayo   47   Informal  and  Quasi-­‐Formal  Severity  :  H  -­‐>  H*     • Error  statisticians  avoid  the  fallacy  of  going  directly  from   statistical  to  research  hypothesis  H*     • Can  we  say  nothing  about  this  link?   • I  think  we  can  and  must,  and  informal  severity   assessments  are  relevant  (#3)     I  will  not  discuss  straw  man  studies  (“chump  effects”).     This is believable: Men react more negatively to success of their partners than to their failures (compared to women)? Studies have shown: H: partner’s success lowers self-esteem in men
  • 48. SPP  D.  Mayo   48   Macho  Men   H*: partner’s success lowers self-esteem in men   I  have  no  doubts  that  certain  types  of  men  feel  threatened   by  the  success  of  their  female  partners,  wives  or  girlfriends       I’ve  even  known  a  few.     Can  this  be  studied  in  the  lab?  Ratliff  and  Oishi  (2013)  did:   .     H*:  “men’s  implicit  self-­‐esteem  is  lower  when  a  partner   succeeds  than  when  a  partner  fails.”     Not so for women Their example does a good job, given the standards in place.
  • 49. SPP  D.  Mayo   49   Treatments: Subjects are randomly assigned to five   “treatments”:  think,  write  about  a  time  your  partner   succeeded,  failed,  succeeded  when  you  failed  (partner   beats  me),  failed  when  you  succeeded  (I  beat  partner),   and  a  typical  day  (control).       Effects:  a  measure  of  “self-­‐esteem”   Explicit:  “How  do  you  feel  about  yourself?”   Implicit:  a test of word associations with “me” versus “other”. None showed statistical significance in explicit self-esteem, so consider just implicit measures    
  • 50. SPP  D.  Mayo   50     Some null hypotheses: The average self-esteem score is no different (these are statistical hypotheses) a) when partner succeeds (rather than failing) b) when partner beats (surpasses) me or I beat her c) control: when she succeeds, fails, or it’s a regular day There are at least double this, given self-esteem could be “explicit” or “implicit” (others too, e.g., the area of success)   Only  null  (a)  was  rejected  statistically!   Should  they  have  taken  the  research  hypothesis  as   disconfirmed  by  negative  cases?     Or  as  casting  doubt  on  their  test?    
  • 51. SPP  D.  Mayo   51   Or  should  they  just  focus  on  the  null  hypotheses  that   were  rejected,  in  particular  null  (a),  for  implicit  self-­‐esteem.     They  opt  for  the  third.     It’s not that they should have regarded their research hypothesis H* as disconfirmed much less falsified.   This is precisely the nub of the problem! I’m saying the hypothesis that the study isn’t well-run needs to be considered • Is the artificial writing assignment sufficiently relevant to the phenomenon of interest? (look at proxy variables) • Is the measure of implicit self esteem (word associations) a valid measure of the effect? (measurements of effects)
  • 52. SPP  D.  Mayo   52   Take,  null  hypothesis  b):  The average self-esteem score is no different when partner beats (surpasses) me or I beat her   Clearly  they  expected  “she  beat  me  in  X”  to  have  a  greater   negative  impact  on  self-­‐esteem  than  “she  succeeded  at  X”.       Still,  they  could  view  it  as  lending  “some  support  to  the  idea   that  men  interpret  ‘my  partner  is  successful’  as  ‘my  partner   is  more  successful  than  me”  (p.  698),     ….as  do  the  authors.         That  is,  any  success  of  hers  is  always  construed  by  Macho  man   as,  she  beat  me.    
  • 53. SPP  D.  Mayo   53   Bending  over  Backwards   For  the  stringent  self-­‐critic,  this  skirts  too  close  to  viewing   the  data  through  the  theory,  a  kind  of  “self-­‐sealing  fallacy”.       I want to be clear that this is not a criticism of them given existing standards “I'm talking about a specific, extra type of integrity...bending over backwards to show how you're maybe wrong, that you ought to have when acting as a scientist.”   (R. Feynman 1974)   I’m  describing  what’s  needed  to  show  “sincerely  trying  to   find  flaws”  under  the  austere  account  I  recommend     The  most  interesting  information  was  never  reported!   Perhaps  it  was  never  even  looked  at:  what  they  wrote  about.    
  • 54. SPP  D.  Mayo   54   Conclusion: Replication Research in Psychology Under an Error Statistical Philosophy Replication problems can’t be solved without correctly understanding their sources   Biggest  sources  of  problems  in  replication  crises   (a) Stat  H  -­‐>research  H*  and  (b)  biasing  selection  effects:     Reasons for (a): focus on P-values and Fisherian tests ignoring N-P tests (and the illicit NHST that goes directly H–> H*)
  • 55. SPP  D.  Mayo   55   Another reason, false dilemma: probabilism or long-run performance plus assuming that N-P can only give the latter I argue for a third use of probability: Rather than report on believability researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted bias? What’s  wanted  is  not  a  high  posterior  probability  in  H   (however  construed)  but  a  high  probability  the  procedure   would  have  unearthed  flaws  in  H  (reinterpretation  of  N-­‐P   methods)  
  • 56. SPP  D.  Mayo   56   What’s  replicable?  Discrepancies  that  are  severely  warranted   Reasons  for  (b)  [embracing  accounts  that  formally  ignore   selection  effects]:  accepting  probabilisms  that  embrace  the   likelihood  principle  LP     There’s  no  point  in  raising  thresholds  for  significance  if   your  methodology  does  not  pick  up  on  biasing  selection   effects.      
  • 57. SPP  D.  Mayo   57     Informal assessments of probativeness are needed to scrutinize statistical inferences in relation to research hypotheses H –> H* One  hypothesis  must  always  be:  our  results  point  to  the   inability  of  our  study  to  severely  probe  the  phenomenon  of   interest  (problem  with  proxy  variables,  measurements,  etc.)     The scientific status of an inquiry is questionable if it cannot or will not distinguish the correctness of inferences from problems stemming from a poorly run study If ordinary research reports adopted the Feynman “bending over backwards” scrutiny, the interpretation of replication efforts would be more informative (or perhaps not needed)
  • 58. SPP  D.  Mayo   58   REFERENCES     Baggerly,  K.  A.,  Coombes,  K.  R.  &  Neeley,  E.  S.  (2008).  “Run  Batch  Effects   Potentially  Compromise  the  Usefulness  of  Genomic  Signatures  for  Ovarian   Cancer.”  Journal  of  Clinical  Oncology.  26(7):  1186-­‐1187.   Bartless,  T.  (2012).  “Daniel  Kahneman  Sees  ‘Train-­‐Wreck  Looming’  for  Social   Psychology”.  Chronicle  of  Higher  Education  Blog  (Oct.  4,  2012)  article   w/links  to  email  D.  Kahneman  sent  to  several  social  psychologists.     http://chronicle.com/blogs/percolator/daniel-­‐kahneman-­‐sees-­‐train-­‐ wreck-­‐looming-­‐for-­‐social-­‐psychology/31338.   Berger,  J.  O.  (2006).  “The  Case  for  Objective  Bayesian  Analysis.”  Bayesian   Analysis  1  (3):  385–402.   Berger,  J.  O.  &  Sellke,  T.  (1987).  “Testing  a  Point  Null  Hypothesis:  The   Irreconcilability  of  P  Values  and  Evidence  (with  Discussion).”  Journal  of  the   American  Statistical  Association  82  (397)  (March  1):  112–122.   Bhattacharjee,  Y.  (2013).  “The  Mind  of  a  Con  Man”.  The  New  York  Times   Magazine  (4/28/2013),  p.  44.   Cohen,  J.  1988.  Statistical  Power  Analysis  for  the  Behavioral  Sciences.  2nd  ed.   Hillsdale,  NJ:  Erlbaum.  
  • 59. SPP  D.  Mayo   59     Coombes,  K.  R.,  Wang,  J.  &  Baggerly,  K.  A.  (2007).  “Microrrays:  retracing  steps.”   Nature  Medicine.  13(11):1276-­‐7.   Cox,  D.  R.  &  D.  V.  Hinkley.  (1974).  Theoretical  Statistics.  London:  Chapman  and   Hall.   Cox,  D.  R.  &  Mayo,  D.  G.  (2010).  “Objectivity  and  Conditionality  in  Frequentist   Inference.”  In  Error  and  Inference:  Recent  Exchanges  on  Experimental   Reasoning,  Reliability,  and  the  Objectivity  and  Rationality  of  Science,  edited   by  Deborah  G.  Mayo  and  Aris  Spanos,  276–304.  Cambridge:  Cambridge   University  Press.   Diaconis,  P.  (1978).  “Statistical  Problems  in  ESP  Research”.  Science  201  (4351):   131-­‐136.  (Letters  in  response  can  be  found  in  the  Dec.  15,  1978  issue  pp.   1145-­‐6.)   Dienes,  Z.  (2011)  “Bayesian  versus  Orthodox  Statistics:  Which  Side  Are  You  On?”   Perspectives  on  Psychological  Science  6(3):  274-­‐290.   Feynman,  R.    (1974).  “Cargo  Cult  Science.”  Caltech  Commencement  Speech.   Fisher,  R.  A.  (1947).  The  Design  of  Experiments,  4th  ed.  Edinburgh:  Oliver  and   Boyd.  
  • 60. SPP  D.  Mayo   60   Gelman,  A.  (2011).  “Induction  and  Deduction  in  Bayesian  Data  Analysis.”  Edited   by  Deborah  G.  Mayo,  Aris  Spanos,  and  Kent  W.  Staley.  Rationality,  Markets   and  Morals:  Studies  at  the  Intersection  of  Philosophy  and  Economics  2   (Special  Topic:  Statistical  Science  and  Philosophy  of  Science):  67–78.   Gelman,  A.  &  Shalizi,  C.  (2013).  “Philosophy  and  the  Practice  of  Bayesian   Statistics.”  British  Journal  of  Mathematical  and  Statistical  Psychology  66  (1):   8–38.   Gigerenzer,  G.  (2000).  “The  Superego,  the  Ego,  and  the  Id  in  Statistical   Reasoning.  “  Adaptive  Thinking,  Rationality  in  the  Real  World,  OUP.   Goodman,  S.  N.  (1999).  Toward  evidence-­‐based  medical  statistics.  2:  The  Bayes   factor.”  Annals  of  Internal  Medicine,  130:1005  –1013.   Howson,  C.  &  Urbach,  P.  (1993).  Scientific  Reasoning:  The  Bayesian  Approach.   2nd  ed.  La  Salle,  IL:  Open  Court.   Johansson  T.  (2010)  “Hail  the  impossible:  p-­‐values,  evidence,  and  likelihood.”   Scandinavian  Journal  of  Psychology  52:113-­‐125.   Kruschke,  J.  K.  (2010).  “What  to  believe:  Bayesian  methods  for  data  analysis”.   Trends  in  Cognitive  Science,  14(7):  297-­‐300.   Lehmann,  E.  L.  (1993).  “The  Fisher,  Neyman-­‐Pearson  Theories  of  Testing  
  • 61. SPP  D.  Mayo   61   Hypotheses:  One  Theory  or  Two?”  Journal  of  the  American  Statistical   Association  88  (424):  1242–1249.   Levelt  Committee,  Noort  Committee,  Drenth  Committee.  (2012).  “Flawed   science:  The  fraudulent  research  practices  of  social  psychologist  Diederik   Stapel”.  Stapel  Investigation:  Joint  Tilburg/Groningen/Amsterdam   investigation  of  the  publications  by  Mr.  Stapel.   https://www.commissielevelt.nl/   Lindley,  D.  V.  (1971).  “The  Estimation  of  Many  Parameters.”  In  Foundations  of   Statistical  Inference,  edited  by  V.  P.  Godambe  and  D.  A.  Sprott,  435–455.   Toronto:  Holt,  Rinehart  and  Winston.   Mayo,  D.  G.  (1996).  Error  and  the  Growth  of  Experimental  Knowledge.  Science  and   Its  Conceptual  Foundation.  Chicago:  University  of  Chicago  Press.   Mayo,  D.  G.  &  Cox,  D.  R.  (2010).  "Frequentist  Statistics  as  a  Theory  of  Inductive   Inference"  in  Error  and  Inference:  Recent  Exchanges  on  Experimental   Reasoning,  Reliability  and  the  Objectivity  and  Rationality  of  Science  (D.   Mayo  and  A.  Spanos  eds.),  Cambridge:  Cambridge  University  Press:  1-­‐27.   This  paper  appeared  in  The  Second  Erich  L.  Lehmann  Symposium:   Optimality,  2006,  Lecture  Notes-­‐Monograph  Series,  Volume  49,  Institute  of   Mathematical  Statistics,  pp.  247-­‐275.  
  • 62. SPP  D.  Mayo   62   Mayo,  D.  G.,  and  A.  Spanos.  (2006).  “Severe  Testing  as  a  Basic  Concept  in  a   Neyman–Pearson  Philosophy  of  Induction.”  British  Journal  for  the   Philosophy  of  Science  57  (2)  (June  1):  323–357.     Mayo,  D.  G.,  and  A.  Spanos.    (2011).  “Error  Statistics.”  In  Philosophy  of  Statistics,   edited  by  Prasanta  S.  Bandyopadhyay  and  Malcom  R.  Forster,  7:152–198.   Handbook  of  the  Philosophy  of  Science.  The  Netherlands:  Elsevier.   Meehl,  P.  E.  &  Waller,  N.  G.  (2002).  “The  Path  Analysis  Controversy:  A  New   Statistical  Approach  to  Strong  Appraisal  of  Verisimilitude.”  Psychological   Methods  7(3):  283–300.   Morrison,  D.  E.  &  Henkel,  R.  E.  (eEds).  (1970).  The  Significance  Test  Controversy:   A  Reader.  Chicago:  Aldine  De  Gruyter.   Micheel,  C.  M.,  Nass,  S.  J.  &  Omenn  G.  S.  (Eds)  Committee  on  the  Review  of  Omics-­‐ Based  Tests  for  Predicting  Patient  Outcomes  in  Clinical  Trials;  Board  on   Health  Care  Services;  Board  on  Health  Sciences  Policy;  Institute  of  Medicine   (2012).  Evolution  of  Translational  Omics:  Lessons  Learned  and  the  Path   Forward.  Nat.  Acad.  Press.     Neyman,  J.  (1957).  “‘Inductive  Behavior’”  as  a  Basic  Concept  of  Science.”  Revue   de  l'Institut  International  de  Statistique/Review  of  the  International   Statistical  Institute,  25  (1/3):  7-­‐22.  
  • 63. SPP  D.  Mayo   63   Neyman,  J.  &  Pearson,  E.  S.  (1928).  “On  the  Use  and  Interpretation  of  Certain   Test  Criteria  for  Purposes  of  Statistical  Inference.  Part  I,”  Biometrica  20A:   175-­‐240  (reprinted  in  Joint  Statistical  Papers,  University  of  California  Press,   Berkeley,  1967,  pp.  1-­‐66.)   Popper,  K.  (1962).  Conjectures  and  Refutations:  The  Growth  of  Scientific   Knowledge.  New  York:  Basic  Books.   Potti,  A.,  Dressman  H.  K.,  Bild,  A.,  Riedel,  R.  F.,  Chan,  G.,  Sayer,  R.,  Cragun,  J.,   Cottrill,  H.,  Kelley,  M.  J.,  Petersen,  R.,  Harpole,  D.,  Marks,  J.,  Berchuck,  A.,   Ginsburg,  G.  S.,  Febbo,  P.,  Lancaster,  J.    &  Nevins,  J.  R.    (2006).  “Genomic   signatures  to  guide  the  use  of  chemotherapeutics.”  Nature  Medicine.  Nov   12(11):1294-­‐300.  Epub  2006  Oct  22.     Potti,  A.  &  Nevins,  J.  R.  (2007)  “Reply  to  Coombes,  Wang  &  Baggerly.”  Nature   Medicine  Nov  13(11):1277-­‐8.     Ratliff,  K.  A.  &  Oishi,  S.  (2013).  “Gender  Differences  in  Implicit  Self-­‐Esteem   Following  a  Romantic  Partner’s  Success  or  Failure”.    Journal  of  Personality   and  Social  Psychology  105(4):  688–702.   Rosenkrantz,  R.  (1977).  Inference,  Method  and  Decision:  Towards  a  Bayesian   Philosophy  of  Science.  Dordrecht,  The  Netherlands:  D.  Reidel.  
  • 64. SPP  D.  Mayo   64   Savage,  L.  J.  (1962).  The  Foundations  of  Statistical  Inference:  A  Discussion.   London:  Methuen.   Savage,  L.  J.  (1964).  “The  Foundations  of  Statistics  Reconsidered.”  In  Studies  in   Subjective  Probability,  H.  Kyburg  &  H.  Smokler  (eds.),  173-­‐188.  New  York:   John  Wiley  &  Sons.   Selvin,  H.  (1970).  “A  Critique  of  Tests  of  Significance  in  Survey  Research.”  In  The   Significance  Test  Controversy,  edited  by  D.  Morrison  and  R.  Henkel,  94-­‐106.   Chicago:  Aldine  De  Gruyter.   Trafimow,  D.  &  Marks  M.  (2015).  “Editorial”.  Basic  and  Applied  Social  Psychology,   37(1),  pp.  1-­‐2.   Wagenmakers,  E.-­‐J.  (2007).  “A  Practical  Solution  to  the  Pervasive  Problems  of  P   Values”.  Psychonomic  Bulletin  &  Review  14  (5),  779-­‐804.