Excursions into the garden of the forking paths

EXCURSIONS INTO THE GARDEN OF THE
FORKING PATHS
P-VALUE FETISHISATION, REPLICATION CRISIS, AND THE
TENSION BETWEEN INNOVATION AND CONFIRMATION
http://bit.ly/helmholtzdirnagl

Personal motivation:
Decades of futile translational stroke research
• Millions of animals killed
• Hundreds (thousands?) of neutral or
negative clinical trials
• Thousands of researchers and
clinicians globally
• Many billions spent on preclinical
research ?

Take home I:
The garden of the forking paths
http://bit.ly/2q2gtXqhttp://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
http://bit.ly/2JzblTR

Take home II
No scientific progress without reproducibility failures
To boldly go where no man…
Exploration at low base rate
Innovation
‚Paradigm shift‘
Incompetence
Bad designs
Tacit knowledge (bad reporting)
Low validity (bias)
Misconduct
The Good The Bad
Essential non-reproducibility
(Kuhn)
Detrimental non-reproducibility
(Popper)

Taken home III
Confirmation – weeding out the false positives of exploration
Jonathan
Kimmelman
PLoS Biol. (2014) 12:e1001863.

>6000 cit.
PLoS Med. 2005;2:e124

Modfied after Gary Larson
Bias: Subjective reality informed by ones preferences

Macleod MR, et al. (2015) Risk of Bias in Reports of In Vivo Research:
A Focus for Improvement. PLoS Biol 13: e1002273.
Low prevalence of methods to prevent bias

Alzheimer's disease models
models
Blinded conduct of
experiment
Blinded assessment
of outcome
Blinded assessment of
outcome
Stroke models (NXY-095)
Blinded assessment of behavioural outcome
No Yes
Improvementinbehaviouraloutcome
(StandardisedEffectSize)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Blinded assessment of behavioural outcome
No Yes
(StandardisedEffectSize) 0.0
0.2
0.4
0.6
0.8
1.0
1.2
Blinded assessment of behav
No
(StandardisedEffectSize)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Reductionininfarctsize
Reductionininfarctsize
> 30 studies > 500 animals
Bias inflates effect sizes

PLoS Biol. 2016;14:e1002331
Effects of attrition in experimental
biomedical research

PLoS Biol. 2016;14:e1002331
Bias produces false positives and inflates
effect sizes

Overall median power of 730 primary neuroscience studies: 21 %
Power failure in neuroscience!

Not knowing
what is false and
what is not, the
researcher sees
95 hypotheses as
true, 45 of which
are not.
α = 0.05; ß = 0.5

Mean group size n ≈ 8
Mean statistical power ≈ 45 %
False positive rate (p ≤ 0.05): ≈ 50 %
Overestimation of true effects: ≈ 50 %
“Low sample size bias“ leads to false
positives and effect size inflation

Beyond bias: HARKING –
Hypothesizing after the results are known

http://xkcd.com/882/
Beyond bias: p - Hacking

In exploratory investigation, researchers
should aim at generating robust
pathophysiological theories of disease.
Currently we often see a mixup of both modes. This prevents us
from tailoring our study designs accordingly.
In confirmatory investigation, researchers
should aim at demonstrating strong and
reproducible treatment effects in relevant
animal models.
Exploration vs Confirmation

Exploratory Confirmatory
Hypothesis (+) +++
Establish pathophysiology +++ (+)
Sequence and details of experiments established
at onset
(+) +++
Primary endpoint - ++
Sample size calculation (+) +++
Blinding +++ +++
Randomization +++ +++
External validity (aging, comorbidities, etc.) - ++
In/Exclusion criteria ++ +++
Test statistics + +++
Preregistration (-) +++
Sensitivity (Type II error) Find what might work ++ +
Specificity (Type I error) Weed out false positives + +++
Stroke 2016; 47:2148-2153

PLoS Biol. (2014) 12:e1001863.
Exploration vs Confirmation

Katharina Fritsch
Katharina Fritsch
https://www.timeshighereducation.com/
Replication (crisis?)

‚ .. non-reproducible single occurrences
are of no significance to science …‘
The Logic of Scientific Discovery (1934)
Sir Karl Popper
(1902-1994)
‘We do not take even our own
observations quite seriously, or accept
them as scientific observations, until we
have repeated and tested them. Only by
such repetitions can we convince ourselves
that we are not dealing with a mere
isolated ‘coincidence’, but with events
which, on account of their regularity and
reproducibility, are in principle inter-
subjectively testable.’

The lexicon of reproducibility
Methods reproducibility: Same data, same tools, same
results? Adds no additional evidence!
Results reproducibility (aka „replication“): Technically
competent repetition, i.e. a new study. Could be strict:
identical conditions: or conceptual: altered conditions (does
causal claim extend to previously unsampled settings?)
Inferential reproducibility: Same conclusions from study
replication or re-analysis? Not all scientists come to the
same conclusions from same results, or may make different
analytic choices. What is concluded or recommended from
a study is often the only thing that matters!
Adapted from Goodman et al. Sci Transl Med. 2016;8:341ps12.

What do we mean by 'reproducible'?
Significance and P values: Evaluating replication effect against null
hypothesis of no effect
Evaluating replication effect against original effect size: Is the
original effect size within the 95% CI of the effect size estimate
from the replication. Alternatively: Comparing original and
replication effect sizes
Meta-analysis combining original and replication effects:
Combining original and replication effect sizes for cumulative
evidence
Subjective assessment of “Did it replicate?”
From the Open Science Collaboration, Psychology Replication, Science. 2015 ;349(6251):aac4716

A false dichotomy
Replication Non-Replication

The emptiness of failed replication (?)
Mitchell J (2014) On the evidentiary evidence of failed replication
http://jasonmitchell.fas.harvard.edu/Papers/Mitchell_failed_science_2014.pdf

The emptiness of failed replication
Does a failure to replicate mean that the original
result was a false positive? Or was the failed
replication a false negative?
Does successful replication mean that the original
result was correct? Or are both results false positives?

Hidden moderators - Contextual
sensitivity – Tacit knowledge
‚We analyzed 100 replication attempts in psychology and found that the
extent to which the research topic was likely to be contextually sensitive
(varying in time, culture, or location) was associated with replication
success. This relationship remained a significant predictor of replication
success even after adjusting for characteristics of the original and
replication studies that previously had been associated with replication
success (e.g., effect size, statistical power).‘
Proc Natl Acad Sci. 2016;113:6454-9.

"Standardization fallacy":
Low external validity, poor reproducibility
Nat Methods. 2009;6:257-61.Trends Pharmacol Sci. 2016;37:509-10

p = 0.049 (p< α = 0.05)
Assume that the experimental result is correct, i.e.
measured difference equals (unknown) treatment effect.
Repeat experiment under identical conditions (i.e. 'strict
replication').
What is the probability to reproduce the significant
findings?
50 %!
How likely is strict replication ?

Replication failure as an indicator of
cutting edge research?
Dirnagl (2017) How likely are your hypotheses, really?
https://dirnagl.com/2017/04/13/how-original-are-your-scientific-hypotheses-really/

The garden of the forking paths
http://bit.ly/2q2gtXqhttp://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
http://bit.ly/2JzblTR

Fig. 6y demonstrates…
Brandt et al. Cell Metabolism 27, 2018, 118-135.e8

Resolving the tension:
Discovery & Replication
Suggested reading:
Wagenmakers EJ, Dutilh G, Sarafoglou A.
Perspect Psychol Sci. 2018 Jul;13(4):418-427
Chang and Eng Bunker circa 1865. Foto Hulton/Getty

No scientific progress without
nonreproducibility
To boldly go where no man…
Exploration at low base rate
Innovation
‚Paradigm shift‘
Incompetence
Bad designs
Tacit knowledge (bad reporting)
Low validity (bias)
Misconduct
The Good The Bad
Essential non-reproducibility
(Kuhn)
Detrimental non-reproducibility
(Popper)

Reduce Bias!
Use blinding, randomization,in/exclusion criteria.
Report results according to guidelines (e.g. ARRIVE).
Increase Power!
Check your power. Achieve at least 80%.
Do apriori sample size calculations.
Probably you need to increase n‘s.
Replicate.
Use statistics sensibly!
P-values do not provide evidence regarding a model or hypothesis.
Test statistics are overrated (and overused) in exploration.
Think biological significance, think effect size.
Replicate.
Practice Open Science
Preregister.
Publish NULL results.
Make the original data available.
Don’t get lost in the garden of the forking paths

https://dirnagl.com/2018/05/16/c
an-non-replication-be-a-sin/
https://dirnagl.com/2017/04/13/how-original-
are-your-scientific-hypotheses-really/
http://bit.ly/helmholtzdirnagl
@dirnagl

Excursions into the garden of the forking paths

Recommended

Recommended

More Related Content

Similar to Excursions into the garden of the forking paths

Similar to Excursions into the garden of the forking paths (20)

Recently uploaded

Recently uploaded (20)

Excursions into the garden of the forking paths

Editor's Notes