Lecture on critically interpreting statistics for PPLE College, University of Amsterdam. License: CC-BY-SA 4.0, except embedded non-free illustrations. Sources are linked.
2. 2
Today
I. Interpreting statistical results
II. Evaluating experimental evidence
a. Not all published results are true
b. Making sense of the mess
III. Doing good and open research
I. Honest methods
II. Open research
7. 7
Simpson’s Paradox
• Women apply more in
some disciplines than in
others
• Disciplines with more
female applicants have
lower success rates
• Per discipline, women are
no less successful than
men
• But: gender bias may lie
elsewhere
– Why are “female”
disciplines less successful?
Van der Lee & Ellemers, 2015; Albers, 2015
8. 8
A Matter of Life and Death
• The case Lucia de B.
– Dutch nurse
– Suspected of murdering
up to 9 patients
– Sentenced in 2004 for
five murders and two
attempts
• Statistical evidence
– Probability of that many
chance deaths during
one nurse’s shifts
http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010
10. 10
Interpreting statistical results
Shifts with incident Shifts without
incident
Total
Lucia on shift 9 133 142
Lucia not on shift 0 887 887
Total 9 1020 1029
• Data accuracy
– Actually, 5 during Lucia’s shift, 2 at other times
• Confounding variables
– Deaths cluster in time
– Stratify by day to account for time
http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010
12. 12
First Conclusion
• Watch out for
confounding variables
• Be wary of Simpson’s
paradox
– Stratify data
• Statistics may not be
“wrong”, but they may
answer the wrong
question
17. 17
Questionable Research Practices
• File drawer problem
– Publication bias
• HARK-ing
– “Hypothesising after
results are known”
• Optional stopping
– Collecting more data
until a result is
significant
“I confess, Oprah… I was doping
when writing my international
publications…”
Image: KU Leuven
18. 18
The File Drawer
• Negative results often
are not published
– E.g., Bem (2011) did not
report all measured
variables
• Suppresses evidence
against the effect
Rosenthal, 1979; Bakker, van Dijk, & Wicherts, 2012; Franco, Malhotra, & Simonovits, 2014
19. 19
The File Drawer: Political Science
• Time-sharing Experiments in the Social
Sciences
– Political science studies run by the American NSF
Franco, Malhotra, & Simonovits, 2014
Never
published
Written,
unpublished
Published % Published
Null result 31 7 11 22
Mixed result 10 32 43 51
Strong result 4 31 57 62
20. 20
HARK-ing
• Researchers may run
studies, but only come up
with hypotheses after
looking at the data
– “Fishing” for p-values
– Many small, under-
powered studies
– Inflates the type I error
rate
– Confuses exploratory and
confirmatory research
Kerr, 1998
21. 21
Recap: Type I Error Rate
• The type I error rate is set by α
– α = .05: A 1 in 20 chance to falsely reject the null
• Multiple testing inflates the type I error rate
– 1 in 20 chance of a false positive per test
– With two tests, the chance of a false positive is almost 1 in
10, etc.
• Questionable research practices increase the chance of
false positive findings
http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
22. 22
HARK-ing: The Baby Factory
• “When a clear and interesting story could be
told about significant findings, the original
motivation was often abandoned. […] ‘You
want to know how it works? We have a bunch
of half-baked ideas. We run a bunch of
experiments. Whatever data we get, we
pretend that’s what we were looking for.’”
Peterson, 2016
23. 23
Why HARK-ing?
• Small studies often have
low power
– Even if an effect exists, the
chance to identify it is low
– Average power in social
psychology around 50%
– I.e., even if there is an
effect, every second study
will fail
• Gaining more power is
expensive
– Does not increase linearly
with sample size
Cohen, 1962; Rossi, 1990; http://rpsychologist.com/d3/NHST/
24. 24
Optional Stopping
• Continue collecting data
until a significant result
is obtained
– n = 30, test, not
significant; n = 35, test,
not significant; n = 40,
test, significant, stop.
– Inflates the type I error
rate
– Can be done correctly
(Schönbrodt, 2016)
25. 25
Optional Stopping: The Baby Factory
“Rather than waiting for the results from a set
number of infants, experimenters began
‘eyeballing’ the data as soon as babies were run
and often began looking for statistical
significance after just 5 or 10 subjects. […] When
the preliminary data looked good, the test
continued. […] But when, after just a few
subjects, no significance was found, the original
protocol was abandoned and new variations
were developed.”
Peterson, 2016
26. 26
Measuring Reproducibility
• Reproducibility Project:
Psychology
– 100 papers published in
3 journals in 2008
– One result from each
paper replicated once
– High-powered
replications
Open Science Collaboration, 2015
29. 29
Interpreting (Non-) Replications
• A failure to replicate does
not mean the effect is not
real
– Replication may have low
power
– Even with high power, non-
replication is possible
– There may be differences
between original and
replication, e.g. cultural
variation
• Reproducibility projects
estimate the proportion
over non-replicable findings
Open Science Collaboration, 2015
30. 30
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literature
• File drawer
• HARK-ing
• Optional stopping
– Reproducibility projects
indicate many false
positives in psychology
and economics
32. 32
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literature
• File drawer
• HARK-ing
• Optional stopping
– Reproducibility projects
indicate many false
positives in psychology
and economics
• Science is becoming
more open and honest
– Aggregation of results
– Incentives for replication
34. 34
Meta-Analyses
• Statistically summarise
results from many
separate studies
– Combine evidence for
and against an effect
• Susceptible to
publication bias
– Excessively high effect
size estimate
– Can be detected with
funnel plots
Flore & Wicherts, 2015
35. 35
The File Drawer, Unlocked
• PsychFileDrawer
– Repository of
unpublished replications
– Online repositories make
publication easier
http://www.psychfiledrawer.org/chart.php?target_article=33
36. 36
Registered Replication Reports
• Replication of specific
effects
– Important effects
– Prior doubt about effects
• Independent replication
– Not involving the original
authors
– Often involving multiple
labs
• Pre-registered
– No publication bias
http://www.psychologicalscience.org/index.php/replication/ongoing-projects
38. 38
Students Can Contribute
• Student projects are
particularly suitable for
replication efforts
– Opportunity to learn
research practices
– Contribute to
improvement of science
Grahe et al., 2012; Frank & Saxe, 2012; King, 2006
39. 39
Third Conclusion
• Meta-analyses can
summarise studies
statistically
– File drawer problem
– Online repositories make
publication of null results
easier
• Registered replication
reports
– Provide reliable estimates
– Eliminate publication bias
41. 41
Honest Research and Open Science
• Purely confirmatory
research
– Pre-register all statistical
analyses
– Only claim registered
analyses as hypothesis
tests
– Use strong statistical
tests
– Openly share methods
and data
Wagenmakers et al., 2012
42. 42
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recruited?
– When are outliers
excluded?
https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
43. 43
Exclusion Rules
When one subject makes
all the difference…
http://www.ted.com/talks/dan_ariely_beware_conflicts_of_interest
44. 44
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recruited?
– When are outliers
excluded?
– What statistical
techniques are going to
be used?
https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
45. 45
Statistical Diversity
• 1 data set
– Do darker-skinned
football players get more
red cards?
– Four different leagues
• 29 teams of analysts
Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508
47. 47
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recruited?
– When are outliers
excluded?
– What statistical
techniques are going to
be used?
• Several platforms
https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
48. 48
Registration and Exploration
• We need both
exploratory and
confirmatory research
– Pre-registration does not
prevent exploratory
research
– Exploratory and
confirmatory must be
labelled as such
Tukey, 1980
49. 49
Cross-validation
• So you’ve found a
significant result…
– … through exploration
• Cross-validate analyses
– New data set
– Split data set
50. 50
In Neuroscience: Double Dipping
• Using the same data
twice
– First, to set the
parameters of the
analysis
– Second, to run the
analysis
• Over-fitting the model
– Make the model fit the
data too much
Kriegeskorte et al., 2009
51. 51
Registration and Exploration
• We need both
exploratory and
confirmatory research
– Pre-registration does not
prevent exploratory
research
– Exploratory and
confirmatory must be
labelled as such
Tukey, 1980
52. 52
Publishing Pre-registered Research
• Badges
– Psychological Science
• Registered reports
– About two dozen
journals in psychology,
medicine, and politics
53. 53
Pre-registration Works
• In medicine, pre-
registration is
mandatory
– When outcomes must be
pre-registered, null
results become more
common
Kaplan & Irvin, 2015
54. 54
Sharing Data
• Sharing data openly
– For re-analysis
– For meta-analysis
– For archiving
– For teaching
• Sharing materials
openly
– For replication
http://re3data.org; http://osf.io; http://figshare.com
55. 55
Why Sharing Data Matters
• Growth in a Time of
Debt
– Key study to justify
austerity policies
– Re-analysed by a 28-
year-old graduate
student
– Excel coding error led to
significant results
Reinhart & Rogoff, 2010; Herndon, Ash, & Pollin, 2013
56. 56
Fourth Conclusion
• Honest research
– Explicit hypotheses
– Pre-registered methods
– Separating exploratory
and confirmatory
• Open Science
– Detailed methods
sections
– Open data sharing
57. 57
Six Lessons for a Critical Reader
1. Consider methods, not just p-values
2. Be wary of small studies, even if they are many
3. Appreciate meta-analyses, but watch out for
publication bias
4. Independent replication is key – and you can
contribute
5. Value pre-registered analyses
6. Use open data