Critically Interpreting Statistics

Critically Interpreting Statistics
Statistics for Experimental Research
Simon Columbus
simon@simoncolumbus.com

2
Today
I. Interpreting statistical results
II. Evaluating experimental evidence
a. Not all published results are true
b. Making sense of the mess
III. Doing good and open research
I. Honest methods
II. Open research

3
INTERPRETING STATISTICAL
RESULTS

4
An Example: Gender Bias
Van der Lee & Ellemers, 2015
Applications Successful
Male 1635 17.7%
Female 1188 14.9%

5
Simpson’s Paradox
Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975

6
Simpson’s Paradox
Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975

7
Simpson’s Paradox
• Women apply more in
some disciplines than in
others
• Disciplines with more
female applicants have
lower success rates
• Per discipline, women are
no less successful than
men
• But: gender bias may lie
elsewhere
– Why are “female”
disciplines less successful?
Van der Lee & Ellemers, 2015; Albers, 2015

8
A Matter of Life and Death
• The case Lucia de B.
– Dutch nurse
– Suspected of murdering
up to 9 patients
– Sentenced in 2004 for
five murders and two
attempts
• Statistical evidence
– Probability of that many
chance deaths during
one nurse’s shifts
http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010

10
Interpreting statistical results
Shifts with incident Shifts without
incident
Total
Lucia on shift 9 133 142
Lucia not on shift 0 887 887
Total 9 1020 1029
• Data accuracy
– Actually, 5 during Lucia’s shift, 2 at other times
• Confounding variables
– Deaths cluster in time
– Stratify by day to account for time
http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010

12
First Conclusion
• Watch out for
confounding variables
• Be wary of Simpson’s
paradox
– Stratify data
• Statistics may not be
“wrong”, but they may
answer the wrong
question

13
EVALUATING EXPERIMENTAL
EVIDENCE, PART I: PROBLEMS

15
Ioannidis, 2005; McNeil, 2011

16
Bem, 2011; Wagenmakers et al., 2011

17
Questionable Research Practices
• File drawer problem
– Publication bias
• HARK-ing
– “Hypothesising after
results are known”
• Optional stopping
– Collecting more data
until a result is
significant
“I confess, Oprah… I was doping
when writing my international
publications…”
Image: KU Leuven

18
The File Drawer
• Negative results often
are not published
– E.g., Bem (2011) did not
report all measured
variables
• Suppresses evidence
against the effect
Rosenthal, 1979; Bakker, van Dijk, & Wicherts, 2012; Franco, Malhotra, & Simonovits, 2014

19
The File Drawer: Political Science
• Time-sharing Experiments in the Social
Sciences
– Political science studies run by the American NSF
Franco, Malhotra, & Simonovits, 2014
Never
published
Written,
unpublished
Published % Published
Null result 31 7 11 22
Mixed result 10 32 43 51
Strong result 4 31 57 62

20
HARK-ing
• Researchers may run
studies, but only come up
with hypotheses after
looking at the data
– “Fishing” for p-values
– Many small, under-
powered studies
– Inflates the type I error
rate
– Confuses exploratory and
confirmatory research
Kerr, 1998

21
Recap: Type I Error Rate
• The type I error rate is set by α
– α = .05: A 1 in 20 chance to falsely reject the null
• Multiple testing inflates the type I error rate
– 1 in 20 chance of a false positive per test
– With two tests, the chance of a false positive is almost 1 in
10, etc.
• Questionable research practices increase the chance of
false positive findings
http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf

22
HARK-ing: The Baby Factory
• “When a clear and interesting story could be
told about significant findings, the original
motivation was often abandoned. […] ‘You
want to know how it works? We have a bunch
of half-baked ideas. We run a bunch of
experiments. Whatever data we get, we
pretend that’s what we were looking for.’”
Peterson, 2016

23
Why HARK-ing?
• Small studies often have
low power
– Even if an effect exists, the
chance to identify it is low
– Average power in social
psychology around 50%
– I.e., even if there is an
effect, every second study
will fail
• Gaining more power is
expensive
– Does not increase linearly
with sample size
Cohen, 1962; Rossi, 1990; http://rpsychologist.com/d3/NHST/

24
Optional Stopping
• Continue collecting data
until a significant result
is obtained
– n = 30, test, not
significant; n = 35, test,
not significant; n = 40,
test, significant, stop.
– Inflates the type I error
rate
– Can be done correctly
(Schönbrodt, 2016)

25
Optional Stopping: The Baby Factory
“Rather than waiting for the results from a set
number of infants, experimenters began
‘eyeballing’ the data as soon as babies were run
and often began looking for statistical
significance after just 5 or 10 subjects. […] When
the preliminary data looked good, the test
continued. […] But when, after just a few
subjects, no significance was found, the original
protocol was abandoned and new variations
were developed.”
Peterson, 2016

26
Measuring Reproducibility
• Reproducibility Project:
Psychology
– 100 papers published in
3 journals in 2008
– One result from each
paper replicated once
– High-powered
replications
Open Science Collaboration, 2015

27
About 40 out of 100 results were successfully replicated.

28
Replication project in behavioural economics
• 11 out of 18 studies were replicated successfully.
Camerer et al., 2016

29
Interpreting (Non-) Replications
• A failure to replicate does
not mean the effect is not
real
– Replication may have low
power
– Even with high power, non-
replication is possible
– There may be differences
between original and
replication, e.g. cultural
variation
• Reproducibility projects
estimate the proportion
over non-replicable findings

30
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literature
• File drawer
• HARK-ing
– Reproducibility projects
indicate many false
positives in psychology
and economics

31
EVALUATING EXPERIMENTAL
EVIDENCE, PART II: SOLUTIONS

32
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literature
• File drawer
• HARK-ing
– Reproducibility projects
indicate many false
positives in psychology
and economics
• Science is becoming
more open and honest
– Aggregation of results
– Incentives for replication

33
Meta-Analyses
• Statistically summarise
results from many
separate studies
– Combine evidence for
and against an effect
Flore & Wicherts, 2015

34
Meta-Analyses
• Statistically summarise
results from many
separate studies
– Combine evidence for
and against an effect
• Susceptible to
publication bias
– Excessively high effect
size estimate
– Can be detected with
funnel plots
Flore & Wicherts, 2015

35
The File Drawer, Unlocked
• PsychFileDrawer
– Repository of
unpublished replications
– Online repositories make
publication easier
http://www.psychfiledrawer.org/chart.php?target_article=33

36
Registered Replication Reports
• Replication of specific
effects
– Important effects
– Prior doubt about effects
• Independent replication
– Not involving the original
authors
– Often involving multiple
labs
• Pre-registered
– No publication bias
http://www.psychologicalscience.org/index.php/replication/ongoing-projects

37
Curate Science
http://curatescience.org/#sbh2008a

38
Students Can Contribute
• Student projects are
particularly suitable for
replication efforts
– Opportunity to learn
research practices
– Contribute to
improvement of science
Grahe et al., 2012; Frank & Saxe, 2012; King, 2006

39
Third Conclusion
• Meta-analyses can
summarise studies
statistically
– File drawer problem
– Online repositories make
publication of null results
easier
• Registered replication
reports
– Provide reliable estimates
– Eliminate publication bias

40
HONEST RESEARCH AND OPEN
SCIENCE

41
Honest Research and Open Science
• Purely confirmatory
research
– Pre-register all statistical
analyses
– Only claim registered
analyses as hypothesis
tests
– Use strong statistical
tests
– Openly share methods
and data
Wagenmakers et al., 2012

42
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recruited?
– When are outliers
excluded?
https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io

43
Exclusion Rules
When one subject makes
all the difference…
http://www.ted.com/talks/dan_ariely_beware_conflicts_of_interest

44
Pre-registration
collected?
excluded?
– What statistical
techniques are going to
be used?

45
Statistical Diversity
• 1 data set
– Do darker-skinned
football players get more
red cards?
– Four different leagues
• 29 teams of analysts
Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508

46
Statistical Diversity
Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508

47
Pre-registration
collected?
excluded?
– What statistical
techniques are going to
be used?
• Several platforms

48
Registration and Exploration
• We need both
exploratory and
– Pre-registration does not
prevent exploratory
research
– Exploratory and
confirmatory must be
labelled as such
Tukey, 1980

49
Cross-validation
• So you’ve found a
significant result…
– … through exploration
• Cross-validate analyses
– New data set
– Split data set

50
In Neuroscience: Double Dipping
• Using the same data
twice
– First, to set the
parameters of the
analysis
– Second, to run the
analysis
• Over-fitting the model
– Make the model fit the
data too much
Kriegeskorte et al., 2009

51
Registration and Exploration
• We need both
exploratory and
– Pre-registration does not
prevent exploratory
research
– Exploratory and
confirmatory must be
labelled as such
Tukey, 1980

52
Publishing Pre-registered Research
• Badges
– Psychological Science
• Registered reports
– About two dozen
journals in psychology,
medicine, and politics

53
Pre-registration Works
• In medicine, pre-
registration is
mandatory
– When outcomes must be
pre-registered, null
results become more
common
Kaplan & Irvin, 2015

54
Sharing Data
• Sharing data openly
– For re-analysis
– For meta-analysis
– For archiving
– For teaching
• Sharing materials
openly
– For replication
http://re3data.org; http://osf.io; http://figshare.com

55
Why Sharing Data Matters
• Growth in a Time of
Debt
– Key study to justify
austerity policies
– Re-analysed by a 28-
year-old graduate
student
– Excel coding error led to
significant results
Reinhart & Rogoff, 2010; Herndon, Ash, & Pollin, 2013

56
Fourth Conclusion
• Honest research
– Explicit hypotheses
– Pre-registered methods
– Separating exploratory
and confirmatory
• Open Science
– Detailed methods
sections
– Open data sharing

57
Six Lessons for a Critical Reader
1. Consider methods, not just p-values
2. Be wary of small studies, even if they are many
3. Appreciate meta-analyses, but watch out for
publication bias
4. Independent replication is key – and you can
contribute
5. Value pre-registered analyses
6. Use open data

Critically Interpreting Statistics

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Critically Interpreting Statistics