Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Critically Interpreting Statistics
Statistics for Experimental Research
Simon Columbus
simon@simoncolumbus.com
2
Today
I. Interpreting statistical results
II. Evaluating experimental evidence
a. Not all published results are true
b. ...
3
INTERPRETING STATISTICAL
RESULTS
4
An Example: Gender Bias
Van der Lee & Ellemers, 2015
Applications Successful
Male 1635 17.7%
Female 1188 14.9%
5
Simpson’s Paradox
Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975
6
Simpson’s Paradox
Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975
7
Simpson’s Paradox
• Women apply more in
some disciplines than in
others
• Disciplines with more
female applicants have
l...
8
A Matter of Life and Death
• The case Lucia de B.
– Dutch nurse
– Suspected of murdering
up to 9 patients
– Sentenced in...
9
p = .000000003
10
Interpreting statistical results
Shifts with incident Shifts without
incident
Total
Lucia on shift 9 133 142
Lucia not ...
11
p = .038
12
First Conclusion
• Watch out for
confounding variables
• Be wary of Simpson’s
paradox
– Stratify data
• Statistics may ...
13
EVALUATING EXPERIMENTAL
EVIDENCE, PART I: PROBLEMS
14
Ioannidis, 2005
15
Ioannidis, 2005; McNeil, 2011
16
Bem, 2011; Wagenmakers et al., 2011
17
Questionable Research Practices
• File drawer problem
– Publication bias
• HARK-ing
– “Hypothesising after
results are ...
18
The File Drawer
• Negative results often
are not published
– E.g., Bem (2011) did not
report all measured
variables
• S...
19
The File Drawer: Political Science
• Time-sharing Experiments in the Social
Sciences
– Political science studies run by...
20
HARK-ing
• Researchers may run
studies, but only come up
with hypotheses after
looking at the data
– “Fishing” for p-va...
21
Recap: Type I Error Rate
• The type I error rate is set by α
– α = .05: A 1 in 20 chance to falsely reject the null
• M...
22
HARK-ing: The Baby Factory
• “When a clear and interesting story could be
told about significant findings, the original...
23
Why HARK-ing?
• Small studies often have
low power
– Even if an effect exists, the
chance to identify it is low
– Avera...
24
Optional Stopping
• Continue collecting data
until a significant result
is obtained
– n = 30, test, not
significant; n ...
25
Optional Stopping: The Baby Factory
“Rather than waiting for the results from a set
number of infants, experimenters be...
26
Measuring Reproducibility
• Reproducibility Project:
Psychology
– 100 papers published in
3 journals in 2008
– One resu...
27
Measuring Reproducibility
About 40 out of 100 results were successfully replicated.
Open Science Collaboration, 2015
28
Measuring Reproducibility
Replication project in behavioural economics
• 11 out of 18 studies were replicated successfu...
29
Interpreting (Non-) Replications
• A failure to replicate does
not mean the effect is not
real
– Replication may have l...
30
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literatur...
31
EVALUATING EXPERIMENTAL
EVIDENCE, PART II: SOLUTIONS
32
Second Conclusion
• Many (key) results are
unreliable
– p-hacking and
publication bias distort
the scientific literatur...
33
Meta-Analyses
• Statistically summarise
results from many
separate studies
– Combine evidence for
and against an effect...
34
Meta-Analyses
• Statistically summarise
results from many
separate studies
– Combine evidence for
and against an effect...
35
The File Drawer, Unlocked
• PsychFileDrawer
– Repository of
unpublished replications
– Online repositories make
publica...
36
Registered Replication Reports
• Replication of specific
effects
– Important effects
– Prior doubt about effects
• Inde...
37
Curate Science
http://curatescience.org/#sbh2008a
38
Students Can Contribute
• Student projects are
particularly suitable for
replication efforts
– Opportunity to learn
res...
39
Third Conclusion
• Meta-analyses can
summarise studies
statistically
– File drawer problem
– Online repositories make
p...
40
HONEST RESEARCH AND OPEN
SCIENCE
41
Honest Research and Open Science
• Purely confirmatory
research
– Pre-register all statistical
analyses
– Only claim re...
42
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recr...
43
Exclusion Rules
When one subject makes
all the difference…
http://www.ted.com/talks/dan_ariely_beware_conflicts_of_inte...
44
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recr...
45
Statistical Diversity
• 1 data set
– Do darker-skinned
football players get more
red cards?
– Four different leagues
• ...
46
Statistical Diversity
Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-w...
47
Pre-registration
• Pre-register analyses
– How are data going to be
collected?
– How many subjects are
going to be recr...
48
Registration and Exploration
• We need both
exploratory and
confirmatory research
– Pre-registration does not
prevent e...
49
Cross-validation
• So you’ve found a
significant result…
– … through exploration
• Cross-validate analyses
– New data s...
50
In Neuroscience: Double Dipping
• Using the same data
twice
– First, to set the
parameters of the
analysis
– Second, to...
51
Registration and Exploration
• We need both
exploratory and
confirmatory research
– Pre-registration does not
prevent e...
52
Publishing Pre-registered Research
• Badges
– Psychological Science
• Registered reports
– About two dozen
journals in ...
53
Pre-registration Works
• In medicine, pre-
registration is
mandatory
– When outcomes must be
pre-registered, null
resul...
54
Sharing Data
• Sharing data openly
– For re-analysis
– For meta-analysis
– For archiving
– For teaching
• Sharing mater...
55
Why Sharing Data Matters
• Growth in a Time of
Debt
– Key study to justify
austerity policies
– Re-analysed by a 28-
ye...
56
Fourth Conclusion
• Honest research
– Explicit hypotheses
– Pre-registered methods
– Separating exploratory
and confirm...
57
Six Lessons for a Critical Reader
1. Consider methods, not just p-values
2. Be wary of small studies, even if they are ...
58
ENJOY YOUR BREAK!
Upcoming SlideShare
Loading in …5
×

Critically Interpreting Statistics

272 views

Published on

Lecture on critically interpreting statistics for PPLE College, University of Amsterdam. License: CC-BY-SA 4.0, except embedded non-free illustrations. Sources are linked.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Critically Interpreting Statistics

  1. 1. Critically Interpreting Statistics Statistics for Experimental Research Simon Columbus simon@simoncolumbus.com
  2. 2. 2 Today I. Interpreting statistical results II. Evaluating experimental evidence a. Not all published results are true b. Making sense of the mess III. Doing good and open research I. Honest methods II. Open research
  3. 3. 3 INTERPRETING STATISTICAL RESULTS
  4. 4. 4 An Example: Gender Bias Van der Lee & Ellemers, 2015 Applications Successful Male 1635 17.7% Female 1188 14.9%
  5. 5. 5 Simpson’s Paradox Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975
  6. 6. 6 Simpson’s Paradox Kievit et al., 2013; Bickel, Hammel, & O’Connell, 1975
  7. 7. 7 Simpson’s Paradox • Women apply more in some disciplines than in others • Disciplines with more female applicants have lower success rates • Per discipline, women are no less successful than men • But: gender bias may lie elsewhere – Why are “female” disciplines less successful? Van der Lee & Ellemers, 2015; Albers, 2015
  8. 8. 8 A Matter of Life and Death • The case Lucia de B. – Dutch nurse – Suspected of murdering up to 9 patients – Sentenced in 2004 for five murders and two attempts • Statistical evidence – Probability of that many chance deaths during one nurse’s shifts http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010
  9. 9. 9 p = .000000003
  10. 10. 10 Interpreting statistical results Shifts with incident Shifts without incident Total Lucia on shift 9 133 142 Lucia not on shift 0 887 887 Total 9 1020 1029 • Data accuracy – Actually, 5 during Lucia’s shift, 2 at other times • Confounding variables – Deaths cluster in time – Stratify by day to account for time http://www.kennislink.nl/publicaties/toch-statistiek-in-de-zaak-lucia-de-b; Gill, Groeneboom, & de Jong, 2010
  11. 11. 11 p = .038
  12. 12. 12 First Conclusion • Watch out for confounding variables • Be wary of Simpson’s paradox – Stratify data • Statistics may not be “wrong”, but they may answer the wrong question
  13. 13. 13 EVALUATING EXPERIMENTAL EVIDENCE, PART I: PROBLEMS
  14. 14. 14 Ioannidis, 2005
  15. 15. 15 Ioannidis, 2005; McNeil, 2011
  16. 16. 16 Bem, 2011; Wagenmakers et al., 2011
  17. 17. 17 Questionable Research Practices • File drawer problem – Publication bias • HARK-ing – “Hypothesising after results are known” • Optional stopping – Collecting more data until a result is significant “I confess, Oprah… I was doping when writing my international publications…” Image: KU Leuven
  18. 18. 18 The File Drawer • Negative results often are not published – E.g., Bem (2011) did not report all measured variables • Suppresses evidence against the effect Rosenthal, 1979; Bakker, van Dijk, & Wicherts, 2012; Franco, Malhotra, & Simonovits, 2014
  19. 19. 19 The File Drawer: Political Science • Time-sharing Experiments in the Social Sciences – Political science studies run by the American NSF Franco, Malhotra, & Simonovits, 2014 Never published Written, unpublished Published % Published Null result 31 7 11 22 Mixed result 10 32 43 51 Strong result 4 31 57 62
  20. 20. 20 HARK-ing • Researchers may run studies, but only come up with hypotheses after looking at the data – “Fishing” for p-values – Many small, under- powered studies – Inflates the type I error rate – Confuses exploratory and confirmatory research Kerr, 1998
  21. 21. 21 Recap: Type I Error Rate • The type I error rate is set by α – α = .05: A 1 in 20 chance to falsely reject the null • Multiple testing inflates the type I error rate – 1 in 20 chance of a false positive per test – With two tests, the chance of a false positive is almost 1 in 10, etc. • Questionable research practices increase the chance of false positive findings http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf
  22. 22. 22 HARK-ing: The Baby Factory • “When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. […] ‘You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.’” Peterson, 2016
  23. 23. 23 Why HARK-ing? • Small studies often have low power – Even if an effect exists, the chance to identify it is low – Average power in social psychology around 50% – I.e., even if there is an effect, every second study will fail • Gaining more power is expensive – Does not increase linearly with sample size Cohen, 1962; Rossi, 1990; http://rpsychologist.com/d3/NHST/
  24. 24. 24 Optional Stopping • Continue collecting data until a significant result is obtained – n = 30, test, not significant; n = 35, test, not significant; n = 40, test, significant, stop. – Inflates the type I error rate – Can be done correctly (Schönbrodt, 2016)
  25. 25. 25 Optional Stopping: The Baby Factory “Rather than waiting for the results from a set number of infants, experimenters began ‘eyeballing’ the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. […] When the preliminary data looked good, the test continued. […] But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.” Peterson, 2016
  26. 26. 26 Measuring Reproducibility • Reproducibility Project: Psychology – 100 papers published in 3 journals in 2008 – One result from each paper replicated once – High-powered replications Open Science Collaboration, 2015
  27. 27. 27 Measuring Reproducibility About 40 out of 100 results were successfully replicated. Open Science Collaboration, 2015
  28. 28. 28 Measuring Reproducibility Replication project in behavioural economics • 11 out of 18 studies were replicated successfully. Camerer et al., 2016
  29. 29. 29 Interpreting (Non-) Replications • A failure to replicate does not mean the effect is not real – Replication may have low power – Even with high power, non- replication is possible – There may be differences between original and replication, e.g. cultural variation • Reproducibility projects estimate the proportion over non-replicable findings Open Science Collaboration, 2015
  30. 30. 30 Second Conclusion • Many (key) results are unreliable – p-hacking and publication bias distort the scientific literature • File drawer • HARK-ing • Optional stopping – Reproducibility projects indicate many false positives in psychology and economics
  31. 31. 31 EVALUATING EXPERIMENTAL EVIDENCE, PART II: SOLUTIONS
  32. 32. 32 Second Conclusion • Many (key) results are unreliable – p-hacking and publication bias distort the scientific literature • File drawer • HARK-ing • Optional stopping – Reproducibility projects indicate many false positives in psychology and economics • Science is becoming more open and honest – Aggregation of results – Incentives for replication
  33. 33. 33 Meta-Analyses • Statistically summarise results from many separate studies – Combine evidence for and against an effect Flore & Wicherts, 2015
  34. 34. 34 Meta-Analyses • Statistically summarise results from many separate studies – Combine evidence for and against an effect • Susceptible to publication bias – Excessively high effect size estimate – Can be detected with funnel plots Flore & Wicherts, 2015
  35. 35. 35 The File Drawer, Unlocked • PsychFileDrawer – Repository of unpublished replications – Online repositories make publication easier http://www.psychfiledrawer.org/chart.php?target_article=33
  36. 36. 36 Registered Replication Reports • Replication of specific effects – Important effects – Prior doubt about effects • Independent replication – Not involving the original authors – Often involving multiple labs • Pre-registered – No publication bias http://www.psychologicalscience.org/index.php/replication/ongoing-projects
  37. 37. 37 Curate Science http://curatescience.org/#sbh2008a
  38. 38. 38 Students Can Contribute • Student projects are particularly suitable for replication efforts – Opportunity to learn research practices – Contribute to improvement of science Grahe et al., 2012; Frank & Saxe, 2012; King, 2006
  39. 39. 39 Third Conclusion • Meta-analyses can summarise studies statistically – File drawer problem – Online repositories make publication of null results easier • Registered replication reports – Provide reliable estimates – Eliminate publication bias
  40. 40. 40 HONEST RESEARCH AND OPEN SCIENCE
  41. 41. 41 Honest Research and Open Science • Purely confirmatory research – Pre-register all statistical analyses – Only claim registered analyses as hypothesis tests – Use strong statistical tests – Openly share methods and data Wagenmakers et al., 2012
  42. 42. 42 Pre-registration • Pre-register analyses – How are data going to be collected? – How many subjects are going to be recruited? – When are outliers excluded? https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
  43. 43. 43 Exclusion Rules When one subject makes all the difference… http://www.ted.com/talks/dan_ariely_beware_conflicts_of_interest
  44. 44. 44 Pre-registration • Pre-register analyses – How are data going to be collected? – How many subjects are going to be recruited? – When are outliers excluded? – What statistical techniques are going to be used? https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
  45. 45. 45 Statistical Diversity • 1 data set – Do darker-skinned football players get more red cards? – Four different leagues • 29 teams of analysts Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508
  46. 46. 46 Statistical Diversity Silberzahn et al., 2015; http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508
  47. 47. 47 Pre-registration • Pre-register analyses – How are data going to be collected? – How many subjects are going to be recruited? – When are outliers excluded? – What statistical techniques are going to be used? • Several platforms https://www.socialscienceregistry.org/; http://egap.org/content/registration; http://ridie.3ieimpact.org/; http://osf.io
  48. 48. 48 Registration and Exploration • We need both exploratory and confirmatory research – Pre-registration does not prevent exploratory research – Exploratory and confirmatory must be labelled as such Tukey, 1980
  49. 49. 49 Cross-validation • So you’ve found a significant result… – … through exploration • Cross-validate analyses – New data set – Split data set
  50. 50. 50 In Neuroscience: Double Dipping • Using the same data twice – First, to set the parameters of the analysis – Second, to run the analysis • Over-fitting the model – Make the model fit the data too much Kriegeskorte et al., 2009
  51. 51. 51 Registration and Exploration • We need both exploratory and confirmatory research – Pre-registration does not prevent exploratory research – Exploratory and confirmatory must be labelled as such Tukey, 1980
  52. 52. 52 Publishing Pre-registered Research • Badges – Psychological Science • Registered reports – About two dozen journals in psychology, medicine, and politics
  53. 53. 53 Pre-registration Works • In medicine, pre- registration is mandatory – When outcomes must be pre-registered, null results become more common Kaplan & Irvin, 2015
  54. 54. 54 Sharing Data • Sharing data openly – For re-analysis – For meta-analysis – For archiving – For teaching • Sharing materials openly – For replication http://re3data.org; http://osf.io; http://figshare.com
  55. 55. 55 Why Sharing Data Matters • Growth in a Time of Debt – Key study to justify austerity policies – Re-analysed by a 28- year-old graduate student – Excel coding error led to significant results Reinhart & Rogoff, 2010; Herndon, Ash, & Pollin, 2013
  56. 56. 56 Fourth Conclusion • Honest research – Explicit hypotheses – Pre-registered methods – Separating exploratory and confirmatory • Open Science – Detailed methods sections – Open data sharing
  57. 57. 57 Six Lessons for a Critical Reader 1. Consider methods, not just p-values 2. Be wary of small studies, even if they are many 3. Appreciate meta-analyses, but watch out for publication bias 4. Independent replication is key – and you can contribute 5. Value pre-registered analyses 6. Use open data
  58. 58. 58 ENJOY YOUR BREAK!

×