14. Open data, open exploration
Year
Male
Female
Male
Female
Yearsoflife(frombirth)Yearsoflife(frombirth)
Years of healthy life
Years of unhealthy life
16. Bad science
Paper retractions are on the rise
0
100
200
300
400
2000 2004 2008 2012 2016
Year
Numberofretractions Number of retractions per year
(reviews and research articles)
retractionwatch.com
17. 0.0
0.1
0.2
0.3
0.4
2000 2004 2008 2012 2016
Year
Retractions/Publications(%) Retractions as a percentage of total articles
(reviews and research articles)
Bad science
Falsification vs errors
retractionwatch.com
PubMed
18. Deductive reasoning
Aleksic et al. F1000Res 3:271, 2014. DOI: 10.12688/f1000research.5686.2
Where are these errors coming from?
(Hypothesizing After the Results are Known)
19. Deductive reasoning
Penders & Janssens. Bioessays 40:1800173, 2018. DOI: 10.1002/bies.201800173
What does it mean if there is an error?
•Something is wrong with study
•Something is wrong with theory
•Cognitive bias
•Accumulation of credibility and status
20. How do we detect errors?
Openness spectrum
Peng. Science 334:1226–1227, 2011 DOI: 10.1126/science.1213847
21. Publication only
J Am Heart Assoc 7:e007678, 2018. DOI: 10.1161/JAHA.117.007678
Openness spectrum
22. How do we detect errors?
Peng. Science 334:1226–1227, 2011 DOI: 10.1126/science.1213847
Openness spectrum
24. How do we detect errors?
Peng. Science 334:1226–1227, 2011 DOI: 10.1126/science.1213847
Openness spectrum
25. The experimental method
P value
Summary statistics
Tidy data
Raw data
Experimental design
Hypothesis testing
Basic data analysis
Data cleaning
Data collection
Little
scrutiny
Lots of
scrutiny
Leek & Peng. Nature 520(7549):612, 2015. DOI: 10.1038/520612a.
Openness spectrum
27. The oblivious P-hacker: the Wansink case:
Poor study design, execution and analysis
http://www.timvandersee.com/the-wansink-dossier-an-overview
28. The oblivious P-hacker: the Wansink case:
Poor study design, execution and analysis
http://www.timvandersee.com/the-wansink-dossier-an-overview
29. P-hacking, not always deliberate
Poor study design, execution and analysis
Silberzahn et al., Adv Methods Pract Psychol Sci 1(3): 337–356, 2018
30. P-hacking, not always deliberate
Poor study design, execution and analysis
Preprint: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
Published version: Am Sci 102(6) : 460, 2014, DOI: 10.1511/2014.111.460
Garden of
forking paths
P < 0.05: ✅⛔ ✅ ✅⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔ ⛔
Start
31. Call it out early
Countering p-hacking
Reviewer comments to authors
"…the analytical approach used by the authors
comes across as a 'p-hacked' exploratory
analysis of data originally collected for other
purposes."
32. Be upfront about the nature of the research
Unpublished data: https://github.com/kamermanpr/pain_threshold/settings
Countering p-hacking
33. Final paragraph of the introduction
"We therefore conducted an exploratory analysis
of data from the two experiments, the results of
which we have reported elsewhere (Madden et al.,
2018)."
First sentence of the discussion
"This exploratory analysis followed our previous
observation that…"
Be upfront about the nature of the research
Countering p-hacking
34. Statistical misconceptions driving bias
Response to a reviewer’s comment on sample size
"Small sample size and lack of power is really
only an issue when retaining the null hypothesis
(i.e., a non-significant finding)."
Statistical power
35. Statistical power
Button et al. Nat Rev Neurosci 14:365-376, 2013. DOI: 10.1038/nrn3475.
Statistical misconceptions driving bias
The winner’s curse
36. Response to a reviewer’s query on a p-value of 0.05
"…in our laboratory a finding is either statistically
significant or it is not…we use 0.05 as the
absolute threshold for determining significance."
Statistical misconceptions driving bias
Significance thresholds
37. American Statistical Association:
• "Scientific conclusions and policy decisions cannot be based on whether a p-
values passes a threshold"
• "By itself, a p-value does not provide a good measure of the evidence of a
model or hypothesis".
• "The p-value, or statistical significance, does not measure the size of an effect
size or the importance of a result".
Ronald Fisher: "No isolated experiment, however significant, can suffice for the
demonstration of natural phenomena."
Edwin Boring: "Scientific generalisation is a broader question than
mathematical description."
Rosnow & Rosenthal: "[S]urely, God loves the .06 nearly as much as the .05."
Statistical misconceptions driving bias
Significance thresholds
Wasserstein & Lazar. Am Stat 70: 129–133, 2016. DOI: 10.1080/00031305.2016.1154108.
Boring. Psychol Bull 16: 335-338, 1919.
Fisher. The design of Experiments. Oliver and Boyd (Edinburgh), 1935
Rosnow & Rosenthal. Am Psychologist 44: 1276-1284,1989
40. "In God we trust, all other must bring data"
W. Edwards Deming
Acknowledgements
Supervisors
• Richard Brooksbank
• Helen Laburn
• Duncan Mitchell
• Neville Pitts
Collaborators
• David Bennett
• Kate Cherry
• Alan Karstaedt
• Zané Lombard
• Tory Madden
• Romy Parker
• Patricia Price
• Derick Raal
• Andrew Rice
• Annina Schmid
• Andreas Themistocleous
Co-workers
• Tapiwa Chinaka
• Stella Iacovides
• Prinisha Pillay
• Toni Wadley
• Zipho Zane
Heads of School
• Graham Mitchell
• Dave Gray
• William Daniels
Current & past students
Long-suffering family
• Andrea Fuller
• James & Robyn