Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What is the reproducibility crisis
in science and what can we do
about it?
Dorothy V. M. Bishop
Professor of Developmental...
What is the problem?
“There is increasing concern about the
reliability of biomedical research, with recent
articles sugge...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Which Article Should You Write?
There are two possible articles you can write: (a) the article you planned to
write when y...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
John et al, 2012 –survey of psychologists
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Button KS et al. 2013. Power failure: why small sample size
undermines the reliability of neuroscience. Nature Reviews
Neu...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Generate
and specify
hypotheses
Design
study
Collect data
Analyse
data & test
hypotheses
Interpret
data
Publish or
conduct...
Bustin (2015) on RNA biomarkers:
“molecular techniques can be unfit for purpose”
Poor fidelity of reagents/cell lines
None of this is new!
1956
De Groot
Failure to distinguish between
hypothesis-testing and
hypothesis-generating
(exploratory) research
-> misuse...
1956
De Groot
1975
Greenwald
“As it is functioning in at least some areas of
behavioral science research, the research-
pu...
1956
De Groot
1975
Greenwald
The “file drawer” problem
1979
Rosenthal
1956
De Groot
1975
Greenwald
1987
Newcombe
“Small studies continue to be carried out
with little more than a blind hope of...
1956
De Groot
1975
Greenwald
1987
Newcombe
1993
Dickersin
& Min
Clinical trials with ‘significant’ results substantially m...
1956
De Groot
1975
Greenwald
1987
Newcombe
1993
Dickersin
& Min
“The misidentified cell lines reported here have already
b...
Why is this making headlines now?
• Increase in studies quantifying the problem
• Concern from those who use research:
• D...
Failure to appreciate power of ‘the prepared mind’
Natural instinct is to look for consistent evidence, not disproof
Probl...
“The self-deception comes in
that over the next 20 years,
people believed they saw
specks of light that
corresponded to wh...
Seeing things in complex data requires skill
Bailey and von Bonin (1951) noted problems in
Brodmann's approach — lack of o...
Seeing things in complex data requires skill
Or pareidolia
Bailey and von Bonin (1951) noted problems in
Brodmann's approa...
Discusses failure so replicate studies on preferential
looking in babies – role of experimenter expertise
Special expertise or Jesus in toast?
How to decide
• Eradicate subjectivity from methods
• Adopt standards from industry f...
Failure to understand statistics (esp. p-values and power)
http://deevybee.blogspot.co.uk/2016/01/the-amazing-significo-wh...
Gelman A, and Loken E. 2013. The garden of forking
paths: Why multiple comparisons can be a problem,
even when there is no...
1 contrast
Probability of a
‘significant’ p-value
< .05 = .05
Large population
database used to explore
link between ADHD ...
Focus just on Young
subgroup:
2 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .10
Large populatio...
Focus just on Young on
measure of hand skill:
4 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .19...
Focus just on Young,
Females on
measure of hand skill:
8 contrasts at this level
Probability of a
‘significant’ p-value < ...
Focus just on Young,
Urban, Females on
measure of hand skill:
16 contrasts at this level
Probability of a
‘significant’ p-...
Problem exacerbated because
• Can now easily gather huge multivariate datasets
• Can easily do complex statistical analyse...
Huge bias for type I error
When p-hacking meets an ideological
agenda: a particularly toxic mix
http://www.snopes.com/medical/disease/cdcwhistleblower.asp
Solutions
a. Using simulated datasets to give insight
into statistical methods
Illustrated with field of ERP/EEG
• Flexibility in analysis in terms of:
• Electrodes
• Time intervals
• Frequency ranges
...
Solutions
b. Distinguish exploration from hypothesis-
testing analyses
• Subdivide data into exploration and replication
s...
Solutions
c. Masked data
Comparison of coronary care units vs treatment at home
From Ben Goldacre’s blog:
http://www.badsc...
Solutions
c. Masked data
MacCoun R., Perlmutter S. 2015 Hide results to seek the truth. Nature 526, 187-189.
“...temporari...
Solutions
d. Preregistration of analyses
https://www.technologyreview.com/s/601348/merck-wants-its-money-back-if-
university-research-is-wrong/
Solutions
e. Fundin...
• Reluctance to collaborate with competitors
• Reluctance to share data
• Fabricated data
Problems caused by researchers. ...
http://deevybee.blogspot.co.uk/
Problems caused by journals
• More concern for newsworthiness than methods
• Won’t publish...
• Reward those with most grant income
• Reward according to journal impact factor
Problems caused by institutions
Income to institution increases with the
amount of funding and so….
• The system
encourages us to
assume that:
• Big grant...
This is counterproductive because
• Amount of funding needed to do research is not a
proxy for value of that research
• So...
Furthermore....
• Desperate scramble for
research funds leads to
researchers being
overcommitted ->
poorly conducted
studi...
Journal impact factor as measure
of quality
• Mean number of citations to
articles published in any given
journal in the t...
Problems with journal impact factors
• Impact factor not a good
indication of the citations for
individual articles in the...
56
Problems caused by employers
• Reward research reproducibility over impact
factor in evaluation
• Consider ‘bang for your ...
Problems caused by funders
• Don’t require that all data reported
Though growing interest in data sharing
• No interest in...
http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/symposium-
resourc...
What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?
Upcoming SlideShare
Loading in …5
×

What is the reproducibility crisis in science and what can we do about it?

7,591 views

Published on

Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop

Published in: Science
  • May I have permission to use some of your slides (with appropriate attribution to yourself)? I have to give an updated presentation in the near future on this and I particularly liked your illustration of the branching tree analysis problem
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • This is a brilliant presentation. I have tried much less successfully to bring this issues to the attention of my research community
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

What is the reproducibility crisis in science and what can we do about it?

  1. 1. What is the reproducibility crisis in science and what can we do about it? Dorothy V. M. Bishop Professor of Developmental Neuropsychology University of Oxford @deevybee
  2. 2. What is the problem? “There is increasing concern about the reliability of biomedical research, with recent articles suggesting that up to 85% of research funding is wasted.” Bustin, S. A. (2015). The reproducibility of biomedical research: Sleepers awake! Biomolecular Detection and Quantification 2005. PLoS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124
  3. 3. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method based on original by Chris Chambers
  4. 4. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method based on original by Chris Chambers How common?
  5. 5. Which Article Should You Write? There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b). re Data Analysis: Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find additional evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something— anything —interesting. Writing the Empirical Journal Article Daryl J. Bem The Compleat Academic: A Practical Guide for the Beginning Social Scientist, 2nd Edition. Washington, DC: American Psychological Association, 2004. “This book provides invaluable guidance that will help new academics plan, play, and ultimately win the academic career game.” Explicitly advises HARKing!
  6. 6. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method based on original by Chris Chambers p-hacking P-hacking: doing many tests and only reporting the significant ones. Collecting extra data or removing outliers to push ‘nearly significant’ results over boundary. How common?
  7. 7. John et al, 2012 –survey of psychologists
  8. 8. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method based on original by Chris Chambers p-hacking Low statistical power Sample size too small to detect real effect
  9. 9. Button KS et al. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14:365-376. Median power of studies included in neuroscience meta-analyses
  10. 10. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method based on original by Chris Chambers p-hacking Low statistical power Publication bias Null findings don’t get published – literature distorted Fanelli, 2010: 92% papers report positive findings
  11. 11. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method p-hacking Low statistical power Publication bias Methods to avert bias not reported MacLeod et al, 2015: in vivo research, only around 25% papers reported randomisation/blinding Failure to control for bias
  12. 12. Generate and specify hypotheses Design study Collect data Analyse data & test hypotheses Interpret data Publish or conduct next experiment Hypothetico-deductive scientific method p-hacking Low statistical power Publication bias Failure to control for bias Poor quality control, e.g. misidentified cell lines/ reagents
  13. 13. Bustin (2015) on RNA biomarkers: “molecular techniques can be unfit for purpose” Poor fidelity of reagents/cell lines
  14. 14. None of this is new!
  15. 15. 1956 De Groot Failure to distinguish between hypothesis-testing and hypothesis-generating (exploratory) research -> misuse of statistical tests Historical timeline: concerns about reproducibility
  16. 16. 1956 De Groot 1975 Greenwald “As it is functioning in at least some areas of behavioral science research, the research- publication system may be regarded as a device for systematically generating and propagating anecdotal information.”
  17. 17. 1956 De Groot 1975 Greenwald The “file drawer” problem 1979 Rosenthal
  18. 18. 1956 De Groot 1975 Greenwald 1987 Newcombe “Small studies continue to be carried out with little more than a blind hope of showing the desired effect. Nevertheless, papers based on such work are submitted for publication, especially if the results turn out to be statistically significant.” 1979 Rosenthal
  19. 19. 1956 De Groot 1975 Greenwald 1987 Newcombe 1993 Dickersin & Min Clinical trials with ‘significant’ results substantially more likely to be published. “Most unpublished trials remained so because investigators thought the results were ‘not interesting’ or they ‘did not have enough time’” 1979 Rosenthal
  20. 20. 1956 De Groot 1975 Greenwald 1987 Newcombe 1993 Dickersin & Min “The misidentified cell lines reported here have already been unwittingly used in several hundreds of potentially misleading reports, including use as inappropriate tumor models and subclones masquerading as independent replicates.” 1999 Macleod et al 1979 Rosenthal
  21. 21. Why is this making headlines now? • Increase in studies quantifying the problem • Concern from those who use research: • Doctors and Patients • Pharma companies • Social media “It really is striking just for how long there have been reports about the poor quality of research methodology, inadequate implementation of research methods and use of inappropriate analysis procedures as well as lack of transparency of reporting. All have failed to stir researchers, funders, regulators, institutions or companies into action”. Bustin, 2014
  22. 22. Failure to appreciate power of ‘the prepared mind’ Natural instinct is to look for consistent evidence, not disproof Problems caused by researchers: 1
  23. 23. “The self-deception comes in that over the next 20 years, people believed they saw specks of light that corresponded to what they thought Vulcan should look during an eclipse: round objects crossing the face of the sun, which were interpreted as transits of Vulcan.”
  24. 24. Seeing things in complex data requires skill Bailey and von Bonin (1951) noted problems in Brodmann's approach — lack of observer independency, reproducibility and objectivity Yet have stood test of time: still used today Brodmann areas, 1909
  25. 25. Seeing things in complex data requires skill Or pareidolia Bailey and von Bonin (1951) noted problems in Brodmann's approach — lack of observer independency, reproducibility and objectivity Yet have stood test of time: still used today Brodmann areas, 1909
  26. 26. Discusses failure so replicate studies on preferential looking in babies – role of experimenter expertise
  27. 27. Special expertise or Jesus in toast? How to decide • Eradicate subjectivity from methods • Adopt standards from industry for checking/double- checking • Automate data collection and analysis as far as possible • Make recordings of methods (e.g. Journal of Visualised Experiments) • Make data and analysis scripts open
  28. 28. Failure to understand statistics (esp. p-values and power) http://deevybee.blogspot.co.uk/2016/01/the-amazing-significo-why-researchers.html Problems caused by researchers: 2
  29. 29. Gelman A, and Loken E. 2013. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no 'fishing expedition' or 'p-hacking' and the research hypothesis was posited ahead of time. www.stat.columbia.edu/~gelman/research/unpublished/p_ hacking.pdf "El jardín de senderos que se bifurcan"
  30. 30. 1 contrast Probability of a ‘significant’ p-value < .05 = .05 Large population database used to explore link between ADHD and handedness https://figshare.com/articles/The_Garden_of_Forking_Paths/2100379
  31. 31. Focus just on Young subgroup: 2 contrasts at this level Probability of a ‘significant’ p-value < .05 = .10 Large population database used to explore link between ADHD and handedness
  32. 32. Focus just on Young on measure of hand skill: 4 contrasts at this level Probability of a ‘significant’ p-value < .05 = .19 Large population database used to explore link between ADHD and handedness
  33. 33. Focus just on Young, Females on measure of hand skill: 8 contrasts at this level Probability of a ‘significant’ p-value < .05 = .34 Large population database used to explore link between ADHD and handedness
  34. 34. Focus just on Young, Urban, Females on measure of hand skill: 16 contrasts at this level Probability of a ‘significant’ p-value < .05 = .56 Large population database used to explore link between ADHD and handedness
  35. 35. Problem exacerbated because • Can now easily gather huge multivariate datasets • Can easily do complex statistical analyses Problems with exploratory analyses that use methods that presuppose hypothesis-testing approach
  36. 36. Huge bias for type I error
  37. 37. When p-hacking meets an ideological agenda: a particularly toxic mix
  38. 38. http://www.snopes.com/medical/disease/cdcwhistleblower.asp
  39. 39. Solutions a. Using simulated datasets to give insight into statistical methods
  40. 40. Illustrated with field of ERP/EEG • Flexibility in analysis in terms of: • Electrodes • Time intervals • Frequency ranges • Measurement of peaks • etc, etc • Often see analyses with 4- or 5-way ANOVA (group x side x site x condition x interval) • Standard stats packages correct p-values for N levels WITHIN a factor, but not for overall N factors and interactions . Cramer AOJ, et al 2016. Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review 23:640-647
  41. 41. Solutions b. Distinguish exploration from hypothesis- testing analyses • Subdivide data into exploration and replication sets. • Or replicate in another dataset
  42. 42. Solutions c. Masked data Comparison of coronary care units vs treatment at home From Ben Goldacre’s blog: http://www.badscience.net/2010/04/righteous-mischief-from-archie-cochrane/ Archie Cochrane
  43. 43. Solutions c. Masked data MacCoun R., Perlmutter S. 2015 Hide results to seek the truth. Nature 526, 187-189. “...temporarily and judiciously removing data labels and altering data values to fight bias and error”
  44. 44. Solutions d. Preregistration of analyses
  45. 45. https://www.technologyreview.com/s/601348/merck-wants-its-money-back-if- university-research-is-wrong/ Solutions e. Funding contingent on adoption of reproducible practices
  46. 46. • Reluctance to collaborate with competitors • Reluctance to share data • Fabricated data Problems caused by researchers. 3 Solutions to these may require changes to incentive structures, which leads us to....
  47. 47. http://deevybee.blogspot.co.uk/ Problems caused by journals • More concern for newsworthiness than methods • Won’t publish replications (or failures to replicate) • Won’t publish ‘negative’ findings
  48. 48. • Reward those with most grant income • Reward according to journal impact factor Problems caused by institutions
  49. 49. Income to institution increases with the amount of funding and so…. • The system encourages us to assume that: • Big grant is better than small grant • Many grants are better than one grant 51 "This is Dr Bagshaw, discoverer of the infinitely expanding research grant“ ©Cartoonstock
  50. 50. This is counterproductive because • Amount of funding needed to do research is not a proxy for value of that research • Some activities intrinsically more expensive • Does not make sense to disfavour research areas that cost less 52 Daniel Kahneman
  51. 51. Furthermore.... • Desperate scramble for research funds leads to researchers being overcommitted -> poorly conducted studies • Ridiculous amount of waste due to the ‘academic backlog’ 53
  52. 52. Journal impact factor as measure of quality • Mean number of citations to articles published in any given journal in the two preceding years • Originally designed to help libraries decide on subscriptions • Now often used as proxy for quality of an article 54 Eugene Garfield
  53. 53. Problems with journal impact factors • Impact factor not a good indication of the citations for individual articles in the journal, because distribution very skewed • Typically, around half the articles have very few citations 55 http://www.dcscience.net/colquhoun-nature-impact-2003.pdf N citations for sample of papers in Nature
  54. 54. 56
  55. 55. Problems caused by employers • Reward research reproducibility over impact factor in evaluation • Consider ‘bang for your buck’ rather than amount of grant income • Reward those who adopt open science practices Solutions for institutions Nat Biotech, 32(9), 871-873. doi: 10.1038/nbt.3004 Marcia McNutt Science 2014 • VOL 346 ISSUE 6214
  56. 56. Problems caused by funders • Don’t require that all data reported Though growing interest in data sharing • No interest in funding replications • No interest in funding systematic reviews Problems caused by funders
  57. 57. http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/symposium- resources-links/ Resources

×