Successfully reported this slideshow.
Upcoming SlideShare
×

# Psychological approaches to reproducibility

1,985 views

Published on

Talk given at All Souls seminar on Reproducibility and Open Research, 31/10/18
http://users.ox.ac.uk/~phys1213/ReproAtASC.html

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Want to earn \$4000/m? Of course you do. Learn how when you join today! ■■■ http://ishbv.com/ezpayjobs/pdf

Are you sure you want to  Yes  No

### Psychological approaches to reproducibility

1. 1. Insights from psychology on lack of reproducibility Dorothy V. M. Bishop Professor of Developmental Neuropsychology University of Oxford @deevybee Talk given at All Souls seminar on Reproducibility and Open Research, 31/10/18 http://users.ox.ac.uk/~phys1213/ReproAtASC.html
2. 2. The four horsemen of the Apocalypse P-hacking Publication biasLow power HARKing
3. 3. http://deevybee.blogspot.com/2016/01/the-amazing-significo-why-researchers.html P-hacking Simple explainer using poker Probability from unbiased deck of cards = 1 in 50 • If magician tells you he’ll deal you ‘3 of a kind’, and he does so, you should be impressed 3 of a kind • If magician deals 50 hands, and one of them is ‘3 of a kind’, you should not be impressed ‘Surprisingness’ of a result only interpretable in context of full dataset
4. 4. http://deevybee.blogspot.com/2016/01/the-amazing-significo-why-researchers.html P-hacking Simple explainer using poker Probability from unbiased deck of cards = 1 in 50 • If magician tells you he’ll deal you ‘3 of a kind’, and he does so, you should be impressed 3 of a kind • If magician deals 50 hands, and one of them is ‘3 of a kind’, you should not be impressed ‘Surprisingness’ of a result only interpretable in context of full dataset
5. 5. http://deevybee.blogspot.com/2016/01/the-amazing-significo-why-researchers.html P-hacking Simple explainer using poker Probability from unbiased deck of cards = 1 in 50 • If magician tells you he’ll deal you ‘3 of a kind’, and he does so, you should be impressed 3 of a kind • If magician deals 50 hands, and one of them is ‘3 of a kind’, you should not be impressed ‘Surprisingness’ of a result only interpretable in context of full dataset
6. 6. 1956 De Groot Failure to distinguish between hypothesis-testing and hypothesis- generating (exploratory) research -> misuse of statistical tests Historical timeline: concerns about reproducibility Describes P-hacking (though that term not used) Situation when 10 statistical tests done in a study “….when N=10 it is as if one participates… in a game of chance with “probability of losing” α for each “draw” or “throw”. The probability that we do not lose a single time in 10 draws can be calculated in the case that the draws are independent; it equals (1 − α)^10. For α = 0.05, the traditional 5% level, this becomes 0.9510 = 0.60. This means, therefore, that we have a 40% chance of rejecting at least one of our 10 null hypotheses — falsely”
7. 7. 1956 De Groot Low power 1969 Cohen Sample size too small to reliably detect a true effect of interest
8. 8. 1956 De Groot 1975 Greenwald The “file drawer” problem 1979 Rosenthal Prejudice against the null “As it is functioning in at least some areas of behavioral science research, the research- publication system may be regarded as a device for systematically generating and propagating anecdotal information.” Publication bias 1969 Cohen Nonsignificant findings not published: literature gives distorted impression
9. 9. 1956 De Groot 1975 Greenwald 1979 Rosenthal HARKing 1969 Cohen 1998 Kerr • “Presenting post hoc hypotheses in a research report as if they were, in fact, a priori hypotheses.” • A way of ”translating type I errors into theory” In survey by Kerr & Harris (1998), 52% respondents said they knew of editors/reviewers encouraging HARKing
10. 10. If we’ve known about this for decades, why haven’t the problems been fixed? No one cause: need to consider research environment and incentives This talk: focus on cognitive biases that make it hard to do science well Idea: doing good science is in opposition to many of our natural ways of thinking
11. 11. Cognitive biases that make it hard to do science well • Failure to understand probability • Tendency to see patterns in things • Confirmation bias • Errors of omission seen as acceptable • Need for narrative
12. 12. A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days? 1.The larger hospital 2.The smaller hospital 3.About the same (that is, within 5% of each other) Example from Daniel Kahneman & Amos Tversky Consider this problem
13. 13. Example from Daniel Kahneman & Amos Tversky A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days? 1.The larger hospital 2.The smaller hospital 3.About the same (that is, within 5% of each other) Expected value: Hosp15 = 57 days Hosp45 = 26 days
14. 14. Example from Daniel Kahneman & Amos Tversky A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days? 1.The larger hospital 2.The smaller hospital 3.About the same (that is, within 5% of each other) Expected value: Hosp15 = 57 days Hosp45 = 26 days Day Hospital 15 Hospital 45 Small sample gives noisier estimates: red line bounces around much more than blue line
15. 15. Insensitivity to sample size • People have strong intuitions about random sampling; • These intuitions are wrong in fundamental respects; • These intuitions are shared by naive subjects and by trained scientists; • Intuitions are applied with unfortunate consequences in the course of scientific inquiry Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.
16. 16. Work in progress: The experimenter game • You have a budget to improve reading ability across Oxfordshire – potential for roll-out to hundreds of schools. You’ve been offered a remedy that claims to boost children’s reading ability by half a standard deviation • If you buy it and it turns out useless, you’ll lose a lot of money • If you buy it and it really works, it will be worth a lot of money • You’re not sure whether to trust the vendor – you think there’s a 50:50 chance that it really works • You can run some tests on samples of children, but it costs money – the more children you test, the more expensive. • You have an optimization problem! • So what’s your experimental strategy?
17. 17. You decide to run a study with two groups of N children What value of N should you start with? - let’s try 20 per group Here’s a sample of data: do you think this is sufficient to decide whether to adopt/reject the intervention?
18. 18. This sample was drawn from population with no real difference
19. 19. Here’s another sample of data: do you think this is sufficient to decide whether to adopt/reject the intervention?
20. 20. This time the sample was drawn from a sample with a true effect. With small sample, difference can look small when there is a true effect – this illustrates problem of LOW POWER
21. 21. You decide to increase sample per group to 50
22. 22. This time, the impression from the sample of data gives a better indication of the true effect in the population But how reliable this impression is depends on the effect size, i.e. the separation in the means of the population distributions
23. 23. Population effect size = .2 Population effect size = .5 Separation between red dots (drawn from population with true effect) and grey dots (drawn from population with no effect) shows sample size where can reliably detect a true effect
24. 24. Failure to appreciate power of ‘the prepared mind’ Tendency to see patterns in things
25. 25. Pareidolia
26. 26. Example from Lazic, S. (2016) Experimental Design for Laboratory Biologists Position of bomb hits: General has map of bomb hits and wants to know if bombs were dropped at random or whether some sites are being targeted. Which map suggests targeting? Blue, red or neither?
27. 27. General message: we tend to assume random data are regular, and so try to interpret patterns when there is irregularity The blue map may look as if there is targeting, especially if there are potential targets at A or B. In fact, blue X and Y co-ordinates were selected at random. The red map does not suggest targeting, but it is not random. The co- ordinates were selected to be evenly distributed, and then jittered A B
28. 28. But! seeing novel patterns in complex data is one of the most important and exciting aspects of science! Consider Brodmann (1909): identified brain regions with different cell types – not obvious: required expertise and painstaking study Bailey and von Bonin (1951) noted problems in Brodmann's approach — lack of observer independency, reproducibility and objectivity Yet Brodman’s areas stood test of time: still used today
29. 29. Special expertise or Jesus in toast? How to decide • Eradicate subjectivity from methods • Adopt standards from industry for checking/double- checking • Automate data collection and analysis as far as possible • Make recordings of methods (e.g. Journal of Visualised Experiments) • Make data and analysis scripts open
30. 30. Confirmation bias
31. 31. How to do good science That is the idea that we all hope you have learned in studying science in school… …. It’s a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty—a kind of leaning over backwards. For example, if you’re doing an experiment, you should report everything that you think might make it invalid— not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked—to make sure the other fellow can tell they have been eliminated. Richard Feynman, Caltech 1974 commencement address
32. 32. Wason task: a way of thinking about experimental design Each card has a number on one side and a patch of colour on the other. You are asked to test the hypothesis that – for these 4 cards - if an even number appears on one side, then the opposite side is red. • Are any of the cards irrelevant to the hypothesis? • Are any of the cards critical to the hypothesis? • Which card(s) would you turn over to test the hypothesis? A B C D
33. 33. Wason task: a way of thinking about experimental design Each card has a number on one side and patch of colour on the other. You are asked to test the hypothesis that – for these 4 cards - if an even number appears on one side, then the opposite side is red. • Usual response is B & C are critical. • But C is not critical (we’re testing ‘if P then Q’, not ‘if Q then P’) • D is critical as it has potential to disconfirm hypothesis – but usually overlooked A B C D
34. 34. Wason task: Shows how confirmation bias can affect experimental design We need to design experiments to look for disconfirmation of a theory . In practice: "To test a hypothesis, we think of a result that would be found if the hypothesis were true and then look for that result" (J. Baron, 1988, p. 231). In survey of 84 scientists (physicists,biologists, psychologists, sociologists) Mahoney (1976) found fewer than 10% correctly identified the critical cards
35. 35. “The self-deception comes in that over the next 20 years, people believed they saw specks of light that corresponded to what they thought Vulcan should look during an eclipse: round objects crossing the face of the sun, which were interpreted as transits of Vulcan.” Confirmation bias at level of observations: Seeing what you expect to see
36. 36. • Cherry-picking may not be deliberate • We find it much easier to process and remember information that agrees with our viewpoint Confirmation bias affects how we remember and process information
37. 37. 37 Twin studies of SLI probandwise concordance: same-sex twins MZ DZ Lewis & Thompson, 1992 .86 .48 Bishop et al, 1995 .70 .46 Tomblin & Buckwalter, 1998 .96 .69 Hayiou-Thomas et al, 2005 .36 .33 A personal example: Slide from talks I gave on genetics of language disorder Twin concordance points to genetic influence when MZ > DZ
38. 38. 38 Twin studies of SLI probandwise concordance: same-sex twins MZ DZ Lewis & Thompson, 1992 .86 .48 Bishop et al, 1995 .70 .46 Tomblin & Buckwalter, 1998 .96 .69 Hayiou-Thomas et al, 2005 .36 .33 I continued to use the original slide after 2005, despite this additional study I had co-authored I failed to mention this in talks for several years – I literally forgot about it – presumably because it did not fit!
39. 39. 39 Example also illustrates how we will do further research to try to make sense of data that does not fit our ideas – but look far less closely when data does fit. For denouement of this story, see
40. 40. Confirmation bias affects literature reviews
41. 41. Most literature reviews cherry-pick the evidence (that’s why I’m not identifying this specific e.g.) “Regardless of etiology, cerebellar neuropathology commonly occurs in autistic individuals. Cerebellar hypoplasia and reduced cerebellar Purkinje cell numbers are the most consistent neuropathologies linked to autism [8, 9, 10, 11, 12, 13]. MRI studies report that autistic children have smaller cerebellar vermal volume in comparison to typically developing children [14].” Example: Study published in 2013 • I was surprised by this introduction to a paper, as it did not fit my impression of the literature on neuropathology in autism: but the authors seemed to cite a lot of supportive evidence • I checked to see if there was a relevant meta-analysis: there was….
42. 42. Standardized mean difference is +ve when cerebellar volume is greater in ASD Meta-analysis: Traut et al (2018) https://doi.org/10.1016/j.biopsych.2017.09.029 Though Webb et al (ref 14) did find area of vermis smaller in ASD after covarying cerebellum size Ref [14] – larger cerebellum Other studies mostly found no difference or increase – opposite of what claimed in 2013 paper
43. 43. Confirmation bias tends to produce errors of omission – these are generally thought to be less serious than errors of commission (i.e. making stuff up) But consequences can be major
44. 44. Errors of omission in reporting research “[I]t is a truly gross ethical violation for a researcher to suppress reporting of difficult-to-explain or embarrassing data in order to present a neat and attractive package to a journal editor.” (Greenwald, 1975, p. 19) “Failure to report results from a clinical trial is equivalent to fraud.” Iain Chalmers, personal communication
45. 45. Consequence of omission errors in literature reviews • When we read a peer-reviewed paper, we tend to trust the citations that back up a point • When we come to write our own paper, we cite the same materials • A good scientist won’t cite papers without reading them, but even this won’t save you from bias – you inherit it from prior papers • If prior papers only cite materials agreeing with a viewpoint, that viewpoint gets entrenched • You won’t know – unless you explicitly search – that there are other papers that give a different picture
46. 46. The (partial*) solution Always start with a systematic review • Systematic review • Collecting and summarise all empirical evidence that fits pre-specified eligibility criteria to address a specific question • Meta-analysis • Use statistical methods to summarise the results of these studies *But depends on finding all relevant papers
47. 47. Example from Lazic, S. (2016) Experimental Design for Laboratory Biologists 100 relevant studies on gene/disease association 95 studies find no association. Negative findings tend not to be mentioned in Abstracts 5 false positive results. Disease and gene mentioned in Abstract Pubmed search for disease AND gene 5 supporting and one negative study found
48. 48. https://hbr.org/2016/10/theres-a-word-for-using-truthful-facts-to-deceive-paltering Omission errors, commission errors and paltering Paltering differs from • Lying by omission (the passive omission of relevant information) • Lying by commission (the active use of false statements) I have no data on this, but personal experience suggests paltering is common in literature reviews – and in reporting results
49. 49. Let’s take another look at that cerebellum paper: statements that are not untrue, but are misleading “Regardless of etiology, cerebellar neuropathology commonly occurs in autistic individuals. Cerebellar hypoplasia and reduced cerebellar Purkinje cell numbers are the most consistent neuropathologies linked to autism [8, 9, 10, 11, 12, 13]. MRI studies report that autistic children have smaller cerebellar vermal volume in comparison to typically developing children [14].” Impression of large body of work, but mostly reviews of same few studies Study 14 by Webb et al: found overall increase in cerebellum size: smaller vermis effect only after adjusting total cerebellar volume
50. 50. In terms of ethical behaviour, rank order the following behaviours: • Omission of relevant studies • Stating that a study found something that it didn’t • Stating a study result that was true, but in a misleading way
51. 51. I don’t know of studies looking at this in science reporting, but analogous behaviour rated in studies of negotiation by Rogers et al (2016) • Omission of relevant information • Lying (untrue statement) • Stating something that is true, but in a misleading way (paltering) 23% 5% 32% Honesty judgement in negotiation study* *Rogers, T. et al (2016). Artful paltering: The risks and rewards of using truthful statements to mislead others. Journal of Personality and Social Psychology, 112(3), 456-473. Neither omission of information nor paltering seen as honest, but both are more acceptable than lying
52. 52. How common are these in literature reviews? Does it matter? • Omission of relevant studies • Stating that a study found something that it didn’t • Stating a study result that was true, but in a misleading way My view: Adoption of these behaviours in science is likely to depend on: (a) Is it rewarded? (b) Will it be detected? (c) If it is, could you avoid blame? (d) Are there obvious victims? (e) Is ‘everyone doing it’? Overlooked victims: • Potential users (patients, etc) • Researchers trying to build on results • Funders
53. 53. A further, overarching problem The need for narrative “Another reason why HARKed research reports may fare better in the review and publication process is that they not only provide a better fit to a specific good science script, they may also provide a better fit to the more general good story script. Positing a theory serves as an effective "initiating event." It gives certain events significance and justifies the investigators' subsequent purposeful activities directed at the goal of testing the hypotheses. And, when one HARKs, a "happy ending” (i.e., confirmation) is guaranteed.” Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social
54. 54. Darwinian processes in survival of ideas “Examples of memes are tunes, ideas, catch- phrases, clothes fashions, ways of making pots or of building arches. Just as genes propagate themselves in the gene pool by leaping from body to body via sperms or eggs, so memes propagate themselves in the meme pool by leaping from brain to brain via a process which, in the broad sense, can be called imitation.” R. Dawkins
55. 55. Successful meme • Easy to understand, remember, and communicate to others • Not helped by reporting everything! • Not helped by reporting null results! • May be influenced by whether confers advantage to the person communicating • Survival does not depend on whether they are useful, true, or potentially harmful
56. 56. Cognitive biases pervade every step of the research process Reading literature Confirmation bias, Omissions Experimental design Confirmation bias, Law of small numbers Experimental observations Seeing patterns, Confirmation bias Data analysis Confirmation bias, Seeing patterns, Law of small numbers, Omissions Scientific reporting Confirmation bias, Omissions, Need for narrative
57. 57. Will anything change? “It really is striking just for how long there have been reports about the poor quality of research methodology, inadequate implementation of research methods and use of inappropriate analysis procedures as well as lack of transparency of reporting. All have failed to stir researchers, funders, regulators, institutions or companies into action”. Bustin, 2014 Reasons for optimism • Concern from those who use research: • Doctors and patients • Pharma companies • Concern from funders • Increase in studies quantifying the problem • Social media
58. 58. 58 Professor Dorothy Bishop, FRS, FMedSci, FBA, Wellcome Trust Principal Research Fellow, Department of Experimental Psychology, Anna Watts Building, Woodstock Road, Oxford, OX2 6GG. @deevybee http://deevybee.blogspot.com/2012/11/bishopblog-catalogue- updated-24th-nov.html https://www.slideshare.net/deevybishop https://orcid.org/0000-0002-2448-4033