Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simulating data to gain insights into power and p-hacking


Published on

Very basic introduction to simulating data to illustrate issues affecting reproducibility. Uses Excel and R, but assumes no prior knowledge of R. Please let me know of errors or things that need better explanation.

Published in: Science
  • Be the first to comment

Simulating data to gain insights into power and p-hacking

  1. 1. Simulating data to gain insights into power and p-hacking Dorothy V. M. Bishop Professor of Developmental Neuropsychology University of Oxford @deevybee
  2. 2. Before you get started…. • The early exercises in this lesson use Microsoft Excel, which most people will have installed • The later exercises use R and R studio. This is free software. If you don’t have it, you’ll need to download it. As this can take time, it’s recommended that you do that before you go further. • Please follow instructions on the next slide.
  3. 3. Installing R • Open an internet browser and go to • Click the "download R" link in the middle of the page under "Getting Started." • Click on the link for a CRAN location close to you • Mac users: • Click on the "Download R for (Mac) OS X" link at the top of the page. • Click on the file containing the latest version of R under "Files." • Save the .pkg file, double-click it to open, and follow the installation instructions. • Windows users: • Click on the "Download R for Windows" link at the top of the page. • Click on the "install R for the first time" link at the top of the page. • Click "Download R for Windows" and save the executable file somewhere on your computer. • Run the .exe file and follow the installation instructions.
  4. 4. Installing R studio • R studio is a friendly interface for R. Once it is installed, you need not open the original R software: instead, you access R by opening the R studio application • Go to and click on the "Download RStudio" button. • Click on "Download RStudio Desktop." Mac users: • Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder. Windows users: • Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.
  5. 5. Why invent data? • If you can anticipate what your data will look like, you will also anticipate a lot of issues about study design that you might not have thought of • Analysing a simulated dataset can clarify what is optimal analysis/ how the analysis works • Simulating data with an anticipated effect is very useful for power analysis – deciding what sample size to use • Simulating data with no effect (i.e. random noise) gives unique insights into p-hacking
  6. 6. Ways to simulate data • For newbies: to get the general idea: Excel • Far better but involves steeper learning curve: R • Also (but not covered here) options in SPSS and Matlab: • e.g. • generation.html
  7. 7. Basic idea • Anything you measure can be seen as a combination of an effect of interest plus random noise • The goal of research is to find out • (a) whether there is an effect of interest • (b) if yes, how big it is • Classic hypothesis-testing with p-values is simply focuses just on (a) – i.e. have we just got noise or a real effect? • We can simulate most scenarios by generating random noise, with or without a consistent added effect
  8. 8. Basic idea: generate a set of random numbers in Excel • Open a new workbook • In cell A1 type random number • In cell A2 type = rand() Grab the little square in the bottom right of A2 and pull it down to autofill the cells below to A8
  9. 9. Random numbers in Excel, ctd • You have just simulated some data! • Are your numbers the same as mine? • What happens when you type rand() in A9?
  10. 10. Random numbers in Excel, ctd. • Your numbers will be different to mine – that’s because they are random. • The numbers will change whenever you open the worksheet, or make any change to it. • Sometimes that’s fine, but for this demo we want to keep the same numbers. To control when random numbers update, select Manual in Formula|Calculation Options. • To update to new numbers use Calculate Now button.
  11. 11. Random numbers in Excel, ctd. • The rand() function generates random numbers between 0 and 1: Are these the kind of numbers we want?
  12. 12. Realistic data usually involves normally distributed numbers • Nifty way to do this in Excel: treat generated numbers as p-values • The normsinv() function turns a p-value into a z-score Z-score
  13. 13. Normally distributed random numbers Try this: • Type = normsinv(A2) in cell B2 • Drag formula down to cell B8 • Now look at how the numbers in column A relate to those in column B. NB. In practice, we can generate normally distributed random numbers (i.e. z-scores) in just one step with formula: = normsinv(rand())
  14. 14. Now we are ready to simulate a study where we have 2 groups to be compared on a t-test • Pull down the formula from columns A and B to extend to A11:B11 • Type a header ‘group’ in C1 • Type 1 in C2:C6 and 2 in C7:C11
  15. 15. What is formula for t-test in Excel? Basic rule for life, especially in programming: if you don’t know it, Google it TTEST formula in xls: You specify: Range 1 Range 2 tails (1 or 2) type 1 = paired 2 = unpaired equal variance 3 = unpaired unequal variance
  16. 16. Try entering the formula for the t-test in C12 =TTEST(B2:B6, B7:B11,2,2) What is the number that you get? This formula gives you a p-value Now press ‘calculate now’ 20 times, and keep a tally of how many p-values are < .05 in 20 simulations
  17. 17. • What has this shown you? • P-values ‘dance about’ even when data are entirely random • On average, one in 20 runs will give p < .05 when null hypothesis is true – no difference between groups See Geoff Cumming: Dance of the p-values Congratulations! You have done your first simulation
  18. 18. We’ll stick with Excel for one more simulation • So far, we’ve simulated the null hypothesis - random data. If we find a ‘significant’ difference, we know it’s a false positive • Next, we’ll simulate data with a genuine effect. • It’s easy to do this: we just add a constant to all the values for group 2 • Since we’re using z-scores, the constant will correspond to the effect size (expressed as Cohen’s d). • Let’s try an effect size of .5 • For cells B7, change the formula to = normsinv(A7)+.5 • Drag the formula down to cell B11 and hit ‘Calculate now’
  19. 19. I’ve added formulae to show the mean and SD for the two groups: = AVERAGE(B2:B6) = STDEV(B2:B6) = AVERAGE(B7:B11) = STDEV(B7:B11) Your values will differ. Why isn’t the difference in means for the two groups exactly .5?
  20. 20. I’ve added formulae to show the mean and SD for the two groups: = AVERAGE(B2:B6) = STDEV(B2:B6) = AVERAGE(B7:B11) = STDEV(B7:B11) Your values will differ. Why isn’t the difference in means for the two groups exactly .5? ANSWER: mean/SD describe the population; this is just a sample from that population
  21. 21. Now add the formula for the t-test Is p < .05 ? It’s pretty unlikely you will see a significant result. Why?
  22. 22. Now add the formula for the t-test Is p < .05 ? It’s pretty unlikely you will see a significant result. Why? ANSWER: Sample too small – can’t pick out signal from noise
  23. 23. • The first simulation gave some insights into false positive rates: it shows how you can get a ‘significant’ result from random data • The second simulation illustrates the opposite situation: showing how often you can fail to get a significant p-value, even when there is a true effect (false negative) • This brings us on to the topic of statistical power: the probability of detecting a real effect with a given sample size • To build on these insights we need to do lots of simulations, and for that it’s best to move to R (which hopefully you have already installed: if not see slides 2-3) What have we learned so far?
  24. 24. Benefits of simulating data in R • Can write a script that executes commands to generate data and then run it automatically many times and store results • Much faster than Excel, and reproducible • Can generate different distributions, correlated variables, etc. • Powerful plotting functions • A good way of starting to learn R Downside: Steep initial learning curve But remember: Google is your friend Tons of material about R on the internet Ready? Create a folder to save your work and fire up R studio!
  25. 25. Self-teaching scripts on Download, save and open this one: Simulation_ex1_intro.R Source pane: script Console: try commands out here Environment: check variables here
  26. 26. First thing to do: Set working directory • Working directory is where R will default to when reading and writing stuff • Easiest way to set it: Go to Session|Set working directory Note that when you do this, the command to set working directory will pop up on the console. On my computer I see: setwd("~/deevybee_repo")
  27. 27. Now we’ll go through the script: it will generate same type of 2-group data as we’ve done in the 2nd exercise in Excel Preliminaries: Install packages. Use Tools|Install Packages • Remember! A common reason for R code not to work is because you have not installed a package that you need. • After installing the package you have to use the library or require command in your script to load it for this session.
  28. 28. To run the code in lines 41-49… • Select the lines of code • Click on the Run button in the top bar • Check what happens in the console Running a script line by line is a good way to learn R
  29. 29. Now start simulating data! • rnorm is an inbuilt R function that generates random normal deviates Now run lines 56-68
  30. 30. Now start simulating data! • rnorm is an inbuilt R function that generates random normal deviates • Note that as well as results you specify being shown on the console, any variables you create are now featured in the environment pane Now run lines 56-68
  31. 31. Think about questions on lines 72-74 • If you’re confused, remember what you’ve been taught in basic statistics (I hope!) about the differences between a population and a sample. • The mean/SD we specify determines characteristics of the population from which we are sampling. See also: simulations-to-understand.html
  32. 32. Now we’ll run lines 79-91 to generate data for another group with different mean • If our scores are z-scores and the mean for group 1 is zero, then myM2 corresponds to Cohen’s d measure of effect size. • The final command creates interesting output on the console: results of a Welch 2-sample t-test (i.e. t-test with correction for unequal variances)
  33. 33. Advantages of R over Excel • Can easily regenerate the data from the script • Very easy to change one parameter and generate a new dataset • We will see shortly how to repeatedly run a simulation and store results by using a loop • But first we’ll do some data reformatting and show a neat way of plotting the results
  34. 34. Making a data frame • A data frame as a way of storing the data that is rather like an Excel worksheet • You can store observations in rows and variables in columns • Data frames are versatile and can hold different variable types • We’ll put our newly created vectors into a data frame, mydf, with columns for group and score • We can easily view mydf by clicking on mydf in the Environment tab
  35. 35. Filling the data frame You can refer to a specific cell in a data frame with the row and column index e.g. mydf[3, 2] refers to 3rd row and 2nd column. Note square brackets here You can refer to a whole column by using $ and its name, e.g.mydf$Group You can also refer to a specific row of a named column, e.g. mydf$Group[3] Run lines 117-125
  36. 36. Deconstructing the t-test result • One reason for making a data frame is that there are many functions in R that operate on data frames. • One of these is the pirateplot function from the yarrr package. This creates a nice kind of plot called a pirate plot, which shows the distribution of individual data points as well as other summary statistics. We want to make a pirate plot with a header that shows the t-test result • Run line 131: myt <- t.test(myvectorA,myvectorB) The comments explain this more, but basically you can extract bits of the output in myt using $. If you type in the console: myt$ A menu pops up showing you which parameters there are. Now run lines 145-149, which show how you can bolt together bits of output from the t-test to make a useful header for a plot
  37. 37. Make a pirate plot • Run line 151: • pirateplot(Score~Group,data=mydf,main =myheader, xlab="Group", ylab="Score") Your plot will be different from this because we are generating random numbers that vary on each run. The pirate plot is not a well-known type of graphic ; this is a perfect opportunity to practice Googling to learn more about it – you should try varying the script to see how you can affect the graph
  38. 38. Some general points to help you learn R 1. Basic rule for life, especially in programming: if you don’t know it, Google it In R, Google your error message 2. Best way to learn is by making mistakes If you see a line of code you don’t understand, play with it to find out what it does. Look at Environment tab, or type name of variable on the console to check its value E.g., you want repeating numbers? Type in the console to compare: rep (1,3) and rep (3,1)
  39. 39. Pause to play with the script. Make a note of any questions
  40. 40. Simulation_ex1_multioutput.R This is essentially the same as the previous script, except that: • The plots are sent to a pdf rather than being output on the Plots pane (see comments in the script for explanation) • You run the simulation repeatedly, with two different values for N The structure of the script is with 2 nested loops: for (i in 1:2){ #line 15 ……… #various commands here for (j in 1:10){ #line 21 ……… #various commands here } } • The first loop runs twice; the second loop, which is nested inside it, runs 10 times. So overall there are 20 runs • The value,i,in the first loop, controls sample size which is either myNs[1] or myNs[2] • The value, j, in the second loop just acts as a counter, to ensure that there are 10 repetitions
  41. 41. Run the whole script! Click on the Files tab in the bottom right-hand pane, and you’ll see you have created two new pdf files (you may need to scroll down to see them): Look at these files, paying particular attention to the proportion of runs where p < .05.
  42. 42. 10 runs of simulation with N = 20 per group and effect size (d) = .3 ** * *
  43. 43. 10 runs of simulation with N = 100 per group and effect size (d) = .3 ** * * ** * *
  44. 44. Points to note • Smaller samples associated with more variable results. • With small sample sizes, true but weak effects will usually not give you a ‘significant’ result (i.e. p < .05). • In the example here, with effect size of .3, sample of 100 per group only gives a significant result on around 60% of runs. • This is the same as saying the power of the study to detect an effect size of .3 is equal to .60% • Many statisticians recommend power should be 80% or more (though will depend on purpose of study).
  45. 45. Body of table show sample size per group Jacob Cohen worked this all out in 1988
  46. 46. Estimating statistical power for your study For simple designs can use G-power package (or Cohen’s formulae) For more complex designs, simulation is a better approach, - just run the analysis on simulated data 10,000 times and then see how frequently your result is ‘significant’ by whatever criterion you plan to use. This requires you to have a sense of what your data will look like, and you have to have an estimate of what is the smallest effect size that you’d be interested in.
  47. 47. “Small studies continue to be carried out with little more than a blind hope of showing the desired effect. Nevertheless, papers based on such work are submitted for publication, especially if the results turn out to be statistically significant.” Weak statistical power has been, and continues to be a major cause of problems with replication of findings 1987 Newcombe
  48. 48. Part 2: Simulating null results to illustrate p-hacking
  49. 49. P-hacking and type 1 error (false positives) Load simulation_ex2_correlations.R Often studies have multiple variables of interest. This script shows you how to use the mvrnorm function from the MASS package to simulate multivariate normal data It also demonstrates the dangers of p-hacking First just ensure the necessary packages are installed and load them using library(): run lines 11-14
  50. 50. Introduction to mvrnorm In R, if you want to know how to use a function, you can just type help, e.g. help(mvrnorm) But often the official help information is technical and unfriendly and you may find more useful and accessible information and examples by Googling. The essential arguments for mvrnorm are the sample size (n), mu, which is a vector of means (one per variable), and Sigma, a matrix showing the correlations between variables. We’ll ignore the other arguments provided on the Help page for this demo. To make life easy, we will again create z-scores for our data, so mean will be zero and SD = 1. We can set nvar to 7 and then specify mu = rep(0, nvar). You could just have mu = rep (0,7) or even mu = c(0,0,0,0,0,0,0) But a good script avoids hardcoding variables like this: you want to be able to try running the script with a range of values, and it’s much easier just changing the initial definition of nvar than retyping all the lines of code that use nvar.
  51. 51. Specifying covariance between variables You should be familiar with the correlation coefficient, r If we are using z-score and have r = .5, what is covariance?
  52. 52. Creating the covariance matrix • One benefit of using z-scores is that the covariance matrix is the same as the correlation matrix, so if we specify the amount of correlation between variables, then we can easily make the covariance matrix that we need. • For simplicity, we’ll just assume that all our simulated variables are intercorrelated by the same amount, a value we’ll call myCorr. • So, if we have 7 variables, and myCorr is 0, we will need a matrix like this: • The script achieves this just by making a matrix where all values = myCorr, and then overwriting the diagonal with myVar (which we’ve set to 1) • N.B. The script is set to simulate uncorrelated variables, so off-diagonal values are 0, but you could experiment with other values, by changing myCorr
  53. 53. Running mvrnorm Before starting, it’s a good idea to clear all variables: R does not do that automatically, and it can be problematic if you still have values of variables from an earlier session. To clear them all, click the little broom symbol on the Environment tab. Now run all the lines of the script up to and including line 51. Check the Environment tab, which will show all the variables you have created. Skip over the command on line 60 for the moment. That line starts a loop and if you try to run it, the system will hang waiting for a close curly bracket to match the open curly bracket. (You can get out of that by either hitting escape or typing a close curly bracket on the console). For now, just run line 69 mydata<-mvrnorm(n=myN, mu=rep(myM,nVar), Sigma=myCov)
  54. 54. As with Excel simulation, the script generate fresh set of numbers on each run, though we can modify the settings to override this. (Google ‘setting seed’ in R) First six rows of mydata look like this:
  55. 55. Now we can analyse the simulated data! Let’s look at correlations between the seven variables Pick your favourite variables by selecting two numbers between 1 and 7 Thought experiment: We’ve simulated uncorrelated variables. In a single run, how likely is it that we’ll see: • No significant correlations • Some significant correlations • A significant correlation (p < .05) between your favourite variables
  56. 56. Correlation matrix for run 1 Output from simulation of 7 independent variables, where true correlation = 0 N = 30 Red denotes p < .05 ( r > .31 or < -.31);
  57. 57. Correlation matrix for run 2 Output from simulation of 7 independent variables, where true correlation = 0 N = 30 Red denotes p < .05 ( r > .31 or < -.31);
  58. 58. Correlation matrix for run 3 Output from simulation of 7 independent variables, where true correlation = 0 N = 30 Red denotes p < .05 ( r > .31 or < -.31); There is no relation between variables – why do we have significant values?
  59. 59. Correlation matrix for run 4 Output from simulation of 7 independent variables, where true correlation = 0 N = 30 Red denotes p < .05 ( r > .31 or < -.31); On any one run, we are looking at 21 correlations. So we should use Bonferroni corrected p-value: .05/21 = .002, corresponds to r = .51
  60. 60. Now try to work through the script yourself • You can run the script to generate your own table of results (it is set up just to show the table for the final run). • The bit of the script for generating tables showing significant p- values in colour is complex: don’t worry if you don’t understand it. • Most important thing is that you should develop competence to play around with the script and see how the output changes depending on how you change the sample size, the number of variables, and the true correlation between variables.
  61. 61. • Use of .05 cutoff makes sense only in relation to an a-priori hypothesis Many ways in which ‘hidden multiplicity’ of testing can give false positive (p < .05) results • Data dredging from a large set of variables • Multi-way Anova with many main effects/interactions • Cramer, A. O. J., et al (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647. doi:10.3758/s13423-015-0913-5) • Trying various analytic approaches until one ‘works’ • Post-hoc division of data into subgroups In latter 2 instances, may be hard to estimate appropriate correction – many binary choices -> multiplicative effects Key point: p-values can only be interpreted in terms of the context in which they are computed
  62. 62. 1 contrast Probability of a ‘significant’ p-value < .05 = .05 Large population database used to explore link between ADHD and handedness Demonstration of rapid expansion of comparisons with binary divisions
  63. 63. Focus just on Young subgroup: 2 contrasts at this level Probability of a ‘significant’ p-value < .05 = .10 Large population database used to explore link between ADHD and handedness
  64. 64. Focus just on Young on measure of hand skill: 4 contrasts at this level Probability of a ‘significant’ p-value < .05 = .19 Large population database used to explore link between ADHD and handedness
  65. 65. Focus just on Young, Females on measure of hand skill: 8 contrasts at this level Probability of a ‘significant’ p-value < .05 = .34 Large population database used to explore link between ADHD and handedness
  66. 66. Focus just on Young, Urban, Females on measure of hand skill: 16 contrasts at this level Probability of a ‘significant’ p-value < .05 = .56 Large population database used to explore link between ADHD and handedness
  67. 67. 1956 De Groot Failure to distinguish between hypothesis-testing and hypothesis-generating (exploratory) research -> misuse of statistical tests de Groot, A. D. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric- Jan Wagenmakers, et al]. Acta Psychologica, 148, 188-194. doi: Further reading
  68. 68. R scripts available on : • Simulation_ex1_intro.R Suitable for R newbies. Demonstrates ‘dance of the p-values’ in a t-test. Bonus, you learn to make pirate plots • Simulation_ex2_correlations Generate correlation matrices from multivariate normal distribution. Bonus, you learn to use ‘grid’ to make nicely formatted tabular outputs. • Simulation_ex3_multiwayAnova.R Simulate data for a 3-way mixed ANOVA. Demonstrates need to correct for N factors and interactions when doing exploratory multiway Anova. • Simulation_ex4_multipleReg.R Simulate data for multiple regression. • Simulation_ex5_falsediscovery.R Simulate data for mixture of null and true effects, to demonstrate that the probability of the data given the hypothesis is different from the probability of the hypothesis given the data. Two simulations from Daniel Lakens’ Coursera Course – with notes! • 1.1 WhichPvaluesCanYouExpect.R • 3.2 OptionalStoppingSim.R Now even more: See OSF!