Learning R while exploring statistics


Published on

Introduction to simulating datasets in R. No prior knowledge of R required. Illustrates idea of spurious correlation, and accompanies this blogpost:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Learning R while exploring statistics

  1. 1. Learning R while exploring statisticsThis exercise is designed to help you learn R while at the same time gaining insights into thephenomenon of illusory correlation. We will go through the following steps:1. Downloading R and R Studio, an interface to the R programming language that is rathereasier to work with than the basic interface.2. Familiarisation with basic operations in R3. Generating simulated data: two correlated variables, X and Y4. Generating simulated data from two groups with different means on uncorrelated variablesU and V to demonstrate spurious correlation between U and V.5. Demonstrating how incorporating group identity in a linear model unmasks the spuriousnature of the correlation between U and V.6. Demonstrating how removing the effect of group will be misleading if group identity is highlydependent on one of the variables.These instructions apply to those working on PC, and I dont know whether equivalent onMac.For steps 5 and 6 its assumed you have a basic understanding of simple regression.1. Downloading R and R StudioDownloading RR is a powerful language for statistical computing, but much of the documentation is writtenfor experts, and so it can be daunting for beginners. If you go to the website:http://www.r-project.org/You will see instructions for how to download R. Do not be put off by the instruction to"choose your preferred CRAN mirror": this just means you should select a download site fromthe list provided that is geographically close to where you are.You may then be offered further options that you may not fully understand. Just persevere byselecting the windows option from the "Download and install R" section, and then selectbase, which at last takes you to a page with straightforward download instructions.Installation of R will create a Start Menu item and an icon for R on your desktop.Downloading R StudioTo download R studio, go to this website and follow the instructions.http://rstudio.org/If for any reason you prefer not to use R Studio, the examples should all work from theoriginal R interface, but your screen may look different, and it may be difficult to arrange itemssuch as figures in a sensible way.2. Familiarisation with basic operations in RAfter opening R Studio your screen will be divided into several windows. Move your cursor tothe window called R console, in which you can type commands.You will see a > cursor.This cursor will not be shown in the examples below, but it indicates that the console isawaiting input from you.At the > cursor, type: help.start()As with other programming languages, you hit Enter at the end of each command.This will open a window showing links to various manuals. You may want to briefly explorethis before going further.Just to familiarise yourself with the console, type: 1+2R evaluates the expression and you see output: [1] 3 thVersion 1.1 9 June 2012 1
  2. 2. The [1] at the beginning of the output line indicates that the answer is the first row of thevariable. This looks confusing if you just have a single number, as in this case, but, as we willsee, output can consist of an array of numbers.Now type: x = 1+2Nothing happens. But the variable x has been assigned, and if you now type x on theconsole, you will again see the output [1] 3In R, the results of variable assignments are not shown automatically, but you can see themat any time by just typing the name of the variable.You can also see all current variables in the Workspace screen on the right.The value assigned to variable x will remain assigned unless you explicitly remove it using therm command. Type: rm(x)You now see that x has disappeared from theIf you type x again, the console gives the message: Error: object x not foundYou can repeat an earlier command by pressing the up arrow until it reappears. Use thismethod to redo the assignment x=1+2, and then type X. Again you get the error message,because R is case-sensitive, and so X and x are different variables.Now type: y = c(1, 3, 6, 7)The workspace tells you y is a numeric variable with four values, i.e. a vector.To see the values, type y on the console. You will see the vector of numbers [1 3 6 7]. The cin the previous command is not a variable name, but rather denotes the operation ofconcatenation. It just instructs R to create a variable consisting of the sequence of materialthat follows in brackets.Now type: x=and hit Enter.The cursor changs to +This is R telling you that the command is incomplete. If you now type 1+2 followed by Enter,your regular cursor returns, because the command is completed.It can happen that you start typing a command and think better of it. To escape from anincomplete command, and restore the > cursor, just hit Escape.The Console is useful for doing quick computations and checking out commands, but ingeneral, when you do computations, you will want to use a script, i.e. a set of commands thatyou can save, so you can repeat the sequence of operations at any time. The script is writtenin the Source window (also known as the Editor window).From the menu at the top of the screen select File|New|R script.You will see a new tab in the Source window, labelled Untitled1. You want to save it with aname. Select a name such as Demo1 and type this in Source window, preceded by thesymbol #.It is important that the name contains no blank spaces.If you make a script name with blank spaces, this can create havoc later on, because whenyou try to execute it, R will interpret all but the first word as commands, and you will getmisleading error messages that will have you scratching your head as to what they mean. thVersion 1.1 9 June 2012 2
  3. 3. The hash symbol that you typed before the script name is used to create a comment in ascript, i.e. a line that is used to remind the user of important information, but which is notexecuted when the script runs. It is customary to put the title of the script, plus informationabout it function, author and date at the head of the script.Select the menu command File|SaveAs to save the script with that name.Currently, your script doesnt do anything. Lets give it some content.In the Source window type:x=2+3y=4+5z=x+yNow select the top menu item Edit|Run Code|Run All.As the script executes, you will see the commands in the script repeated in the Consolewindow, and the values of the variables x, y and z in the Workspace window.These variables will remain assigned to these values until explicitly cleared.You can test this by typing a command at the console such as:x-ywhich will give the answer -4.Important: Traditionally, R scripts use <- instead of =.So, you will see instances of scripts which have commands such as a <- 1+3.This is equivalent to a = 1+3.It is also possible to have the arrow going the other way , i.e., 1+3 -> a, which means thesame thing.My view of life is that you should never make two keystrokes when one will do, and so Ipersist with the use of the equals sign, but R purists disapprove of this.One reason for avoiding = in assigning values to variables, is that it can be confusing,because the equals symbol is also used in other contexts, such as judging whether two thingsare the same. For the present, Im not going to worry you further about this, but you may wantto squirrel that fact away. Confusion between different uses of the = operator causes muchgrief, not just in R but in most programming languages.Loops: A loop is a way of repeatedly executing the same code. Suppose we wanted to printout the ten times table, we could type 1*10; 2*10, 3*10, and so on. But a simpler method is touse a loop, where we multiply 10 times a variable, myx, and specify the range of values thatmyx will take at the start of the loop. Thus we can type in the commands: for (myx in c(1:10)) { print(10*myx) }The first line specifies the values that myx can take, i.e. c(1:10), which is the values 1, 2, 3, 4,5, 6, 7, 8, 9, 10. The program executes all the commands between curly brackets repeatedly,incrementing the value of myx each time it does so, until it gets to the final value, whereuponit exits the loop.Stopping a program: Sometimes a program has been written in a way that it keeps runningand never stops. If you need to abort you just type Ctrl+c.Commenting: A good script will contain many lines preceded by #This indicates that the line is a comment – it does not contain commands to be executed, butprovides explanation of how the script works.Before you go any further, create a new directory that will contain all of your scripts, data, andworkspace for a project. Then go to the menu and select Tools|Set WorkingDirectory|Choose Directory and navigate to your new directory. This means that all your thVersion 1.1 9 June 2012 3
  4. 4. work will be saved in one place. Whenever you start up R from a file in that directory, it willcontinue as your working directory.A note on quotes: If you paste a script into your R console or browser, quotes may getreformatted, causing an error. Always check: for R, single quotes should be straight quotes,not smart quotes (i.e. quotes that slant or curl in a different direction at the start and end of aquoted section). You may need to retype them if your system has reformatted them.Further readingThe best way to learn R is to play with it. You should try typing in commands to see whathappens. Use the R Manuals from the Help screen to get started.In addition, these texts are recommended.Braun, W. J., & Murdoch, D. J. (2007). A first course in statistical programming with R.Cambridge: Cambridge University Press.Crawley, M. J. (2007). The R Book. Chichester, UK: Wiley.Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S, 4th edition. NewYork: Springer. (Do not be put off by the title: it really should be entitled with S and R)3. Generating simulated data: two correlated variables, X and YAn important aspect of R is the ease with which you can generate simulated data.Playing with simulated data is one of the best ways of gaining an intuitive grasp of statistics.You can create a dataset with certain characteristics, and then see what happens when youanalyse it in different ways.Most introductions to statistics ignore the potential of simulated data, and simulation is oftenseen as an advanced topic. My view is that it should be one of the first things you learn to do.As a first exercise in running a script in R, we shall generate a simulated set of data for twovariables, look at some basic statistics for the variables, plot them, and save the data. We willbe using the data for more interesting purposes later on, but for the time being, the aim is tofamiliarise you with some key R commands. In addition, it is very useful to know how tosimulate datasets with specific characteristics, as these can be used to check how variousanalyses work.Unfortunately, most people do find R commands quite daunting, and the command needed tocreate simulated data will probably look horrific if you are a newbie. Also, in R the helpcommands are often not all that helpful, as they are written for statisticians. Dont lose yournerve. I shall walk you through it and all will become clear.One of the first things you need to understand about R is that there is a huge number offunctions that you can use to carry out various statistical, mathematical and graphicoperations, but they arent all available when you start up R. Many of them are available inpackages which you have to specify if you want to use them. Theres a nice explanation ofhow you can find and use packages here: http://ww2.coastal.edu/kingw/statistics/R-tutorials/package.html.Were going to use commands from a package called MASS, which contains functions anddatasets from Modern Applied Statistics with S by Venables and Ripley (see RecommendedReading above).All we need to do is to include the following line in our script:require(MASS)Once that command is executed, all the functions from MASS will be available for us to use.When learning R, its a good idea to run each new command and see what, if anythinghappens, and whether the workspace changes. If you just highlight one or more commands inthe Editor window and then hit the Run button with a green arrow at the top of the window,this just runs that command. If you run the require command as above, the Console justreassuringly tells you it is loading MASS. thVersion 1.1 9 June 2012 4
  5. 5. Now, were going to generate two columns of correlated numbers, X and Y.Well start by creating a variable to hold their names. The next line of your script should be:mylabels=c(X,Y) # Put labels for the two variables in a vectorRemember: You could just omit the bit after the hash, which is a comment. Its there toremind you what you are doing. It may be obvious now, but, trust me, it wont be if you comeback a week later. You should add your own comments, using language that will be helpful toyou.If you run this command you will see that the Workspace shows mylabels as a charactervariable with two values. It knows to treat mylabels as a character, rather than numbervariable because you have enclosed the labels in quotes. Does it matter if you use single ordouble quotes? I couldnt remember, so I just tried making a different variable by typing acommand on the console with double quotes - you should do the same. Its always goodpractice to just play around with commands and see what happens.We are going to use a fancy command from MASS called mvrnorm. Its not uncommon toforget the precise format that a commands need, but help is at hand.On the console type help(mvrnorm), and you will find that the Help screen shows you the waythe command is used. It first tells you what the arguments are for the command, i.e. the thingsyou need to specify to make it work, it then terrifies you with a more technical explanation,and finally gives a worked example. The worked example may be helpful or may just baffleyou completely.Lets look at mvrnorm. The help screen starts as follows:mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE)and then gives an account of what each argument is.n the number of samples required.mu a vector giving the means of the variables.Sigma a positive-definite symmetric matrix specifying the covariance matrix of the variables.tol tolerance (relative to largest variance) for numerical lack of positive-definiteness in Sigma.empirical logical. If true, mu and Sigma specify the empirical not population mean and covariance matrix.Thus the first things you need to specify are the number of cases to simulate (n), the mean ofthe variable, and the covariance matrix.We are going to be working with z-scores, to make life easier. Remember that for z-scores, acorrelation is equivalent to a covariance, and the SD and variance are both equal to 1.We first specify the correlation that we want:myr = .5Add that to your script, and run it, so that we have a value in myr. For sigma, we need to specify the following matrix:[1 myrmyr 1]In R, you can create a matrix using the c (concatenate) command, but if you just typed c(1,myr, myr, 1), then this wouldnt work. Why not? Try typing this at the console and see.Youll find you have the right numbers, but they arent in a 2x2 matrix. To get them properlyarranged, you need to explicitly specify that you want a matrix with two rows and twocolumns.So the full command ismysigma = matrix(c(1,myr,myr,1),2,2) thVersion 1.1 9 June 2012 5
  6. 6. The last two numbers in the command indicate we want 2 rows and 2 columns. Look atmysigma. You could then try making another matrix, but with 1, 4 rather than 2, 2 at the end. Icant stress enough that to understand commands, you just have to try them out. If you arentsure how something works, tweak a command and see what happens.Note - theres nothing to stop you typing .5 rather than myr in the command above. It will givethe same answer. But we want a flexible script that will allow us to play around and look atdifferent values of the correlation, and if we use the variable myr in the code, rather than aspecific value, this allows us to do that easily.So all you now need is to specify the number of cases and the mean values for X and Y.We do this with the commands:myn=50 # were going to create 50 rows of datamymean=c(0,0) #means are zero for both X and YWe are now ready to go! What about the other arguments, tol and empirical? They areoptional and well leave them alone for the moment, though we will look at empirical later on.We need a variable name for our simulated data. Lets call it myarray. So we type:myarray=mvrnorm(n=myn, mymean,mysigma)Now run the whole script. Each command is reflected in the console as it executes. But whereare the results? The workspace now confirms that you have created myarray which is amatrix with 50 rows and 2 columns. To look at the results, just type myarray on the console.There are your 50 paired z-scores!Before going further, Ill just explain why Ive created variables that all start with my. This isnot essential, but its a fairly common method. It has the advantage that you are unlikely toinadvertently use a variable name that corresponds to an existing R command, and whenreading a script it makes it generally easier to distinguish your variables from other parts of Rlanguage.We have created paired variables, but they arent yet labelled. Assigning column names to amatrix in R is easy. Remember, we created mylabels earlier. We can assign these as ourcolumn names as follows:colnames(myarray)=mylabelsSo now you have built up a whole script to generate paired numbers, which looks like this:# simulate_XY# Script to simulate z-scores X and Y, with specific correlationrequire(MASS) # Load functions from Modern Applied Statistics for Smylabels=c(X,Y) # Labels to be used later for our variablesmyr=.5 # Correlation (can be changed)mysigma = matrix(c(1,myr, myr,1),2,2) # 2 x 2 ccovariance matrix # (with zscores, equiv to correlation matrix)myn=50 # N rows of data to simulatemymean=c(0,0) # Means for each variable (zero for zscores)myarray=mvrnorm(n=myn, mymean,mysigma) # create array of simulated datacolnames(myarray)=mylabels #Assign labels to columns of simulated dataBut you may be suspicious. How do you know that the numbers you have generated havemean of zero and are actually correlated .5?You can use R commands to find out.This command gives you a range of descriptive statistics, including the means:summary(myarray) thVersion 1.1 9 June 2012 6
  7. 7. and this one gives the correlation matrix:cor(myarray)At this point, you may start to think (depending on your locus of control) either that you havedone something wrong, or that R is not very good. Its highly likely that your means will differfrom zero, and the correlation will be smaller or bigger than .5. The reason is that we did notspecify empirical = TRUE. R has faithfully generated a sample of observations from apopulation of values where the true correlation is .5, but because of sampling error, theobserved value in this sample is likely to deviate from .5.If you re-run the program, but this time alter the mvrnorm command to:myarray=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)then you will find the means are zero (or, more likely a real number that is infinitesimallysmall) and the correlation is .5.Alternatively, you could remove the empirical command (or specify empirical=FALSE, whichhas the same effect), but specify n = 50000, or another very large number. The larger thesample you take from the population, the closer the sample correlation will approach to thepopulation correlation.Its always a good idea to plot data as well as looking at summary statistics. To see ascatterplot of your data, add this command to your script:plot(myarray)A graph will now pop up in the Plots tab of the right hand lower window.Finally, you might want to save your simulated data so you can use them at a later time.This command will write a data file to your current directory:write.table(myarray,"mysimdata")If you want to get your data back on another occasion, this command will read the saved datainto a matrix called newdatanewdata=read.table("mysimdata")The mvrnorm command uses a random number generator, which means that each time yourun the script, different numbers will be generated. If you want to always get the samenumbers, you can do so by just specifying a seed for the random number generator. Thiscan be any number, but provided it is the same number each time, youll get the same result.Just put this command somewhere before the mvrnorm command:set.seed(2)If you have started from scratch and got this far, then you should take a break and rewardyourself with a cup of coffee or whatever other substances hit the spot for you.4. Generating simulated data from two groups with different means on uncorrelatedvariables U and VWere now going to apply what weve learned to generate data from two separate groups ontwo variables that are uncorrelated. The only difference is that the means differ on bothvariables for the two groups. Lets set means for X and Y for group A as -1 and for group Btheyll be 1. Well generate 60 cases for each group. Well call these datasets myarrayA andmyarrayB.If youve followed what weve done so far, you should be able to work out how to do this. It willbe a good exercise to try, as you learn R by thinking it through, rather than by just copying.But Ill give you a script to do it anyway, in case you get stuck:#demo_spurious_corr_scriptrequire(MASS) #Load functions from Modern Applied Statistics for Smylabels=c(U,V)myr=0 #U and V are uncorrelated, and so r is set to zeromysigma = matrix(c(1,myr, myr,1),2,2) thVersion 1.1 9 June 2012 7
  8. 8. myn=60set.seed(3)#Array for group Amymean=c(-1,-1) #mean zscore for group AmyarrayA=mvrnorm(n=myn, mymean,mysigma) #Generate uncorrelated U and V for grp Acolnames(myarrayA)=mylabelssummary(myarrayA)cor(myarrayA)plot(myarrayA)#Array for group Bmymean=c(1,1) #mean zscore for group BmyarrayB=mvrnorm(n=myn, mymean,mysigma)#Generate uncorrelated U and V for grp Bcolnames(myarrayB)=mylabelssummary(myarrayB)cor(myarrayB)plot(myarrayB)We now want to combine the two arrays into one long column, and call this combined arraywith a new name, myarrayAB.This can be achieved with a single command for concatenating rows, as follows:myarrayAB=rbind(myarrayA,myarrayB)We can then look at the correlation for the combined groups:cor(myarrayAB)Even though the correlation within either group was set to zero, the correlation for thecombined groups is around .5 and highly significant. This is the phenomenon of spuriouscorrelation.To make it more concrete, consider if U and V were height and chest hairiness and groups Aand B were males and females. Since men tend to be taller and hairier than women, youcould find a spurious correlation between height and hairiness in a combined group, eventhough they are uncorrelated within either sex.One reason I like simulations is that they can give you new insights into such phenomena.Note that we specified massive mean differences between our groups: one group with amean z-score of +1 and the other with mean z-score of -1. When I first attempted thissimulation, I used much smaller group differences, and was surprised at how hard it was togenerate a spurious correlation. With a simulation like this, you can play around and get agood feel for the phenomenon by repeatedly generating datasets with different values. Thephenomenon of spurious correlation is a source of major concern, especially for thoseinterested in correlational data, but my impression is that its importance may have beenoveremphasised, because in practice it doesnt become a problem except in quite extremesituations where you have two groups with very different mean values.5. Demonstrating how incorporating group identity in a linear model unmasks thespurious nature of the correlation between U and VLet us stick with the interpretation of our simulated data as representing height and hairinessin males and females (ignoring the fact that the group mean differences are vastly greaterthan would be realistic). We now need to add to our combined dataset another column thatspecifies gender.The R command rep will just create a vector of repeated numbers. We make a set of 60values = .5 for males, and 60 values = -.5 for females. The reason for picking these specificvalues is because it helps interpretation of regression output if we set the average for twogroups to zero and make the mean difference between them equal to one. However, its not thVersion 1.1 9 June 2012 8
  9. 9. essential to do this, and you could have picked other numbers, such as 0 and 1 to indicategroup identity.males=rep(.5,myn) #Create vector with myn repetitions of value .5females=rep(-.5,myn) #Create vector with myn repetitions of value -.5Having made our two sets of numbers, we then join them together in a variable calledmygender as follows:gender=c(males, females)Run these commands and then type gender at the console to check the result.All that is now needed is to bolt this column on to our existing myarrayAB, which we can dowith a single command for concatenating columns, cbind.myarrayAB=cbind(gender,myarrayAB)Note that I have created a lot of intermediate variables in the course of generatingmyarrayAB. This is unnecessary and uses up memory. It would be possible to combineseveral steps in one command and so avoid creating the intermediate variables. However,when learning R, I think it is helpful to break commands down into small steps and create newvariables, as this allows you to see the logic of what is being done, and to check the values ofeach variable. It also makes your scripts easier to understand when you come back to themlater. Very experienced programmers may write much more compact code than this, but withmodern computers, memory is seldom a problem unless working with very large data arrays,and so, apart from demonstrating how clever you are, compact code doesnt serve muchfunction.We now want to do a regression analysis. We will start with simple regression of V on U forthe combined group data.R has many powerful commands for doing regression, but it requires that the data areformatted in what is called a data frame.Fortunately, this transformation is trivially easy: we just add the command:mydata=data.frame(myarrayAB)Commands for regression in R are formulated in terms of the general linear model. This is avery general and flexible approach to statistical analysis that readily incorporates the moretraditional methods beloved of psychologists such as analysis of variance. However, I suspectthat many psychologists reading this wont find it a very intuitive way to think about data, andit takes a while to map the R commands onto pre-existing statistical knowledge.The other thing that can be puzzling is that with programs such as SPSS, we are used torunning a command and then looking at the output screen. Although R can be used in ananalogous way, it is more usual to write the results to another variable. The variable thatholds the results is likely to be a fairly complex structure, as we shall see. But the basic ideais that you dont just use a command to do the analysis: you actually specify a name for theoutput of the analysis.The simplest form of regression is pretty easy. The command lm just stands for linear model,and requires two obligatory commands: you have to specify a formula that indicates therelationship between predicted and predictor variables, and specify the dataset used toestimate regression coefficients. So lets illustrate this with our U and V variables.Add this command to the script:myreg1=lm(V~U,mydata)and then inspect the myreg1 variable that is created.This contains two coefficients, an intercept, that is close to zero, and a slope, that is close to0.5.Note that when you type lm you also get information about the formula used to generate thecoefficients, labelled call. The output of lm contains a complex set of varied information in a thVersion 1.1 9 June 2012 9
  10. 10. structure. If you want to look at just part of the structure, you have to use the $ sign to indicatewhich bit. Try this, by just typing at the console:myreg1$callandmyreg1$coefYou will see that the portion after the $ indicates which bit of the myreg1 structure is referredto.The term V~U tells the program to fit a straight line according to the formula:V = b1 + b2.Uwhere b1 is a slope and b2 is an intercept.It is these slopes and intercepts that are then generated when the lm command is executed.We can use these outputs to plot the regression line.First plot the raw data. This command will achieve that:plot(U~V, mydata)The command abline plots a straight line with a given intercept and slope. You could add astraight line through the intercept zero and with slope of 1, as follows:abline(0,1)The regression line is simply the straight line with intercept and slope corresponding to thecomputed regression coefficients, and so can be plotted just by typing:abline(myreg1$coef)The lty command allows you to specify the type of line you want. This command will make theregression line a dotted line.abline(myreg1$coef,lty=5)As an aside here, I havent used R very much, and when I first saw a command with lty I wasconfused and thought it was some kind of variable. This is, in my experience, a commondifficulty with R. Various letter sequences that look like variables or functions, arent. What didI do? I Googled "R lty" and immediately all became clear. Perhaps the single most importantadvice if you want to learn R is to just use Google if you get stuck.We now want to look at the regression with gender included. A simple modification to thesyntax achieves this. We have taken care to code gender so that the sum of the two gendercodes is zero, and we can include it in the linear model, even though it is a categoricalvariable.Here is the command:myreg2=lm(V~U+gender,mydata)This corresponds to the regression equation:V = b1 + b2.U + b3.genderIf we type myreg2, we see that the output now has one intercept and two regressioncoefficients, like this:(Intercept) U gender0.03357 0.04529 1.68913Your values may differ from this because the simulated data will be different, but the overallpattern will be similar. Note that the regression coefficient associated with U is now close tozero, whereas that associated with gender is much bigger.Once we have run the model we can get much more detailed statistical output by requesting asummary, as follows:summary(myreg2) thVersion 1.1 9 June 2012 10
  11. 11. Now we have not only the coefficients, but their standard errors, associated t-values andsignificance levels. This confirms that gender is a substantial predictor of V, and U is not.Finally, you can use the anova command to produce an anova table comparing the fit of thetwo models:anova(myreg1,myreg2)Ive learned a lot about using R for regression analysis from this site. It also has informationon how to do diagnostic plots, for instance. However, for the present, I wont get diverted intothat, but will rather press on to look at what happens if you have groups defined on a variablethat is highly correlated with one of the dependent variables.6. Demonstrating how removing the effect of group will be misleading if group identityis highly dependent on one of the variables.You should by now be able to follow this script, which is heavily commented to explain eachstep. This time we are going to generate a multivariate normal distribution with 3 variables.Two of them, L1 and L2 are language measures and A is an auditory measure. The languagemeasures show moderate correlation with the auditory measure and are highly intercorrelatedwith one another. Group identity (control or language impaired, .5 or -.5) is defined in terms ofwhether the score on L1 is above z-score of -1 or not. This, then, is analogous to the case ofdyslexia or language impairment, where we define whether or not the child has the diagnosison the basis of a low test score.In a case like this, removing the effect of group can abolish the relationship between L2 andA, simply because L1 and L2 are highly intercorrelated. It would be quite wrong to concludefrom this that L2 and A are not related.#demo_spurious_corr_script3# Using a group variable that is highly correlated with one variable# With these settings, gives the result that by including SLI category# you remove influence of L2require(MASS) #Load functions from Modern Applied Statistics for Smylabels=c(L1,L2,A) #3 variables, two language and one auditorymyr=.8 #correlation between the language measuresmyr2=.3 #correlation of both language measures with auditorymysigma = matrix(c(1,myr,myr2, myr,1,myr2, myr2,myr2,1),3,3)myn=60set.seed(6) #change or comment out this line to get different set of estimatesmymean=c(0,0,0) #Means for L1, L2, and A are zeromyarray3=mvrnorm(n=myn, mymean,mysigma,empirical=TRUE)colnames(myarray3)=mylabelssummary(myarray3)cor(myarray3)myL1=myarray3[,1] #first column# Now determine which cases are control or SLI and put in mygroup variablemygroup=rep.int(-1,myn) #default is SLI, coded -1mycon=which(myL1> -1) #row index of those with L1 in con rangemygroup[mycon]=1 These rows are assigned group code of 1 (control)myarrayAB=cbind(mygroup,myarray3) #add mygroup to the data arraymydata=data.frame(myarrayAB)#Regression with only Group includedmyreg1=lm(A~mygroup,mydata) thVersion 1.1 9 June 2012 11
  12. 12. summary(myreg1)#Regression with both group and L2 includedmyreg2=lm(A~L2+mygroup,mydata)summary(myreg2)anova(myreg1,myreg2)#Regression if we exclude group IDmyreg3=lm(A~L2,mydata)summary(myreg3)The point I want to make with this simulation is that if we want to take out the effect of groupidentity from a correlation, then we need to think carefully about the logic of what we aredoing.In the previous example of spurious correlation, we defined gender quite independently of ourtwo measures, height and hairiness. Although males and females differed substantially onboth measures, their gender was not determined by those measures. In any logical causalroute, we can confidently treat gender as a primary cause, and so it makes sense to take outits effect.For certain developmental disorders (and indeed other conditions), the causal route is muchless certain, because the disorder is diagnosed on the basis of measured variables. So, forinstance, dyslexia is defined in terms of low scores on reading measures. In the simulationabove, we looked at correlation between L2 and A, and defined our disorder in terms of L1 -which was highly correlated with L2. We could have defined dyslexia in terms of L2 - youmight like to try that: it will achieve a similar effect. The results we got from our simulation areactually sensible, but there is a danger they will be misinterpreted. What they are actuallytelling us is that language measures and auditory measures are significantly correlated, andthis is evident regardless of whether we use a categorical language measure, where groupidentity is determined by cutoff on a test, or a quantitative measure. What this analysis isdefintely not saying is that the correlation between language and auditory measures isspurious.Its possible to imagine a situation where you could have a spurious association with thesekinds of variables. For instance, poor social environment may affect both language measuresand auditory measures. To show that, wed need to incorporate a measure of socialenvironment in our regression analysis. But the bottom line is that if we want to argue anassociation between variables X and Y is spurious, we must have a third variable, Z, that is(a) measureable and (b) not dependent on X or Y. Z may be highly correlated with X and Y:thats not a problem. The problem is when Z is determined by X or Y. thVersion 1.1 9 June 2012 12