Lets look at an example of what a “perfect” correlation would look like. Let’s take a look and see what this would looked like when we graph these data points, or observations, on a scatter plot.
So here is our example of perfect correlation. As you can see, the data points or observations range from point one to one point zero.
And as you can see here, we can draw a line that perfectly fits all the data points, that is it crosses through all of them. The formula for this line is Y = X, because when X = .1, Y also = .1. Whatever point we pick from X, whether it’s .2, .5, or .6789, Y will have those exact same values.
Now there any number of possible perfect relationships between our X and Y variables. There could be a positive relationship between X and Y (as we had with the previous example), but there could also be a negative relationship; that is as X increases Y decreases, or as X decreases Y increase. Y doesn’t have to have a 1 to 1 relationship with X either: we could have Y = .5X (or have of X), or Y = 2 or 3 times X. In this manner we would have a perfectly linear relationship (that is a straight line), but it could also be a more complicated function, like a curve. In the lower left corner you can see an example of Y = X cubed. Really, the important concept with perfect correlation is that we can draw a line that fits the data perfectly, that is it will cross exactly through all our observations. While it’s important to grasp this basic concept of perfect correlation, we very rarely encounter such relationships between variables in the real world of behavioral or health statistics. In the real world, things are a lot messier…
In real world social and health science research we very rarely, if ever, find perfect relationships between data. As you can see here, we can’t fit a straight line that will cross through all the data exactly. However, there are a couple of options for how we could describe this linear relationship.One method would be to calculate a line of best fit for these data points. This method is called linear regression, and we will come to this later. The most common method of regression uses the method of least squares to calculate this line: We’ll come to that in a bit.Another option for describing this relationship is through the method of correlation. We’ll start with this method because it builds off of some of the elementary concepts we’ve discussed in class about calculating the means and standard deviations for different samples, as well as transforming observations in to Standard Scores (or Z-scores).
The standard deviation is a rough measure of the average amount by which observations deviate from their mean.In order to calculate the standard deviation, we first need to calculate the Sum of Squares for our sample.To calculate the sum of squares, we need to square each X observation and sum them up. The sum of 2 squared plus 4 squared plus 5 squared plus 7 squared plus 8 squared is equal to 158.We next add up all our X observations, and then square this total. So 2 plus 4 plus 5 plus 7 plus 8 equal 26, and 26 squared equals 676. We divide this number by the number of observations, which in this case is 5, and subtract this quotient from our first number. So this gives us 158 minus 676 divided by five, which is equal to 22.8The variance is calculated by dividing the sum of squares by one less than the number of observations. We add this n – 1 correction because we are calculating a sample standard deviation. The variance for our sample is 5.7The calculation for the standard deviation is then the square root of the variance: which is the square root of 5.7, which is approximately 2.39.
And finally, lets review how we compute a z-score. In order to convert a score to a standard score, we need an individual observation, the mean of the sample or population, and then the standard deviation of sample or population. We’ve already calculated our mean and sample standard deviation for our X variable, so now we’ll convert one of our observations to a Z score. Here well covert our second observation, X-2, to a z-score. The second observation of X is 4. The difference of the mean of 5.2 from our observation of 4 is 1.2. We then divide by the standard deviation of 2.39 to come to our standard score conversion, which is approximately negative .502.This concludes our brief review, and brings us to a good point to continue our exploration of correlation.
This animated picture shows a number of different negative correlations that vary in strength. We can tell that it’s a negative correlation because it slants down running from left to right, so that the X values increase as Y decreases.The technique of correlation was developed by Karl Pearson, and thus the statistic is often referred to as Pearson’s R. Typically, it’s just reported as R. The value for R can range anywhere from -1.0 to +1.0. The sign indicates the direction of the relationship; that is, whether the relationship is a negative association or a positive one. The absolute value of R describes the strength of the linear relationship between all the observations. Both 1.0 and negative 1.0 are perfect relationships, with 1.0 being a perfect positive relationship and negative 1.0 being a negative perfect relationship. A Pearson’s R of 0 indicates that there is no detectable relationship between our two variables with these observations.Now the size of pearson’s r is describes the strength of the relationship between two variables or the magnitude of effect. An R of .0 to .2 is considered to be a very small, weak, or virtually absent relationship. R’s from .2 to .4 are considered a relatively week association. Point 4 to .6 is considered a moderate association, and .6 to .8 is a strong association. R’s of .8 to 1.0 range from very strong to a perfect association. Remember that these are the absolute values for R – an R of -.8 is also considered to be a very strong relationship, only that it is negative in direction. An R of -.8 is a stronger relationship than an R of positive .6. Very strong correlations of .8 or higher found in the behavioral sciences when measuring the reliability of certain tests. For example, research validating certain tests, such as IQ tests, will have people tested more than once. It is expected that the scores between the initial test and second testing would be very close if the test is reliable.Thus the value r gives us information about both the strength and the direction of the relationship between our two variables. Although there are a number of different metrics for effect sizes, perhaps the most common effect size is reported by squaring the value of r, which is referred to as the coefficient of determination. R squared is a metric of the amount of variance in our Y variable that can be explained by variability in the X sample (as well as vice versa). The size of R can be deceiving, because when we square it to determine the coefficient of determination it can quickly decrease in size. For example: an R of .5 translates to an R squared of .25, and .6 to .36.
To calculate r using the z-score method, first we have to calculate the mean and standard deviation for both our X and Y variables. Once we have these statistics, we can calculate Z-scores for all X and Y observations. We then multiply the Z scores for Z-X and Z-Y for each pair of observations. Next, we sum up the product of these Z scores. And finally, we divide by the number of paired observations minus one, which in this case is 4. It’s important to remember that you are dividing by the number of paired observations for both X and Y rather than the total number of observations. If you used n – 1 for the total number of observations, you would be dividing 3.54 by 9, which would give you an R of approximately .39, which is a very different result. Now that we’ve gone through this example of how to calculate R using the Z-score method, I’m going to tell you to never use this approach in actual practice. This is the formula that Caldwell provides in “Statistics Unplugged” (at least in the 2nd edition I have), and it is a good formula for grasping conceptually what is going on when we calculate R. However, one of the reasons why we would never use this formula in actual practice is due to the considerable amount of rounding in calculating Z scores and multiplying them. When I calculated the actual correlation of this data using excel, the result was r of .885965 – we were pretty close, but there is considerable work in calculating each z-score individually, and then multiplying them. While we weren’t off that much with an example problem of 5 paired observations , I’m sure you can imagine how much number crunching (and rounding) would be involved with a study of 100 or more participants, which isn’t that uncommon in actual research practice.
So if you aren’t supposed to use the z-score method for calculating R, then how are you supposed to compute it? Here’s the actual formula we’ll be using.So this is the formula that we will be using to actually compute R. You probably recognize that we will need to calculate the sum of squares for both X and Y variables – these are notated by SSx and Ssy under the radical sign that is in the denominator of this formula. We went over this in the review. Now we’ll also will need to calculate the sum of products for X and Y, which is something new. The formula for the sum of products is on the bottom of the slide. These formulas can look like a lot at first glance, so let’s break down the actual steps.
First we have to calculate the sum X observations, the sum of Y observations, the sum of X squared observations, and the sum of Y squared observations. We also have to compute X times Y, and sum up all these observations. There are five paired observations in our example, so our n is equal to five.First, we’ll compute the Sum of Products for XY. We take the sum of all X multiplied by Y observations (171), and subtract the sum of X (26) times the sum of Y (26) divided by n (5)We then calculate the sum of squares for both X and Y variables, which we have already reviewed.Now we can plug these results into our formula for R
Now we plug our sum of squares X, sum of squares Y, and sum of products XY into our formula to calculate R. R is equal to the Sum of Products for XY (20.2) divided by the square root of the Sum of Squares X (22.8) times the Sum of Squares Y (22.8). The result is 20.2 divided by 22.8, which gives us a Pearson’s R of approximately .886. This gives us a coefficient of determination R squared of .785, indicating that close to 79 percent of the variance in variable Y is accounted for by variation in the variable X. These results are similar to what we found before, but you can see from these calculations how much less rounding off was involved. Now it’s also important to note that I’ve only been talking about the direction of the relationship and the strength of the effect. An important question to ask is whether these results are statistically significant. Before we covered the concept of null hypothesis significance testing to determine whether there was a statistically significant difference between sample means. Using our degrees of freedom and our correlation coefficient, we can look up the associated alpha value in a statistics table. Degrees of freedom is calculated by summing up the number of paired observations and subtracting the number of groups: in this case there were 5 paired observations observations and 2 groups, so our degrees of freedom would be 3. Our null hypothesis would be that r = 0, or that there is essentially no significant relationship between the two variables. In the appendix of Caldwell’s text you can find a table with critical values for R. With 3 degrees of freedom, the critical r value to reach an alpha level of .05 would be .878. As our result exceeded that value, so we would reject the null hypotheses that there is not a statistically significant relationship between these two variables.
Up until this point we have been covering the concept of correlation. While correlation gives us information about the strength direction of the relationship, it can be difficult to make actual predictions from our data. Let’s say that in our example dataset, X refers to the number of birthday cards different people send out, and the number of birthday cards they receive in return. Through our calculation of a correlation coefficient, we found a strong relationship between these two variables. However, if someone wanted to know how many birthdays cards someone would receive who sends out 6 cards, we would be unable to answer them: correlation tells us about the strength and direction of the relationship, but does not allow us to make predictions related to units of measurement. Regression allows us to maintain our units of measurement so we can see how blood pressure decreases with hours of exercise per month, or how much a persons life expectancy decreases in years related to the number of cigarettes they smoke per week.Linear regression allows us to create a straight line that fits as closely as possible to all the data points. It may cross through some of them, and it will certainly miss some of the points unless we have a perfect relationship, but it will be an approximate function describing the relationship between our variables of X and Y maintaining their original units of measurement. The line of best fits reduces the average distance of our data points to the line to the lowest possible value. The distance between the line and any data points that it does not cross through is referred to as predictive error. The most common form of regression is calculated through the least squares method. With the least squares regression approach our regression line minimizes not the total amount of predictive error, but rather the total squared predictive error, or the total for all squared predictive errors.
Here is our equation that can solve for the exact least squares regression line for any scatter plot. The equation is Y Prime = bX + a. Y prime gives us the predicted value, X is the known value, and B and A represent numbers we have already calculated from our original correlation analysis.You can recognize B as involving both our Pearson’s r correlation coefficient, and the sum of squares for both X and Y variables. So the formula reads: B = r times the square root of the sum of squares Y divided by the sum of squares x. B is the slope of our regression line.In solving for a, we use both the sample means for Y and X, as represented by Y bar and X bar. We also use the variable B, which was just determined. A gives us our Y axis intercept, or where our line will cross through the Y axis.So let’s have a look at how we could apply the least squares regression equation to our original data.
And at the bottom we have the answer to our question about the number of birthday cards one could expect to receive given they sent out 6 cards. Our calculation yields a result of approximately 6.5, so someone who sends out 6 birthday cards can reasonably assume to receive 6 to 7 back.
The standard error of estimate is a special type of standard deviation that reflects the magnitude of predictive error for our regression equation. So lets take a look at how we calculate the standard error of measurement in least squares linear regression.At the top, we have our definition formula, and at the bottom, we have the formula for the actual calculation.To calculate the standard error of estimate, we take the square root of sum of squares y times 1 minus r squared divided by n minus two. Let’s look at how this would work with our example data set.
To calculate the standard error of estimate for our example, we take the square root of sum of squares y (which is 22.8) times 1 minus r squared (and our r was .886) divided by n minus two, which is 5 minus two. This gives us a standard error of estimate of approximately 1.278.
Next we will briefly cover some of the assumptions and limitations of correlation and regression approaches.First, We are assuming that our data falls in a relatively straight line. We can check this assumption using scatter plots to visually inspect our data. Using scatter plots we can check to make certain that the data does not curve. If so, other statistical techniques may be more appropriate than linear correlation and regression.Next, we are assuming homoscedasticity: which essentially means that the data on our scatter plot is equally dispersed. This also relates to issues of range limitations.An important concept to keep in mind with correlation and regression has to do with the range of observations that we’re looking at. Unless we have a representative sample of the entire range of the distribution, we can’t be certain that our observed relationship will apply to the actual population. Sometimes the function of the relationship between the two variables can be different at the upper and lower ends of our distributions. For example, there was a study conducted by Sternberg and Williams (1997) that found a relatively weak relationship between GRE, or graduate record exam, scores, and academic performance in graduate school. However, the study was strongly criticized because the sample happened to be of graduate students from Yale University. As you can probably imagine, although there was invariably some variability in their GRE scores, the scores in this study represented almost exclusively the upper range of the distribution. Yale is a very selective school after all. This severely limits the conclusions that could be drawn from this study, as it does with other studies with similar kinds of range restrictions. Would we be able to make predictions about student performance with students who’s GRE scores were in the normal, or lower than average ranges? Not really from the results of this study. At least not with a lot of confidence.In terms of limitations, there is the potential problem of inability to determine the direction of effect: that is, does X influence Y or does Y influence X? For example, there has been research looking at the relationship between formal study of music and certain types of intellectual functioning (most commonly referred to is spatial-temporal reasoning, which is the ability to mentally manipulate objects in space and time, such as using a map to navigate). So there has been some research linking the study of music with high spatial temporal reasoning. So does the formal study of music cause improvements in spatial temporal reasoning? Or is it that individuals who are higher in spatial temporal reasoning ability enjoy studying music? By correlating or regressing these two variables against each other we can’t tell. Now I mentioned that the inability to determine directionality is a potential limitation of correlation and regression – that is because these are statistical techniques rather than research methods. We could potentially set up an experiment to determine the effect of music study on spatial-temporal reasoning and use regression to calculate the effect. In this case, the problem of directionality could be corrected by using a true experiment. It is a common maxim in intro psych courses that correlation is not causation. In fact, correlation is only a piece of the puzzle to assigning causality. We do need correlation, which gives us information on the strength of the relationship, and regression can allow us to predict outcomes, but we need some other pieces to assign causality. First, we need temporal precedence: that is we must establish that the predictive event occurs first, and the predicted outcome occurs later. This can be controlled for using true experiments and longitudinal studies. With the music example, we could perform an intelligence test on young children every few years and track their music training (whether they study music or not). We could also control for other variables, perhaps other forms of stimulation that could account for the effects, like studying foreign languages, sports, arts, or oather areas of academics. If we could establish that music training occurred first and spatial temporal reasoning subsequently increased, we would have more information to support this claim. The third piece of the puzzle needed to assign causality is some form of hypothesized causal mechanism. That is, we have to have some sort of understanding as to why these effects are occuring. For example, we have lots of evidence that smoking cigarettes causes lung cancer: Research has established both the strength of the relationship, and the order of the relationship (that is people who smoke have a subsequent higher rate of lung cancer rather than people who get lung cancer just enjoy smoking). We also have a causal mechanism in understanding of the carcinogenic effects of toxins and pollutants such as tar in cigarettes. Thus, careful study allows us to assign causality when we have an understanding of the strength and direction of the relationship, the temporal order of events (that is one event occuring first ,and the other event occuring afterward), and some sort of understanding about the causal mechanism involved that is thought to effect the change.And with that, we have covered our introduction to correlation and regression. We will later cover some more advanced techniques of partial correlation and multiple regression. These techniques will allow us to look at the effects of multiple predictor variables at one time, as well as to control for the effects of other variables that we believe may be mediating our observed relationship.
Correlation & Regression
Correlation & Regression<br />Grant M. Heller<br />Stats 5030<br />
Z-scores (Standard Scores)<br />Calculated by subtracting the mean from each observation and dividing the difference by the standard deviation.<br />
Types of Relationships<br />Positive Relationships<br />Negative Relationship<br />No Relationship<br />Strong vs. Weak Relationship<br />Image from: http://member.tripod.com/~BDaugherty/KeySkills/lineGraphs.html#SCATTER<br />
Correlation: strength of relationship<br />Pearson’s correlation coeffecient (r)<br />Any value between -1.00 and +1.00<br />Sign (+/-) indicates direction of relationship (whether positive or negative)<br />Absolute value of r indicates the strength of the linear relationship.<br />Effect size (Cohen, 1988)<br />r ≤ .10 : small (weak) relationship<br />r of .30 : medium (moderate) relationship<br />r ≥ .50 : large (strong) relationship <br />Animation from: http://www.ats.ucla.edu/stat/sas/teach/corrrelation/corr.htm<br />
Correlation Coefficient<br />Z Score Formula:<br />To calculate r, w need to:<br />1)Calculate means and standard deviations for variables X and Y<br />2) Calculate Standard Scores (z scores) for all variable X and Y observations<br />3) Multiply the z scores for X & Y (Zx & Zy) for each pair of observations.<br />4) Sum the product of Zx * Zy<br />5) Divide the results by the number of paired observations<br />
Calculation of r from example problemusing z-score method<br />
Steps for calculation of r<br />1) determine the number of paired observations (n).<br />2) sum all scores for X and for Y separately<br />3) find the product of each pair of X & Y scores (multiply)<br />4) sum the products of X & Y scores – save this #<br />5) square each X score and sum them up <br />do the same for each Y score, and save these 2 numbers.<br />
Assumptions & Limitations<br />Assumptions<br />Linearity<br />Homoscedasticity<br />Adequate range<br />Limitations<br />Question of direction of effect<br />Does X influence Y or does Y influence X?<br />
References<br />Caldwell, S. (2004). Statistics Unplugged (2nd ed.). Belmont, CA: Thompson.<br />Cohen, B. H. (2008). Explaining Psychological Statistics (3rd ed.). Hoboken, NJ: Wiley.<br />Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum. <br />Introduction to SAS. UCLA: Academic Technology Services, Statistical Consulting Group. from http://www.ats.ucla.edu/stat/sas/notes2/ (accessed October 4, 2010). <br />Green, S. B., & Salkind, N. J. (2003). Using SPSS for Windows and Macintosh: Analyzing and Understanding Data (3rd ed.). Upper Saddle River, NJ: Prentice Hall.<br />Witte, R. S., & Witte, J. S. Cohen, J. (1988). Statistics (8th ed.). Hoboken, NJ: Wiley.<br />