Correlation & Linear Regression

9,854 views

Published on

Correlation & Linear Regression

Published in: Business
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
9,854
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
368
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Correlation & Linear Regression

  1. 1. 4/10/2007 Lecture 10: Correlation and Linear Regression Learning Objectives The objective of this chapter is to discuss a powerful technique for examining whether or not two variables are related. Specially, we will be talking about the ideas of: : • Simple regression • Correlation Key points • Pearson correlation coefficient • Coefficient of determination • Sl Slope of the regression line, regression coefficient f th i li i ffi i t • Intercept • SE of estimates • Associated t-value with calculated correlation coefficient PLSS 6390 Experimental Designs and Data Analysis 1 Correlation and Linear Regression-Overview • In ANOVA, we described the effect of selected treatments (independent) on some measured variables (dependent). The independent treatments can be qualitative or quantitative, but the dependent variable is quantitative p q • Regression and correlation are powerful methods to describe and test associations between continuous variables • Regression and correlation are related but have different purposes. • The goal of regression is to predict the value of one variable, Y (dependent), from measurements of the other variable, X (independent). When using regression, keep in mind that to predict Y from X in no way implies that X is the cause of Y. Demonstrating Y cause and effect relationship requires careful experimental design with controls to rule out other causes • Correlation merely describes association between 2 variables PLSS 6390 Experimental Designs and Data Analysis 2 1
  2. 2. 4/10/2007 Correlation-Scatter Diagrams • A correlation is a statistical technique used to determine the relationship between two variables. • The simplest thing we can do with 2 variables that we believe are p g related is to draw a scattergram. A scatter diagram is a simple graph that plots values of our 2 variables on a X-Y axis • E.g. let us make a scattergram between the number of coffee cups consumed a day by 9 students and the number of bathroom trips they take during the day. PLSS 6390 Experimental Designs and Data Analysis 3 Correlation-Scatter Diagrams Individual No. coffee cups No. bathroom trips John 1 2 Paul 2 1 Millie 3 3 Jesse 3 3 Arturo 3 4 Shilpa 4 6 Ravi 5 6 Yona 6 5 Adrian 6 7 PLSS 6390 Experimental Designs and Data Analysis 4 2
  3. 3. 4/10/2007 Correlation-Scatter Diagrams 8 No. Bathroom trips (X2) 7 s 6 5 4 3 2 1 . 0 0 1 2 3 4 5 6 7 No. Coffee cups (X1) PLSS 6390 Experimental Designs and Data Analysis 5 Correlation-Scatter Diagrams • A correlation measures three things: The direction of the relationship: • POSITIVE (+): As one increases the other increases, and as increases one decreases the other decreases • NEGATIVE (-): As one increases the other decreases, and as one decreases the other increases The form of the relationship: a correlation can be classified into 3 categories: linear, curvilinear (nonlinear) and no correlation The d Th degree of the relationship: R f th l ti hi Range i from +1 (a perfect is f ( ft positive relationship) to -1 (a perfect negative relationship) PLSS 6390 Experimental Designs and Data Analysis 6 3
  4. 4. 4/10/2007 Correlation Classification • Linear correlation: Two variables may be correlated and the relationship can be described by a linear model PLSS 6390 Experimental Designs and Data Analysis 7 Correlation Classification • Non-linear correlation: Two variables may be correlated but not through a linear model. This type of model is called non-linear and might be one of a curve (curvilinear) PLSS 6390 Experimental Designs and Data Analysis 8 4
  5. 5. 4/10/2007 Correlation Classification • No correlation: Two quantitative variables may not be correlated at all PLSS 6390 Experimental Designs and Data Analysis 9 Linear Correlation • Variables that are correlated through a linear relationship can display either positive or negative correlation • Positively correlated variables vary directly (as the value of one increases, the other increases), while negatively correlated variables vary as opposites (as the value of one increases the other decreases) Negative correlation Positive correlation 8 8 hroom trips hroom trips 7 7 6 6 5 5 4 4 No. of bath No. of bath 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 No. coffee cups No. coffee cups PLSS 6390 Experimental Designs and Data Analysis 10 5
  6. 6. 4/10/2007 Strength o f Correlation • Correlation may be strong, moderate or weak. • You can estimate the strength by observing the variation of the points around the line: large variation means a weak correlation, while small variation, i.e., data points distributed close to the line hil ll ii i d i di ib d l h li is said to be strong correlation. • Note that the correlation type is independent of the strength Strong Weak PLSS 6390 Experimental Designs and Data Analysis 11 The Correlation coefficient • The strength of a linear relationship is measured by the correlation coefficient • The sample correlation coefficient is given by the symbol “r” • The population correlation coefficient has the symbol “ρ” PLSS 6390 Experimental Designs and Data Analysis 12 6
  7. 7. 4/10/2007 Interpreting r • The sign of the correlation coefficient tells us the direction of the linear relationship • If r is negative (< 0), the correlation is negative and the line slopes d l down • If r is positive (> 0), the correlation is positive and the line slopes up • The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship: • | r | ≥ 0.90 implies a strong linear association • For 0.65 < | r | < 0.90, correlation is moderate • For | r | ≤ 0.65, linear association is weak PLSS 6390 Experimental Designs and Data Analysis 13 Correlation-Cautions and fundamental rule • The correlation coefficient only gives us an indication about the strength of a linear relationship • Two variables may have a strong curvilinear relationship, but they could have a weak value for r (linear correlation) ld h klf (li li ) • Correlation DOES NOT imply causation • Just because 2 variables are highly correlated does not mean that the explanatory variable causes the response (e.g., arm length and leg length of people are highly correlated, but no variable determines the other) PLSS 6390 Experimental Designs and Data Analysis 14 7
  8. 8. 4/10/2007 Setting a Correlation • Let us consider an example in which an engineer would like to determine if a relationship exists between the extrusion temperature and the strength of a certain formulation of plastic. He oversees the production of 15 batches of plastic at various temperatures and records the strength results • The 2 variables of interest in this study are the strength of the plastic and the extrusion temperature • The independent variable is the extrusion temperature. This is the only variable over which the experimenter has control. He can select whatever level he sees as appropriate • The response variable is strength. The value of strength is thought to be dependent on temperature • In the arm length and leg length test however, both variables are independent!!! PLSS 6390 Experimental Designs and Data Analysis 15 Setting a Correlation Temperature Strength 120 18 125 22 130 28 135 31 140 36 145 40 150 47 155 50 160 52 165 58 PLSS 6390 Experimental Designs and Data Analysis 16 8
  9. 9. 4/10/2007 Setting a Correlation • The scatter diagram allows us to deduce the nature of the relationship between the 2 variables. • What can we conclude? Scatter diagram of strength vs temperature 70 60 50 strength (psi) 40 30 20 10 0 100 110 120 130 140 150 160 170 Temperature (F) PLSS 6390 Experimental Designs and Data Analysis 17 Correlation by inspection • Does there appear to be a relationship between the 2 variables? • Classify the relationship as linear, curvilinear, or no relationship • Classify the correlation as positive, negative, or no correlation • Classify the strength of the correlation as strong, moderate, weak, or none PLSS 6390 Experimental Designs and Data Analysis 18 9
  10. 10. 4/10/2007 Computing the Correlation Coefficient “r” ⎪⎛ x − x ⎞⎛ y − y ⎞⎫ ⎧ ⎪ 1 ∑ ⎪⎜ s ⎜ s ⎬⎜ ⎜ r= ⎨ n − 1 ⎩⎝ x ⎠⎝ y ⎠⎪ ⎭ df Z-scores Z-scores Z scores For x data For y data PLSS 6390 Experimental Designs and Data Analysis 19 Computing the Correlation Coefficient “r” [ ] ∑ (Z x )(Z y ) 1 r= n −1 • Using the data provided, calculate the coefficient of correlation for the relationship between the 2 variables (r = 0.997) • What is the strength of this correlation? PLSS 6390 Experimental Designs and Data Analysis 20 10
  11. 11. 4/10/2007 Computing the Correlation Coefficient “r” • This coefficient is equivalent to: (∑ X )(∑ Y ) ∑ XY − n r= ∑ X 2 − (∑ X ) / n ∑ Y 2 − (∑ Y ) / n 2 2 • It is also called Pearson s r Pearson’s PLSS 6390 Experimental Designs and Data Analysis 21 Testing for the significance of r • The only other thing we might want to is find out if the correlation is statistically significant. • We can formulate the null hypothesis • H0: r = 0 is true • H1: r ≠ 0 • To test whether or not r is significantly different from zero, we use the t test for Pearson’s r: PLSS 6390 Experimental Designs and Data Analysis 22 11
  12. 12. 4/10/2007 Testing for the significance of r • To test whether or not r is significantly different from zero, we use the t test for Pearson’s r: n−2 tob = r 1− r 2 • Since this is like any other hypothesis test, we want to compare tob with tcrit. For this test, we use our given alpha level (generally, 0.05 0.01) 0 05 or 0 01) and df = n-2 In this case we subtract 2 from our n 2. case, sample size because we have two variables • So, with α = 0.05, is the correlation significant? PLSS 6390 Experimental Designs and Data Analysis 23 Testing for the significance of r n−2 10 − 2 tob = r = 0.997 = 28.2 1− r 2 1 − 0.99 • Now, like any other significance test, we find our critical value of t from the table (α = 0.05, df = 8: 2.306 ) and compare it to the obtained value. Since 2.306 < 28.2, we reject the null hypothesis and conclude that the correlation is statistically significant PLSS 6390 Experimental Designs and Data Analysis 24 12
  13. 13. 4/10/2007 Calculating the correlation coefficient in SAS Proc corr; Var variable1 variable2; Run; • In SAS the output will give some basic statistics for the variables correlated, the correlation coefficient, and the associated probability. If P < 0.05, r is significant and if P< 0.01, then r is highly significant • The correlation table is presented in the form of a matrix PLSS 6390 Experimental Designs and Data Analysis 25 Scatter Diagrams and Statistical Modeling and Regression • We’ve already seen that the best graphic illustrating the relation between 2 quantitative variables is a scatter diagram. Let us take this concept a step farther and, actually develop a mathematical model for the relationship between 2 quantitative variables p q • Since the data appears to be linearly related we can find a straight line model that fits the data better than all possible straight line models. This is the LINE OT BEST FIT (LOBF) • To figure it out, first we need an idea of the general equation for a line. From algebra, any straight line can be described as: Y = a + bX, where a is the intercept and b the slope PLSS 6390 Experimental Designs and Data Analysis 26 13
  14. 14. 4/10/2007 The straight line model • So, any straight line is completely defined by two parameters: The slope (b) is the steepness either positive or negative The intercept (a) this is where the graph crosses the vertical axis • In our model b is the estimator of the slope, the true value of the parameter is β and unknown • Similarly, a is an estimate of the intercept, the true value α is unknown PLSS 6390 Experimental Designs and Data Analysis 27 Calculating the parameter estimators a and b • The problem of a regression in a nutshell is to figure out what values of a and b to use. To do that, we use the following formulas: n∑ xy − ∑ x.∑ y ⎛ sy ⎞ = r⎜ ⎟ b= n∑ ( x 2 ) − (∑ x ) ⎜s ⎟ 2 ⎝ x⎠ a = y − bx • And PLSS 6390 Experimental Designs and Data Analysis 28 14
  15. 15. 4/10/2007 Calculating the parameter estimators a and b • From our example, a = -88.2 and b = 0.887 • In general terms any old linear regression equation is: Response = intercept +slope × (predictor) So in our example: strength = -88.2 + 0.88(temperature) • The interpretation of the slope is, “The amount of change in the response for every unit change in the independent variable” • The interpretation of the y-intercept is, “the value of the response variable when the independent variable has a value of 0.” i bl h h i d d i bl h l f0” PLSS 6390 Experimental Designs and Data Analysis 29 Calculating the parameter estimators a and b • From our example, a = -88.2 and b = 0.887 • In general terms any old linear regression equation is: Response = intercept +slope × (predictor) So in our example: strength = -88.2 + 0.88(temperature) • The interpretation of the slope is, “The amount of change in the response for every unit change in the independent variable” • The interpretation of the y-intercept is, “the value of the response variable when the independent variable has a value of 0.” i bl h h i d d i bl h l f0” • Note that sometimes the intercept is meaningful while in others it does not make sense at all. In our e.g., the intercept is -88.2, but there is no such thing as strength of -88.2 PLSS 6390 Experimental Designs and Data Analysis 30 15
  16. 16. 4/10/2007 Calculating the parameter estimators a and b • In SAS, we use PROC REG or PROC GLM to determine the regression parameters: PROC REG MODEL dependent = independent RUN; • Or PROC GLM; MODEL dependent = independent; RUN; PLSS 6390 Experimental Designs and Data Analysis 31 Cardinal Rule of Regression • Never predict a response value from a predictor value this is outise of the experimental range • From our example, the only predictions we can make (statistically) are predictions for strength where temperatures are between 120 and 165. PLSS 6390 Experimental Designs and Data Analysis 32 16
  17. 17. 4/10/2007 Regression Estimates: The coefficient of determination • r2 is called the coefficient of determination and is just the square of the correlation coefficient • r2 is a proportion, so it is a number between 0 and 1 inclusive (remember that r can only vary from -1 to +1) 1 • r2 tells us how much of the variance in the dependent variable is explained by the independent variable or the amount of variation in the response that is due to the variability in the predictor variable. • r2 values close to 0 mean that our estimated model is a poor one while values close to 1 imply that our model does a great job explaining the variation • The value 1- r2 is the variability in the response variable that remained unexplained and is incorporated into the “uncertainty”. PLSS 6390 Experimental Designs and Data Analysis 33 Regression Estimates: The coefficient of determination Temperature vs strength 70 60 y = 0.8873x - 88.236 50 r2 = 0.9939 Strength 40 30 20 10 0 110 120 130 140 150 160 170 temperature PLSS 6390 Experimental Designs and Data Analysis 34 17

×