Regression Making predictions using data Limitations of correlations Correlations measure the magnitude of the relationship between two variables within a population There are two important limitations associated with correlations They cannot predict scores on one variable from knowledge of the other They cannot measure relationships between more than two variables Linear regression is a more flexible statistical technique that allows you to answer both types of questions Knowledge of how much bacon a person consumes does not let you predict their exact risk of heart disease You cannot produce an estimate of how bacon consumption, exercise, and alcohol intake combine to predict heart disease Linear regression Unlike Pearson correlations, linear regressions formalize the relationship between the two variables using a line The components of this equation each have special meaning Y = value of Y variable – also called outcome variable X = value of X variable – also called predictor variable b = slope of line – how changes in X produce changes in Y a = intercept – what value of Y is associated with 0 in X A regression line is an algorithm that maps scores on the predictor variable to scores on the outcome variable Y = mX + b Y = mX + b Y = bX + a Linear regression But there are many possible lines that can capture the relationship between two variables How do we determine the best line to represent a given set of data? 0.3 0.4 0.6 0.7 0.8 0.9 1.1000000000000001 1.3 3 8 6 9 3 6 11 10 Linear regression But there are many possible lines that can capture the relationship between two variables Each potential regression equation has a certain amount of error Error = the distance between the regression line and each datapoint 0.3 0.4 0.6 0.7 0.8 0.9 1.1000000000000001 1.3 3 8 6 9 3 6 11 10 Linear regression But there are many possible lines that can capture the relationship between two variables Each potential regression equation has a certain amount of error Error = the distance between the regression line and each datapoint Also called residuals The line of best fit is the line that minimizes the (squared) residuals No other line can produce a smaller total error 0.3 0.4 0.6 0.7 0.8 0.9 1.1000000000000001 1.3 3 8 6 9 3 6 11 10 Line of best fit To specify the equation for a line, we must estimate two values The derivations for these are complicated (matrix algebra), but final form of the equations are easy to use Y = bX + a slope intercept Line of best fit To specify the equation for a line, we must estimate two values The derivations for these are complicated (matrix algebra), but final form of the equations are easy to use We can use the equation for the line of best fit to predict scores on the outcome variable for any value of the predictor variable Predicted scores are represented with Ŷ Y = bX + a Let’s do an example!Height (X)Rated deepness of voice ...