9.2 Linear Regression and the Coefficient of Determination PART 1: LINEAR REGRESSION
Scatter Diagrams and Linear RelationshipsLooking at a scatter diagram, we can ask somequestions.1. Do the data indicate a linear relationship between x and y?2. Can you find an equation for the “best-fitting line” relating x and y? Can you use this relationship to make predictions?3. What fractional part of the variability in y can be associated with the variability in x? What fractional part of the variability in y is not associated with a corresponding variability in x?
Scatter Diagrams and Linear Relationships The first step in answering these questions is to try to express the relationship as a mathematical equation. There are many possible equations, but the simplest and most widely used is the linear equation, or the equation of a straight line. Because we will be using this line to predict the y values from the x values, we call x the explanatory variable and y the response variable. Our job is to find the linear equation that “best” represents the points of the scatter diagram.
Least-Squares Criterion For our criterion of “best-fitting line”, we use the least-squares criterion The least-squares criterion states that the line we fit to the data points must be such that the sum of the squares of the vertical distances from the points (x, y) to the line be made as small as possible. In Figure 9-10, d represents the difference between the y coordinate of the data point and the corresponding y coordinate on the line. Least-Squares Criterion Figure 9.10
Example In Denali National Park, Alaska, the wolf population is dependent on a large, strong caribou population. In this wild setting, caribou are found in very large herds. The well-being of an entire caribou herd is not threatened by wolves. In fact, it is thought that wolves keep caribou herds strong by helping prevent overpopulation. Let x be a random variable that represents the fall caribou population (in hundreds) in Denali National Park, and let y be a random variable that represents the late-winter wolf population in the park. A random sample of recent years gave the following information (Reference: U.S. department of the Interior, National Biological Service).
Examplea) Draw a scatter diagram displaying the dataSolution Caribou and Wolf Populations Figure 9.9
Example Caribou and Wolf Populations Figure 9.11
Influential Points Some points in the data set have a strong influence on the equation of the least-squares line. A data pair is influential if removing it would substantially change the equation of the least- squares line or other calculations associated with linear regression. An influential point often has an x-value near the extreme high or low value of the data set. If a data set has an influential point, look at it to make sure it is not the result of a data collection error or a recording error. A valid influential point affects the equation of the least-squares line.
Influential PointsInfluential Point Present Influential Point Removed Figure 9.12(a) Figure 9.12(b)
9.2 Linear Regression and the Coefficient of Determination PART 2: COEFFICIENT OF DETERMINATION
Example If r = 0.90, then r2 = 0.81 is the coefficient of determination. We can say that about 81% of the (variation) behavior of the y variable can be explained by the corresponding (variation) behavior of the x variable if we use the equation of the least-squares line. The remaining 19% of the (variation) behavior of the y variable is due to random chance or to the possibility of lurking variables that influence y.