Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.2: Regression
2. Chapter 10: Correlation and Regression
10.1 Correlation
10.2 Regression
10.3 Prediction Intervals and Variation
10.4 Multiple Regression
10.5 Nonlinear Regression
2
Objectives:
• Draw a scatter plot for a set of ordered pairs.
• Compute the correlation coefficient.
• Test the hypothesis H0: ρ = 0.
• Compute the equation of the regression line & the coefficient of determination.
• Compute the standard error of the estimate & a prediction interval.
3. Key Concepts: If the value of the correlation coefficient is significant, determine the equation of
the regression line.
Find the equation of the straight line that best fits the points in a scatterplot of paired sample data.
That best-fitting straight line is called the regression line, and its equation is called the regression
equation. The regression equation expresses a relationship between x (called the independent
variable, predictor variable or explanatory variable), and y (called the dependent variable or response
variable). The typical equation of a straight line is expressed in the form of y = mx + b, where b is
the y-intercept and m is the slope.
Regression Line: Given a collection of paired sample data, the regression line (or line of best fit, or
least-squares line) is the straight line that “best” fits the scatterplot of the data.
If there is not a significant linear correlation, the best predicted y-value is 𝑦.
If there is a significant linear correlation, the best predicted y-value is found by substituting the x-
value into the regression equation.
10.2 Regression
3
𝑦 = 𝑏0 + 𝑏1𝑥, OR 𝑦 = 𝑎 + 𝑏𝑥,
𝑆𝑙𝑜𝑝𝑒: 𝑏1 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2
𝑌 − int 𝑒 𝑟𝑐𝑒𝑝𝑡: 𝑏0 = 𝑦 − 𝑏1𝑥,
𝑦 =
𝑦
𝑛
, 𝑥 =
𝑥
𝑛
Population Parameter: 𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙
Sample Statistic : 𝒚 = 𝒃𝟎 + 𝒃𝟏𝒙
Ti calculator : 𝒚 = 𝒂 + 𝒃𝒙
Also: 𝑏1 = 𝑏 = 𝑟
𝑠𝑦
𝑠𝑥
, 𝑏0 = 𝑎 = 𝑦 − 𝑏1𝑥
r is the linear correlation coefficient
sy is the standard deviation of the sample y values
sx is the standard deviation of the sample x values.
4. Regression equations are often useful for predicting the value of one variable, given
some specific value of the other variable:
1. Bad Model: If the regression equation does not appear to be useful for making
predictions, don’t use the regression equation for making predictions. For bad
models, the best predicted value of a variable is simply its sample mean: 𝒚.
2. Good Model: Use the regression equation for predictions only if the graph of the
regression line on the scatterplot confirms that the regression line fits the points
reasonably well.
3. Correlation: Use the regression equation for predictions only if the linear
correlation coefficient r indicates that there is a linear correlation between the two
variables.
4. Scope: Use the regression line for predictions only if the data do not go much
beyond the scope of the available sample data.
4
10.2 Regression, Making Predictions
5. Best fit means that the sum of
the squares of the vertical
distance (residuals) from each
point to the line is at a minimum.
5
10.2 Regression Population Parameter: 𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙
Sample Statistic : 𝒚 = 𝒃𝟎 + 𝒃𝟏𝒙
Ti calculator : 𝒚 = 𝒂 + 𝒃𝒙
𝑦 = 𝑏0 + 𝑏1𝑥, OR 𝑦 = 𝑎 + 𝑏𝑥,
𝑆𝑙𝑜𝑝𝑒: 𝑏 = 𝑏1 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2
𝑌 − int 𝑒 𝑟𝑐𝑒𝑝𝑡:
𝑎 = 𝑏0 =
𝑦 𝑥2
− 𝑥 𝑥𝑦
𝑛 𝑥2 − 𝑥 2
Also :
𝑏1 = 𝑏 = 𝑟
𝑠𝑦
𝑠𝑥
, 𝑏0 = 𝑎 = 𝑦 − 𝑏1𝑥
oefficient
ent
e y values
es
the sample x values.
he sample x values.
𝑟 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
, 𝑂𝑟: 𝑟 =
(𝑍𝑥𝑍𝑦)
𝑛 − 1
6. x 1 1 3 5
y 2 8 6 4
6
x y x•y x² y²
1 2 2 1 4
1 8 8 1 64
3 6 18 9 36
5 4 20 25 16
𝑟 =
4 • 48 − 10 • 20
4(36) − 102 4(120) − 202
=
−8
44 • 80
= −0.135
𝑟 =
𝑛( 𝑥𝑦) − 𝑥 • 𝑦
𝑛( 𝑥
2
) − ( 𝑥)2 𝑛( 𝑦
2
) − ( 𝑦)2
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
4. Or Highlight & Clear
5. Type in your data in L1, ..
TI Calculator:
Linear Regression - test
1. Stat
2. Tests
3. LinRegTTest
4. Enter 𝑳𝟏 & 𝑳𝟐
5. Freq = 1
6. Choose ≠
7. Calculate
∑x = 10 ∑y = 20 ∑xy = 48 ∑x² = 36 ∑y² = 120
Example 1
Given the sample data:
a. Find the value of the linear correlation coefficient r
b. Test the claim that there is a linear correlation between
the two variables x and y. Use both (a) Method 1 and
(b) Method 2. ( = 0.05)
c. Find the regression equation.
d. Find the best predicted value of y, when x is equal to 2.
Social science Statistics Calculator Tab: https://www.socscistatistics.com/tests/
Correlation Coefficient Calculator: https://www.socscistatistics.com/tests/pearson/default.aspx
Linear Regression Calculator: https://www.socscistatistics.com/tests/regression/default.aspx
7. 7
Example 1b
1) Null & Alternative hypotheses:
2) Test statistic (TS)
𝑡 =
𝑟 − 𝜇𝑟
1 − 𝑟2
𝑛 − 2
3) Distribution, CV, RR & NRR.
Method 1 : T-test = 0.05,
df = 𝑛 − 2 = 2
4) Make a decision:
Decision:
a. Do not Reject H0
b. The claim is False
c. There is no linear correlation between the 2 variables.
=
−.135
0.70064
=
−.135 − 0
1 − −.135 2
4 − 2
r = −0.135
CV: 𝑛 = 4, = 0.05,
Use r: From
Correlation Table
→ CV: t = ±4.303
→ CV: r = ±0.950
H0: 𝜌 = 0, H1: 𝜌 ≠ 0, 2TT. claim
Method 1 : Method 2:
= −0.1927
𝑟 = −0.135
8. 8
Method 1 : T-test = 0.05,
df = 𝑛 − 2 = 2
CV: 𝑛 = 4, = 0.05, Use r:
From Correlation Table
→ CV: t = ±4.303
→ CV: r = ±0.950
Example 1b
10. Example 2: Finding r Using the following Formula
The data shown is for car rental companies in the United
States for a recent year. Find the correlation coefficient, the
equation of the regression line for the data, and graph the
line of the scatter plot.
10
Company
Cars x
(in 10,000s)
Income y
(in billions) xy x2
y2
A
B
C
D
E
F
63.0
29.0
20.8
19.1
13.4
8.5
7.0
3.9
2.1
2.8
1.4
1.5
441.00
113.10
43.68
53.48
18.76
12.75
3969.00
841.00
432.64
364.81
179.56
72.25
49.00
15.21
4.41
7.84
1.96
2.25
Σx =
153.8
Σy =
18.7
Σxy =
682.77
Σx2
=
5859.26
Σy2
=
80.67
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6
𝑟 =
6(682.77) − 153.8 • 18.7
6(5859.26) − 153. 82 6(80.67) − 18. 72
𝑟 = 0.982 (strong positive relationship)
𝑟 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
TI Calculator:
Linear Regression – test &
Correlation Coefficient 𝑟
1. Stat
2. Tests
3. LinRegTTest
4. Enter 𝑳𝟏 & 𝑳𝟐
5. Freq = 1
6. Choose ≠
7. Calculate
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
4. Or Highlight & Clear
5. Type in your data in L1, ..
11. 11
Example 2 Continued:
Company
Cars x
(in 10,000s)
Income y
(in billions) xy x2
y2
A
B
C
D
E
F
63.0
29.0
20.8
19.1
13.4
8.5
7.0
3.9
2.1
2.8
1.4
1.5
441.00
113.10
43.68
53.48
18.76
12.75
3969.00
841.00
432.64
364.81
179.56
72.25
49.00
15.21
4.41
7.84
1.96
2.25
Σx =
153.8
Σy =
18.7
Σxy =
682.77
Σx2
=
5859.26
Σy2
=
80.67
Σx = 153.8, Σy = 18.7, Σxy = 682.77
Σx2 = 5859.26, Σy2 = 80.67, n = 6
𝑏0 ==
18.7 5859.26 − 153.8 682.77
6 5859.26 − 153.8 2
= 0.396
𝑏1 =
6 682.77 − 153.8 18.7
6 5859.26 − 153.8 2 = 0.106
→ 𝑦′ = 0.396 + 0.106𝑥
𝑆𝑙𝑜𝑝𝑒: 𝑏1 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2
𝑏0 =
𝑦 𝑥2
− 𝑥 𝑥𝑦
𝑛 𝑥2 − 𝑥 2 =
18.7
6
− (0.106)
153.8
6
OR: 𝑌 − 𝑖𝑛𝑡: 𝑏0 = 𝑦 − 𝑏1𝑥,
𝑦 =
𝑦
𝑛
, 𝑥 =
𝑥
𝑛
𝑦 = 𝑏0 + 𝑏1𝑥, 𝑂𝑅: 𝑦′
= 𝑎 + 𝑏𝑥
TI Calculator:
Linear Regression – test &
Correlation Coefficient 𝑟
1. Stat
2. Tests
3. LinRegTTest
4. Enter 𝑳𝟏 & 𝑳𝟐
5. Freq = 1
6. Choose ≠
7. Calculate
The data shown is for car rental
companies in the United States for
a recent year. Find the equation of
the regression line for the data, and
graph the line of the scatter plot.
12. Find two points to
sketch the graph of the
regression line.
12
Example 2 Continued:
Any x values between 10 and 60 (Between 8.5 & 63)
Let x = 15 & 40
Plot (15,1.986) & (40,4.636),
and sketch the resulting line.
𝑦′
15 = 0.396 + 0.106 15 = 1.986
→ (15,1.986)
𝑦′
(40) = 0.396 + 0.106 40 = 4.636
→ (40, 4.636)
Predict the income of a car
rental agency that has
200,000 automobiles.
Significant linear correlation
→ Plug in
𝑥 = 20, 𝑦′(20) =
0.396 + 0.106 20 = 2.516
When a rental agency has 200,000
automobiles, its revenue will be
approximately $2.516 billion.
𝑦 = 𝑏0 + 𝑏1𝑥, 𝑂𝑅: 𝑦′
= 𝑎 + 𝑏𝑥
𝑦′ = 0.396 + 0.106𝑥
13. Marginal Change: In working with two variables related by a regression equation, the marginal change in a
variable is the amount that it changes when the other variable changes by exactly one unit. The slope b1 in the
regression equation represents the marginal change in y that occurs when x changes by one unit.
13
10.2 Regression, Marginal Change, Outlier & Influential Points
The slope of 2.49 tells us that if we increase x by 1, the predicted
variable y will increase by 2.49.
For Example: 𝑦 = 𝑏0 + 𝑏1𝑥 → 𝑦 = −3.37 + 2.49𝑥
Outlier (O): In a scatterplot, an outlier is a point lying far away from the other data points.
Influential Points (IP): Paired sample data may include one or more influential points, which are points that
strongly affect the graph of the regression line.
The scatterplot
located to the left
shows the regression
line. If we include an
additional pair of
data, x = 50 and y =
0, we get the
regression line
shown to the right
below.
The additional point
(50,0) is an
influential point
because the graph of
the regression line
did change
considerably as
shown. It is also an
outlier because it is
far from the other
points.
Essentially, an influential point is an outlier that significantly affects
the slope of the regression line. As a result of that single outlier, the
slope of the regression line changes greatly resulting in changing the
shape of the line. Accordingly, the outlier is considered an influential
point. (All IPs are Os but all Os may not be IPs.)
14. 14
2. Given the sample data: (the numbers of registered boats in tens of thousands)
a.Find the value of the linear correlation coefficient r.
b.Test the claim that there is a linear correlation between the two variables x and y.
Use both (a) Method 1 and (b) Method 2. ( = 0.05)
c.Find the regression equation.
d. Assume that in 2001 there were 850,000 registered boats. Because the table lists the
numbers of registered boats in tens of thousands, this means that for 2001 we have x
= 85. Given that x = 85, find the best predicted value of y, the number of manatee
deaths from boats.
e.Using the above pairs and the value of r, what proportion of the variation in numbers of
manatee deaths can be explained by the variation in the number of registered boats?
Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
X:Boats(10,000s) 68 68 67 70 71 73 76 81 83 84
Y:Manatee Deaths 53 38 35 49 42 60 54 67 82 78
Example 3
15. 2. Given the sample data:
a. Find r
b. Test the claim…
c. Regression equation.
d. x = 85, find the best
predicted value of y.
e. Proportion of the
variation in # of manatee
deaths explained by the
variation in the # of
boats? 15
𝑃 − 𝑣𝑎𝑙𝑢𝑒 = 0.000151 < = 0.05
Decision:
a. Reject H0
b. The claim is True
c. There is a significant linear correlation
between the 2 variables.
𝑟 = 0.922
𝑐. 𝑦 = 𝑎 + 𝑏𝑥 = −112.71 + 2.274𝑥
d. Significant linear correlation: → Plug in
y = −112.71 + 2.274 85 = 80.58 → 81.0
r = 0.9215 → r2
= 0.84920 = 84.92%
Example 3 Continued
16. 16
1) Null & Alternative hypotheses:
2) Test statistic (TS)
3) Distribution, RR & NRR.
Method 1 : T-test = 0.05,
df = n-2 = 8 CV: t = ±2.306
4) Make a decision:
Decision:
a. Reject H0
b. The claim is True
c. There is a significant linear correlation between the 2 variables.
=
0.922
0.13689
= 6.7352
=
0.992 − 0
1 − 0.992 2
10 − 2
Use r = 0.922
𝑇𝑆: 𝑡 = 𝑟
𝑛 − 2
1 − 𝑟2
, 𝑑𝑓 = 𝑛 − 2
𝑂𝑟: 𝑟
𝑡 =
𝑟
1 − 𝑟2
𝑛 − 2
H0: 𝜌 = 0, H1: 𝜌 ≠ 0, 2TT. claim
Method 2 :
Method 1:
n = 10, = 0.05
→ 𝐶𝑉: 𝑟 = ±0.632
CV: From Pearson
Correlation Coefficient table:
𝑟 = 0.922
Example 3 Continued
17. a. Use the table to the right the regression line and
predict the y value when x is 10.
b. Predict the IQ score of an adult who is exactly
175 cm tall. (IQ scores have a mean of 100 of )
17
Example 4:
Solution: Good Model: Use
the Regression Equation for
Predictions. Why?
𝑦 = 𝑏0 + 𝑏1𝑥 = −3.37 + 2.49𝑥
𝑦(10) = −3.37 + 2.49(10) = 21.5
Solution: Bad Model: Use 𝒚 for predictions.
Knowing that there is no correlation between height and IQ score, we know that a
regression equation is not a good model, so the best predicted value of IQ score is the
mean, which is 100.
𝑦 = 𝑏0 + 𝑏1𝑥, OR 𝑦 = 𝑎 + 𝑏𝑥,
𝑆𝑙𝑜𝑝𝑒: 𝑏1 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2
𝑌 − int 𝑒 𝑟𝑐𝑒𝑝𝑡: 𝑏0 = 𝑦 − 𝑏1𝑥,
𝑦 =
𝑦
𝑛
, 𝑥 =
𝑥
𝑛
18. Least-Squares Property: A straight line satisfies the least-squares property if the
sum of the squares of the residuals is the smallest sum possible.
Residual: For a pair of sample x and y values, the residual is the difference between the observed
sample value of y and the y value that is predicted by using the regression equation.
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦 → 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 plot is collection of Pairs: (𝑥, 𝑦 − 𝑦). The residual plot should not have
any obvious pattern. The residual plot should not become much wider (or thinner) when viewed from
left to right.
18
Example 5: a. Find the residual value for the sample point
with coordinates of (8, 4). b. Draw the Residual Plot. c. What is
the value of the Marginal Change? 𝑦 = 𝑏0 + 𝑏1𝑥 = 1 + 𝑥
x 8 12 20 24
y 4 24 8 32
a.𝑥 = 8 → 𝑦 = 1 + 8 = 9
𝑥 = 8 → 𝑦 = 4
Residual:𝑦 − 𝑦 = 4 − 9 = −5
c. 𝑀𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝐶ℎ𝑎𝑛𝑔𝑒 = 𝑆𝑙𝑜𝑝𝑒 = 1
10.2 Regression, Least-Squares Property & Residual Plots
19. 19
10.2 Regression Summary
Finding the Correlation Coefficient and the Regression Line Equation
Step 1 Make a table, as shown in step 2.
Step 2 Find the values of xy, x2, and y2. Place them in the appropriate columns and sum each
column.
Step 3 Substitute in the formula to find the value of r: 𝑟 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
Step 4 When r is significant, substitute in the formulas to find the values of a and b for the
regression line equation y' = a + bx.
𝑦 = 𝑏0 + 𝑏1𝑥, OR 𝑦 = 𝑎 + 𝑏𝑥, 𝑆𝑙𝑜𝑝𝑒: 𝑏1 = 𝑏 =
𝑛 𝑥𝑦− 𝑥 𝑦
𝑛 𝑥2− 𝑥 2 , 𝑌 − int 𝑒 𝑟𝑐𝑒𝑝𝑡: 𝑏0 = 𝑎 = 𝑦 − 𝑏1𝑥, 𝑦 =
𝑦
𝑛
, 𝑥 =
𝑥
𝑛
20. Example 5: (Skip)
Find the equation of the regression line in which the explanatory variable (or x
variable) is chocolate consumption and the response variable (or y variable) is the
corresponding Nobel Laureate rate. The table of data is on the next slide.
20
Chocolate (x) Nobel (y)
4.5 5.5
10.2 24.3
4.4 8.6
2.9 0.1
3.9 6.1
0.7 0.1
8.5 25.3
7.3 7.6
6.3 9.0
11.6 12.7
2.5 1.9
8.8 12.7
Chocolate
(x)
Nobel (y)
3.7 3.3
1.8 1.5
4.5 11.4
9.4 25.5
3.6 3.1
2.0 1.9
3.6 1.7
6.4 31.9
11.9 31.5
9.7 18.9
5.3 10.8
Solution: REQUIREMENT
(1) The data are assumed to be a simple random
sample (SRS).
(2) The scatterplot is very roughly a
straight-line pattern.
(3) There are no outliers.
21. 21
Example 5:
Use the first formulas for b1 and b0
to find the equation of the
regression line in which the
explanatory variable (or x variable)
is chocolate consumption and the
response variable (or y variable) is
the corresponding number of Nobel
Laureates.
Find the slope b1 as follows:
r is the linear correlation coefficient
sy is the standard deviation of the sample y values
sx is the standard deviation of the sample x values.
𝑦 = 𝑏0 + 𝑏1𝑥, OR 𝑦 = 𝑎 + 𝑏𝑥,
𝑆𝑙𝑜𝑝𝑒: 𝑏1 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2
𝑌 − int 𝑒 𝑟𝑐𝑒𝑝𝑡: 𝑏0 = 𝑦 − 𝑏1𝑥,
𝑦 =
𝑦
𝑛
, 𝑥 =
𝑥
𝑛
Also :
𝑏1 = 𝑏 = 𝑟
𝑠𝑦
𝑠𝑥
, 𝑏0 = 𝑎 = 𝑦 − 𝑏1𝑥
𝑦 = 𝑏0 + 𝑏1𝑥 = −3.3667 + 2.4931𝑥
Graphing the Regression
Line: Shown below is the
Minitab display of the
scatterplot with the graph of the
regression line included. We can
see that the regression line fits
the points well, but the points
are not very close to the line.
𝑏1 = 𝑟
𝑠𝑦
𝑠𝑥
= 0.80061 ∙
10.2116
3.2792
= 2.4931, 𝑏0 = 𝑦 − 𝑏1𝑥
= 11.10435 − 2.4931 ∙ 5.8043 = −3.3667