SlideShare a Scribd company logo
1 of 29
Chapter 10: Regression and Correlation
315
Chapter 10: Regression and Correlation
The previous chapter looked at comparing populations to see if there is a difference
between the two. That involved two random variables where the same measurement was
taken but from two different groups. This chapter will look at two random variables but
we will have one group where we are looking at two different measurements taken from
each object, and see if there is a relationship between the two quantitative variables. To
do this, you look at regression, which finds the linear relationship, and correlation, which
measures the strength of a linear relationship.
Please note: there are many other types of relationships besides linear that can be found
for the data. This book will only explore linear, but realize that there are other
relationships that can be used to describe data (quadratic, exponential, etc.).
Section 10.1: Regression
When comparing two different quantitative variables, two questions come to mind: “Is
there a relationship between two variables?” and “How strong is that relationship?”
These questions can be answered using regression and correlation. Regression answers
whether there is a relationship (again this book will explore linear only) and correlation
answers how strong the linear relationship is. To introduce both of these concepts, it is
easier to look at a set of data. Below are the steps from chapter 2 for making a
scatterplot.
TECHNOLOGY: SCATTERPLOT
Using StatCrunch:
 Enter the data into 2 columns in the spreadsheet (see earlier instructions on entering
a list of data)
 Click Graph, Scatter Plot
 In the popup window that opens choose the X Variable and Y Variable from the
drop-down menus
 Under “Graph properties” you can put a title
 Then click “Compute!”
Using your TI84:
 First push STAT 1 and enter the data into L1 and L2
 Push 2nd Y= to open the STAT PLOTS menu. Then push 1 to select Plot1
 You need to make your input screen look like the screen below (you may have
different list names depending on where you put your data)
 Then push ZOOM 9 to see the scatterplot
Chapter 10: Regression and Correlation
316
Example #10.1.1: Making a Scatterplot
Is there a relationship between the alcohol content and the number of calories in
12-ounce beer? To determine if there is one, a random sample was taken of
beer’s alcohol content and calories ("Calories in beer," 2011), and the data are in
table #10.1.1. Make a scatterplot of the data.
Table #10.1.1: Alcohol and Calorie Content in Beer
Brand Brewery Alcohol Content Calories in 12 oz
Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163
Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215
Steel Reserve MillerCoors 8.10% 222
O'Doul's Anheuser Busch 0.40% 70
Coors Light MillerCoors 4.15% 104
Genesee Cream Ale High Falls Brewing 5.10% 162
Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158
Michelob Beer Anheuser Busch 5.00% 155
Flying Dog Doggie Style Flying Dog Brewery 4.70% 158
Big Sky I.P.A. Big Sky Brewing 6.20% 195
Solution:
It is helpful to state the random variables in the context of the problem.
rv X = alcohol content in a randomly selected 12-ounce beer
rv Y = number of calories in that same randomly selected 12-ounce beer
Figure #10.1.1: Scatter Plot of BeerData
This scatter plot looks fairly linear. However, notice that there is one beer in the
list that is actually considered a non-alcoholic beer. That value is probably an
outlier since it is a non-alcoholic beer. The rest of the analysis will not include
O’Doul’s. You cannot just remove data points, but in this case, it makes more
sense to, since all the other beers have a fairly large alcohol content.
2 4 6 8
050100150200250
Calories vs Alcohol Content
Alcohol Content (%)
Caloriesin12inBeer
Chapter 10: Regression and Correlation
317
The scatterplot without O’Doul’s is as follows: (TI84 and StatCrunch graphs)
The relation looks fairly linear. The next step is to find a line that best fits the data and
the corresponding equation of that line.
In high school algebra you spent many, many months each year on linear functions. Most
of you are familiar with the equation Y = mX + b that is used in most algebra texts. Some
of the more current algebra textbooks use the equation Y = a + bX (this matches the
equation that the calculator uses and the linear equation we will use in this class).
 X is the independent variable and is also called the predictor variable
 Y is the dependent variable and is also called the response variable
 The coefficient of X is the slope and the constant term is the Y-intercept.
 In this course we are going to be using the equation 𝑌̂( 𝑥) = 𝑎 + 𝑏(𝑋)
 The “hat” on the Y reminds us that this is an estimated or predicted value of Y
 slope = coefficient of X = b =
𝑏
1
=
change in Y
change in X
which means for each 1 unit
increase in X, Y changes by b units on average. Whether Y increases or decreases
for every 1 unit increase in X depends on the sign of the slope.
 Y-intercept = (0, 𝑎) which means when X = 0, Y = 𝑎 (Sometimes the Y-intercept
will have no physical meaning with respect to the linear regression problem that
we are doing because it falls outside of the values that make sense or outside the
range of values sampled from, but it is still part of the equation and is needed to
plot the line)
 This equation should only be used for X-values between Xmin and Xmax. Many
relationships between variables may look linear in a particular range of X-values
but once you go beyond those values the relation may no longer be linear.
There are many conditions to check when doing linear regression. In this level of a
course, we are just going to look at checking the following assumptions:
1. The set (X,Y) of ordered pairs is a random sample from the population of all such
possible (X,Y) pairs.
2. The scatter plot of X versus Y has a roughly linear pattern with no outliers.
 We will get a hypothesis test in section 10.3 to tell us if what we see is linear
enough or not.
Chapter 10: Regression and Correlation
318
In graphing real-world data, the scatterplots do not look as perfect as the ones from
algebra. In this case we find what we call a best-fitting line using a process called
regression. The notation we use in this class for the equation of the best-fitting line (also
called the least-squares regression equation) is as follows:
𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋)
TECHNOLOGY: REGRESSION EQUATION, 𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋)
Using StatCrunch:
 Enter data into 2 columns in the spreadsheet (see earlier instructions on entering a
list of data)
 Click Stat, Regression, Simple Linear
 In the popup window that opens choose the X Variable and Y Variable from the
drop-down menus
 Then click “Compute!”
Using your TI84:
 First push STAT 1 and enter the data into L1 and L2
 Then push STAT ← to open the TESTS menu. Scroll down until you see
“LinRegTTest” and push ENTER. You can then enter the names of the lists where
you put your data.
 You also need to tell the TI to store this linear regression equation in Y1 by typing
VARS → 1 1 next to "RegEQ" on the LinRegTTest input screen.
 Your input screen should look like the one below (you may have stored your data
in different lists. For now, which inequality you highlight does not matter, but it
will in the later section when we do the hypothesis test).
 After you highlight “Calculate” and push ENTER you will get the following output
screen. You will need to scroll down to see all of the output.
 This equation is written as 𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋)
 slope =
𝑏
1
, which tells us on average how much we expect Y to change when X
increases by 1 unit
 Y-intercept: (0, 𝑎), which tells us the value of Y when X is 0. For many
applications, we do not interpret the Y-intercept because X = 0 is out of the scope
of the data and usually does not make sense to talk about (like a baby that weighs 0
pounds)
Chapter 10: Regression and Correlation
319
Example #10.1.2: Finding the Equation of the Line of Best Fit
Use the data from Example #10.1.1 (removing O’Doul’s) to do the following:
a) Find the equation of the line of best fit.
Solution:
Alcohol content is the explanatory variable and number of calories is the response
variable.
TI84: Values of alcohol content are in L1 and values of calories are in L2.
StatCrunch: Using Stat, Regression, Simple Linear
So the equation of the line of best fit is as follows:
𝑌̂( 𝑋) = 25.03123606+ 26.31860776(𝑋), where 4.15% ≤ 𝑋 ≤ 8.1%
b) Draw the scatterplot and the line of best fit on the same set of axes.
Solution:
Since we told the calculator to store the equation in Y1, if you push ZOOM 9 you
will see the scatterplot and the line of best fit drawn on the same set of axes. That
graph is also on the second page of results in StatCrunch.
Chapter 10: Regression and Correlation
320
c) Interpret the slope and Y-intercept in context.
Solution:
Slope =
change in Y
change in X
=
𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑐𝑎𝑙𝑜𝑟𝑖𝑒𝑠
𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑎𝑙𝑐𝑜ℎ𝑜𝑙 𝑐𝑜𝑛𝑡𝑒𝑛𝑡
≈
26.32 𝑐𝑎𝑙𝑜𝑟𝑖𝑒𝑠
1%
The slope here tells us that for every 1% increase in the alcohol content of beer
we expect the calories to increase by 26.32 on average.
The Y-intercept here would be (0%, 25.03 calories). This has no meaning with
respect to this problem since we only looked at alcoholic beers (so it makes no
sense to talk about a beer with 0% alcohol).
What makes this the best fitting line? The process of regression is used to find the line
that best fits the data. The criteria for the best fitting line that technology will use are as
follows.
1. The line must pass through the point (X̅,Y̅)
2. The line must make the sum of the square of the residuals as small as possible
What the heck is a residual? When you draw a line that “best” fits the data, that line will
not be able to pass through all of the points (in fact it might not pass through a single
point from the data). You can see that in the graph in Example #10.1.2. The residuals
give us a way to measure how far the line is vertically from each point in the data set.
Residual – the difference between the actual Y value and the predicted Y value on the
regression line for a particular value of X, 𝑥0. This is the directed vertical distance
between the actual point in the data and the corresponding point on the regression line.
residual = 𝑌( 𝑥0)− 𝑌̂(𝑥0)
 Data points above the line will have positive residuals.
 Data points below the line will have negative residuals.
 The sum of the residuals is always 0
The regression line and the residuals are displayed in figure #10.1.2.
Figure #10.1.2: Scatter Plot of BeerData with RegressionLine and Residuals
Chapter 10: Regression and Correlation
321
Example #10.1.3: Computing Predicted Values and Residuals.
a.) Use the regression equation to predict the number of calories when the alcohol
content is 6.50% based on the data given in Table #10.1.2 ("Calories in beer,"
2011) from a random sample of 9 beers.
Table #10.1.2: Alcohol and Calorie Content in Beerwithout Outlier
Brand Brewery Alcohol
Content
Calories
in 12 oz
Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163
Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215
Steel Reserve MillerCoors 8.10% 222
Coors Light MillerCoors 4.15% 104
Genesee Cream Ale High Falls Brewing 5.10% 162
Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158
Michelob Beer Anheuser Busch 5.00% 155
Flying Dog Doggie Style Flying Dog Brewery 4.70% 158
Big Sky I.P.A. Big Sky Brewing 6.20% 195
Solution:
State random variables
rv X = alcohol content in a randomly selected 12-ounce beer
rv Y = number of calories in that same randomly selected 12-ounce
beer
In this case, 𝑥0 = 6.50
First check that 6.50 is between Xmin and Xmax from the data.
4.15 ≤ 6.50 ≤ 8.1
𝑌̂(6.50) ≈ 25.03123606+ 26.31860776(6.50) ≈ 196.1 calories
This equation was also stored in Y1. So you can also find this predicted
value as follows:
𝑌̂(6.50) = 𝑌1(6.50) ≈ 196.1 calories
The following keypunches will type a Y1: VARS → 1 1
If you are drinking a beer that has 6.50% alcohol content, then it is predicted
to have 196.1 calories. Notice, the mean number of calories of the sample of
9 beers is about 170.2 calories. The value of 196.1 seems like a better
estimate than the mean when looking at the original data. The regression
equation is a better estimate than just the mean, since the regression equation
takes into account the alcohol content.
b.) Use the regression equation to predict the number of calories when the alcohol
content is 2.00%.
Chapter 10: Regression and Correlation
322
Solution:
In this case, 𝑥0 = 2.00
First check that 2.00 is between Xmin and Xmax from the data.
2.00 is not between 4.15 and 8.1
The equation should not be used to predict that calories for this beer. We also
should not use the mean from the sample of 12 beers either since there were
no beers in the sample with an alcohol content as low as 2.00%.
c.) Find the residual associated with the beer that had 6.70% alcohol.
Solution:
In this case, 𝑥0 = 6.70
residual = 𝑌( 𝑥0)− 𝑌̂(𝑥0) = actual Y – predicted Y
residual = 𝑌(6.70)− 𝑌̂(6.70) = actual Y – predicted Y
To get 𝑌(6.70) you need to look in the data table. This beer is highlighted in
Table #10.1.2. This beer has 215 calories. So 𝑌(6.70) = 215
To get 𝑌̂(6.70) you need to use the equation.
𝑌̂(6.70) ≈ 25.03123606+ 26.31860776(6.70) ≈ 201.4 calories
This beer with 6.70% alcohol actually had 215 calories but the linear model
predicted that it would have 201.4 calories.
residual = 𝑌(6.70)− 𝑌̂(6.70)≈ 215 calories − 201.4 calories ≈
13.6 calories
This residual means that the actual value was about 13.6 calories above the
predicted value.
Example #10.1.4: Interpreting a Negative Slope
For a set of sample data of elevation (in ft) and high temperature (in ℉) for
randomly selected cities, the following equation of the least squares regression
line was computed. Interpret the slope in context.
𝑌̂( 𝑥) ≈ 77.37 − 0.0039(𝑋), 3000 ≤ 𝑋 ≤ 7000
Solution:
Slope =
change in Y
change in X
=
𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 ℎ𝑖𝑔ℎ 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒
𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑒𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛
≈
−0.0039 ℉
1 𝑓𝑡
The slope here tells us that for every 1 foot increase in elevation we expect the
high temperature to decrease by 0.0039 ℉ on average. (NOTE: The word
“decrease” takes care of the negative sign in the numeric value of the slope.)
Chapter 10: Regression and Correlation
323
Section10.1: Homework
1.) When an anthropologist finds skeletal remains, they need to figure out the height
of the person. The height of a person (in cm) and the length of their metacarpal
bone 1 (in cm) were collected and are in table #10.1.5 ("Prediction of height,"
2013).
Table #10.1.5: Data of Metacarpal versus Height
Length of
Metacarpal
(cm)
Height of
Person
(cm)
45 171
51 178
39 157
41 163
48 172
49 183
46 173
43 175
47 173
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the height of a person for a metacarpal length of 44 cm or state why
you shouldn’t.
g) Predict the height of a person for a metacarpal length of 55 cm or state why
you shouldn’t.
h) Compute the residual for the person with a metacarpal length of 45 cm.
Interpret what this value means in the context of this problem.
Chapter 10: Regression and Correlation
324
2.) Table #10.1.6 contains the value of the house and the amount of annual rental
income in a year that the house brings in ("Capital and rental," 2013).
Table #10.1.6: Data of House Value versus Annual Rental Income
Value Rental Value Rental Value Rental Value Rental
81000 6656 77000 4576 75000 7280 67500 6864
95000 7904 94000 8736 90000 6240 85000 7072
121000 12064 115000 7904 110000 7072 104000 7904
135000 8320 130000 9776 126000 6240 125000 7904
145000 8320 140000 9568 140000 9152 135000 7488
165000 13312 165000 8528 155000 7488 148000 8320
178000 11856 174000 10400 170000 9568 170000 12688
200000 12272 200000 10608 194000 11232 190000 8320
214000 8528 208000 10400 200000 10400 200000 8320
240000 10192 240000 12064 240000 11648 225000 12480
289000 11648 270000 12896 262000 10192 244500 11232
325000 12480 310000 12480 303000 12272 300000 12480
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the rental income a house worth $230,000 or state why you shouldn’t.
g) Predict the rental income a house worth $400,000 or state why you shouldn’t.
h) Compute the residual for the house worth $214,000. Interpret what this value
means in the context of this problem.
Chapter 10: Regression and Correlation
325
3.) The World Bank collects information on the life expectancy of a person in each
country ("Life expectancy at," 2013) and the fertility rate (average number of
children per woman) in the country ("Fertility rate," 2013). The data for 24
randomly selected countries for the year 2011 are in table #10.1.7.
Table #10.1.7: Data of Fertility Rates versus Life Expectancy
Fertility
Rate
Life
Expectancy
1.7 77.2
5.8 55.4
2.2 69.9
2.1 76.4
1.8 75.0
2.0 78.2
2.6 73.0
2.8 70.8
1.4 82.6
2.6 68.9
1.5 81.0
6.9 54.2
2.4 67.1
1.5 73.3
2.5 74.2
1.4 80.7
2.9 72.1
2.1 78.3
4.7 62.9
6.8 54.4
5.2 55.9
4.2 66.0
1.5 76.0
3.9 72.3
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the life expectancy of a country that has a fertility rate of 2.7 or state
why you shouldn’t.
g) Predict the life expectancy of a country that has a fertility rate of 8.1 or state
why you shouldn’t.
h) Compute the residual for the country with a fertility rate of 5.8. Interpret what
this value means in the context of this problem.
Chapter 10: Regression and Correlation
326
4.) The height and weight of baseball players are in table #10.1.9 ("MLB heights
weights," 2013).
Table #10.1.9: Heights and Weights of Baseball Players
Height
(inches)
Weight
(pounds)
76 212
76 224
72 180
74 210
75 215
71 200
77 235
78 235
77 194
76 185
72 180
72 170
75 220
74 228
73 210
72 180
70 185
73 190
71 186
74 200
74 200
75 210
78 240
72 208
75 180
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the weight of a baseball player that is 75 inches tall or state why you
shouldn’t.
g) Predict the weight of a baseball player that is 68 inches tall or state why you
shouldn’t.
h) Compute the residual for the baseball player that is 76 inches tall and weighs
212 pounds. Interpret what this value means in the context of this problem.
Chapter 10: Regression and Correlation
327
5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg)
and calories were measured. ("Data hotdogs," 2013) The data are in table
#10.1.11.
Table #10.1.11: Calories and Sodium Levels in Beef Hotdogs
Calories Sodium
186 495
181 477
176 425
149 322
184 482
190 587
158 370
139 322
175 479
148 375
152 330
111 300
141 386
153 401
190 645
157 440
131 317
149 319
135 298
132 253
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the amount of sodium a beef hotdog has if it has 170 calories or state
why you shouldn’t.
g) Predict the amount of sodium a beef hotdog has if it has 120 calories or state
why you shouldn’t.
h) Compute the residual for the beef hotdog with 153 calories. Interpret what
this value means in the context of this problem.
Chapter 10: Regression and Correlation
328
6.) Per capita income in 1960 dollars for European countries and the percent of the
labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD
economic development," 2013).
Table #10.1.12: Percent of Labor in Agriculture and Per Capita Income for
European Countries
Country Percent in
Agriculture
Per capita
income
Sweden 14 1644
Switzerland 11 1361
Luxembourg 15 1242
U. Kingdom 4 1105
Denmark 18 1049
W. Germany 15 1035
France 20 1013
Belgium 6 1005
Norway 20 977
Iceland 25 839
Netherlands 11 810
Austria 23 681
Ireland 36 529
Italy 27 504
Greece 56 324
Spain 42 290
Portugal 44 238
Turkey 79 177
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the per capita income in a country that has 21 percent of labor in
agriculture or state why you shouldn’t.
g) Predict the per capita income in a country that has 2 percent of labor in
agriculture or state why you shouldn’t.
h) Compute the residual for the country with 6 percent of labor in agriculture.
Interpret what this value means in the context of this problem.
Chapter 10: Regression and Correlation
329
7.) Cigarette smoking and cancer have been linked. The number of deaths per one
hundred thousand from bladder cancer and the number of cigarettes sold per
capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013) for 44
randomly selected countries. Create a scatter plot and find a regression equation
between cigarette smoking and deaths of bladder cancer. Then use the regression
equation to find the number of deaths from bladder cancer when the cigarette
sales were 20 per capita and when the cigarette sales were 6 per capita. Which
number of deaths that you calculated do you think is closer to the true number?
Why?
Table #10.1.13: Number of Cigarettes and Number of Bladder Cancer
Deaths in 1960
Cigarette
Sales (per
Capita)
Bladder
Cancer
Deaths (per
100
Thousand)
Cigarette
Sales (per
Capita)
Bladder
Cancer
Deaths (per
100
Thousand)
Cigarette
Sales (per
Capita)
Bladder
Cancer
Deaths (per
100
Thousand)
18.20 2.90 42.40 6.54 28.92 4.79
25.82 3.52 28.64 5.98 25.91 5.21
18.24 2.99 21.16 2.90 26.92 4.69
28.60 4.46 29.14 5.30 24.96 5.27
31.10 5.11 19.96 2.89 22.06 3.72
33.60 4.78 26.38 4.47 16.08 3.06
40.46 5.60 23.44 2.93 27.56 4.04
28.27 4.46 23.78 4.89 21.17 4.04
20.10 3.08 29.18 4.99 21.25 5.14
27.91 4.75 18.06 3.25 22.86 4.78
26.18 4.09 20.94 3.64 28.04 3.20
22.12 4.23 20.08 2.94 30.34 3.46
21.84 2.91 22.57 3.21 23.75 3.95
23.44 2.86 14.00 3.31 23.32 3.72
21.58 4.65 25.89 4.63
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the number of deaths from bladder cancer when the cigarette sales
were 20 per capita or state why you shouldn’t.
g) Predict the number of deaths from bladder cancer when the cigarette sales
were 6 per capita or state why you shouldn’t.
h) Compute the residual for the country where cigarette sales were 18.20 per
capita. Interpret what this value means in the context of this problem.
Chapter 10: Regression and Correlation
330
8.) The weight of a car can influence the mileage that the car can obtain. A random
sample of cars’ weights and mileage was collected and are in table #10.1.14
("Passenger car mileage," 2013). Create a scatter plot and find a regression
equation between weight of cars and mileage. Then use the regression equation to
find the mileage on a car that weighs 3800 pounds and on a car that weighs 2000
pounds. Which mileage that you calculated do you think is closer to the true
mileage? Why?
Table #10.1.14: Weights and Mileages of Cars
Weight (100 pounds) Mileage (mpg) Weight (100 pounds) Mileage (mpg)
22.5 53.3 35.0 31.3
22.5 41.1 35.0 28.0
22.5 38.9 35.0 28.0
25.0 40.9 35.0 28.0
27.5 46.9 40.0 23.6
27.5 36.3 40.0 23.6
30.0 32.2 40.0 23.4
30.0 32.2 40.0 23.1
30.0 31.5 45.0 19.5
30.0 31.4 45.0 17.2
30.0 31.4 45.0 17.0
35.0 32.6 55.0 13.2
35.0 31.3
a) State the random variables.
b) Make a scatterplot of X versus Y.
c) Find the equation of the best-fitting line (the least squares regression
equation).
d) Interpret the slope in the context of this problem.
e) Interpret the Y-intercept in the context of this problem or state why it does not
make sense to do so.
f) Predict the mileage on a car that weighs 3800 pounds or state why you
shouldn’t.
g) Predict the mileage on a car that weighs 2000 pounds or state why you
shouldn’t.
h) Compute the residual for the car that weighs 55.0 pounds. Interpret what this
value means in the context of this problem.
Chapter 10: Regression and Correlation
331
Section 10.2: Correlation
A correlation exists between two quantitative variables when the values of one
quantitative variable are somehow associated with the values of the other quantitative
variable.
When you see a pattern in the data you say there is a correlation in the data. Though this
book is only dealing with linear patterns, patterns can be other math models such as
exponential, logarithmic, or periodic. To see this pattern, you can draw a scatter plot of
the data.
Remember to read graphs from left to right, the same as you read words. If the graph
goes up the correlation is positive and if the graph goes down the correlation is negative.
The words “weak”, “moderate”, and “strong” are used to describe the strength of the
relationship between the two variables.
Figure 10.2.1: Correlation Graphs
We need a numeric way to measure the strength of the linear relation between two
variables. This measure needs to be unitless. If someone measures heights in inches and
weights in pounds and someone else takes the same group of people and measures
heights in centimeters and weights in kilograms, then whatever we use to measure the
strength of the relationship between height and weight should be the same. The strength
should not depend on the units of measurement. What statistic do we have that is
unitless?......z-scores.
Chapter 10: Regression and Correlation
332
The formula below is what was developed by Karl Pearson to measure the strength of
linear relation between two quantitative variables. If we had to make these computations
by hand (which we don’t!) we would first need to convert all of the X-coordinates into
their corresponding z-scores and then the same for the Y-coordinates. We will be using
technology to compute this.
𝑟 =
∑ 𝑧 𝑥 ∙ 𝑧 𝑦
𝑛 − 1
Linear correlation coefficient – is a number that describes the strength of the linear
relationship between the two variables. It is also called the Pearson correlation
coefficient after Karl Pearson who developed it. The symbol for the sample linear
correlation coefficient is r. The symbol for the population correlation coefficient is r
(Greek letter rho)
r is always between -1 and 1, inclusive.
r = -1 means there is a perfect negative linear correlation
r = 1 means there is a perfect positive linear correlation.
The closer r is to 1 or -1, the stronger the linear correlation.
The closer r is to 0, the weaker the linear correlation.
BE CAREFUL: r = 0 does not mean there is no correlation. It just means there is no
linear correlation. There might be a very strong curved pattern like in the last graph on
the previous page.
There are many conditions to check for linear correlation. In this level of a course, we
are just going to look at checking the following assumptions (these are the same
assumptions we had in the last section for regression):
1. The set (X,Y) of ordered pairs is a random sample from the population of all such
possible (X,Y) pairs.
2. The scatter plot of X versus Y has a roughly linear pattern with no outliers.
 We will get a hypothesis test in section 10.3 to tell us if what we see is linear
enough or not.
The value of the sample linear correlation coefficient is on the same output screen that
was used in the last section to get the equation of the best-fitting line.
This sample linear correlation coefficient is computed from unitless z-scores, so it is
unitless.
Chapter 10: Regression and Correlation
333
TECHNOLOGY: LINEAR CORRELATION COEFFICIENT
Using StatCrunch:
 Enter data into 2 columns in the spreadsheet (see earlier instructions on entering a
list of data)
 Click Stat, Regression, Simple Linear
 In the popup window that opens choose the X Variable and Y Variable from the
drop-down menus
 Then click “Compute!”
Using your TI84:
 First push STAT 1 and enter the data into L1 and L2
 Then push STAT ← to open the TESTS menu. Scroll down until you see
“LinRegTTest” and push ENTER. You can then enter the names of the lists where
you put your data.
 Your input screen should look like the one below (you may have stored your data
in different lists. For now, which inequality you highlight does not matter, but it
will in the later section when we do the hypothesis test).
 After you highlight “Calculate” and push ENTER you will get the following output
screen. You will need to scroll down to see all of the output.
 You will need to scroll down to the bottom of the output screens to see the value of
r.
Chapter 10: Regression and Correlation
334
Example #10.2.1: Calculating the Linear Correlation Coefficient, r
How strong is the positive relationship between the alcohol content and the
number of calories in 12-ounce beer? To determine if there is a positive linear
correlation, a random sample was taken of beer’s alcohol content and calories for
several different beers ("Calories in beer," 2011), and the data are in table #10.2.1.
Find the correlation coefficient and interpret that value.
Table #10.2.1: Alcohol and Calorie Content in Beerwithout Outlier
Brand Brewery Alcohol
Content
Calories
in 12 oz
Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163
Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215
Steel Reserve MillerCoors 8.10% 222
Coors Light MillerCoors 4.15% 104
Genesee Cream Ale High Falls Brewing 5.10% 162
Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158
Michelob Beer Anheuser Busch 5.00% 155
Flying Dog Doggie Style Flying Dog Brewery 4.70% 158
Big Sky I.P.A. Big Sky Brewing 6.20% 195
Solution:
State random variables
rv X = alcohol content in a randomly selected 12-ounce beer
rv Y = number of calories in that same randomly selected 12-ounce beer
Assumptions check:
1. The problem states that a random sample of beers was taken
2. The scatterplot of the data looked roughly linear with no outliers
TI84: Use the LinRegTTest in the STAT menu. The setup is in figure 10.2.2.
Figure #10.2.2: Setup for Linear RegressionTest on TI-84
Chapter 10: Regression and Correlation
335
Figure #10.2.3: Results for Linear RegressionTest on TI-84
StatCrunch: Using Stat, Regression, Simple Linear
The correlation coefficient is 𝑟 ≈ 0.913. This is close to 1, so it looks like there
is a strong, positive linear correlation between alcohol content and number of
calories for beer.
Causation
One common mistake people make is to assume that because there is a correlation, then
one variable causes the other. This is usually not the case. That would be like saying the
amount of alcohol in the beer causes it to have a certain number of calories. However,
fermentation of sugars is what causes the alcohol content. The more sugars you have, the
more alcohol can be made, and the more sugar, the higher the calories. It is actually the
amount of sugar that causes both. Do not confuse the idea of correlation with the concept
of causation. Just because two variables are correlated does not mean one causes the
other to happen.
Example #10.2.2: Correlation Versus Causation
A study showed a strong linear correlation between per capita beer consumption and
teacher’s salaries. Does giving a teacher a raise cause people to buy more beer?
Does buying more beer cause teachers to get a raise?
Solution:
There is probably some other factor causing both of them to increase at the same
time. Think about this: In a town where people have little extra money, they won’t
have money for beer and they won’t give teachers raises. In another town where
people have more extra money to spend it will be easier for them to buy more beer
and they would be more willing to give teachers raises.
Remember a correlation only means a pattern exists. It does not mean that one variable
causes the other variable to change. Correlation does not imply causation.
Chapter 10: Regression and Correlation
336
Explained Variation
As stated before, there is some variability in the dependent variable values, such as
calories. Some of the variation in calories is due to alcohol content and some is due to
other factors. How much of the variation in the calories is due to alcohol content?
You can have two beers at the same alcohol content, but beer one has higher calories
because of the other ingredients. Some variability is explained by the model and some
variability is not explained. The coefficient of determination gives us the proportion of
the variation in Y that is explained by the model with X as its predictor variable.
Coefficient of determination – measures the proportion of the variability in Y that is
explained by the linear model with X as its predictor variable.
 This value is next to r2 on the LinRegTTest output screen
 This proportion is often changed to a percentage when its value is interpreted.
Example #10.2.3: Finding the Coefficient of Determination
Find the coefficient of determination for the beer data in Example 10.2.1 and
interpret the value.
Solution:
From the calculator results,
𝑟2
≈ 0.834
Interpret:
Thus, about 83.4% of the variation in calories is explained by the linear
relationship between alcohol content and calories. The other 16.6% of the
variation in calories is due to other factors.
Now that you have a correlation coefficient for the sample data, how can you tell if it is
significant or not to determine if this linear relation exists for the population of objects?
This will be answered in the next section.
Chapter 10: Regression and Correlation
337
Section10.2:Homework
These problems use the same data as section 10.1.
1.) When an anthropologist finds skeletal remains, they need to figure out the height
of the person. The height of a person (in cm) and the length of their metacarpal
bone 1 (in cm) were collected and are in table #10.1.5 ("Prediction of height,"
2013). Find the correlation coefficient and coefficient of determination and then
interpret both.
2.) Table #10.1.6 contains the value of the house and the amount of rental income in
a year that the house brings in ("Capital and rental," 2013). Find the correlation
coefficient and coefficient of determination and then interpret both.
3.) The World Bank collects information on the life expectancy of a person in each
country ("Life expectancy at," 2013) and the fertility rate per woman in the
country ("Fertility rate," 2013). The data for 24 randomly selected countries for
the year 2011 are in table #10.1.7. Find the correlation coefficient and coefficient
of determination and then interpret both.
4.) The height and weight of baseball players are in table #10.1.9 ("MLB
heightsweights," 2013). Find the correlation coefficient and coefficient of
determination and then interpret both.
5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg)
and calories were measured. ("Data hotdogs," 2013) The data are in table
#10.1.11. Find the correlation coefficient and coefficient of determination
and then interpret both.
6.) Per capita income in 1960 dollars for European countries and the percent of the
labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD
economic development," 2013). Find the correlation coefficient and coefficient
of determination and then interpret both.
7.) Cigarette smoking and cancer have been linked. The number of deaths per one
hundred thousand from bladder cancer and the number of cigarettes sold per
capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013). Find the
correlation coefficient and coefficient of determination and then interpret
both.
8.) The weight of a car can influence the mileage that the car can obtain. A random
sample of cars weights and mileage was collected and are in table #10.1.14
("Passenger car mileage," 2013). Find the correlation coefficient and
coefficient of determination and then interpret both.
Chapter 10: Regression and Correlation
338
9.) There is a negative correlation between police expenditure and crime rate. Does
this mean that spending more money on police causes the crime rate to decrease?
Explain your answer.
10.) There is a positive correlation between tobacco sales and alcohol sales. Does that
mean that using tobacco causes a person to also drink alcohol? Explain your
answer.
11.) There is a positive correlation between the average temperature in a location and
the morality rate from breast cancer. Does that mean that higher temperatures
cause more women to die of breast cancer? Explain your answer.
12.) There is a positive correlation between the length of time a tableware company
polishes a dish and the price of the dish. Does that mean that the time a plate is
polished determines the price of the dish? Explain your answer.
Section 10.3: Inference for Regression and Correlation
In the last section we computed the sample linear correlation coefficient. How do we
know if there is enough evidence in the sample data to conclude that a linear relation
exists in the population? We perform a hypothesis test. In this case the parameter we
will be testing is rho, 𝜌, which is the population linear correlation coefficient.
Chapter 10: Regression and Correlation
339
Hypothesis Test for Population Correlation (Lin Reg T-Test)
1. State the random variable and the Parameter in words.
𝜌 = the linear correlation between __ and __ for all _____
rv 𝑟 = the linear correlation between __ and __ for __ r.s. _______
2. State the null and alternative Hypotheses and the level of significance
TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST
𝐻0: 𝜌 = 0
𝐻𝐴: 𝜌 ≠ 0 this tests for any
kind of a linear relation
𝐻0: 𝜌 = 0
𝐻𝐴: 𝜌 < 0 this tests for a
negative linear relation
𝐻0: 𝜌 = 0
𝐻𝐴: 𝜌 > 0 this tests for a
positive linear relation
Also, state your 𝛼 level here.
3. State and check the Assumptions for a hypothesis test
a) The set (X,Y) of ordered pairs is a random sample from the population of
all such possible (X,Y) pairs.
b) The scatter plot of x versus y has a roughly linear pattern with no outliers.
4. Name the hypothesis test used
In this case the assumptions for the Lin Reg T-Test have been satisfied.
5. Find the sample statistic and Test statistic
Sample correlation coefficient:
𝑟 = value next to r on the output screen
Test Statistic:
t =
r
1- r2
n - 2
6. Obtain the p-value and illustrate the meaning of the p-value, sample
statistic and test statistic
TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST
TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST
𝑟: 0
t: 0
𝑟: 0
t: 0
𝑟: 0
t: 0
7. Make a decision about H0
Reject 𝐻0 if the p-value ≤ a and fail to reject 𝐻0 if the p-value > a
8. State a conclusion in the context of the problem
 If you reject 𝐻0, then there is significant evidence to conclude (𝐻𝐴 in
context)
 If you fail to reject 𝐻0, then there is NOT significant evidence to conclude
(𝐻𝐴 in context)
 We never say “accept 𝐻0”
Chapter 10: Regression and Correlation
340
Example #10.3.1: Testing the Claim of a Linear Correlation
Is there a positive linear correlation between beer’s alcohol content and calories?
To determine if there is a positive linear correlation, a random sample was taken
of beer’s alcohol content and calories for several different beers ("Calories in
beer,," 2011), and the data is in table #10.2.1. Test at the 5% level.
Solution:
1. State the random variable and the Parameter in words.
𝜌 = the linear correlation between alcohol content and number of calories for
all beers with an alcohol content between 4.15% and 8.1 %
rv 𝑟 = the linear correlation between alcohol content and number of calories
for 9 randomly selected beers with an alcohol content between 4.15% and
8.1 %
2. State the null and alternative Hypotheses and the level of significance
Since you are asked if there is a positive correlation, use r > 0.
𝐻0: 𝜌 = 0
𝐻𝐴: 𝜌 > 0
a = 0.05
3. State and check the Assumptions for the hypothesis test
1. The problem states that a random sample of 9 beers was taken
2. The scatterplot of the data looked roughly linear with no outliers
4. Name the hypothesis test used
In this case the assumptions for the Linear Regression T-Test have been met.
5. Find the sample statistic and Test statistic
TI84: Use LinRegTTest from STAT TESTS
Input screen:
Output screen:
Chapter 10: Regression and Correlation
341
StatCrunch: Use Stat, Regression, Simple Linear
We need the T-Stat and P-value in the “Slope” row
Sample correlation coefficient:
𝑟 ≈ 0.913
Test Statistic: 𝑡 =
𝑟
√
1−𝑟2
𝑛−2
=
0.9134413647
√
1−0.8343751268
9−2
≈ 5.94
This means the value of r is about 5.94 standard deviations above the
hypothesized value in the null hypothesis.
6. Obtain the p-value and illustrate the meaning of the p-value, sample statistic
and test statistic.
p-value ≈ 2.884 × 10−4
≈ 0.0003
7. Make a decision about 𝐻0
Since the p-value ≤ 0.05, reject 𝐻0
8. State a conclusion in the context of the problem
There is enough evidence to show that there is a positive correlation between
alcohol content and number of calories in all 12-ounce bottles of beer with
alcohol content between 4.15% and 8.1%.
Chapter 10: Regression and Correlation
342
Section10.3:Homework
For each problem, use the PHANTOMS process.
1.) When an anthropologist finds skeletal remains, they need to figure out the height
of the person. The height of a person (in cm) and the length of their metacarpal
bone one (in cm) were collected and are in table #10.1.5 ("Prediction of height,"
2013). Test at the 1% level for a positive correlation between length of
metacarpal bone one and height of a person.
2.) Table #10.1.6 contains the value of the house and the amount of rental income in
a year that the house brings in ("Capital and rental," 2013). Test at the 5% level
for a positive correlation between house value and annual rental amount.
3.) The World Bank collects information on the life expectancy of a person in each
country ("Life expectancy at," 2013) and the fertility rate per woman in the
country ("Fertility rate," 2013). The data for 24 randomly selected countries for
the year 2011 are in table #10.1.7. Test at the 1% level for a negative correlation
between fertility rate and life expectancy.
4.) The height and weight of baseball players are in table #10.1.9 ("MLB heights
weights," 2013). Test at the 5% level for a positive correlation between
height and weight of baseball players.
5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg)
and calories were measured. ("Data hotdogs," 2013) The data are in table
#10.1.11. Test at the 5% level for a positive correlation between number of
calories and amount of sodium.
6.) Per capita income in 1960 dollars for European countries and the percent of the
labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD
economic development," 2013). Test at the 5% level for a negative correlation
between percent of labor force in agriculture and per capita income.
7.) Cigarette smoking and cancer have been linked. The number of deaths per one
hundred thousand from bladder cancer and the number of cigarettes sold per
capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013). Test at the
1% level for a positive correlation between cigarette smoking and deaths of
bladder cancer.
8.) The weight of a car can influence the mileage that the car can obtain. A random
sample of cars weights and mileage was collected and are in table #10.1.14
("Passenger car mileage," 2013). Test at the 5% level for a negative
correlation between the weight of cars and mileage.
Chapter 10: Regression and Correlation
343
Data Source:
Brain2bodyweight. (2013, November 16). Retrieved from
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight
Calories in beer, beer alcohol, beer carbohydrates. (2011, October 25). Retrieved from
www.beer100.com/beercalories.htm
Capital and rental values of Auckland properties. (2013, September 26). Retrieved from
http://www.statsci.org/data/oz/rentcap.html
Data hotdogs. (2013, November 16). Retrieved from
http://wiki.stat.ucla.edu/socr/index.php/SOCR_012708_ID_Data_HotDogs
Fertility rate. (2013, October 14). Retrieved from
http://data.worldbank.org/indicator/SP.DYN.TFRT.IN
Health expenditure. (2013, October 14). Retrieved from
http://data.worldbank.org/indicator/SH.XPD.TOTL.ZS
Life expectancy at birth. (2013, October 14). Retrieved from
http://data.worldbank.org/indicator/SP.DYN.LE00.IN
MLB heightsweights. (2013, November 16). Retrieved from
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights
OECD economic development. (2013, December 04). Retrieved from
http://lib.stat.cmu.edu/DASL/Datafiles/oecdat.html
Passenger car mileage. (2013, December 04). Retrieved from
http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html
Prediction of height from metacarpal bone length. (2013, September 26). Retrieved from
http://www.statsci.org/data/general/stature.html
Pregnant woman receiving prenatal care. (2013, October 14). Retrieved from
http://data.worldbank.org/indicator/SH.STA.ANVC.ZS
Smoking and cancer. (2013, December 04). Retrieved from
http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html

More Related Content

What's hot

Multiple linear regression II
Multiple linear regression IIMultiple linear regression II
Multiple linear regression IIJames Neill
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
Moving Average
Moving AverageMoving Average
Moving Averageelboone
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designssmackinnon
 
Normal distribution curve
Normal distribution curveNormal distribution curve
Normal distribution curveFahadi302
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator Muhammad Ali
 
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...Ismet Kale
 
Mcqs (testing of hypothesis)
Mcqs (testing of hypothesis)Mcqs (testing of hypothesis)
Mcqs (testing of hypothesis)Nadeem Uddin
 
What is Pie chart
What is Pie chartWhat is Pie chart
What is Pie chartAsad Afridi
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionpankaj8108
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingSumit Sharma
 

What's hot (20)

Multiple linear regression II
Multiple linear regression IIMultiple linear regression II
Multiple linear regression II
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Using SPSS: A Tutorial
Using SPSS: A TutorialUsing SPSS: A Tutorial
Using SPSS: A Tutorial
 
Economatrics
Economatrics Economatrics
Economatrics
 
Moving Average
Moving AverageMoving Average
Moving Average
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Normal distribution curve
Normal distribution curveNormal distribution curve
Normal distribution curve
 
Multicollinearity PPT
Multicollinearity PPTMulticollinearity PPT
Multicollinearity PPT
 
Autocorrelation (1)
Autocorrelation (1)Autocorrelation (1)
Autocorrelation (1)
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator
 
Ch14 slides
Ch14 slidesCh14 slides
Ch14 slides
 
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...
The Use of ARCH and GARCH Models for Estimating and Forecasting Volatility-ru...
 
Mcqs (testing of hypothesis)
Mcqs (testing of hypothesis)Mcqs (testing of hypothesis)
Mcqs (testing of hypothesis)
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
What is Pie chart
What is Pie chartWhat is Pie chart
What is Pie chart
 
Monte carlo
Monte carloMonte carlo
Monte carlo
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Multicollinearity
MulticollinearityMulticollinearity
Multicollinearity
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 

Similar to Chapter 10

Two-Variable (Bivariate) RegressionIn the last unit, we covered
Two-Variable (Bivariate) RegressionIn the last unit, we covered Two-Variable (Bivariate) RegressionIn the last unit, we covered
Two-Variable (Bivariate) RegressionIn the last unit, we covered LacieKlineeb
 
Turning Multivariable Models Into Interactive Animated Simulations
Turning Multivariable Models Into Interactive Animated SimulationsTurning Multivariable Models Into Interactive Animated Simulations
Turning Multivariable Models Into Interactive Animated SimulationsTom Loughran
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statisticsmscartersmaths
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dMerrileeDelvalle969
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation ExampleOUM SAOKOSAL
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionKhalid Aziz
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionKush Kulshrestha
 
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...Ghaneshwer Jharbade
 
Ordinary Least Squares Ordinary Least Squares
Ordinary Least Squares Ordinary Least SquaresOrdinary Least Squares Ordinary Least Squares
Ordinary Least Squares Ordinary Least Squaresfarikaumi777
 
Data AnalysisInstructions of Excel 2016By Yancy Chow.docx
Data AnalysisInstructions of Excel 2016By Yancy Chow.docxData AnalysisInstructions of Excel 2016By Yancy Chow.docx
Data AnalysisInstructions of Excel 2016By Yancy Chow.docxwhittemorelucilla
 
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docxDBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docxedwardmarivel
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression Penny Jiang
 
Statistics with Computer Applications
Statistics with Computer ApplicationsStatistics with Computer Applications
Statistics with Computer ApplicationsDrMateoMacalaguingJr
 

Similar to Chapter 10 (20)

Ch14 multiple regression
Ch14 multiple regressionCh14 multiple regression
Ch14 multiple regression
 
Two-Variable (Bivariate) RegressionIn the last unit, we covered
Two-Variable (Bivariate) RegressionIn the last unit, we covered Two-Variable (Bivariate) RegressionIn the last unit, we covered
Two-Variable (Bivariate) RegressionIn the last unit, we covered
 
Turning Multivariable Models Into Interactive Animated Simulations
Turning Multivariable Models Into Interactive Animated SimulationsTurning Multivariable Models Into Interactive Animated Simulations
Turning Multivariable Models Into Interactive Animated Simulations
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statistics
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The d
 
Correlation
CorrelationCorrelation
Correlation
 
Chapter 12
Chapter 12Chapter 12
Chapter 12
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation Example
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear Regression
 
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
 
Ordinary Least Squares Ordinary Least Squares
Ordinary Least Squares Ordinary Least SquaresOrdinary Least Squares Ordinary Least Squares
Ordinary Least Squares Ordinary Least Squares
 
Chapter 3.1
Chapter 3.1Chapter 3.1
Chapter 3.1
 
Data AnalysisInstructions of Excel 2016By Yancy Chow.docx
Data AnalysisInstructions of Excel 2016By Yancy Chow.docxData AnalysisInstructions of Excel 2016By Yancy Chow.docx
Data AnalysisInstructions of Excel 2016By Yancy Chow.docx
 
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docxDBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression
 
assignment 2
assignment 2assignment 2
assignment 2
 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
 
Statistics with Computer Applications
Statistics with Computer ApplicationsStatistics with Computer Applications
Statistics with Computer Applications
 

More from MaryWall14

p-value drawing (model)
p-value drawing (model) p-value drawing (model)
p-value drawing (model) MaryWall14
 
Confidence Interval for Mean and Proportion (Methodology)
Confidence Interval for Mean and Proportion (Methodology)Confidence Interval for Mean and Proportion (Methodology)
Confidence Interval for Mean and Proportion (Methodology)MaryWall14
 
Hypothesis Tests (outline)
Hypothesis Tests (outline)Hypothesis Tests (outline)
Hypothesis Tests (outline)MaryWall14
 
Decisions conclusions hypothesis_testing
Decisions conclusions hypothesis_testingDecisions conclusions hypothesis_testing
Decisions conclusions hypothesis_testingMaryWall14
 
Confidence interval (t-critical)
Confidence interval (t-critical)Confidence interval (t-critical)
Confidence interval (t-critical)MaryWall14
 
Confidence interval interpreting_proportion
Confidence interval interpreting_proportionConfidence interval interpreting_proportion
Confidence interval interpreting_proportionMaryWall14
 
1.4 How not to do Statistics
1.4 How not to do Statistics1.4 How not to do Statistics
1.4 How not to do StatisticsMaryWall14
 
1.3 Experimental Design and Observational Studies
1.3 Experimental Design and Observational Studies 1.3 Experimental Design and Observational Studies
1.3 Experimental Design and Observational Studies MaryWall14
 
1.2 Sampling Methods
1.2 Sampling Methods1.2 Sampling Methods
1.2 Sampling MethodsMaryWall14
 
1.1 intro to statistics
1.1 intro to statistics1.1 intro to statistics
1.1 intro to statisticsMaryWall14
 

More from MaryWall14 (20)

Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Chapter 7
Chapter 7Chapter 7
Chapter 7
 
Chapter 6
Chapter 6Chapter 6
Chapter 6
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
P value
P valueP value
P value
 
p-value drawing (model)
p-value drawing (model) p-value drawing (model)
p-value drawing (model)
 
Confidence Interval for Mean and Proportion (Methodology)
Confidence Interval for Mean and Proportion (Methodology)Confidence Interval for Mean and Proportion (Methodology)
Confidence Interval for Mean and Proportion (Methodology)
 
Hypothesis Tests (outline)
Hypothesis Tests (outline)Hypothesis Tests (outline)
Hypothesis Tests (outline)
 
Decisions conclusions hypothesis_testing
Decisions conclusions hypothesis_testingDecisions conclusions hypothesis_testing
Decisions conclusions hypothesis_testing
 
Confidence interval (t-critical)
Confidence interval (t-critical)Confidence interval (t-critical)
Confidence interval (t-critical)
 
Confidence interval interpreting_proportion
Confidence interval interpreting_proportionConfidence interval interpreting_proportion
Confidence interval interpreting_proportion
 
1.4 How not to do Statistics
1.4 How not to do Statistics1.4 How not to do Statistics
1.4 How not to do Statistics
 
1.3 Experimental Design and Observational Studies
1.3 Experimental Design and Observational Studies 1.3 Experimental Design and Observational Studies
1.3 Experimental Design and Observational Studies
 
1.2 Sampling Methods
1.2 Sampling Methods1.2 Sampling Methods
1.2 Sampling Methods
 
1.1 intro to statistics
1.1 intro to statistics1.1 intro to statistics
1.1 intro to statistics
 

Recently uploaded

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answersdalebeck957
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 

Recently uploaded (20)

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 

Chapter 10

  • 1. Chapter 10: Regression and Correlation 315 Chapter 10: Regression and Correlation The previous chapter looked at comparing populations to see if there is a difference between the two. That involved two random variables where the same measurement was taken but from two different groups. This chapter will look at two random variables but we will have one group where we are looking at two different measurements taken from each object, and see if there is a relationship between the two quantitative variables. To do this, you look at regression, which finds the linear relationship, and correlation, which measures the strength of a linear relationship. Please note: there are many other types of relationships besides linear that can be found for the data. This book will only explore linear, but realize that there are other relationships that can be used to describe data (quadratic, exponential, etc.). Section 10.1: Regression When comparing two different quantitative variables, two questions come to mind: “Is there a relationship between two variables?” and “How strong is that relationship?” These questions can be answered using regression and correlation. Regression answers whether there is a relationship (again this book will explore linear only) and correlation answers how strong the linear relationship is. To introduce both of these concepts, it is easier to look at a set of data. Below are the steps from chapter 2 for making a scatterplot. TECHNOLOGY: SCATTERPLOT Using StatCrunch:  Enter the data into 2 columns in the spreadsheet (see earlier instructions on entering a list of data)  Click Graph, Scatter Plot  In the popup window that opens choose the X Variable and Y Variable from the drop-down menus  Under “Graph properties” you can put a title  Then click “Compute!” Using your TI84:  First push STAT 1 and enter the data into L1 and L2  Push 2nd Y= to open the STAT PLOTS menu. Then push 1 to select Plot1  You need to make your input screen look like the screen below (you may have different list names depending on where you put your data)  Then push ZOOM 9 to see the scatterplot
  • 2. Chapter 10: Regression and Correlation 316 Example #10.1.1: Making a Scatterplot Is there a relationship between the alcohol content and the number of calories in 12-ounce beer? To determine if there is one, a random sample was taken of beer’s alcohol content and calories ("Calories in beer," 2011), and the data are in table #10.1.1. Make a scatterplot of the data. Table #10.1.1: Alcohol and Calorie Content in Beer Brand Brewery Alcohol Content Calories in 12 oz Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163 Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215 Steel Reserve MillerCoors 8.10% 222 O'Doul's Anheuser Busch 0.40% 70 Coors Light MillerCoors 4.15% 104 Genesee Cream Ale High Falls Brewing 5.10% 162 Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158 Michelob Beer Anheuser Busch 5.00% 155 Flying Dog Doggie Style Flying Dog Brewery 4.70% 158 Big Sky I.P.A. Big Sky Brewing 6.20% 195 Solution: It is helpful to state the random variables in the context of the problem. rv X = alcohol content in a randomly selected 12-ounce beer rv Y = number of calories in that same randomly selected 12-ounce beer Figure #10.1.1: Scatter Plot of BeerData This scatter plot looks fairly linear. However, notice that there is one beer in the list that is actually considered a non-alcoholic beer. That value is probably an outlier since it is a non-alcoholic beer. The rest of the analysis will not include O’Doul’s. You cannot just remove data points, but in this case, it makes more sense to, since all the other beers have a fairly large alcohol content. 2 4 6 8 050100150200250 Calories vs Alcohol Content Alcohol Content (%) Caloriesin12inBeer
  • 3. Chapter 10: Regression and Correlation 317 The scatterplot without O’Doul’s is as follows: (TI84 and StatCrunch graphs) The relation looks fairly linear. The next step is to find a line that best fits the data and the corresponding equation of that line. In high school algebra you spent many, many months each year on linear functions. Most of you are familiar with the equation Y = mX + b that is used in most algebra texts. Some of the more current algebra textbooks use the equation Y = a + bX (this matches the equation that the calculator uses and the linear equation we will use in this class).  X is the independent variable and is also called the predictor variable  Y is the dependent variable and is also called the response variable  The coefficient of X is the slope and the constant term is the Y-intercept.  In this course we are going to be using the equation 𝑌̂( 𝑥) = 𝑎 + 𝑏(𝑋)  The “hat” on the Y reminds us that this is an estimated or predicted value of Y  slope = coefficient of X = b = 𝑏 1 = change in Y change in X which means for each 1 unit increase in X, Y changes by b units on average. Whether Y increases or decreases for every 1 unit increase in X depends on the sign of the slope.  Y-intercept = (0, 𝑎) which means when X = 0, Y = 𝑎 (Sometimes the Y-intercept will have no physical meaning with respect to the linear regression problem that we are doing because it falls outside of the values that make sense or outside the range of values sampled from, but it is still part of the equation and is needed to plot the line)  This equation should only be used for X-values between Xmin and Xmax. Many relationships between variables may look linear in a particular range of X-values but once you go beyond those values the relation may no longer be linear. There are many conditions to check when doing linear regression. In this level of a course, we are just going to look at checking the following assumptions: 1. The set (X,Y) of ordered pairs is a random sample from the population of all such possible (X,Y) pairs. 2. The scatter plot of X versus Y has a roughly linear pattern with no outliers.  We will get a hypothesis test in section 10.3 to tell us if what we see is linear enough or not.
  • 4. Chapter 10: Regression and Correlation 318 In graphing real-world data, the scatterplots do not look as perfect as the ones from algebra. In this case we find what we call a best-fitting line using a process called regression. The notation we use in this class for the equation of the best-fitting line (also called the least-squares regression equation) is as follows: 𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋) TECHNOLOGY: REGRESSION EQUATION, 𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋) Using StatCrunch:  Enter data into 2 columns in the spreadsheet (see earlier instructions on entering a list of data)  Click Stat, Regression, Simple Linear  In the popup window that opens choose the X Variable and Y Variable from the drop-down menus  Then click “Compute!” Using your TI84:  First push STAT 1 and enter the data into L1 and L2  Then push STAT ← to open the TESTS menu. Scroll down until you see “LinRegTTest” and push ENTER. You can then enter the names of the lists where you put your data.  You also need to tell the TI to store this linear regression equation in Y1 by typing VARS → 1 1 next to "RegEQ" on the LinRegTTest input screen.  Your input screen should look like the one below (you may have stored your data in different lists. For now, which inequality you highlight does not matter, but it will in the later section when we do the hypothesis test).  After you highlight “Calculate” and push ENTER you will get the following output screen. You will need to scroll down to see all of the output.  This equation is written as 𝑌̂( 𝑋) = 𝑎 + 𝑏(𝑋)  slope = 𝑏 1 , which tells us on average how much we expect Y to change when X increases by 1 unit  Y-intercept: (0, 𝑎), which tells us the value of Y when X is 0. For many applications, we do not interpret the Y-intercept because X = 0 is out of the scope of the data and usually does not make sense to talk about (like a baby that weighs 0 pounds)
  • 5. Chapter 10: Regression and Correlation 319 Example #10.1.2: Finding the Equation of the Line of Best Fit Use the data from Example #10.1.1 (removing O’Doul’s) to do the following: a) Find the equation of the line of best fit. Solution: Alcohol content is the explanatory variable and number of calories is the response variable. TI84: Values of alcohol content are in L1 and values of calories are in L2. StatCrunch: Using Stat, Regression, Simple Linear So the equation of the line of best fit is as follows: 𝑌̂( 𝑋) = 25.03123606+ 26.31860776(𝑋), where 4.15% ≤ 𝑋 ≤ 8.1% b) Draw the scatterplot and the line of best fit on the same set of axes. Solution: Since we told the calculator to store the equation in Y1, if you push ZOOM 9 you will see the scatterplot and the line of best fit drawn on the same set of axes. That graph is also on the second page of results in StatCrunch.
  • 6. Chapter 10: Regression and Correlation 320 c) Interpret the slope and Y-intercept in context. Solution: Slope = change in Y change in X = 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑐𝑎𝑙𝑜𝑟𝑖𝑒𝑠 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑎𝑙𝑐𝑜ℎ𝑜𝑙 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ≈ 26.32 𝑐𝑎𝑙𝑜𝑟𝑖𝑒𝑠 1% The slope here tells us that for every 1% increase in the alcohol content of beer we expect the calories to increase by 26.32 on average. The Y-intercept here would be (0%, 25.03 calories). This has no meaning with respect to this problem since we only looked at alcoholic beers (so it makes no sense to talk about a beer with 0% alcohol). What makes this the best fitting line? The process of regression is used to find the line that best fits the data. The criteria for the best fitting line that technology will use are as follows. 1. The line must pass through the point (X̅,Y̅) 2. The line must make the sum of the square of the residuals as small as possible What the heck is a residual? When you draw a line that “best” fits the data, that line will not be able to pass through all of the points (in fact it might not pass through a single point from the data). You can see that in the graph in Example #10.1.2. The residuals give us a way to measure how far the line is vertically from each point in the data set. Residual – the difference between the actual Y value and the predicted Y value on the regression line for a particular value of X, 𝑥0. This is the directed vertical distance between the actual point in the data and the corresponding point on the regression line. residual = 𝑌( 𝑥0)− 𝑌̂(𝑥0)  Data points above the line will have positive residuals.  Data points below the line will have negative residuals.  The sum of the residuals is always 0 The regression line and the residuals are displayed in figure #10.1.2. Figure #10.1.2: Scatter Plot of BeerData with RegressionLine and Residuals
  • 7. Chapter 10: Regression and Correlation 321 Example #10.1.3: Computing Predicted Values and Residuals. a.) Use the regression equation to predict the number of calories when the alcohol content is 6.50% based on the data given in Table #10.1.2 ("Calories in beer," 2011) from a random sample of 9 beers. Table #10.1.2: Alcohol and Calorie Content in Beerwithout Outlier Brand Brewery Alcohol Content Calories in 12 oz Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163 Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215 Steel Reserve MillerCoors 8.10% 222 Coors Light MillerCoors 4.15% 104 Genesee Cream Ale High Falls Brewing 5.10% 162 Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158 Michelob Beer Anheuser Busch 5.00% 155 Flying Dog Doggie Style Flying Dog Brewery 4.70% 158 Big Sky I.P.A. Big Sky Brewing 6.20% 195 Solution: State random variables rv X = alcohol content in a randomly selected 12-ounce beer rv Y = number of calories in that same randomly selected 12-ounce beer In this case, 𝑥0 = 6.50 First check that 6.50 is between Xmin and Xmax from the data. 4.15 ≤ 6.50 ≤ 8.1 𝑌̂(6.50) ≈ 25.03123606+ 26.31860776(6.50) ≈ 196.1 calories This equation was also stored in Y1. So you can also find this predicted value as follows: 𝑌̂(6.50) = 𝑌1(6.50) ≈ 196.1 calories The following keypunches will type a Y1: VARS → 1 1 If you are drinking a beer that has 6.50% alcohol content, then it is predicted to have 196.1 calories. Notice, the mean number of calories of the sample of 9 beers is about 170.2 calories. The value of 196.1 seems like a better estimate than the mean when looking at the original data. The regression equation is a better estimate than just the mean, since the regression equation takes into account the alcohol content. b.) Use the regression equation to predict the number of calories when the alcohol content is 2.00%.
  • 8. Chapter 10: Regression and Correlation 322 Solution: In this case, 𝑥0 = 2.00 First check that 2.00 is between Xmin and Xmax from the data. 2.00 is not between 4.15 and 8.1 The equation should not be used to predict that calories for this beer. We also should not use the mean from the sample of 12 beers either since there were no beers in the sample with an alcohol content as low as 2.00%. c.) Find the residual associated with the beer that had 6.70% alcohol. Solution: In this case, 𝑥0 = 6.70 residual = 𝑌( 𝑥0)− 𝑌̂(𝑥0) = actual Y – predicted Y residual = 𝑌(6.70)− 𝑌̂(6.70) = actual Y – predicted Y To get 𝑌(6.70) you need to look in the data table. This beer is highlighted in Table #10.1.2. This beer has 215 calories. So 𝑌(6.70) = 215 To get 𝑌̂(6.70) you need to use the equation. 𝑌̂(6.70) ≈ 25.03123606+ 26.31860776(6.70) ≈ 201.4 calories This beer with 6.70% alcohol actually had 215 calories but the linear model predicted that it would have 201.4 calories. residual = 𝑌(6.70)− 𝑌̂(6.70)≈ 215 calories − 201.4 calories ≈ 13.6 calories This residual means that the actual value was about 13.6 calories above the predicted value. Example #10.1.4: Interpreting a Negative Slope For a set of sample data of elevation (in ft) and high temperature (in ℉) for randomly selected cities, the following equation of the least squares regression line was computed. Interpret the slope in context. 𝑌̂( 𝑥) ≈ 77.37 − 0.0039(𝑋), 3000 ≤ 𝑋 ≤ 7000 Solution: Slope = change in Y change in X = 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 ℎ𝑖𝑔ℎ 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑒𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 ≈ −0.0039 ℉ 1 𝑓𝑡 The slope here tells us that for every 1 foot increase in elevation we expect the high temperature to decrease by 0.0039 ℉ on average. (NOTE: The word “decrease” takes care of the negative sign in the numeric value of the slope.)
  • 9. Chapter 10: Regression and Correlation 323 Section10.1: Homework 1.) When an anthropologist finds skeletal remains, they need to figure out the height of the person. The height of a person (in cm) and the length of their metacarpal bone 1 (in cm) were collected and are in table #10.1.5 ("Prediction of height," 2013). Table #10.1.5: Data of Metacarpal versus Height Length of Metacarpal (cm) Height of Person (cm) 45 171 51 178 39 157 41 163 48 172 49 183 46 173 43 175 47 173 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the height of a person for a metacarpal length of 44 cm or state why you shouldn’t. g) Predict the height of a person for a metacarpal length of 55 cm or state why you shouldn’t. h) Compute the residual for the person with a metacarpal length of 45 cm. Interpret what this value means in the context of this problem.
  • 10. Chapter 10: Regression and Correlation 324 2.) Table #10.1.6 contains the value of the house and the amount of annual rental income in a year that the house brings in ("Capital and rental," 2013). Table #10.1.6: Data of House Value versus Annual Rental Income Value Rental Value Rental Value Rental Value Rental 81000 6656 77000 4576 75000 7280 67500 6864 95000 7904 94000 8736 90000 6240 85000 7072 121000 12064 115000 7904 110000 7072 104000 7904 135000 8320 130000 9776 126000 6240 125000 7904 145000 8320 140000 9568 140000 9152 135000 7488 165000 13312 165000 8528 155000 7488 148000 8320 178000 11856 174000 10400 170000 9568 170000 12688 200000 12272 200000 10608 194000 11232 190000 8320 214000 8528 208000 10400 200000 10400 200000 8320 240000 10192 240000 12064 240000 11648 225000 12480 289000 11648 270000 12896 262000 10192 244500 11232 325000 12480 310000 12480 303000 12272 300000 12480 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the rental income a house worth $230,000 or state why you shouldn’t. g) Predict the rental income a house worth $400,000 or state why you shouldn’t. h) Compute the residual for the house worth $214,000. Interpret what this value means in the context of this problem.
  • 11. Chapter 10: Regression and Correlation 325 3.) The World Bank collects information on the life expectancy of a person in each country ("Life expectancy at," 2013) and the fertility rate (average number of children per woman) in the country ("Fertility rate," 2013). The data for 24 randomly selected countries for the year 2011 are in table #10.1.7. Table #10.1.7: Data of Fertility Rates versus Life Expectancy Fertility Rate Life Expectancy 1.7 77.2 5.8 55.4 2.2 69.9 2.1 76.4 1.8 75.0 2.0 78.2 2.6 73.0 2.8 70.8 1.4 82.6 2.6 68.9 1.5 81.0 6.9 54.2 2.4 67.1 1.5 73.3 2.5 74.2 1.4 80.7 2.9 72.1 2.1 78.3 4.7 62.9 6.8 54.4 5.2 55.9 4.2 66.0 1.5 76.0 3.9 72.3 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the life expectancy of a country that has a fertility rate of 2.7 or state why you shouldn’t. g) Predict the life expectancy of a country that has a fertility rate of 8.1 or state why you shouldn’t. h) Compute the residual for the country with a fertility rate of 5.8. Interpret what this value means in the context of this problem.
  • 12. Chapter 10: Regression and Correlation 326 4.) The height and weight of baseball players are in table #10.1.9 ("MLB heights weights," 2013). Table #10.1.9: Heights and Weights of Baseball Players Height (inches) Weight (pounds) 76 212 76 224 72 180 74 210 75 215 71 200 77 235 78 235 77 194 76 185 72 180 72 170 75 220 74 228 73 210 72 180 70 185 73 190 71 186 74 200 74 200 75 210 78 240 72 208 75 180 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the weight of a baseball player that is 75 inches tall or state why you shouldn’t. g) Predict the weight of a baseball player that is 68 inches tall or state why you shouldn’t. h) Compute the residual for the baseball player that is 76 inches tall and weighs 212 pounds. Interpret what this value means in the context of this problem.
  • 13. Chapter 10: Regression and Correlation 327 5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg) and calories were measured. ("Data hotdogs," 2013) The data are in table #10.1.11. Table #10.1.11: Calories and Sodium Levels in Beef Hotdogs Calories Sodium 186 495 181 477 176 425 149 322 184 482 190 587 158 370 139 322 175 479 148 375 152 330 111 300 141 386 153 401 190 645 157 440 131 317 149 319 135 298 132 253 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the amount of sodium a beef hotdog has if it has 170 calories or state why you shouldn’t. g) Predict the amount of sodium a beef hotdog has if it has 120 calories or state why you shouldn’t. h) Compute the residual for the beef hotdog with 153 calories. Interpret what this value means in the context of this problem.
  • 14. Chapter 10: Regression and Correlation 328 6.) Per capita income in 1960 dollars for European countries and the percent of the labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD economic development," 2013). Table #10.1.12: Percent of Labor in Agriculture and Per Capita Income for European Countries Country Percent in Agriculture Per capita income Sweden 14 1644 Switzerland 11 1361 Luxembourg 15 1242 U. Kingdom 4 1105 Denmark 18 1049 W. Germany 15 1035 France 20 1013 Belgium 6 1005 Norway 20 977 Iceland 25 839 Netherlands 11 810 Austria 23 681 Ireland 36 529 Italy 27 504 Greece 56 324 Spain 42 290 Portugal 44 238 Turkey 79 177 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the per capita income in a country that has 21 percent of labor in agriculture or state why you shouldn’t. g) Predict the per capita income in a country that has 2 percent of labor in agriculture or state why you shouldn’t. h) Compute the residual for the country with 6 percent of labor in agriculture. Interpret what this value means in the context of this problem.
  • 15. Chapter 10: Regression and Correlation 329 7.) Cigarette smoking and cancer have been linked. The number of deaths per one hundred thousand from bladder cancer and the number of cigarettes sold per capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013) for 44 randomly selected countries. Create a scatter plot and find a regression equation between cigarette smoking and deaths of bladder cancer. Then use the regression equation to find the number of deaths from bladder cancer when the cigarette sales were 20 per capita and when the cigarette sales were 6 per capita. Which number of deaths that you calculated do you think is closer to the true number? Why? Table #10.1.13: Number of Cigarettes and Number of Bladder Cancer Deaths in 1960 Cigarette Sales (per Capita) Bladder Cancer Deaths (per 100 Thousand) Cigarette Sales (per Capita) Bladder Cancer Deaths (per 100 Thousand) Cigarette Sales (per Capita) Bladder Cancer Deaths (per 100 Thousand) 18.20 2.90 42.40 6.54 28.92 4.79 25.82 3.52 28.64 5.98 25.91 5.21 18.24 2.99 21.16 2.90 26.92 4.69 28.60 4.46 29.14 5.30 24.96 5.27 31.10 5.11 19.96 2.89 22.06 3.72 33.60 4.78 26.38 4.47 16.08 3.06 40.46 5.60 23.44 2.93 27.56 4.04 28.27 4.46 23.78 4.89 21.17 4.04 20.10 3.08 29.18 4.99 21.25 5.14 27.91 4.75 18.06 3.25 22.86 4.78 26.18 4.09 20.94 3.64 28.04 3.20 22.12 4.23 20.08 2.94 30.34 3.46 21.84 2.91 22.57 3.21 23.75 3.95 23.44 2.86 14.00 3.31 23.32 3.72 21.58 4.65 25.89 4.63 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the number of deaths from bladder cancer when the cigarette sales were 20 per capita or state why you shouldn’t. g) Predict the number of deaths from bladder cancer when the cigarette sales were 6 per capita or state why you shouldn’t. h) Compute the residual for the country where cigarette sales were 18.20 per capita. Interpret what this value means in the context of this problem.
  • 16. Chapter 10: Regression and Correlation 330 8.) The weight of a car can influence the mileage that the car can obtain. A random sample of cars’ weights and mileage was collected and are in table #10.1.14 ("Passenger car mileage," 2013). Create a scatter plot and find a regression equation between weight of cars and mileage. Then use the regression equation to find the mileage on a car that weighs 3800 pounds and on a car that weighs 2000 pounds. Which mileage that you calculated do you think is closer to the true mileage? Why? Table #10.1.14: Weights and Mileages of Cars Weight (100 pounds) Mileage (mpg) Weight (100 pounds) Mileage (mpg) 22.5 53.3 35.0 31.3 22.5 41.1 35.0 28.0 22.5 38.9 35.0 28.0 25.0 40.9 35.0 28.0 27.5 46.9 40.0 23.6 27.5 36.3 40.0 23.6 30.0 32.2 40.0 23.4 30.0 32.2 40.0 23.1 30.0 31.5 45.0 19.5 30.0 31.4 45.0 17.2 30.0 31.4 45.0 17.0 35.0 32.6 55.0 13.2 35.0 31.3 a) State the random variables. b) Make a scatterplot of X versus Y. c) Find the equation of the best-fitting line (the least squares regression equation). d) Interpret the slope in the context of this problem. e) Interpret the Y-intercept in the context of this problem or state why it does not make sense to do so. f) Predict the mileage on a car that weighs 3800 pounds or state why you shouldn’t. g) Predict the mileage on a car that weighs 2000 pounds or state why you shouldn’t. h) Compute the residual for the car that weighs 55.0 pounds. Interpret what this value means in the context of this problem.
  • 17. Chapter 10: Regression and Correlation 331 Section 10.2: Correlation A correlation exists between two quantitative variables when the values of one quantitative variable are somehow associated with the values of the other quantitative variable. When you see a pattern in the data you say there is a correlation in the data. Though this book is only dealing with linear patterns, patterns can be other math models such as exponential, logarithmic, or periodic. To see this pattern, you can draw a scatter plot of the data. Remember to read graphs from left to right, the same as you read words. If the graph goes up the correlation is positive and if the graph goes down the correlation is negative. The words “weak”, “moderate”, and “strong” are used to describe the strength of the relationship between the two variables. Figure 10.2.1: Correlation Graphs We need a numeric way to measure the strength of the linear relation between two variables. This measure needs to be unitless. If someone measures heights in inches and weights in pounds and someone else takes the same group of people and measures heights in centimeters and weights in kilograms, then whatever we use to measure the strength of the relationship between height and weight should be the same. The strength should not depend on the units of measurement. What statistic do we have that is unitless?......z-scores.
  • 18. Chapter 10: Regression and Correlation 332 The formula below is what was developed by Karl Pearson to measure the strength of linear relation between two quantitative variables. If we had to make these computations by hand (which we don’t!) we would first need to convert all of the X-coordinates into their corresponding z-scores and then the same for the Y-coordinates. We will be using technology to compute this. 𝑟 = ∑ 𝑧 𝑥 ∙ 𝑧 𝑦 𝑛 − 1 Linear correlation coefficient – is a number that describes the strength of the linear relationship between the two variables. It is also called the Pearson correlation coefficient after Karl Pearson who developed it. The symbol for the sample linear correlation coefficient is r. The symbol for the population correlation coefficient is r (Greek letter rho) r is always between -1 and 1, inclusive. r = -1 means there is a perfect negative linear correlation r = 1 means there is a perfect positive linear correlation. The closer r is to 1 or -1, the stronger the linear correlation. The closer r is to 0, the weaker the linear correlation. BE CAREFUL: r = 0 does not mean there is no correlation. It just means there is no linear correlation. There might be a very strong curved pattern like in the last graph on the previous page. There are many conditions to check for linear correlation. In this level of a course, we are just going to look at checking the following assumptions (these are the same assumptions we had in the last section for regression): 1. The set (X,Y) of ordered pairs is a random sample from the population of all such possible (X,Y) pairs. 2. The scatter plot of X versus Y has a roughly linear pattern with no outliers.  We will get a hypothesis test in section 10.3 to tell us if what we see is linear enough or not. The value of the sample linear correlation coefficient is on the same output screen that was used in the last section to get the equation of the best-fitting line. This sample linear correlation coefficient is computed from unitless z-scores, so it is unitless.
  • 19. Chapter 10: Regression and Correlation 333 TECHNOLOGY: LINEAR CORRELATION COEFFICIENT Using StatCrunch:  Enter data into 2 columns in the spreadsheet (see earlier instructions on entering a list of data)  Click Stat, Regression, Simple Linear  In the popup window that opens choose the X Variable and Y Variable from the drop-down menus  Then click “Compute!” Using your TI84:  First push STAT 1 and enter the data into L1 and L2  Then push STAT ← to open the TESTS menu. Scroll down until you see “LinRegTTest” and push ENTER. You can then enter the names of the lists where you put your data.  Your input screen should look like the one below (you may have stored your data in different lists. For now, which inequality you highlight does not matter, but it will in the later section when we do the hypothesis test).  After you highlight “Calculate” and push ENTER you will get the following output screen. You will need to scroll down to see all of the output.  You will need to scroll down to the bottom of the output screens to see the value of r.
  • 20. Chapter 10: Regression and Correlation 334 Example #10.2.1: Calculating the Linear Correlation Coefficient, r How strong is the positive relationship between the alcohol content and the number of calories in 12-ounce beer? To determine if there is a positive linear correlation, a random sample was taken of beer’s alcohol content and calories for several different beers ("Calories in beer," 2011), and the data are in table #10.2.1. Find the correlation coefficient and interpret that value. Table #10.2.1: Alcohol and Calorie Content in Beerwithout Outlier Brand Brewery Alcohol Content Calories in 12 oz Big Sky Scape Goat Pale Ale Big Sky Brewing 4.70% 163 Sierra Nevada Harvest Ale Sierra Nevada 6.70% 215 Steel Reserve MillerCoors 8.10% 222 Coors Light MillerCoors 4.15% 104 Genesee Cream Ale High Falls Brewing 5.10% 162 Sierra Nevada Summerfest Beer Sierra Nevada 5.00% 158 Michelob Beer Anheuser Busch 5.00% 155 Flying Dog Doggie Style Flying Dog Brewery 4.70% 158 Big Sky I.P.A. Big Sky Brewing 6.20% 195 Solution: State random variables rv X = alcohol content in a randomly selected 12-ounce beer rv Y = number of calories in that same randomly selected 12-ounce beer Assumptions check: 1. The problem states that a random sample of beers was taken 2. The scatterplot of the data looked roughly linear with no outliers TI84: Use the LinRegTTest in the STAT menu. The setup is in figure 10.2.2. Figure #10.2.2: Setup for Linear RegressionTest on TI-84
  • 21. Chapter 10: Regression and Correlation 335 Figure #10.2.3: Results for Linear RegressionTest on TI-84 StatCrunch: Using Stat, Regression, Simple Linear The correlation coefficient is 𝑟 ≈ 0.913. This is close to 1, so it looks like there is a strong, positive linear correlation between alcohol content and number of calories for beer. Causation One common mistake people make is to assume that because there is a correlation, then one variable causes the other. This is usually not the case. That would be like saying the amount of alcohol in the beer causes it to have a certain number of calories. However, fermentation of sugars is what causes the alcohol content. The more sugars you have, the more alcohol can be made, and the more sugar, the higher the calories. It is actually the amount of sugar that causes both. Do not confuse the idea of correlation with the concept of causation. Just because two variables are correlated does not mean one causes the other to happen. Example #10.2.2: Correlation Versus Causation A study showed a strong linear correlation between per capita beer consumption and teacher’s salaries. Does giving a teacher a raise cause people to buy more beer? Does buying more beer cause teachers to get a raise? Solution: There is probably some other factor causing both of them to increase at the same time. Think about this: In a town where people have little extra money, they won’t have money for beer and they won’t give teachers raises. In another town where people have more extra money to spend it will be easier for them to buy more beer and they would be more willing to give teachers raises. Remember a correlation only means a pattern exists. It does not mean that one variable causes the other variable to change. Correlation does not imply causation.
  • 22. Chapter 10: Regression and Correlation 336 Explained Variation As stated before, there is some variability in the dependent variable values, such as calories. Some of the variation in calories is due to alcohol content and some is due to other factors. How much of the variation in the calories is due to alcohol content? You can have two beers at the same alcohol content, but beer one has higher calories because of the other ingredients. Some variability is explained by the model and some variability is not explained. The coefficient of determination gives us the proportion of the variation in Y that is explained by the model with X as its predictor variable. Coefficient of determination – measures the proportion of the variability in Y that is explained by the linear model with X as its predictor variable.  This value is next to r2 on the LinRegTTest output screen  This proportion is often changed to a percentage when its value is interpreted. Example #10.2.3: Finding the Coefficient of Determination Find the coefficient of determination for the beer data in Example 10.2.1 and interpret the value. Solution: From the calculator results, 𝑟2 ≈ 0.834 Interpret: Thus, about 83.4% of the variation in calories is explained by the linear relationship between alcohol content and calories. The other 16.6% of the variation in calories is due to other factors. Now that you have a correlation coefficient for the sample data, how can you tell if it is significant or not to determine if this linear relation exists for the population of objects? This will be answered in the next section.
  • 23. Chapter 10: Regression and Correlation 337 Section10.2:Homework These problems use the same data as section 10.1. 1.) When an anthropologist finds skeletal remains, they need to figure out the height of the person. The height of a person (in cm) and the length of their metacarpal bone 1 (in cm) were collected and are in table #10.1.5 ("Prediction of height," 2013). Find the correlation coefficient and coefficient of determination and then interpret both. 2.) Table #10.1.6 contains the value of the house and the amount of rental income in a year that the house brings in ("Capital and rental," 2013). Find the correlation coefficient and coefficient of determination and then interpret both. 3.) The World Bank collects information on the life expectancy of a person in each country ("Life expectancy at," 2013) and the fertility rate per woman in the country ("Fertility rate," 2013). The data for 24 randomly selected countries for the year 2011 are in table #10.1.7. Find the correlation coefficient and coefficient of determination and then interpret both. 4.) The height and weight of baseball players are in table #10.1.9 ("MLB heightsweights," 2013). Find the correlation coefficient and coefficient of determination and then interpret both. 5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg) and calories were measured. ("Data hotdogs," 2013) The data are in table #10.1.11. Find the correlation coefficient and coefficient of determination and then interpret both. 6.) Per capita income in 1960 dollars for European countries and the percent of the labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD economic development," 2013). Find the correlation coefficient and coefficient of determination and then interpret both. 7.) Cigarette smoking and cancer have been linked. The number of deaths per one hundred thousand from bladder cancer and the number of cigarettes sold per capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013). Find the correlation coefficient and coefficient of determination and then interpret both. 8.) The weight of a car can influence the mileage that the car can obtain. A random sample of cars weights and mileage was collected and are in table #10.1.14 ("Passenger car mileage," 2013). Find the correlation coefficient and coefficient of determination and then interpret both.
  • 24. Chapter 10: Regression and Correlation 338 9.) There is a negative correlation between police expenditure and crime rate. Does this mean that spending more money on police causes the crime rate to decrease? Explain your answer. 10.) There is a positive correlation between tobacco sales and alcohol sales. Does that mean that using tobacco causes a person to also drink alcohol? Explain your answer. 11.) There is a positive correlation between the average temperature in a location and the morality rate from breast cancer. Does that mean that higher temperatures cause more women to die of breast cancer? Explain your answer. 12.) There is a positive correlation between the length of time a tableware company polishes a dish and the price of the dish. Does that mean that the time a plate is polished determines the price of the dish? Explain your answer. Section 10.3: Inference for Regression and Correlation In the last section we computed the sample linear correlation coefficient. How do we know if there is enough evidence in the sample data to conclude that a linear relation exists in the population? We perform a hypothesis test. In this case the parameter we will be testing is rho, 𝜌, which is the population linear correlation coefficient.
  • 25. Chapter 10: Regression and Correlation 339 Hypothesis Test for Population Correlation (Lin Reg T-Test) 1. State the random variable and the Parameter in words. 𝜌 = the linear correlation between __ and __ for all _____ rv 𝑟 = the linear correlation between __ and __ for __ r.s. _______ 2. State the null and alternative Hypotheses and the level of significance TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST 𝐻0: 𝜌 = 0 𝐻𝐴: 𝜌 ≠ 0 this tests for any kind of a linear relation 𝐻0: 𝜌 = 0 𝐻𝐴: 𝜌 < 0 this tests for a negative linear relation 𝐻0: 𝜌 = 0 𝐻𝐴: 𝜌 > 0 this tests for a positive linear relation Also, state your 𝛼 level here. 3. State and check the Assumptions for a hypothesis test a) The set (X,Y) of ordered pairs is a random sample from the population of all such possible (X,Y) pairs. b) The scatter plot of x versus y has a roughly linear pattern with no outliers. 4. Name the hypothesis test used In this case the assumptions for the Lin Reg T-Test have been satisfied. 5. Find the sample statistic and Test statistic Sample correlation coefficient: 𝑟 = value next to r on the output screen Test Statistic: t = r 1- r2 n - 2 6. Obtain the p-value and illustrate the meaning of the p-value, sample statistic and test statistic TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST TWO-TAILED TEST LEFT-TAILED TEST RIGHT-TAILED TEST 𝑟: 0 t: 0 𝑟: 0 t: 0 𝑟: 0 t: 0 7. Make a decision about H0 Reject 𝐻0 if the p-value ≤ a and fail to reject 𝐻0 if the p-value > a 8. State a conclusion in the context of the problem  If you reject 𝐻0, then there is significant evidence to conclude (𝐻𝐴 in context)  If you fail to reject 𝐻0, then there is NOT significant evidence to conclude (𝐻𝐴 in context)  We never say “accept 𝐻0”
  • 26. Chapter 10: Regression and Correlation 340 Example #10.3.1: Testing the Claim of a Linear Correlation Is there a positive linear correlation between beer’s alcohol content and calories? To determine if there is a positive linear correlation, a random sample was taken of beer’s alcohol content and calories for several different beers ("Calories in beer,," 2011), and the data is in table #10.2.1. Test at the 5% level. Solution: 1. State the random variable and the Parameter in words. 𝜌 = the linear correlation between alcohol content and number of calories for all beers with an alcohol content between 4.15% and 8.1 % rv 𝑟 = the linear correlation between alcohol content and number of calories for 9 randomly selected beers with an alcohol content between 4.15% and 8.1 % 2. State the null and alternative Hypotheses and the level of significance Since you are asked if there is a positive correlation, use r > 0. 𝐻0: 𝜌 = 0 𝐻𝐴: 𝜌 > 0 a = 0.05 3. State and check the Assumptions for the hypothesis test 1. The problem states that a random sample of 9 beers was taken 2. The scatterplot of the data looked roughly linear with no outliers 4. Name the hypothesis test used In this case the assumptions for the Linear Regression T-Test have been met. 5. Find the sample statistic and Test statistic TI84: Use LinRegTTest from STAT TESTS Input screen: Output screen:
  • 27. Chapter 10: Regression and Correlation 341 StatCrunch: Use Stat, Regression, Simple Linear We need the T-Stat and P-value in the “Slope” row Sample correlation coefficient: 𝑟 ≈ 0.913 Test Statistic: 𝑡 = 𝑟 √ 1−𝑟2 𝑛−2 = 0.9134413647 √ 1−0.8343751268 9−2 ≈ 5.94 This means the value of r is about 5.94 standard deviations above the hypothesized value in the null hypothesis. 6. Obtain the p-value and illustrate the meaning of the p-value, sample statistic and test statistic. p-value ≈ 2.884 × 10−4 ≈ 0.0003 7. Make a decision about 𝐻0 Since the p-value ≤ 0.05, reject 𝐻0 8. State a conclusion in the context of the problem There is enough evidence to show that there is a positive correlation between alcohol content and number of calories in all 12-ounce bottles of beer with alcohol content between 4.15% and 8.1%.
  • 28. Chapter 10: Regression and Correlation 342 Section10.3:Homework For each problem, use the PHANTOMS process. 1.) When an anthropologist finds skeletal remains, they need to figure out the height of the person. The height of a person (in cm) and the length of their metacarpal bone one (in cm) were collected and are in table #10.1.5 ("Prediction of height," 2013). Test at the 1% level for a positive correlation between length of metacarpal bone one and height of a person. 2.) Table #10.1.6 contains the value of the house and the amount of rental income in a year that the house brings in ("Capital and rental," 2013). Test at the 5% level for a positive correlation between house value and annual rental amount. 3.) The World Bank collects information on the life expectancy of a person in each country ("Life expectancy at," 2013) and the fertility rate per woman in the country ("Fertility rate," 2013). The data for 24 randomly selected countries for the year 2011 are in table #10.1.7. Test at the 1% level for a negative correlation between fertility rate and life expectancy. 4.) The height and weight of baseball players are in table #10.1.9 ("MLB heights weights," 2013). Test at the 5% level for a positive correlation between height and weight of baseball players. 5.) A random sample of beef hotdogs was taken and the amount of sodium (in mg) and calories were measured. ("Data hotdogs," 2013) The data are in table #10.1.11. Test at the 5% level for a positive correlation between number of calories and amount of sodium. 6.) Per capita income in 1960 dollars for European countries and the percent of the labor force that works in agriculture in 1960 are in table #10.1.12 ("OECD economic development," 2013). Test at the 5% level for a negative correlation between percent of labor force in agriculture and per capita income. 7.) Cigarette smoking and cancer have been linked. The number of deaths per one hundred thousand from bladder cancer and the number of cigarettes sold per capita in 1960 are in table #10.1.13 ("Smoking and cancer," 2013). Test at the 1% level for a positive correlation between cigarette smoking and deaths of bladder cancer. 8.) The weight of a car can influence the mileage that the car can obtain. A random sample of cars weights and mileage was collected and are in table #10.1.14 ("Passenger car mileage," 2013). Test at the 5% level for a negative correlation between the weight of cars and mileage.
  • 29. Chapter 10: Regression and Correlation 343 Data Source: Brain2bodyweight. (2013, November 16). Retrieved from http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight Calories in beer, beer alcohol, beer carbohydrates. (2011, October 25). Retrieved from www.beer100.com/beercalories.htm Capital and rental values of Auckland properties. (2013, September 26). Retrieved from http://www.statsci.org/data/oz/rentcap.html Data hotdogs. (2013, November 16). Retrieved from http://wiki.stat.ucla.edu/socr/index.php/SOCR_012708_ID_Data_HotDogs Fertility rate. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SP.DYN.TFRT.IN Health expenditure. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SH.XPD.TOTL.ZS Life expectancy at birth. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SP.DYN.LE00.IN MLB heightsweights. (2013, November 16). Retrieved from http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights OECD economic development. (2013, December 04). Retrieved from http://lib.stat.cmu.edu/DASL/Datafiles/oecdat.html Passenger car mileage. (2013, December 04). Retrieved from http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html Prediction of height from metacarpal bone length. (2013, September 26). Retrieved from http://www.statsci.org/data/general/stature.html Pregnant woman receiving prenatal care. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SH.STA.ANVC.ZS Smoking and cancer. (2013, December 04). Retrieved from http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html