Correlation

Elementary Statistics
Chapter 10: Correlation
and Regression
10.1 Correlation
1

Chapter 10: Correlation and Regression
10.1 Correlation
10.2 Regression
10.3 Prediction Intervals and Variation
10.4 Multiple Regression
10.5 Nonlinear Regression
2
Objectives:
• Draw a scatter plot for a set of ordered pairs.
• Compute the correlation coefficient.
• Test the hypothesis 𝐻0: ρ = 0.
• Compute the equation of the regression line & the coefficient of determination.
• Compute the standard error of the estimate & a prediction interval.

Key Concept: In addition to hypothesis testing and confidence intervals, inferential statistics determines if a
relationship between 2 or more quantitative variables exists.
A correlation exists between two variables when the values of one variable are somehow associated with the
values of the other variable.
A linear correlation exists between two variables when there is a correlation and the plotted points look like a
straight line.
This section considers only linear relationships, which means that when graphed in a scatterplot, the points
approximate a straight-line pattern and methods of conducting a formal hypothesis test that can be used to decide
whether there is a linear correlation between all population values for the two variables.
The linear correlation coefficient r, is a number that measures how well paired sample data fit a straight-line
pattern when graphed (measures the strength of the linear association between the two variables). Use the sample
of paired data (sometimes called bivariate data) to find the value of r, and then use r to decide whether there is a
linear correlation between the two variables.
10.1 Correlation
3
Regression is a statistical method used to describe the nature of the relationship between variables—that is,
positive or negative, linear or nonlinear.
Questions: 1. Are two or more variables related?
2. If so, what is the strength of the relationship?
3. What type of relationship exists?
4. What kind of predictions can be made from the relationship?

Scatterplots & The Strength of the Linear
Correlation: r
A scatter plot is a graph of the ordered pairs (x, y)
x: the independent variable
y: the dependent variable y.
a. Distinct straight-line, or linear, pattern. We say that there is a
positive linear correlation between x and y, since as the x values
increase, the corresponding y values also increase.
b. Distinct straight-line, or linear pattern. We say that there is a
negative linear correlation between x and y, since as the x values
increase, the corresponding y values decrease.
c. No distinct pattern, which suggests that there is no correlation
between x and y.
d. Distinct pattern suggesting a correlation between x and y, but the
pattern is not that of a straight line.
4

Linear Correlation Coefficient r
1. Are two or more variables related?
2. If so, what is the strength of the relationship?
To answer these two questions, statisticians use the correlation coefficient, a numerical measure to determine
whether two or more variables are related and to determine the strength of the relationship between or among the
variables.
• Linear Correlation Coefficient r
 The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y
values in a sample. It determines whether there is a linear correlation between two variables.
3. What type of relationship exists?
There are two types of relationships: Simple and Multiple
In a simple relationship, there are two variables: an independent variable (predictor variable) and a dependent
variable (response variable).
In a multiple relationship, there are two or more independent variables that are used to predict one dependent
variable.
4. What kind of predictions can be made from the relationship?
Predictions are made daily in all areas. Examples include weather forecasting, stock market analyses, sales
predictions, crop predictions, gasoline price predictions, and sports predictions. Some predictions are more accurate
than others, due to the strength of the relationship. That is, the stronger the relationship is between variables, the
more accurate the prediction is. 5

6
Construct a scatter plot for the data
shown for car rental companies in the
United States for a recent year.
Example 1
Step 1: Draw and label the x and y axes.
Step 2: Plot each point on the graph.
TI Calculator:
Scatter Plot:
1. Press on Y & clear
2. 2nd y (STATPLOT), Enter
3. On, Enter
4. Select X1-list: 𝑳𝟏
5. Select Y1-list: 𝑳𝟐
6. Mark: Select Character
7. Press Zoom & 9 to get
ZoomStat
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
4. Or Highlight & Clear
5. Type in your data in L1, ..

Calculating and Interpreting the Linear Correlation Coefficient denoted by r
The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two
variables.
There are several types of correlation coefficients. The one explained in this section is called the Pearson product moment correlation
coefficient (PPMC).
The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is .
The range of the correlation coefficient is from 1 to 1.
If there is a strong positive linear relationship between the variables, the value of r will be close to 1. (0.7 or larger)
If there is a strong negative linear relationship between the variables, the value of r will be close to 1. ( 0.7 or smaller)
n number of pairs of sample data, ∑x = sum of all x’s, ∑x² = Sum of (x values that are squared)
(∑x)² Sum up the x values and square the total. Avoid confusing ∑x² and (∑x)².
∑xy indicates that each x value should first be multiplied by its corresponding y value. After obtaining all such products, find their
sum.
r linear correlation coefficient for sample data
𝝆 (Rho) : linear correlation coefficient for a population of paired data
7

Given any collection of sample paired quantitative data, the linear correlation coefficient r can
always be computed, but the following requirements should be satisfied when using the sample
paired data to make a conclusion about linear correlation in the corresponding population of
paired data.
1. The sample of paired (x, y) data is a simple random sample of quantitative data.
2. Visual examination of the scatterplot must confirm that the points approximate a straight-line
pattern.
3. Because results can be strongly affected by the presence of outliers, any outliers must be removed if
they are known to be errors. The effects of any other outliers should be considered by calculating r
with and without the outliers included.
4. In other words, requirements 2 and 3 are simplified attempts at checking that the pairs of (x, y) data
have a bivariate normal distribution.
8
𝑟 =
𝑛( 𝑥𝑦) − 𝑥 • 𝑦
𝑛( 𝑥
2
) − ( 𝑥)2 𝑛( 𝑦
2
) − ( 𝑦)2
𝑟 =
(𝑍𝑥𝑍𝑦)
𝑛 − 1
𝒁𝒙 denotes the z score for an individual sample value x
𝒁𝒚 denotes the z score for the corresponding sample value y.

Example 2: Finding r Using Technology
The table lists five paired data values. Use technology to find the value of the
correlation coefficient r for the data.
Chocolate 5 6 4 4 5
Nobel 6 9 3 2 11
9
Solution:
The value of r will be automatically calculated with
software or a calculator: r = 0.795

Example 3 a: Finding r Using the following Formula
Use this Formula to find the value of the linear
correlation coefficient r for the five pairs of
chocolate/Nobel data listed in the table.
Chocolate 5 6 4 4 5
Nobel 6 9 3 2 11
10
x (Chocolate) y (Nobel) x² y² xy
5 6 25 36 30
6 9 36 81 54
4 3 16 9 12
4 2 16 4 8
5 11 25 121 55
𝑟 =
5(159) − 24 • 31
5(118) − 242 5(251) − 312
=
51
14(294)
= 0.795
𝑟 =
𝑛( 𝑥𝑦) − 𝑥 • 𝑦
𝑛( 𝑥
2
) − ( 𝑥)2 𝑛( 𝑦
2
) − ( 𝑦)2
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
TI Calculator:
Linear Regression - test
1. Stat
2. Tests
3. LinRegTTest
4. Enter 𝑳𝟏 & 𝑳𝟐
5. Freq = 1
6. Choose ≠
7. Calculate
∑x = 24 ∑y = 31 ∑x² = 118 ∑y² = 251 ∑xy = 159

Use Formula to find the value of the linear correlation coefficient r for the five pairs of
chocolate/Nobel data listed in the table.
Solution: The z scores for all of the chocolate values (see the third column) and the z scores
for all of the Nobel values (see the fourth column) are below. The last column lists the
products zx · zy.
x (Chocolate) y (Nobel)
5 6
6 9
4 3
4 2
5 11
blank blank
11
Example 3b: Finding r Using the following Formula
=
3.179746
5 − 1
= 0.795
This Formula has the advantage of making
it easier to understand how r works. The
variable x is used for the chocolate values,
and the variable y is used for the Nobel
values. Each sample value is replaced by its
corresponding z score.
For Example:
Chocolates: 𝑥 =
𝑥
𝑛
= 4.8
𝑠𝑥 =
(𝑥 − 𝑥)2
𝑛 − 1
= 0.836660,
𝑥 = 5 → 𝑧𝑥 =
𝑥 − 𝑥
𝑠𝑥
=
5 − 4.8
0.83666
= 0.23905
𝑟 =
(𝑍𝑥𝑍𝑦)
𝑛 − 1
zx zy
0.239046 −0.052164
1.434274 0.730297
−0.956183 −0.834625
−0.956183 −1.095445
0.239046 1.251937
zx · zy
−0.012470
1.047446
0.798054
1.047446
0.299270
∑ (zx · zy) =
3.179746

Example 4: Finding r Using the following Formula
Find the correlation coefficient for the given data.
12
Company
Cars x
(in 10,000s)
Income y
(in billions) xy x2
y2
A
B
C
D
E
F
63.0
29.0
20.8
19.1
13.4
8.5
7.0
3.9
2.1
2.8
1.4
1.5
441.00
113.10
43.68
53.48
18.76
12.75
3969.00
841.00
432.64
364.81
179.56
72.25
49.00
15.21
4.41
7.84
1.96
2.25
Σx =
153.8
Σy =
18.7
Σxy =
682.77
Σx2
=
5859.26
Σy2
=
80.67
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6
𝑟 =
6(682.77) − 153.8 • 18.7
6(5859.26) − 153. 82 6(80.67) − 18. 72
= 0.982
(strong positive relationship)
𝑟 =
𝑛( 𝑥𝑦) − 𝑥 • 𝑦
𝑛( 𝑥
2
) − ( 𝑥)2 𝑛( 𝑦
2
) − ( 𝑦)2
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
TI Calculator:
Linear Regression - test
1. Stat
2. Tests
3. LinRegTTest
5. Freq = 1
6. Choose ≠
7. Calculate

Null Hypothesis: H0: ρ = 0 (No correlation) Alternative Hypothesis: H1: ρ ≠ 0 (Correlation)
Using P-Value from Technology to Interpret r:
P-value ≤ α: Reject 𝐻0 → Supports the claim of a linear correlation.
P-value > α: Does not support the claim of a linear correlation.
Using Pearson Correlation coefficient table to Interpret r: Consider critical values from
this Table or technology as being both positive and negative:
• Correlation If |r| ≥ critical value ⇾ There is sufficient evidence to support the claim of a linear
correlation.
• No Correlation If |r| < critical value ⇾ There is not sufficient evidence to support the claim of a
linear correlation.
Properties of the Linear Correlation Coefficient r
1. −1 ≤ r ≤ 1.
2. If all values of either variable are converted to a different scale, the value of r does not change.
4. The value of r is not affected by the choice of x or y. Interchange all x values and y values, and
the value of r will not change. r measures the strength of a linear relationship. It is not designed
to measure the strength of a relationship that is not linear.
5. r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value. 13

14
10.1 Correlation
A correlation exists between two variables when the values of one variable are
somehow associated with the values of the other variable. A linear correlation exists
between two variables when there is a correlation and the plotted points look like a
straight line. The linear correlation coefficient r, is a number that measures how well
paired sample data fit a straight-line pattern when graphed (measures the strength of the
linear association between paired data called bivariate data). The value of r² is the
proportion of the variation in y that is explained by the linear relationship between x
and y.
Properties of the Linear Correlation Coefficient r: −1 ≤ r ≤ 1
zx denotes the z score
for an individual
sample value x
zy is the z score for the
corresponding sample
value y.
𝑇𝑆: 𝑡 = 𝑟
𝑛 − 2
1 − 𝑟2
=
𝑟
1 − 𝑟2
𝑛 − 2
, 𝑑𝑓 = 𝑛 − 2
𝑂𝑟: 𝑟
𝑟 =
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
, 𝑂𝑟: 𝑟 =
(𝑍𝑥𝑍𝑦)
𝑛 − 1
Step 1: H0 :𝜌 = 0, H1: 𝜌 ≠ 0 claim
& Tails
Step 2: TS: 𝑡 = 𝑟
𝑛−2
1−𝑟2 , OR: r
Step 3: CV using α From the
T-table or r-table
Step 4: Make the decision to
a. Reject or not H0
b. The claim is true or false
c. Restate this decision: There is / is
not sufficient evidence to support the
claim that…
There is a linear Correlation If |r|
≥ critical value
There is No Correlation If |r| <
critical value
TI Calculator:
Scatter Plot:
1. Press on Y & clear
2. 2nd y (STATPLOT), Enter
3. On, Enter
4. Select X1-list: 𝑳𝟏
5. Select Y1-list: 𝑳𝟐
6. Mark: Select Character
7. Press Zoom & 9 to get
ZoomStat
TI Calculator:
Linear Regression – test &
Correlation Coefficient 𝑟
1. Stat
2. Tests
3. LinRegTTest
5. Freq = 1
6. Choose ≠
7. Calculate
TI Calculator:
How to enter data:
1. Stat
2. Edit
3. ClrList 𝑳𝟏
𝑡 =
𝑟−𝜇𝑟
1−𝑟2
𝑛−2
,
𝜇𝑟 = 0 (𝐹𝑟𝑜𝑚 H0)

15
Test the significance of the given correlation
coefficient using α = 0.05, n = 6 and r = 0.982.
Example 5
Decision:
a. Reject H0
b. The claim is
True
c. There is a
significant
relationship
between the 2
variables.
Step 1: H0 , H1, claim & Tails
Step 2: TS Calculate (TS)
Step 3: CV using α
a. Reject or not H0
c. Restate this decision: There
is / is not sufficient evidence to
support the claim that…
H0: 𝜌 = 0, H1: 𝜌 ≠ 0, claim, 2TT
TS: t-distribution 2nd Method: Pearson Correlation
CV: α = 0.05,
𝑑𝑓 = 𝑛 − 2 = 6 − 2 = 4
𝑡 = 𝑟
𝑛 − 2
1 − 𝑟2 =
𝑟
1 − 𝑟2
𝑛 − 2
, 𝑑𝑓 = 𝑛 − 2
𝑡 = 0.982
6 − 2
1 − 0.9822 = 10.3981 TS: 𝑟 = 0.982
CV: From Pearson
Correlation Coefficient
table: 𝑛 = 6,α = 0.05
→ 𝑡 = ±2.776 → 𝑟 = ±0.811

16
Given the value of r = 0.801 for 23 pairs of data regarding chocolate
consumption and numbers of Nobel Laureates, and using a significance level
of 0.05; is there sufficient evidence to support a claim that there is a linear
correlation between chocolate consumption and numbers of Nobel
Laureates?
Example 6
Decision:
a. Reject H0
b. The claim is True
c. There is sufficient evidence to support the
conclusion that for countries, there is a linear
correlation between chocolate consumption
and numbers of Nobel Laureates.
CV: α = 0.05, 𝑑𝑓 = 𝑛 − 2
= 23 − 2 = 21
TS: 𝑡 = 0.801
23−2
1−0.8012
= 6.1314
TS: 𝑟 = 0.801
CV: From Pearson
Correlation coefficient
table: n= 23, α = 0.05
Interpretation: Although we have found a linear
correlation, it would be absurd to think that eating
more chocolate would help win a Nobel Prize.
→ 𝑡 = ±2.080
Table: r = 0.396 < CV < r = 0.444
Technology: r = 0.413 → 𝑟 = ±0.413
H0: 𝜌 = 0, H1: 𝜌 ≠ 0, claim, 2TT
Step 1: H0 , H1, claim & Tails
Step 2: TS Calculate (TS)
Step 3: CV using α
a. Reject or not H0
c. Restate this decision: There is
/ is not sufficient evidence to
support the claim that…
𝑡 = 𝑟
𝑛 − 2
1 − 𝑟2 = 𝑑𝑓 = 𝑛 − 2

Interpreting r: Explained Variation
The value of r² , called the coefficient of determination, is the proportion of the
variation in y that is explained by the linear relationship between x and y.
𝑟2
=
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
Using the 23 pairs of chocolate/Nobel data, we get r = 0.801. What proportion of the
variation in numbers of Nobel Laureates can be explained by the variation in the
consumption of chocolate?
17
Solution
With r = 0.801 we get r² = 0.642.
Interpretation
We conclude that 0.642 (or about 64%) of the variation in numbers of Nobel
Laureates can be explained by the linear relationship between chocolate consumption
and numbers of Nobel Laureates.
This implies that about 36% of the variation in numbers of Nobel Laureates cannot be
explained by rates of chocolate consumption.

When the null hypothesis has been rejected for a specific value, any of the following five possibilities can exist.
1. There is a direct cause-and-effect relationship between the variables. That is, x causes y.
water causes plants to grow, poison causes death, heat causes ice to melt
2. There is a reverse cause-and-effect relationship between the variables. That is, y causes x.
Suppose a researcher believes excessive coffee consumption causes nervousness, but the researcher fails to consider that the
reverse situation may occur. That is, it may be that an extremely nervous person craves coffee to calm his or her nerves.
3. The relationship between the variables may be caused by a third variable.
If a statistician correlated the number of deaths due to drowning and the number of cans of soft drink consumed daily during
the summer, he or she would probably find a significant relationship. However, the soft drink is not necessarily responsible
for the deaths, since both variables may be related to heat and humidity.
4. There may be a complexity of interrelationships among many variables.
A researcher may find a significant relationship between students’ high school grades and college grades. But there probably
are many other variables involved, such as IQ, hours of study, influence of parents, motivation, age, and instructors.
5. The relationship may be coincidental.
A researcher may be able to find a significant relationship between the increase in the number of people who are exercising
and the increase in the number of people who are committing crimes. But common-sense dictates that any relationship
between these two values must be due to coincidence.
10.1 Correlation, Possible Relationships Between Variables
18

Interpreting r with Causation:
Correlation does not imply causality!
We noted previously that we should use common sense when interpreting results.
Clearly, it would be absurd to think that eating more chocolate would help win a
Nobel Prize
19
Common Errors Involving Correlation:
1. Assuming that correlation implies causality
2. Using data based on averages
3. Ignoring the possibility of a nonlinear relationship
Hypotheses If conducting a formal hypothesis test to determine whether there is a significant
linear correlation between two variables, use the following null and alternative hypotheses that
use ρ to represent the linear correlation coefficient of the population:
Null Hypothesis H0: ρ = 0 (No correlation)
Alternative Hypothesis H1: ρ ≠ 0 (Correlation)

Correlation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Correlation

Similar to Correlation (20)

More from Long Beach City College

More from Long Beach City College (12)

Recently uploaded

Recently uploaded (20)

Correlation