SlideShare a Scribd company logo
1 of 18
Download to read offline
STAT 22000 Lecture Slides
Correlation
Yibi Huang
Department of Statistics
University of Chicago
Recall when we introduced scatter plots in Chapter 1, we assessed
the strength of the association between two variables by eyeballs.
Weak Association Strong Association
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
X
Y
1
Recall when we introduced scatter plots in Chapter 1, we assessed
the strength of the association between two variables by eyeballs.
Weak Association Strong Association
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
X
Y
Range of Y
when X is
fixed ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
X
Y
Range of Y
when X is
fixed
Large spread of Y Small spread of Y
when X is known when X is known
1
Correlation = Correlation Coefficient, r
Correlation r is a numerical measure of the direction and strength
of the linear relationship between two numerical variables.
“r” always lies between −1 and 1; the strength increases as you
move away from 0 to either −1 or 1.
• r > 0: positive association
• r < 0: negative association
• r ≈ 0: very weak linear relationship
• large |r|: strong linear relationship
• r = −1 or r = 1: only when all the data points on the
scatterplot lie exactly along a straight line
−1 0 1
Neg. Assoc. Pos. Assoc.
Strong Weak No Assoc Weak Strong
Perfect Perfect 2
Positive Correlations
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0.4
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0.6
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = 0.9
3
Negative Correlations
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.1
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.7
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.95
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−1
0
1
2
3
r = −0.99
4
Formula for Computing the Correlation Coefficient “r”
(x1, y1)
(x2, y2)
(x3, y3)
.
.
.
(xn, yn)
The correlation coefficient r
(or simply, correlation) is defined as:
r =
1
n − 1
n
X
i=1
xi − x̄
sx
!
| {z }
z-score of xi
yi − ȳ
sy
!
| {z }
z-score of yi
.
where sx and sy are respectively the sample SD of X
and of Y.
Usually, we find the correlation using softwares
rather than by manual computation.
5
Why r Measures the Strength of a Linear Relationship?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
x
y
+
−
+ −
x
y
xi −
− x >
> 0
xi −
− x <
< 0
xi −
− x >
> 0
xi −
− x <
< 0
yi −
− y >
> 0
yi −
− y <
< 0
yi −
− y >
> 0
yi −
− y <
< 0
What is the sign of
xi − x̄
sx
!
yi − ȳ
sy
!
??
Here r > 0;
more positive
contributions than
negative.
What kind of points
have large
contributions to the
correlation?
6
Correlation r Has No Unit
r =
1
n − 1
n
X
i=1
xi − x̄
sx
!
| {z }
z-score of xi
yi − ȳ
sy
!
| {z }
z-score of yi
.
After standardization, the z-score of neither xi nor yi has a unit.
• So r is unit-free.
• So we can compare r between data sets, where variables are
measured in different units or when variables are different.
E.g. we may compare the
r between [swim time and pulse],
with the
r between [swim time and breathing rate].
7
Correlation r Has No Unit (2)
Changing the units of variables does not change the correlation
coefficient r, because we get rid of all the units when we
standardize them (get z-scores).
E.g., no matter the temperatures are recorded in ◦F, or ◦C, the
correlations obtained are equal because
C =
5
9
(F − 32).
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
25 30 35 40 45 50 55
45
55
65
75
HIGH
Daily low (F)
Daily
high
(F)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5 0 5 10
10
15
20
25
HIGH
Daily low (C)
Daily
high
(C)
8
“r” Does Not Distinguish x & y
Sometimes one use the X variable to predict the Y variable. In this
case, X is called the explanatory variable, and Y the response.
The correlation coefficient r does not distinguish between the two.
It treats x and y symmetrically.
r =
1
n − 1
n
X
i=1
xi − x̄
sx
!
yi − ȳ
sy
!
Swapping the x-, y-axes doesn’t change r (both r = 0.74.)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
20 30 40 50
50
60
70
80
Daily low (F)
Daily
high
(F)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50 60 70 80
20
30
40
50
Daily high (F)
Daily
low
(F)
9
Correlation r Describes Linear Relationships Only
The scatter plot below shows a perfect nonlinear association. All
points fall on the quadratic curve y = 1 − 4(x − 0.5)2.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.4 0.8
0.0
0.4
0.8
1.2
x
y
●
●
●
●
●
●
r of all black dots = 0.803,
r of all dots = −0.019.
(black + white)
No matter how strong the association,
the r of a curved relationship is NEVER 1 or −1.
It can even be 0, like the plot above. 10
Correlation Is VERY Sensitive to Outliers
Sometimes a single outlier can change r drastically.
● ●
●
●
●
●
●
●
●
●
●
For the plot on the left,
r =









0.0031 with the outlier
0.6895 without the outlier
Outliers that may remarkably change the form of associations
when removed are called influential points.
Remark: Not all outliers are influential points.
11
When Data Points Are Clustered ...
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
In the plot above, each of the two
clusters exhibits a weak negative
association (r = −0.336 and −0.323).
But the whole diagram shows a
moderately strong positive association
(r = 0.849).
• This is an example of the Simpson’s paradox.
• An overall r can be misleading when data points are clustered.
• Cluster-wise r’s should be reported as well.
12
Always Check the Scatter Plots (1)
The 4 data sets below have identical x, y, sx, sy, and r.
Dataset 1 Dataset 2 Dataset 3 Dataset 4
x y x y x y x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.96 8 8.14 8 6.77 8 5.76
13 7.58 13 8.75 13 12.76 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.36 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Ave 9 7.5 9 7.5 9 7.5 9 7.5
SD 3.16 1.94 3.16 1.94 3.16 1.94 3.16 1.94
r 0.82 0.82 0.82 0.82 13
Always Check the Scatter Plots (2)
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15 20
0
5
10
15
Dataset 1
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15 20
0
5
10
15
Dataset 2
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15 20
0
5
10
15
Dataset 3
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15 20
0
5
10
15
Dataset 4
• In Dataset 2, y can be predicted exactly from x. But r < 1,
because r only measures linear association.
• In Dataset 3, r would be 1 instead of 0.82 if the outlier were
actually on the line.
The correlation coefficient can be misleading in the presence of
outliers, multiple clusters, or nonlinear association.
14
Correlation Indicates Association, Not Causation
Source: http://www.nejm.org/doi/full/10.1056/NEJMon1211064
15
Questions
• Why do both variables have to be numerical when computing
their correlation coefficient?
• If the law requires women to marry only men 2 years older
than themselves, what is the correlation of the ages between
husbands and wives?
16

More Related Content

Similar to L22.pdf

12 x1 t01 03 integrating derivative on function (2013)
12 x1 t01 03 integrating derivative on function (2013)12 x1 t01 03 integrating derivative on function (2013)
12 x1 t01 03 integrating derivative on function (2013)
Nigel Simmons
 
12 x1 t01 01 log laws (2013)
12 x1 t01 01 log laws (2013)12 x1 t01 01 log laws (2013)
12 x1 t01 01 log laws (2013)
Nigel Simmons
 

Similar to L22.pdf (20)

Solids of revolution
Solids of revolutionSolids of revolution
Solids of revolution
 
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONSEDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
 
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONSEDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
 
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONSEDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
 
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONSEDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
EDGE-NEIGHBOR RUPTURE DEGREE ON GRAPH OPERATIONS
 
12 x1 t01 03 integrating derivative on function (2013)
12 x1 t01 03 integrating derivative on function (2013)12 x1 t01 03 integrating derivative on function (2013)
12 x1 t01 03 integrating derivative on function (2013)
 
Project_Report.pdf
Project_Report.pdfProject_Report.pdf
Project_Report.pdf
 
Semana 21 funciones álgebra uni ccesa007
Semana 21 funciones  álgebra uni ccesa007Semana 21 funciones  álgebra uni ccesa007
Semana 21 funciones álgebra uni ccesa007
 
UNIT I_3.pdf
UNIT I_3.pdfUNIT I_3.pdf
UNIT I_3.pdf
 
Data sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionData sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansion
 
12 x1 t01 01 log laws (2013)
12 x1 t01 01 log laws (2013)12 x1 t01 01 log laws (2013)
12 x1 t01 01 log laws (2013)
 
Splay Tree
Splay TreeSplay Tree
Splay Tree
 
Nelson maple pdf
Nelson maple pdfNelson maple pdf
Nelson maple pdf
 
The Queue Length of a GI M 1 Queue with Set Up Period and Bernoulli Working V...
The Queue Length of a GI M 1 Queue with Set Up Period and Bernoulli Working V...The Queue Length of a GI M 1 Queue with Set Up Period and Bernoulli Working V...
The Queue Length of a GI M 1 Queue with Set Up Period and Bernoulli Working V...
 
elec2200-6.pdf
elec2200-6.pdfelec2200-6.pdf
elec2200-6.pdf
 
computer graphics slides by Talha shah
computer graphics slides by Talha shahcomputer graphics slides by Talha shah
computer graphics slides by Talha shah
 
Method for beam 20 1-2020
Method for beam 20 1-2020Method for beam 20 1-2020
Method for beam 20 1-2020
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Chap-1 Preliminary Concepts and Linear Finite Elements.pptx
Chap-1 Preliminary Concepts and Linear Finite Elements.pptxChap-1 Preliminary Concepts and Linear Finite Elements.pptx
Chap-1 Preliminary Concepts and Linear Finite Elements.pptx
 
Introduction to Gaussian Processes
Introduction to Gaussian ProcessesIntroduction to Gaussian Processes
Introduction to Gaussian Processes
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

L22.pdf

  • 1. STAT 22000 Lecture Slides Correlation Yibi Huang Department of Statistics University of Chicago
  • 2. Recall when we introduced scatter plots in Chapter 1, we assessed the strength of the association between two variables by eyeballs. Weak Association Strong Association ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X Y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X Y 1
  • 3. Recall when we introduced scatter plots in Chapter 1, we assessed the strength of the association between two variables by eyeballs. Weak Association Strong Association ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X Y Range of Y when X is fixed ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X Y Range of Y when X is fixed Large spread of Y Small spread of Y when X is known when X is known 1
  • 4. Correlation = Correlation Coefficient, r Correlation r is a numerical measure of the direction and strength of the linear relationship between two numerical variables. “r” always lies between −1 and 1; the strength increases as you move away from 0 to either −1 or 1. • r > 0: positive association • r < 0: negative association • r ≈ 0: very weak linear relationship • large |r|: strong linear relationship • r = −1 or r = 1: only when all the data points on the scatterplot lie exactly along a straight line −1 0 1 Neg. Assoc. Pos. Assoc. Strong Weak No Assoc Weak Strong Perfect Perfect 2
  • 5. Positive Correlations ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = 0.9 3
  • 6. Negative Correlations ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.95 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 r = −0.99 4
  • 7. Formula for Computing the Correlation Coefficient “r” (x1, y1) (x2, y2) (x3, y3) . . . (xn, yn) The correlation coefficient r (or simply, correlation) is defined as: r = 1 n − 1 n X i=1 xi − x̄ sx ! | {z } z-score of xi yi − ȳ sy ! | {z } z-score of yi . where sx and sy are respectively the sample SD of X and of Y. Usually, we find the correlation using softwares rather than by manual computation. 5
  • 8. Why r Measures the Strength of a Linear Relationship? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 x y + − + − x y xi − − x > > 0 xi − − x < < 0 xi − − x > > 0 xi − − x < < 0 yi − − y > > 0 yi − − y < < 0 yi − − y > > 0 yi − − y < < 0 What is the sign of xi − x̄ sx ! yi − ȳ sy ! ?? Here r > 0; more positive contributions than negative. What kind of points have large contributions to the correlation? 6
  • 9. Correlation r Has No Unit r = 1 n − 1 n X i=1 xi − x̄ sx ! | {z } z-score of xi yi − ȳ sy ! | {z } z-score of yi . After standardization, the z-score of neither xi nor yi has a unit. • So r is unit-free. • So we can compare r between data sets, where variables are measured in different units or when variables are different. E.g. we may compare the r between [swim time and pulse], with the r between [swim time and breathing rate]. 7
  • 10. Correlation r Has No Unit (2) Changing the units of variables does not change the correlation coefficient r, because we get rid of all the units when we standardize them (get z-scores). E.g., no matter the temperatures are recorded in ◦F, or ◦C, the correlations obtained are equal because C = 5 9 (F − 32). ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 30 35 40 45 50 55 45 55 65 75 HIGH Daily low (F) Daily high (F) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 0 5 10 10 15 20 25 HIGH Daily low (C) Daily high (C) 8
  • 11. “r” Does Not Distinguish x & y Sometimes one use the X variable to predict the Y variable. In this case, X is called the explanatory variable, and Y the response. The correlation coefficient r does not distinguish between the two. It treats x and y symmetrically. r = 1 n − 1 n X i=1 xi − x̄ sx ! yi − ȳ sy ! Swapping the x-, y-axes doesn’t change r (both r = 0.74.) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 50 50 60 70 80 Daily low (F) Daily high (F) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 20 30 40 50 Daily high (F) Daily low (F) 9
  • 12. Correlation r Describes Linear Relationships Only The scatter plot below shows a perfect nonlinear association. All points fall on the quadratic curve y = 1 − 4(x − 0.5)2. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.4 0.8 0.0 0.4 0.8 1.2 x y ● ● ● ● ● ● r of all black dots = 0.803, r of all dots = −0.019. (black + white) No matter how strong the association, the r of a curved relationship is NEVER 1 or −1. It can even be 0, like the plot above. 10
  • 13. Correlation Is VERY Sensitive to Outliers Sometimes a single outlier can change r drastically. ● ● ● ● ● ● ● ● ● ● ● For the plot on the left, r =          0.0031 with the outlier 0.6895 without the outlier Outliers that may remarkably change the form of associations when removed are called influential points. Remark: Not all outliers are influential points. 11
  • 14. When Data Points Are Clustered ... ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● In the plot above, each of the two clusters exhibits a weak negative association (r = −0.336 and −0.323). But the whole diagram shows a moderately strong positive association (r = 0.849). • This is an example of the Simpson’s paradox. • An overall r can be misleading when data points are clustered. • Cluster-wise r’s should be reported as well. 12
  • 15. Always Check the Scatter Plots (1) The 4 data sets below have identical x, y, sx, sy, and r. Dataset 1 Dataset 2 Dataset 3 Dataset 4 x y x y x y x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.96 8 8.14 8 6.77 8 5.76 13 7.58 13 8.75 13 12.76 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.10 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.10 4 5.36 19 12.50 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 Ave 9 7.5 9 7.5 9 7.5 9 7.5 SD 3.16 1.94 3.16 1.94 3.16 1.94 3.16 1.94 r 0.82 0.82 0.82 0.82 13
  • 16. Always Check the Scatter Plots (2) ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 0 5 10 15 Dataset 1 ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 0 5 10 15 Dataset 2 ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 0 5 10 15 Dataset 3 ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 0 5 10 15 Dataset 4 • In Dataset 2, y can be predicted exactly from x. But r < 1, because r only measures linear association. • In Dataset 3, r would be 1 instead of 0.82 if the outlier were actually on the line. The correlation coefficient can be misleading in the presence of outliers, multiple clusters, or nonlinear association. 14
  • 17. Correlation Indicates Association, Not Causation Source: http://www.nejm.org/doi/full/10.1056/NEJMon1211064 15
  • 18. Questions • Why do both variables have to be numerical when computing their correlation coefficient? • If the law requires women to marry only men 2 years older than themselves, what is the correlation of the ages between husbands and wives? 16