1
Correlation and
Regression
BY UNSA SHAKIR
2
Correlation and Regression
Correlation describes the strength of a
linear relationship between two variables
Regression tells us how to draw the straight
line described by the correlation
3
Correlation and Regression
• For example:
A sociologist may be interested in the relationship
between education and self-esteem or Income and
Number of Children in a family.
Independent Variables
Education
Family Income
Dependent Variables
Self-Esteem
Number of Children
4
Correlation and Regression
• For example:
• May expect: As education increases, self-esteem
increases (positive relationship).
• May expect: As family income increases, the number
of children in families declines (negative relationship).
Independent Variables
Education
Family Income
Dependent Variables
Self-Esteem
Number of Children
+
-
5
Correlation
6
Correlation
• Correlation is a statistical technique used to
determine the degree to which two variables
are related
• A correlation is a relationship between two
variables. The data can be represented by the
ordered pairs (x, y) where x is the independent
(or explanatory) variable, and y is the
dependent (or response) variable.
7
Correlation
x 1 2 3 4 5
y – 4 – 2 – 1 0 2
A scatter plot can be used to determine
whether a linear (straight line) correlation
exists between two variables.
x
2 4
–2
– 4
y
2
6
Example:
8
Linear Correlation
x
y
Negative Linear Correlation
x
y
No Correlation
x
y
Positive Linear Correlation
x
y
Nonlinear Correlation
As x increases,
y tends to
decrease.
As x increases,
y tends to
increase.
9
Correlation Coefficient
• It is also called Pearson's correlation or
product moment correlation coefficient
• The correlation coefficient is a measure of
the strength and the direction of a linear
relationship between two variables. The
symbol r represents the sample correlation
coefficient. The formula for r is
  
   2 22 2
.
n xy x y
r
n x x n y y
   

     
10
The sign of r denotes the nature of
association
while the value of r denotes the strength of
association.
11
If the sign is +ve this means the relation is
direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).
While if the sign is -ve this means an
inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
12
The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated
by the following diagram.
-1 10-0.25-0.75 0.750.25
strong strongintermediate intermediateweak weak
no
relation
perfect
correlation
perfect
correlation
Directindirect
13
If r = Zero this means no association or
correlation between the two variables.
If 0 < r < 0.25 = weak correlation.
If 0.25 ≤ r < 0.75 = intermediate correlation.
If 0.75 ≤ r < 1 = strong correlation.
If r = l = perfect correlation.
14
Linear Correlation
x
y
Strong negative correlation
x
y
Weak positive correlation
x
y
Strong positive correlation
x
y
Nonlinear Correlation
r = 0.91 r = 0.88
r = 0.42
r = 0.07
15
Calculating a Correlation Coefficient
  
   2 22 2
.
n xy x y
r
n x x n y y
   

     
1. Find the sum of the x-values.
2. Find the sum of the y-values.
Calculating a Correlation Coefficient
In Words In Symbols
x
y
xy3. Multiply each x-value by its
corresponding y-value and find
the sum.
16
Calculating a Correlation Coefficient
Calculating a Correlation Coefficient
In Words In Symbols
2
x
2
y
4. Square each x-value and
find the sum.
5. Square each y-value and
find the sum.
6. Use these five sums to
calculate the correlation
coefficient.
17
Correlation Coefficient
x y xy x2 y2
1 – 3 – 3 1 9
2 – 1 – 2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
Example:
Calculate the correlation coefficient r for the following
data.
15x  1y   9xy  2
55x  2
15y 
18
Correlation Coefficient
  
   2 22 2
n xy x y
r
n x x n y y
   

     
Example:
Calculate the correlation coefficient r for the following
data.
  
 22
5(9) 15 1
5(55) 15 5(15) 1
 

  
60
50 74
 0.986
There is a strong positive linear correlation
between x and y.
19
Correlation Coefficient
Hours,
x
0 1 2 3 3 5 5 5 6 7 7 10
Test score,
y
96 85 82 74 95 68 76 84 58 65 75 50
Example:
The following data represents the number of hours, 12
different students watched television during the
weekend and the scores of each student who took a test
the following Monday.
a.) Display the scatter plot.
b.) Calculate the correlation coefficient r.
20
Correlation Coefficient
100
x
y
Hours watching TV
Testscore
80
60
40
20
2 4 6 8 10
Hours,
x
0 1 2 3 3 5 5 5 6 7 7 10
Test score,
y
96 85 82 74 95 68 76 84 58 65 75 50
21
Correlation Coefficient
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test
score, y
96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85
16
4
222
28
5
34
0
38
0
420 348
45
5
52
5
50
0
x2 0 1 4 9 9 25 25 25 36 49 49
10
0
y2 921
6
722
5
67
24
547
6
90
25
46
24
57
76
705
6
336
4
42
25
56
25
25
00
Example continued:
54x  908y  3724xy 
2
332x  2
70836y 
22
Correlation Coefficient
Example continued:
  
   2 22 2
n xy x y
r
n x x n y y
   

     
  
 22
12(3724) 54 908
12(332) 54 12(70836) 908


 
0.831 
• There is a strong negative linear correlation.
• As the number of hours spent watching TV increases,
the test scores tend to decrease.
23
Example:
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded
as shown in the following table . It is required to find
the correlation between age and weight.
Weight
(Kg)
Age
(years)
serial
No
1271
862
1283
1054
1165
1396
24
Y2X2xy
Weight
(Kg)
(y)
Age
(year)
(x)
Serial
n.
14449841271
643648862
14464961283
10025501054
12136661165
169811171396
∑y2=
742
∑x2=
291
∑xy=
461
∑y=
66
∑x=
41
Total
25
r = 0.759
strong direct correlation
















6
(66)
742.
6
(41)
291
6
6641
461
r
22
26
EXAMPLE: Relationship betweenAnxiety and Test
Scores
Anxiety
(X)
Test
score (Y)
X2 Y2 XY
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
27
Calculating Correlation Coefficient
  
94.
)200)(356(
1024774
32)204(632)230(6
)32)(32()129)(6(
22





r
r = - 0.94
Indirect strong correlation
28
Example
Tree
Height
Trunk
Diameter
y x xy y2
x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
Σ =321 Σ =73 Σ =3142 Σ =14111 Σ =713
29
13
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14
2
Trunk Diameter, x
Tree
Height,
y
Example
• r = 0.886 → relatively
strong positive linear
association between x
and y
30
31
Regression
32
Regression Analyses
• Regression technique is concerned with
predicting some variables by knowing others
• The process of predicting variable Y using
variable X
33
20
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
34
Regression
Uses a variable (x) to predict some outcome
variable (y)
Tells you how values in y change as a
function of changes in values of x
35
The regression line makes the sum of the squares of the
residuals smaller than for any other line
Regression minimizes residuals
80
100
120
140
160
180
200
220
60 70 80 90 100 110 120
Wt (kg)
SBP(mmHg)
36
By using the least squares method (a procedure that
minimizes the vertical deviations of plotted points
surrounding a straight line) we are
able to construct a best fitting straight line to the scatter
diagram points and then formulate a regression equation
in the form of:



 



n
x)(
x
n
yx
xy
b 2
2
1
)xb(xyyˆ 
bXayˆ 
Regression equation describes the regression line
mathematically by showing Intercept and Slope
37
Correlation and Regression
• The statistics equation for a line:
Y = a + bx
Where: Y = the line’s position on the
vertical axis at any point (estimated
value of dependent variable)
X = the line’s position on the
horizontal axis at any point (value of
the independent variable for which you
want an estimate of Y)
b = the slope of the line
(called the coefficient)
a = the intercept with the Y axis,
where X equals zero
^
^
38
Linear Equations
Y
Y = bX + a
a = Y-intercept
X
Change
in Y
Change in X
b = Slope
39
Exercise
A sample of 6 persons was selected the value of
their age ( x variable) and their weight is
demonstrated in the following table. Find the
regression equation and what is the predicted
weight when age is 8.5 years.
Weight (y)Age (x)Serial no.
12
8
12
10
11
13
7
6
8
5
6
9
1
2
3
4
5
6
40
Answer
Y2X2xyWeight (y)Age (x)Serial no.
144
64
144
100
121
169
49
36
64
25
36
81
84
48
96
50
66
117
12
8
12
10
11
13
7
6
8
5
6
9
1
2
3
4
5
6
7422914616641Total
41
6.83
6
41
x  11
6
66
y
92.0
6
)41(
291
6
6641
461
2




b
Regression equation
6.83)0.9(x11yˆ (x) 
42
0.92x4.675yˆ (x) 
12.50Kg8.5*0.924.675yˆ (8.5) 
Kg58.117.5*0.924.675yˆ (7.5) 
43
we create a regression line by plotting two estimated
values for y against their X component, then extending
the line right and left.
44
Regression Line
Example:
a.) Find the equation of the regression line.
b.) Use the equation to find the expected value when
value of x is 2.3
x y xy x2 y2
1 – 3 – 3 1 9
2 – 1 – 2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
15x  1y   9xy  2
55x  2
15y 
45
Regression Line
2
x
y
1
1
2
3
1 2 3 4 5
  
 22
n xy x y
m
n x x
   

  
  
 2
5(9) 15 1
5(55) 15
 


60
50

1.2
46
Regression Line
Example:
The following data represents the number of hours 12
different students watched television during the
weekend and the scores of each student who took a
test the following Monday.
a.) Find the equation of the regression line.
b.) Use the equation to find the expected test score
for a student who watches 9 hours of TV.
47
Regression Line
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score,
y
96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216
722
5
672
4
547
6
902
5
462
4
577
6
705
6
336
4
422
5
562
5
250
0
54x  908y  3724xy  2
332x  2
70836y 
48
• Find the correlation between age and blood
pressure using simple and Spearman's
correlation coefficients, and comment.
• Find the regression equation?
• What is the predicted blood pressure for a
man aging 25 years?
Exercise
49
x2xyyxSerial
4002400120201
18495504128432
39698883141633
6763276126264
28097102134535
9613968128316
33647888136587
21166072132468
33648120140589
4900100801447010
50
x2xyyxSerial
211658881284611
280972081365312
360087601466013
40024801242014
396990091436315
184955901304316
67632241242617
36122991211918
96139061263119
52928291232320
416781144862630852Total
51



 



n
x)(
x
n
yx
xy
b 2
2
1
4547.0
20
852
41678
20
2630852
114486
2




=
=112.13 + 0.4547 x
for age 25
B.P = 112.13 + 0.4547 * 25=123.49
= 123.5 mm hg
yˆ

correlation and regression

  • 1.
  • 2.
    2 Correlation and Regression Correlationdescribes the strength of a linear relationship between two variables Regression tells us how to draw the straight line described by the correlation
  • 3.
    3 Correlation and Regression •For example: A sociologist may be interested in the relationship between education and self-esteem or Income and Number of Children in a family. Independent Variables Education Family Income Dependent Variables Self-Esteem Number of Children
  • 4.
    4 Correlation and Regression •For example: • May expect: As education increases, self-esteem increases (positive relationship). • May expect: As family income increases, the number of children in families declines (negative relationship). Independent Variables Education Family Income Dependent Variables Self-Esteem Number of Children + -
  • 5.
  • 6.
    6 Correlation • Correlation isa statistical technique used to determine the degree to which two variables are related • A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where x is the independent (or explanatory) variable, and y is the dependent (or response) variable.
  • 7.
    7 Correlation x 1 23 4 5 y – 4 – 2 – 1 0 2 A scatter plot can be used to determine whether a linear (straight line) correlation exists between two variables. x 2 4 –2 – 4 y 2 6 Example:
  • 8.
    8 Linear Correlation x y Negative LinearCorrelation x y No Correlation x y Positive Linear Correlation x y Nonlinear Correlation As x increases, y tends to decrease. As x increases, y tends to increase.
  • 9.
    9 Correlation Coefficient • Itis also called Pearson's correlation or product moment correlation coefficient • The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is       2 22 2 . n xy x y r n x x n y y           
  • 10.
    10 The sign ofr denotes the nature of association while the value of r denotes the strength of association.
  • 11.
    11 If the signis +ve this means the relation is direct (an increase in one variable is associated with an increase in the other variable and a decrease in one variable is associated with a decrease in the other variable). While if the sign is -ve this means an inverse or indirect relationship (which means an increase in one variable is associated with a decrease in the other).
  • 12.
    12 The value ofr ranges between ( -1) and ( +1) The value of r denotes the strength of the association as illustrated by the following diagram. -1 10-0.25-0.75 0.750.25 strong strongintermediate intermediateweak weak no relation perfect correlation perfect correlation Directindirect
  • 13.
    13 If r =Zero this means no association or correlation between the two variables. If 0 < r < 0.25 = weak correlation. If 0.25 ≤ r < 0.75 = intermediate correlation. If 0.75 ≤ r < 1 = strong correlation. If r = l = perfect correlation.
  • 14.
    14 Linear Correlation x y Strong negativecorrelation x y Weak positive correlation x y Strong positive correlation x y Nonlinear Correlation r = 0.91 r = 0.88 r = 0.42 r = 0.07
  • 15.
    15 Calculating a CorrelationCoefficient       2 22 2 . n xy x y r n x x n y y            1. Find the sum of the x-values. 2. Find the sum of the y-values. Calculating a Correlation Coefficient In Words In Symbols x y xy3. Multiply each x-value by its corresponding y-value and find the sum.
  • 16.
    16 Calculating a CorrelationCoefficient Calculating a Correlation Coefficient In Words In Symbols 2 x 2 y 4. Square each x-value and find the sum. 5. Square each y-value and find the sum. 6. Use these five sums to calculate the correlation coefficient.
  • 17.
    17 Correlation Coefficient x yxy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 Example: Calculate the correlation coefficient r for the following data. 15x  1y   9xy  2 55x  2 15y 
  • 18.
    18 Correlation Coefficient      2 22 2 n xy x y r n x x n y y            Example: Calculate the correlation coefficient r for the following data.     22 5(9) 15 1 5(55) 15 5(15) 1       60 50 74  0.986 There is a strong positive linear correlation between x and y.
  • 19.
    19 Correlation Coefficient Hours, x 0 12 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 Example: The following data represents the number of hours, 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a.) Display the scatter plot. b.) Calculate the correlation coefficient r.
  • 20.
    20 Correlation Coefficient 100 x y Hours watchingTV Testscore 80 60 40 20 2 4 6 8 10 Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
  • 21.
    21 Correlation Coefficient Hours, x0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 16 4 222 28 5 34 0 38 0 420 348 45 5 52 5 50 0 x2 0 1 4 9 9 25 25 25 36 49 49 10 0 y2 921 6 722 5 67 24 547 6 90 25 46 24 57 76 705 6 336 4 42 25 56 25 25 00 Example continued: 54x  908y  3724xy  2 332x  2 70836y 
  • 22.
    22 Correlation Coefficient Example continued:      2 22 2 n xy x y r n x x n y y                22 12(3724) 54 908 12(332) 54 12(70836) 908     0.831  • There is a strong negative linear correlation. • As the number of hours spent watching TV increases, the test scores tend to decrease.
  • 23.
    23 Example: A sample of6 children was selected, data about their age in years and weight in kilograms was recorded as shown in the following table . It is required to find the correlation between age and weight. Weight (Kg) Age (years) serial No 1271 862 1283 1054 1165 1396
  • 24.
  • 25.
    25 r = 0.759 strongdirect correlation                 6 (66) 742. 6 (41) 291 6 6641 461 r 22
  • 26.
    26 EXAMPLE: Relationship betweenAnxietyand Test Scores Anxiety (X) Test score (Y) X2 Y2 XY 10 2 100 4 20 8 3 64 9 24 2 9 4 81 18 1 7 1 49 7 5 6 25 36 30 6 5 36 25 30 ∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
  • 27.
    27 Calculating Correlation Coefficient   94. )200)(356( 1024774 32)204(632)230(6 )32)(32()129)(6( 22      r r = - 0.94 Indirect strong correlation
  • 28.
    28 Example Tree Height Trunk Diameter y x xyy2 x2 35 8 280 1225 64 49 9 441 2401 81 27 7 189 729 49 33 6 198 1089 36 60 13 780 3600 169 21 7 147 441 49 45 11 495 2025 121 51 12 612 2601 144 Σ =321 Σ =73 Σ =3142 Σ =14111 Σ =713
  • 29.
    29 13 0 10 20 30 40 50 60 70 0 2 46 8 10 12 14 2 Trunk Diameter, x Tree Height, y Example • r = 0.886 → relatively strong positive linear association between x and y
  • 30.
  • 31.
  • 32.
    32 Regression Analyses • Regressiontechnique is concerned with predicting some variables by knowing others • The process of predicting variable Y using variable X
  • 33.
    33 20 Types of RegressionModels Positive Linear Relationship Negative Linear Relationship Relationship NOT Linear No Relationship
  • 34.
    34 Regression Uses a variable(x) to predict some outcome variable (y) Tells you how values in y change as a function of changes in values of x
  • 35.
    35 The regression linemakes the sum of the squares of the residuals smaller than for any other line Regression minimizes residuals 80 100 120 140 160 180 200 220 60 70 80 90 100 110 120 Wt (kg) SBP(mmHg)
  • 36.
    36 By using theleast squares method (a procedure that minimizes the vertical deviations of plotted points surrounding a straight line) we are able to construct a best fitting straight line to the scatter diagram points and then formulate a regression equation in the form of:         n x)( x n yx xy b 2 2 1 )xb(xyyˆ  bXayˆ  Regression equation describes the regression line mathematically by showing Intercept and Slope
  • 37.
    37 Correlation and Regression •The statistics equation for a line: Y = a + bx Where: Y = the line’s position on the vertical axis at any point (estimated value of dependent variable) X = the line’s position on the horizontal axis at any point (value of the independent variable for which you want an estimate of Y) b = the slope of the line (called the coefficient) a = the intercept with the Y axis, where X equals zero ^ ^
  • 38.
    38 Linear Equations Y Y =bX + a a = Y-intercept X Change in Y Change in X b = Slope
  • 39.
    39 Exercise A sample of6 persons was selected the value of their age ( x variable) and their weight is demonstrated in the following table. Find the regression equation and what is the predicted weight when age is 8.5 years. Weight (y)Age (x)Serial no. 12 8 12 10 11 13 7 6 8 5 6 9 1 2 3 4 5 6
  • 40.
    40 Answer Y2X2xyWeight (y)Age (x)Serialno. 144 64 144 100 121 169 49 36 64 25 36 81 84 48 96 50 66 117 12 8 12 10 11 13 7 6 8 5 6 9 1 2 3 4 5 6 7422914616641Total
  • 41.
  • 42.
    42 0.92x4.675yˆ (x)  12.50Kg8.5*0.924.675yˆ(8.5)  Kg58.117.5*0.924.675yˆ (7.5) 
  • 43.
    43 we create aregression line by plotting two estimated values for y against their X component, then extending the line right and left.
  • 44.
    44 Regression Line Example: a.) Findthe equation of the regression line. b.) Use the equation to find the expected value when value of x is 2.3 x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 15x  1y   9xy  2 55x  2 15y 
  • 45.
    45 Regression Line 2 x y 1 1 2 3 1 23 4 5     22 n xy x y m n x x             2 5(9) 15 1 5(55) 15     60 50  1.2
  • 46.
    46 Regression Line Example: The followingdata represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV.
  • 47.
    47 Regression Line Hours, x0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 722 5 672 4 547 6 902 5 462 4 577 6 705 6 336 4 422 5 562 5 250 0 54x  908y  3724xy  2 332x  2 70836y 
  • 48.
    48 • Find thecorrelation between age and blood pressure using simple and Spearman's correlation coefficients, and comment. • Find the regression equation? • What is the predicted blood pressure for a man aging 25 years? Exercise
  • 49.
  • 50.
  • 51.