Correlation and regression

CORRELATION AND REGRESSION
Dr Abdul Aziz Tayoun
Consultant community medicine
Supervisor training center –SBPM
(Rawdhah)

TYPES OF RELATIONSHIPS
• Between two categorial variables:
Relative risk (RR).
Odds ration (OR).
• Between two continuous variable :
Correlation coefficient (R).
Correlation coefficient squared (𝑅2
)
(coefficient of determination)

CORRELATION
 It is an association measure.
 It measures the association between two
continuous variables.
 It assume that the association is linear.
 Linear association between two variables
means that one variable increases or decreases
a fixed amount for a unit increase or decrease
in the other.

CORRELATION COEFFICIENT
• It measures the degree of association .
• It measures linear association.
• It is sometimes called Pearson’s correlation
coefficient.

STRENGTH OF ASSOCIATION
• The correlation coefficient is measured on a
scale that varies from +1 through 0 to -1.
• Complete correlation between two variables is
expressed by either +1 or -1.
• When one variable increases as the other
increases the correlation is positive.
• When one decreases as the other increases it
is negative.
• Complete absence of correlation is
represented by 0.

NEGATIVE RELATIONSHIP
Reliability
Age of Car

SCATTER DIAGRAMS
• When un investigator has collected two series of observations
and wishes to see whether there is a relationship between
them , he should first construct a scatter diagram.
• The vertical scale represents one set of measurements and
the horizontal scale the other.
• Usually we put the independent variable on the horizontal
axis and the dependent variable on the vertical axis,
• Sometimes it is not easy to know which variable is dependent
and which is independent ,
• This is a common sense reasoning , so it is logic to say that
the height of a person depends on his age and not the
converse,

CALCULATION OF THE
• A pediatric registrar has measured the
pulmonary anatomical dead space (in ml) and
height in (cm) of 15 children.
• The data are given in the following table.
• First step is to inspect the scatter diagram to
see if the area covered by the dots centers on
a straight line or whether a curved line is
needed.
• The next step is to calculate the correlation
coefficient

CHILD NUMBER HIGHT=X DEAD SPACE=Y
1 110 44 4840
2 116 31 3596
3 124 43 5332
4 129 45 5805
5 131 56 7336
6 138 79 10902
7 142 57 8094
8 150 56 8400
9 153 58 8874
10 155 92 14260
11 156 78 12168
12 159 64 10176
13 164 88 14432
14 168 112 18816
15 174 101 17574
T 2169 1004 150605
MEAN 144.6 66.93333333
SD 19.36786735 23.64761138

HIGHT=X DEAD SPACE=Y
110 44
116 31
124 43
129 45
131 56
138 79
142 57
150 56
153 58
155 92
156 78
159 64
164 88
168 112
174 101
0
20
40
60
80
100
120
0 50 100 150 200
deadspace hieghte
scatter graph of height and anatomic dead space
for the 15 children

THE FORMULA TO BE USED
With x representing the value of independent variable(in this
case the height) and y representing the dependent variable ( in
this case the anatomical dead space):
𝑟 =
𝑥 − 𝑥 𝑦 − 𝑦
𝑥 − 𝑥 2 (𝑦 − 𝑦)2
Which can be shown to be equal to :
𝑟 =
𝑥𝑦 − 𝑛 𝑥 𝑦
𝑛 − 1 𝑆 𝑥 𝑆 𝑦
Where : x = height in cm
y = anatomical dead space in ml
𝑥 = mean of height 𝑦 = mean of anatomical dead
space
𝑆 𝑥= standard deviation for height 𝑆 𝑦= standard
deviation for anatomical dead space

CALCULATION
𝑟 =
150605 − 15 144 6 66 93
14 19 37 23 65
𝑟 =
150605 − 145171 17
5412 06
=
5433,83
6412,0609
= 0 847
𝑅2
= 0 8472
= 0,717

COMMENTS ON THE RESULTS
• The correlation coefficient of 0.817 indicates a positive
correlation between the size of the pulmonary anatomical
dead space and height of the child .
• But in the interpretation of correlation it is important to
remember that correlation is not causation.
• A part of the variation in one of the variables (as measured by
it’s variance) can be thought of as being due to the
relationship with the other variable and another part as due
to undetermined often random causes.
• The part due to the dependence of one variable on the other
can be measured by 𝑅2 and it is equal to 0.717 in our
example.
• So we can say that 72% of the variation between children in
the size of anatomical dead space is due to the height of the
child.
•

The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated
by the following diagram.
-1 10-0.25-0.75 0.750.25
strong strongintermediate intermediateweak weak
no relation
perfect
correlation
perfect
correlation
Directindirect

SIGNIFICANCE TEST FOR
To test wether the association is merely
apparent , and might have been arisen by
chance , we use the ( t test) with the following
equation :
𝑡 = 𝑟
𝑛 − 2
1 − 𝑟2
We must enter the t table with n-2 degrees of
freedom

CALCULATION OF T
𝑡 = 0 847
15−2
1−0 8472 = 0 847
13
0 283
= 0.847 45 9
=5.74
If we enter the t table with (15-2=13) degrees
of freedom
We find p < 0.001
So the correlation coefficient may be regarded
as highly significant .
Thus we have a very strong correlation between
dead space and height of the child , which is
most unlikely have arisen by chance.

THE ASSUMPTIONS GOVERNING
THIS TEST ARE
1. Both variables are normally distributed.
2. There is a linear relationship between them.
3. The null hypothesis is that there is no
association between them.

SPEARMAN RANK CORRELATION
We use Spearman rank correlation when:
• The data may reveal outlying points well away
from the main body of the data.
• The variables may be quantitative discrete or
ordinal.

THE FORMULA FOR SPEARMAN
RANK CORRELATION (𝑟𝑠)
𝑟𝑠 =
6 𝑑𝑖
2
𝑛 𝑛2 − 1
Where d is the difference in ranks of the two
variable for the same individual.
See the following slide

child number height dead space rank y d d2
1 110 44 31 3 2 4
2 116 31 43 1 -1 1
3 124 43 44 2 -1 1
4 129 45 45 4 0 0
5 131 56 56 5.5 0.5 0.25
6 138 79 56 11 5 25
7 142 57 57 7 0 0
8 150 56 58 5.5 -2.5 6.25
9 153 58 64 8 -1 1
10 155 92 78 13 3 9
11 156 78 79 10 -1 1
12 159 64 88 9 -3 9
13 164 88 92 12 -1 1
14 168 112 101 15 1 1
15 174 101 112 14 -1 1
T 60.5
Derivation of Spearman rank correlation for the 15 children (height , anatomical dead space)

CALCULATION OF SPEARMAN
RANK CORRELATION
𝑟𝑠 = 1 −
6 60 5
15 225−1
= 1 −
383
15 224
= 1 −
363
3360
= 1 −
0 108 = 0 892
In this case the value is very close to the Pearson
correlation coefficient .
For more than n >10 , the Spearman rank
correlation can be tested for significance using
the t test.

DIFFERENCE BETWEEN CORRELATION
AND REGRESSION
• Correlation describes the strength of
association between two variables and
completely symmetrical , the correlation
between A & B is the same as the correlation
between B & A , if one variable change by a
certain amount the other changes on average
by a certain amount.
• The regression equation representing how
much the dependent variable changes with
any given change in the independent
variables, which can be used to construct a

REGRESSION
Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals
80
100
120
140
160
180
200
220
60 70 80 90 100 110 120
Wt (kg)
SBP(mmHg)

ASSUMPTIONS FOR THE ORDINARY
LEAST SQUARES PROCEDURE
1. The relationship between X and Y is linear.
2. The dependent variable Y is metric
continuous
3. The residual term e , is normally distributed,
with a mean of zero , for each value of the
independent variable X.
4. The spread of the residual terms should be
the same, whatever the value of X.

REGRESSION EQUATION
Regression equation
describes the
regression line
mathematically
• Intercept
• Slope
80
100
120
140
160
180
200
220
60 70 80 90 100 110 120
Wt (kg)
SBP(mmHg)

LINEAR EQUATIONS
Y
Y = bX + a
a = Y-intercept
X
Change
in Y
Change in X
b = Slope

INTERPRETATION OF THE
EQUATION
X : represents the independent variable
Y : represents the dependent variable.
a : represents the intercept , the value of y when x=0
b : represents the slope , the value of y when x
changes by one unit.
So the regression equation is more useful than the
correlation coefficient because it allows us to predict
the value of y when we know the value of x.

CALCULATION OF THE REGRESSION
MODEL
𝑏 =
𝑥 𝑦 −𝑛 𝑥 𝑦
𝑛−1 𝑆 𝑥
2
𝑎 = 𝑦- b 𝑥
𝑏 =
150605−15 144 6 66 93
14 19 36972 =
5433 83
5251 6
= 1.033
𝑎 = 66 93 − 1 033 144 6 = 66 93 − 149 37
= −82.4
Y= -82.4 + 1.033 x

INTERPRETATION OF THE RESULTS
• when the height is 0 the anatomical dead
space is – 82.4 which is not logic, the
equitation is valid only for the range between
minimum and maximal height regarding the
data , say between 110- 174 cm only.
• For every centimeter increase in the height the
anatomical dead space increases by 1.033 ml
over the range of measurement mode.

TESTING THE HYPOTHESIS B=0
𝑡 =
𝑏
𝑆𝐸(𝑏)
SE(b)=
𝑆 𝑟𝑒𝑠
𝑥− 𝑥 2
=
𝑆 𝑟𝑒𝑠
𝑛−1 𝑆 𝑥
2
𝑆𝑟𝑒𝑠=
𝑦−𝑦 𝑓𝑖𝑡
2
𝑛−2
This can be shown algebraically equal to :
𝑆𝑟𝑒𝑠 =
𝑆 𝑦
2
1 − 𝑟2 𝑛 − 1
𝑛 − 2

CALCULATION OF STANDARD ERROR
OF B
𝑆𝑟𝑒𝑠 =
23 652 1−0 8462 15−1
15−2
=
559 133 0,284 14
13
=
2225 36
13
= 171 18 =13.08
𝑆𝐸 𝑏 =
𝑆 𝑟𝑒𝑠
𝑛−1 𝑆 𝑥
2
=
13 08
14 19 36792
=
13 08
5251 6
=
13 08
72,468
= 0.1805
𝑡 =
1 033
0 1805
= 5.72
This has 15-2 =13 degrees of freedom
p value < 0.001
Note that the test significance for the slope gives exactly the
same value of p as the test of significance for the correlation
coefficient., although the two tests are derived differently.

95% CONFIDENCE INTERVAL FOR B
95% CI forb = 𝑏 ± 𝑡0 05 𝑆𝐸(𝑏)
95% 𝐶𝐼 𝑓𝑜𝑟 𝑏 = 1 033 ± 2 16 0 1805
= 1 033 ± 0 3899
95%CI for b = (0.643 to 1.423)

FROM THE REGRESSION MODEL WE
CAN CALCULATE THE VALUE OF Y FOR
ANY VALUE OF X
Question : what is the anatomical dead space
for a child measuring 125 and 150 cm?
Answer : 𝑦 = −82 4 + 1 033 𝑥
Y = -82.4 +1.033 *125 =46,725 ml
Y= -82.4+ 1.033*150 = 72.55 mi

THE ASSUMPTIONS ARE
1. The prediction error are approximately
Normally distributed, note that this does not
mean x or y variables have to be normally
distributed.
2. The relationship between the two variable is
linear.
3. The scatter of points about the line is
approximately constant.

MULTIPLE REGRESSION
Multiple regression analysis is a straightforward
extension of simple regression analysis which
allows more than one independent variable.

THE MODEL FOR LINEAR
REGRESSION
𝑦 =∝ +𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀
Where : 𝑥1 is the first independent variable
𝑥2 is the second independent variable
And so on up to the kth independent variable 𝑥 𝑘
The term ∝ is the intercept or constant term, it is
the value of y when all the independent variables
are zero.
𝜀 the error term and usually assumed to have
normal distribution and to have average value of

USES OF MULTIPLE REGRESSION
1. To look for relationships between continuous
variables, allowing for a third variable.
2. To adjust for differences in confounding
factors between groups.

MODEL BUILDING AND VARIABLE
SELECTION
• Automated variable selection : the computer
does it for you, this method is perhaps more
appropriate if you have little idea about which
variables are likely to be relevant to the
relationship.
• Manual selection : you do it yourself if you
have particular hypothesis to test and have a
good idea about which variables are likely to
be most relevant in explaining your

STARTING PROCEDURE FOR BOTH METHODS
• Identify a list of independent variables that you think
might possibly have some role in explaining the variation
in your dependent variable ( be as broad-minded as
possible).
• Draw a scatterplot of each of these candidate variables
against the dependent variable to examine for linearity.
• Perform a series of univariate regressions , regress each
candidate independent variable against the dependent
variable and see the p-value in each case.
• At this stage all variables that have a p-value of at least
0.2 should be considered for inclusion in the model, using
a p-value less than this may fail to identify variables that

GOODNESS-OF-FIT : 𝑅2
When you add an extra variable to an existing
model , and want to compare goodness-of-fit
with the old model you use the adjusted 𝑅2
not 𝑅2
𝑅2
will increase when an extra independent
variable is added to the model.
If 𝑅2
increases , then you know that the
explanatory power has increased.

ADJUSTMENT AND CONFOUNDING
• One of the most attractive features of the multiple
regression model it’s ability to adjust for the effects of
possible association between the independent variables.
• It is possible that 2 or more of the independent variables
will be associated.
• The beauty of the multiple regression model is that each
regression coefficient measures only the direct effect of
it’s independant variable on the dependent variable, and
controls or adjusts for any possible interaction from any
of the other variables in the model.

BASIC ASSUMPTIONS FOR MULTIPLE
LINEAR REGRESSION MODEL
1. Metric continuous dependent variable.
2. Linear relationship between the dependent
variable and each independent variable.
3. The residuals have constant spread across the
range of values of the independent variable.
4. The residuals are normally distributed for
each fitted value of the independent variable.
5. The independent variables are not perfectly
correlated with each other.

Correlation and regression

More Related Content

What's hot

Similar to Correlation and regression

Recently uploaded

In this document

Correlation and regression