SlideShare a Scribd company logo
Introduction to Correlation &
Regression Analysis
Farzad Javidanrad
November 2013
Some Basic Concepts:
o Variable: A letter (symbol) which represents the elements of
a specific set.
o Random Variable: A variable whose values are randomly
appear based on a probability distribution.
o Probability Distribution: A corresponding rule (function)
which corresponds a probability to the values of a random
variable (individually or to a set of them). E.g.:
๐’™ 0 1
๐‘ƒ(๐‘ฅ) 0.5 0.5
In one trial ๐ป, ๐‘‡
In two trials
๐ป๐ป, ๐ป๐‘‡, ๐‘‡๐ป, ๐‘‡๐‘‡
Correlation:
Is there any relation between:
๏ฑ fast food sale and different seasons?
๏ฑ specific crime and religion?
๏ฑ smoking cigarette and lung cancer?
๏ฑ maths score and overall score in exam?
๏ฑ temperature and earthquake?
๏ฑ cost of advertisement and number of sold items?
๏‚ง To answer each question two sets of corresponding data need to
be randomly collected.
Let random variable "๐’™" represents the first group of
data and random variable "๐’š" represents the second.
Question: Is this true that students who have a better
overall result are good in maths?
Our aim is to find out whether there is any linear
association between ๐’™ and ๐’š. In statistics, technical
term for linear association is โ€œcorrelationโ€. So, we are
looking to see if there is any correlation between two
scores.
๏ƒ˜ โ€œLinear associationโ€ : variables are in relations at
their levels, i.e. ๐’™ with ๐’š not with ๐’š ๐Ÿ
, ๐’š ๐Ÿ‘
,
๐Ÿ
๐’š
or even
โˆ†๐’š.
Imagine we have a random sample of scores in a
school as following:
In our example, the correlation between ๐’™ and ๐’š
can be shown in a scatter diagram:
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
Y
X
Correlation between maths score and
overall score The graph shows a
positive correlation
between maths
scores and overall
scores, i.e. when ๐’™
increases ๐’š
increases too.
Different scatter diagrams show different types of
correlation:
โ€ข Is this enough? Are we happy?
Certainly not!! We think we know things better
when they are described by numbers!!!!
Although, scatter diagrams are informative but to find
the degree (strength) of a correlation between two
variables we need a numerical measurement.
Adopted from www.pdesas.org
Following the work of Francis Galton on regression
line, in 1896 Karl Pearson introduced a formula for
measuring correlation between two variables, called
Correlation Coefficient or Pearsonโ€™s Correlation
Coefficient.
For a sample of size ๐’, sample correlation coefficient
๐’“ ๐’™๐’š can be calculated by:
๐’“ ๐’™๐’š =
๐Ÿ
๐’
(๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š)
๐Ÿ
๐’
(๐’™๐’Š โˆ’ ๐’™) ๐Ÿ . ๐Ÿ
๐’
(๐’š๐’Š โˆ’ ๐’š) ๐Ÿ
=
๐’„๐’๐’—(๐’™, ๐’š)
๐‘บ ๐’™ . ๐‘บ ๐’š
Where ๐’™ and ๐’š are the mean values of ๐’™ and ๐’š in the
sample and ๐‘บ represents the biased version of
โ€œstandard deviationโ€*. The covariance between ๐’™ and ๐’š
( ๐’„๐’๐’— ๐’™, ๐’š ) shows how much ๐’™ and ๐’š change together.
Alternatively, if there is an opportunity to observe all
available data, the population correlation coefficient
(๐† ๐’™๐’š) can be obtained by:
๐† ๐’™๐’š =
๐‘ฌ ๐’™๐’Š โˆ’ ๐ ๐’™ . (๐’š๐’Š โˆ’ ๐ ๐’š)
๐‘ฌ ๐’™๐’Š โˆ’ ๐ ๐’™
๐Ÿ. ๐‘ฌ(๐’š๐’Š โˆ’ ๐ ๐’š) ๐Ÿ
=
๐’„๐’๐’—(๐’™, ๐’š)
๐ˆ ๐’™ . ๐ˆ ๐’š
Where ๐‘ฌ, ๐ and ๐ˆ are expected value, mean and
standard deviation of the random variables,
respectively and ๐‘ต is the size of the population.
Question: Under what conditions can we use this
population correlation coefficient?
๏ƒ˜ If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“ ๐’™๐’š = ๐Ÿ
Maximum (perfect) positive correlation.
๏ƒ˜ If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“ ๐’™๐’š = โˆ’๐Ÿ
Maximum (perfect) negative correlation.
๏ƒ˜ If there is no linear association between ๐’™ and ๐’š
then ๐’“ ๐’™๐’š = ๐ŸŽ.
Note 1: If there is no linear association between two
random variables they might have non linear
association or no association at all.
For all ๐’‚ , ๐’ƒ โˆˆ ๐‘น
And ๐’‚ > ๐ŸŽ
For all ๐’‚ , ๐’ƒ โˆˆ ๐‘น
And ๐’‚ < ๐ŸŽ
In our example, the sample correlation coefficient is:
๐’™๐’Š ๐’š๐’Š ๐’™๐’Š โˆ’ ๐’™ ๐’š๐’Š โˆ’ ๐’š ๐’™๐’Š โˆ’ ๐’™ . (๐’š๐’Š โˆ’ ๐’š) (๐‘ฅ๐‘–โˆ’ ๐‘ฅ )2
(๐‘ฆ๐‘–โˆ’ ๐‘ฆ )2
70 73 12 13.9 166.8 144 193.21
85 90 27 30.9 834.3 729 954.81
22 31 -36 -28.1 1011.6 1296 789.61
66 50 8 -9.1 -72.8 64 82.81
15 31 -43 -28.1 1208.3 1849 789.61
58 50 0 -9.1 0 0 82.81
69 56 11 -3.1 -34.1 121 9.61
49 55 -9 -4.1 36.9 81 16.81
73 80 15 20.9 313.5 225 436.81
61 49 3 -10.1 -30.3 9 102.01
77 79 19 19.9 378.1 361 396.01
44 58 -14 -1.1 15.4 196 1.21
35 40 -23 -19.1 439.3 529 364.81
88 85 30 25.9 777 900 670.81
69 73 11 13.9 152.9 121 193.21
5196.9 6625 5084.15
๐’“ ๐’™๐’š =
๐Ÿ
๐’
(๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š)
๐Ÿ
๐’
(๐’™๐’Š โˆ’ ๐’™) ๐Ÿ . ๐Ÿ
๐’
(๐’š๐’Š โˆ’ ๐’š) ๐Ÿ
= ๐Ÿ“๐Ÿ๐Ÿ—๐Ÿ”.๐Ÿ—
๐Ÿ”๐Ÿ”๐Ÿ๐Ÿ“ร—๐Ÿ“๐ŸŽ๐Ÿ–๐Ÿ’.๐Ÿ๐Ÿ“
=๐ŸŽ.๐Ÿ–๐Ÿ—๐Ÿ“
which shows an strong positive correlation between maths score and overall score.
Positive Linear
Association
No Linear
Association
Negative Linear
Association
๐‘บ ๐’™ > ๐‘บ ๐’š ๐‘บ ๐’™ = ๐‘บ ๐’š ๐‘บ ๐’™ < ๐‘บ ๐’š
๐’“ ๐’™๐’š = ๐Ÿ
Adapted and modified from www.tice.agrocampus-ouest.fr
๐’“ ๐’™๐’š โ‰ˆ ๐Ÿ
๐ŸŽ < ๐’“ ๐’™๐’š < ๐Ÿ
๐’“ ๐’™๐’š = ๐ŸŽ
โˆ’๐Ÿ < ๐’“ ๐’™๐’š< ๐ŸŽ
๐’“ ๐’™๐’š โ‰ˆ โˆ’๐Ÿ
๐’“ ๐’™๐’š = โˆ’๐Ÿ
Perfect
Weak
No
Correlation
Weak
Strong
Perfect
Strong
Some properties of the correlation coefficient:
(Sample or population)
a. It lies between -1 and 1, i.e. โˆ’๐Ÿ โ‰ค ๐’“ ๐’™๐’š โ‰ค ๐Ÿ.
b. It is symmetrical with respect to ๐’™ and ๐’š, i.e. ๐’“ ๐’™๐’š =
๐’“ ๐’š๐’™ . This means the direction of calculation is not
important.
c. It is just a pure number and independent from the
unit of measurement of ๐’™ and ๐’š.
d. It is independent of the choice of origin and scale
of ๐’™ and ๐’šโ€™s measurements, that is;
๐’“ ๐’™๐’š = ๐’“ ๐’‚๐’™+๐’ƒ ๐’„๐’š+๐’… (๐’‚, ๐’„ > ๐ŸŽ)
e. ๐’‡ ๐’™, ๐’š = ๐’‡ ๐’™ . ๐’‡(๐’š) ๐’“ ๐’™๐’š = ๐ŸŽ
Important Note:
Many researchers wrongly construct a theory just based on a
simple correlation test.
๏ฑ Correlation does not imply causation.
If there is a high correlation between number of smoked
cigarettes and the number of infected lungโ€™s cells it does not
necessarily mean that smoking causes lung cancer. Causality
test (sometimes called Granger causality test) is different from
correlation test.
In causality test it is important to know about the direction of
causality (e.g. ๐’™ on ๐’š and not vice versa) but in correlation we
are trying to find if two variables moving together (same or
opposite directions).
๐’™ and ๐’š are statistically independent,
where ๐’‡(๐’™, ๐’š) is the joint Probability
Density Function (PDF)
Determination Coefficient and Correlation Coefficient:
๐’“ ๐’™๐’š = ยฑ๐Ÿ perfect linear relationship between variables:
i.e. ๐’™ is the only factor which describes variations of ๐’š at the level
(linearly); ๐’š = ๐’‚ + ๐’ƒ๐’™ .
๐’“ ๐’™๐’š โ‰ˆ ยฑ๐Ÿ ๐’™ is not the only factor which describes
variations of ๐’š but we can still imagine that a line represents this
relationship which passing through most of the points or having a
minimum vertical distance from them, in total. This line is called
the โ€œline of best fitโ€ or known technically as โ€œregression lineโ€.
Adopted from www.ncetm.org.uk/public/files/195322/G3fb.jpg
The graph shows a line of
best fit between age of a
car and its price. Imagine
the line has the equation
of ๐’š = ๐’‚ + ๐’ƒ๐’™
The criterion to choose a line among others is the
goodness of fit which can be calculated through
determination coefficient, ๐’“ ๐Ÿ.
๏ƒ˜ In the previous example, age of a car is only factor
among many other factors that explain the price of a
car. Can you find some other factors?
If ๐’š and ๐’™ represent price and age of cars respectively,
the percentage of the variation of ๐’š which is determined
(explained) by the variation of ๐’™ is called โ€œdetermination
coefficientโ€.
Determination coefficient can be understood better by
Venn-Euler diagrams:
y x
y x
y x
y=x
๐’“ ๐Ÿ = ๐ŸŽ , none of variations of y can be determined
by x (no linear association)
๐’“ ๐Ÿ
โ‰ˆ ๐ŸŽ, small percentage of variation of y can be
determined by x (weak linear association)
๐’“ ๐Ÿ โ‰ˆ ๐Ÿ, large percentage of variation of y can be
determined by x (strong linear association)
๐’“ ๐Ÿ
= ๐Ÿ, all variation of y can be determined by x
and no other factors (complete linear association)
The shaded area shows the percentage of variation of
y which can be determined by x. it is easy to
understand that ๐ŸŽ โ‰ค ๐’“ ๐Ÿ
โ‰ค ๐Ÿ.
Although, determination coefficient (๐’“ ๐Ÿ) is different
conceptually from correlation coefficient (๐’“ ๐’™๐’š)but one
can be calculated from another; in fact:
๐’“ ๐’™๐’š = ยฑ ๐’“ ๐Ÿ
Or, alternatively
๐’“ ๐Ÿ = ๐’ƒ ๐Ÿ ๐Ÿ
๐’
๐’™๐’Š โˆ’ ๐’™ ๐Ÿ
๐Ÿ
๐’
๐’š๐’Š โˆ’ ๐’š ๐Ÿ
= ๐’ƒ ๐Ÿ
๐‘บ ๐’™
๐Ÿ
๐‘บ ๐’š
๐Ÿ
Where ๐’ƒ is the slope coefficient in the regression
line ๐’š = ๐’‚ + ๐’ƒ๐’™ .
Note: If ๐’š = ๐’‚ + ๐’ƒ๐’™ shows the regression line (๐’š ๐’๐’ ๐’™)
and ๐’™ = ๐’„ + ๐’…๐’š shows another regression line (๐’™ ๐’๐’ ๐’š)
then we have: ๐’“ ๐Ÿ = ๐’ƒ. ๐’…
Summary of Correlation & Determination Coefficients:
โ€ข Correlation means a linear association between two random variables which
could be positive or negative or zero.
โ€ข Linear association means that variables are in relations at their levels
(linearly).
โ€ข Correlation coefficient measures the strength of linear association between
two variables. It could be calculated for a sample or for the whole population.
โ€ข The value of correlation coefficient is between -1 and 1, which show the
strongest correlation (negative or positive) but moving towards zero it makes
correlation weaker.
โ€ข Correlation does not imply causation.
โ€ข Determination coefficient shows the percentage of variation of one variable
which can be described by another variable and it is a measure for the
goodness of fit for lines passing through plotted points.
โ€ข The value of determination coefficient is between 0 and 1 and can be
obtained from correlation coefficient by squaring it.
โ€ข Knowing two random variables are just linearly associated is
not much satisfactory. There are sometimes a strong idea
that the variation of one variable can solidly explain the
variation of another.
โ€ข To test this idea (hypothesis) we need another analytical
approach, which is called โ€œregression analysisโ€.
โ€ข In regression analysis we try to study or predict the mean
(average) value of a dependent variable ๐’€ based on the
knowledge we have about independent (explanatory)
variable(s) ๐‘ฟ ๐Ÿ, ๐‘ฟ ๐Ÿ,โ€ฆ, ๐‘ฟ ๐’. This is familiar for those who know
the meaning of conditional probabilities; as we are going to
make a linear model such as, which is a deterministic part of
the model in regression analysis:
๐ธ(๐‘Œ ๐‘‹1, ๐‘‹2,โ€ฆ, ๐‘‹ ๐‘›) = ๐›ฝ0 + ๐›ฝ1 ๐‘‹1 + ๐›ฝ2 ๐‘‹2 + โ‹ฏ + ๐›ฝ ๐‘› ๐‘‹ ๐‘›
โ€ข The deterministic part of the regression model does reflect the
structure of the relationship between ๐’€ and ๐‘ฟโ€ฒ ๐’” in a
mathematical world but we live in a stochastic world.
โ€ข Godโ€™s knowledge (if the term is applicable) is deterministic but
our perception about everything in this world is always
stochastic and our model should be built in this way.
โ€ข To understand the concept of stochastic model letโ€™s have an
example:
๏ƒ˜ If we make a model between monthly consumption expenditure
๐‘ช and monthly income ๐‘ฐ, the model cannot be deterministic
(mathematical) such that for every value of ๐‘ฐ there is one and
only one value of ๐‘ช (which is the concept of functional
relationship in maths). Why?
๏ƒ˜ Although, the income is the main variable determining the amount of
consumption expenditure but many other factors such as the mood of
people, their wealth, interest rate and etc. are overlooked in a simple
mathematical model such as ๐‘ช = ๐’‡(๐‘ฐ) but their influences can change the
value of ๐‘ช even at the same level of ๐‘ฐ. If we believe that the average impact
of all their omitted variables is random (sometimes positive and sometimes
negative). So, in order to make a realistic model we need to add a stochastic
(random) term ๐’– to our mathematical model: ๐‘ช = ๐’‡ ๐‘ฐ + ๐’–
ยฃ1000
ยฃ1400
โ‹ฎ
โ‹ฎ
ยฃ800
ยฃ1000
ยฃ750
ยฃ900
ยฃ1200
ยฃ1150
I C
The change in the
consumption
expenditure comes
from the change of
income (๐ผ) or
change of some
random elements
(๐‘ข), so, we can write
๐‘ช = ๐’‡ ๐‘ฐ + ๐’–
โ€ข The general stochastic model for our purpose would be as
following, which is called โ€œLinear Regression Model**โ€:
๐’€๐’Š = ๐‘ฌ(๐’€๐’Š ๐‘ฟ ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ ๐’๐’Š) + ๐’–๐’Š
Which can be written as:
๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐œท ๐’ ๐‘ฟ ๐’๐’Š + ๐’–๐’Š
Where ๐’Š (๐‘– = 1,2, โ€ฆ , ๐‘›) shows time period (days, weeks, months,
years and etc.) and ๐’–๐’Š is an error (stochastic) term and also a
representative of all other influential variables which are not
considered in the model and ignored.
โ€ข The deterministic part of the model
๐‘ฌ(๐’€๐’Š ๐‘ฟ ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ ๐’๐’Š) =๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐œท ๐’ ๐‘ฟ ๐’๐’Š
is called Population Regression Function (PRF).
โ€ข The general form of the Linear Regression Model with ๐’Œ
explanatory variables and ๐’ observations can be shown in
the matrix form as:
๐’€ ๐‘›ร—1 = ๐‘ฟ ๐‘›ร—๐‘˜ ๐œท ๐‘˜ร—1 + ๐’– ๐‘›ร—1
Or simply:
๐’€ = ๐‘ฟ๐œท + ๐’–
Where
๐’€ =
๐‘Œ1
๐‘Œ2
โ‹ฎ
๐‘Œ๐‘›
, ๐‘ฟ =
1 ๐‘‹11 ๐‘‹21
1
โ‹ฎ
๐‘‹12
โ‹ฎ
๐‘‹22
โ‹ฎ
1 ๐‘‹1๐‘› ๐‘‹2๐‘›
โ€ฆ ๐‘‹ ๐‘˜1
โ€ฆ
โ‹ฑ
๐‘‹ ๐‘˜2
โ‹ฎ
โ€ฆ ๐‘‹ ๐‘˜๐‘›
, ๐œท =
๐›ฝ0
๐›ฝ1
โ‹ฎ
๐›ฝ ๐‘˜
and ๐’– =
๐‘ข1
๐‘ข2
โ‹ฎ
๐‘ข ๐‘›
๐’€ is also called regressand and ๐‘ฟ is a vector of regressors.
โ€ข ๐œท ๐ŸŽ is the intercept but ๐œท๐’Š
โ€ฒ
๐’” are slope coefficients which are also
called regression parameters. The value of each parameter
shows the magnitude of one unit change in the associated
regressor ๐‘ฟ๐’Š on the mean value of the regressand ๐’€๐’Š. The idea
is to estimate the unknown value of the population
regression parameters based on estimators which use
sample data.
โ€ข The sample counterpart of the regression line can be written in
the form of:
๐’€๐’Š = ๐’€๐’Š + ๐’–๐’Š
or
๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ ๐’ ๐‘ฟ ๐’๐’Š + ๐’†๐’Š
Where ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ ๐’ ๐‘ฟ ๐’๐’Š is the deterministic
part of the sample model and is called โ€œSample Regression
Function (SRF) โ€œand ๐’ƒ๐’Š
โ€ฒ
๐’” are estimators of unknown parameters
๐œท๐’Š
โ€ฒ
๐’” and ๐’–๐’Š = ๐’†๐’Š is a residual.
The following graph shows the important elements of PRF and
SRF:
๐’€๐’Š โˆ’ ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐’–๐’Š
๐’€๐’Š โˆ’ ๐’€๐’Š = ๐’–๐’Š = ๐’†๐’Š
observation
Estimation of
๐’€๐’Š based on SRF
Estimation of
๐’€๐’Š based on PRF
Adopted and altered fromhttp://marketingclassic.blogspot.co.uk/2011_12_01_archive.html
In PRF
In SRF
The PRF is a
hypothetical
line which we
have no idea
about that but
try to estimate
its parameters
based on the
data in sample
๐‘บ๐‘น๐‘ญ: ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š
๐‘ท๐‘น๐‘ญ: ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ๐’Š
โ€ข Now the question is how to calculate ๐’ƒ๐’Š
โ€ฒ
๐’” based on the
sample observations and how to ensure that they are good
and unbiased estimators of ๐œท๐’Š
โ€ฒ
๐’” in the population?
โ€ข There are two main methods of calculating ๐’ƒ๐’Š
โ€ฒ
๐’” and constructing
SRF, called the โ€œmethod of Ordinary Least Square (OLS)โ€ and
the โ€œmethod of Maximum Likelihood (ML)โ€. Here, we focus on
OLS method as it is used most comprehensively. Here, for
simplicity, we start with two-variable PRF (๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ๐’Š) and
its SRF counterpart (๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š).
โ€ข According to OLS method we try to minimise some of the
squared residuals in a hypothetical sample; i.e.
๐’–๐’Š
๐Ÿ
= ๐’†๐’Š
๐Ÿ
= ๐’€๐’Š โˆ’ ๐’€๐’Š
๐Ÿ
= ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š
๐Ÿ
โ€ข It is obvious from previous equation that the sum of squared
residuals is a function of ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ, i.e.
๐’†๐’Š
๐Ÿ = ๐’‡(๐’ƒ ๐ŸŽ, ๐’ƒ ๐Ÿ)
because if these two parameters (intercept and slope) change,
๐’†๐’Š
๐Ÿ will change (see the graph on the slide 25).
โ€ข Differentiating A partially with respect to ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ and
following the first and necessary conditions for optimisation in
calculus we have:
๐ ๐’†๐’Š
๐Ÿ
๐๐’ƒ ๐ŸŽ
= โˆ’๐Ÿ ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐’†๐’Š = ๐ŸŽ
๐ ๐’†๐’Š
๐Ÿ
๐๐’ƒ ๐Ÿ
= โˆ’๐Ÿ ๐‘ฟ๐’Š ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐‘ฟ๐’Š ๐’†๐’Š = ๐ŸŽ
A
B
After simplifications we reach to two equations with two
unknowns ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ:
๐’€๐’Š = ๐’๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š
๐’€๐’Š ๐‘ฟ๐’Š = ๐’ƒ ๐ŸŽ ๐‘ฟ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š
๐Ÿ
Where ๐’ is the sample size. So;
๐’ƒ ๐Ÿ =
๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐’€๐’Š โˆ’ ๐’€
๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ
=
๐’™๐’Š ๐’š๐’Š
๐’™๐’Š
๐Ÿ
=
๐’„๐’๐’—(๐’™, ๐’š)
๐‘บ ๐’™
๐Ÿ
Where ๐‘บ ๐’™ is the biased version of sample standard deviation,
i.e. we have ๐’ instead of (๐’ โˆ’ ๐Ÿ) in denominator.
๐‘บ ๐’™ =
๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ
๐’
And
๐‘0 = ๐‘Œ โˆ’ ๐‘1 ๐‘‹
โ€ข The ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ obtained from OLS method are the point
estimators of ๐œท ๐ŸŽ and ๐œท ๐Ÿin the population but in order to test
some hypothesis about the population parameters we need to
have knowledge about the distributions of their estimators. For
that reason we need to make some assumptions about the
explanatory variables and the error term in PRF. (see the
equations in B to find the reason).
๏‚ง The Assumptions Underlying the OLS Method:
1. The regression model is linear in terms of its parameters (coefficients).*
2. The values of the explanatory variable(s) are fixed in repeated sampling.
This means that the nature of explanatory variables (๐‘ฟโ€ฒ ๐’”) is non-stochastic.
The only stochastic variables are error term (๐’–๐’Š) and regressand (๐’€๐’Š).
3. The disturbance (error) terms are normally distributed with zero mean and
equal variance; given the value of ๐‘ฟโ€ฒ ๐’”. That is: ๐’–๐’Š~๐‘ต(๐ŸŽ, ๐ˆ ๐Ÿ)
4. There is no autocorrelation between error terms, i.e.
๐’„๐’๐’— ๐’–๐’Š, ๐’–๐’‹ = ๐ŸŽ
This means they are completely random and there is no association between
them or any pattern in their appearance.
5. There is no correlation between error terms and explanatory variables, i.e.
๐’„๐’๐’— ๐’–๐’Š, ๐‘ฟ๐’Š = ๐ŸŽ
6. The number of observations (sample size) should be bigger than the
number of parameters in the model.
7. The model should be logically and correctly specified in terms of functional
form or even the type and the nature of variables enter into the model.
These assumptions are the assumptions of the Classical Linear
Regression Models (CLRM), which sometimes they are called
Gaussian assumptions on linear regression models.
โ€ข Under these assumptions and also the central limit theorem
the OLS estimators in sampling distribution (repeated sampling)
,when ๐’ โ†’ โˆž, have a normal distribution:
๐’ƒ ๐ŸŽ~๐‘ต(๐œท ๐ŸŽ,
๐‘ฟ๐’Š
๐Ÿ
๐’ ๐’™๐’Š
๐Ÿ
. ๐ˆ ๐Ÿ)
๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ,
๐ˆ ๐Ÿ
๐’™๐’Š
๐Ÿ
)
where ๐ˆ ๐Ÿ is the variance of the error term (๐’—๐’‚๐’“ ๐’–๐’Š = ๐ˆ ๐Ÿ) and it
can be estimated itself through ๐ˆ estimator, where:
๐ˆ =
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ
๐‘œ๐‘Ÿ
๐ˆ =
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐’Œ
๐‘คโ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘–๐‘  ๐’Œ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘› ๐‘กโ„Ž๐‘’ ๐‘š๐‘œ๐‘‘๐‘’๐‘™.
โ€ข Based on the assumptions of the classical linear regression
model (CLRM), Gauss-Markov Theorem asserts that the least
square estimators, among unbiased estimators, have the
minimum variance. So they are the Best, Linear, Unbiased
Estimators (BLUE).
๏‚ง Interval Estimation For Population Parameters:
โ€ข In order to construct a confidence interval for unknown
๐œทโ€ฒ ๐’” (PRFโ€™s parameters) we can either follow Z distribution (if
we have a prior knowledge about ๐ˆ) or t-distribution (if we use
๐ˆ instead).
โ€ข The confidence intervals for the slope parameter at any level of
significance ๐œถ would be*:
๐‘ท ๐’ƒ ๐Ÿ โˆ’ ๐’ ๐œถ
๐Ÿ
. ๐ˆ ๐’ƒ ๐Ÿ
โ‰ค ๐œท ๐Ÿ โ‰ค ๐’ƒ ๐Ÿ + ๐’ ๐œถ
๐Ÿ
. ๐ˆ ๐’ƒ ๐Ÿ
= ๐Ÿ โˆ’ ๐œถ
Or
๐‘ท ๐’ƒ ๐Ÿ โˆ’ ๐’• ๐œถ
๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ ๐’ƒ ๐Ÿ
โ‰ค ๐œท ๐Ÿ โ‰ค ๐’ƒ ๐Ÿ + ๐’• ๐œถ
๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ ๐’ƒ ๐Ÿ
= ๐Ÿ โˆ’ ๐œถ
๏‚ง Hypothesis Testing For Parameters:
โ€ข The critical values (Z or t) in the confidence intervals, can be
used to find the rejection area(s) and test any hypothesis on
parameters.
โ€ข For example, to test ๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐ŸŽ against the alternative ๐‘ฏ ๐Ÿ: ๐œท ๐Ÿ โ‰ 
๐ŸŽ, after finding the critical values t (which means we do not
have prior knowledge of ๐ˆ and use ๐ˆ instead) at any
significance level ๐œถ, we will have two critical regions and if the
value of the test statistic
๐’• =
๐’ƒ ๐Ÿโˆ’๐œท ๐Ÿ
๐ˆ
๐’™ ๐’Š
๐Ÿ
be in the critical region ๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐ŸŽ must be rejected.
โ€ข In case we have more than one slope parameter the degree of
freedom for t-distribution will be the sample size ๐’ minus the
number of estimated parameters including the intercept
parameters, i.e. for ๐’Œ parameters ๐’…๐’‡ = ๐’ โˆ’ ๐’Œ .
๏‚ง Determination Coefficient ๐’“ ๐Ÿ
and Goodness of Fit:
โ€ข In early slides we talked about determination coefficient and
its relationship with correlation coefficient. The coefficient of
determination ๐’“ ๐Ÿ
come to our attention when there is no issue
about estimation of regression parameters.
โ€ข It is a measure which shows how well the SRF fits the data.
โ€ข to understand this measure properly letโ€™s have a look at it
from different angle.
We know that
๐’€๐’Š = ๐’€๐’Š + ๐’†๐’Š
And in the deviation form after
subtracting ๐’€ from both sides
๐’€๐’Š โˆ’ ๐’€ = ๐’€๐’Š โˆ’ ๐’€ + ๐’†๐’Š
We know that ๐’†๐’Š = ๐’€๐’Š โˆ’ ๐’€๐’Š
๐’†๐’Š
AdoptedfromBasicEconometricsGojaratiP76
๐‘Œ
๐’€๐’Š โˆ’ ๐’€
So;
๐’€๐’Š โˆ’ ๐’€ = ( ๐’€๐’Š โˆ’ ๐’€) + (๐’€๐’Š โˆ’ ๐’€๐’Š)
Or in the deviation form
๐’š๐’Š = ๐’š๐’Š + ๐’†๐’Š
By squaring both sides and adding all over the sample we have:
๐’š๐’Š
๐Ÿ
= ๐’š๐’Š
๐Ÿ
+ ๐Ÿ ๐’š๐’Š ๐’†๐’Š + ๐’†๐’Š
๐Ÿ
= ๐’š๐’Š
๐Ÿ
+ ๐’†๐’Š
๐Ÿ
Where ๐’š๐’Š ๐’†๐’Š = ๐ŸŽ according to the OLSโ€™s assumptions 3 and 5.
And if we change it to the non-deviated form:
๐’€๐’Š โˆ’ ๐’€ 2 = ๐’€๐’Š โˆ’ ๐’€
2
+ ๐’€๐’Š โˆ’ ๐’€๐’Š
2
Total variation of the
observed Y values around
their mean =Total Sum of
Squares= TSS
Total explained variation of the
estimated Y values around their
mean = Explained Sum of
Squares (by explanatory
variables)= ESS
Total unexplained variation of
the observed Y values around
the regression line= Residual
Sum of Squares (Explained by
error terms)= RSS
Dividing both sides by Total Sum of Squares (TSS) we have:
1 =
๐ธ๐‘†๐‘†
๐‘‡๐‘†๐‘†
+
๐‘…๐‘†๐‘†
๐‘‡๐‘†๐‘†
=
๐’€๐’Š โˆ’ ๐’€ 2
๐’€๐’Š โˆ’ ๐’€ 2
+
๐’€๐’Š โˆ’ ๐’€๐’Š
2
๐’€๐’Š โˆ’ ๐’€ 2
Where
๐’€๐’Šโˆ’ ๐’€ ๐Ÿ
๐’€๐’Šโˆ’ ๐’€ ๐Ÿ
=
๐‘ฌ๐‘บ๐‘บ
๐‘ป๐‘บ๐‘บ
is the percentage of the variation of the actual
(observed) ๐’€๐’Š which is explained by the explanatory variables (by
regression line).
โ€ข A good reader knows that this is not a new concept; the
determination coefficient ๐’“ ๐Ÿ was described already as a
measure of the goodness of fit between different alternative
sample regression functions (SRFs).
๐Ÿ = ๐’“ ๐Ÿ +
๐‘น๐‘บ๐‘บ
๐‘ป๐‘บ๐‘บ
โ†’ ๐’“ ๐Ÿ = ๐Ÿ โˆ’
๐‘น๐‘บ๐‘บ
๐‘ป๐‘บ๐‘บ
= ๐Ÿ โˆ’
๐’† ๐’Š
๐Ÿ
๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ
โ€ข A good model must have a reasonable high ๐’“ ๐Ÿ but this does not
mean any model with a high ๐’“ ๐Ÿ is a good model. Extremely high
level of ๐’“ ๐Ÿ could be as a result of having a spurious regression
line due to the variety of reasons such as non-stationarity of
data, cointegration problem and etc.
โ€ข In a regression model with two parameters, ๐’“ ๐Ÿ can be directly
calculated:
๐’“ ๐Ÿ =
๐’€ ๐’Šโˆ’ ๐’€
๐Ÿ
๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ =
๐’ƒ ๐ŸŽ+๐’ƒ ๐Ÿ ๐‘ฟ๐’Šโˆ’๐’ƒ ๐ŸŽโˆ’๐’ƒ ๐Ÿ ๐‘ฟ
๐Ÿ
๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ
=
๐’ƒ ๐Ÿ
๐Ÿ
๐‘ฟ ๐’Šโˆ’๐‘ฟ
๐Ÿ
๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ =
๐’ƒ ๐Ÿ
๐Ÿ
๐’™ ๐’Š
๐Ÿ
๐’š ๐’Š
๐Ÿ = ๐’ƒ ๐Ÿ
๐Ÿ ๐‘บ ๐‘ฟ
๐Ÿ
๐‘บ ๐’€
๐Ÿ
Where ๐‘บ ๐‘ฟ
๐Ÿ
and ๐‘บ ๐’€
๐Ÿ
are the standard deviations of ๐‘ฟ and ๐’€
respectively.
๏‚ง Multiple Regression Analysis:
โ€ข If there are more than two explanatory variables in the
regression line we need additional assumptions about the
independency of the explanatory variables and also having no
exact linear relationship between them.
โ€ข The population and the sample regression models for three
variables model can be described as following:
In Population: ๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’–๐’Š
In Sample: ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’†๐’Š
โ€ข The OLS estimators can be obtained by minimising ๐’†๐’Š
๐Ÿ. So,
the values of the SRF parameters in the deviation form are as
following:
๐’ƒ ๐Ÿ =
( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
๐’ƒ ๐Ÿ =
( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š
๐Ÿ
) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
And the intercept parameter will be calculated in the non-deviated
form as:
๐’ƒ ๐ŸŽ = ๐’€ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ
โ€ข Under the classical assumptions and also the central limit
theorem the OLS estimators in sampling distribution (repeated
sampling),when ๐’ โ†’ โˆž, have a normal distribution:
๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ,
๐ˆ ๐’–
๐Ÿ. ๐’™ ๐Ÿ๐’Š
๐Ÿ
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
)
๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ,
๐ˆ ๐’–
๐Ÿ. ๐’™ ๐Ÿ๐’Š
๐Ÿ
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
)
โ€ข The distribution of the intercept parameter ๐’ƒ ๐ŸŽ is not of primary
concern as in many cases it has no practical importance.
โ€ข If the variance of the disturbance (error) term (๐ˆ ๐’–
๐Ÿ
) is not known
the residual variance (sample variance) can be used ( ๐ˆ ๐’–
๐Ÿ
),
which is an unbiased estimator of the earlier:
๐ˆ ๐’–
๐Ÿ
=
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐’Œ
Where ๐’Œ is the number of parameters in the model (including the
intercept ๐’ƒ ๐ŸŽ). Therefore, in a regression model with two slope
parameters and one intercept parameter the residual variance can
be calculated by:
๐ˆ ๐’–
๐Ÿ
=
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
So, for a model with two slope parameters, the unbiased
estimates of the variance of these parameters are:
๐‘บ ๐’ƒ ๐Ÿ
๐Ÿ
=
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
.
๐’™ ๐Ÿ๐’Š
๐Ÿ
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
=
๐ˆ ๐’–
๐Ÿ
๐’™ ๐Ÿ๐’Š
๐Ÿ (๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐Ÿ๐Ÿ)
Where ๐’“ ๐Ÿ
๐Ÿ๐Ÿ =
๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š
๐Ÿ
๐’™ ๐Ÿ๐’Š
๐Ÿ ๐’™ ๐Ÿ๐’Š
๐Ÿ .
and
๐‘บ ๐’ƒ ๐Ÿ
๐Ÿ
=
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
.
๐’™ ๐Ÿ๐’Š
๐Ÿ
( ๐’™ ๐Ÿ๐’Š
๐Ÿ)( ๐’™ ๐Ÿ๐’Š
๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š)
๐Ÿ
=
๐ˆ ๐’–
๐Ÿ
๐’™ ๐Ÿ๐’Š
๐Ÿ (๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐Ÿ๐Ÿ)
๐ˆ ๐’–
๐Ÿ
๏‚ง The Coefficient of Multiple Determination (๐‘น ๐Ÿ
and ๐‘น ๐Ÿ
):
The same concept of the coefficient of determination used for a
bivariate model can be extended for a multivariate model.
โ€ข If ๐‘น ๐Ÿ is denoted as the coefficient of multiple determination it
shows the proportion (percentage) of the total variation of ๐’€
explained by the explanatory variables and it is calculated by:
๐‘…2
=
๐ธ๐‘†๐‘†
๐‘‡๐‘†๐‘†
=
๐‘ฆ ๐‘–
2
๐‘ฆ ๐‘–
2 =
๐‘1 ๐‘ฆ ๐‘– ๐‘ฅ1๐‘–+๐‘2 ๐‘ฆ ๐‘– ๐‘ฅ2๐‘–
๐‘ฆ ๐‘–
2
And we know that:
0 โ‰ค ๐‘…2
โ‰ค 1
๏ƒ˜ Note that ๐‘…2 can also be calculated through RSS, i.e.
๐‘…2 = 1 โˆ’
๐‘…๐‘†๐‘†
๐‘‡๐‘†๐‘†
= 1 โˆ’
๐‘’๐‘–
2
๐‘ฆ๐‘–
2
C
โ€ข ๐‘น ๐Ÿ is likely to increase by including an additional explanatory
variable (see ). Therefore, in case we have two alternative
models with the same dependent variable ๐’€ but different
number of explanatory variables we should not be misled by the
high ๐‘น ๐Ÿ
of the model with more variables.
โ€ข To solve this problem we need to bring the degrees of freedom
into our consideration as a reduction factor against adding
additional explanatory variables. So, the adjusted ๐‘น ๐Ÿ which can
be shown by ๐‘น ๐Ÿ is considered as an alternative coefficient of
determination and it is calculated as:
๐‘…2 = 1 โˆ’
๐‘’๐‘–
2
๐‘› โˆ’ ๐‘˜
๐‘ฆ๐‘–
2
๐‘› โˆ’ 1
= 1 โˆ’
๐‘› โˆ’ 1
๐‘› โˆ’ ๐‘˜
.
๐‘’๐‘–
2
๐‘ฆ๐‘–
2
= 1 โˆ’
๐‘›โˆ’1
๐‘›โˆ’๐‘˜
(1 โˆ’ ๐‘…2)
C
๏‚ง Partial Correlation Coefficients:
โ€ข For a three-variable regression model such as
๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’†๐’Š
We can talk about three linear association (correlation) between
๐’€ and ๐‘ฟ ๐Ÿ ๐’“ ๐’š๐’™ ๐Ÿ
, between ๐’€ and ๐‘ฟ ๐Ÿ (๐’“ ๐’š๐’™ ๐Ÿ
) and finally between
๐‘ฟ ๐Ÿ and ๐‘ฟ ๐Ÿ (๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ
). These correlations are called simple (gross)
correlation coefficients but they do not reflect the true linear
association between two variables as the influence of the third
variable on the other two is not removed.
โ€ข The net linear association between two variables can be
obtained through the partial correlation coefficient, where the
influence of the third variable is removed (the variable is hold
constant). Symbolically, ๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ
represents the partial
correlation coefficient between ๐’€ and ๐‘ฟ ๐Ÿ holding ๐‘ฟ ๐Ÿ constant.
โ€ข Two partial correlation coefficients in our model can be
calculated as following:
๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ
=
๐’“ ๐’š๐’™ ๐Ÿ
โˆ’ ๐’“ ๐’š๐’™ ๐Ÿ
๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ
๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐’™ ๐Ÿ ๐’™ ๐Ÿ
. ๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐’š๐’™ ๐Ÿ
๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ
=
๐’“ ๐’š๐’™ ๐Ÿ
โˆ’ ๐’“ ๐’š๐’™ ๐Ÿ
๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ
๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐’™ ๐Ÿ ๐’™ ๐Ÿ
. ๐Ÿ โˆ’ ๐’“ ๐Ÿ
๐’š๐’™ ๐Ÿ
โ€ข The correlation coefficient ๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ.๐’š has no practical importance.
Specifically, when the direction of causality is from ๐‘ฟโ€ฒ
๐’” to ๐’€ we
can simply use the simple correlation coefficient in this case:
๐’“ =
๐’™ ๐Ÿ ๐’™ ๐Ÿ
๐’™ ๐Ÿ
๐Ÿ . ๐’™ ๐Ÿ
๐Ÿ
โ€ข They can be used to find out which explanatory variable has
more linear association with the dependent variable.
๏‚ง Hypothesis Testing in Multiple Regression Models:
In a multiple regression model hypotheses are formed to test
different aspects of this type of regression models:
i. Testing hypothesis about an individual parameter of the
model. For example;
๐‘ฏ ๐ŸŽ: ๐œท๐’‹ = ๐ŸŽ against ๐‘ฏ ๐Ÿ: ๐œท๐’‹ โ‰  ๐ŸŽ
If ๐ˆ is unknown and is replaced by ๐ˆ the test statistic
๐’• =
๐’ƒ ๐’‹โˆ’๐œท ๐’‹
๐’”๐’†(๐’ƒ ๐’‹)
=
๐’ƒ ๐’‹
๐’”๐’†(๐’ƒ ๐’‹)
follows the t-distribution with ๐’ โˆ’ ๐’Œ df (for a regression model with
three parameters, including intercept, ๐๐Ÿ = ๐’ โˆ’ ๐Ÿ‘)
ii. Testing hypothesis about the equality of two parameters
in the model. For example,
๐‘ฏ ๐ŸŽ: ๐œท๐’Š = ๐œท๐’‹ against ๐‘ฏ ๐Ÿ: ๐œท๐’Š โ‰  ๐œท๐’‹
Again, if ๐ˆ is unknown and is replaced by ๐ˆ the test statistic
๐’• =
๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹ โˆ’ ๐œท๐’Š โˆ’ ๐œท๐’‹
๐’”๐’†(๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹)
=
๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹
๐’—๐’‚๐’“ ๐’ƒ๐’Š + ๐’—๐’‚๐’“ ๐’ƒ๐’‹ โˆ’ ๐Ÿ๐’„๐’๐’—(๐’ƒ๐’Š, ๐’ƒ๐’‹)
follows the t-distribution with ๐’ โˆ’ ๐’Œ df.
โ€ข If the value of test statistic ๐’• > ๐’• ๐œถ
๐Ÿ
,(๐’โˆ’๐’Œ) we must reject ๐‘ฏ ๐ŸŽ,
otherwise there is not much evidence to reject that.
iii. Testing hypothesis about the overall significance of the
estimated model by checking if all the slope parameters
are simultaneously zero. For example, to test
๐‘ฏ ๐ŸŽ: ๐œท๐’Š = ๐ŸŽ (โˆ€ ๐’Š) against ๐‘ฏ ๐Ÿ: โˆƒ๐œท๐’Š โ‰  ๐ŸŽ
the analysis of variance (ANOVA) table can be used to find if the
mean sum of squares (MSS), due to the regression (or
explanatory variables) are very far from the MSS due to the
residuals. If this is true, it means the variation of explanatory
variables contribute more towards the variation of the dependent
variable than the variation of residuals, so, the ratio
๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘”๐‘Ÿ๐‘’๐‘ ๐‘ ๐‘–๐‘œ๐‘› (๐‘’๐‘ฅ๐‘๐‘™๐‘Ž๐‘›๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘ฆ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘ )
๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘ ๐‘–๐‘‘๐‘ข๐‘Ž๐‘™๐‘  (๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘’๐‘™๐‘’๐‘š๐‘’๐‘›๐‘ก๐‘ )
should be much higher than one.
โ€ข The ANOVA table for the three-variable regression model can
be formed as following:
โ€ข If we believe that the regression model is meaningless so we
cannot reject the null hypothesis that all slope coefficients are
simultaneously equal to zero, otherwise the test statistic
๐น =
๐ธ๐‘†๐‘†/๐‘‘๐‘“
๐‘…๐‘†๐‘†/๐‘‘๐‘“
=
๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š
๐Ÿ
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
Which follows the F-distribution with 2 and ๐’ โˆ’ ๐Ÿ‘ df must be much
bigger than 1.
Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)
Due to Explanatory
Variables
๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š 2
๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š
๐Ÿ
Due to Residuals
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
๐ˆ ๐Ÿ
=
๐’†๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ‘
Total
๐’š๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ
โ€ข In general, to test the overall significance of the sample
regression for a multi-variable model (e.g with ๐’Œ slope
parameters) the null and alternative hypotheses and the test
statistic are as following:
๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐œท ๐Ÿ = โ‹ฏ = ๐œท ๐’Œ = ๐ŸŽ
๐‘ฏ ๐Ÿ: ๐’‚๐’• ๐’๐’†๐’‚๐’”๐’• ๐’•๐’‰๐’†๐’“๐’† ๐’Š๐’” ๐’๐’๐’† ๐œท๐’Š โ‰  ๐ŸŽ
๐‘ญ =
๐‘ฌ๐‘บ๐‘บ
๐’Œโˆ’๐Ÿ
๐‘น๐‘บ๐‘บ
๐’โˆ’๐’Œ
โ€ข If ๐‘ญ > ๐‘ญ ๐œถ, ๐’Œโˆ’๐Ÿ, ๐’โˆ’๐’Œ we reject ๐‘ฏ ๐ŸŽ at the significance level of ๐œถ,
otherwise there is no enough evidence to reject it.
โ€ข It is sometimes easier to use the determination coefficient ๐‘น ๐Ÿ
to run the above test, because
๐‘น ๐Ÿ
=
๐‘ฌ๐‘บ๐‘บ
๐‘ป๐‘บ๐‘บ
โ†’ ๐‘ฌ๐‘บ๐‘บ = ๐‘น ๐Ÿ
. ๐‘ป๐‘บ๐‘บ
and also
๐‘น๐‘บ๐‘บ = ๐Ÿ โˆ’ ๐‘น ๐Ÿ
. ๐‘ป๐‘บ๐‘บ
โ€ข The ANOVA table can also be written as:
โ€ข So, the test statistic F can be written as:
๐‘ญ =
๐‘น ๐Ÿ ๐’š๐’Š
๐Ÿ
(๐’Œ โˆ’ ๐Ÿ)
(๐Ÿ โˆ’ ๐‘น ๐Ÿ) ๐’š๐’Š
๐Ÿ
(๐’ โˆ’ ๐’Œ)
=
๐’ โˆ’ ๐’Œ
๐’Œ โˆ’ ๐Ÿ
.
๐‘น ๐Ÿ
๐Ÿ โˆ’ ๐‘น ๐Ÿ
Source of variation Sum of Squares (SS) df Mean Sum of
Squares (MSS)
Due to Explanatory
Variables
๐‘น ๐Ÿ
๐’š๐’Š
๐Ÿ
๐’Œ โˆ’ ๐Ÿ
๐‘น ๐Ÿ
๐’š๐’Š
๐Ÿ
๐’Œ โˆ’ ๐Ÿ
Due to Residuals
(๐Ÿ โˆ’ ๐‘น ๐Ÿ
) ๐’š๐’Š
๐Ÿ ๐’ โˆ’ ๐’Œ
๐ˆ ๐Ÿ
=
(๐Ÿ โˆ’ ๐‘น ๐Ÿ
) ๐’š๐’Š
๐Ÿ
๐’ โˆ’ ๐’Œ
Total
๐’š๐’Š
๐Ÿ
๐’ โˆ’ ๐Ÿ
iv. Testing hypothesis about parameters when they satisfy
certain restrictions.*
e.g.๐‘ฏ ๐ŸŽ: ๐œท๐’Š + ๐œท๐’‹ = ๐Ÿ against ๐‘ฏ ๐Ÿ: ๐œท๐’Š + ๐œท๐’‹ โ‰  ๐Ÿ
v. Testing hypothesis about the stability of the estimated
regression model in a specific time period or in two cross-
sectional unit.**
vi. Testing hypothesis about different functional forms of
regression models.***

More Related Content

What's hot

Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
Avjinder (Avi) Kaler
ย 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
alok tiwari
ย 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regression
Unsa Shakir
ย 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
Teachers Mitraa
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
Srikant001p
ย 
Multiple Correlation - Thiyagu
Multiple Correlation - ThiyaguMultiple Correlation - Thiyagu
Multiple Correlation - Thiyagu
Thiyagu K
ย 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
Anil Pokhrel
ย 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
Sir Parashurambhau College, Pune
ย 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
sonia gupta
ย 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
RekhaChoudhary24
ย 
Scatter plots
Scatter plotsScatter plots
Scatter plotsswartzje
ย 
Spearman Rank Correlation Presentation
Spearman Rank Correlation PresentationSpearman Rank Correlation Presentation
Spearman Rank Correlation Presentationcae_021
ย 
Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)
Naveen Kumar Medapalli
ย 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysisShivani Sharma
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
ย 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
Muhammad Fazeel
ย 
Chapter 10
Chapter 10Chapter 10
Chapter 10guest3720ca
ย 

What's hot (20)

Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
ย 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
ย 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regression
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
ย 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
ย 
Multiple Correlation - Thiyagu
Multiple Correlation - ThiyaguMultiple Correlation - Thiyagu
Multiple Correlation - Thiyagu
ย 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
ย 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
ย 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
ย 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
ย 
Scatter plots
Scatter plotsScatter plots
Scatter plots
ย 
Spearman Rank Correlation Presentation
Spearman Rank Correlation PresentationSpearman Rank Correlation Presentation
Spearman Rank Correlation Presentation
ย 
Regression
RegressionRegression
Regression
ย 
Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)
ย 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
ย 
Regression analysis
Regression analysisRegression analysis
Regression analysis
ย 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
ย 
Chapter 10
Chapter 10Chapter 10
Chapter 10
ย 

Similar to Introduction to correlation and regression analysis

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Antony Raj
ย 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptx
IloveBepis
ย 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Antony Raj
ย 
Lecture 4
Lecture 4Lecture 4
Lecture 4
Farzad Javidanrad
ย 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
ayaan522797
ย 
Measure of Association
Measure of AssociationMeasure of Association
Measure of Association
Kalahandi University
ย 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
Dr. Tushar J Bhatt
ย 
Chapter 10
Chapter 10Chapter 10
Chapter 10Rose Jenkins
ย 
Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression impt
freelancer
ย 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
budbarber38650
ย 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
Kishor Ade
ย 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
Ram Kumar Shah "Struggler"
ย 
REGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREREGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HERE
ShriramKargaonkar
ย 
Correlation analysis notes
Correlation analysis notesCorrelation analysis notes
Correlation analysis notes
Japheth Muthama
ย 
Factor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptxFactor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptx
GauravRajole
ย 
Correlation and regression analysis
Correlation and regression analysisCorrelation and regression analysis
Correlation and regression analysis
_pem
ย 
Correlation
CorrelationCorrelation
Correlation
Long Beach City College
ย 
Correlation
CorrelationCorrelation
Correlation
HemamaliniSakthivel
ย 

Similar to Introduction to correlation and regression analysis (20)

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
ย 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptx
ย 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
ย 
Lecture 4
Lecture 4Lecture 4
Lecture 4
ย 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
ย 
Measure of Association
Measure of AssociationMeasure of Association
Measure of Association
ย 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
ย 
Chapter 10
Chapter 10Chapter 10
Chapter 10
ย 
Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression impt
ย 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
ย 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
ย 
9. parametric regression
9. parametric regression9. parametric regression
9. parametric regression
ย 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
ย 
REGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREREGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HERE
ย 
Correlation analysis notes
Correlation analysis notesCorrelation analysis notes
Correlation analysis notes
ย 
Factor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptxFactor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptx
ย 
Correlation and regression analysis
Correlation and regression analysisCorrelation and regression analysis
Correlation and regression analysis
ย 
Correlation
CorrelationCorrelation
Correlation
ย 
Correlation
CorrelationCorrelation
Correlation
ย 
Regression
RegressionRegression
Regression
ย 

More from Farzad Javidanrad

Lecture 5
Lecture 5Lecture 5
Lecture 5
Farzad Javidanrad
ย 
Lecture 3
Lecture 3Lecture 3
Lecture 3
Farzad Javidanrad
ย 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Farzad Javidanrad
ย 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Farzad Javidanrad
ย 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
Farzad Javidanrad
ย 
Matrix algebra
Matrix algebraMatrix algebra
Matrix algebra
Farzad Javidanrad
ย 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
Farzad Javidanrad
ย 
Statistics (recap)
Statistics (recap)Statistics (recap)
Statistics (recap)
Farzad Javidanrad
ย 
Basic calculus (ii) recap
Basic calculus (ii) recapBasic calculus (ii) recap
Basic calculus (ii) recap
Farzad Javidanrad
ย 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
Farzad Javidanrad
ย 
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...Farzad Javidanrad
ย 
Introductory Finance for Economics (Lecture 10)
Introductory Finance for Economics (Lecture 10)Introductory Finance for Economics (Lecture 10)
Introductory Finance for Economics (Lecture 10)Farzad Javidanrad
ย 

More from Farzad Javidanrad (12)

Lecture 5
Lecture 5Lecture 5
Lecture 5
ย 
Lecture 3
Lecture 3Lecture 3
Lecture 3
ย 
Lecture 2
Lecture 2Lecture 2
Lecture 2
ย 
Lecture 1
Lecture 1Lecture 1
Lecture 1
ย 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
ย 
Matrix algebra
Matrix algebraMatrix algebra
Matrix algebra
ย 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
ย 
Statistics (recap)
Statistics (recap)Statistics (recap)
Statistics (recap)
ย 
Basic calculus (ii) recap
Basic calculus (ii) recapBasic calculus (ii) recap
Basic calculus (ii) recap
ย 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
ย 
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...
The Dynamic of Business Cycle in Kaleckiโ€™s Theory: Duality in the Nature of I...
ย 
Introductory Finance for Economics (Lecture 10)
Introductory Finance for Economics (Lecture 10)Introductory Finance for Economics (Lecture 10)
Introductory Finance for Economics (Lecture 10)
ย 

Recently uploaded

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
ย 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
ย 
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
Nguyen Thanh Tu Collection
ย 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
ย 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
ย 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
ย 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
ย 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Spoล‚eczeล„stwa Przedsiฤ™biorczego
ย 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
ย 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
ย 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
ย 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
ย 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
ย 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
ย 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
ย 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
ย 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
ย 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
ย 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
ย 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
ย 

Recently uploaded (20)

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
ย 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
ย 
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
GIรO รN Dแบ Y THรŠM (Kแบพ HOแบ CH Bร€I BUแป”I 2) - TIแบพNG ANH 8 GLOBAL SUCCESS (2 Cแป˜T) N...
ย 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
ย 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
ย 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
ย 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
ย 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ย 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
ย 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
ย 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
ย 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
ย 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
ย 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
ย 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
ย 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
ย 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
ย 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
ย 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
ย 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
ย 

Introduction to correlation and regression analysis

  • 1. Introduction to Correlation & Regression Analysis Farzad Javidanrad November 2013
  • 2. Some Basic Concepts: o Variable: A letter (symbol) which represents the elements of a specific set. o Random Variable: A variable whose values are randomly appear based on a probability distribution. o Probability Distribution: A corresponding rule (function) which corresponds a probability to the values of a random variable (individually or to a set of them). E.g.: ๐’™ 0 1 ๐‘ƒ(๐‘ฅ) 0.5 0.5 In one trial ๐ป, ๐‘‡ In two trials ๐ป๐ป, ๐ป๐‘‡, ๐‘‡๐ป, ๐‘‡๐‘‡
  • 3. Correlation: Is there any relation between: ๏ฑ fast food sale and different seasons? ๏ฑ specific crime and religion? ๏ฑ smoking cigarette and lung cancer? ๏ฑ maths score and overall score in exam? ๏ฑ temperature and earthquake? ๏ฑ cost of advertisement and number of sold items? ๏‚ง To answer each question two sets of corresponding data need to be randomly collected. Let random variable "๐’™" represents the first group of data and random variable "๐’š" represents the second. Question: Is this true that students who have a better overall result are good in maths?
  • 4. Our aim is to find out whether there is any linear association between ๐’™ and ๐’š. In statistics, technical term for linear association is โ€œcorrelationโ€. So, we are looking to see if there is any correlation between two scores. ๏ƒ˜ โ€œLinear associationโ€ : variables are in relations at their levels, i.e. ๐’™ with ๐’š not with ๐’š ๐Ÿ , ๐’š ๐Ÿ‘ , ๐Ÿ ๐’š or even โˆ†๐’š. Imagine we have a random sample of scores in a school as following:
  • 5. In our example, the correlation between ๐’™ and ๐’š can be shown in a scatter diagram: 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Y X Correlation between maths score and overall score The graph shows a positive correlation between maths scores and overall scores, i.e. when ๐’™ increases ๐’š increases too.
  • 6. Different scatter diagrams show different types of correlation: โ€ข Is this enough? Are we happy? Certainly not!! We think we know things better when they are described by numbers!!!! Although, scatter diagrams are informative but to find the degree (strength) of a correlation between two variables we need a numerical measurement. Adopted from www.pdesas.org
  • 7. Following the work of Francis Galton on regression line, in 1896 Karl Pearson introduced a formula for measuring correlation between two variables, called Correlation Coefficient or Pearsonโ€™s Correlation Coefficient. For a sample of size ๐’, sample correlation coefficient ๐’“ ๐’™๐’š can be calculated by: ๐’“ ๐’™๐’š = ๐Ÿ ๐’ (๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š) ๐Ÿ ๐’ (๐’™๐’Š โˆ’ ๐’™) ๐Ÿ . ๐Ÿ ๐’ (๐’š๐’Š โˆ’ ๐’š) ๐Ÿ = ๐’„๐’๐’—(๐’™, ๐’š) ๐‘บ ๐’™ . ๐‘บ ๐’š Where ๐’™ and ๐’š are the mean values of ๐’™ and ๐’š in the sample and ๐‘บ represents the biased version of โ€œstandard deviationโ€*. The covariance between ๐’™ and ๐’š ( ๐’„๐’๐’— ๐’™, ๐’š ) shows how much ๐’™ and ๐’š change together.
  • 8. Alternatively, if there is an opportunity to observe all available data, the population correlation coefficient (๐† ๐’™๐’š) can be obtained by: ๐† ๐’™๐’š = ๐‘ฌ ๐’™๐’Š โˆ’ ๐ ๐’™ . (๐’š๐’Š โˆ’ ๐ ๐’š) ๐‘ฌ ๐’™๐’Š โˆ’ ๐ ๐’™ ๐Ÿ. ๐‘ฌ(๐’š๐’Š โˆ’ ๐ ๐’š) ๐Ÿ = ๐’„๐’๐’—(๐’™, ๐’š) ๐ˆ ๐’™ . ๐ˆ ๐’š Where ๐‘ฌ, ๐ and ๐ˆ are expected value, mean and standard deviation of the random variables, respectively and ๐‘ต is the size of the population. Question: Under what conditions can we use this population correlation coefficient?
  • 9. ๏ƒ˜ If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“ ๐’™๐’š = ๐Ÿ Maximum (perfect) positive correlation. ๏ƒ˜ If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“ ๐’™๐’š = โˆ’๐Ÿ Maximum (perfect) negative correlation. ๏ƒ˜ If there is no linear association between ๐’™ and ๐’š then ๐’“ ๐’™๐’š = ๐ŸŽ. Note 1: If there is no linear association between two random variables they might have non linear association or no association at all. For all ๐’‚ , ๐’ƒ โˆˆ ๐‘น And ๐’‚ > ๐ŸŽ For all ๐’‚ , ๐’ƒ โˆˆ ๐‘น And ๐’‚ < ๐ŸŽ
  • 10. In our example, the sample correlation coefficient is: ๐’™๐’Š ๐’š๐’Š ๐’™๐’Š โˆ’ ๐’™ ๐’š๐’Š โˆ’ ๐’š ๐’™๐’Š โˆ’ ๐’™ . (๐’š๐’Š โˆ’ ๐’š) (๐‘ฅ๐‘–โˆ’ ๐‘ฅ )2 (๐‘ฆ๐‘–โˆ’ ๐‘ฆ )2 70 73 12 13.9 166.8 144 193.21 85 90 27 30.9 834.3 729 954.81 22 31 -36 -28.1 1011.6 1296 789.61 66 50 8 -9.1 -72.8 64 82.81 15 31 -43 -28.1 1208.3 1849 789.61 58 50 0 -9.1 0 0 82.81 69 56 11 -3.1 -34.1 121 9.61 49 55 -9 -4.1 36.9 81 16.81 73 80 15 20.9 313.5 225 436.81 61 49 3 -10.1 -30.3 9 102.01 77 79 19 19.9 378.1 361 396.01 44 58 -14 -1.1 15.4 196 1.21 35 40 -23 -19.1 439.3 529 364.81 88 85 30 25.9 777 900 670.81 69 73 11 13.9 152.9 121 193.21 5196.9 6625 5084.15 ๐’“ ๐’™๐’š = ๐Ÿ ๐’ (๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š) ๐Ÿ ๐’ (๐’™๐’Š โˆ’ ๐’™) ๐Ÿ . ๐Ÿ ๐’ (๐’š๐’Š โˆ’ ๐’š) ๐Ÿ = ๐Ÿ“๐Ÿ๐Ÿ—๐Ÿ”.๐Ÿ— ๐Ÿ”๐Ÿ”๐Ÿ๐Ÿ“ร—๐Ÿ“๐ŸŽ๐Ÿ–๐Ÿ’.๐Ÿ๐Ÿ“ =๐ŸŽ.๐Ÿ–๐Ÿ—๐Ÿ“ which shows an strong positive correlation between maths score and overall score.
  • 11. Positive Linear Association No Linear Association Negative Linear Association ๐‘บ ๐’™ > ๐‘บ ๐’š ๐‘บ ๐’™ = ๐‘บ ๐’š ๐‘บ ๐’™ < ๐‘บ ๐’š ๐’“ ๐’™๐’š = ๐Ÿ Adapted and modified from www.tice.agrocampus-ouest.fr ๐’“ ๐’™๐’š โ‰ˆ ๐Ÿ ๐ŸŽ < ๐’“ ๐’™๐’š < ๐Ÿ ๐’“ ๐’™๐’š = ๐ŸŽ โˆ’๐Ÿ < ๐’“ ๐’™๐’š< ๐ŸŽ ๐’“ ๐’™๐’š โ‰ˆ โˆ’๐Ÿ ๐’“ ๐’™๐’š = โˆ’๐Ÿ Perfect Weak No Correlation Weak Strong Perfect Strong
  • 12. Some properties of the correlation coefficient: (Sample or population) a. It lies between -1 and 1, i.e. โˆ’๐Ÿ โ‰ค ๐’“ ๐’™๐’š โ‰ค ๐Ÿ. b. It is symmetrical with respect to ๐’™ and ๐’š, i.e. ๐’“ ๐’™๐’š = ๐’“ ๐’š๐’™ . This means the direction of calculation is not important. c. It is just a pure number and independent from the unit of measurement of ๐’™ and ๐’š. d. It is independent of the choice of origin and scale of ๐’™ and ๐’šโ€™s measurements, that is; ๐’“ ๐’™๐’š = ๐’“ ๐’‚๐’™+๐’ƒ ๐’„๐’š+๐’… (๐’‚, ๐’„ > ๐ŸŽ)
  • 13. e. ๐’‡ ๐’™, ๐’š = ๐’‡ ๐’™ . ๐’‡(๐’š) ๐’“ ๐’™๐’š = ๐ŸŽ Important Note: Many researchers wrongly construct a theory just based on a simple correlation test. ๏ฑ Correlation does not imply causation. If there is a high correlation between number of smoked cigarettes and the number of infected lungโ€™s cells it does not necessarily mean that smoking causes lung cancer. Causality test (sometimes called Granger causality test) is different from correlation test. In causality test it is important to know about the direction of causality (e.g. ๐’™ on ๐’š and not vice versa) but in correlation we are trying to find if two variables moving together (same or opposite directions). ๐’™ and ๐’š are statistically independent, where ๐’‡(๐’™, ๐’š) is the joint Probability Density Function (PDF)
  • 14. Determination Coefficient and Correlation Coefficient: ๐’“ ๐’™๐’š = ยฑ๐Ÿ perfect linear relationship between variables: i.e. ๐’™ is the only factor which describes variations of ๐’š at the level (linearly); ๐’š = ๐’‚ + ๐’ƒ๐’™ . ๐’“ ๐’™๐’š โ‰ˆ ยฑ๐Ÿ ๐’™ is not the only factor which describes variations of ๐’š but we can still imagine that a line represents this relationship which passing through most of the points or having a minimum vertical distance from them, in total. This line is called the โ€œline of best fitโ€ or known technically as โ€œregression lineโ€. Adopted from www.ncetm.org.uk/public/files/195322/G3fb.jpg The graph shows a line of best fit between age of a car and its price. Imagine the line has the equation of ๐’š = ๐’‚ + ๐’ƒ๐’™
  • 15. The criterion to choose a line among others is the goodness of fit which can be calculated through determination coefficient, ๐’“ ๐Ÿ. ๏ƒ˜ In the previous example, age of a car is only factor among many other factors that explain the price of a car. Can you find some other factors? If ๐’š and ๐’™ represent price and age of cars respectively, the percentage of the variation of ๐’š which is determined (explained) by the variation of ๐’™ is called โ€œdetermination coefficientโ€. Determination coefficient can be understood better by Venn-Euler diagrams:
  • 16. y x y x y x y=x ๐’“ ๐Ÿ = ๐ŸŽ , none of variations of y can be determined by x (no linear association) ๐’“ ๐Ÿ โ‰ˆ ๐ŸŽ, small percentage of variation of y can be determined by x (weak linear association) ๐’“ ๐Ÿ โ‰ˆ ๐Ÿ, large percentage of variation of y can be determined by x (strong linear association) ๐’“ ๐Ÿ = ๐Ÿ, all variation of y can be determined by x and no other factors (complete linear association) The shaded area shows the percentage of variation of y which can be determined by x. it is easy to understand that ๐ŸŽ โ‰ค ๐’“ ๐Ÿ โ‰ค ๐Ÿ.
  • 17. Although, determination coefficient (๐’“ ๐Ÿ) is different conceptually from correlation coefficient (๐’“ ๐’™๐’š)but one can be calculated from another; in fact: ๐’“ ๐’™๐’š = ยฑ ๐’“ ๐Ÿ Or, alternatively ๐’“ ๐Ÿ = ๐’ƒ ๐Ÿ ๐Ÿ ๐’ ๐’™๐’Š โˆ’ ๐’™ ๐Ÿ ๐Ÿ ๐’ ๐’š๐’Š โˆ’ ๐’š ๐Ÿ = ๐’ƒ ๐Ÿ ๐‘บ ๐’™ ๐Ÿ ๐‘บ ๐’š ๐Ÿ Where ๐’ƒ is the slope coefficient in the regression line ๐’š = ๐’‚ + ๐’ƒ๐’™ . Note: If ๐’š = ๐’‚ + ๐’ƒ๐’™ shows the regression line (๐’š ๐’๐’ ๐’™) and ๐’™ = ๐’„ + ๐’…๐’š shows another regression line (๐’™ ๐’๐’ ๐’š) then we have: ๐’“ ๐Ÿ = ๐’ƒ. ๐’…
  • 18. Summary of Correlation & Determination Coefficients: โ€ข Correlation means a linear association between two random variables which could be positive or negative or zero. โ€ข Linear association means that variables are in relations at their levels (linearly). โ€ข Correlation coefficient measures the strength of linear association between two variables. It could be calculated for a sample or for the whole population. โ€ข The value of correlation coefficient is between -1 and 1, which show the strongest correlation (negative or positive) but moving towards zero it makes correlation weaker. โ€ข Correlation does not imply causation. โ€ข Determination coefficient shows the percentage of variation of one variable which can be described by another variable and it is a measure for the goodness of fit for lines passing through plotted points. โ€ข The value of determination coefficient is between 0 and 1 and can be obtained from correlation coefficient by squaring it.
  • 19. โ€ข Knowing two random variables are just linearly associated is not much satisfactory. There are sometimes a strong idea that the variation of one variable can solidly explain the variation of another. โ€ข To test this idea (hypothesis) we need another analytical approach, which is called โ€œregression analysisโ€. โ€ข In regression analysis we try to study or predict the mean (average) value of a dependent variable ๐’€ based on the knowledge we have about independent (explanatory) variable(s) ๐‘ฟ ๐Ÿ, ๐‘ฟ ๐Ÿ,โ€ฆ, ๐‘ฟ ๐’. This is familiar for those who know the meaning of conditional probabilities; as we are going to make a linear model such as, which is a deterministic part of the model in regression analysis: ๐ธ(๐‘Œ ๐‘‹1, ๐‘‹2,โ€ฆ, ๐‘‹ ๐‘›) = ๐›ฝ0 + ๐›ฝ1 ๐‘‹1 + ๐›ฝ2 ๐‘‹2 + โ‹ฏ + ๐›ฝ ๐‘› ๐‘‹ ๐‘›
  • 20. โ€ข The deterministic part of the regression model does reflect the structure of the relationship between ๐’€ and ๐‘ฟโ€ฒ ๐’” in a mathematical world but we live in a stochastic world. โ€ข Godโ€™s knowledge (if the term is applicable) is deterministic but our perception about everything in this world is always stochastic and our model should be built in this way. โ€ข To understand the concept of stochastic model letโ€™s have an example: ๏ƒ˜ If we make a model between monthly consumption expenditure ๐‘ช and monthly income ๐‘ฐ, the model cannot be deterministic (mathematical) such that for every value of ๐‘ฐ there is one and only one value of ๐‘ช (which is the concept of functional relationship in maths). Why?
  • 21. ๏ƒ˜ Although, the income is the main variable determining the amount of consumption expenditure but many other factors such as the mood of people, their wealth, interest rate and etc. are overlooked in a simple mathematical model such as ๐‘ช = ๐’‡(๐‘ฐ) but their influences can change the value of ๐‘ช even at the same level of ๐‘ฐ. If we believe that the average impact of all their omitted variables is random (sometimes positive and sometimes negative). So, in order to make a realistic model we need to add a stochastic (random) term ๐’– to our mathematical model: ๐‘ช = ๐’‡ ๐‘ฐ + ๐’– ยฃ1000 ยฃ1400 โ‹ฎ โ‹ฎ ยฃ800 ยฃ1000 ยฃ750 ยฃ900 ยฃ1200 ยฃ1150 I C The change in the consumption expenditure comes from the change of income (๐ผ) or change of some random elements (๐‘ข), so, we can write ๐‘ช = ๐’‡ ๐‘ฐ + ๐’–
  • 22. โ€ข The general stochastic model for our purpose would be as following, which is called โ€œLinear Regression Model**โ€: ๐’€๐’Š = ๐‘ฌ(๐’€๐’Š ๐‘ฟ ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ ๐’๐’Š) + ๐’–๐’Š Which can be written as: ๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐œท ๐’ ๐‘ฟ ๐’๐’Š + ๐’–๐’Š Where ๐’Š (๐‘– = 1,2, โ€ฆ , ๐‘›) shows time period (days, weeks, months, years and etc.) and ๐’–๐’Š is an error (stochastic) term and also a representative of all other influential variables which are not considered in the model and ignored. โ€ข The deterministic part of the model ๐‘ฌ(๐’€๐’Š ๐‘ฟ ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ ๐’๐’Š) =๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐œท ๐’ ๐‘ฟ ๐’๐’Š is called Population Regression Function (PRF).
  • 23. โ€ข The general form of the Linear Regression Model with ๐’Œ explanatory variables and ๐’ observations can be shown in the matrix form as: ๐’€ ๐‘›ร—1 = ๐‘ฟ ๐‘›ร—๐‘˜ ๐œท ๐‘˜ร—1 + ๐’– ๐‘›ร—1 Or simply: ๐’€ = ๐‘ฟ๐œท + ๐’– Where ๐’€ = ๐‘Œ1 ๐‘Œ2 โ‹ฎ ๐‘Œ๐‘› , ๐‘ฟ = 1 ๐‘‹11 ๐‘‹21 1 โ‹ฎ ๐‘‹12 โ‹ฎ ๐‘‹22 โ‹ฎ 1 ๐‘‹1๐‘› ๐‘‹2๐‘› โ€ฆ ๐‘‹ ๐‘˜1 โ€ฆ โ‹ฑ ๐‘‹ ๐‘˜2 โ‹ฎ โ€ฆ ๐‘‹ ๐‘˜๐‘› , ๐œท = ๐›ฝ0 ๐›ฝ1 โ‹ฎ ๐›ฝ ๐‘˜ and ๐’– = ๐‘ข1 ๐‘ข2 โ‹ฎ ๐‘ข ๐‘› ๐’€ is also called regressand and ๐‘ฟ is a vector of regressors.
  • 24. โ€ข ๐œท ๐ŸŽ is the intercept but ๐œท๐’Š โ€ฒ ๐’” are slope coefficients which are also called regression parameters. The value of each parameter shows the magnitude of one unit change in the associated regressor ๐‘ฟ๐’Š on the mean value of the regressand ๐’€๐’Š. The idea is to estimate the unknown value of the population regression parameters based on estimators which use sample data. โ€ข The sample counterpart of the regression line can be written in the form of: ๐’€๐’Š = ๐’€๐’Š + ๐’–๐’Š or ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ ๐’ ๐‘ฟ ๐’๐’Š + ๐’†๐’Š Where ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ ๐’ ๐‘ฟ ๐’๐’Š is the deterministic part of the sample model and is called โ€œSample Regression Function (SRF) โ€œand ๐’ƒ๐’Š โ€ฒ ๐’” are estimators of unknown parameters ๐œท๐’Š โ€ฒ ๐’” and ๐’–๐’Š = ๐’†๐’Š is a residual.
  • 25. The following graph shows the important elements of PRF and SRF: ๐’€๐’Š โˆ’ ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐’–๐’Š ๐’€๐’Š โˆ’ ๐’€๐’Š = ๐’–๐’Š = ๐’†๐’Š observation Estimation of ๐’€๐’Š based on SRF Estimation of ๐’€๐’Š based on PRF Adopted and altered fromhttp://marketingclassic.blogspot.co.uk/2011_12_01_archive.html In PRF In SRF The PRF is a hypothetical line which we have no idea about that but try to estimate its parameters based on the data in sample ๐‘บ๐‘น๐‘ญ: ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š ๐‘ท๐‘น๐‘ญ: ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ๐’Š
  • 26. โ€ข Now the question is how to calculate ๐’ƒ๐’Š โ€ฒ ๐’” based on the sample observations and how to ensure that they are good and unbiased estimators of ๐œท๐’Š โ€ฒ ๐’” in the population? โ€ข There are two main methods of calculating ๐’ƒ๐’Š โ€ฒ ๐’” and constructing SRF, called the โ€œmethod of Ordinary Least Square (OLS)โ€ and the โ€œmethod of Maximum Likelihood (ML)โ€. Here, we focus on OLS method as it is used most comprehensively. Here, for simplicity, we start with two-variable PRF (๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ๐’Š) and its SRF counterpart (๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š). โ€ข According to OLS method we try to minimise some of the squared residuals in a hypothetical sample; i.e. ๐’–๐’Š ๐Ÿ = ๐’†๐’Š ๐Ÿ = ๐’€๐’Š โˆ’ ๐’€๐’Š ๐Ÿ = ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š ๐Ÿ
  • 27. โ€ข It is obvious from previous equation that the sum of squared residuals is a function of ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ, i.e. ๐’†๐’Š ๐Ÿ = ๐’‡(๐’ƒ ๐ŸŽ, ๐’ƒ ๐Ÿ) because if these two parameters (intercept and slope) change, ๐’†๐’Š ๐Ÿ will change (see the graph on the slide 25). โ€ข Differentiating A partially with respect to ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ and following the first and necessary conditions for optimisation in calculus we have: ๐ ๐’†๐’Š ๐Ÿ ๐๐’ƒ ๐ŸŽ = โˆ’๐Ÿ ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐’†๐’Š = ๐ŸŽ ๐ ๐’†๐’Š ๐Ÿ ๐๐’ƒ ๐Ÿ = โˆ’๐Ÿ ๐‘ฟ๐’Š ๐’€๐’Š โˆ’ ๐’ƒ ๐ŸŽ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐‘ฟ๐’Š ๐’†๐’Š = ๐ŸŽ A B
  • 28. After simplifications we reach to two equations with two unknowns ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ: ๐’€๐’Š = ๐’๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š ๐’€๐’Š ๐‘ฟ๐’Š = ๐’ƒ ๐ŸŽ ๐‘ฟ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ๐’Š ๐Ÿ Where ๐’ is the sample size. So; ๐’ƒ ๐Ÿ = ๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐’€๐’Š โˆ’ ๐’€ ๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ = ๐’™๐’Š ๐’š๐’Š ๐’™๐’Š ๐Ÿ = ๐’„๐’๐’—(๐’™, ๐’š) ๐‘บ ๐’™ ๐Ÿ Where ๐‘บ ๐’™ is the biased version of sample standard deviation, i.e. we have ๐’ instead of (๐’ โˆ’ ๐Ÿ) in denominator. ๐‘บ ๐’™ = ๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ ๐’
  • 29. And ๐‘0 = ๐‘Œ โˆ’ ๐‘1 ๐‘‹ โ€ข The ๐’ƒ ๐ŸŽ and ๐’ƒ ๐Ÿ obtained from OLS method are the point estimators of ๐œท ๐ŸŽ and ๐œท ๐Ÿin the population but in order to test some hypothesis about the population parameters we need to have knowledge about the distributions of their estimators. For that reason we need to make some assumptions about the explanatory variables and the error term in PRF. (see the equations in B to find the reason). ๏‚ง The Assumptions Underlying the OLS Method: 1. The regression model is linear in terms of its parameters (coefficients).* 2. The values of the explanatory variable(s) are fixed in repeated sampling. This means that the nature of explanatory variables (๐‘ฟโ€ฒ ๐’”) is non-stochastic. The only stochastic variables are error term (๐’–๐’Š) and regressand (๐’€๐’Š). 3. The disturbance (error) terms are normally distributed with zero mean and equal variance; given the value of ๐‘ฟโ€ฒ ๐’”. That is: ๐’–๐’Š~๐‘ต(๐ŸŽ, ๐ˆ ๐Ÿ)
  • 30. 4. There is no autocorrelation between error terms, i.e. ๐’„๐’๐’— ๐’–๐’Š, ๐’–๐’‹ = ๐ŸŽ This means they are completely random and there is no association between them or any pattern in their appearance. 5. There is no correlation between error terms and explanatory variables, i.e. ๐’„๐’๐’— ๐’–๐’Š, ๐‘ฟ๐’Š = ๐ŸŽ 6. The number of observations (sample size) should be bigger than the number of parameters in the model. 7. The model should be logically and correctly specified in terms of functional form or even the type and the nature of variables enter into the model. These assumptions are the assumptions of the Classical Linear Regression Models (CLRM), which sometimes they are called Gaussian assumptions on linear regression models.
  • 31. โ€ข Under these assumptions and also the central limit theorem the OLS estimators in sampling distribution (repeated sampling) ,when ๐’ โ†’ โˆž, have a normal distribution: ๐’ƒ ๐ŸŽ~๐‘ต(๐œท ๐ŸŽ, ๐‘ฟ๐’Š ๐Ÿ ๐’ ๐’™๐’Š ๐Ÿ . ๐ˆ ๐Ÿ) ๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ, ๐ˆ ๐Ÿ ๐’™๐’Š ๐Ÿ ) where ๐ˆ ๐Ÿ is the variance of the error term (๐’—๐’‚๐’“ ๐’–๐’Š = ๐ˆ ๐Ÿ) and it can be estimated itself through ๐ˆ estimator, where: ๐ˆ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ ๐‘œ๐‘Ÿ ๐ˆ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐’Œ ๐‘คโ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘–๐‘  ๐’Œ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘› ๐‘กโ„Ž๐‘’ ๐‘š๐‘œ๐‘‘๐‘’๐‘™.
  • 32. โ€ข Based on the assumptions of the classical linear regression model (CLRM), Gauss-Markov Theorem asserts that the least square estimators, among unbiased estimators, have the minimum variance. So they are the Best, Linear, Unbiased Estimators (BLUE). ๏‚ง Interval Estimation For Population Parameters: โ€ข In order to construct a confidence interval for unknown ๐œทโ€ฒ ๐’” (PRFโ€™s parameters) we can either follow Z distribution (if we have a prior knowledge about ๐ˆ) or t-distribution (if we use ๐ˆ instead). โ€ข The confidence intervals for the slope parameter at any level of significance ๐œถ would be*: ๐‘ท ๐’ƒ ๐Ÿ โˆ’ ๐’ ๐œถ ๐Ÿ . ๐ˆ ๐’ƒ ๐Ÿ โ‰ค ๐œท ๐Ÿ โ‰ค ๐’ƒ ๐Ÿ + ๐’ ๐œถ ๐Ÿ . ๐ˆ ๐’ƒ ๐Ÿ = ๐Ÿ โˆ’ ๐œถ Or ๐‘ท ๐’ƒ ๐Ÿ โˆ’ ๐’• ๐œถ ๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ ๐’ƒ ๐Ÿ โ‰ค ๐œท ๐Ÿ โ‰ค ๐’ƒ ๐Ÿ + ๐’• ๐œถ ๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ ๐’ƒ ๐Ÿ = ๐Ÿ โˆ’ ๐œถ
  • 33. ๏‚ง Hypothesis Testing For Parameters: โ€ข The critical values (Z or t) in the confidence intervals, can be used to find the rejection area(s) and test any hypothesis on parameters. โ€ข For example, to test ๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐ŸŽ against the alternative ๐‘ฏ ๐Ÿ: ๐œท ๐Ÿ โ‰  ๐ŸŽ, after finding the critical values t (which means we do not have prior knowledge of ๐ˆ and use ๐ˆ instead) at any significance level ๐œถ, we will have two critical regions and if the value of the test statistic ๐’• = ๐’ƒ ๐Ÿโˆ’๐œท ๐Ÿ ๐ˆ ๐’™ ๐’Š ๐Ÿ be in the critical region ๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐ŸŽ must be rejected. โ€ข In case we have more than one slope parameter the degree of freedom for t-distribution will be the sample size ๐’ minus the number of estimated parameters including the intercept parameters, i.e. for ๐’Œ parameters ๐’…๐’‡ = ๐’ โˆ’ ๐’Œ .
  • 34. ๏‚ง Determination Coefficient ๐’“ ๐Ÿ and Goodness of Fit: โ€ข In early slides we talked about determination coefficient and its relationship with correlation coefficient. The coefficient of determination ๐’“ ๐Ÿ come to our attention when there is no issue about estimation of regression parameters. โ€ข It is a measure which shows how well the SRF fits the data. โ€ข to understand this measure properly letโ€™s have a look at it from different angle. We know that ๐’€๐’Š = ๐’€๐’Š + ๐’†๐’Š And in the deviation form after subtracting ๐’€ from both sides ๐’€๐’Š โˆ’ ๐’€ = ๐’€๐’Š โˆ’ ๐’€ + ๐’†๐’Š We know that ๐’†๐’Š = ๐’€๐’Š โˆ’ ๐’€๐’Š ๐’†๐’Š AdoptedfromBasicEconometricsGojaratiP76 ๐‘Œ ๐’€๐’Š โˆ’ ๐’€
  • 35. So; ๐’€๐’Š โˆ’ ๐’€ = ( ๐’€๐’Š โˆ’ ๐’€) + (๐’€๐’Š โˆ’ ๐’€๐’Š) Or in the deviation form ๐’š๐’Š = ๐’š๐’Š + ๐’†๐’Š By squaring both sides and adding all over the sample we have: ๐’š๐’Š ๐Ÿ = ๐’š๐’Š ๐Ÿ + ๐Ÿ ๐’š๐’Š ๐’†๐’Š + ๐’†๐’Š ๐Ÿ = ๐’š๐’Š ๐Ÿ + ๐’†๐’Š ๐Ÿ Where ๐’š๐’Š ๐’†๐’Š = ๐ŸŽ according to the OLSโ€™s assumptions 3 and 5. And if we change it to the non-deviated form: ๐’€๐’Š โˆ’ ๐’€ 2 = ๐’€๐’Š โˆ’ ๐’€ 2 + ๐’€๐’Š โˆ’ ๐’€๐’Š 2 Total variation of the observed Y values around their mean =Total Sum of Squares= TSS Total explained variation of the estimated Y values around their mean = Explained Sum of Squares (by explanatory variables)= ESS Total unexplained variation of the observed Y values around the regression line= Residual Sum of Squares (Explained by error terms)= RSS
  • 36. Dividing both sides by Total Sum of Squares (TSS) we have: 1 = ๐ธ๐‘†๐‘† ๐‘‡๐‘†๐‘† + ๐‘…๐‘†๐‘† ๐‘‡๐‘†๐‘† = ๐’€๐’Š โˆ’ ๐’€ 2 ๐’€๐’Š โˆ’ ๐’€ 2 + ๐’€๐’Š โˆ’ ๐’€๐’Š 2 ๐’€๐’Š โˆ’ ๐’€ 2 Where ๐’€๐’Šโˆ’ ๐’€ ๐Ÿ ๐’€๐’Šโˆ’ ๐’€ ๐Ÿ = ๐‘ฌ๐‘บ๐‘บ ๐‘ป๐‘บ๐‘บ is the percentage of the variation of the actual (observed) ๐’€๐’Š which is explained by the explanatory variables (by regression line). โ€ข A good reader knows that this is not a new concept; the determination coefficient ๐’“ ๐Ÿ was described already as a measure of the goodness of fit between different alternative sample regression functions (SRFs). ๐Ÿ = ๐’“ ๐Ÿ + ๐‘น๐‘บ๐‘บ ๐‘ป๐‘บ๐‘บ โ†’ ๐’“ ๐Ÿ = ๐Ÿ โˆ’ ๐‘น๐‘บ๐‘บ ๐‘ป๐‘บ๐‘บ = ๐Ÿ โˆ’ ๐’† ๐’Š ๐Ÿ ๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ
  • 37. โ€ข A good model must have a reasonable high ๐’“ ๐Ÿ but this does not mean any model with a high ๐’“ ๐Ÿ is a good model. Extremely high level of ๐’“ ๐Ÿ could be as a result of having a spurious regression line due to the variety of reasons such as non-stationarity of data, cointegration problem and etc. โ€ข In a regression model with two parameters, ๐’“ ๐Ÿ can be directly calculated: ๐’“ ๐Ÿ = ๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ ๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ = ๐’ƒ ๐ŸŽ+๐’ƒ ๐Ÿ ๐‘ฟ๐’Šโˆ’๐’ƒ ๐ŸŽโˆ’๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ ๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ = ๐’ƒ ๐Ÿ ๐Ÿ ๐‘ฟ ๐’Šโˆ’๐‘ฟ ๐Ÿ ๐’€ ๐’Šโˆ’ ๐’€ ๐Ÿ = ๐’ƒ ๐Ÿ ๐Ÿ ๐’™ ๐’Š ๐Ÿ ๐’š ๐’Š ๐Ÿ = ๐’ƒ ๐Ÿ ๐Ÿ ๐‘บ ๐‘ฟ ๐Ÿ ๐‘บ ๐’€ ๐Ÿ Where ๐‘บ ๐‘ฟ ๐Ÿ and ๐‘บ ๐’€ ๐Ÿ are the standard deviations of ๐‘ฟ and ๐’€ respectively.
  • 38. ๏‚ง Multiple Regression Analysis: โ€ข If there are more than two explanatory variables in the regression line we need additional assumptions about the independency of the explanatory variables and also having no exact linear relationship between them. โ€ข The population and the sample regression models for three variables model can be described as following: In Population: ๐’€๐’Š = ๐œท ๐ŸŽ + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐œท ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’–๐’Š In Sample: ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’†๐’Š โ€ข The OLS estimators can be obtained by minimising ๐’†๐’Š ๐Ÿ. So, the values of the SRF parameters in the deviation form are as following: ๐’ƒ ๐Ÿ = ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ
  • 39. ๐’ƒ ๐Ÿ = ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐Ÿ ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’š๐’Š)( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ And the intercept parameter will be calculated in the non-deviated form as: ๐’ƒ ๐ŸŽ = ๐’€ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ โˆ’ ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ โ€ข Under the classical assumptions and also the central limit theorem the OLS estimators in sampling distribution (repeated sampling),when ๐’ โ†’ โˆž, have a normal distribution: ๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ, ๐ˆ ๐’– ๐Ÿ. ๐’™ ๐Ÿ๐’Š ๐Ÿ ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ ) ๐’ƒ ๐Ÿ~๐‘ต(๐œท ๐Ÿ, ๐ˆ ๐’– ๐Ÿ. ๐’™ ๐Ÿ๐’Š ๐Ÿ ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ )
  • 40. โ€ข The distribution of the intercept parameter ๐’ƒ ๐ŸŽ is not of primary concern as in many cases it has no practical importance. โ€ข If the variance of the disturbance (error) term (๐ˆ ๐’– ๐Ÿ ) is not known the residual variance (sample variance) can be used ( ๐ˆ ๐’– ๐Ÿ ), which is an unbiased estimator of the earlier: ๐ˆ ๐’– ๐Ÿ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐’Œ Where ๐’Œ is the number of parameters in the model (including the intercept ๐’ƒ ๐ŸŽ). Therefore, in a regression model with two slope parameters and one intercept parameter the residual variance can be calculated by: ๐ˆ ๐’– ๐Ÿ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘
  • 41. So, for a model with two slope parameters, the unbiased estimates of the variance of these parameters are: ๐‘บ ๐’ƒ ๐Ÿ ๐Ÿ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘ . ๐’™ ๐Ÿ๐’Š ๐Ÿ ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ = ๐ˆ ๐’– ๐Ÿ ๐’™ ๐Ÿ๐’Š ๐Ÿ (๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐Ÿ๐Ÿ) Where ๐’“ ๐Ÿ ๐Ÿ๐Ÿ = ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š ๐Ÿ ๐’™ ๐Ÿ๐’Š ๐Ÿ ๐’™ ๐Ÿ๐’Š ๐Ÿ . and ๐‘บ ๐’ƒ ๐Ÿ ๐Ÿ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘ . ๐’™ ๐Ÿ๐’Š ๐Ÿ ( ๐’™ ๐Ÿ๐’Š ๐Ÿ)( ๐’™ ๐Ÿ๐’Š ๐Ÿ) โˆ’ ( ๐’™ ๐Ÿ๐’Š ๐’™ ๐Ÿ๐’Š) ๐Ÿ = ๐ˆ ๐’– ๐Ÿ ๐’™ ๐Ÿ๐’Š ๐Ÿ (๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐Ÿ๐Ÿ) ๐ˆ ๐’– ๐Ÿ
  • 42. ๏‚ง The Coefficient of Multiple Determination (๐‘น ๐Ÿ and ๐‘น ๐Ÿ ): The same concept of the coefficient of determination used for a bivariate model can be extended for a multivariate model. โ€ข If ๐‘น ๐Ÿ is denoted as the coefficient of multiple determination it shows the proportion (percentage) of the total variation of ๐’€ explained by the explanatory variables and it is calculated by: ๐‘…2 = ๐ธ๐‘†๐‘† ๐‘‡๐‘†๐‘† = ๐‘ฆ ๐‘– 2 ๐‘ฆ ๐‘– 2 = ๐‘1 ๐‘ฆ ๐‘– ๐‘ฅ1๐‘–+๐‘2 ๐‘ฆ ๐‘– ๐‘ฅ2๐‘– ๐‘ฆ ๐‘– 2 And we know that: 0 โ‰ค ๐‘…2 โ‰ค 1 ๏ƒ˜ Note that ๐‘…2 can also be calculated through RSS, i.e. ๐‘…2 = 1 โˆ’ ๐‘…๐‘†๐‘† ๐‘‡๐‘†๐‘† = 1 โˆ’ ๐‘’๐‘– 2 ๐‘ฆ๐‘– 2 C
  • 43. โ€ข ๐‘น ๐Ÿ is likely to increase by including an additional explanatory variable (see ). Therefore, in case we have two alternative models with the same dependent variable ๐’€ but different number of explanatory variables we should not be misled by the high ๐‘น ๐Ÿ of the model with more variables. โ€ข To solve this problem we need to bring the degrees of freedom into our consideration as a reduction factor against adding additional explanatory variables. So, the adjusted ๐‘น ๐Ÿ which can be shown by ๐‘น ๐Ÿ is considered as an alternative coefficient of determination and it is calculated as: ๐‘…2 = 1 โˆ’ ๐‘’๐‘– 2 ๐‘› โˆ’ ๐‘˜ ๐‘ฆ๐‘– 2 ๐‘› โˆ’ 1 = 1 โˆ’ ๐‘› โˆ’ 1 ๐‘› โˆ’ ๐‘˜ . ๐‘’๐‘– 2 ๐‘ฆ๐‘– 2 = 1 โˆ’ ๐‘›โˆ’1 ๐‘›โˆ’๐‘˜ (1 โˆ’ ๐‘…2) C
  • 44. ๏‚ง Partial Correlation Coefficients: โ€ข For a three-variable regression model such as ๐’€๐’Š = ๐’ƒ ๐ŸŽ + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐‘ฟ ๐Ÿ๐’Š + ๐’†๐’Š We can talk about three linear association (correlation) between ๐’€ and ๐‘ฟ ๐Ÿ ๐’“ ๐’š๐’™ ๐Ÿ , between ๐’€ and ๐‘ฟ ๐Ÿ (๐’“ ๐’š๐’™ ๐Ÿ ) and finally between ๐‘ฟ ๐Ÿ and ๐‘ฟ ๐Ÿ (๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ ). These correlations are called simple (gross) correlation coefficients but they do not reflect the true linear association between two variables as the influence of the third variable on the other two is not removed. โ€ข The net linear association between two variables can be obtained through the partial correlation coefficient, where the influence of the third variable is removed (the variable is hold constant). Symbolically, ๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ represents the partial correlation coefficient between ๐’€ and ๐‘ฟ ๐Ÿ holding ๐‘ฟ ๐Ÿ constant.
  • 45. โ€ข Two partial correlation coefficients in our model can be calculated as following: ๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ = ๐’“ ๐’š๐’™ ๐Ÿ โˆ’ ๐’“ ๐’š๐’™ ๐Ÿ ๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ ๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐’™ ๐Ÿ ๐’™ ๐Ÿ . ๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐’š๐’™ ๐Ÿ ๐’“ ๐’š๐’™ ๐Ÿ. ๐’™ ๐Ÿ = ๐’“ ๐’š๐’™ ๐Ÿ โˆ’ ๐’“ ๐’š๐’™ ๐Ÿ ๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ ๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐’™ ๐Ÿ ๐’™ ๐Ÿ . ๐Ÿ โˆ’ ๐’“ ๐Ÿ ๐’š๐’™ ๐Ÿ โ€ข The correlation coefficient ๐’“ ๐’™ ๐Ÿ ๐’™ ๐Ÿ.๐’š has no practical importance. Specifically, when the direction of causality is from ๐‘ฟโ€ฒ ๐’” to ๐’€ we can simply use the simple correlation coefficient in this case: ๐’“ = ๐’™ ๐Ÿ ๐’™ ๐Ÿ ๐’™ ๐Ÿ ๐Ÿ . ๐’™ ๐Ÿ ๐Ÿ โ€ข They can be used to find out which explanatory variable has more linear association with the dependent variable.
  • 46. ๏‚ง Hypothesis Testing in Multiple Regression Models: In a multiple regression model hypotheses are formed to test different aspects of this type of regression models: i. Testing hypothesis about an individual parameter of the model. For example; ๐‘ฏ ๐ŸŽ: ๐œท๐’‹ = ๐ŸŽ against ๐‘ฏ ๐Ÿ: ๐œท๐’‹ โ‰  ๐ŸŽ If ๐ˆ is unknown and is replaced by ๐ˆ the test statistic ๐’• = ๐’ƒ ๐’‹โˆ’๐œท ๐’‹ ๐’”๐’†(๐’ƒ ๐’‹) = ๐’ƒ ๐’‹ ๐’”๐’†(๐’ƒ ๐’‹) follows the t-distribution with ๐’ โˆ’ ๐’Œ df (for a regression model with three parameters, including intercept, ๐๐Ÿ = ๐’ โˆ’ ๐Ÿ‘)
  • 47. ii. Testing hypothesis about the equality of two parameters in the model. For example, ๐‘ฏ ๐ŸŽ: ๐œท๐’Š = ๐œท๐’‹ against ๐‘ฏ ๐Ÿ: ๐œท๐’Š โ‰  ๐œท๐’‹ Again, if ๐ˆ is unknown and is replaced by ๐ˆ the test statistic ๐’• = ๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹ โˆ’ ๐œท๐’Š โˆ’ ๐œท๐’‹ ๐’”๐’†(๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹) = ๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹ ๐’—๐’‚๐’“ ๐’ƒ๐’Š + ๐’—๐’‚๐’“ ๐’ƒ๐’‹ โˆ’ ๐Ÿ๐’„๐’๐’—(๐’ƒ๐’Š, ๐’ƒ๐’‹) follows the t-distribution with ๐’ โˆ’ ๐’Œ df. โ€ข If the value of test statistic ๐’• > ๐’• ๐œถ ๐Ÿ ,(๐’โˆ’๐’Œ) we must reject ๐‘ฏ ๐ŸŽ, otherwise there is not much evidence to reject that.
  • 48. iii. Testing hypothesis about the overall significance of the estimated model by checking if all the slope parameters are simultaneously zero. For example, to test ๐‘ฏ ๐ŸŽ: ๐œท๐’Š = ๐ŸŽ (โˆ€ ๐’Š) against ๐‘ฏ ๐Ÿ: โˆƒ๐œท๐’Š โ‰  ๐ŸŽ the analysis of variance (ANOVA) table can be used to find if the mean sum of squares (MSS), due to the regression (or explanatory variables) are very far from the MSS due to the residuals. If this is true, it means the variation of explanatory variables contribute more towards the variation of the dependent variable than the variation of residuals, so, the ratio ๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘”๐‘Ÿ๐‘’๐‘ ๐‘ ๐‘–๐‘œ๐‘› (๐‘’๐‘ฅ๐‘๐‘™๐‘Ž๐‘›๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘ฆ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘ ) ๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘ ๐‘–๐‘‘๐‘ข๐‘Ž๐‘™๐‘  (๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘’๐‘™๐‘’๐‘š๐‘’๐‘›๐‘ก๐‘ ) should be much higher than one.
  • 49. โ€ข The ANOVA table for the three-variable regression model can be formed as following: โ€ข If we believe that the regression model is meaningless so we cannot reject the null hypothesis that all slope coefficients are simultaneously equal to zero, otherwise the test statistic ๐น = ๐ธ๐‘†๐‘†/๐‘‘๐‘“ ๐‘…๐‘†๐‘†/๐‘‘๐‘“ = ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š ๐Ÿ ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘ Which follows the F-distribution with 2 and ๐’ โˆ’ ๐Ÿ‘ df must be much bigger than 1. Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS) Due to Explanatory Variables ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š 2 ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š + ๐’ƒ ๐Ÿ ๐’š๐’Š ๐’™ ๐Ÿ๐’Š ๐Ÿ Due to Residuals ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘ ๐ˆ ๐Ÿ = ๐’†๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ‘ Total ๐’š๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ
  • 50. โ€ข In general, to test the overall significance of the sample regression for a multi-variable model (e.g with ๐’Œ slope parameters) the null and alternative hypotheses and the test statistic are as following: ๐‘ฏ ๐ŸŽ: ๐œท ๐Ÿ = ๐œท ๐Ÿ = โ‹ฏ = ๐œท ๐’Œ = ๐ŸŽ ๐‘ฏ ๐Ÿ: ๐’‚๐’• ๐’๐’†๐’‚๐’”๐’• ๐’•๐’‰๐’†๐’“๐’† ๐’Š๐’” ๐’๐’๐’† ๐œท๐’Š โ‰  ๐ŸŽ ๐‘ญ = ๐‘ฌ๐‘บ๐‘บ ๐’Œโˆ’๐Ÿ ๐‘น๐‘บ๐‘บ ๐’โˆ’๐’Œ โ€ข If ๐‘ญ > ๐‘ญ ๐œถ, ๐’Œโˆ’๐Ÿ, ๐’โˆ’๐’Œ we reject ๐‘ฏ ๐ŸŽ at the significance level of ๐œถ, otherwise there is no enough evidence to reject it. โ€ข It is sometimes easier to use the determination coefficient ๐‘น ๐Ÿ to run the above test, because ๐‘น ๐Ÿ = ๐‘ฌ๐‘บ๐‘บ ๐‘ป๐‘บ๐‘บ โ†’ ๐‘ฌ๐‘บ๐‘บ = ๐‘น ๐Ÿ . ๐‘ป๐‘บ๐‘บ and also ๐‘น๐‘บ๐‘บ = ๐Ÿ โˆ’ ๐‘น ๐Ÿ . ๐‘ป๐‘บ๐‘บ
  • 51. โ€ข The ANOVA table can also be written as: โ€ข So, the test statistic F can be written as: ๐‘ญ = ๐‘น ๐Ÿ ๐’š๐’Š ๐Ÿ (๐’Œ โˆ’ ๐Ÿ) (๐Ÿ โˆ’ ๐‘น ๐Ÿ) ๐’š๐’Š ๐Ÿ (๐’ โˆ’ ๐’Œ) = ๐’ โˆ’ ๐’Œ ๐’Œ โˆ’ ๐Ÿ . ๐‘น ๐Ÿ ๐Ÿ โˆ’ ๐‘น ๐Ÿ Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS) Due to Explanatory Variables ๐‘น ๐Ÿ ๐’š๐’Š ๐Ÿ ๐’Œ โˆ’ ๐Ÿ ๐‘น ๐Ÿ ๐’š๐’Š ๐Ÿ ๐’Œ โˆ’ ๐Ÿ Due to Residuals (๐Ÿ โˆ’ ๐‘น ๐Ÿ ) ๐’š๐’Š ๐Ÿ ๐’ โˆ’ ๐’Œ ๐ˆ ๐Ÿ = (๐Ÿ โˆ’ ๐‘น ๐Ÿ ) ๐’š๐’Š ๐Ÿ ๐’ โˆ’ ๐’Œ Total ๐’š๐’Š ๐Ÿ ๐’ โˆ’ ๐Ÿ
  • 52. iv. Testing hypothesis about parameters when they satisfy certain restrictions.* e.g.๐‘ฏ ๐ŸŽ: ๐œท๐’Š + ๐œท๐’‹ = ๐Ÿ against ๐‘ฏ ๐Ÿ: ๐œท๐’Š + ๐œท๐’‹ โ‰  ๐Ÿ v. Testing hypothesis about the stability of the estimated regression model in a specific time period or in two cross- sectional unit.** vi. Testing hypothesis about different functional forms of regression models.***