This document provides an introduction to correlation and regression analysis. It defines key concepts like variables, random variables, and probability distributions. It discusses how correlation measures the strength and direction of a linear relationship between two variables. Correlation coefficients range from -1 to 1, with values closer to these extremes indicating stronger correlation. The document also introduces determination coefficients, which measure the proportion of variance in one variable explained by the other. Regression analysis builds on correlation to study and predict the average value of one variable based on the values of other explanatory variables.
Multiple Correlation Coefficient denoting a correlation of one variable with multiple other variables. Theย Multiple Correlation Coefficient,ย R,ย is a measure of the strength of the association between the independent (explanatory) variables and the one dependent (prediction) variable. This presentation explains the concept of multiple correlation and its computation process.
This is about the correlation analysis in statistics. It covers types, importance,Scatter diagram method
Karl pearson correlation coefficient
Spearman rank correlation coefficient
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables.
Multiple Correlation Coefficient denoting a correlation of one variable with multiple other variables. Theย Multiple Correlation Coefficient,ย R,ย is a measure of the strength of the association between the independent (explanatory) variables and the one dependent (prediction) variable. This presentation explains the concept of multiple correlation and its computation process.
This is about the correlation analysis in statistics. It covers types, importance,Scatter diagram method
Karl pearson correlation coefficient
Spearman rank correlation coefficient
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxbudbarber38650
ย
FSE 200
Adkins Page 1 of 10
Simple Linear Regression
Correlation only measures the strength and direction of the linear relationship between two quantitative variables. If the relationship is linear, then we would like to try to model that relationship with the equation of a line. We will use a regression line to describe the relationship between an explanatory variable and a response variable.
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.
Ex. It has been suggested that there is a relationship between sleep deprivation of employees and the ability to complete simple tasks. To evaluate this hypothesis, 12 people were asked to solve simple tasks after having been without sleep for 15, 18, 21, and 24 hours. The sample data are shown below.
Subject
Hours without sleep, x
Tasks completed, y
1
15
13
2
15
9
3
15
15
4
18
8
5
18
12
6
18
10
7
21
5
8
21
8
9
21
7
10
24
3
11
24
5
12
24
4
Draw a scatterplot and describe the relationship. Lay a straight-edge on top of the plot and move it around until you find what you think might be a โline of best fit.โ Then try to predict the number of tasks completed for someone having been without sleep 16 hours.
Was your line the same as that of the classmate sitting next to you? Probably not. We need a method that we can use to find the โbestโ regression line to use for prediction. The method we will use is called least-squares. No line will pass exactly through all the points in the scatterplot. When we use the line to predict a y for a given x value, if there is a data point with that same x value, we can compute the error (residual):
Our goal is going to be to make the vertical distances from the line as small as possible. The most commonly used method for doing this is the least-squares method.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
ยท Least-Squares Regression Line:
ยท Slope of the Regression Line:
ยท Intercept of the Regression Line:
Generally, regression is performed using statistical software. Clearly, given the appropriate information, the above formulas are simple to use.
Once we have the regression line, how do we interpret it, and what can we do with it?
The slope of a regression line is the rate of change, that amount of change in when x increases by 1.
The intercept of the regression line is the value of when x = 0. It is statistically meaningful only when x can take on values that are close to zero.
To make a prediction, just substitute an x-value into the equation and find .
To plot the line on a scatterplot, just find a couple of points on the regression line, one near each end of the range of x in the data. Plot the points and connect them with a line. .
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
To get a copy of the slides for free Email me at: japhethmuthama@gmail.com
You can also support my PhD studies by donating a 1 dollar to my PayPal.
PayPal ID is japhethmuthama@gmail.com
Factor Extraction method in factor analysis with example in R studio.pptxGauravRajole
ย
In this ppt information about factor analysis is given which is part of multivariate analysis. detail description is given about the factor extraction method, a test of the sufficiency of factor numbers, Interpretation of factors, factor score, rotation of factors, orthogonal rotation methods, varimax rotation, Oblique Rotation, and an example of factor analysis in R-studio.
๏บPlease Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.1: Correlation
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
ย
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
ย
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Palestine last event orientationfvgnh .pptxRaedMohamed3
ย
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
How to Split Bills in the Odoo 17 POS ModuleCeline George
ย
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
2. Some Basic Concepts:
o Variable: A letter (symbol) which represents the elements of
a specific set.
o Random Variable: A variable whose values are randomly
appear based on a probability distribution.
o Probability Distribution: A corresponding rule (function)
which corresponds a probability to the values of a random
variable (individually or to a set of them). E.g.:
๐ 0 1
๐(๐ฅ) 0.5 0.5
In one trial ๐ป, ๐
In two trials
๐ป๐ป, ๐ป๐, ๐๐ป, ๐๐
3. Correlation:
Is there any relation between:
๏ฑ fast food sale and different seasons?
๏ฑ specific crime and religion?
๏ฑ smoking cigarette and lung cancer?
๏ฑ maths score and overall score in exam?
๏ฑ temperature and earthquake?
๏ฑ cost of advertisement and number of sold items?
๏ง To answer each question two sets of corresponding data need to
be randomly collected.
Let random variable "๐" represents the first group of
data and random variable "๐" represents the second.
Question: Is this true that students who have a better
overall result are good in maths?
4. Our aim is to find out whether there is any linear
association between ๐ and ๐. In statistics, technical
term for linear association is โcorrelationโ. So, we are
looking to see if there is any correlation between two
scores.
๏ โLinear associationโ : variables are in relations at
their levels, i.e. ๐ with ๐ not with ๐ ๐
, ๐ ๐
,
๐
๐
or even
โ๐.
Imagine we have a random sample of scores in a
school as following:
5. In our example, the correlation between ๐ and ๐
can be shown in a scatter diagram:
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
Y
X
Correlation between maths score and
overall score The graph shows a
positive correlation
between maths
scores and overall
scores, i.e. when ๐
increases ๐
increases too.
6. Different scatter diagrams show different types of
correlation:
โข Is this enough? Are we happy?
Certainly not!! We think we know things better
when they are described by numbers!!!!
Although, scatter diagrams are informative but to find
the degree (strength) of a correlation between two
variables we need a numerical measurement.
Adopted from www.pdesas.org
7. Following the work of Francis Galton on regression
line, in 1896 Karl Pearson introduced a formula for
measuring correlation between two variables, called
Correlation Coefficient or Pearsonโs Correlation
Coefficient.
For a sample of size ๐, sample correlation coefficient
๐ ๐๐ can be calculated by:
๐ ๐๐ =
๐
๐
(๐๐ โ ๐)(๐๐ โ ๐)
๐
๐
(๐๐ โ ๐) ๐ . ๐
๐
(๐๐ โ ๐) ๐
=
๐๐๐(๐, ๐)
๐บ ๐ . ๐บ ๐
Where ๐ and ๐ are the mean values of ๐ and ๐ in the
sample and ๐บ represents the biased version of
โstandard deviationโ*. The covariance between ๐ and ๐
( ๐๐๐ ๐, ๐ ) shows how much ๐ and ๐ change together.
8. Alternatively, if there is an opportunity to observe all
available data, the population correlation coefficient
(๐ ๐๐) can be obtained by:
๐ ๐๐ =
๐ฌ ๐๐ โ ๐ ๐ . (๐๐ โ ๐ ๐)
๐ฌ ๐๐ โ ๐ ๐
๐. ๐ฌ(๐๐ โ ๐ ๐) ๐
=
๐๐๐(๐, ๐)
๐ ๐ . ๐ ๐
Where ๐ฌ, ๐ and ๐ are expected value, mean and
standard deviation of the random variables,
respectively and ๐ต is the size of the population.
Question: Under what conditions can we use this
population correlation coefficient?
9. ๏ If ๐ = ๐๐ + ๐ ๐ ๐๐ = ๐
Maximum (perfect) positive correlation.
๏ If ๐ = ๐๐ + ๐ ๐ ๐๐ = โ๐
Maximum (perfect) negative correlation.
๏ If there is no linear association between ๐ and ๐
then ๐ ๐๐ = ๐.
Note 1: If there is no linear association between two
random variables they might have non linear
association or no association at all.
For all ๐ , ๐ โ ๐น
And ๐ > ๐
For all ๐ , ๐ โ ๐น
And ๐ < ๐
11. Positive Linear
Association
No Linear
Association
Negative Linear
Association
๐บ ๐ > ๐บ ๐ ๐บ ๐ = ๐บ ๐ ๐บ ๐ < ๐บ ๐
๐ ๐๐ = ๐
Adapted and modified from www.tice.agrocampus-ouest.fr
๐ ๐๐ โ ๐
๐ < ๐ ๐๐ < ๐
๐ ๐๐ = ๐
โ๐ < ๐ ๐๐< ๐
๐ ๐๐ โ โ๐
๐ ๐๐ = โ๐
Perfect
Weak
No
Correlation
Weak
Strong
Perfect
Strong
12. Some properties of the correlation coefficient:
(Sample or population)
a. It lies between -1 and 1, i.e. โ๐ โค ๐ ๐๐ โค ๐.
b. It is symmetrical with respect to ๐ and ๐, i.e. ๐ ๐๐ =
๐ ๐๐ . This means the direction of calculation is not
important.
c. It is just a pure number and independent from the
unit of measurement of ๐ and ๐.
d. It is independent of the choice of origin and scale
of ๐ and ๐โs measurements, that is;
๐ ๐๐ = ๐ ๐๐+๐ ๐๐+๐ (๐, ๐ > ๐)
13. e. ๐ ๐, ๐ = ๐ ๐ . ๐(๐) ๐ ๐๐ = ๐
Important Note:
Many researchers wrongly construct a theory just based on a
simple correlation test.
๏ฑ Correlation does not imply causation.
If there is a high correlation between number of smoked
cigarettes and the number of infected lungโs cells it does not
necessarily mean that smoking causes lung cancer. Causality
test (sometimes called Granger causality test) is different from
correlation test.
In causality test it is important to know about the direction of
causality (e.g. ๐ on ๐ and not vice versa) but in correlation we
are trying to find if two variables moving together (same or
opposite directions).
๐ and ๐ are statistically independent,
where ๐(๐, ๐) is the joint Probability
Density Function (PDF)
14. Determination Coefficient and Correlation Coefficient:
๐ ๐๐ = ยฑ๐ perfect linear relationship between variables:
i.e. ๐ is the only factor which describes variations of ๐ at the level
(linearly); ๐ = ๐ + ๐๐ .
๐ ๐๐ โ ยฑ๐ ๐ is not the only factor which describes
variations of ๐ but we can still imagine that a line represents this
relationship which passing through most of the points or having a
minimum vertical distance from them, in total. This line is called
the โline of best fitโ or known technically as โregression lineโ.
Adopted from www.ncetm.org.uk/public/files/195322/G3fb.jpg
The graph shows a line of
best fit between age of a
car and its price. Imagine
the line has the equation
of ๐ = ๐ + ๐๐
15. The criterion to choose a line among others is the
goodness of fit which can be calculated through
determination coefficient, ๐ ๐.
๏ In the previous example, age of a car is only factor
among many other factors that explain the price of a
car. Can you find some other factors?
If ๐ and ๐ represent price and age of cars respectively,
the percentage of the variation of ๐ which is determined
(explained) by the variation of ๐ is called โdetermination
coefficientโ.
Determination coefficient can be understood better by
Venn-Euler diagrams:
16. y x
y x
y x
y=x
๐ ๐ = ๐ , none of variations of y can be determined
by x (no linear association)
๐ ๐
โ ๐, small percentage of variation of y can be
determined by x (weak linear association)
๐ ๐ โ ๐, large percentage of variation of y can be
determined by x (strong linear association)
๐ ๐
= ๐, all variation of y can be determined by x
and no other factors (complete linear association)
The shaded area shows the percentage of variation of
y which can be determined by x. it is easy to
understand that ๐ โค ๐ ๐
โค ๐.
17. Although, determination coefficient (๐ ๐) is different
conceptually from correlation coefficient (๐ ๐๐)but one
can be calculated from another; in fact:
๐ ๐๐ = ยฑ ๐ ๐
Or, alternatively
๐ ๐ = ๐ ๐ ๐
๐
๐๐ โ ๐ ๐
๐
๐
๐๐ โ ๐ ๐
= ๐ ๐
๐บ ๐
๐
๐บ ๐
๐
Where ๐ is the slope coefficient in the regression
line ๐ = ๐ + ๐๐ .
Note: If ๐ = ๐ + ๐๐ shows the regression line (๐ ๐๐ ๐)
and ๐ = ๐ + ๐ ๐ shows another regression line (๐ ๐๐ ๐)
then we have: ๐ ๐ = ๐. ๐
18. Summary of Correlation & Determination Coefficients:
โข Correlation means a linear association between two random variables which
could be positive or negative or zero.
โข Linear association means that variables are in relations at their levels
(linearly).
โข Correlation coefficient measures the strength of linear association between
two variables. It could be calculated for a sample or for the whole population.
โข The value of correlation coefficient is between -1 and 1, which show the
strongest correlation (negative or positive) but moving towards zero it makes
correlation weaker.
โข Correlation does not imply causation.
โข Determination coefficient shows the percentage of variation of one variable
which can be described by another variable and it is a measure for the
goodness of fit for lines passing through plotted points.
โข The value of determination coefficient is between 0 and 1 and can be
obtained from correlation coefficient by squaring it.
19. โข Knowing two random variables are just linearly associated is
not much satisfactory. There are sometimes a strong idea
that the variation of one variable can solidly explain the
variation of another.
โข To test this idea (hypothesis) we need another analytical
approach, which is called โregression analysisโ.
โข In regression analysis we try to study or predict the mean
(average) value of a dependent variable ๐ based on the
knowledge we have about independent (explanatory)
variable(s) ๐ฟ ๐, ๐ฟ ๐,โฆ, ๐ฟ ๐. This is familiar for those who know
the meaning of conditional probabilities; as we are going to
make a linear model such as, which is a deterministic part of
the model in regression analysis:
๐ธ(๐ ๐1, ๐2,โฆ, ๐ ๐) = ๐ฝ0 + ๐ฝ1 ๐1 + ๐ฝ2 ๐2 + โฏ + ๐ฝ ๐ ๐ ๐
20. โข The deterministic part of the regression model does reflect the
structure of the relationship between ๐ and ๐ฟโฒ ๐ in a
mathematical world but we live in a stochastic world.
โข Godโs knowledge (if the term is applicable) is deterministic but
our perception about everything in this world is always
stochastic and our model should be built in this way.
โข To understand the concept of stochastic model letโs have an
example:
๏ If we make a model between monthly consumption expenditure
๐ช and monthly income ๐ฐ, the model cannot be deterministic
(mathematical) such that for every value of ๐ฐ there is one and
only one value of ๐ช (which is the concept of functional
relationship in maths). Why?
21. ๏ Although, the income is the main variable determining the amount of
consumption expenditure but many other factors such as the mood of
people, their wealth, interest rate and etc. are overlooked in a simple
mathematical model such as ๐ช = ๐(๐ฐ) but their influences can change the
value of ๐ช even at the same level of ๐ฐ. If we believe that the average impact
of all their omitted variables is random (sometimes positive and sometimes
negative). So, in order to make a realistic model we need to add a stochastic
(random) term ๐ to our mathematical model: ๐ช = ๐ ๐ฐ + ๐
ยฃ1000
ยฃ1400
โฎ
โฎ
ยฃ800
ยฃ1000
ยฃ750
ยฃ900
ยฃ1200
ยฃ1150
I C
The change in the
consumption
expenditure comes
from the change of
income (๐ผ) or
change of some
random elements
(๐ข), so, we can write
๐ช = ๐ ๐ฐ + ๐
22. โข The general stochastic model for our purpose would be as
following, which is called โLinear Regression Model**โ:
๐๐ = ๐ฌ(๐๐ ๐ฟ ๐๐, โฆ , ๐ฟ ๐๐) + ๐๐
Which can be written as:
๐๐ = ๐ท ๐ + ๐ท ๐ ๐ฟ ๐๐ + ๐ท ๐ ๐ฟ ๐๐ + โฏ + ๐ท ๐ ๐ฟ ๐๐ + ๐๐
Where ๐ (๐ = 1,2, โฆ , ๐) shows time period (days, weeks, months,
years and etc.) and ๐๐ is an error (stochastic) term and also a
representative of all other influential variables which are not
considered in the model and ignored.
โข The deterministic part of the model
๐ฌ(๐๐ ๐ฟ ๐๐, โฆ , ๐ฟ ๐๐) =๐ท ๐ + ๐ท ๐ ๐ฟ ๐๐ + ๐ท ๐ ๐ฟ ๐๐ + โฏ + ๐ท ๐ ๐ฟ ๐๐
is called Population Regression Function (PRF).
23. โข The general form of the Linear Regression Model with ๐
explanatory variables and ๐ observations can be shown in
the matrix form as:
๐ ๐ร1 = ๐ฟ ๐ร๐ ๐ท ๐ร1 + ๐ ๐ร1
Or simply:
๐ = ๐ฟ๐ท + ๐
Where
๐ =
๐1
๐2
โฎ
๐๐
, ๐ฟ =
1 ๐11 ๐21
1
โฎ
๐12
โฎ
๐22
โฎ
1 ๐1๐ ๐2๐
โฆ ๐ ๐1
โฆ
โฑ
๐ ๐2
โฎ
โฆ ๐ ๐๐
, ๐ท =
๐ฝ0
๐ฝ1
โฎ
๐ฝ ๐
and ๐ =
๐ข1
๐ข2
โฎ
๐ข ๐
๐ is also called regressand and ๐ฟ is a vector of regressors.
24. โข ๐ท ๐ is the intercept but ๐ท๐
โฒ
๐ are slope coefficients which are also
called regression parameters. The value of each parameter
shows the magnitude of one unit change in the associated
regressor ๐ฟ๐ on the mean value of the regressand ๐๐. The idea
is to estimate the unknown value of the population
regression parameters based on estimators which use
sample data.
โข The sample counterpart of the regression line can be written in
the form of:
๐๐ = ๐๐ + ๐๐
or
๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ ๐๐ + ๐ ๐ ๐ฟ ๐๐ + โฏ + ๐ ๐ ๐ฟ ๐๐ + ๐๐
Where ๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ ๐๐ + ๐ ๐ ๐ฟ ๐๐ + โฏ + ๐ ๐ ๐ฟ ๐๐ is the deterministic
part of the sample model and is called โSample Regression
Function (SRF) โand ๐๐
โฒ
๐ are estimators of unknown parameters
๐ท๐
โฒ
๐ and ๐๐ = ๐๐ is a residual.
25. The following graph shows the important elements of PRF and
SRF:
๐๐ โ ๐ฌ(๐ ๐ฟ๐) = ๐๐
๐๐ โ ๐๐ = ๐๐ = ๐๐
observation
Estimation of
๐๐ based on SRF
Estimation of
๐๐ based on PRF
Adopted and altered fromhttp://marketingclassic.blogspot.co.uk/2011_12_01_archive.html
In PRF
In SRF
The PRF is a
hypothetical
line which we
have no idea
about that but
try to estimate
its parameters
based on the
data in sample
๐บ๐น๐ญ: ๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ๐
๐ท๐น๐ญ: ๐ฌ(๐ ๐ฟ๐) = ๐ท ๐ + ๐ท ๐ ๐ฟ๐
26. โข Now the question is how to calculate ๐๐
โฒ
๐ based on the
sample observations and how to ensure that they are good
and unbiased estimators of ๐ท๐
โฒ
๐ in the population?
โข There are two main methods of calculating ๐๐
โฒ
๐ and constructing
SRF, called the โmethod of Ordinary Least Square (OLS)โ and
the โmethod of Maximum Likelihood (ML)โ. Here, we focus on
OLS method as it is used most comprehensively. Here, for
simplicity, we start with two-variable PRF (๐๐ = ๐ท ๐ + ๐ท ๐ ๐ฟ๐) and
its SRF counterpart (๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ๐).
โข According to OLS method we try to minimise some of the
squared residuals in a hypothetical sample; i.e.
๐๐
๐
= ๐๐
๐
= ๐๐ โ ๐๐
๐
= ๐๐ โ ๐ ๐ โ ๐ ๐ ๐ฟ๐
๐
27. โข It is obvious from previous equation that the sum of squared
residuals is a function of ๐ ๐ and ๐ ๐, i.e.
๐๐
๐ = ๐(๐ ๐, ๐ ๐)
because if these two parameters (intercept and slope) change,
๐๐
๐ will change (see the graph on the slide 25).
โข Differentiating A partially with respect to ๐ ๐ and ๐ ๐ and
following the first and necessary conditions for optimisation in
calculus we have:
๐ ๐๐
๐
๐๐ ๐
= โ๐ ๐๐ โ ๐ ๐ โ ๐ ๐ ๐ฟ๐ = โ๐ ๐๐ = ๐
๐ ๐๐
๐
๐๐ ๐
= โ๐ ๐ฟ๐ ๐๐ โ ๐ ๐ โ ๐ ๐ ๐ฟ๐ = โ๐ ๐ฟ๐ ๐๐ = ๐
A
B
28. After simplifications we reach to two equations with two
unknowns ๐ ๐ and ๐ ๐:
๐๐ = ๐๐ ๐ + ๐ ๐ ๐ฟ๐
๐๐ ๐ฟ๐ = ๐ ๐ ๐ฟ๐ + ๐ ๐ ๐ฟ๐
๐
Where ๐ is the sample size. So;
๐ ๐ =
๐ฟ๐ โ ๐ฟ ๐๐ โ ๐
๐ฟ๐ โ ๐ฟ ๐
=
๐๐ ๐๐
๐๐
๐
=
๐๐๐(๐, ๐)
๐บ ๐
๐
Where ๐บ ๐ is the biased version of sample standard deviation,
i.e. we have ๐ instead of (๐ โ ๐) in denominator.
๐บ ๐ =
๐ฟ๐ โ ๐ฟ ๐
๐
29. And
๐0 = ๐ โ ๐1 ๐
โข The ๐ ๐ and ๐ ๐ obtained from OLS method are the point
estimators of ๐ท ๐ and ๐ท ๐in the population but in order to test
some hypothesis about the population parameters we need to
have knowledge about the distributions of their estimators. For
that reason we need to make some assumptions about the
explanatory variables and the error term in PRF. (see the
equations in B to find the reason).
๏ง The Assumptions Underlying the OLS Method:
1. The regression model is linear in terms of its parameters (coefficients).*
2. The values of the explanatory variable(s) are fixed in repeated sampling.
This means that the nature of explanatory variables (๐ฟโฒ ๐) is non-stochastic.
The only stochastic variables are error term (๐๐) and regressand (๐๐).
3. The disturbance (error) terms are normally distributed with zero mean and
equal variance; given the value of ๐ฟโฒ ๐. That is: ๐๐~๐ต(๐, ๐ ๐)
30. 4. There is no autocorrelation between error terms, i.e.
๐๐๐ ๐๐, ๐๐ = ๐
This means they are completely random and there is no association between
them or any pattern in their appearance.
5. There is no correlation between error terms and explanatory variables, i.e.
๐๐๐ ๐๐, ๐ฟ๐ = ๐
6. The number of observations (sample size) should be bigger than the
number of parameters in the model.
7. The model should be logically and correctly specified in terms of functional
form or even the type and the nature of variables enter into the model.
These assumptions are the assumptions of the Classical Linear
Regression Models (CLRM), which sometimes they are called
Gaussian assumptions on linear regression models.
31. โข Under these assumptions and also the central limit theorem
the OLS estimators in sampling distribution (repeated sampling)
,when ๐ โ โ, have a normal distribution:
๐ ๐~๐ต(๐ท ๐,
๐ฟ๐
๐
๐ ๐๐
๐
. ๐ ๐)
๐ ๐~๐ต(๐ท ๐,
๐ ๐
๐๐
๐
)
where ๐ ๐ is the variance of the error term (๐๐๐ ๐๐ = ๐ ๐) and it
can be estimated itself through ๐ estimator, where:
๐ =
๐๐
๐
๐ โ ๐
๐๐
๐ =
๐๐
๐
๐ โ ๐
๐คโ๐๐ ๐กโ๐๐๐ ๐๐ ๐ ๐๐๐๐๐๐๐ก๐๐ ๐๐ ๐กโ๐ ๐๐๐๐๐.
32. โข Based on the assumptions of the classical linear regression
model (CLRM), Gauss-Markov Theorem asserts that the least
square estimators, among unbiased estimators, have the
minimum variance. So they are the Best, Linear, Unbiased
Estimators (BLUE).
๏ง Interval Estimation For Population Parameters:
โข In order to construct a confidence interval for unknown
๐ทโฒ ๐ (PRFโs parameters) we can either follow Z distribution (if
we have a prior knowledge about ๐) or t-distribution (if we use
๐ instead).
โข The confidence intervals for the slope parameter at any level of
significance ๐ถ would be*:
๐ท ๐ ๐ โ ๐ ๐ถ
๐
. ๐ ๐ ๐
โค ๐ท ๐ โค ๐ ๐ + ๐ ๐ถ
๐
. ๐ ๐ ๐
= ๐ โ ๐ถ
Or
๐ท ๐ ๐ โ ๐ ๐ถ
๐,(๐โ๐). ๐ ๐ ๐
โค ๐ท ๐ โค ๐ ๐ + ๐ ๐ถ
๐,(๐โ๐). ๐ ๐ ๐
= ๐ โ ๐ถ
33. ๏ง Hypothesis Testing For Parameters:
โข The critical values (Z or t) in the confidence intervals, can be
used to find the rejection area(s) and test any hypothesis on
parameters.
โข For example, to test ๐ฏ ๐: ๐ท ๐ = ๐ against the alternative ๐ฏ ๐: ๐ท ๐ โ
๐, after finding the critical values t (which means we do not
have prior knowledge of ๐ and use ๐ instead) at any
significance level ๐ถ, we will have two critical regions and if the
value of the test statistic
๐ =
๐ ๐โ๐ท ๐
๐
๐ ๐
๐
be in the critical region ๐ฏ ๐: ๐ท ๐ = ๐ must be rejected.
โข In case we have more than one slope parameter the degree of
freedom for t-distribution will be the sample size ๐ minus the
number of estimated parameters including the intercept
parameters, i.e. for ๐ parameters ๐ ๐ = ๐ โ ๐ .
34. ๏ง Determination Coefficient ๐ ๐
and Goodness of Fit:
โข In early slides we talked about determination coefficient and
its relationship with correlation coefficient. The coefficient of
determination ๐ ๐
come to our attention when there is no issue
about estimation of regression parameters.
โข It is a measure which shows how well the SRF fits the data.
โข to understand this measure properly letโs have a look at it
from different angle.
We know that
๐๐ = ๐๐ + ๐๐
And in the deviation form after
subtracting ๐ from both sides
๐๐ โ ๐ = ๐๐ โ ๐ + ๐๐
We know that ๐๐ = ๐๐ โ ๐๐
๐๐
AdoptedfromBasicEconometricsGojaratiP76
๐
๐๐ โ ๐
35. So;
๐๐ โ ๐ = ( ๐๐ โ ๐) + (๐๐ โ ๐๐)
Or in the deviation form
๐๐ = ๐๐ + ๐๐
By squaring both sides and adding all over the sample we have:
๐๐
๐
= ๐๐
๐
+ ๐ ๐๐ ๐๐ + ๐๐
๐
= ๐๐
๐
+ ๐๐
๐
Where ๐๐ ๐๐ = ๐ according to the OLSโs assumptions 3 and 5.
And if we change it to the non-deviated form:
๐๐ โ ๐ 2 = ๐๐ โ ๐
2
+ ๐๐ โ ๐๐
2
Total variation of the
observed Y values around
their mean =Total Sum of
Squares= TSS
Total explained variation of the
estimated Y values around their
mean = Explained Sum of
Squares (by explanatory
variables)= ESS
Total unexplained variation of
the observed Y values around
the regression line= Residual
Sum of Squares (Explained by
error terms)= RSS
36. Dividing both sides by Total Sum of Squares (TSS) we have:
1 =
๐ธ๐๐
๐๐๐
+
๐ ๐๐
๐๐๐
=
๐๐ โ ๐ 2
๐๐ โ ๐ 2
+
๐๐ โ ๐๐
2
๐๐ โ ๐ 2
Where
๐๐โ ๐ ๐
๐๐โ ๐ ๐
=
๐ฌ๐บ๐บ
๐ป๐บ๐บ
is the percentage of the variation of the actual
(observed) ๐๐ which is explained by the explanatory variables (by
regression line).
โข A good reader knows that this is not a new concept; the
determination coefficient ๐ ๐ was described already as a
measure of the goodness of fit between different alternative
sample regression functions (SRFs).
๐ = ๐ ๐ +
๐น๐บ๐บ
๐ป๐บ๐บ
โ ๐ ๐ = ๐ โ
๐น๐บ๐บ
๐ป๐บ๐บ
= ๐ โ
๐ ๐
๐
๐ ๐โ ๐ ๐
37. โข A good model must have a reasonable high ๐ ๐ but this does not
mean any model with a high ๐ ๐ is a good model. Extremely high
level of ๐ ๐ could be as a result of having a spurious regression
line due to the variety of reasons such as non-stationarity of
data, cointegration problem and etc.
โข In a regression model with two parameters, ๐ ๐ can be directly
calculated:
๐ ๐ =
๐ ๐โ ๐
๐
๐ ๐โ ๐ ๐ =
๐ ๐+๐ ๐ ๐ฟ๐โ๐ ๐โ๐ ๐ ๐ฟ
๐
๐ ๐โ ๐ ๐
=
๐ ๐
๐
๐ฟ ๐โ๐ฟ
๐
๐ ๐โ ๐ ๐ =
๐ ๐
๐
๐ ๐
๐
๐ ๐
๐ = ๐ ๐
๐ ๐บ ๐ฟ
๐
๐บ ๐
๐
Where ๐บ ๐ฟ
๐
and ๐บ ๐
๐
are the standard deviations of ๐ฟ and ๐
respectively.
38. ๏ง Multiple Regression Analysis:
โข If there are more than two explanatory variables in the
regression line we need additional assumptions about the
independency of the explanatory variables and also having no
exact linear relationship between them.
โข The population and the sample regression models for three
variables model can be described as following:
In Population: ๐๐ = ๐ท ๐ + ๐ท ๐ ๐ฟ ๐๐ + ๐ท ๐ ๐ฟ ๐๐ + ๐๐
In Sample: ๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ ๐๐ + ๐ ๐ ๐ฟ ๐๐ + ๐๐
โข The OLS estimators can be obtained by minimising ๐๐
๐. So,
the values of the SRF parameters in the deviation form are as
following:
๐ ๐ =
( ๐ ๐๐ ๐๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐๐)( ๐ ๐๐ ๐ ๐๐)
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
39. ๐ ๐ =
( ๐ ๐๐ ๐๐)( ๐ ๐๐
๐
) โ ( ๐ ๐๐ ๐๐)( ๐ ๐๐ ๐ ๐๐)
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
And the intercept parameter will be calculated in the non-deviated
form as:
๐ ๐ = ๐ โ ๐ ๐ ๐ฟ ๐ โ ๐ ๐ ๐ฟ ๐
โข Under the classical assumptions and also the central limit
theorem the OLS estimators in sampling distribution (repeated
sampling),when ๐ โ โ, have a normal distribution:
๐ ๐~๐ต(๐ท ๐,
๐ ๐
๐. ๐ ๐๐
๐
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
)
๐ ๐~๐ต(๐ท ๐,
๐ ๐
๐. ๐ ๐๐
๐
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
)
40. โข The distribution of the intercept parameter ๐ ๐ is not of primary
concern as in many cases it has no practical importance.
โข If the variance of the disturbance (error) term (๐ ๐
๐
) is not known
the residual variance (sample variance) can be used ( ๐ ๐
๐
),
which is an unbiased estimator of the earlier:
๐ ๐
๐
=
๐๐
๐
๐ โ ๐
Where ๐ is the number of parameters in the model (including the
intercept ๐ ๐). Therefore, in a regression model with two slope
parameters and one intercept parameter the residual variance can
be calculated by:
๐ ๐
๐
=
๐๐
๐
๐ โ ๐
41. So, for a model with two slope parameters, the unbiased
estimates of the variance of these parameters are:
๐บ ๐ ๐
๐
=
๐๐
๐
๐ โ ๐
.
๐ ๐๐
๐
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
=
๐ ๐
๐
๐ ๐๐
๐ (๐ โ ๐ ๐
๐๐)
Where ๐ ๐
๐๐ =
๐ ๐๐ ๐ ๐๐
๐
๐ ๐๐
๐ ๐ ๐๐
๐ .
and
๐บ ๐ ๐
๐
=
๐๐
๐
๐ โ ๐
.
๐ ๐๐
๐
( ๐ ๐๐
๐)( ๐ ๐๐
๐) โ ( ๐ ๐๐ ๐ ๐๐)
๐
=
๐ ๐
๐
๐ ๐๐
๐ (๐ โ ๐ ๐
๐๐)
๐ ๐
๐
42. ๏ง The Coefficient of Multiple Determination (๐น ๐
and ๐น ๐
):
The same concept of the coefficient of determination used for a
bivariate model can be extended for a multivariate model.
โข If ๐น ๐ is denoted as the coefficient of multiple determination it
shows the proportion (percentage) of the total variation of ๐
explained by the explanatory variables and it is calculated by:
๐ 2
=
๐ธ๐๐
๐๐๐
=
๐ฆ ๐
2
๐ฆ ๐
2 =
๐1 ๐ฆ ๐ ๐ฅ1๐+๐2 ๐ฆ ๐ ๐ฅ2๐
๐ฆ ๐
2
And we know that:
0 โค ๐ 2
โค 1
๏ Note that ๐ 2 can also be calculated through RSS, i.e.
๐ 2 = 1 โ
๐ ๐๐
๐๐๐
= 1 โ
๐๐
2
๐ฆ๐
2
C
43. โข ๐น ๐ is likely to increase by including an additional explanatory
variable (see ). Therefore, in case we have two alternative
models with the same dependent variable ๐ but different
number of explanatory variables we should not be misled by the
high ๐น ๐
of the model with more variables.
โข To solve this problem we need to bring the degrees of freedom
into our consideration as a reduction factor against adding
additional explanatory variables. So, the adjusted ๐น ๐ which can
be shown by ๐น ๐ is considered as an alternative coefficient of
determination and it is calculated as:
๐ 2 = 1 โ
๐๐
2
๐ โ ๐
๐ฆ๐
2
๐ โ 1
= 1 โ
๐ โ 1
๐ โ ๐
.
๐๐
2
๐ฆ๐
2
= 1 โ
๐โ1
๐โ๐
(1 โ ๐ 2)
C
44. ๏ง Partial Correlation Coefficients:
โข For a three-variable regression model such as
๐๐ = ๐ ๐ + ๐ ๐ ๐ฟ ๐๐ + ๐ ๐ ๐ฟ ๐๐ + ๐๐
We can talk about three linear association (correlation) between
๐ and ๐ฟ ๐ ๐ ๐๐ ๐
, between ๐ and ๐ฟ ๐ (๐ ๐๐ ๐
) and finally between
๐ฟ ๐ and ๐ฟ ๐ (๐ ๐ ๐ ๐ ๐
). These correlations are called simple (gross)
correlation coefficients but they do not reflect the true linear
association between two variables as the influence of the third
variable on the other two is not removed.
โข The net linear association between two variables can be
obtained through the partial correlation coefficient, where the
influence of the third variable is removed (the variable is hold
constant). Symbolically, ๐ ๐๐ ๐. ๐ ๐
represents the partial
correlation coefficient between ๐ and ๐ฟ ๐ holding ๐ฟ ๐ constant.
45. โข Two partial correlation coefficients in our model can be
calculated as following:
๐ ๐๐ ๐. ๐ ๐
=
๐ ๐๐ ๐
โ ๐ ๐๐ ๐
๐ ๐ ๐ ๐ ๐
๐ โ ๐ ๐
๐ ๐ ๐ ๐
. ๐ โ ๐ ๐
๐๐ ๐
๐ ๐๐ ๐. ๐ ๐
=
๐ ๐๐ ๐
โ ๐ ๐๐ ๐
๐ ๐ ๐ ๐ ๐
๐ โ ๐ ๐
๐ ๐ ๐ ๐
. ๐ โ ๐ ๐
๐๐ ๐
โข The correlation coefficient ๐ ๐ ๐ ๐ ๐.๐ has no practical importance.
Specifically, when the direction of causality is from ๐ฟโฒ
๐ to ๐ we
can simply use the simple correlation coefficient in this case:
๐ =
๐ ๐ ๐ ๐
๐ ๐
๐ . ๐ ๐
๐
โข They can be used to find out which explanatory variable has
more linear association with the dependent variable.
46. ๏ง Hypothesis Testing in Multiple Regression Models:
In a multiple regression model hypotheses are formed to test
different aspects of this type of regression models:
i. Testing hypothesis about an individual parameter of the
model. For example;
๐ฏ ๐: ๐ท๐ = ๐ against ๐ฏ ๐: ๐ท๐ โ ๐
If ๐ is unknown and is replaced by ๐ the test statistic
๐ =
๐ ๐โ๐ท ๐
๐๐(๐ ๐)
=
๐ ๐
๐๐(๐ ๐)
follows the t-distribution with ๐ โ ๐ df (for a regression model with
three parameters, including intercept, ๐๐ = ๐ โ ๐)
47. ii. Testing hypothesis about the equality of two parameters
in the model. For example,
๐ฏ ๐: ๐ท๐ = ๐ท๐ against ๐ฏ ๐: ๐ท๐ โ ๐ท๐
Again, if ๐ is unknown and is replaced by ๐ the test statistic
๐ =
๐๐ โ ๐๐ โ ๐ท๐ โ ๐ท๐
๐๐(๐๐ โ ๐๐)
=
๐๐ โ ๐๐
๐๐๐ ๐๐ + ๐๐๐ ๐๐ โ ๐๐๐๐(๐๐, ๐๐)
follows the t-distribution with ๐ โ ๐ df.
โข If the value of test statistic ๐ > ๐ ๐ถ
๐
,(๐โ๐) we must reject ๐ฏ ๐,
otherwise there is not much evidence to reject that.
48. iii. Testing hypothesis about the overall significance of the
estimated model by checking if all the slope parameters
are simultaneously zero. For example, to test
๐ฏ ๐: ๐ท๐ = ๐ (โ ๐) against ๐ฏ ๐: โ๐ท๐ โ ๐
the analysis of variance (ANOVA) table can be used to find if the
mean sum of squares (MSS), due to the regression (or
explanatory variables) are very far from the MSS due to the
residuals. If this is true, it means the variation of explanatory
variables contribute more towards the variation of the dependent
variable than the variation of residuals, so, the ratio
๐ด๐บ๐บ ๐๐ข๐ ๐ก๐ ๐๐๐๐๐๐ ๐ ๐๐๐ (๐๐ฅ๐๐๐๐๐๐ก๐๐๐ฆ ๐ฃ๐๐๐๐๐๐๐๐ )
๐ด๐บ๐บ ๐๐ข๐ ๐ก๐ ๐๐๐ ๐๐๐ข๐๐๐ (๐๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐ )
should be much higher than one.
49. โข The ANOVA table for the three-variable regression model can
be formed as following:
โข If we believe that the regression model is meaningless so we
cannot reject the null hypothesis that all slope coefficients are
simultaneously equal to zero, otherwise the test statistic
๐น =
๐ธ๐๐/๐๐
๐ ๐๐/๐๐
=
๐ ๐ ๐๐ ๐ ๐๐ + ๐ ๐ ๐๐ ๐ ๐๐
๐
๐๐
๐
๐ โ ๐
Which follows the F-distribution with 2 and ๐ โ ๐ df must be much
bigger than 1.
Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)
Due to Explanatory
Variables
๐ ๐ ๐๐ ๐ ๐๐ + ๐ ๐ ๐๐ ๐ ๐๐ 2
๐ ๐ ๐๐ ๐ ๐๐ + ๐ ๐ ๐๐ ๐ ๐๐
๐
Due to Residuals
๐๐
๐
๐ โ ๐
๐ ๐
=
๐๐
๐
๐ โ ๐
Total
๐๐
๐
๐ โ ๐
50. โข In general, to test the overall significance of the sample
regression for a multi-variable model (e.g with ๐ slope
parameters) the null and alternative hypotheses and the test
statistic are as following:
๐ฏ ๐: ๐ท ๐ = ๐ท ๐ = โฏ = ๐ท ๐ = ๐
๐ฏ ๐: ๐๐ ๐๐๐๐๐ ๐๐๐๐๐ ๐๐ ๐๐๐ ๐ท๐ โ ๐
๐ญ =
๐ฌ๐บ๐บ
๐โ๐
๐น๐บ๐บ
๐โ๐
โข If ๐ญ > ๐ญ ๐ถ, ๐โ๐, ๐โ๐ we reject ๐ฏ ๐ at the significance level of ๐ถ,
otherwise there is no enough evidence to reject it.
โข It is sometimes easier to use the determination coefficient ๐น ๐
to run the above test, because
๐น ๐
=
๐ฌ๐บ๐บ
๐ป๐บ๐บ
โ ๐ฌ๐บ๐บ = ๐น ๐
. ๐ป๐บ๐บ
and also
๐น๐บ๐บ = ๐ โ ๐น ๐
. ๐ป๐บ๐บ
51. โข The ANOVA table can also be written as:
โข So, the test statistic F can be written as:
๐ญ =
๐น ๐ ๐๐
๐
(๐ โ ๐)
(๐ โ ๐น ๐) ๐๐
๐
(๐ โ ๐)
=
๐ โ ๐
๐ โ ๐
.
๐น ๐
๐ โ ๐น ๐
Source of variation Sum of Squares (SS) df Mean Sum of
Squares (MSS)
Due to Explanatory
Variables
๐น ๐
๐๐
๐
๐ โ ๐
๐น ๐
๐๐
๐
๐ โ ๐
Due to Residuals
(๐ โ ๐น ๐
) ๐๐
๐ ๐ โ ๐
๐ ๐
=
(๐ โ ๐น ๐
) ๐๐
๐
๐ โ ๐
Total
๐๐
๐
๐ โ ๐
52. iv. Testing hypothesis about parameters when they satisfy
certain restrictions.*
e.g.๐ฏ ๐: ๐ท๐ + ๐ท๐ = ๐ against ๐ฏ ๐: ๐ท๐ + ๐ท๐ โ ๐
v. Testing hypothesis about the stability of the estimated
regression model in a specific time period or in two cross-
sectional unit.**
vi. Testing hypothesis about different functional forms of
regression models.***