2. INTRODUCTION
WHAT IS REGRESSION…?????
“A technique for determining the statistical relationship between two or more
variables where a change in a dependent variable is associated with, and depends on,
a change in one or more independent variables.”
The term regression was first used by GALTON. He found that although tall fathers
have tall sons, and short fathers have short sons, the average height of sons of tall
fathers is less than the average height of their fathers and average heights of sons of
short fathers is more than the average height of their fathers. In other words the
average height of sons of tall fathers and short fathers will regress back or go back to
the general average height. This phenomenon was described by him as “REGRESSON”.
3. REGRESSION ALALYSIS
In statistics, regression analysis includes many technologies for modelling and
analyzing several variables, when the focus is on relationship between one
dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value
of the dependent variable changes when any one of the independent variables is
varied, while the other independent variables are held fixed.
Regression analysis is widely used for prediction and forecasting. Regression
analysis is also used to understand which among the independent variables are
related to the dependent variable, and to explore the forms of these
relationships.
In restricted circumstances, regression analysis can be used to infer causal
relationships between the independent and dependent variables. However this
can lead to illusions or false relationships, so caution is advisable
4. DEPENDENT AND INDEPENDENT VARIABLES
Independent variables are regarded as inputs to a system and may
take differently value freely in a system.
Dependent variables are those values that change as a consequence
of change of other values in a system.
Independent variables are also called as predictor or explanatory
variables and are denoted by X
Dependent variables are also called as response variables and are
denoted by Y
5. LINES OF REGRESSION
A line that can be taken as representative of the ideal variation is
called as the line of best fit.
It is a line such that the sum of the distances of the points from the
line is minimum.
Its called as “THE LINE OF REGRESSION”.
The distance is not measured by dropping a perpendicular from a
point to the line. We measure the deviation (1) vertically and (2)
horizontally, and get one line where distance is minimized vertically
and on other horizontally respectively.
Thus we have two lines of regression
6. Line of regression of Y on X
If the deviation of the point is minimized
from the line measured along y-axis we
get a line which is called the line of
regression of Y on X
Its equation is written in the form
Y = a + bX
Line of regression of X on Y
If the deviation of the point is minimized
from the line measured along x-axis we
get a line which is called the line of
regression of X on Y
Its equation is written in the form
X = a + bY
7.
8. FIRST ORDER LINEAR REGRESSION
Y = a + mX + Ɛ
WHERE,
Y = Dependent variable
X = independent variable
m = slope of the line
a = Y- intercept
Ɛ = error variable
9. SLOPE AND INTERCEPT
The slope of a line is a change in Y for one unit increase in X for three
units.
Intercept is the height at which the line crosses the vertical axis and is
obtained by setting X = 0 in the equation.
10. ERROR VARIABLE
Random error term :
Ɛ- random variable or random error term.
The quantity Ɛ in the model equation is a random variable assumed
to be normally distributed with mean = 0 and variance = σ^2
Without Ɛ, any observed pair (x,y) would correspond to a point falling
exactly on the line.
Y = a + mX, is the true regression line
The inclusion of random error term allow (x,y) to fall either above the
true regression line (when Ɛ>0) or below the line (when Ɛ<0)
11. SCATTER DIAGRAM
Simplest method of obtaining the line of regression.
The data are plotted on graph paper with taking independent
variables on x axis and dependent variables on y axis.
If the correlation is perfect means r=1 (positive or negative) , we have
perfect line of regression. The points lies on a straight line and error is
zero.
However in real practice we don’t come across a perfect correlation.
Usually the points are scattered in a narrow straight strip and we have
to find a line which will best represent all points on the scatter
diagram. we draw a line which is close to all the points as far as
possible
12. Scatter diagram with perfect line
of regression. i.e r=0
Scatter diagram with error i.e
r ≠ 0
13. LEAST SQUARE METHOD
The famous German mathematician Carl Friedrich Gauss had investigated
the method of least squares as early as 1794, but unfortunately he did not
publish the method until 1809.
In the meantime, the method was discovered and published in 1806 by
the French mathematician Legendre, who quarrelled with Gauss about
who had discovered the method first.
The basic idea of the method of least squares is easy to understand. It
may seem unusual that when several people measure the same quantity,
they usually do not obtain the same results.
In fact, if same person measures the same quantity several times, the
results vary. Then what is the best estimate for the true measurement ???
14. The method of least squares gives a way to find the best estimate,
assuming that the errors (i.e. the differences from the true value) are
random and unbiased. Let us consider a simple example.
Problem: Suppose we measure a distance four times, and obtain the
following results: 72, 69, 70 and 73 units What is the best estimate of the
correct measurement?
Let us denote the estimate of the true measurement by x, and form the
deviations (errors) from x, namely x − 72, x− 69, x− 70, and x − 73.
Let S be the sum of the squares of these errors, i.e. S = (x − 72)^2 + (x −
69)^2 + (x − 70)^2 + (x − 73)^2.
We seek the value of x that minimise the value of S. We can write S in the
equivalent form S = 4(x − 71)2 + 10 We can see from this form (or we can
use calculus) that the minimum value of S is 10, when x = 71.
So the best estimate of the true measurement is 71 units!
15. LINE OF BEST FIT USING LEAST SQUARE
METHOD
Suppose we have to find best fit line by relating age and overall
satisfaction from our survey. AGE
(𝒚𝒊)
OVERALL
SASTIFACTION
(𝒙𝒊)
22 8
19 7
50 9
25 10
44 5
23 6
30 7
22 6
21 8
40 4
0
10
20
30
40
50
60
0 2 4 6 8 10 12
Y-VALUES
16. We have equation of a line as Y = a + bX
Where, a=slope and b=Y-intercept
calculating the value of the slopes and intercept best fit line can be
obtained.
The values for those can be calculated by simultaneously solving the
equation :-
𝑎 𝑥𝑖
2
+ 𝑏 𝑥𝑖 = 𝑥𝑖 𝑦𝑖
𝑎 𝑥𝑖 + 𝑏 𝑛 = 𝑦𝑖
On solving the above equation we get a = -1.167 and b = 37.76
Using these values in the above equation of line we can obtain a best
fit line of regression.
17.
18. MOST APPLICATION OF LINEAR REGRESSION
If the goal is prediction, or forecasting then linear regression can be
used to fit a predictive model to an observed data set of y and x
values.
After developing such a model, if an additional value of X is given
without the accompanying value of Y, the fitted model can be used to
make a prediction for the value of Y.
Given a variable y and a number of variables 𝑋1,….,𝑋 𝑃 that may be
related to y, linear regression analysis can be applied to quantify the
strength of the relationship between y and the 𝑋𝑗 , to assess which 𝑋𝑗
may have no relationship with y at all, and to identify which subsets
of 𝑋𝑗 contain redundant information about y.