2. Regression Analysis
๏ Regression analysis is a set of statistical process for
estimating the relationship between a dependent variable
(response) and one or more independent variables(aka
explanatory).
๏ For example, when a series of Y numbers (such as the
monthly sales of cameras over a period of years) is causally
connected with the series of X numbers (the monthly
advertising budget), then it is beneficial to establish a
relationship between X and Y in order to forecast Y.
2
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
3. 3
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Types of Regression
๏ There are various types of
regressions.
๏ Each type has its own
importance on different
scenarios, but at the core, all the
regression methods analyze the
effect of the independent
variable on dependent variables.
๏ Some important types of
regression models are :
4. 4
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Linear Regression
๏ Linear regression attempts to model the
relationship between dependent (Y) and
independent (X) variables by fitting a
linear equation to observed data.
๏ The case of one independent variable is
called Simple Linear Regression, for
more than one independent variable the
process is called Multiple Linear
Regression.
5. 5
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
A. Simple Linear Regression (SLR)
๏ Linear regression models describe the relationship
between variables by fitting a line to the observed data.
๏ Linear regression uses a straight line, while logistic and
non-linear regression uses curved lines.
๏ SLR assumes that at least to some extend, the behavior of
one variable (Y) is the result of a functional relationship
between the two variables ( X & Y).
6. 6
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Objectives of Regression
1. Establish if there is a relationship between two variables
(X & Y).
2. Forecast new observation.
7. 7
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Classical Assumptions of Linear regression
1. Linear relationship
2. No-Autocorrelation : Regressors and error term has no
correlation.
3. No-Multicollinearity : Independent variables are not correlated.
4. Homoscedasticity: The error term is the same across all values of
the independent variables.
5. Multivariate Normality : Data should be normally distributed
8. 8 @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Mathematical expression of Linear regression
๏ต Assume Y is a dependent variable
๏ต That depends on the realization of the
independent variable X.
๏ต We know the linear equation of a simple
line is :
(1)
๏ต Where, m = gradient or slope
๏ต c = y -intercept or height at which the
line crosses the y โaxis.
c
mX
Y ๏ซ
๏ฝ
c
mX
Y ๏ซ
๏ฝ
Intercept c
X
Y
9. 9 @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต For SLR model, Equation (1)
could we written as :
(2)
๏ต Where, m = gradient or slope
๏ต ฮฒ0 = y -intercept = value where
the regression line crosses the y โ
axis.
๏ต ฮฒ1 = Coefficient or slope of X.
X
Y 1
0 ๏ข
๏ข ๏ซ
๏ฝ
Intercept ฮฒ0
X
Y
X
Y 1
0 ๏ข
๏ข ๏ซ
๏ฝ
10. 10 @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต The slope or coefficient of X denotes
the relationship between X and Y.
๏ต With unit change in X, Y changed as
no. of times of X coefficient.
๏ต In other words ฮฒ1 represents the
sensitivity of Y changes with X.
๏ต In the equation, Y = ฮฒ0 + ฮฒ1 X,
๏ต ฮฒ0 gives the value of variable Y, when
X = 0.
Intercept ฮฒ0
X
Y
X
Y 1
0 ๏ข
๏ข ๏ซ
๏ฝ
11. y = 2.7333 + 8.6485x
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
Demand : Dependent Variable (Y)
Demand : Dependent Variable (Y) Linear (Demand : Dependent Variable (Y))
Consider the demand data
given in the table below.
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Example
Month:
Independent
Variable (X)
Demand :
Dependent
Variable (Y)
1 9
2 15
3 32
4 48
5 52
6 60
7 39
8 65
9 90
10 93
๏ต Here , The slope or coefficient of X is
๏ต ฮฒ1 = 8.6485
๏ต denotes that for one unit change in X,
Y changes 8.6485
12. y = b0 + b1x + ฮตi
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
Demand : Dependent Variable (Y)
Demand : Dependent Variable (Y)
Linear (Demand : Dependent Variable (Y))
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต Data in general may not intersect or
passes through the regression line.
๏ต Rather, there exist some errors on
random component, which can be
measured as distance between true value
and predicted value.
๏ต The regression model must include
these error terms as :
(3)
For sample data, equation (3) can be
written as :
i
X
Y ๏ฅ
๏ข
๏ข ๏ซ
๏ซ
๏ฝ 1
0
i
x
b
b
y ๏ฅ
๏ซ
๏ซ
๏ฝ 1
0
(4)
Non-random component
Random component
13. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต To minimize the error or random component , SLR uses OLS
method and calculate the value of and as :
๏ต The intercept or coefficient of X,
)
var(
)
(
)
(
)
)(
(
ห
1
2
1
1
Y
XY
Cov
X
X
Y
Y
X
X
n
i
i
n
i
i
i
๏ฝ
๏ญ
๏ญ
๏ญ
๏ฝ
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ข
X
Y 1
0
ห ๏ข
๏ข ๏ญ
๏ฝ
ORDINARY LEAST SQUARES (OLS)
0
๏ขฬ 1
๏ขฬ
14. Standard error of slope and Intercept
2
1
2
1
2
0
)
(
1
2
)
(
SE
Intercept
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ซ
๏ด
๏ญ
๏ฝ n
i
i
n
i
i
X
X
X
n
n
๏ฅ
๏ข
2
1
1
2
1
)
(
2
)
(
SE
Slope
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ญ
๏ฝ
n
i
i
n
i
i
X
X
n
๏ฅ
๏ข
15. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต To minimize the error or random component , SLR uses OLS method
๏ต The method of least squares gives the best equation under the
assumptions stated below :
1. The regression model is linear in regression parameters.
2. The explanatory variable, X, is assumed to be non-random or
non-stochastic (i.e., X is deterministic).
3. The conditional expected value of the error terms or residuals,
E(ฮตi | Xi), is zero.
4. In case of time series data, error terms are uncorrelated, that is,
Cov(ฮตi , ฮตi ) = 0 for all i โ j.
5. The variance of the errors, Var(ฮตi |Xi), is constant for all values
of Xi. or follows homoscedasticity.
6. The error terms, ฮตi , follow a normal distribution.
Explanation of OLS method
16. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต In ordinary least squares, the objective is find the optimal values of ฮฒ0 and ฮฒ1
that will minimize the Sum of Squares Errors (SSE) given in Eq. (6) as
2
1
1
0
1
2
)
( i
n
i
i
n
i
i X
Y
SSE ๏ข
๏ข
๏ฅ ๏ฅ
๏ฅ ๏ฝ
๏ฝ
๏ญ
๏ญ
๏ฝ
๏ฝ (6)
17. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต To find the optimal values of ฮฒ0 and ฮฒ1 that will minimize SSE, we have to
equate the partial derivative of SSE with respect to ฮฒ0 and ฮฒ1 to zero [Eqs. (7)
and (9) as :
๏ต Solving Eq. (7) for ฮฒ0, the estimated value of ฮฒ0 is given by
๏ต Differentiating SSE with respect to ฮฒ1, we get
๏ต Substituting the value of ฮฒ0 from Eq. (8) in Eq. (9), we get
(7)
(8)
(9)
18. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ต Thus, the value of ฮฒ1 is given by
)
var(
)
(
)
(
)
)(
(
ห
1
2
1
1
Y
XY
Cov
X
X
Y
Y
X
X
n
i
i
n
i
i
i
๏ฝ
๏ญ
๏ญ
๏ญ
๏ฝ
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ข
21. ๏ The Excel functions give b = 8.65 and a = 2.73.
๏ Use them in equation, Y = a + bX, to make a forecast.
๏ For example, for period 11 (X = 11),
๏ Forecast = 2.73 + 11*8.65 = 97.87.
๏ Similarly, for period 12,
๏ Forecast = 2.73 + 12*8.65 = 106.52.
21
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
22. Coefficient of Determination
22
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ The coefficient of determination (R2), where R is the value of the
coefficient of correlation, is a measure of the variability that is accounted
for by the regression line for the dependent variable.
๏ Calculation :
23. ๏ Where,
Yi = actual value of Y
๐ = estimated value of Y
๐= mean value of Y
๏ The coefficient of determination always falls between 0 and 1.
๏ For example, if r = 0.8, the coefficient of determination is r2 = 0.64
meaning that 64% of the variation in Y is due to variation in X.
๏ The remaining 36% variation in the value of Y is due to other variables.
๏ If the coefficient of determination is low, multiple regression analysis
may be used to account for all variables affecting the independent
variable Y. 23
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
24. 24
Solution : Coefficient of determination and Std. Errors
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Time: (X)
Dema
nd (Y) ๐ (๐ โ ๐)2 (๐ โ ๐)2
๐2
= (๐ โ ๐)2
1 9 11.3815 1705.69 1514.65
2 15 20.03 25.30 1246.09 916.2729
3 32 28.6785 11.03 334.89 467.4893
4 48 37.327 113.91 5.29 168.2987
5 52 45.9755 36.29 2.89 18.7013
6 60 54.624 28.90 94.09 18.69698
7 39 63.2725 589.15 127.69 168.2858
8 65 71.921 47.90 216.09 467.4676
9 90 80.5695 88.93 1576.09 916.2426
10 93 89.218 14.30 1823.29 1514.611
5.5 50.3 961.41 7132.1 6170.72
๐ ๐ ๐๐๐
= ๐ โ ๐
2
๐๐๐
= ๐ โ ๐ 2
๐๐๐ธ = ๐2
= ๐ โ ๐
2
๐ 2
=
๐๐๐
๐๐๐
=
6170.72
7132.1
=0.8652
๐ 2
= 1 โ
๐๐๐ธ
๐๐๐
= 1-
961.41
7132.1
= 0.8652
2
1
2
1
2
0
)
(
1
2
)
(
SE
Intercept
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ซ
๏ด
๏ญ
๏ฝ n
i
i
n
i
i
X
X
X
n
n
๏ฅ
๏ข
2
1
1
2
1
)
(
2
)
(
SE
Slope
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ญ
๏ฝ
n
i
i
n
i
i
X
X
n
๏ฅ
๏ข
=
961.41
8
ร
1
10
+
30.25
82.5
= 10.96 * 0.683 = 7.488
=
961.41
8
โ 1/82.5 = 1.2069
25. Exercise
25
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
1. From the following data of a book store ABC, derive the regression
equation for the effect of purchases on sales of books.
Also, calculate standard errors and coefficient of determination.
2.
27. ๏ต The coefficient of determination =
๏ต Standard Errors :
@Ravindra Na
๐ 2
=
๐๐๐
๐๐๐
=
๐โ๐ ฬ 2
๐โ ฬ
๐ 2
=
2391.51
2868
=0.8339
2
1
2
1
2
0
)
(
1
2
)
(
SE
Intercept
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ซ
๏ด
๏ญ
๏ฝ n
i
i
n
i
i
X
X
X
n
n
๏ฅ
๏ข
2
1
1
2
1
)
(
2
)
(
SE
Slope
๏ฅ
๏ฅ
๏ฝ
๏ฝ
๏ญ
๏ญ
๏ฝ
n
i
i
n
i
i
X
X
n
๏ฅ
๏ข
=
476.49
8
ร
1
10
+
8100
6380
= 7.718 * 1.172= 9.045
=
476.49
8
โ 1/6380 = 0.0967
X Y ๐ (๐ โ ๐)2 (๐ โ ๐)2
(๐ โ ๐)2
91 71 70.61 0.38 1 0.15
97 75 74.29 18.42 25 0.50
108 69 81.04 121.83 1 144.90
121 97 89.01 361.35 729 63.85
67 70 55.90 198.92 0 198.92
124 91 90.85 434.67 441 0.02
51 39 46.08 571.95 961 50.19
73 61 59.58 108.68 81 2.03
111 80 82.88 165.82 100 8.28
57 47 49.76 409.50 529 7.64
90 70 2391.51 2868.00 476.49
๐ ๐ ๐๐๐
= ๐ โ ๐
2
๐๐๐
= ๐ โ ๐ 2
๐๐๐ธ = ๐2
= ๐ โ ๐
2
28. 28
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
A. Multiple Linear Regression (MLR)
โข Predicting an outcome (dependent variable)
based upon several independent variables
simultaneously.
โข Why is this important?
โข Behavior is rarely a function of just one variable,
but is instead influenced by many variables. So
the idea is that we should be able to obtain a
more accurate predicted score if using multiple
variables to predict our outcome.
29. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Travel
time (y)
Km
travelled
(x1)
No. of
Deliveries
(x2)
Independent variables Dependent
variables
Potential
multicollinearity
Multiple regression
many-to-one
DV
IV
IV
IV
IV
Multiple regression
many-to-one
10 relationships to consider!
30. 30
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ The functional form of MLR is given by
๏ The variable Y is the dependent variable (response variable or outcome
variable);
๏ X1, X2 , โฆ, Xk are independent variables (predictor variables or
explanatory variables);
๏ ฮฒ0 is a constant;
๏ ฮฒ1 , ฮฒ2, โฆ, ฮฒk are called the partial regression co-efficients corresponding
to the explanatory variables X1, X2 , โฆ, Xk respectively; and
๏ ฮตi is the error term (or residual).
i
ki
k
i
i
i X
X
X
Y ๏ฅ
๏ข
๏ข
๏ข
๏ข ๏ซ
๏ซ
๏ซ
๏ซ
๏ซ
๏ฝ .......
2
2
1
1
0
31. 31
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ IF ฮตi = 0, then
๏ The estimated value of Y will be
๏ In MLR each coefficient is interpreted as estimated change in Y
corresponding to the unit change in a independent variable (X1), when all
other variables held constant (X2, X3,โฆ Xk).
ki
k
i
i X
X
X
Y
E ๏ข
๏ข
๏ข
๏ข ๏ซ
๏ซ
๏ซ
๏ซ
๏ฝ .......
)
( 2
2
1
1
0
ki
k
i
i X
b
X
b
X
b
b
Y ๏ซ
๏ซ
๏ซ
๏ซ
๏ฝ .......
ห
2
2
1
1
0
33. New Considerations
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ Adding more independent variables to a multiple regression
procedure does not mean the regression will be "better" or offer
better predictions; in fact it can make things worse. This is
called OVERFITTING.
๏ The addition of more independent variables creates more
relationships among them. So not only are the independent
variables potentially related to the dependent variable, they are
also potentially related to each other. When this happens, it is
called MULTICOLLINEARITY.
34. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
๏ The ideal is for all of the independent variables to be correlated with
the dependent variable but NOT with each other.
๏ Because of multicollinearity and overfitting, there is a fair amount
of prep-work is required before conducting multiple regression
analysis :
๏ถ Correlations
๏ถ Scatter plots
๏ถ Simple regressions
35. Steps in MLR
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
1. Generate a list of potential variables; independent(s) and dependent
2. Collect data on the variables
3. Check the relationships between each independent variable and the
dependent variable using scatterplots and correlations
4. Check the relationships among the independent variables using
scatterplots and correlations
5. Conduct simple linear regressions for each IV/DV pair (Optional).
6. Use the non-redundant independent variables in the analysis to find
the best fitting model
7. Use the best fitting model to make predictions about the dependent
variable.
36. Example
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Aditya Delivery Service (ADS) offers same-day delivery for letters,
packages, and other small courier parcels. They use Google Maps
to group individual deliveries into one trip to reduce time and fuel
costs. Some trips cost more than one delivery.
The ADS company wants to estimate how long a delivery will take
based on three factors:
1) the total distance of the trip in Kilometers (KMs)
2) the number of deliveries that must be made during the trip, and
3) the daily price of petrol.
37. @Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
In this case, we can predict the total travel time using the distance
traveled, number of deliveries on each trip, and daily petrol price.
Distance
Travelled (Kms),
(X1)
No. of
Deliveries (X2)
Petrol Price ($),
(X3)
Travel Time
(hrs), (Y)
89 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
Step 1 : Generate a list of potential variables; independent(s)
and dependent
Step 2 : Collect data on the variables
To conduct this analysis a random sample of 10 past trips and
record four pieces of information for each trip is given as:
38. Step 3 : Scatterplot IV to DV
Step 4 : Scatterplot IV to IV
y = 0.0403x + 3.1856
Rยฒ = 0.8615
5
5.5
6
6.5
7
7.5
40 60 80 100 120
Distance Travelled (Kms), (X1)
Travel
Time
(hrs),
(Y)
y = 0.4983x + 4.8454
Rยฒ = 0.8399
5
5.5
6
6.5
7
7.5
1 2 3 4 5 6
Travel
Time
(hrs),
(Y)
Travel
Time
(hrs),
(Y)
No. of Deliveries (X2)
5
5.5
6
6.5
7
7.5
3 3.2 3.4 3.6 3.8
Petrol Price ($), (X3)
y = 0.0763x - 2.97
Rยฒ = 0.9137
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
40 60 80 100 120
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
40 60 80 100 120
Distance Travelled (Kms), (X1) Distance Travelled (Kms), (X1)
No.
of
Deliveries
(X
2
)
No. of Deliveries (X2)
Petrol
Price
($),
(X
3
)
Petrol
Price
($),
(X
3
)
3
3.2
3.4
3.6
3.8
4
1 2 3 4 5 6
39. ๏ต Step 5. Conduct simple linear regressions for each
IV/DV pair (Optional).
๏ต Step 6. Use the non-redundant independent
variables in the analysis to find the best fitting
model.
๏ง In our example Step 3 suggest that Petrol Price ($),
(X3) is redundant due to overfitting issue
๏ง X1 or X2 is redundant as X1 & X2 shows
multicollinearity
๏ต Step 7. Use the best fitting model to make
predictions about the dependent variable.
40. Excel: Multiple regression
Distance
Travelled (Kms),
(X1)
No. of
Deliveries (X2)
Petrol Price ($),
(X3)
Travel Time
(hrs), (Y)
89 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
For practice purpose, if we solve our above example considering one
response variable y and three predictor variables X1, X2, and X3.
41. SUMMARY OUTPUT
@Ravindra Nath Shukla (PhD Scholar) ABV-IIITM
Regression Statistics
Multiple R 0.946
R Square 0.895
Adjusted R Square 0.842
Standard Error 0.345
Observations 10
ANOVA
df SS MS F Significance F
Regression 3 6.056 2.019 16.991 0.002
Residual 6 0.713 0.119
Total 9 6.769
Coefficients
Standard
Error
t Stat P-value Lower 95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept 6.211 2.321 2.677 0.037 0.533 11.890 0.533 11.890
Distance Travelled
(Kms), (X1)
0.014 0.022 0.636 0.548 -0.040 0.068 -0.040 0.068
No. of Deliveries
(X2)
0.383 0.300 1.277 0.249 -0.351 1.117 -0.351 1.117
Petrol Price ($),
(X3)
-0.607 0.527 -1.152 0.293 -1.895 0.682 -1.895 0.682
i
i
i
i X
X
X
Y 3
2
1 607
.
0
383
.
0
014
.
0
211
.
6 ๏ญ
๏ซ
๏ซ
๏ฝ
Regression Equation:
42. Manual solution : Multiple regression
Suppose, for practice purpose we use Y as dependent variable and X1, and
X2 as independent variable as:
Travel Time
(hrs), (Y)
Distance
Travelled
(Kms), (X1)
No. of Deliveries
(X2)
7 89 4
5.4 66 1
6.6 78 3
7.4 111 6
4.8 44 1
6.4 77 3
7 80 3
5.6 66 2
7.3 109 5
6.4 76 3
47. Step 4: Place b0, b1, and b2 in the estimated
linear regression equation
๐ = 3.733 + 0.02622 ๐ฟ๐ + 0.1840 ๐ฟ๐
48. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9335
R Square 0.8714
Adjusted R Square 0.8347
Standard Error 0.3526
Observations 10
ANOVA
df SS MS F Significance F
Regression 2 5.8985 2.9493 23.7161 0.0008
Residual 7 0.8705 0.1244
Total 9 6.769
Coefficients Standard Error t Stat P-value
Intercept 3.7322 0.8870 4.2077 0.0040
X Variable 1 0.0262 0.0200 1.3101 0.2315
X Variable 2 0.1840 0.2509 0.7335 0.4871
49. Steps for manual solution : Multiple regression
Suppose, we have the following dataset with one response variable y and
two predictor variables X1, and X2.