- Regression analysis is used to predict the value of a dependent variable based on one or more independent variables and explain the relationship between them.
- There are different types of regression depending on whether the dependent variable is continuous or binary. Ordinary least squares regression is used for continuous dependent variables while logistic regression is used for binary dependent variables.
- The simple linear regression model describes the relationship between one independent and one dependent variable as a linear equation. This can be extended to multiple linear regression with more than one independent variable.
2. What &Why
1
What is Regression?
Formulation of a functional relationship between a set of Independent or
Explanatory variables (X’s) with a Dependent or Response variable (Y).
Y = f(X)
Why Regression?
Knowledge of Y is crucial for decision making.
• Will he/she buy or not?
• Shall I offer him/her the loan or not?
• ………
X is available at the time of decision making and is related to Y, thus making
it possible to have a prediction of Y.
3. 2
Types of Regression
Y
Continuous
E.g., SalesVolume, Claim
Amount, % of sales growth
etc.
Binary (0/1)
E.g., Buy/No-Buy, Survive/Not-
Survive,Win/Loss etc
Ordinary Least Square
(OLS) Regression
Logistic Regression
4. • Regression analysis is used to:
• Predict the value of a
dependent variable based on
the value of at least one
independent variable
• Explain the impact of changes
in an independent variable on
the dependent variable
• Dependent variable: the
variable we wish to explain,
usually denoted by Y.
• Independent variable: the
variable used to explain the
dependent variable. Usually
denoted by X.
3
Intro to RegressionAnalysis
7. • Only one independent
variable, x
• Relationship between x
and y is described by a
linear function
• Changes in y are
assumed to be caused
by changes in x
6
Simple Linear Regression Model
23. 22
PredictedValue
ofY for Xi
Intercept = β0
Random Error for this x value
Y
X
uXββY 10 ++=
xi
Slope = β1
ui
Individual
person's marks
Population Linear Regression
24. 23
Linear component
Population y
intercept
Population Slope
Coefficient
Random
Error term, or
residual
Dependent
Variable
Independent
Variable
Random Error
component
uXββY 10 ++=
But can we actually get this equation?
If yes what all information we will need?
Population Regression Function
25. 24
PredictedValue
ofY for Xi
Intercept = β0
Random Error for this x value
Y
Xxi
Slope = β1
exbby 10 ++=
ei
ObservedValue
of y for xi
Sample Regression Function
26. 25
exbby 10i ++=
Estimate of the
regression intercept
Estimate of the
regression slope
Independent
variable
Error term
Notice the similarity with the Population Regression Function
Can we do something of the error term?
Sample Regression Function
27. • Represents the influence of all the variable which
we have not accounted for in the equation
• It represents the difference between the actual y
values as compared the predicted y values from the
Sample Regression Line
• Wouldn't it be good if we were able to reduce this
error term?
• By the way - what are we trying to achieve by
Sample Regression?
26
The ErrorTerm (Residual)
31. • The sum of the residuals from the least squares regression line is
zero.
• The sum of the squared residuals is a minimum.
Minimize( )
• The simple regression line always passes through the mean of
the y variable and the mean of the x variable
• The least squares coefficients are unbiased estimates of β0 and
β1
30
0)ˆ( =−∑ yy
2
)ˆ( yy∑ −
OLS Regression Properties
32. • Parameter Instability - This happens in situations where
correlations change over a period of time.This is very
common in financial markets where economic, tax,
regulatory, and political factors change frequently.
• Public knowledge of a specific regression relation may
cause a large number of people to react in a similar fashion
towards the variables, negating its future usefulness.
• If any of the regression assumptions are violated,
predicted dependent variables and hypothesis tests will not
hold valid.
31
Limitations of RegressionAnalysis
33. • In simple linear regression, the dependent variable was assumed to be
dependent on only one variable (independent variable)
• In General Multiple Linear Regression model, the dependent variable derives its
value from two or more than two variable.
• General Multiple Linear Regression model take the following form:
where:
Yi = ith observation of dependent variableY
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
εi = error term of ith observation
n = number of observations
k = total number of independent variables
32
ikikiii XbXbXbbY ε+++++= .........22110
General Multiple Linear Regression Model
34. • As we calculated the intercept and the slope coefficient in case of
simple linear regression by minimizing the sum of squared errors,
similarly we estimate the intercept and slope coefficient in multiple
linear regression.
• Sum of Squared Errors is minimized and the slope coefficient is
estimated.
• The resultant estimated equation becomes:
• Now the error in the ith observation can be written as:
33
∑=
n
i
i
1
2
ε
kikiii XbXbXbbY
∧∧∧∧∧
++++= .........22110
++++−=−=
∧∧∧∧∧
kikiiiiii XbXbXbbYYY .........22110ε
Estimated Regression Equation
35. 34
Assumptions of Multiple Regression Model
• There exists a linear relationship between the dependent and
independent variables.
• The expected value of the error term, conditional on the
independent variables is zero.
• The error terms are homoskedastic, i.e. the variance of the
error terms is constant for all the observations.
• The expected value of the product of error terms is always
zero, which implies that the error terms are uncorrelated with
each other.
• The error term is normally distributed.
• The independent variables doesn't have any linear
relationships between each other.