2. Bi-Variate data-Correlation &
Regression
Both correlation and regression are used to study
relationships in bi- variate data
Correlation is a measure of the degree of the
relatedness between two variables eg whether two
stocks in an industry rise and fall in a related manner,
the relatedness between sales and advertising
expenditure
Correlation is bi- directional
3. Regression is understanding the nature of the
relationship between the variables; the objective being
to predict the dependent variable for given values of
the independent variables
Regression is uni- directional
Bi-Variate data-Correlation &
Regression
4. Coefficient of Correlation
Pearson’s Product- moment Correlation r is a measure of
the linear correlation of two continuous variables
r = S(X-X)(Y-Y)
(X-X)2(Y-Y)2
The value of r lies between -1 and +1; r=0 implies no
linear correlation, while r= +1 implies perfect positive
correlation and r= -1 implies perfect negative correlation
The sign of r indicates the direction of the relationship
between X and Y, while the absolute value of r indicates
the degree of the relatedness
However, r is a measure of a linear relationship only
r does not have units
5. Types of linear regression models
Simple Linear regression:
Y = a + b X
Multiple Linear regression:
Y = a + b X1 + c X2 + d X3
Y is called the dependent/explained variable, while
the X(s) are called independent/ explanatory
variable(s); the causality flows from the X(s) to Y
a, b, c, etc are called regression coefficients
6. The Simple Linear Regression
Studying the simple linear regression model is
important because:
It helps us to understand visually the meaning of
various terms in regression analysis
Many relationships though curvi- linear in the long
term, are linear in the short term
Linear relationships exist between transformed
variables (Y= a+ bX +cX2 is a quadratic
relationship between Y and X, but a linear
relationship between Y, X and X2)
7. The scatter diagram answers:
Is there a relationship between Y and X? (Is the
independent variable correctly chosen?)
Is it an increasing or decreasing relationship? (Is it
a direct or inverse relationship?)
Is it a linear or curvi- linear relationship? (Is it a
simple or a multiple linear regression?)
The Scatter Diagram
8. As it is an empirical relationship, all the Ys will
not fall on any line which is fitted to the data
Yi=a + bXi + ei
“ei” is called the error term or the disturbance
term
The slope (b) and the intercept (a) need to be
determined such that the error is minimised
Estimating the coefficients- the
concept of error in regression
9. Method of Least Squares
The best- fitting line is that which minimises Sei2
By minimising Sei2,
eis do not cancel each other
Large errors are penalised
Using Calculus, two normal equations are obtained:
SY = na + bSX
SXY = aSX + bSX2
which can be solved for b and a
b= SXY – nXY a= Y - bX
SX2 – nX2
10. Using the regression equation to
predict Y
Y can be predicted for given value of X using
the regression relationship
However, the regression equation should be
used for prediction only within the range of X
values used to build the model
11. Standard Error of Regression
The standard deviation of the error terms around the
regression line is called the standard error of
regression
se = S(Yi – Yi)2 where Yi = a + bXi
n- 2
Since the standard error computed from a sample is
used to estimate the standard error in the population
regression model, it is divided by n – 2, as a and b
are estimated population coefficients from the
sample.
In a multiple linear regression, it is divided by n – k (k
is the no. of coefficients including the intercept)
12. For ease of calculation standard error in a
simple regression can be computed as:
se = SYi
2 – aSYi – bSXiYi
n - 2
Standard error is in the same units as Y
Standard Error of Regression
13. Approximate prediction interval
for Y
The standard error is used to construct an
approximate interval for Y, assuming that the
ui are normally distributed
An approximate 95% interval for Yi would be
Yi + 2 * se
14. Another measure of Correlation-
the Coefficient of Determination
The coefficient of determination r2 is defined as
r2 = S(Yi – Y)2 = aSYi + bSXiYi – nY2
S(Yi – Y)2 SYi
2 – nY2
r2 denotes the percentage of the variation in Y that is
explained by its relationship to X
16. Coefficient of Determination
r2 denotes the strength of a linear relationship
between Y and X
r2 lies between 0 and 1; r2 = 0 implies no relationship,
r2 = 1 implies perfect correlation
r2 is a percentage, higher the percentage better is the
model in explaining Y; R2 is a measure of the
goodness of fit of the regression
r is simply taken as sqrt(r2); the sign of r is the sign of
the slope coefficient
r is just a number