2. Regression analysis is a form of predictive modelling
technique which investigates the relationship
between a dependent (target) and independent
variable (s) (predictor).
This technique is used for forecasting, time series
modelling and finding the causal effect relationship
between the variables.
3. It is the two-variable
(bivariate ) linear
model because it relates the
two variables x and y.
Y dependent (explained, response, predicted) variable,
regressand
X independent (explanatory, control, predictor) variable,
regressor
ε error term or disturbance
β1 slope coefficient
β0 the intercept coefficient or the constant term
Table 2.1 Terminology for Simple regression
4. REGRESSION VERSUS CORRELATION
correlation analysis
the primary objective is to
measure the strength or degree
of linear association between
two variables.
regression analysis
the primary objective is to
estimate or predict the average
value of one variable on the
basis of the fixed values of
other variables
5. Least Squares Principle
The method of least squares estimates the parameters β1 and β2 by
minimizing the sum of squares of difference between the observations and
the line in the scatter diagram.
The intercept and slope of this line, the line that best fits
the data using the leasts quares principle,are b1 and
b2,the leasts quares estimates of β1 and β2. The fitted line
itself is then
𝑦𝑖 = 𝑏1 + 𝑏2𝑥𝑖
The vertical distances from each point to the fitted line
are the leasts quares residuals. They are given by
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 = 𝑦𝑖 − 𝑏1 + 𝑏2𝑥𝑖 Figure. The relationship
among y, 𝑒𝑖 and the fitted
regression line
7. Elasticity is the measurement of the proportional change of an
economic variable in response to a change in another.
The elasticity of a variable y with respect to another variable x is
𝜀 =
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑦
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑥
=
∆ 𝑦 𝑦
∆ 𝑥 𝑥
=
∆𝑦
∆𝑥
×
𝑥
𝑦
𝜀 = 𝑏2
𝑥
𝑦
= 10.21 ×
19.60
283.57
= 0.71
We estimate that a 1% increase in
weekly household income will
lead, on average, to a 0.71%
increase in weekly household
expenditure on food
8. Assessing the Fit of Regression Models
Three statistics are used in Ordinary Least Squares (OLS) regression to
evaluate model fit:
R-squared,
the overall F-test,
the Root Mean Square Error (RMSE).
All three are based on two sums of squares: Sum of Squares Total (SST)
and Sum of Squares Error (SSE). SST measures how far the data are
from the mean, and SSE measures how far the data are from the
model’s predicted values. Different combinations of these two values
provide different information about how the regression model compares
to the mean model.
9. Sum of Squares Total (SST) is a measure of the total sample
variation in the yi; that is, it measures how spread out the yi are in
the sample
𝑆𝑆𝑇 =
𝑖=1
𝑛
𝑦𝑖 − 𝑦 2
Sum of Squares Error (SSE) measures the sample variation in
the 𝑦𝑖
𝑆𝑆𝐸 =
𝑖=1
𝑛
𝑦𝑖 − 𝑦 2
10. The R-squared of the regression (the coefficient of
determination) is defined as
Adjusted R-squared incorporates the model’s degrees of freedom.
𝑅2
=
𝑆𝑆𝐸
𝑆𝑆𝑇
R2 is the ratio of the explained variation compared to the total variation;
thus, it is interpreted as the fraction of the sample variation in y that is
explained by x
The value of R2 is always between zero and one
Adjusted R-squared will decrease as predictors are added if the increase in
model fit does not make up for the loss of degrees of freedom
R-squared should always be used with models with more than one predictor
variable. It is interpreted as the proportion of total variance that is
explained by the model.
11. The F-test
An F test is used to test the significance of R. The hypotheses are
H0: p= 0 and H1: p ≠ 0 where r represents the population correlation
coefficient for multiple correlation
The formula for the F test is
𝐹 =
𝑅2/𝑘
(1 − 𝑅2)/(𝑛 − 𝑘 − 1)
where n is the number of data groups (x1, x2, . . . , y) and k is the
number of independent variables. The degrees of freedom are d.f.N.
= n - k and d.f.D. =n - k - 1.
A significant F-test indicates that the observed R-squared is reliable and is
not a spurious result of oddities in the data set.
12. Root Mean Square Error (RMSE) is the square root of the
variance of the residuals.
It indicates the absolute fit of the model to the data–how close the
observed data points are to the model’s predicted values
RMSE is a good measure of how accurately the model predicts the
response, and it is the most important criterion for fit if the main
purpose of the model is prediction.
13. The p-value is the probability of obtaining a test statistic at least
as extreme as the one that was actually observed, assuming that the
null hypothesis is true. If the p-value is less than 0.05 or 0.01,
corresponding respectively to a 5% or 1% chance of rejecting the null
hypothesis when it is true.
The p-value for each term tests the null hypothesis that the
coefficient is equal to zero (no effect).
A low p-value (< 0.05) indicates that you can reject the null
hypothesis. In other words, a predictor that has a low p-value is likely
to be a meaningful addition to your model because changes in the
predictor's value are related to changes in the response variable.
Conversely, a larger (insignificant) p-value suggests that changes
in the predictor are not associated with changes in the response.