Correlation and regression in r

Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
UNIT IV
Correlation and Regression Analysis
(NOS 9001)
Regression Analysis and Modeling –
Introduction:
Regression analysis is a form of predictive modeling technique
which investigates the relationship between a dependent (target)
and independent variable(s) (predictor).
This technique is used for forecasting, time series modeling and
finding the causal effect relationship between the variables. For
example, relationship between rash driving and number of road
accidents by a driver is best studied through regression.
Regression analysis is an important tool for modeling and analyzing
data. Here, we fit a curve / line to the data points, in such a
manner that the differences between the distances of data points
from the curve or line is minimized
Regression analysis estimates the relationship between two or more
variables.
example:
Let’s say, if we want to estimate growth in sales of a company based
on current economic conditions. we have the recent company data
which indicates that the growth in sales is around two and a half
times the growth in the economy. Using this insight, we can predict
future sales of the company based on current & past information.
There are multiple benefits of using regression analysis. They are as
follows:
* It indicates the significant relationships between dependent
variable and independent variable.

* It indicates the strength of impact of multiple independent
variables on a dependent variable.
Regression analysis also allows us to compare the effects of
variables measured on different scales, such as the effect of price
changes and the number of promotional activities. These benefits
help market researchers / data analysts / data scientists to
eliminate and evaluate the best set of variables to be used for
building predictive models.
There are various kinds of regression techniques available to make
predictions.
These techniques are mostly driven by three metrics.
1. Number of independent variables,
2. Type of dependent variables and
3. Shape of regression line
Linear Regression:
A simple linear regression model describes the relationship between
two variables x and y can be expressed by the following equation.
The numbers α and β are called parameters, and ϵ is the error term.
If we choose the parameters α and β in the simple linear regression
model so as to minimize the sum of squares of the error term ϵ, we
will have the so called estimated simple regression equation. It
allows us to compute fitted values of y based on values of x.
In R we use lm () function to do simple regression modeling.
Apply the simple linear regression model for the data set cars. The
cars dataset as two variables (attributes) speed and dist and has 50
values.
> head(cars)
speed dist

1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
By using the attach( ) function the database is attached to the R
search path. This means that the database is searched by R when
evaluating a variable, so objects in the database can be accessed by
simply giving their names.
> speed
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
15 15 15 16
[28] 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24
24 24 25
> plot(cars)
> plot(dist,speed)
The plot() function gives a scatterplot whenever we give two numeric
variables.
The first variable listed will be plotted on the horizontal axis.

Now apply the regression analysis on the dataset using lm( )
function.
> speed.lm=lm(speed ~ dist, data = cars)
lm function that describes the variable speed by the variable dist,
and save the linear regression model in a new variable speed.lm. In
the above function y variables or dependent variable is speed and x
variable or independent variable is dist.
We get the intercept “C” and the slope “m” of the equation –
Y=mX+C
> speed.lm
Call:
lm(formula = speed ~ dist, data = cars)

Coefficients:
(Intercept) dist
8.2839 0.1656
> abline(speed.lm)
This function adds one or more straight lines through the current
plot.
> plot(speed.lm)
The plot function displays four charts: Residuals vs. Fitted, Normal
QQ, ScaleLocation, and Residuals vs. Leverage.

What is a quantile in statistics?
In statistics and the theory of probability, quantiles are cutpoints
dividing the range of a probability distribution into contiguous
intervals with equal probabilities, or dividing the observations in a
sample in the same way. There is one less quantilethan the
number of groups created.

The residual data of the simple linear regression model is the
difference between the observed data of the dependent
variable y and the fitted values ŷ.
> eruption.lm = lm(eruptions ~ waiting, data=faithful)
> eruption.res = resid(eruption.lm)
We now plot the residual against the observed values of the
variable waiting.
> plot(faithful$waiting, eruption.res,
+ ylab="Residuals", xlab="Waiting Time",
+ main="Old Faithful Eruptions")
> abline(0, 0) # the horizon

Residual: The difference between the predicted value (based on the
regression equation) and the actual, observed value.
Outlier: In linear regression, an outlier is an observation with large
residual. In other words, it is an observation whose
dependent¬variable value is unusual given its value on the
predictor variables. An outlier may indicate a sample peculiarity or
may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor
variable is a point with high leverage. Leverage is a measure of how
far an independent variable deviates from its mean. High leverage
points can have a great amount of effect on the estimate of
regression coefficients.
Influence: An observation is said to be influential if removing the
observation substantially changes the estimate of the regression
coefficients. Influence can be thought of as the product of leverage
and outlierness.
Cook's distance (or Cook's D): A measure that combines the
information of leverage and residual of the observation.
Estimated simple regression equation:
Apply we will use the above simple linear regression model, and
estimate the next speed if the distance covered is 80.
Extract the parameters of the estimated regression equation with
the coefficients function.
> coeffs = coefficients(speed.lm)
> coeffs

(Intercept) dist
8.2839056 0.1655676
Forecasting/Prediction:
We now fit the speed using the estimated regression equation.
> newdist = 80
> distance = coeffs[1] + coeffs[2]*newdist
> distance
(Intercept)
21.52931
To create a summary of the fitted model:
> summary (speed.lm)
Call:
lm(formula = speed ~ dist, data = cars)
Residuals:
Min 1Q Median 3Q Max
-7.5293 -2.1550 0.3615 2.4377 6.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
dist 0.16557 0.01749 9.464 1.49e-12 ***
---

Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.156 on 48 degrees of freedom Multiple
R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
OLS Regression:
ordinary least squares (OLS) or linear least squares is a method for
estimating the unknown parameters in a linear regression model,
with the goal of minimizing the differences between the observed
responses in a dataset and the responses predicted by the linear
approximation of the data.
This is applied in both simple linear and multiple regression where
the common assumptions are
(1) The model is linear in the coefficients of the predictor with an
additive random
error term
(2) The random error terms are
* normally distributed with 0 mean and
* a variance that doesn't change as the values of the predictor
covariates change.
Correlation:
Correlation is a statistical measure that indicates the extent to
which two or more variables fluctuate together. That can show
whether and how strongly pairs of variables are related. It Measure
the association between variables. Positive and
negative correlation, ranging between +1 and -1.

For example, height and weight are related; taller people tend to be
heavier than shorter people. A positive correlation indicates the
extent to which those variables increase or decrease in parallel; a
negative correlation indicates the extent to which one variable
increases as the other decreases.
When the fluctuation of one variable reliably predicts a similar
fluctuation in another variable, there’s often a tendency to think
that means that the change in one causes the change in the other.
However, correlation does not imply causation.
There may be, for example, an unknown factor that influences both
variables similarly.
An intelligent correlation analysis can lead to a greater
understanding of your data.
Correlation in R:
We use the cor( ) function to produce correlations.
A simplified format of cor(x, use=, method= ) where
Option Description
x Matrix or data frame
use Specifies the handling of missing data. Options are
all.obs (assumes no missing data - missing data will
produce an error), complete.obs (listwise deletion), and
pairwise.complete.obs (pairwise deletion)
method Specifies the type of correlation. Options are pearson,
spearman or kendall.
> cor(cars)
speed dist
speed 1.0000000 0.8068949

dist 0.8068949 1.0000000
> cor(cars, use="complete.obs", method="kendall")
speed dist
speed 1.0000000 0.6689901
dist 0.6689901 1.0000000
> cor(cars, use="complete.obs", method="pearson")
speed dist
speed 1.0000000 0.8068949
dist 0.8068949 1.0000000
Correlation Coefficient:
The correlation coefficient of two variables in a data sample is their
covariance divided by the product of their individual standard
deviations. It is a normalized measurement of how the two are
linearly related.
Formally, the sample correlation coefficient is defined by the
following formula, where s x and sy are the sample standard
deviations, and sxy is the sample covariance.
Similarly, the population correlation coefficient is defined as follows,
where σ x and σy are the population standard deviations, and σxy is
the population covariance.

If the correlation coefficient is close to 1, it would indicates that the
variables are positively linearly related and the scatter plot falls
almost along a straight line with positive slope. For -1, it indicates
that the variables are negatively linearly related and the scatter plot
almost falls along a straight line with negative slope. And for zero, it
would indicates a weak linear relationship between the variables.
* r : correlation coefficient
* +1 : Perfectly positive
* -1 : Perfectly negative
* 0 – 0.2 : No or very weak association
* 0.2 – 0.4 : Weak association
* 0.4 – 0.6 : Moderate association
* 0.6 – 0.8 : Strong association
* 0.8 – 1 : Very strong to perfect association
Covariance:
Covariance provides a measure of the strength of the correlation
between two or more sets of random variates. Correlation is defined
in terms of the variance of x, the variance of y, and the covariance
of x and y (the way the two vary together; the way they co-vary) on
the assumption that both variables are normally distributed.
Covariance in R:
We apply the cov function to compute the covariance of eruptions
and waiting in faithful dataset

> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
ANOVA:
Analysis of Variance (ANOVA) is a commonly used statistical
technique for investigating data by comparing the means of subsets
of the data. The base case is the one-way ANOVA which is an
extension of two-sample t test for independent groups covering
situations where there are more than two groups being compared.
In one-way ANOVA the data is sub-divided into groups based on a
single classification factor and the standard terminology used to
describe the set of factor levels is treatment even though this might
not always have meaning for the particular application. There is
variation in the measurements taken on the individual components
of the data set and ANOVA investigates whether this variation can
be explained by the grouping introduced by the classification factor.
To investigate these differences we fit the one-way ANOVA model
using the lm function and look at the parameter estimates and
standard errors for the treatment effects.
> anova(speed.lm)
Analysis of Variance Table
Response: speed
Df Sum Sq Mean Sq F value Pr(>F)
dist 1 891.98 891.98 89.567 1.49e-12 ***
Residuals 48 478.02 9.96

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This table confirms that there are differences between the groups
which were highlighted in the model summary. The function confint
is used to calculate confidence intervals on the treatment
parameters, by default 95% confidence intervals
> confint(speed.lm)
2.5 % 97.5 %
(Intercept) 6.5258378 10.0419735
dist 0.1303926 0.2007426
Heteroscedasticity:
Heteroscedasticity (also spelled heteroskedasticity) refers to the
circumstance in which the variability of a variable is unequal across
the range of values of a second variable that predicts it.
A scatterplot of these variables will often create a cone-like shape,
as the scatter (or variability) of the dependent variable (DV) widens
or narrows as the value of the independent variable (IV) increases.
The inverse of heteroscedasticity is homoscedasticity, which
indicates that a DV's variability is equal across values of an IV.
Hetero (different or unequal) is the opposite of Homo (same or
equal).
Skedastic means spread or scatter.
Homoskedasticity = equal spread.
Heteroskedasticity = unequal spread.

Detecting Heteroskedasticity
There are two ways in general.
The first is the informal way which is done through graphs and
therefore we call it the graphical method.
The second is through formal tests for heteroskedasticity, like the
following ones:
1. The Breusch-Pagan LM Test
2. The Glesjer LM Test

3. The Harvey-Godfrey LM Test
4. The Park LM Test
5. The Goldfeld-Quandt Tets
6. White’s Test
Heteroscedasticity test in R:
bptest(p) does the Breuch Pagan test to formally check presence of
heteroscedasticity. To use bptest, you will have to call lmtest
library.
> install.packages("lmtest")
> library(lmtest)
> bptest(speed.lm)
studentized Breusch-Pagan test
data: speed.lm
BP = 0.71522, df = 1, p-value = 0.3977
If the test is positive (low p value), you should see if any
transformation of the dependent variable helps you eliminate
heteroscedasticity.
Autocorrelation:
Autocorrelation, also known as serial correlation or
cross-autocorrelation, is the cross-correlation of a signal with itself
at different points in time. Informally, it is the similarity between
observations as a function of the time lag between them.
It is a mathematical tool for finding repeating patterns, such as the
presence of a periodic signal obscured by noise, or identifying the
missing fundamental frequency in a signal implied by its harmonic

frequencies. It is often used in signal processing for analyzing
functions or series of values, such as time domain signals.
Autocorrelation is a mathematical representation of the degree of
similarity between a given time series and a lagged version of itself
over successive time intervals.
In statistics, the autocorrelation of a random process is
the correlation between values of the process at different times, as a
function of the two times or of the time lag. Let X be a stochastic
process, and t be any point in time. (t may be an integer for
a discrete-time process or a real number for a continuous-
time process.) Then Xt is the value (or realization) produced by a
given run of the process at time t. Suppose that the process
has mean μt and variance σt
2 at time t, for each t. Then the
definition of the autocorrelation between times s and t is
where "E" is the expected value operator. Note that this expression
is not well-defined for all-time series or processes, because the
mean may not exist, or the variance may be zero (for a constant
process) or infinite (for processes with distribution lacking well-
behaved moments, such as certain types of power law). If the
function R is well-defined, its value must lie in the range [−1, 1],
with 1 indicating perfect correlation and −1 indicating perfect anti-
correlation.

Above: A plot of a series of 100 random numbers concealing
a sine function. Below: The sine function revealed in
a correlogram produced by autocorrelation.

Visual comparison of convolution, cross-correlation and
autocorrelation.
The function acf ( ) in R computes estimates of the autocovariance
or autocorrelation function.
Test: -
The traditional test for the presence of first-order autocorrelation is
the Durbin– Watson statistic or, if the explanatory variables include
a lagged dependent variable, Durbin's h statistic. The
Durbin-Watson can be linearly mapped however to the Pearson
correlation between values and their lags.
A more flexible test, covering autocorrelation of higher orders and
applicable whether or not the regressors include lags of the
dependent variable, is the Breusch–Godfrey test. This involves an
auxiliary regression, wherein the residuals obtained from estimating
the model of interest are regressed on (a) the original regressors and

(b) k lags of the residuals, where k is the order of the test. The
simplest version of the test statistic from this auxiliary regression is
TR2, where T is the sample size and R2 is the coefficient of
determination. Under the null hypothesis of no autocorrelation, this
statistic is asymptotically distributed as x2 with k degrees of
freedom.
Introduction to Multiple Regression:
Multiple regression is a flexible method of data analysis that may be
appropriate whenever a quantitative variable (the dependent
variable) is to be examined in relationship to any other factors
(expressed as independent or predictor variables).
Relationships may be nonlinear, independent variables may be
quantitative or qualitative, and one can examine the effects of a
single variable or multiple variables with or without the effects of
other variables taken into account.
Many practical questions involve the relationship between a
dependent variable of interest (call it Y) and a set of k independent
variables or potential predictor
variables (call them X1, X2, X3,..., Xk), where the scores on all
variables are measured for N cases. For example, you might be
interested in predicting performance on a job (Y) using information
on years of experience (X1), performance in a training program (X2),
and performance on an aptitude test (X3).
A multiple regression equation for predicting Y can be expressed a
follows:
To apply the equation, each Xj score for an individual case is
multiplied by the corresponding Bj value, the products are added
together, and the constant A is added to the sum. The result is Y',
the predicted Y value for the case. Multiple Regression in R:
YEAR ROLL UNEM HGRAD INC
11 5501 8.1 9552 1923

22 5945 7.0 9680 1961
3 3 6629 7.3 9731 1979
4 4 7556 7.5 11666 2030
5 5 8716 7.0 14675 2112
6 6 9369 6.4 15265 2192
7 7 9920 6.5 15484 2235
8 8 10167 6.4 15723 2351
9 9 11084 6.3 16501 2411
10 10 12504 7.7 16890 2475
11 11 13746 8.2 17203 2524
12 12 13656 7.5 17707 2674
13 13 13850 7.4 18108 2833
14 14 14145 8.2 18266 2863
15 15 14888 10.1 19308 2839
16 16 14991 9.2 18224 2898
17 17 14836 7.7 18997 3123
18 18 14478 5.7 19505 3195
19 19 14539 6.5 19800 3239
20 20 14395 7.5 19546 3129
21 21 14599 7.3 19117 3100
22 22 14969 9.2 18774 3008

23 23 15107 10.1 17813 2983
24 24 14831 7.5 17304 3069
25 25 15081 8.8 16756 3151
26 26 15127 9.1 16749 3127
27 27 15856 8.8 16925 3179
28 28 15938 7.8 17231 3207
29 29 16081 7.0 16816 3345
> #read data into variable
> datavar <- read.csv("dataset_enrollmentForecast.csv")
> #attach data variable
> attach(datavar)
> #two predictor model
> #create a linear model using lm(FORMULA, DATAVAR)
> #predict the fall enrollment (ROLL) using the unemployment rate
(UNEM) and number of spring high school graduates (HGRAD)
> twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)
> #display model
> twoPredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = datavar)
Coefficients:

(Intercept) UNEM HGRAD
-8255.7511 698.2681 0.9423
> #what is the expected fall enrollment (ROLL) given this year's
unemployment rate (UNEM) of 9% and spring high school
graduating class (HGRAD) of 100,000
> -8255.8 + 698.2 * 9 + 0.9 * 100000
[1] 88028
> #the predicted fall enrollment, given a 9% unemployment rate and
100,000 student spring high school graduating class, is 88,028
students.
> #three predictor model
> #create a linear model using lm(FORMULA, DATAVAR)
> #predict the fall enrollment (ROLL) using the unemployment rate
(UNEM), number of spring high school graduates (HGRAD), and per
capita income (INC)
> threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC,
datavar)
> #display model
> threePredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = datavar)
Coefficients:
(Intercept) UNEM HGRAD INC
-9153.2545 450.1245 0.4065 4.2749

Multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in
which two or more predictor variables in a multiple regression
model are highly correlated, meaning that one can be linearly
predicted from the others with a substantial degree of accuracy. In
this situation the coefficient estimates of the multiple regressions
may change erratically in response to small changes in the model or
the data. Multicollinearity does not reduce the predictive power or
reliability of the model as a whole, at least within the sample data
set; it only affects calculations regarding individual predictors. That
is, a multiple regression model with correlated predictors can
indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any
individual predictor, or about which predictors are redundant with
respect to others.
Key Assumptions of OLS:
Introduction
Linear regression models find several uses in real-life problems. For example, a
multi-national corporation wanting to identify factors that can affect the sales
of its product can run a linear regression to find out which factors are
important. In econometrics, Ordinary Least Squares (OLS) method is widely
used to estimate the parameter of a linear regression model. OLS estimators
minimize the sum of the squared errors (a difference between observed values
and predicted values). While OLS is computationally feasible and can be easily
used while doing any econometrics test, it is important to know the underlying
assumptions of OLS regression. This is because a lack of knowledge of OLS
assumptions would result in its misuse and give incorrect results for the

econometrics test completed. The importance of OLS assumptions cannot be
overemphasized. The next section describes the assumptions of OLS
regression.
Assumptions of OLS Regression
The necessary OLS assumptions, which are used to derive the OLS estimators
in linear regression models, are discussed below.
OLS Assumption 1: The linear regression model is “linear in parameters.”
When the dependent variable (Y)(Y) is a linear function of independent
variables (X's)(X′s) and the error term, the regression is linear in parameters
and not necessarily linear in X'sX′s. For example, consider the following:
A1. The linear regression model is “linear in parameters.”
A2. There is a random sampling of observations.
A3. The conditional mean should be zero.
A4. There is no multi-collinearity (or perfect collinearity).
A5. Spherical errors: There is homoscedasticity and no autocorrelation
A6: Optional Assumption: Error terms should be normally distributed.
a)Y=β0+β1X1+β2X2+ε
b)Y=β0+β1X12+β2X2+ε
c)Y=β0+β12X1+β2X2+ε

In the above three examples, for a) and b) OLS assumption 1 is satisfied. For c)
OLS assumption 1 is not satisfied because it is not linear in parameter { beta
}_{ 1 }β1.
OLS Assumption 2: There is a random sampling of observations
This assumption of OLS regression says that:
• The sample taken for the linear regression model must be drawn randomly
from the population. For example, if you have to run a regression model to
study the factors that impact the scores of students in the final exam, then you
must select students randomly from the university during your data collection
process, rather than adopting a convenient sampling procedure.
• The number of observations taken in the sample for making the linear
regression model should be greater than the number of parameters to be
estimated. This makes sense mathematically too. If a number of parameters to
be estimated (unknowns) are more than the number of observations, then
estimation is not possible. If a number of parameters to be estimated
(unknowns) equal the number of observations, then OLS is not required. You
can simply use algebra.
• The X'sX′s should be fixed (e. independent variables should impact
dependent variables). It should not be the case that dependent variables impact
independent variables. This is because, in regression models, the causal
relationship is studied and there is not a correlation between the two variables.
For example, if you run the regression with inflation as your dependent
variable and unemployment as the independent variable, the OLS
estimators are likely to be incorrect because with inflation and unemployment,
we expect correlation rather than a causal relationship.
• The error terms are random. This makes the dependent variable random.
OLS Assumption 3: The conditional mean should be zero.

The expected value of the mean of the error terms of OLS regression should be
zero given the values of independent variables.
Mathematically, E(ε∣X)=0 This is sometimes just written as E(ε)=0
In other words, the distribution of error terms has zero mean and doesn’t
depend on the independent variables X'sX′s. Thus, there must be no
relationship between the X'sX′s and the error term.
OLS Assumption 4: There is no multi-collinearity (or perfect collinearity).
In a simple linear regression model, there is only one independent variable and
hence, by default, this assumption will hold true. However, in the case of
multiple linear regression models, there are more than one independent
variable. The OLS assumption of no multi-collinearity says that there should be
no linear relationship between the independent variables. For example,
suppose you spend your 24 hours in a day on three things – sleeping, studying,
or playing. Now, if you run a regression with dependent variable as exam
score/performance and independent variables as time spent sleeping, time
spent studying, and time spent playing, then this assumption will not hold.
This is because there is perfect collinearity between the three independent
variables.
Time spent sleeping = 24 – Time spent studying – Time spent playing.
In such a situation, it is better to drop one of the three independent variables
from the linear regression model. If the relationship (correlation) between
independent variables is strong (but not exactly perfect), it still causes
problems in OLS estimators. Hence, this OLS assumption says that you should
select independent variables that are not correlated with each other.

An important implication of this assumption of OLS regression is that there
should be sufficient variation in the X's. More the variability in X's, better are
the OLS estimates in determining the impact of X's on Y.
OLS Assumption 5: Spherical errors: There is homoscedasticity and no
autocorrelation.
According to this OLS assumption, the error terms in the regression should all
have the same variance.
Mathematically, Var(ε∣X)=σ2
If this variance is not constant (i.e. dependent on X’s), then the linear
regression model has heteroscedastic errors and likely to give incorrect
estimates.
This OLS assumption of no autocorrelation says that the error terms of
different observations should not be correlated with each other.
Mathematically, Cov(εiεj∣X)=0fori≠j
For example, when we have time series data (e.g. yearly data of
unemployment), then the regression is likely to suffer from autocorrelation
because unemployment next year will certainly be dependent on
unemployment this year. Hence, error terms in different observations will
surely be correlated with each other.
In simple terms, this OLS assumption means that the error terms should be
IID (Independent and Identically Distributed).

Image Source: Laerd Statistics
The above diagram shows the difference between Homoscedasticity and
Heteroscedasticity. The variance of errors is constant in case of homoscedasticity
while it’s not the case if errors are heteroscedastic.
OLS Assumption 6: Error terms should be normally distributed.
This assumption states that the errors are normally distributed, conditional
upon the independent variables. This OLS assumption is not required for the
validity of OLS method; however, it becomes important when one needs to
define some additional finite-sample properties. Note that only the error terms
need to be normally distributed. The dependent variable Y need not be
normally distributed.
The Use of OLS Assumptions
OLS assumptions are extremely important. If the OLS assumptions 1 to 5 hold,
then according to Gauss-Markov Theorem, OLS estimator is Best Linear
Unbiased Estimator (BLUE). These are desirable properties of OLS estimators
and require separate discussion in detail. However, below the focus is on the
importance of OLS assumptions by discussing what happens when they fail
and how can you look out for potential errors when assumptions are not
outlined.

1. The Assumption of Linearity (OLS Assumption 1) – If you fit a linear model to
a data that is non-linearly related, the model will be incorrect and hence
unreliable. When you use the model for extrapolation, you are likely to get
erroneous results. Hence, you should always plot a graph of observed predicted
values. If this graph is symmetrically distributed along the 45-degree line, then
you can be sure that the linearity assumption holds. If linearity assumptions
don’t hold, then you need to change the functional form of the regression,
which can be done by taking non-linear transformations of independent
variables (i.e. you can take log X instead of X as your independent variable)
and then check for linearity.
2. The Assumption of Homoscedasticity (OLS Assumption 5) – If errors are
heteroscedastic (i.e. OLS assumption is violated), then it will be difficult to trust
the standard errors of the OLS estimates. Hence, the confidence intervals will
be either too narrow or too wide. Also, violation of this assumption has a
tendency to give too much weight on some portion (subsection) of the data.
Hence, it is important to fix this if error variances are not constant. You can
easily check if error variances are constant or not. Examine the plot
of residuals predicted values or residuals vs. time (for time series models).
Typically, if the data set is large, then errors are more or less homoscedastic. If
your data set is small, check for this assumption.
3. The Assumption of Independence/No Autocorrelation (OLS Assumption 5) –
As discussed previously, this assumption is most likely to be violated in time
series regression models and, hence, intuition says that there is no need to
investigate it. However, you can still check for autocorrelation by viewing
the residual time series plot. If autocorrelation is present in the model, you can
try taking lags of independent variables to correct for the trend component. If

you do not correct for autocorrelation, then OLS estimates won’t be BLUE, and
they won’t be reliable enough.
4. The Assumption of Normality of Errors (OLS Assumption 6) – If error terms
are not normal, then the standard errors of OLS estimates won’t be reliable,
which means the confidence intervals would be too wide or narrow. Also, OLS
estimators won’t have the desirable BLUE property. A normal probability plot or
a normal quantile plot can be used to check if the error terms are normally
distributed or not. A bow-shaped deviated pattern in these plots reveals that
the errors are not normally distributed. Sometimes errors are not normal
because the linearity assumption is not holding. So, it is worthwhile to check
for linearity assumption again if this assumption fails.
5. Assumption of No Multicollinearity (OLS assumption 4) – You can check for
multicollinearity by making a correlation matrix (though there are other
complex ways of checking them like Variance Inflation Factor, etc.). Almost a
sure indication of the presence of multi-collinearity is when you get opposite
(unexpected) signs for your regression coefficients (e. if you expect that the
independent variable positively impacts your dependent variable but you get a
negative sign of the coefficient from the regression model). It is highly likely
that the regression suffers from multi-collinearity. If the variable is not that
important intuitively, then dropping that variable or any of the correlated
variables can fix the problem.
6. OLS assumptions 1, 2, and 4 are necessary for the setup of the OLS problem
and its derivation. Random sampling, observations being greater than the
number of parameters, and regression being linear in parameters are all part of
the setup of OLS regression. The assumption of no perfect collinearity allows
one to solve for first order conditions in the derivation of OLS estimates.

Conclusion
Linear regression models are extremely useful and have a wide range of
applications. When you use them, be careful that all the assumptions of OLS
regression are satisfied while doing an econometrics test so that your efforts
don’t go wasted. These assumptions are extremely important, and one cannot
just neglect them. Having said that, many times these OLS assumptions will be
violated. However, that should not stop you from conducting your econometric
test. Rather, when the assumption is violated, applying the correct fixes and
then running the linear regression model should be the way out for a reliable
econometric test.

Correlation and regression in r

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Correlation and regression in r

Similar to Correlation and regression in r (20)

Recently uploaded

Recently uploaded (20)

Correlation and regression in r