SlideShare a Scribd company logo
1 of 37
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
UNIT IV
Correlation and Regression Analysis
(NOS 9001)
Regression Analysis and Modeling –
Introduction:
Regression analysis is a form of predictive modeling technique
which investigates the relationship between a dependent (target)
and independent variable(s) (predictor).
This technique is used for forecasting, time series modeling and
finding the causal effect relationship between the variables. For
example, relationship between rash driving and number of road
accidents by a driver is best studied through regression.
Regression analysis is an important tool for modeling and analyzing
data. Here, we fit a curve / line to the data points, in such a
manner that the differences between the distances of data points
from the curve or line is minimized
Regression analysis estimates the relationship between two or more
variables.
example:
Let’s say, if we want to estimate growth in sales of a company based
on current economic conditions. we have the recent company data
which indicates that the growth in sales is around two and a half
times the growth in the economy. Using this insight, we can predict
future sales of the company based on current & past information.
There are multiple benefits of using regression analysis. They are as
follows:
* It indicates the significant relationships between dependent
variable and independent variable.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
* It indicates the strength of impact of multiple independent
variables on a dependent variable.
Regression analysis also allows us to compare the effects of
variables measured on different scales, such as the effect of price
changes and the number of promotional activities. These benefits
help market researchers / data analysts / data scientists to
eliminate and evaluate the best set of variables to be used for
building predictive models.
There are various kinds of regression techniques available to make
predictions.
These techniques are mostly driven by three metrics.
1. Number of independent variables,
2. Type of dependent variables and
3. Shape of regression line
Linear Regression:
A simple linear regression model describes the relationship between
two variables x and y can be expressed by the following equation.
The numbers α and β are called parameters, and ϵ is the error term.
If we choose the parameters α and β in the simple linear regression
model so as to minimize the sum of squares of the error term ϵ, we
will have the so called estimated simple regression equation. It
allows us to compute fitted values of y based on values of x.
In R we use lm () function to do simple regression modeling.
Apply the simple linear regression model for the data set cars. The
cars dataset as two variables (attributes) speed and dist and has 50
values.
> head(cars)
speed dist
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
By using the attach( ) function the database is attached to the R
search path. This means that the database is searched by R when
evaluating a variable, so objects in the database can be accessed by
simply giving their names.
> speed
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
15 15 15 16
[28] 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24
24 24 25
> plot(cars)
> plot(dist,speed)
The plot() function gives a scatterplot whenever we give two numeric
variables.
The first variable listed will be plotted on the horizontal axis.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Now apply the regression analysis on the dataset using lm( )
function.
> speed.lm=lm(speed ~ dist, data = cars)
lm function that describes the variable speed by the variable dist,
and save the linear regression model in a new variable speed.lm. In
the above function y variables or dependent variable is speed and x
variable or independent variable is dist.
We get the intercept “C” and the slope “m” of the equation –
Y=mX+C
> speed.lm
Call:
lm(formula = speed ~ dist, data = cars)
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Coefficients:
(Intercept) dist
8.2839 0.1656
> abline(speed.lm)
This function adds one or more straight lines through the current
plot.
> plot(speed.lm)
The plot function displays four charts: Residuals vs. Fitted, Normal
QQ, ScaleLocation, and Residuals vs. Leverage.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
What is a quantile in statistics?
In statistics and the theory of probability, quantiles are cutpoints
dividing the range of a probability distribution into contiguous
intervals with equal probabilities, or dividing the observations in a
sample in the same way. There is one less quantilethan the
number of groups created.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
The residual data of the simple linear regression model is the
difference between the observed data of the dependent
variable y and the fitted values ŷ.
> eruption.lm = lm(eruptions ~ waiting, data=faithful)
> eruption.res = resid(eruption.lm)
We now plot the residual against the observed values of the
variable waiting.
> plot(faithful$waiting, eruption.res,
+ ylab="Residuals", xlab="Waiting Time",
+ main="Old Faithful Eruptions")
> abline(0, 0) # the horizon
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Residual: The difference between the predicted value (based on the
regression equation) and the actual, observed value.
Outlier: In linear regression, an outlier is an observation with large
residual. In other words, it is an observation whose
dependent¬variable value is unusual given its value on the
predictor variables. An outlier may indicate a sample peculiarity or
may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor
variable is a point with high leverage. Leverage is a measure of how
far an independent variable deviates from its mean. High leverage
points can have a great amount of effect on the estimate of
regression coefficients.
Influence: An observation is said to be influential if removing the
observation substantially changes the estimate of the regression
coefficients. Influence can be thought of as the product of leverage
and outlierness.
Cook's distance (or Cook's D): A measure that combines the
information of leverage and residual of the observation.
Estimated simple regression equation:
Apply we will use the above simple linear regression model, and
estimate the next speed if the distance covered is 80.
Extract the parameters of the estimated regression equation with
the coefficients function.
> coeffs = coefficients(speed.lm)
> coeffs
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
(Intercept) dist
8.2839056 0.1655676
Forecasting/Prediction:
We now fit the speed using the estimated regression equation.
> newdist = 80
> distance = coeffs[1] + coeffs[2]*newdist
> distance
(Intercept)
21.52931
To create a summary of the fitted model:
> summary (speed.lm)
Call:
lm(formula = speed ~ dist, data = cars)
Residuals:
Min 1Q Median 3Q Max
-7.5293 -2.1550 0.3615 2.4377 6.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
dist 0.16557 0.01749 9.464 1.49e-12 ***
---
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.156 on 48 degrees of freedom Multiple
R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
OLS Regression:
ordinary least squares (OLS) or linear least squares is a method for
estimating the unknown parameters in a linear regression model,
with the goal of minimizing the differences between the observed
responses in a dataset and the responses predicted by the linear
approximation of the data.
This is applied in both simple linear and multiple regression where
the common assumptions are
(1) The model is linear in the coefficients of the predictor with an
additive random
error term
(2) The random error terms are
* normally distributed with 0 mean and
* a variance that doesn't change as the values of the predictor
covariates change.
Correlation:
Correlation is a statistical measure that indicates the extent to
which two or more variables fluctuate together. That can show
whether and how strongly pairs of variables are related. It Measure
the association between variables. Positive and
negative correlation, ranging between +1 and -1.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
For example, height and weight are related; taller people tend to be
heavier than shorter people. A positive correlation indicates the
extent to which those variables increase or decrease in parallel; a
negative correlation indicates the extent to which one variable
increases as the other decreases.
When the fluctuation of one variable reliably predicts a similar
fluctuation in another variable, there’s often a tendency to think
that means that the change in one causes the change in the other.
However, correlation does not imply causation.
There may be, for example, an unknown factor that influences both
variables similarly.
An intelligent correlation analysis can lead to a greater
understanding of your data.
Correlation in R:
We use the cor( ) function to produce correlations.
A simplified format of cor(x, use=, method= ) where
Option Description
x Matrix or data frame
use Specifies the handling of missing data. Options are
all.obs (assumes no missing data - missing data will
produce an error), complete.obs (listwise deletion), and
pairwise.complete.obs (pairwise deletion)
method Specifies the type of correlation. Options are pearson,
spearman or kendall.
> cor(cars)
speed dist
speed 1.0000000 0.8068949
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
dist 0.8068949 1.0000000
> cor(cars, use="complete.obs", method="kendall")
speed dist
speed 1.0000000 0.6689901
dist 0.6689901 1.0000000
> cor(cars, use="complete.obs", method="pearson")
speed dist
speed 1.0000000 0.8068949
dist 0.8068949 1.0000000
Correlation Coefficient:
The correlation coefficient of two variables in a data sample is their
covariance divided by the product of their individual standard
deviations. It is a normalized measurement of how the two are
linearly related.
Formally, the sample correlation coefficient is defined by the
following formula, where s x and sy are the sample standard
deviations, and sxy is the sample covariance.
Similarly, the population correlation coefficient is defined as follows,
where σ x and σy are the population standard deviations, and σxy is
the population covariance.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
If the correlation coefficient is close to 1, it would indicates that the
variables are positively linearly related and the scatter plot falls
almost along a straight line with positive slope. For -1, it indicates
that the variables are negatively linearly related and the scatter plot
almost falls along a straight line with negative slope. And for zero, it
would indicates a weak linear relationship between the variables.
* r : correlation coefficient
* +1 : Perfectly positive
* -1 : Perfectly negative
* 0 – 0.2 : No or very weak association
* 0.2 – 0.4 : Weak association
* 0.4 – 0.6 : Moderate association
* 0.6 – 0.8 : Strong association
* 0.8 – 1 : Very strong to perfect association
Covariance:
Covariance provides a measure of the strength of the correlation
between two or more sets of random variates. Correlation is defined
in terms of the variance of x, the variance of y, and the covariance
of x and y (the way the two vary together; the way they co-vary) on
the assumption that both variables are normally distributed.
Covariance in R:
We apply the cov function to compute the covariance of eruptions
and waiting in faithful dataset
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
ANOVA:
Analysis of Variance (ANOVA) is a commonly used statistical
technique for investigating data by comparing the means of subsets
of the data. The base case is the one-way ANOVA which is an
extension of two-sample t test for independent groups covering
situations where there are more than two groups being compared.
In one-way ANOVA the data is sub-divided into groups based on a
single classification factor and the standard terminology used to
describe the set of factor levels is treatment even though this might
not always have meaning for the particular application. There is
variation in the measurements taken on the individual components
of the data set and ANOVA investigates whether this variation can
be explained by the grouping introduced by the classification factor.
To investigate these differences we fit the one-way ANOVA model
using the lm function and look at the parameter estimates and
standard errors for the treatment effects.
> anova(speed.lm)
Analysis of Variance Table
Response: speed
Df Sum Sq Mean Sq F value Pr(>F)
dist 1 891.98 891.98 89.567 1.49e-12 ***
Residuals 48 478.02 9.96
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This table confirms that there are differences between the groups
which were highlighted in the model summary. The function confint
is used to calculate confidence intervals on the treatment
parameters, by default 95% confidence intervals
> confint(speed.lm)
2.5 % 97.5 %
(Intercept) 6.5258378 10.0419735
dist 0.1303926 0.2007426
Heteroscedasticity:
Heteroscedasticity (also spelled heteroskedasticity) refers to the
circumstance in which the variability of a variable is unequal across
the range of values of a second variable that predicts it.
A scatterplot of these variables will often create a cone-like shape,
as the scatter (or variability) of the dependent variable (DV) widens
or narrows as the value of the independent variable (IV) increases.
The inverse of heteroscedasticity is homoscedasticity, which
indicates that a DV's variability is equal across values of an IV.
Hetero (different or unequal) is the opposite of Homo (same or
equal).
Skedastic means spread or scatter.
Homoskedasticity = equal spread.
Heteroskedasticity = unequal spread.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Detecting Heteroskedasticity
There are two ways in general.
The first is the informal way which is done through graphs and
therefore we call it the graphical method.
The second is through formal tests for heteroskedasticity, like the
following ones:
1. The Breusch-Pagan LM Test
2. The Glesjer LM Test
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
3. The Harvey-Godfrey LM Test
4. The Park LM Test
5. The Goldfeld-Quandt Tets
6. White’s Test
Heteroscedasticity test in R:
bptest(p) does the Breuch Pagan test to formally check presence of
heteroscedasticity. To use bptest, you will have to call lmtest
library.
> install.packages("lmtest")
> library(lmtest)
> bptest(speed.lm)
studentized Breusch-Pagan test
data: speed.lm
BP = 0.71522, df = 1, p-value = 0.3977
If the test is positive (low p value), you should see if any
transformation of the dependent variable helps you eliminate
heteroscedasticity.
Autocorrelation:
Autocorrelation, also known as serial correlation or
cross-autocorrelation, is the cross-correlation of a signal with itself
at different points in time. Informally, it is the similarity between
observations as a function of the time lag between them.
It is a mathematical tool for finding repeating patterns, such as the
presence of a periodic signal obscured by noise, or identifying the
missing fundamental frequency in a signal implied by its harmonic
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
frequencies. It is often used in signal processing for analyzing
functions or series of values, such as time domain signals.
Autocorrelation is a mathematical representation of the degree of
similarity between a given time series and a lagged version of itself
over successive time intervals.
In statistics, the autocorrelation of a random process is
the correlation between values of the process at different times, as a
function of the two times or of the time lag. Let X be a stochastic
process, and t be any point in time. (t may be an integer for
a discrete-time process or a real number for a continuous-
time process.) Then Xt is the value (or realization) produced by a
given run of the process at time t. Suppose that the process
has mean μt and variance σt
2 at time t, for each t. Then the
definition of the autocorrelation between times s and t is
where "E" is the expected value operator. Note that this expression
is not well-defined for all-time series or processes, because the
mean may not exist, or the variance may be zero (for a constant
process) or infinite (for processes with distribution lacking well-
behaved moments, such as certain types of power law). If the
function R is well-defined, its value must lie in the range [−1, 1],
with 1 indicating perfect correlation and −1 indicating perfect anti-
correlation.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Above: A plot of a series of 100 random numbers concealing
a sine function. Below: The sine function revealed in
a correlogram produced by autocorrelation.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Visual comparison of convolution, cross-correlation and
autocorrelation.
The function acf ( ) in R computes estimates of the autocovariance
or autocorrelation function.
Test: -
The traditional test for the presence of first-order autocorrelation is
the Durbin– Watson statistic or, if the explanatory variables include
a lagged dependent variable, Durbin's h statistic. The
Durbin-Watson can be linearly mapped however to the Pearson
correlation between values and their lags.
A more flexible test, covering autocorrelation of higher orders and
applicable whether or not the regressors include lags of the
dependent variable, is the Breusch–Godfrey test. This involves an
auxiliary regression, wherein the residuals obtained from estimating
the model of interest are regressed on (a) the original regressors and
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
(b) k lags of the residuals, where k is the order of the test. The
simplest version of the test statistic from this auxiliary regression is
TR2, where T is the sample size and R2 is the coefficient of
determination. Under the null hypothesis of no autocorrelation, this
statistic is asymptotically distributed as x2 with k degrees of
freedom.
Introduction to Multiple Regression:
Multiple regression is a flexible method of data analysis that may be
appropriate whenever a quantitative variable (the dependent
variable) is to be examined in relationship to any other factors
(expressed as independent or predictor variables).
Relationships may be nonlinear, independent variables may be
quantitative or qualitative, and one can examine the effects of a
single variable or multiple variables with or without the effects of
other variables taken into account.
Many practical questions involve the relationship between a
dependent variable of interest (call it Y) and a set of k independent
variables or potential predictor
variables (call them X1, X2, X3,..., Xk), where the scores on all
variables are measured for N cases. For example, you might be
interested in predicting performance on a job (Y) using information
on years of experience (X1), performance in a training program (X2),
and performance on an aptitude test (X3).
A multiple regression equation for predicting Y can be expressed a
follows:
To apply the equation, each Xj score for an individual case is
multiplied by the corresponding Bj value, the products are added
together, and the constant A is added to the sum. The result is Y',
the predicted Y value for the case. Multiple Regression in R:
YEAR ROLL UNEM HGRAD INC
11 5501 8.1 9552 1923
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
22 5945 7.0 9680 1961
3 3 6629 7.3 9731 1979
4 4 7556 7.5 11666 2030
5 5 8716 7.0 14675 2112
6 6 9369 6.4 15265 2192
7 7 9920 6.5 15484 2235
8 8 10167 6.4 15723 2351
9 9 11084 6.3 16501 2411
10 10 12504 7.7 16890 2475
11 11 13746 8.2 17203 2524
12 12 13656 7.5 17707 2674
13 13 13850 7.4 18108 2833
14 14 14145 8.2 18266 2863
15 15 14888 10.1 19308 2839
16 16 14991 9.2 18224 2898
17 17 14836 7.7 18997 3123
18 18 14478 5.7 19505 3195
19 19 14539 6.5 19800 3239
20 20 14395 7.5 19546 3129
21 21 14599 7.3 19117 3100
22 22 14969 9.2 18774 3008
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
23 23 15107 10.1 17813 2983
24 24 14831 7.5 17304 3069
25 25 15081 8.8 16756 3151
26 26 15127 9.1 16749 3127
27 27 15856 8.8 16925 3179
28 28 15938 7.8 17231 3207
29 29 16081 7.0 16816 3345
> #read data into variable
> datavar <- read.csv("dataset_enrollmentForecast.csv")
> #attach data variable
> attach(datavar)
> #two predictor model
> #create a linear model using lm(FORMULA, DATAVAR)
> #predict the fall enrollment (ROLL) using the unemployment rate
(UNEM) and number of spring high school graduates (HGRAD)
> twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)
> #display model
> twoPredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = datavar)
Coefficients:
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
(Intercept) UNEM HGRAD
-8255.7511 698.2681 0.9423
> #what is the expected fall enrollment (ROLL) given this year's
unemployment rate (UNEM) of 9% and spring high school
graduating class (HGRAD) of 100,000
> -8255.8 + 698.2 * 9 + 0.9 * 100000
[1] 88028
> #the predicted fall enrollment, given a 9% unemployment rate and
100,000 student spring high school graduating class, is 88,028
students.
> #three predictor model
> #create a linear model using lm(FORMULA, DATAVAR)
> #predict the fall enrollment (ROLL) using the unemployment rate
(UNEM), number of spring high school graduates (HGRAD), and per
capita income (INC)
> threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC,
datavar)
> #display model
> threePredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = datavar)
Coefficients:
(Intercept) UNEM HGRAD INC
-9153.2545 450.1245 0.4065 4.2749
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in
which two or more predictor variables in a multiple regression
model are highly correlated, meaning that one can be linearly
predicted from the others with a substantial degree of accuracy. In
this situation the coefficient estimates of the multiple regressions
may change erratically in response to small changes in the model or
the data. Multicollinearity does not reduce the predictive power or
reliability of the model as a whole, at least within the sample data
set; it only affects calculations regarding individual predictors. That
is, a multiple regression model with correlated predictors can
indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any
individual predictor, or about which predictors are redundant with
respect to others.
Key Assumptions of OLS:
Introduction
Linear regression models find several uses in real-life problems. For example, a
multi-national corporation wanting to identify factors that can affect the sales
of its product can run a linear regression to find out which factors are
important. In econometrics, Ordinary Least Squares (OLS) method is widely
used to estimate the parameter of a linear regression model. OLS estimators
minimize the sum of the squared errors (a difference between observed values
and predicted values). While OLS is computationally feasible and can be easily
used while doing any econometrics test, it is important to know the underlying
assumptions of OLS regression. This is because a lack of knowledge of OLS
assumptions would result in its misuse and give incorrect results for the
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
econometrics test completed. The importance of OLS assumptions cannot be
overemphasized. The next section describes the assumptions of OLS
regression.
Assumptions of OLS Regression
The necessary OLS assumptions, which are used to derive the OLS estimators
in linear regression models, are discussed below.
OLS Assumption 1: The linear regression model is “linear in parameters.”
When the dependent variable (Y)(Y) is a linear function of independent
variables (X's)(X′s) and the error term, the regression is linear in parameters
and not necessarily linear in X'sX′s. For example, consider the following:
A1. The linear regression model is “linear in parameters.”
A2. There is a random sampling of observations.
A3. The conditional mean should be zero.
A4. There is no multi-collinearity (or perfect collinearity).
A5. Spherical errors: There is homoscedasticity and no autocorrelation
A6: Optional Assumption: Error terms should be normally distributed.
a)Y=β0+β1X1+β2X2+ε
b)Y=β0+β1X12+β2X2+ε
c)Y=β0+β12X1+β2X2+ε
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
In the above three examples, for a) and b) OLS assumption 1 is satisfied. For c)
OLS assumption 1 is not satisfied because it is not linear in parameter { beta
}_{ 1 }β1.
OLS Assumption 2: There is a random sampling of observations
This assumption of OLS regression says that:
• The sample taken for the linear regression model must be drawn randomly
from the population. For example, if you have to run a regression model to
study the factors that impact the scores of students in the final exam, then you
must select students randomly from the university during your data collection
process, rather than adopting a convenient sampling procedure.
• The number of observations taken in the sample for making the linear
regression model should be greater than the number of parameters to be
estimated. This makes sense mathematically too. If a number of parameters to
be estimated (unknowns) are more than the number of observations, then
estimation is not possible. If a number of parameters to be estimated
(unknowns) equal the number of observations, then OLS is not required. You
can simply use algebra.
• The X'sX′s should be fixed (e. independent variables should impact
dependent variables). It should not be the case that dependent variables impact
independent variables. This is because, in regression models, the causal
relationship is studied and there is not a correlation between the two variables.
For example, if you run the regression with inflation as your dependent
variable and unemployment as the independent variable, the OLS
estimators are likely to be incorrect because with inflation and unemployment,
we expect correlation rather than a causal relationship.
• The error terms are random. This makes the dependent variable random.
OLS Assumption 3: The conditional mean should be zero.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
The expected value of the mean of the error terms of OLS regression should be
zero given the values of independent variables.
Mathematically, E(ε∣X)=0 This is sometimes just written as E(ε)=0
In other words, the distribution of error terms has zero mean and doesn’t
depend on the independent variables X'sX′s. Thus, there must be no
relationship between the X'sX′s and the error term.
OLS Assumption 4: There is no multi-collinearity (or perfect collinearity).
In a simple linear regression model, there is only one independent variable and
hence, by default, this assumption will hold true. However, in the case of
multiple linear regression models, there are more than one independent
variable. The OLS assumption of no multi-collinearity says that there should be
no linear relationship between the independent variables. For example,
suppose you spend your 24 hours in a day on three things – sleeping, studying,
or playing. Now, if you run a regression with dependent variable as exam
score/performance and independent variables as time spent sleeping, time
spent studying, and time spent playing, then this assumption will not hold.
This is because there is perfect collinearity between the three independent
variables.
Time spent sleeping = 24 – Time spent studying – Time spent playing.
In such a situation, it is better to drop one of the three independent variables
from the linear regression model. If the relationship (correlation) between
independent variables is strong (but not exactly perfect), it still causes
problems in OLS estimators. Hence, this OLS assumption says that you should
select independent variables that are not correlated with each other.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
An important implication of this assumption of OLS regression is that there
should be sufficient variation in the X's. More the variability in X's, better are
the OLS estimates in determining the impact of X's on Y.
OLS Assumption 5: Spherical errors: There is homoscedasticity and no
autocorrelation.
According to this OLS assumption, the error terms in the regression should all
have the same variance.
Mathematically, Var(ε∣X)=σ2
If this variance is not constant (i.e. dependent on X’s), then the linear
regression model has heteroscedastic errors and likely to give incorrect
estimates.
This OLS assumption of no autocorrelation says that the error terms of
different observations should not be correlated with each other.
Mathematically, Cov(εiεj∣X)=0fori≠j
For example, when we have time series data (e.g. yearly data of
unemployment), then the regression is likely to suffer from autocorrelation
because unemployment next year will certainly be dependent on
unemployment this year. Hence, error terms in different observations will
surely be correlated with each other.
In simple terms, this OLS assumption means that the error terms should be
IID (Independent and Identically Distributed).
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Image Source: Laerd Statistics
The above diagram shows the difference between Homoscedasticity and
Heteroscedasticity. The variance of errors is constant in case of homoscedasticity
while it’s not the case if errors are heteroscedastic.
OLS Assumption 6: Error terms should be normally distributed.
This assumption states that the errors are normally distributed, conditional
upon the independent variables. This OLS assumption is not required for the
validity of OLS method; however, it becomes important when one needs to
define some additional finite-sample properties. Note that only the error terms
need to be normally distributed. The dependent variable Y need not be
normally distributed.
The Use of OLS Assumptions
OLS assumptions are extremely important. If the OLS assumptions 1 to 5 hold,
then according to Gauss-Markov Theorem, OLS estimator is Best Linear
Unbiased Estimator (BLUE). These are desirable properties of OLS estimators
and require separate discussion in detail. However, below the focus is on the
importance of OLS assumptions by discussing what happens when they fail
and how can you look out for potential errors when assumptions are not
outlined.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
1. The Assumption of Linearity (OLS Assumption 1) – If you fit a linear model to
a data that is non-linearly related, the model will be incorrect and hence
unreliable. When you use the model for extrapolation, you are likely to get
erroneous results. Hence, you should always plot a graph of observed predicted
values. If this graph is symmetrically distributed along the 45-degree line, then
you can be sure that the linearity assumption holds. If linearity assumptions
don’t hold, then you need to change the functional form of the regression,
which can be done by taking non-linear transformations of independent
variables (i.e. you can take log X instead of X as your independent variable)
and then check for linearity.
2. The Assumption of Homoscedasticity (OLS Assumption 5) – If errors are
heteroscedastic (i.e. OLS assumption is violated), then it will be difficult to trust
the standard errors of the OLS estimates. Hence, the confidence intervals will
be either too narrow or too wide. Also, violation of this assumption has a
tendency to give too much weight on some portion (subsection) of the data.
Hence, it is important to fix this if error variances are not constant. You can
easily check if error variances are constant or not. Examine the plot
of residuals predicted values or residuals vs. time (for time series models).
Typically, if the data set is large, then errors are more or less homoscedastic. If
your data set is small, check for this assumption.
3. The Assumption of Independence/No Autocorrelation (OLS Assumption 5) –
As discussed previously, this assumption is most likely to be violated in time
series regression models and, hence, intuition says that there is no need to
investigate it. However, you can still check for autocorrelation by viewing
the residual time series plot. If autocorrelation is present in the model, you can
try taking lags of independent variables to correct for the trend component. If
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
you do not correct for autocorrelation, then OLS estimates won’t be BLUE, and
they won’t be reliable enough.
4. The Assumption of Normality of Errors (OLS Assumption 6) – If error terms
are not normal, then the standard errors of OLS estimates won’t be reliable,
which means the confidence intervals would be too wide or narrow. Also, OLS
estimators won’t have the desirable BLUE property. A normal probability plot or
a normal quantile plot can be used to check if the error terms are normally
distributed or not. A bow-shaped deviated pattern in these plots reveals that
the errors are not normally distributed. Sometimes errors are not normal
because the linearity assumption is not holding. So, it is worthwhile to check
for linearity assumption again if this assumption fails.
5. Assumption of No Multicollinearity (OLS assumption 4) – You can check for
multicollinearity by making a correlation matrix (though there are other
complex ways of checking them like Variance Inflation Factor, etc.). Almost a
sure indication of the presence of multi-collinearity is when you get opposite
(unexpected) signs for your regression coefficients (e. if you expect that the
independent variable positively impacts your dependent variable but you get a
negative sign of the coefficient from the regression model). It is highly likely
that the regression suffers from multi-collinearity. If the variable is not that
important intuitively, then dropping that variable or any of the correlated
variables can fix the problem.
6. OLS assumptions 1, 2, and 4 are necessary for the setup of the OLS problem
and its derivation. Random sampling, observations being greater than the
number of parameters, and regression being linear in parameters are all part of
the setup of OLS regression. The assumption of no perfect collinearity allows
one to solve for first order conditions in the derivation of OLS estimates.
Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
Conclusion
Linear regression models are extremely useful and have a wide range of
applications. When you use them, be careful that all the assumptions of OLS
regression are satisfied while doing an econometrics test so that your efforts
don’t go wasted. These assumptions are extremely important, and one cannot
just neglect them. Having said that, many times these OLS assumptions will be
violated. However, that should not stop you from conducting your econometric
test. Rather, when the assumption is violated, applying the correct fixes and
then running the linear regression model should be the way out for a reliable
econometric test.

More Related Content

What's hot

Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsBabasab Patil
 
Properties of coefficient of correlation
Properties of coefficient of correlationProperties of coefficient of correlation
Properties of coefficient of correlationNadeem Uddin
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlationShakeel Nouman
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelSetia Pramana
 
Hypothesis and Hypothesis Testing
Hypothesis and Hypothesis TestingHypothesis and Hypothesis Testing
Hypothesis and Hypothesis TestingNaibin
 
Probability distribution 2
Probability distribution 2Probability distribution 2
Probability distribution 2Nilanjan Bhaumik
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisSalim Azad
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regressionnszakir
 
Measures of Dispersion
Measures of DispersionMeasures of Dispersion
Measures of DispersionMohit Mahajan
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAnil Pokhrel
 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersionForensic Pathology
 

What's hot (20)

Covariance vs Correlation
Covariance vs CorrelationCovariance vs Correlation
Covariance vs Correlation
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec doms
 
Properties of coefficient of correlation
Properties of coefficient of correlationProperties of coefficient of correlation
Properties of coefficient of correlation
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlation
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Hypothesis and Hypothesis Testing
Hypothesis and Hypothesis TestingHypothesis and Hypothesis Testing
Hypothesis and Hypothesis Testing
 
Probability distribution 2
Probability distribution 2Probability distribution 2
Probability distribution 2
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Bernoulli distribution
Bernoulli distributionBernoulli distribution
Bernoulli distribution
 
Kruskal Wall Test
Kruskal Wall TestKruskal Wall Test
Kruskal Wall Test
 
Continuous probability distribution
Continuous probability distributionContinuous probability distribution
Continuous probability distribution
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Regression
RegressionRegression
Regression
 
Binomial probability distributions
Binomial probability distributions  Binomial probability distributions
Binomial probability distributions
 
Measures of Dispersion
Measures of DispersionMeasures of Dispersion
Measures of Dispersion
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersion
 

Similar to Correlation and regression in r

Correation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareCorreation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareshrikrishna kesharwani
 
NPTL Machine Learning Week 2.docx
NPTL Machine Learning Week 2.docxNPTL Machine Learning Week 2.docx
NPTL Machine Learning Week 2.docxMr. Moms
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxbishalnandi2
 
Chapter III.pptx
Chapter III.pptxChapter III.pptx
Chapter III.pptxBeamlak5
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptxHarsha Patel
 
Regression analysis and its type
Regression analysis and its typeRegression analysis and its type
Regression analysis and its typeEkta Bafna
 
Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptxssuserb8a904
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Detail Study of the concept of Regression model.pptx
Detail Study of the concept of  Regression model.pptxDetail Study of the concept of  Regression model.pptx
Detail Study of the concept of Regression model.pptxtruptikulkarni2066
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 

Similar to Correlation and regression in r (20)

Correation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareCorreation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R software
 
NPTL Machine Learning Week 2.docx
NPTL Machine Learning Week 2.docxNPTL Machine Learning Week 2.docx
NPTL Machine Learning Week 2.docx
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
working with python
working with pythonworking with python
working with python
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptx
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
Chapter III.pptx
Chapter III.pptxChapter III.pptx
Chapter III.pptx
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptx
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Regression analysis and its type
Regression analysis and its typeRegression analysis and its type
Regression analysis and its type
 
Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptx
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Detail Study of the concept of Regression model.pptx
Detail Study of the concept of  Regression model.pptxDetail Study of the concept of  Regression model.pptx
Detail Study of the concept of Regression model.pptx
 
ai.pptx
ai.pptxai.pptx
ai.pptx
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
R nonlinear least square
R   nonlinear least squareR   nonlinear least square
R nonlinear least square
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Correlation and regression in r

  • 1. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad UNIT IV Correlation and Regression Analysis (NOS 9001) Regression Analysis and Modeling – Introduction: Regression analysis is a form of predictive modeling technique which investigates the relationship between a dependent (target) and independent variable(s) (predictor). This technique is used for forecasting, time series modeling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression. Regression analysis is an important tool for modeling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized Regression analysis estimates the relationship between two or more variables. example: Let’s say, if we want to estimate growth in sales of a company based on current economic conditions. we have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information. There are multiple benefits of using regression analysis. They are as follows: * It indicates the significant relationships between dependent variable and independent variable.
  • 2. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad * It indicates the strength of impact of multiple independent variables on a dependent variable. Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models. There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics. 1. Number of independent variables, 2. Type of dependent variables and 3. Shape of regression line Linear Regression: A simple linear regression model describes the relationship between two variables x and y can be expressed by the following equation. The numbers α and β are called parameters, and ϵ is the error term. If we choose the parameters α and β in the simple linear regression model so as to minimize the sum of squares of the error term ϵ, we will have the so called estimated simple regression equation. It allows us to compute fitted values of y based on values of x. In R we use lm () function to do simple regression modeling. Apply the simple linear regression model for the data set cars. The cars dataset as two variables (attributes) speed and dist and has 50 values. > head(cars) speed dist
  • 3. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 > attach(cars) By using the attach( ) function the database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names. > speed [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 [28] 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25 > plot(cars) > plot(dist,speed) The plot() function gives a scatterplot whenever we give two numeric variables. The first variable listed will be plotted on the horizontal axis.
  • 4. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Now apply the regression analysis on the dataset using lm( ) function. > speed.lm=lm(speed ~ dist, data = cars) lm function that describes the variable speed by the variable dist, and save the linear regression model in a new variable speed.lm. In the above function y variables or dependent variable is speed and x variable or independent variable is dist. We get the intercept “C” and the slope “m” of the equation – Y=mX+C > speed.lm Call: lm(formula = speed ~ dist, data = cars)
  • 5. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Coefficients: (Intercept) dist 8.2839 0.1656 > abline(speed.lm) This function adds one or more straight lines through the current plot. > plot(speed.lm) The plot function displays four charts: Residuals vs. Fitted, Normal QQ, ScaleLocation, and Residuals vs. Leverage.
  • 6. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
  • 7. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
  • 8. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad What is a quantile in statistics? In statistics and the theory of probability, quantiles are cutpoints dividing the range of a probability distribution into contiguous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one less quantilethan the number of groups created.
  • 9. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values ŷ. > eruption.lm = lm(eruptions ~ waiting, data=faithful) > eruption.res = resid(eruption.lm) We now plot the residual against the observed values of the variable waiting. > plot(faithful$waiting, eruption.res, + ylab="Residuals", xlab="Waiting Time", + main="Old Faithful Eruptions") > abline(0, 0) # the horizon
  • 10. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value. Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent¬variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness. Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation. Estimated simple regression equation: Apply we will use the above simple linear regression model, and estimate the next speed if the distance covered is 80. Extract the parameters of the estimated regression equation with the coefficients function. > coeffs = coefficients(speed.lm) > coeffs
  • 11. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad (Intercept) dist 8.2839056 0.1655676 Forecasting/Prediction: We now fit the speed using the estimated regression equation. > newdist = 80 > distance = coeffs[1] + coeffs[2]*newdist > distance (Intercept) 21.52931 To create a summary of the fitted model: > summary (speed.lm) Call: lm(formula = speed ~ dist, data = cars) Residuals: Min 1Q Median 3Q Max -7.5293 -2.1550 0.3615 2.4377 6.4179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.28391 0.87438 9.474 1.44e-12 *** dist 0.16557 0.01749 9.464 1.49e-12 *** ---
  • 12. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.156 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12 OLS Regression: ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in a dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are * normally distributed with 0 mean and * a variance that doesn't change as the values of the predictor covariates change. Correlation: Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. That can show whether and how strongly pairs of variables are related. It Measure the association between variables. Positive and negative correlation, ranging between +1 and -1.
  • 13. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad For example, height and weight are related; taller people tend to be heavier than shorter people. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases. When the fluctuation of one variable reliably predicts a similar fluctuation in another variable, there’s often a tendency to think that means that the change in one causes the change in the other. However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly. An intelligent correlation analysis can lead to a greater understanding of your data. Correlation in R: We use the cor( ) function to produce correlations. A simplified format of cor(x, use=, method= ) where Option Description x Matrix or data frame use Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion) method Specifies the type of correlation. Options are pearson, spearman or kendall. > cor(cars) speed dist speed 1.0000000 0.8068949
  • 14. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad dist 0.8068949 1.0000000 > cor(cars, use="complete.obs", method="kendall") speed dist speed 1.0000000 0.6689901 dist 0.6689901 1.0000000 > cor(cars, use="complete.obs", method="pearson") speed dist speed 1.0000000 0.8068949 dist 0.8068949 1.0000000 Correlation Coefficient: The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related. Formally, the sample correlation coefficient is defined by the following formula, where s x and sy are the sample standard deviations, and sxy is the sample covariance. Similarly, the population correlation coefficient is defined as follows, where σ x and σy are the population standard deviations, and σxy is the population covariance.
  • 15. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad If the correlation coefficient is close to 1, it would indicates that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicates a weak linear relationship between the variables. * r : correlation coefficient * +1 : Perfectly positive * -1 : Perfectly negative * 0 – 0.2 : No or very weak association * 0.2 – 0.4 : Weak association * 0.4 – 0.6 : Moderate association * 0.6 – 0.8 : Strong association * 0.8 – 1 : Very strong to perfect association Covariance: Covariance provides a measure of the strength of the correlation between two or more sets of random variates. Correlation is defined in terms of the variance of x, the variance of y, and the covariance of x and y (the way the two vary together; the way they co-vary) on the assumption that both variables are normally distributed. Covariance in R: We apply the cov function to compute the covariance of eruptions and waiting in faithful dataset
  • 16. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad > duration = faithful$eruptions # the eruption durations > waiting = faithful$waiting # the waiting period > cov(duration, waiting) # apply the cov function [1] 13.978 ANOVA: Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way ANOVA which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared. In one-way ANOVA the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is treatment even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor. To investigate these differences we fit the one-way ANOVA model using the lm function and look at the parameter estimates and standard errors for the treatment effects. > anova(speed.lm) Analysis of Variance Table Response: speed Df Sum Sq Mean Sq F value Pr(>F) dist 1 891.98 891.98 89.567 1.49e-12 *** Residuals 48 478.02 9.96
  • 17. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 This table confirms that there are differences between the groups which were highlighted in the model summary. The function confint is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals > confint(speed.lm) 2.5 % 97.5 % (Intercept) 6.5258378 10.0419735 dist 0.1303926 0.2007426 Heteroscedasticity: Heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. A scatterplot of these variables will often create a cone-like shape, as the scatter (or variability) of the dependent variable (DV) widens or narrows as the value of the independent variable (IV) increases. The inverse of heteroscedasticity is homoscedasticity, which indicates that a DV's variability is equal across values of an IV. Hetero (different or unequal) is the opposite of Homo (same or equal). Skedastic means spread or scatter. Homoskedasticity = equal spread. Heteroskedasticity = unequal spread.
  • 18. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
  • 19. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad
  • 20. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Detecting Heteroskedasticity There are two ways in general. The first is the informal way which is done through graphs and therefore we call it the graphical method. The second is through formal tests for heteroskedasticity, like the following ones: 1. The Breusch-Pagan LM Test 2. The Glesjer LM Test
  • 21. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad 3. The Harvey-Godfrey LM Test 4. The Park LM Test 5. The Goldfeld-Quandt Tets 6. White’s Test Heteroscedasticity test in R: bptest(p) does the Breuch Pagan test to formally check presence of heteroscedasticity. To use bptest, you will have to call lmtest library. > install.packages("lmtest") > library(lmtest) > bptest(speed.lm) studentized Breusch-Pagan test data: speed.lm BP = 0.71522, df = 1, p-value = 0.3977 If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity. Autocorrelation: Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation of a signal with itself at different points in time. Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic
  • 22. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals. Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. In statistics, the autocorrelation of a random process is the correlation between values of the process at different times, as a function of the two times or of the time lag. Let X be a stochastic process, and t be any point in time. (t may be an integer for a discrete-time process or a real number for a continuous- time process.) Then Xt is the value (or realization) produced by a given run of the process at time t. Suppose that the process has mean μt and variance σt 2 at time t, for each t. Then the definition of the autocorrelation between times s and t is where "E" is the expected value operator. Note that this expression is not well-defined for all-time series or processes, because the mean may not exist, or the variance may be zero (for a constant process) or infinite (for processes with distribution lacking well- behaved moments, such as certain types of power law). If the function R is well-defined, its value must lie in the range [−1, 1], with 1 indicating perfect correlation and −1 indicating perfect anti- correlation.
  • 23. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Above: A plot of a series of 100 random numbers concealing a sine function. Below: The sine function revealed in a correlogram produced by autocorrelation.
  • 24. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Visual comparison of convolution, cross-correlation and autocorrelation. The function acf ( ) in R computes estimates of the autocovariance or autocorrelation function. Test: - The traditional test for the presence of first-order autocorrelation is the Durbin– Watson statistic or, if the explanatory variables include a lagged dependent variable, Durbin's h statistic. The Durbin-Watson can be linearly mapped however to the Pearson correlation between values and their lags. A more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable, is the Breusch–Godfrey test. This involves an auxiliary regression, wherein the residuals obtained from estimating the model of interest are regressed on (a) the original regressors and
  • 25. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad (b) k lags of the residuals, where k is the order of the test. The simplest version of the test statistic from this auxiliary regression is TR2, where T is the sample size and R2 is the coefficient of determination. Under the null hypothesis of no autocorrelation, this statistic is asymptotically distributed as x2 with k degrees of freedom. Introduction to Multiple Regression: Multiple regression is a flexible method of data analysis that may be appropriate whenever a quantitative variable (the dependent variable) is to be examined in relationship to any other factors (expressed as independent or predictor variables). Relationships may be nonlinear, independent variables may be quantitative or qualitative, and one can examine the effects of a single variable or multiple variables with or without the effects of other variables taken into account. Many practical questions involve the relationship between a dependent variable of interest (call it Y) and a set of k independent variables or potential predictor variables (call them X1, X2, X3,..., Xk), where the scores on all variables are measured for N cases. For example, you might be interested in predicting performance on a job (Y) using information on years of experience (X1), performance in a training program (X2), and performance on an aptitude test (X3). A multiple regression equation for predicting Y can be expressed a follows: To apply the equation, each Xj score for an individual case is multiplied by the corresponding Bj value, the products are added together, and the constant A is added to the sum. The result is Y', the predicted Y value for the case. Multiple Regression in R: YEAR ROLL UNEM HGRAD INC 11 5501 8.1 9552 1923
  • 26. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad 22 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 7 7 9920 6.5 15484 2235 8 8 10167 6.4 15723 2351 9 9 11084 6.3 16501 2411 10 10 12504 7.7 16890 2475 11 11 13746 8.2 17203 2524 12 12 13656 7.5 17707 2674 13 13 13850 7.4 18108 2833 14 14 14145 8.2 18266 2863 15 15 14888 10.1 19308 2839 16 16 14991 9.2 18224 2898 17 17 14836 7.7 18997 3123 18 18 14478 5.7 19505 3195 19 19 14539 6.5 19800 3239 20 20 14395 7.5 19546 3129 21 21 14599 7.3 19117 3100 22 22 14969 9.2 18774 3008
  • 27. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad 23 23 15107 10.1 17813 2983 24 24 14831 7.5 17304 3069 25 25 15081 8.8 16756 3151 26 26 15127 9.1 16749 3127 27 27 15856 8.8 16925 3179 28 28 15938 7.8 17231 3207 29 29 16081 7.0 16816 3345 > #read data into variable > datavar <- read.csv("dataset_enrollmentForecast.csv") > #attach data variable > attach(datavar) > #two predictor model > #create a linear model using lm(FORMULA, DATAVAR) > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) > twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar) > #display model > twoPredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD, data = datavar) Coefficients:
  • 28. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad (Intercept) UNEM HGRAD -8255.7511 698.2681 0.9423 > #what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000 > -8255.8 + 698.2 * 9 + 0.9 * 100000 [1] 88028 > #the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is 88,028 students. > #three predictor model > #create a linear model using lm(FORMULA, DATAVAR) > #predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC) > threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar) > #display model > threePredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = datavar) Coefficients: (Intercept) UNEM HGRAD INC -9153.2545 450.1245 0.4065 4.2749
  • 29. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. Key Assumptions of OLS: Introduction Linear regression models find several uses in real-life problems. For example, a multi-national corporation wanting to identify factors that can affect the sales of its product can run a linear regression to find out which factors are important. In econometrics, Ordinary Least Squares (OLS) method is widely used to estimate the parameter of a linear regression model. OLS estimators minimize the sum of the squared errors (a difference between observed values and predicted values). While OLS is computationally feasible and can be easily used while doing any econometrics test, it is important to know the underlying assumptions of OLS regression. This is because a lack of knowledge of OLS assumptions would result in its misuse and give incorrect results for the
  • 30. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad econometrics test completed. The importance of OLS assumptions cannot be overemphasized. The next section describes the assumptions of OLS regression. Assumptions of OLS Regression The necessary OLS assumptions, which are used to derive the OLS estimators in linear regression models, are discussed below. OLS Assumption 1: The linear regression model is “linear in parameters.” When the dependent variable (Y)(Y) is a linear function of independent variables (X's)(X′s) and the error term, the regression is linear in parameters and not necessarily linear in X'sX′s. For example, consider the following: A1. The linear regression model is “linear in parameters.” A2. There is a random sampling of observations. A3. The conditional mean should be zero. A4. There is no multi-collinearity (or perfect collinearity). A5. Spherical errors: There is homoscedasticity and no autocorrelation A6: Optional Assumption: Error terms should be normally distributed. a)Y=β0+β1X1+β2X2+ε b)Y=β0+β1X12+β2X2+ε c)Y=β0+β12X1+β2X2+ε
  • 31. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad In the above three examples, for a) and b) OLS assumption 1 is satisfied. For c) OLS assumption 1 is not satisfied because it is not linear in parameter { beta }_{ 1 }β1. OLS Assumption 2: There is a random sampling of observations This assumption of OLS regression says that: • The sample taken for the linear regression model must be drawn randomly from the population. For example, if you have to run a regression model to study the factors that impact the scores of students in the final exam, then you must select students randomly from the university during your data collection process, rather than adopting a convenient sampling procedure. • The number of observations taken in the sample for making the linear regression model should be greater than the number of parameters to be estimated. This makes sense mathematically too. If a number of parameters to be estimated (unknowns) are more than the number of observations, then estimation is not possible. If a number of parameters to be estimated (unknowns) equal the number of observations, then OLS is not required. You can simply use algebra. • The X'sX′s should be fixed (e. independent variables should impact dependent variables). It should not be the case that dependent variables impact independent variables. This is because, in regression models, the causal relationship is studied and there is not a correlation between the two variables. For example, if you run the regression with inflation as your dependent variable and unemployment as the independent variable, the OLS estimators are likely to be incorrect because with inflation and unemployment, we expect correlation rather than a causal relationship. • The error terms are random. This makes the dependent variable random. OLS Assumption 3: The conditional mean should be zero.
  • 32. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad The expected value of the mean of the error terms of OLS regression should be zero given the values of independent variables. Mathematically, E(ε∣X)=0 This is sometimes just written as E(ε)=0 In other words, the distribution of error terms has zero mean and doesn’t depend on the independent variables X'sX′s. Thus, there must be no relationship between the X'sX′s and the error term. OLS Assumption 4: There is no multi-collinearity (or perfect collinearity). In a simple linear regression model, there is only one independent variable and hence, by default, this assumption will hold true. However, in the case of multiple linear regression models, there are more than one independent variable. The OLS assumption of no multi-collinearity says that there should be no linear relationship between the independent variables. For example, suppose you spend your 24 hours in a day on three things – sleeping, studying, or playing. Now, if you run a regression with dependent variable as exam score/performance and independent variables as time spent sleeping, time spent studying, and time spent playing, then this assumption will not hold. This is because there is perfect collinearity between the three independent variables. Time spent sleeping = 24 – Time spent studying – Time spent playing. In such a situation, it is better to drop one of the three independent variables from the linear regression model. If the relationship (correlation) between independent variables is strong (but not exactly perfect), it still causes problems in OLS estimators. Hence, this OLS assumption says that you should select independent variables that are not correlated with each other.
  • 33. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad An important implication of this assumption of OLS regression is that there should be sufficient variation in the X's. More the variability in X's, better are the OLS estimates in determining the impact of X's on Y. OLS Assumption 5: Spherical errors: There is homoscedasticity and no autocorrelation. According to this OLS assumption, the error terms in the regression should all have the same variance. Mathematically, Var(ε∣X)=σ2 If this variance is not constant (i.e. dependent on X’s), then the linear regression model has heteroscedastic errors and likely to give incorrect estimates. This OLS assumption of no autocorrelation says that the error terms of different observations should not be correlated with each other. Mathematically, Cov(εiεj∣X)=0fori≠j For example, when we have time series data (e.g. yearly data of unemployment), then the regression is likely to suffer from autocorrelation because unemployment next year will certainly be dependent on unemployment this year. Hence, error terms in different observations will surely be correlated with each other. In simple terms, this OLS assumption means that the error terms should be IID (Independent and Identically Distributed).
  • 34. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Image Source: Laerd Statistics The above diagram shows the difference between Homoscedasticity and Heteroscedasticity. The variance of errors is constant in case of homoscedasticity while it’s not the case if errors are heteroscedastic. OLS Assumption 6: Error terms should be normally distributed. This assumption states that the errors are normally distributed, conditional upon the independent variables. This OLS assumption is not required for the validity of OLS method; however, it becomes important when one needs to define some additional finite-sample properties. Note that only the error terms need to be normally distributed. The dependent variable Y need not be normally distributed. The Use of OLS Assumptions OLS assumptions are extremely important. If the OLS assumptions 1 to 5 hold, then according to Gauss-Markov Theorem, OLS estimator is Best Linear Unbiased Estimator (BLUE). These are desirable properties of OLS estimators and require separate discussion in detail. However, below the focus is on the importance of OLS assumptions by discussing what happens when they fail and how can you look out for potential errors when assumptions are not outlined.
  • 35. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad 1. The Assumption of Linearity (OLS Assumption 1) – If you fit a linear model to a data that is non-linearly related, the model will be incorrect and hence unreliable. When you use the model for extrapolation, you are likely to get erroneous results. Hence, you should always plot a graph of observed predicted values. If this graph is symmetrically distributed along the 45-degree line, then you can be sure that the linearity assumption holds. If linearity assumptions don’t hold, then you need to change the functional form of the regression, which can be done by taking non-linear transformations of independent variables (i.e. you can take log X instead of X as your independent variable) and then check for linearity. 2. The Assumption of Homoscedasticity (OLS Assumption 5) – If errors are heteroscedastic (i.e. OLS assumption is violated), then it will be difficult to trust the standard errors of the OLS estimates. Hence, the confidence intervals will be either too narrow or too wide. Also, violation of this assumption has a tendency to give too much weight on some portion (subsection) of the data. Hence, it is important to fix this if error variances are not constant. You can easily check if error variances are constant or not. Examine the plot of residuals predicted values or residuals vs. time (for time series models). Typically, if the data set is large, then errors are more or less homoscedastic. If your data set is small, check for this assumption. 3. The Assumption of Independence/No Autocorrelation (OLS Assumption 5) – As discussed previously, this assumption is most likely to be violated in time series regression models and, hence, intuition says that there is no need to investigate it. However, you can still check for autocorrelation by viewing the residual time series plot. If autocorrelation is present in the model, you can try taking lags of independent variables to correct for the trend component. If
  • 36. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad you do not correct for autocorrelation, then OLS estimates won’t be BLUE, and they won’t be reliable enough. 4. The Assumption of Normality of Errors (OLS Assumption 6) – If error terms are not normal, then the standard errors of OLS estimates won’t be reliable, which means the confidence intervals would be too wide or narrow. Also, OLS estimators won’t have the desirable BLUE property. A normal probability plot or a normal quantile plot can be used to check if the error terms are normally distributed or not. A bow-shaped deviated pattern in these plots reveals that the errors are not normally distributed. Sometimes errors are not normal because the linearity assumption is not holding. So, it is worthwhile to check for linearity assumption again if this assumption fails. 5. Assumption of No Multicollinearity (OLS assumption 4) – You can check for multicollinearity by making a correlation matrix (though there are other complex ways of checking them like Variance Inflation Factor, etc.). Almost a sure indication of the presence of multi-collinearity is when you get opposite (unexpected) signs for your regression coefficients (e. if you expect that the independent variable positively impacts your dependent variable but you get a negative sign of the coefficient from the regression model). It is highly likely that the regression suffers from multi-collinearity. If the variable is not that important intuitively, then dropping that variable or any of the correlated variables can fix the problem. 6. OLS assumptions 1, 2, and 4 are necessary for the setup of the OLS problem and its derivation. Random sampling, observations being greater than the number of parameters, and regression being linear in parameters are all part of the setup of OLS regression. The assumption of no perfect collinearity allows one to solve for first order conditions in the derivation of OLS estimates.
  • 37. Dr.K.Sreenivasa Rao B.Tech, M.Tech, Ph.D VBIT, Hyderabad Conclusion Linear regression models are extremely useful and have a wide range of applications. When you use them, be careful that all the assumptions of OLS regression are satisfied while doing an econometrics test so that your efforts don’t go wasted. These assumptions are extremely important, and one cannot just neglect them. Having said that, many times these OLS assumptions will be violated. However, that should not stop you from conducting your econometric test. Rather, when the assumption is violated, applying the correct fixes and then running the linear regression model should be the way out for a reliable econometric test.