Linear regression establishes a relationship model between two variables, a predictor variable and a response variable. The relationship is represented by a linear equation where the exponent of both variables is 1, forming a straight line when graphed. Assumptions of linear regression include a linear relationship between variables, normally distributed residuals, and homoscedasticity. Linear regression is used to predict the response variable for new observations by fitting a linear model to observed data using functions like lm() and predict() in R.
Linear Regression
Regression analysisis a very widely used
statistical tool to establish a relationship model
between two variables.
One of these variable is called predictor variable
whose value is gathered through experiments.
The other variable is called response variable
whose value is derived from the predictor
variable
2.
• In LinearRegression these two variables are related
through an equation, where exponent (power) of both
these variables is 1.
• Mathematically a linear relationship represents a straight
line when plotted as a graph.
• A non-linear relationship where the exponent of any
variable is not equal to 1 creates a curve.
• The general mathematical equation for a linear
regression is −
• y = ax + b
• Following is the description of the parameters used −
• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.
3.
• Steps toEstablish a Regression
• A simple example of regression is predicting weight of a
person when his height is known. To do this we need to
have the relationship between height and weight of a
person.
• The steps to create the relationship is −
• Carry out the experiment of gathering a sample of
observed values of height and corresponding weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create
the mathematical equation using these
• Get a summary of the relationship model to know the
average error in prediction. Also called residuals.
• To predict the weight of new persons, use
the predict() function in R.
4.
• Input Data
•Below is the sample data representing the observations −
• # Values of height
• 151, 174, 138, 186, 128, 136, 179, 163, 152, 131
• # Values of weight.
• 63, 81, 56, 91, 47, 57, 76, 72, 62, 48
• lm() Function
• This function creates the relationship model between the
predictor and the response variable.
• Syntax
• The basic syntax for lm() function in linear regression is −
• lm(formula,data)
• Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
• Get theSummary of the Relationship
• x <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• # Apply the lm() function.
• relation <- lm(y~x)
• print(summary(relation))
7.
• predict() Function
•Syntax
• The basic syntax for predict() in linear regression
is −
• predict(object, newdata) Following is the
description of the parameters used −
• object is the formula which is already created
using the lm() function.
• newdata is the vector containing the new value
for predictor variable.
8.
• Predict theweight of new persons
• # The predictor vector.
• x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131) # The resposne vector.
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• # Apply the lm() function.
• relation <- lm(y~x)
• # Find weight of a person with height 170.
• a <- data.frame(x = 170)
• result <- predict(relation,a)
• print(result)
9.
Visualize the RegressionGraphically
• Create the predictor and response variable.
• x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• relation <- lm(y~x) # Give the chart file a name.
• png(file = "linearregression.png") # Plot the chart.
• plot(y,x,col = "blue",xlab = "Weight in Kg",ylab = "Height in cm")
• plot(y,x,col = "blue"cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")
• plot(y,x,col = "blue",cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")
• plot(y,x,col = "blue",abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight
in Kg",ylab = "Height in cm")
• plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height
in cm") dev.off()
10.
Regression assumptions
• Linearregression makes several assumptions
about the data, such as :
• Linearity of the data. The relationship between
the predictor (x) and the outcome (y) is assumed
to be linear.
• Normality of residuals. The residual errors are
assumed to be normally distributed.
• Homogeneity of residuals variance. The residuals
are assumed to have a constant variance
(homoscedasticity)
• Independence of residuals error terms.
11.
• Assumptions aboutthe form of the model
• Assumptions about the errors
• Assumptions about the predictors
• The Predictor variables x1,x2,...xn are
assumed to be linearly independent of each
other. If this assumption is violated, then the
problem is called collinearity problem.
12.
Validating linear assumptions
•Step 1 - Install the necessary libraries
• install.packages("ggplot2") install.packages("dplyr")
library(ggplot2) library(dplyr)
• Step 2 - Read a csv file and explore the data
• data <- read.csv("/content/Data_1.csv")
• head(data) # head() returns the top 6 rows of the
dataframe
• summary(data) # returns the statistical summary of the
data columns
• plot(data$Width,data$Cost) #the plot() gives a visual
representation of the relation between the variable Width
and Cost
• cor(data$Width,data$Cost) # correlation between the two
variables
13.
Using Scatter Plot
•The linearity of the relationship between the dependent and
predictor variables of the model can be studied using scatter plots
• No of hours freshmen_score
• 2 55
• 2.5 62
• 3 65
• 3.5 70
• 4 77
• 4.5 82
• 5 75
• 5.5 83
• 6 85
6.5 88
14.
• Students (HS$noofhours)against fresmen_score
(freshmen_score)
• It can be observed that the study time exhibits a
linear relationship with the score in freshmen
• Using r
• x=1:20
y=x^2
plot(lm(y~x)) Residuals vs fitted plots
• plot(lm(dist~speed,data=cars))
15.
Quantile-Quantile Plot
• TheQuantile-Quantile Plot in Programming
Language, or (Q-Q Plot) is defined as a value
of two variables that are plotted
corresponding to each other and check
whether the distributions of two variables are
similar or not with respect to the
locations. qqline() function in R Language is
used to draw a Q-Q Line Plot.
16.
• R –Quantile-Quantile Plot
• Syntax: qqline(x, y, col)
• Parameters:
• x, y: X and Y coordinates of plot
• col: It defines color
• Returns: A QQ Line plot of the coordinates provided
• # Set seed for reproducibility
• set.seed(500)
•
• # Create random normally distributed values
• x <- rnorm(1200)
•
• # QQplot of normally distributed values
• qqnorm(x)
•
• # Add qqline to plot
• qqline(x, col = "darkgreen")
17.
Implementation of QQplotof
Logistically Distributed Values
• # Set seed for reproducibility
•
• # Random values according to logistic distribution
• # QQplot of logistic distribution
• y <- rlogis(800)
•
• # QQplot of normally distributed values
• qqnorm(y)
•
• # Add qqline to plot
• qqline(y, col = "darkgreen")
18.
The Scale LocationPlot
• The scale-location plot is very similar to residuals vs
fitted, but simplifies analysis of the homoskedasticity
assumption.
• It takes the square root of the absolute value of
standardized residuals instead of plotting the residuals
themselves.
• Recall that homoskedasticity means constant variance
in linear regression.
• More formally, in linear regression you have
• where is your design matrix, is your vector of
responses, and your vector of errors. .
•
plot(lm(dist~speed,data=cars))
19.
• We wantto check two things:
• That the red line is approximately horizontal.
Then the average magnitude of the
standardized residuals isn’t changing much as a
function of the fitted values.
• That the spread around the red line doesn’t
vary with the fitted values. Then the variability
of magnitudes doesn’t vary much as a function
of the fitted values.
20.
Residuals vs fittedvalues plots
• The fitted vs residuals plot is mainly useful for investigating:
• if linearity assumptions holds: This is indicated by the mean
residual value for every fitted value region being close to 0.this
is shown by the red line is approximate to the dashed line in
the graph.
• if data contain outlines This indicated by some ‘extreme’
residuals that are far from the other residuals points.
• we can see the pattern in the graph so that indicate the data
violations of linearity. the y equation is 3rd order polynomial
function.
• if the relationship between x and y is non-linear, the residua)
ls will be a non-linear function of the fitted values.
• data("cars") model <- lm(dist~speed,data=cars)
plot(model,which = 1
21.
• The ScaleLocation Plot
• The scale-location plot is very similar to residuals vs
fitted, but plot the square root Standardized residuals vs
fitted values to verify homoskedasticity assumption.We
want to look at:
• the red line: the red line represent the average the
standardized residuals.and must be approximately
horizontal.if the line approximately horizontal and
magnitude of the line hasn’t much fluctuations in the line
,that means the average of the standardized residuals
approximately same.
• variance around the line: The spread of standardized
residuals around the red line doesn’t vary with respect to
the fitted values,means the variance of standardized
residuals due to each fitted value is approximately the
same not much fluctuations in the variance
• modelmt <- lm(disp ~ cyl + hp ,data= mtcars)
plot(modelmt,which = 3)