Linear Regression
Regression analysis is a very widely used
statistical tool to establish a relationship model
between two variables.
One of these variable is called predictor variable
whose value is gathered through experiments.
The other variable is called response variable
whose value is derived from the predictor
variable
• In Linear Regression these two variables are related
through an equation, where exponent (power) of both
these variables is 1.
• Mathematically a linear relationship represents a straight
line when plotted as a graph.
• A non-linear relationship where the exponent of any
variable is not equal to 1 creates a curve.
• The general mathematical equation for a linear
regression is −
• y = ax + b
• Following is the description of the parameters used −
• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.
• Steps to Establish a Regression
• A simple example of regression is predicting weight of a
person when his height is known. To do this we need to
have the relationship between height and weight of a
person.
• The steps to create the relationship is −
• Carry out the experiment of gathering a sample of
observed values of height and corresponding weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create
the mathematical equation using these
• Get a summary of the relationship model to know the
average error in prediction. Also called residuals.
• To predict the weight of new persons, use
the predict() function in R.
• Input Data
• Below is the sample data representing the observations −
• # Values of height
• 151, 174, 138, 186, 128, 136, 179, 163, 152, 131
• # Values of weight.
• 63, 81, 56, 91, 47, 57, 76, 72, 62, 48
• lm() Function
• This function creates the relationship model between the
predictor and the response variable.
• Syntax
• The basic syntax for lm() function in linear regression is −
• lm(formula,data)
• Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
• x <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• # Apply the lm() function.
• relation <- lm(y~x)
• print(relation)
• O/P
• Call: lm(formula = y ~ x)
• Coefficients: (Intercept) x -38.4551 0.6746
• Get the Summary of the Relationship
• x <- c(151, 174, 138, 186, 128, 136, 179, 163,
152, 131)
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• # Apply the lm() function.
• relation <- lm(y~x)
• print(summary(relation))
• predict() Function
• Syntax
• The basic syntax for predict() in linear regression
is −
• predict(object, newdata) Following is the
description of the parameters used −
• object is the formula which is already created
using the lm() function.
• newdata is the vector containing the new value
for predictor variable.
• Predict the weight of new persons
• # The predictor vector.
• x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131) # The resposne vector.
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• # Apply the lm() function.
• relation <- lm(y~x)
• # Find weight of a person with height 170.
• a <- data.frame(x = 170)
• result <- predict(relation,a)
• print(result)
Visualize the Regression Graphically
• Create the predictor and response variable.
• x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
• y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
• relation <- lm(y~x) # Give the chart file a name.
• png(file = "linearregression.png") # Plot the chart.
• plot(y,x,col = "blue",xlab = "Weight in Kg",ylab = "Height in cm")
• plot(y,x,col = "blue"cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")
• plot(y,x,col = "blue",cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")
• plot(y,x,col = "blue",abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight
in Kg",ylab = "Height in cm")
• plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height
in cm") dev.off()
Regression assumptions
• Linear regression makes several assumptions
about the data, such as :
• Linearity of the data. The relationship between
the predictor (x) and the outcome (y) is assumed
to be linear.
• Normality of residuals. The residual errors are
assumed to be normally distributed.
• Homogeneity of residuals variance. The residuals
are assumed to have a constant variance
(homoscedasticity)
• Independence of residuals error terms.
• Assumptions about the form of the model
• Assumptions about the errors
• Assumptions about the predictors
• The Predictor variables x1,x2,...xn are
assumed to be linearly independent of each
other. If this assumption is violated, then the
problem is called collinearity problem.
Validating linear assumptions
• Step 1 - Install the necessary libraries
• install.packages("ggplot2") install.packages("dplyr")
library(ggplot2) library(dplyr)
• Step 2 - Read a csv file and explore the data
• data <- read.csv("/content/Data_1.csv")
• head(data) # head() returns the top 6 rows of the
dataframe
• summary(data) # returns the statistical summary of the
data columns
• plot(data$Width,data$Cost) #the plot() gives a visual
representation of the relation between the variable Width
and Cost
• cor(data$Width,data$Cost) # correlation between the two
variables
Using Scatter Plot
• The linearity of the relationship between the dependent and
predictor variables of the model can be studied using scatter plots
• No of hours freshmen_score
• 2 55
• 2.5 62
• 3 65
• 3.5 70
• 4 77
• 4.5 82
• 5 75
• 5.5 83
• 6 85
6.5 88
• Students (HS$noofhours) against fresmen_score
(freshmen_score)
• It can be observed that the study time exhibits a
linear relationship with the score in freshmen
• Using r
• x=1:20
y=x^2
plot(lm(y~x)) Residuals vs fitted plots
• plot(lm(dist~speed,data=cars))
Quantile-Quantile Plot
• The Quantile-Quantile Plot in Programming
Language, or (Q-Q Plot) is defined as a value
of two variables that are plotted
corresponding to each other and check
whether the distributions of two variables are
similar or not with respect to the
locations. qqline() function in R Language is
used to draw a Q-Q Line Plot.
• R – Quantile-Quantile Plot
• Syntax: qqline(x, y, col)
• Parameters:
• x, y: X and Y coordinates of plot
• col: It defines color
• Returns: A QQ Line plot of the coordinates provided
• # Set seed for reproducibility
• set.seed(500)
•
• # Create random normally distributed values
• x <- rnorm(1200)
•
• # QQplot of normally distributed values
• qqnorm(x)
•
• # Add qqline to plot
• qqline(x, col = "darkgreen")
Implementation of QQplot of
Logistically Distributed Values
• # Set seed for reproducibility
•
• # Random values according to logistic distribution
• # QQplot of logistic distribution
• y <- rlogis(800)
•
• # QQplot of normally distributed values
• qqnorm(y)
•
• # Add qqline to plot
• qqline(y, col = "darkgreen")
The Scale Location Plot
• The scale-location plot is very similar to residuals vs
fitted, but simplifies analysis of the homoskedasticity
assumption.
• It takes the square root of the absolute value of
standardized residuals instead of plotting the residuals
themselves.
• Recall that homoskedasticity means constant variance
in linear regression.
• More formally, in linear regression you have
• where is your design matrix, is your vector of
responses, and your vector of errors. .
•
plot(lm(dist~speed,data=cars))
• We want to check two things:
• That the red line is approximately horizontal.
Then the average magnitude of the
standardized residuals isn’t changing much as a
function of the fitted values.
• That the spread around the red line doesn’t
vary with the fitted values. Then the variability
of magnitudes doesn’t vary much as a function
of the fitted values.
Residuals vs fitted values plots
• The fitted vs residuals plot is mainly useful for investigating:
• if linearity assumptions holds: This is indicated by the mean
residual value for every fitted value region being close to 0.this
is shown by the red line is approximate to the dashed line in
the graph.
• if data contain outlines This indicated by some ‘extreme’
residuals that are far from the other residuals points.
• we can see the pattern in the graph so that indicate the data
violations of linearity. the y equation is 3rd order polynomial
function.
• if the relationship between x and y is non-linear, the residua)
ls will be a non-linear function of the fitted values.
• data("cars") model <- lm(dist~speed,data=cars)
plot(model,which = 1
• The Scale Location Plot
• The scale-location plot is very similar to residuals vs
fitted, but plot the square root Standardized residuals vs
fitted values to verify homoskedasticity assumption.We
want to look at:
• the red line: the red line represent the average the
standardized residuals.and must be approximately
horizontal.if the line approximately horizontal and
magnitude of the line hasn’t much fluctuations in the line
,that means the average of the standardized residuals
approximately same.
• variance around the line: The spread of standardized
residuals around the red line doesn’t vary with respect to
the fitted values,means the variance of standardized
residuals due to each fitted value is approximately the
same not much fluctuations in the variance
• modelmt <- lm(disp ~ cyl + hp ,data= mtcars)
plot(modelmt,which = 3)

Linear Regression.pptx

  • 1.
    Linear Regression Regression analysisis a very widely used statistical tool to establish a relationship model between two variables. One of these variable is called predictor variable whose value is gathered through experiments. The other variable is called response variable whose value is derived from the predictor variable
  • 2.
    • In LinearRegression these two variables are related through an equation, where exponent (power) of both these variables is 1. • Mathematically a linear relationship represents a straight line when plotted as a graph. • A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve. • The general mathematical equation for a linear regression is − • y = ax + b • Following is the description of the parameters used − • y is the response variable. • x is the predictor variable. • a and b are constants which are called the coefficients.
  • 3.
    • Steps toEstablish a Regression • A simple example of regression is predicting weight of a person when his height is known. To do this we need to have the relationship between height and weight of a person. • The steps to create the relationship is − • Carry out the experiment of gathering a sample of observed values of height and corresponding weight. • Create a relationship model using the lm() functions in R. • Find the coefficients from the model created and create the mathematical equation using these • Get a summary of the relationship model to know the average error in prediction. Also called residuals. • To predict the weight of new persons, use the predict() function in R.
  • 4.
    • Input Data •Below is the sample data representing the observations − • # Values of height • 151, 174, 138, 186, 128, 136, 179, 163, 152, 131 • # Values of weight. • 63, 81, 56, 91, 47, 57, 76, 72, 62, 48 • lm() Function • This function creates the relationship model between the predictor and the response variable. • Syntax • The basic syntax for lm() function in linear regression is − • lm(formula,data) • Following is the description of the parameters used − • formula is a symbol presenting the relation between x and y. • data is the vector on which the formula will be applied.
  • 5.
    • x <-c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) • y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) • # Apply the lm() function. • relation <- lm(y~x) • print(relation) • O/P • Call: lm(formula = y ~ x) • Coefficients: (Intercept) x -38.4551 0.6746
  • 6.
    • Get theSummary of the Relationship • x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) • y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) • # Apply the lm() function. • relation <- lm(y~x) • print(summary(relation))
  • 7.
    • predict() Function •Syntax • The basic syntax for predict() in linear regression is − • predict(object, newdata) Following is the description of the parameters used − • object is the formula which is already created using the lm() function. • newdata is the vector containing the new value for predictor variable.
  • 8.
    • Predict theweight of new persons • # The predictor vector. • x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) # The resposne vector. • y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) • # Apply the lm() function. • relation <- lm(y~x) • # Find weight of a person with height 170. • a <- data.frame(x = 170) • result <- predict(relation,a) • print(result)
  • 9.
    Visualize the RegressionGraphically • Create the predictor and response variable. • x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) • y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) • relation <- lm(y~x) # Give the chart file a name. • png(file = "linearregression.png") # Plot the chart. • plot(y,x,col = "blue",xlab = "Weight in Kg",ylab = "Height in cm") • plot(y,x,col = "blue"cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm") • plot(y,x,col = "blue",cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm") • plot(y,x,col = "blue",abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm") • plot(y,x,col = "blue",main = "Height & Weight Regression", abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm") dev.off()
  • 10.
    Regression assumptions • Linearregression makes several assumptions about the data, such as : • Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear. • Normality of residuals. The residual errors are assumed to be normally distributed. • Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity) • Independence of residuals error terms.
  • 11.
    • Assumptions aboutthe form of the model • Assumptions about the errors • Assumptions about the predictors • The Predictor variables x1,x2,...xn are assumed to be linearly independent of each other. If this assumption is violated, then the problem is called collinearity problem.
  • 12.
    Validating linear assumptions •Step 1 - Install the necessary libraries • install.packages("ggplot2") install.packages("dplyr") library(ggplot2) library(dplyr) • Step 2 - Read a csv file and explore the data • data <- read.csv("/content/Data_1.csv") • head(data) # head() returns the top 6 rows of the dataframe • summary(data) # returns the statistical summary of the data columns • plot(data$Width,data$Cost) #the plot() gives a visual representation of the relation between the variable Width and Cost • cor(data$Width,data$Cost) # correlation between the two variables
  • 13.
    Using Scatter Plot •The linearity of the relationship between the dependent and predictor variables of the model can be studied using scatter plots • No of hours freshmen_score • 2 55 • 2.5 62 • 3 65 • 3.5 70 • 4 77 • 4.5 82 • 5 75 • 5.5 83 • 6 85 6.5 88
  • 14.
    • Students (HS$noofhours)against fresmen_score (freshmen_score) • It can be observed that the study time exhibits a linear relationship with the score in freshmen • Using r • x=1:20 y=x^2 plot(lm(y~x)) Residuals vs fitted plots • plot(lm(dist~speed,data=cars))
  • 15.
    Quantile-Quantile Plot • TheQuantile-Quantile Plot in Programming Language, or (Q-Q Plot) is defined as a value of two variables that are plotted corresponding to each other and check whether the distributions of two variables are similar or not with respect to the locations. qqline() function in R Language is used to draw a Q-Q Line Plot.
  • 16.
    • R –Quantile-Quantile Plot • Syntax: qqline(x, y, col) • Parameters: • x, y: X and Y coordinates of plot • col: It defines color • Returns: A QQ Line plot of the coordinates provided • # Set seed for reproducibility • set.seed(500) • • # Create random normally distributed values • x <- rnorm(1200) • • # QQplot of normally distributed values • qqnorm(x) • • # Add qqline to plot • qqline(x, col = "darkgreen")
  • 17.
    Implementation of QQplotof Logistically Distributed Values • # Set seed for reproducibility • • # Random values according to logistic distribution • # QQplot of logistic distribution • y <- rlogis(800) • • # QQplot of normally distributed values • qqnorm(y) • • # Add qqline to plot • qqline(y, col = "darkgreen")
  • 18.
    The Scale LocationPlot • The scale-location plot is very similar to residuals vs fitted, but simplifies analysis of the homoskedasticity assumption. • It takes the square root of the absolute value of standardized residuals instead of plotting the residuals themselves. • Recall that homoskedasticity means constant variance in linear regression. • More formally, in linear regression you have • where is your design matrix, is your vector of responses, and your vector of errors. . • plot(lm(dist~speed,data=cars))
  • 19.
    • We wantto check two things: • That the red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. • That the spread around the red line doesn’t vary with the fitted values. Then the variability of magnitudes doesn’t vary much as a function of the fitted values.
  • 20.
    Residuals vs fittedvalues plots • The fitted vs residuals plot is mainly useful for investigating: • if linearity assumptions holds: This is indicated by the mean residual value for every fitted value region being close to 0.this is shown by the red line is approximate to the dashed line in the graph. • if data contain outlines This indicated by some ‘extreme’ residuals that are far from the other residuals points. • we can see the pattern in the graph so that indicate the data violations of linearity. the y equation is 3rd order polynomial function. • if the relationship between x and y is non-linear, the residua) ls will be a non-linear function of the fitted values. • data("cars") model <- lm(dist~speed,data=cars) plot(model,which = 1
  • 21.
    • The ScaleLocation Plot • The scale-location plot is very similar to residuals vs fitted, but plot the square root Standardized residuals vs fitted values to verify homoskedasticity assumption.We want to look at: • the red line: the red line represent the average the standardized residuals.and must be approximately horizontal.if the line approximately horizontal and magnitude of the line hasn’t much fluctuations in the line ,that means the average of the standardized residuals approximately same. • variance around the line: The spread of standardized residuals around the red line doesn’t vary with respect to the fitted values,means the variance of standardized residuals due to each fitted value is approximately the same not much fluctuations in the variance • modelmt <- lm(disp ~ cyl + hp ,data= mtcars) plot(modelmt,which = 3)