SlideShare a Scribd company logo
1 of 85
Data is present everywhere and becomes a
part of our life….
2
8/12/2022
Data itself has no meaning unless it is
contextually processed into information,
from which knowledge can be derived
Data
• Raw and
unprocessed
• Obtained from
end devices.
Knowledge
Information –
organized and
structured to
achieve specific
objective.
Information
filtered,
processed
categorized
and
condensed.
8/12/2022 3
• Correlation shows the quantity of the degree to which
two variables are associated. It does not fix a line
through the data points. Correlation shows how much one
variable changes when the other remains constant. When it
is zero, the relationship does not exist. When it is positive,
one variable goes high as the other goes up. When it is
negative, one variable goes high as the other goes down.
• The Pearson correlation measures the degree to which a set
of data points form a straight line relationship.
Correlation
5
Introduction to Linear Regression
 Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific
set of data.
 The two terms essential to understanding Regression
Analysis:
 Dependent variables - The factors that we want to understand or
predict.
 Independent variables - The factors that influence the dependent
variable.
6
Introduction to Linear Regression (cont.)
 Any straight line can be represented by an equation of the
form Y = mX + c, where m and c are constants.
 The value of m is called the slope constant and
determines the direction and degree to which the line
is tilted.
 The value of c is called the Y-intercept and determines the
point where the line crosses the Y-axis.
Regression
LEAST SQUARES REGRESSION
 We can place the line "by eye": try to have the line as close as
possible to all points, and a similar number of points above and
below the line.
 But for better accuracy, let's see how to calculate the line
using Least Squares Regression.
 The least square method is the process of obtaining the line of
best fit for the given data set by reducing the sum of the squares
of the offsets (residual part) of the points from the line.
LEAST SQUARES METHOD
STEPS TO CALCULATE LINE OF BEST FIT
STEP 1
STEP 2
c = Σy − m Σx N
EXAMPLE 1
x 1 2 3 4 5
y 2 5 3 8 7
x y xy x2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
EXAMPLE (contd…)
Find the value of m by using the formula,
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
m = [(5×88) - (15×25)]/(5×55) - (15)2
m = (440 - 375)/(275 - 225)
m = 65/50 = 13/10
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
EXAMPLE (contd…)
Find the value of c by using the formula,
c = (∑y - m∑x)/n
c = (25 - 1.3×15)/5
c = (25 - 19.5)/5
c = 5.5/5
So, the required equation of least squares is
y = mx + c = 13/10x + 5.5/5.
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
EXAMPLE 2
"x“
Hours of
Sunshine
"y"
Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Sam found how many hours of sunshine vs how many ice creams were sold at
the shop from Monday to Friday:
EXAMPLE (contd…)
x y
y = 1.518x +
0.305
error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x, y) points and the
line y = 1.518x + 0.305 on a graph:
Nice fit!
EXAMPLE (contd…)
Sam hears the weather forecast which says "we expect 8 hours
of sun tomorrow", so he uses the above equation to estimate
that he will sell
y = 1.518 x 8 + 0.305
= 12.45 Ice Creams
Sam makes fresh waffle cone mixture for 13 ice creams
just in case. Yum.
USING SUM OF CROSS AND SQUARED
DEVIATIONS
where sum of cross-deviations of y and x is
and sum of squared deviations of x is
PYTHON CODE
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
PYTHON CODE
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
PYTHON CODE
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:nb_0 = {} 
nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
OUTPUT:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
Implementation of Linear
Regression using sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# creating a dummy dataset
np.random.seed(10)
x = np.random.rand(50, 1)
y = 3 + 3 * x + np.random.rand(50, 1)
#scatterplot
plt.scatter(x,y,s=10)
plt.xlabel('x_dummy')
plt.ylabel('y_dummy')
plt.show()
Implementation of Linear
Regression using sklearn
 #creating a model
 from sklearn.linear_model import LinearRegression
 # creating a object
 regressor = LinearRegression()
 #training the model
 regressor.fit(x, y)
 #using the training dataset for the prediction
 pred = regressor.predict(x)
Implementation of Linear
Regression using sklearn
#model performance
from sklearn.metrics import r2_score, mean_squared_error
mse = mean_squared_error(y, pred)
r2 = r2_score(y, pred)#Best fit lineplt.scatter(x, y)
plt.plot(x, pred, color = 'Black', marker = 'o')
#Results
print("Mean Squared Error : ", mse)
print("R-Squared :" , r2)
print("Y-intercept :" , regressor.intercept_)
print("Slope :" , regressor.coef_)
OUTPUT:
R-Squared : 0.9068822972556425
Y-intercept : [3.41354381]
Slope : [[3.11024701]]
POINTS TO REMEMBER
 Least squares is sensitive to outliers. A strange value
will pull the line towards it.
 Works better for even non-linear data. But the
formulas (and the steps taken) will be very different!
 Difference between actual value of y and predicted
value of y is called as residual.
• Linear regression finds the best line that predicts y from
x, but Correlation does not fit a line.
• Correlation is used when we measure both variables,
while linear regression is mostly applied when x is a
variable that is manipulated.
Correlation Vs Regression
Utility of Regression
 Used in economic and business research
 Estimation of Relationship
 Prediction
Types
 There are several linear regression analyses
 Simple linear regression
 One dependent variable
 One independent variable
 Multiple linear regression
 One dependent variable
 Two or more independent variables
Multiple Regression Model
The equation that describes how the dependent
variable y is related to the independent variables x1, x2, . . .
xp and an error term is:
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:
b0, b1, b2, . . . , bp are the parameters, and e is
a random variable called the error term
The variance of e , denoted by 2, is the same for all
values of the independent variables.
The error e is a normally distributed random variable
reflecting the deviation between the y value and the
expected value of y given by b0 + b1x1 + b2x2 + . . + bpxp.
Assumptions about the Error Term
The error e is a random variable with mean of zero.
The values of e are independent.
The equation that describes how the mean value
of y is related to x1, x2, . . . xp is:
Multiple Regression Equation
E(y) = b0 + b1x1 + b2x2 + . . . + bpxp
A simple random sample is used to compute sample
statistics b0, b1, b2, . . . , bp that are used as the point
estimators of the parameters b0, b1, b2, . . . , bp.
Estimated Multiple Regression Equation
^
y = b0 + b1x1 + b2x2 + . . . + bpxp
Estimation Process
Multiple Regression Model
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp + e
Multiple Regression Equation
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp
Unknown parameters are
b0, b1, b2, . . . , bp
Sample Data:
x1 x2 . . . xp y
. . . .
. . . .
0 1 1 2 2
ˆ ... p p
y b b x b x b x
    
Estimated Multiple
Regression Equation
Sample statistics are
b0, b1, b2, . . . , bp
b0, b1, b2, . . . , bp
provide estimates of
b0, b1, b2, . . . , bp
Least Squares Method
 Least Squares Criterion
2
ˆ
min ( )
i i
y y


Computation of Coefficient Values
• The formulas for the regression coefficients b0, b1, b2,. . . bp
involve the use of matrix algebra.
• Computer software packages are available to perform the
calculations.
 Example: Programmer Salary Survey
Multiple Regression Model
A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if salary
was related to the years of experience and the score on
the firm’s programmer aptitude test.
4
7
1
5
8
10
0
1
6
6
9
2
10
5
6
8
4
6
3
3
78
100
86
82
86
84
75
80
83
91
88
73
75
81
74
87
79
94
70
89
24.0
43.0
23.7
34.3
35.8
38.0
22.2
23.1
30.0
33.0
38.0
26.6
36.2
31.6
29.0
34.0
30.1
33.9
28.2
30.0
Exper. Score Score
Exper.
Salary Salary
Multiple Regression Model
Suppose we believe that salary (y) is related to the years of
experience (x1) and the score on the programmer aptitude
test (x2) by the following regression model:
Multiple Regression Model
where
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
y = b0 + b1x1 + b2x2 + e
Solving for the Estimates of b0, b1, b2
Input Data
Least Squares
Output
x1 x2 y
4 78 24
7 100 43
. . .
. . .
3 89 30
Computer
Package
for Solving
Multiple
Regression
Problems
b0 =
b1 =
b2 =
R2 =
etc.
Estimated Regression Equation
SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)
Note: Predicted salary will be in thousands of dollars.
Interpreting the Coefficients
In multiple regression analysis, we interpret each
regression coefficient as follows:
bi represents an estimate of the change in y
corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
Salary is expected to increase by $1,404 for each
additional year of experience (when the variable score
on programmer attitude test is held constant).
b1 = 1.404
Interpreting the Coefficients
Salary is expected to increase by $251 for each
additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
b2 = 0.251
Interpreting the Coefficients
Standard Error of Estimate
Interpret R-squared in Regression
Analysis
 Determines how well the model fits the data
 Goodness-of-fit measure for linear regression models
 Measures the strength of the relationship between
our model and the dependent variable on a convenient
0–100% scale
Interpret R-squared in Regression
Analysis (contd…)
We need to calculate two things:
 var(avg) = ∑(yi – Ӯ)2
 var(model) = ∑(yi – ŷ)2
R2 = 1 – [var(model)/var(avg)]
= 1 -[∑(yi – ŷ)2/∑(yi – Ӯ)2]
Limitations of R-squared
 R-squared cannot be used to check if the
coefficient estimates and predictions are biased
or not.
 R-squared does not inform if the regression model
has an adequate fit or not.
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
SST = SSR + SSE
2
( )
i
y y

 2
ˆ
( )
i
y y

 2
ˆ
( )
i i
y y


= +
Testing for Significance: F Test
F test is referred to as the test for overall significance.
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
A separate t test is conducted for each of the
independent Variables in the model.
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual
significance.
In simple linear regression, the F and t tests provide
the same conclusion.
Testing for Significance
In multiple regression, the F and t tests have different
purposes.
Testing for Significance: Multicollinearity
The term multicollinearity refers to the correlation
among the independent variables.
When the independent variables are highly correlated
(say, |r | > .7), it is not possible to determine the
separate effect of any particular independent variable
on the dependent variable.
Testing for Significance: Multicollinearity
Every attempt should be made to avoid including
independent variables that are highly correlated.
If the estimated regression equation is to be used only
for predictive purposes, multicollinearity is usually
not a serious problem.
Using the Estimated Regression Equation
for Estimation and Prediction
The procedures for estimating the mean value of y
and predicting an individual value of y in multiple
regression are similar to those in simple regression.
We substitute the given values of x1, x2, . . . , xp into
the estimated regression equation and use the
corresponding value of y as the point estimate.
In many situations we must work with qualitative
independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0
indicates male and x2 = 1 indicates female.
Qualitative Independent Variables
In this case, x2 is called a dummy or indicator variable.
Qualitative Independent Variables
 Example: Programmer Salary Survey
As an extension of the problem involving the computer
programmer salary survey, suppose that management also
believes that the annual salary is related to whether the
individual has a graduate degree in computer science or
information systems. The years of experience, the score on the
programmer aptitude test, whether the individual has a
relevant graduate degree, and the annual salary ($1000) for
each of the sampled 20 programmers are shown on the next
slide.
4
7
1
5
8
10
0
1
6
6
9
2
10
5
6
8
4
6
3
3
78
100
86
82
86
84
75
80
83
91
88
73
75
81
74
87
79
94
70
89
24.0
43.0
23.7
34.3
35.8
38.0
22.2
23.1
30.0
33.0
38.0
26.6
36.2
31.6
29.0
34.0
30.1
33.9
28.2
30.0
Exper. Score Score
Exper.
Salary Salary
Degr.
No
Yes
No
Yes
Yes
Yes
No
No
No
Yes
Degr.
Yes
No
Yes
No
No
Yes
No
Yes
No
No
Qualitative Independent Variables
Estimated Regression Equation
y = b0 + b1x1 + b2x2 + b3x3
^
where:
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x3 is a dummy variable
Regression to the Mean
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
Selected
group’s
mean
Overall
mean
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
Selected
group’s
mean
Overall
mean
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
Selected
group’s
mean
Overall
mean
Where it
would
have
been
with no
regression
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure
Selected
group’s
mean
Overall
mean
Where its
mean is
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
The group mean on
the first measure
appears to “regress
toward the mean” of
the second measure.
Selected
group’s
mean
Overall
mean
Overall
mean
The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
The group mean on
the first measure
appears to “regress
toward the mean” of
the second measure.
Selected
group’s
mean
Overall
mean
Regression to the mean
Overall
mean
Example I:
If the first measure
is a pretest and
you select the low
scorers...
Pretest
Example I:
If the first measure
is a pretest and
you select the low
scorers...
...and the second
measure is a posttest
Pretest
Posttest
Example I:
if the first measure
is a pretest and
you select the low
scorers...
...and the second
measure is a posttest,
Pretest
Posttest
regression to the
mean will make it
appear as though the
group gained from
pre to post.
Pseudo-effect
Example II:
If the first measure
is a pretest and
you select the high
scorers...
Pretest
Example II:
if the first measure
is a pretest and
you select the high
scorers...
...and the second
measure is a posttest,
Pretest
Posttest
Example I:
...and the second
measure is a posttest,
Pretest
Posttest
regression to the
mean will make it
appear as though the
group lost from pre
to post.
Pseudo-effect
If the first measure
is a pretest and
you select the high
scorers...
Some Facts
• This is purely a statistical phenomenon.
• This is a group phenomenon.
• Some individuals will move opposite to this group
trend.
Why Does It Happen?
• Regression artifacts occur whenever you sample
asymmetrically from a distribution.
• Regression artifacts occur with any two variables
(not just pre and posttest) and even backwards in
time!
What Does It Depend On?
 The degree of asymmetry (i.e., how far
from the overall mean of the first measure
the selected group's mean is)
 The correlation between the two
measures
The absolute amount of regression to the
mean depends on two factors:
A Simple Formula
The percent of regression to the mean is
Prm = 100(1 - r)
Where r is the correlation between the two measures.
For Example:
• If r = 1, there is no (i.e., 0%) regression to the mean.
• If r = 0, there is 100% regression to the mean.
• If r = .2, there is 80% regression to the mean.
• If r = .5, there is 50% regression to the mean.
Prm = 100(1 - r)
Example
Assume a standardized
test with a mean of 50. Pretest
50
Example
Assume a standardized
test with a mean of 50 Pretest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Example
Assume a standardized
test with a mean of 50. Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Example
Assume a standardized
test with a mean of 50.
The formula is…
Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Example
Assume a standardized
test with a mean of 50.
The formula is
Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Prm = 100(1 - r)
= 100(1-.5)
= 50%
50%
Example
Assume a standardized
test with a mean of 50.
The formula is
Pretest
Posttest
Therefore the mean
will regress up 50%
(from 30 to 50),
leaving a final mean
of 40 and a 10 point
pseudo-gain. Pseudo-effect
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Prm = 100(1 - r)
= 100(1-.5)
= 50%
40
ANY QUERIES???
THANK YOU
8/12/2022 85

More Related Content

Similar to Regression

error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docx
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docxerror 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docx
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docxSALU18
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsAmol Gaikwad
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionKuppusamy P
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Simplilearn
 
Unit One - Solved problems on error analysis .ppt
Unit One - Solved problems on error analysis .pptUnit One - Solved problems on error analysis .ppt
Unit One - Solved problems on error analysis .pptashugizaw1506
 
Matlab polynimials and curve fitting
Matlab polynimials and curve fittingMatlab polynimials and curve fitting
Matlab polynimials and curve fittingAmeen San
 
Least Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaLeast Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaEdureka!
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examplesDennis
 
INTRODUCTION TO MATLAB presentation.pptx
INTRODUCTION TO MATLAB presentation.pptxINTRODUCTION TO MATLAB presentation.pptx
INTRODUCTION TO MATLAB presentation.pptxDevaraj Chilakala
 
4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptx4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptxssuser255bf1
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxanhlodge
 

Similar to Regression (20)

error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docx
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docxerror 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docx
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docx
 
pre
prepre
pre
 
Lecture 7.pptx
Lecture 7.pptxLecture 7.pptx
Lecture 7.pptx
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithms
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Regression
Regression  Regression
Regression
 
Unit One - Solved problems on error analysis .ppt
Unit One - Solved problems on error analysis .pptUnit One - Solved problems on error analysis .ppt
Unit One - Solved problems on error analysis .ppt
 
Matlab polynimials and curve fitting
Matlab polynimials and curve fittingMatlab polynimials and curve fitting
Matlab polynimials and curve fitting
 
Signals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptxSignals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptx
 
Least Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaLeast Squares Regression Method | Edureka
Least Squares Regression Method | Edureka
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
INTRODUCTION TO MATLAB presentation.pptx
INTRODUCTION TO MATLAB presentation.pptxINTRODUCTION TO MATLAB presentation.pptx
INTRODUCTION TO MATLAB presentation.pptx
 
4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptx4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptx
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docx
 
Business statistics homework help
Business statistics homework helpBusiness statistics homework help
Business statistics homework help
 

More from ramyaranjith

Tools for Project Excellence.pptx
Tools for Project Excellence.pptxTools for Project Excellence.pptx
Tools for Project Excellence.pptxramyaranjith
 
Variadic functions
Variadic functionsVariadic functions
Variadic functionsramyaranjith
 
Command line arguments
Command line argumentsCommand line arguments
Command line argumentsramyaranjith
 
MongoDB - Features and Operations
MongoDB - Features and OperationsMongoDB - Features and Operations
MongoDB - Features and Operationsramyaranjith
 
CS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved ExamplesCS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved Examplesramyaranjith
 

More from ramyaranjith (8)

Tools for Project Excellence.pptx
Tools for Project Excellence.pptxTools for Project Excellence.pptx
Tools for Project Excellence.pptx
 
ML with IoT
ML with IoTML with IoT
ML with IoT
 
Variadic functions
Variadic functionsVariadic functions
Variadic functions
 
Command line arguments
Command line argumentsCommand line arguments
Command line arguments
 
Bit fields
Bit fieldsBit fields
Bit fields
 
MongoDB - Features and Operations
MongoDB - Features and OperationsMongoDB - Features and Operations
MongoDB - Features and Operations
 
Unified Process
Unified Process Unified Process
Unified Process
 
CS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved ExamplesCS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved Examples
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Regression

  • 1.
  • 2. Data is present everywhere and becomes a part of our life…. 2 8/12/2022
  • 3. Data itself has no meaning unless it is contextually processed into information, from which knowledge can be derived Data • Raw and unprocessed • Obtained from end devices. Knowledge Information – organized and structured to achieve specific objective. Information filtered, processed categorized and condensed. 8/12/2022 3
  • 4. • Correlation shows the quantity of the degree to which two variables are associated. It does not fix a line through the data points. Correlation shows how much one variable changes when the other remains constant. When it is zero, the relationship does not exist. When it is positive, one variable goes high as the other goes up. When it is negative, one variable goes high as the other goes down. • The Pearson correlation measures the degree to which a set of data points form a straight line relationship. Correlation
  • 5. 5 Introduction to Linear Regression  Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data.  The two terms essential to understanding Regression Analysis:  Dependent variables - The factors that we want to understand or predict.  Independent variables - The factors that influence the dependent variable.
  • 6. 6 Introduction to Linear Regression (cont.)  Any straight line can be represented by an equation of the form Y = mX + c, where m and c are constants.  The value of m is called the slope constant and determines the direction and degree to which the line is tilted.  The value of c is called the Y-intercept and determines the point where the line crosses the Y-axis.
  • 8. LEAST SQUARES REGRESSION  We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of points above and below the line.  But for better accuracy, let's see how to calculate the line using Least Squares Regression.  The least square method is the process of obtaining the line of best fit for the given data set by reducing the sum of the squares of the offsets (residual part) of the points from the line.
  • 10. STEPS TO CALCULATE LINE OF BEST FIT
  • 12. STEP 2 c = Σy − m Σx N
  • 13. EXAMPLE 1 x 1 2 3 4 5 y 2 5 3 8 7 x y xy x2 1 2 2 1 2 5 10 4 3 3 9 9 4 8 32 16 5 7 35 25 ∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
  • 14. EXAMPLE (contd…) Find the value of m by using the formula, m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2 m = [(5×88) - (15×25)]/(5×55) - (15)2 m = (440 - 375)/(275 - 225) m = 65/50 = 13/10 ∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
  • 15. EXAMPLE (contd…) Find the value of c by using the formula, c = (∑y - m∑x)/n c = (25 - 1.3×15)/5 c = (25 - 19.5)/5 c = 5.5/5 So, the required equation of least squares is y = mx + c = 13/10x + 5.5/5. ∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
  • 16. EXAMPLE 2 "x“ Hours of Sunshine "y" Ice Creams Sold 2 4 3 5 5 7 7 10 9 15 Sam found how many hours of sunshine vs how many ice creams were sold at the shop from Monday to Friday:
  • 17. EXAMPLE (contd…) x y y = 1.518x + 0.305 error 2 4 3.34 −0.66 3 5 4.86 −0.14 5 7 7.89 0.89 7 10 10.93 0.93 9 15 13.97 −1.03 Here are the (x, y) points and the line y = 1.518x + 0.305 on a graph: Nice fit!
  • 18. EXAMPLE (contd…) Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to estimate that he will sell y = 1.518 x 8 + 0.305 = 12.45 Ice Creams Sam makes fresh waffle cone mixture for 13 ice creams just in case. Yum.
  • 19. USING SUM OF CROSS AND SQUARED DEVIATIONS where sum of cross-deviations of y and x is and sum of squared deviations of x is
  • 20. PYTHON CODE import numpy as np import matplotlib.pyplot as plt def estimate_coef(x, y): # number of observations/points n = np.size(x) # mean of x and y vector m_x = np.mean(x) m_y = np.mean(y) # calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x SS_xx = np.sum(x*x) - n*m_x*m_x # calculating regression coefficients b_1 = SS_xy / SS_xx b_0 = m_y - b_1*m_x return (b_0, b_1)
  • 21. PYTHON CODE def plot_regression_line(x, y, b): # plotting the actual points as scatter plot plt.scatter(x, y, color = "m", marker = "o", s = 30) # predicted response vector y_pred = b[0] + b[1]*x # plotting the regression line plt.plot(x, y_pred, color = "g") # putting labels plt.xlabel('x') plt.ylabel('y') # function to show plot plt.show()
  • 22. PYTHON CODE def main(): # observations / data x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) # estimating coefficients b = estimate_coef(x, y) print("Estimated coefficients:nb_0 = {} nb_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line(x, y, b) if __name__ == "__main__": main() OUTPUT: Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437
  • 23. Implementation of Linear Regression using sklearn import pandas as pd import numpy as np import matplotlib.pyplot as plt # creating a dummy dataset np.random.seed(10) x = np.random.rand(50, 1) y = 3 + 3 * x + np.random.rand(50, 1) #scatterplot plt.scatter(x,y,s=10) plt.xlabel('x_dummy') plt.ylabel('y_dummy') plt.show()
  • 24. Implementation of Linear Regression using sklearn  #creating a model  from sklearn.linear_model import LinearRegression  # creating a object  regressor = LinearRegression()  #training the model  regressor.fit(x, y)  #using the training dataset for the prediction  pred = regressor.predict(x)
  • 25. Implementation of Linear Regression using sklearn #model performance from sklearn.metrics import r2_score, mean_squared_error mse = mean_squared_error(y, pred) r2 = r2_score(y, pred)#Best fit lineplt.scatter(x, y) plt.plot(x, pred, color = 'Black', marker = 'o') #Results print("Mean Squared Error : ", mse) print("R-Squared :" , r2) print("Y-intercept :" , regressor.intercept_) print("Slope :" , regressor.coef_) OUTPUT: R-Squared : 0.9068822972556425 Y-intercept : [3.41354381] Slope : [[3.11024701]]
  • 26. POINTS TO REMEMBER  Least squares is sensitive to outliers. A strange value will pull the line towards it.  Works better for even non-linear data. But the formulas (and the steps taken) will be very different!  Difference between actual value of y and predicted value of y is called as residual.
  • 27. • Linear regression finds the best line that predicts y from x, but Correlation does not fit a line. • Correlation is used when we measure both variables, while linear regression is mostly applied when x is a variable that is manipulated. Correlation Vs Regression
  • 28. Utility of Regression  Used in economic and business research  Estimation of Relationship  Prediction
  • 29. Types  There are several linear regression analyses  Simple linear regression  One dependent variable  One independent variable  Multiple linear regression  One dependent variable  Two or more independent variables
  • 30. Multiple Regression Model The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is: y = b0 + b1x1 + b2x2 + . . . + bpxp + e where: b0, b1, b2, . . . , bp are the parameters, and e is a random variable called the error term
  • 31. The variance of e , denoted by 2, is the same for all values of the independent variables. The error e is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by b0 + b1x1 + b2x2 + . . + bpxp. Assumptions about the Error Term The error e is a random variable with mean of zero. The values of e are independent.
  • 32. The equation that describes how the mean value of y is related to x1, x2, . . . xp is: Multiple Regression Equation E(y) = b0 + b1x1 + b2x2 + . . . + bpxp
  • 33. A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b0, b1, b2, . . . , bp. Estimated Multiple Regression Equation ^ y = b0 + b1x1 + b2x2 + . . . + bpxp
  • 34. Estimation Process Multiple Regression Model E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp + e Multiple Regression Equation E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp Unknown parameters are b0, b1, b2, . . . , bp Sample Data: x1 x2 . . . xp y . . . . . . . . 0 1 1 2 2 ˆ ... p p y b b x b x b x      Estimated Multiple Regression Equation Sample statistics are b0, b1, b2, . . . , bp b0, b1, b2, . . . , bp provide estimates of b0, b1, b2, . . . , bp
  • 35. Least Squares Method  Least Squares Criterion 2 ˆ min ( ) i i y y   Computation of Coefficient Values • The formulas for the regression coefficients b0, b1, b2,. . . bp involve the use of matrix algebra. • Computer software packages are available to perform the calculations.
  • 36.  Example: Programmer Salary Survey Multiple Regression Model A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test.
  • 38. Suppose we believe that salary (y) is related to the years of experience (x1) and the score on the programmer aptitude test (x2) by the following regression model: Multiple Regression Model where y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test y = b0 + b1x1 + b2x2 + e
  • 39. Solving for the Estimates of b0, b1, b2 Input Data Least Squares Output x1 x2 y 4 78 24 7 100 43 . . . . . . 3 89 30 Computer Package for Solving Multiple Regression Problems b0 = b1 = b2 = R2 = etc.
  • 40. Estimated Regression Equation SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE) Note: Predicted salary will be in thousands of dollars.
  • 41. Interpreting the Coefficients In multiple regression analysis, we interpret each regression coefficient as follows: bi represents an estimate of the change in y corresponding to a 1-unit increase in xi when all other independent variables are held constant.
  • 42. Salary is expected to increase by $1,404 for each additional year of experience (when the variable score on programmer attitude test is held constant). b1 = 1.404 Interpreting the Coefficients
  • 43. Salary is expected to increase by $251 for each additional point scored on the programmer aptitude test (when the variable years of experience is held constant). b2 = 0.251 Interpreting the Coefficients
  • 44. Standard Error of Estimate
  • 45. Interpret R-squared in Regression Analysis  Determines how well the model fits the data  Goodness-of-fit measure for linear regression models  Measures the strength of the relationship between our model and the dependent variable on a convenient 0–100% scale
  • 46. Interpret R-squared in Regression Analysis (contd…) We need to calculate two things:  var(avg) = ∑(yi – Ӯ)2  var(model) = ∑(yi – ŷ)2 R2 = 1 – [var(model)/var(avg)] = 1 -[∑(yi – ŷ)2/∑(yi – Ӯ)2]
  • 47. Limitations of R-squared  R-squared cannot be used to check if the coefficient estimates and predictions are biased or not.  R-squared does not inform if the regression model has an adequate fit or not.
  • 48. Multiple Coefficient of Determination Relationship Among SST, SSR, SSE where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error SST = SSR + SSE 2 ( ) i y y   2 ˆ ( ) i y y   2 ˆ ( ) i i y y   = +
  • 49. Testing for Significance: F Test F test is referred to as the test for overall significance. The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.
  • 50. A separate t test is conducted for each of the independent Variables in the model. If the F test shows an overall significance, the t test is used to determine whether each of the individual independent variables is significant. Testing for Significance: t Test We refer to each of these t tests as a test for individual significance.
  • 51. In simple linear regression, the F and t tests provide the same conclusion. Testing for Significance In multiple regression, the F and t tests have different purposes.
  • 52. Testing for Significance: Multicollinearity The term multicollinearity refers to the correlation among the independent variables. When the independent variables are highly correlated (say, |r | > .7), it is not possible to determine the separate effect of any particular independent variable on the dependent variable.
  • 53. Testing for Significance: Multicollinearity Every attempt should be made to avoid including independent variables that are highly correlated. If the estimated regression equation is to be used only for predictive purposes, multicollinearity is usually not a serious problem.
  • 54. Using the Estimated Regression Equation for Estimation and Prediction The procedures for estimating the mean value of y and predicting an individual value of y in multiple regression are similar to those in simple regression. We substitute the given values of x1, x2, . . . , xp into the estimated regression equation and use the corresponding value of y as the point estimate.
  • 55. In many situations we must work with qualitative independent variables such as gender (male, female), method of payment (cash, check, credit card), etc. For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1 indicates female. Qualitative Independent Variables In this case, x2 is called a dummy or indicator variable.
  • 56. Qualitative Independent Variables  Example: Programmer Salary Survey As an extension of the problem involving the computer programmer salary survey, suppose that management also believes that the annual salary is related to whether the individual has a graduate degree in computer science or information systems. The years of experience, the score on the programmer aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for each of the sampled 20 programmers are shown on the next slide.
  • 58. Estimated Regression Equation y = b0 + b1x1 + b2x2 + b3x3 ^ where: y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test x3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree x3 is a dummy variable
  • 60. The Simple Explanation... When you select a group from the extreme end of a distribution...
  • 61. The Simple Explanation... When you select a group from the extreme end of a distribution... Selected group’s mean Overall mean
  • 62. The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean
  • 63. The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean Where it would have been with no regression
  • 64. The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure Selected group’s mean Overall mean Where its mean is
  • 65. The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Overall mean
  • 66. The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Regression to the mean Overall mean
  • 67. Example I: If the first measure is a pretest and you select the low scorers... Pretest
  • 68. Example I: If the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest Pretest Posttest
  • 69. Example I: if the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group gained from pre to post. Pseudo-effect
  • 70. Example II: If the first measure is a pretest and you select the high scorers... Pretest
  • 71. Example II: if the first measure is a pretest and you select the high scorers... ...and the second measure is a posttest, Pretest Posttest
  • 72. Example I: ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group lost from pre to post. Pseudo-effect If the first measure is a pretest and you select the high scorers...
  • 73. Some Facts • This is purely a statistical phenomenon. • This is a group phenomenon. • Some individuals will move opposite to this group trend.
  • 74. Why Does It Happen? • Regression artifacts occur whenever you sample asymmetrically from a distribution. • Regression artifacts occur with any two variables (not just pre and posttest) and even backwards in time!
  • 75. What Does It Depend On?  The degree of asymmetry (i.e., how far from the overall mean of the first measure the selected group's mean is)  The correlation between the two measures The absolute amount of regression to the mean depends on two factors:
  • 76. A Simple Formula The percent of regression to the mean is Prm = 100(1 - r) Where r is the correlation between the two measures.
  • 77. For Example: • If r = 1, there is no (i.e., 0%) regression to the mean. • If r = 0, there is 100% regression to the mean. • If r = .2, there is 80% regression to the mean. • If r = .5, there is 50% regression to the mean. Prm = 100(1 - r)
  • 78. Example Assume a standardized test with a mean of 50. Pretest 50
  • 79. Example Assume a standardized test with a mean of 50 Pretest 50 You give your program to the lowest scorers and their mean is 30. 30
  • 80. Example Assume a standardized test with a mean of 50. Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.
  • 81. Example Assume a standardized test with a mean of 50. The formula is… Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.
  • 82. Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. Prm = 100(1 - r) = 100(1-.5) = 50% 50%
  • 83. Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest Therefore the mean will regress up 50% (from 30 to 50), leaving a final mean of 40 and a 10 point pseudo-gain. Pseudo-effect 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. Prm = 100(1 - r) = 100(1-.5) = 50% 40