2. Data is present everywhere and becomes a
part of our life….
2
8/12/2022
3. Data itself has no meaning unless it is
contextually processed into information,
from which knowledge can be derived
Data
• Raw and
unprocessed
• Obtained from
end devices.
Knowledge
Information –
organized and
structured to
achieve specific
objective.
Information
filtered,
processed
categorized
and
condensed.
8/12/2022 3
4. • Correlation shows the quantity of the degree to which
two variables are associated. It does not fix a line
through the data points. Correlation shows how much one
variable changes when the other remains constant. When it
is zero, the relationship does not exist. When it is positive,
one variable goes high as the other goes up. When it is
negative, one variable goes high as the other goes down.
• The Pearson correlation measures the degree to which a set
of data points form a straight line relationship.
Correlation
5. 5
Introduction to Linear Regression
Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific
set of data.
The two terms essential to understanding Regression
Analysis:
Dependent variables - The factors that we want to understand or
predict.
Independent variables - The factors that influence the dependent
variable.
6. 6
Introduction to Linear Regression (cont.)
Any straight line can be represented by an equation of the
form Y = mX + c, where m and c are constants.
The value of m is called the slope constant and
determines the direction and degree to which the line
is tilted.
The value of c is called the Y-intercept and determines the
point where the line crosses the Y-axis.
8. LEAST SQUARES REGRESSION
We can place the line "by eye": try to have the line as close as
possible to all points, and a similar number of points above and
below the line.
But for better accuracy, let's see how to calculate the line
using Least Squares Regression.
The least square method is the process of obtaining the line of
best fit for the given data set by reducing the sum of the squares
of the offsets (residual part) of the points from the line.
13. EXAMPLE 1
x 1 2 3 4 5
y 2 5 3 8 7
x y xy x2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
14. EXAMPLE (contd…)
Find the value of m by using the formula,
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
m = [(5×88) - (15×25)]/(5×55) - (15)2
m = (440 - 375)/(275 - 225)
m = 65/50 = 13/10
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
15. EXAMPLE (contd…)
Find the value of c by using the formula,
c = (∑y - m∑x)/n
c = (25 - 1.3×15)/5
c = (25 - 19.5)/5
c = 5.5/5
So, the required equation of least squares is
y = mx + c = 13/10x + 5.5/5.
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
16. EXAMPLE 2
"x“
Hours of
Sunshine
"y"
Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Sam found how many hours of sunshine vs how many ice creams were sold at
the shop from Monday to Friday:
17. EXAMPLE (contd…)
x y
y = 1.518x +
0.305
error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x, y) points and the
line y = 1.518x + 0.305 on a graph:
Nice fit!
18. EXAMPLE (contd…)
Sam hears the weather forecast which says "we expect 8 hours
of sun tomorrow", so he uses the above equation to estimate
that he will sell
y = 1.518 x 8 + 0.305
= 12.45 Ice Creams
Sam makes fresh waffle cone mixture for 13 ice creams
just in case. Yum.
19. USING SUM OF CROSS AND SQUARED
DEVIATIONS
where sum of cross-deviations of y and x is
and sum of squared deviations of x is
20. PYTHON CODE
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
21. PYTHON CODE
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
22. PYTHON CODE
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:nb_0 = {}
nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
OUTPUT:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
23. Implementation of Linear
Regression using sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# creating a dummy dataset
np.random.seed(10)
x = np.random.rand(50, 1)
y = 3 + 3 * x + np.random.rand(50, 1)
#scatterplot
plt.scatter(x,y,s=10)
plt.xlabel('x_dummy')
plt.ylabel('y_dummy')
plt.show()
24. Implementation of Linear
Regression using sklearn
#creating a model
from sklearn.linear_model import LinearRegression
# creating a object
regressor = LinearRegression()
#training the model
regressor.fit(x, y)
#using the training dataset for the prediction
pred = regressor.predict(x)
26. POINTS TO REMEMBER
Least squares is sensitive to outliers. A strange value
will pull the line towards it.
Works better for even non-linear data. But the
formulas (and the steps taken) will be very different!
Difference between actual value of y and predicted
value of y is called as residual.
27. • Linear regression finds the best line that predicts y from
x, but Correlation does not fit a line.
• Correlation is used when we measure both variables,
while linear regression is mostly applied when x is a
variable that is manipulated.
Correlation Vs Regression
28. Utility of Regression
Used in economic and business research
Estimation of Relationship
Prediction
29. Types
There are several linear regression analyses
Simple linear regression
One dependent variable
One independent variable
Multiple linear regression
One dependent variable
Two or more independent variables
30. Multiple Regression Model
The equation that describes how the dependent
variable y is related to the independent variables x1, x2, . . .
xp and an error term is:
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:
b0, b1, b2, . . . , bp are the parameters, and e is
a random variable called the error term
31. The variance of e , denoted by 2, is the same for all
values of the independent variables.
The error e is a normally distributed random variable
reflecting the deviation between the y value and the
expected value of y given by b0 + b1x1 + b2x2 + . . + bpxp.
Assumptions about the Error Term
The error e is a random variable with mean of zero.
The values of e are independent.
32. The equation that describes how the mean value
of y is related to x1, x2, . . . xp is:
Multiple Regression Equation
E(y) = b0 + b1x1 + b2x2 + . . . + bpxp
33. A simple random sample is used to compute sample
statistics b0, b1, b2, . . . , bp that are used as the point
estimators of the parameters b0, b1, b2, . . . , bp.
Estimated Multiple Regression Equation
^
y = b0 + b1x1 + b2x2 + . . . + bpxp
34. Estimation Process
Multiple Regression Model
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp + e
Multiple Regression Equation
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp
Unknown parameters are
b0, b1, b2, . . . , bp
Sample Data:
x1 x2 . . . xp y
. . . .
. . . .
0 1 1 2 2
ˆ ... p p
y b b x b x b x
Estimated Multiple
Regression Equation
Sample statistics are
b0, b1, b2, . . . , bp
b0, b1, b2, . . . , bp
provide estimates of
b0, b1, b2, . . . , bp
35. Least Squares Method
Least Squares Criterion
2
ˆ
min ( )
i i
y y
Computation of Coefficient Values
• The formulas for the regression coefficients b0, b1, b2,. . . bp
involve the use of matrix algebra.
• Computer software packages are available to perform the
calculations.
36. Example: Programmer Salary Survey
Multiple Regression Model
A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if salary
was related to the years of experience and the score on
the firm’s programmer aptitude test.
38. Suppose we believe that salary (y) is related to the years of
experience (x1) and the score on the programmer aptitude
test (x2) by the following regression model:
Multiple Regression Model
where
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
y = b0 + b1x1 + b2x2 + e
39. Solving for the Estimates of b0, b1, b2
Input Data
Least Squares
Output
x1 x2 y
4 78 24
7 100 43
. . .
. . .
3 89 30
Computer
Package
for Solving
Multiple
Regression
Problems
b0 =
b1 =
b2 =
R2 =
etc.
41. Interpreting the Coefficients
In multiple regression analysis, we interpret each
regression coefficient as follows:
bi represents an estimate of the change in y
corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
42. Salary is expected to increase by $1,404 for each
additional year of experience (when the variable score
on programmer attitude test is held constant).
b1 = 1.404
Interpreting the Coefficients
43. Salary is expected to increase by $251 for each
additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
b2 = 0.251
Interpreting the Coefficients
45. Interpret R-squared in Regression
Analysis
Determines how well the model fits the data
Goodness-of-fit measure for linear regression models
Measures the strength of the relationship between
our model and the dependent variable on a convenient
0–100% scale
46. Interpret R-squared in Regression
Analysis (contd…)
We need to calculate two things:
var(avg) = ∑(yi – Ӯ)2
var(model) = ∑(yi – ŷ)2
R2 = 1 – [var(model)/var(avg)]
= 1 -[∑(yi – ŷ)2/∑(yi – Ӯ)2]
47. Limitations of R-squared
R-squared cannot be used to check if the
coefficient estimates and predictions are biased
or not.
R-squared does not inform if the regression model
has an adequate fit or not.
48. Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
SST = SSR + SSE
2
( )
i
y y
2
ˆ
( )
i
y y
2
ˆ
( )
i i
y y
= +
49. Testing for Significance: F Test
F test is referred to as the test for overall significance.
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
50. A separate t test is conducted for each of the
independent Variables in the model.
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual
significance.
51. In simple linear regression, the F and t tests provide
the same conclusion.
Testing for Significance
In multiple regression, the F and t tests have different
purposes.
52. Testing for Significance: Multicollinearity
The term multicollinearity refers to the correlation
among the independent variables.
When the independent variables are highly correlated
(say, |r | > .7), it is not possible to determine the
separate effect of any particular independent variable
on the dependent variable.
53. Testing for Significance: Multicollinearity
Every attempt should be made to avoid including
independent variables that are highly correlated.
If the estimated regression equation is to be used only
for predictive purposes, multicollinearity is usually
not a serious problem.
54. Using the Estimated Regression Equation
for Estimation and Prediction
The procedures for estimating the mean value of y
and predicting an individual value of y in multiple
regression are similar to those in simple regression.
We substitute the given values of x1, x2, . . . , xp into
the estimated regression equation and use the
corresponding value of y as the point estimate.
55. In many situations we must work with qualitative
independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0
indicates male and x2 = 1 indicates female.
Qualitative Independent Variables
In this case, x2 is called a dummy or indicator variable.
56. Qualitative Independent Variables
Example: Programmer Salary Survey
As an extension of the problem involving the computer
programmer salary survey, suppose that management also
believes that the annual salary is related to whether the
individual has a graduate degree in computer science or
information systems. The years of experience, the score on the
programmer aptitude test, whether the individual has a
relevant graduate degree, and the annual salary ($1000) for
each of the sampled 20 programmers are shown on the next
slide.
58. Estimated Regression Equation
y = b0 + b1x1 + b2x2 + b3x3
^
where:
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x3 is a dummy variable
61. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
Selected
group’s
mean
Overall
mean
62. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
Selected
group’s
mean
Overall
mean
63. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
Selected
group’s
mean
Overall
mean
Where it
would
have
been
with no
regression
64. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure
Selected
group’s
mean
Overall
mean
Where its
mean is
65. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
The group mean on
the first measure
appears to “regress
toward the mean” of
the second measure.
Selected
group’s
mean
Overall
mean
Overall
mean
66. The Simple Explanation...
When you select a
group from the
extreme end of a
distribution...
the group will do
better on a subsequent
measure.
The group mean on
the first measure
appears to “regress
toward the mean” of
the second measure.
Selected
group’s
mean
Overall
mean
Regression to the mean
Overall
mean
67. Example I:
If the first measure
is a pretest and
you select the low
scorers...
Pretest
68. Example I:
If the first measure
is a pretest and
you select the low
scorers...
...and the second
measure is a posttest
Pretest
Posttest
69. Example I:
if the first measure
is a pretest and
you select the low
scorers...
...and the second
measure is a posttest,
Pretest
Posttest
regression to the
mean will make it
appear as though the
group gained from
pre to post.
Pseudo-effect
70. Example II:
If the first measure
is a pretest and
you select the high
scorers...
Pretest
71. Example II:
if the first measure
is a pretest and
you select the high
scorers...
...and the second
measure is a posttest,
Pretest
Posttest
72. Example I:
...and the second
measure is a posttest,
Pretest
Posttest
regression to the
mean will make it
appear as though the
group lost from pre
to post.
Pseudo-effect
If the first measure
is a pretest and
you select the high
scorers...
73. Some Facts
• This is purely a statistical phenomenon.
• This is a group phenomenon.
• Some individuals will move opposite to this group
trend.
74. Why Does It Happen?
• Regression artifacts occur whenever you sample
asymmetrically from a distribution.
• Regression artifacts occur with any two variables
(not just pre and posttest) and even backwards in
time!
75. What Does It Depend On?
The degree of asymmetry (i.e., how far
from the overall mean of the first measure
the selected group's mean is)
The correlation between the two
measures
The absolute amount of regression to the
mean depends on two factors:
76. A Simple Formula
The percent of regression to the mean is
Prm = 100(1 - r)
Where r is the correlation between the two measures.
77. For Example:
• If r = 1, there is no (i.e., 0%) regression to the mean.
• If r = 0, there is 100% regression to the mean.
• If r = .2, there is 80% regression to the mean.
• If r = .5, there is 50% regression to the mean.
Prm = 100(1 - r)
80. Example
Assume a standardized
test with a mean of 50. Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
81. Example
Assume a standardized
test with a mean of 50.
The formula is…
Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
82. Example
Assume a standardized
test with a mean of 50.
The formula is
Pretest
Posttest
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Prm = 100(1 - r)
= 100(1-.5)
= 50%
50%
83. Example
Assume a standardized
test with a mean of 50.
The formula is
Pretest
Posttest
Therefore the mean
will regress up 50%
(from 30 to 50),
leaving a final mean
of 40 and a 10 point
pseudo-gain. Pseudo-effect
50
You give your program
to the lowest scorers
and their mean is 30.
30
Assume that the correlation
of pre-post is .5.
Prm = 100(1 - r)
= 100(1-.5)
= 50%
40