5. Introduction
• All problems of science and engineering have dependent and independent
variables
• For a large class of problems, the trend of the relation between the dependent
and independent variable is ‛linear’
• Without much fanfare, let us assume that we have
two variables x, and y those that have a linear trend
of relationship between them.
• What we understand by a ‘linear trend’ is that the
data points lie ‘around’ a straight line, not
necessarily on the line
6. Introduction
• Let (xi, yi), i = 1,n be a set of point through
which we would like to pass a straight line
• Let the equation of the line be
𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖
• 𝑦𝑖are the points on the line corresponding to
the points xi.
• yi are the actual observational points, those are
different than 𝑦𝑖.
• Consider the difference between the points 𝑦𝑖
and yi to be ei.
a
Slope b
xi
yi
ei
7. Least-square linear regression
• Therefore
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖
𝑒𝑖 = 𝑦𝑖 − 𝛼 − 𝛽𝑥𝑖
• We need to put some conditions on ei so that we can obtain expressions for a
and b.
• The quantities ei can be negative and positive. If we impose a condition on ei so
that 𝑒𝑖 = 0, then it can be shown that there are infinite solutions for a and b.
• To address this issue, Gauss proposed to square ei, then minimize 𝑒𝑖
2
.
• The Sum Square of Errors is written as 𝑆𝑆𝐸 = 𝑒𝑖
2
= 𝑦𝑖 − 𝛼 − 𝛽𝑥𝑖
2
• SSE will be minimized with respect to a and b.
8. Least-square linear regression
• Therefore the problem ends up as an optimization problem.
• To minimize SSE, we will partially differentiate SSE with respect to a and b, and
set these equal to 0
𝜕𝑆𝑆𝐸
𝜕𝛼
= −2 𝑦𝑖 − 𝛼 − 𝛽𝑥𝑖 = 0
𝜕𝑆𝑆𝐸
𝜕𝛽
= −2 𝑦𝑖 − 𝛼 − 𝛽𝑥𝑖 𝑥𝑖 = 0
• Simplifying, we obtain
𝛼𝑛 + 𝛽 𝑥𝑖 = 𝑦𝑖
𝛼 𝑥𝑖 + 𝛽 𝑥𝑖
2
= 𝑥𝑖𝑦𝑖
9. Least-square linear regression
• Solving these, we obtain expressions for a and b.
𝛼 =
𝑦𝑖 𝑥𝑖
2
− 𝑥𝑖 𝑥𝑖𝑦𝑖
𝑛 𝑥𝑖
2
− 𝑥𝑖
2
𝛽 =
𝑛 𝑥𝑖𝑦𝑖 − 𝑥𝑖 𝑦𝑖
𝑛 𝑥𝑖
2
− 𝑥𝑖
2
10. Example: The raw material used in the production of a certain synthetic fiber is stored in a location
without humidity control. Measurements of the relative humidity in the storage location and the
moisture content of a sample of the raw material were taken over 15 days. The following data were
recorded.
Construct a least-square regression model.
Considering the Relative Humidity as x, and Moisture
Content as y, we can obtain 𝑥𝑖 = 692, 𝑦𝑖 = 186,
𝑥𝑖
2
= 33212, 𝑥𝑖𝑦𝑖 = 8997. Using these values,
we obtain
𝛼 =
186 × 33212 − 692 × 8997
15 × 33212 − 692 2
= −2.51
𝛽 =
15 × 8997 − 692 × 186
15 × 33212 − 692 2 = 0.32
Therefore, the equation of the line is y = –2.51 + 0.32 x
Relative humidity 46 53 29 61 36 39 47 49 52 38 55 32 57 54 44
Moisture content 12 15 7 17 10 11 11 12 14 9 16 8 18 14 12
0
2
4
6
8
10
12
14
16
18
20
25 30 35 40 45 50 55 60 65
Relative Humidity
Moisture
Content
11. Goodness of fit
• To find the values of a and b, all we need are a set of values
of x, and a set of values of y.
• In reality, x and y do not have to follow a linear trend,
yet we can still go ahead and find value of a and b.
• So, once a regression relation has been found, a relevant
question that arises is ‘how good is the fit?’
• We ascertain that by computing the Pearson’s correlation coefficient, defined by
r as
𝜌 =
(𝑥 − 𝑥)(𝑦 − 𝑦)
(𝑥 − 𝑥)2 (𝑦 − 𝑦)2
• where 𝑥 and 𝑦 are means of x and y respectively
12. Goodness of fit
• Correlation coefficient ranges between −1 and 1. The different values are shown
in the following figures.
0
0.5
1
1.5
2
2.5
3
0 0.5 1
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1
0
0.5
1
1.5
2
2.5
3
0 0.5 1
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1
r = 1 r = 0 r = −1
0 < r < 1 -1 < r < 0