2. Regression Analysis
• Also known as curve fitting.
• Suitable plot of data to indicate the nature of the relation between the
independent and the dependent variables.
• If prediction is within given data range-interpolation, otherwise extrapolation.
• Can be linear, semi log, log-log, nonlinear.
• Tells us about the error incurred while representing the data using that relation.
The linear graph can be of the form
(i) y=ax+b –linear fit- straight line on a linear graph
(ii) y= axb - power law fit-straight line on a log-log graph
(iii) y= aebx – exponential fit-straight line on a semi log graph
The non-linear relationship follows a polynomial relationship of the form
y= ax3+bx2+cx+d.
The parameters a, b, c, d are known as the fit parameters and need to be determined
as a part of the regression analysis.
3. Linear between x and y
y=ax+b –linear fit- straight line on a linear graph.
4. Linear between logx and logy
y= axb –linear fit- power law fit-straight line on a log-log graph.
5. Non-linear between x and y
y= ax3+bx2+cx+d –non-linear fit- polynomial relationship.
6. Least Square method
Let’s consider there is a linear relation between x and y i.e. their trend is represented
by a straight line.
The straight line does not pass through any of the data points.
If we consider the straight line as a local mean then the deviations are distributed
w.r.t the local mean as a normal distribution. The least square principle can be
applied as:
Minimize
where yf
= ax + b is the desired linear fit to data.
( )
2 2
2 1 1
n n
i f i i
i i
y y y ax b
s
n n
= =
é ù
- - +
é ù
ë û
ë û
= =
å å
7. Least square method contd.
S2 gives the variance w.r.t. the mean. Hence, minimization requires:
These equations may be rearranged as two simultaneous equations for a and b as
given below (known as normal equations):
Let’s define:
This quantity is known as the covariance i.e. influence in variability of xi on yi and
vice versa.
( ) ( )
2 2
1 1
1 1
2 0; 2 0
n n
i i i i i
i i
s s
y ax b x y ax b
a n b n
= =
¶ ¶
= - - + = = - - + =
é ù é ù
ë û ë û
¶ ¶
å å
( ) ( )
( )
2
i i i i
i i
x a x b x y
x a nb y
+ =
+ =
å å å
å å
2 2
2 2 2 2
, , , ;
i i i i i i
x y xy
x y x y x y
x y x y xy
n n n n n
s s s
= = = - = - = -
å å å å å
8. Least square method contd.
With these definitions the slope and intercept of the line fit a may be written as
The regression line line passes through the point
Example:
The data is expected to follow a linear relation y=ax+b. Find the slope and intercept.
Find the correlation coefficient.
2
2
2
.
i i
xy
i
x
x y
x y
n
a
x
x
n
b y ax
s
s
-
= =
-
= -
å
å
( , )
x y
yi xi
1.2 1.0
2.0 1.6
2.4 3.4
3.5 4.0
3.5 5.2
Quickly verify
9. Least square method contd.
Solution:
Sum Sum
Use equations:
Answers: a= 0.54 ; b= 0.879
Hence, y= 0.54x+0.879
yi xi
1.2 1.0
2.0 1.6
2.4 3.4
3.5 4.0
3.5 5.2
12.6 15.2
xiyi xi
2
1.2 1.0
3.2 2.56
8.16 11.56
14.0 16.0
18.2 27.04
44.76 58.16
( ) ( )
( )
2
i i i i
i i
x a x b x y
x a nb y
+ =
+ =
å å å
å å
10. Standard error
Let’s say the computed value of y is
And suppose there are “n” number of data. Then considering a linear fit we have 2
parameters “a” and “b”.
Hence, DOF = n-p = n-2 [Two parameters a and b are calculated using the same data.]
Hence, standard error is given by:
f
y ax b
= +
[ ]
1/2
2
1
1/2
2
i
1
2
(ax b)
2
n
i f
i
n
i
i
y y
e
n
y
e
n
=
=
ì ü
é ù
-
ï ï
ë û
ï ï
= í ý
-
ï ï
ï ï
î þ
ì ü
- +
ï ï
ï ï
= í ý
-
ï ï
ï ï
î þ
å
å
11. Goodness of fit
• A measure of how good the regression line as a representation of the data is.
• It is possible to fit two lines to data by treating:
(a) "x" as the independent variable and "y" as the dependent variable or
(b) "y" as the independent variable and "x" as the dependent variable i.e. x=a'y+b'.
Then
The second fit line may be written as:
The slope of this line is which is not the same as “a”.
• If the two slopes are the same the two regression lines coincide.
• The ratio of the slopes of the two lines is a measure of how good the form of the
fit is to the data
2
;
xy
y
a b x a y
s
s
¢ ¢ ¢
= = -
1 b
y x
a a
¢
= -
¢ ¢
1
a¢
12. Correlation coefficient
The correlation coefficient ρ is defined as:
• The sign of the correlation coefficient is determined by the sign of the covariance.
• The correlation is perfect if ρ = ±1 .
• The correlation is poor if ρ ≈ 0 .
• Absolute value of the correlation coefficient should be greater than 0.5 to indicate
that y and x are related.
2
2
2 2
1
aa
2
xy
x y
xy
x y
slopeof st regressionline
slopeof nd regressionline
or
s
r
s s
s
r
s s
¢
= = =
= ±
13. Polynomial regression
Sometimes the data may show a non-linear behavior that may be modeled by a
polynomial relation. Ex: y
f
= ax2 + bx + c .
The variance of the data with respect to the fit is again minimized with respect to the
three fit parameters a, b, c to get three normal equations.
Least square principle requires:
( )
2
2
2 1
n
i i i
i
y ax bx c
s
n
=
é ù
- + +
ë û
=
å
( )
( )
( )
2
2 2
1
2
2
1
2
2
1
2
0;
2
0;
2
0;
n
i i i i
i
n
i i i i
i
n
i i i
i
s
y ax bx c x
a n
s
y ax bx c x
b n
s
y ax bx c
c n
=
=
=
¶
é ù
= - + + =
ë û
¶
¶
é ù
= - + + =
ë û
¶
¶
é ù
= - + + =
ë û
¶
å
å
å
14. Polynomial regression
The earlier equations give:
These are solved for the fit parameters.
( ) ( ) ( )
( ) ( ) ( )
( ) ( )
3 2
4 2
2
3
2
i i i i i
i i i i i
i i i
x a x b x c x y
x a x b x c x y
x a x b nc y
+ + =
+ + =
+ + =
å å å å
å å å å
å å å
15. Goodness of fit and the index of correlation
In the case of a non-linear fit we define a quantity
known as the index of correlation to determine the
goodness of the fit.
[ ]
2
2
2
2
1 1
f
y
y y
s
y y
r
s
é ù
-
ë û
= ± - = ± -
-
å
å
• If the index of correlation is close to ±1, the fit to be considered good.
• The index of correlation is identical to the correlation coefficient for a linear fit.
• The index of correlation compares the scatter of the data with respect to its own
mean as compared to the scatter of the data with respect to the regression curve
16. General index of correlation
Let’s suppose a function z is defined as :
Standard error:
LS principle:
Index of correlation:
Basically means variance w.r.t to mean compared to variance w.r.t to local mean (fit).
2
2
1
y
s
or R
r
s
= ± -
( )
( , )
z f x y
z ax by c
=
= + +
( )
( )
( )
1
1
1
2 ( ) 0;
2 ( ) 0;
2 0;
n
i i i i
i
n
i i i i
i
n
i i i
i
S
z ax by c x
a
S
z ax by c y
b
S
z ax by c
c
=
=
=
¶
= - + + - =
é ù
ë û
¶
¶
= - + + - =
é ù
ë û
¶
¶
= - + + =
é ù
ë û
¶
å
å
å
( )
2
1
n
i i i
i
S z ax by c
=
= - + +
é ù
ë û
å
17. Parity plot
• The data and the fit may be compared by making a parity plot.
• The parity plot is a plot of given data (z) along the abscissa and the fit (zf) along
the ordinate.
• The parity line is a line of equality between the two.
• The departure of the data from the parity line is an indication of the quality of the
fit. When the data is a function of more than one independent variable it is not
always possible to make plots between independent and dependent variables. In
such a case the parity plot is a way out.
18. General non-linear fit:
What if the fit equation is a non-linear relation that is neither a polynomial nor can be
reducible to the linear form?
Example:
Here, parameter estimation requires the use of a search method to determine the best
parameter set that minimizes the sum of the squares of the residual. i.e. to find (a,
b,….p) such that S is minimized for yf = f (x : a, b, c….p ) – [general non linear
function with p parameters].
Where sum of the squares of the residual given by
Hence, choose the parameters such that
In general it is not possible to set the partial derivatives with respect to the parameters
to zero to obtain the normal equations and thus obtain the fit parameters.
2
b( ln )
(1) (2)
bx x c x d
y ae cx d or y ae + +
= + + =
2
1
(min)
N
i f
i
S y y
=
é ù
= -
ë û
å
.... 0
S S S S
a b c p
¶ ¶ ¶ ¶
= = = =
¶ ¶ ¶ ¶
19. General non-linear fit:
Let’s consider a 3 parameter system with a, b, c as the parameter.
Now assume certain values which gives some value that may not be
minimum.
Then evaluate
If each of these is zero, then it’s a minimum.
Now S being a function of the parameters,
We can write and minimum is achieved when
Magnitude of gradient
Unit vectors in the direction of components:
(0) (0) (0)
, ,c
a b (0)
S
(0) (0) (0) (0) (0) (0) (0) (0) (0)
, , , , , ,
a b c a b c a b c
S S S
a b c
¶ ¶ ¶
= =
¶ ¶ ¶
( , , )
S f a b c
=
ˆ ˆ
ˆ
. .b .c
S S S
S a
a b c
¶ ¶ ¶
Ñ = + +
¶ ¶ ¶
0
S
Ñ =
2 2 2
S S S
S
a b c
¶ ¶ ¶
æ ö æ ö æ ö
Ñ = + +
ç ÷ ç ÷ ç ÷
¶ ¶ ¶
è ø è ø è ø
, ,
S S S
a b c
S S S
¶ ¶ ¶
¶ ¶ ¶
Ñ Ñ Ñ
20. General non-linear fit:
To minimize we now move in a direction opposite to the gradient hence reducing the
parameter values by
Thus Note: will be same for all
Hence,
This is repeated until S reaches a value which is minimum.
Since it moves along the steepest path, hence known as Steepest Descent method.
NOTE: Initially you may chose larger values for but once it moves close to
minimum then you must reduce its magnitude.
d
(1) (0) (0)
(1) (0) (0)
(1) (0) (0)
( )
( )
c ( )
a a component along a
b b component along b
c component along c
d
d
d
= -
= -
= -
(0) (0) (0) (0) (0) (0) (0) (0) (0)
(0) (0) (0) (0) (0) (0) (0) (0) (0)
,b ,c ,b ,c ,b ,c
,b ,c ,b ,c ,b ,c
(1) (0) (1) (0) (1) (0)
; ; c
a a a
a a a
S S S
a b c
a a b b c
S S S
d d d
¶ ¶ ¶
¶ ¶ ¶
= - = - = -
Ñ Ñ Ñ
d
e
d
21. Example : Steepest Descent
Q. Determine the general fit parameters by general non-linear regression if
the data follows the form
X 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
y 1.196 1.379 1.581 1.79 2.013 2.279 2.545 2.842 3.173 3.5
bx
f
y ae cx
= +
22. Solution
Sum of squares of residuals:
Hence,
Assume we get
10 2
1
( )
i
bx
i i
i
S y ae cx
=
é ù
= - +
ë û
å
10
1
10
1
10
1
2 ( ) ( );
2 ( ) ( );
2 ( ) ( )
i i
i i
i
bx bx
i i
i
bx bx
i i i
i
bx
i i i
i
S
y ae cx e
a
S
y ae cx ax e
b
S
y ae cx x
c
=
=
=
¶
é ù
= - + -
ë û
¶
¶
é ù
= - + -
ë û
¶
¶
é ù
= - + -
ë û
¶
å
å
å
11.674;
24.023;
30.682;
23.003
S
S
a
S
b
S
c
=
¶
= -
¶
¶
= -
¶
¶
= -
¶
(0)
(0)
(0)
1;
0.2;
c 0.1
a
b
=
=
=
23. Magnitude of gradient vector:
Hence unit vector in the direction of components:
Hence,
2 2 2
45.251
S S S
S
a b c
¶ ¶ ¶
æ ö æ ö æ ö
Ñ = + + =
ç ÷ ç ÷ ç ÷
¶ ¶ ¶
è ø è ø è ø
(0) (0) (0)
(0) (0) (0)
,b ,c
,b ,c 24.023
0.531
45.249
a
a
S
a
S
¶
-
¶
= = -
Ñ
30.681
0.678;
45.249
23.003
0.508
45.249
S
b
S
S
c
S
¶
-
¶ = = -
Ñ
¶
-
¶ = = -
Ñ
(1) (0)
(1) (0)
(1) (0)
ˆ
(a) 1 (0.02 0.531) 1.011
ˆ
(b) 0.02 (0.02 0.678) 0.214
ˆ
c ( ) 0.1 (0.02 0.508) 0.11
a a
b b
c c
d
d
d
= - = - ´- =
= - = - ´- =
= - = - ´- =
24. Now for these values of the new value of
S=10.948
This is repeated until the value of S comes below 0.01 or less (which is specified
already).
For this example calculate the final values of a, b and c with a MATLAB program.
#Assignment
(1) (1) (1)
1.011; 0.214;c 0.11
a b
= = =