Ssp notes

SECTION – I

REVIEW OF
RANDOM VARIABLES & RANDOM PROCESS

1

REVIEW OF RANDOM VARIABLES..............................................................................3
1.1 Introduction..............................................................................................................3
1.2 Discrete and Continuous Random Variables ........................................................4
1.3 Probability Distribution Function ..........................................................................4
1.4 Probability Density Function .................................................................................5
1.5 Joint random variable .............................................................................................6
1.6 Marginal density functions......................................................................................6
1.7 Conditional density function...................................................................................7
1.8 Baye’s Rule for mixed random variables...............................................................8
1.9 Independent Random Variable ..............................................................................9
1.10 Moments of Random Variables ..........................................................................10
1.11 Uncorrelated random variables..........................................................................11
1.12 Linear prediction of Y from X ..........................................................................11
1.13 Vector space Interpretation of Random Variables...........................................12
1.14 Linear Independence ...........................................................................................12
1.15 Statistical Independence......................................................................................12
1.16 Inner Product .......................................................................................................12
1.17 Schwary Inequality ..............................................................................................13
1.18 Orthogonal Random Variables...........................................................................13
1.19 Orthogonality Principle.......................................................................................14
1.20 Chebysev Inequality.............................................................................................15
1.21 Markov Inequality ...............................................................................................15
1.22 Convergence of a sequence of random variables ..............................................16
1.23 Almost sure (a.s.) convergence or convergence with probability 1 .................16
1.24 Convergence in mean square sense ....................................................................17
1.25 Convergence in probability.................................................................................17
1.26 Convergence in distribution................................................................................18
1.27 Central Limit Theorem .......................................................................................18
1.28 Jointly Gaussian Random variables...................................................................19

2

REVIEW OF RANDOM VARIABLES

1.1 Introduction
• Mathematically a random variable is neither random nor a variable
• It is a mapping from sample space into the real-line ( “real-valued” random
variable) or the complex plane ( “complex-valued ” random variable) .
Suppose we have a probability space {S , ℑ, P} .
Let X : S → ℜ be a function mapping the sample space S into the real line such that
For each s ∈ S , there exists a unique X ( s ) ∈ ℜ. Then X is called a random variable.
Thus a random variable associates the points in the sample space with real numbers.

ℜ Notations:
X (s) • Random variables are represented by
s • upper-case letters.
• Values of a random variable are
denoted by lower case letters
S • Y = y means that y is the value of a
random variable X .

Figure Random Variable

Example 1: Consider the example of tossing a fair coin twice. The sample space is S= {
HH,HT,TH,TT} and all four outcomes are equally likely. Then we can define a random
variable X as follows
Sample Point Value of the random P{ X = x}

Variable X = x
HH 0 1
4

HT 1 1
4

TH 2 1
4

TT 3 1
4

3

Example 2: Consider the sample space associated with the single toss of a fair die. The
sample space is given by S = {1, 2,3, 4,5,6} . If we define the random variable X that
associates a real number equal to the number in the face of the die, then X = {1, 2,3, 4,5,6}

1.2 Discrete and Continuous Random Variables
• A random variable X is called discrete if there exists a countable sequence of

distinct real number xi such that ∑ P ( x ) = 1.
i
m i
Pm ( xi ) is called the

probability mass function. The random variable defined in Example 1 is a
discrete random variable.
• A continuous random variable X can take any value from a continuous
interval
• A random variable may also de mixed type. In this case the RV takes
continuous values, but at each finite number of points there is a finite
probability.

1.3 Probability Distribution Function
We can define an event { X ≤ x} = {s / X ( s ) ≤ x, s ∈ S}
The probability
FX ( x) = P{ X ≤ x} is called the probability distribution function.
Given FX ( x), we can determine the probability of any event involving values of the
random variable X .
• FX (x) is a non-decreasing function of X .
• FX (x) is right continuous
=> FX (x) approaches to its value from right.

• FX (−∞) = 0
• FX (∞) = 1
• P{x1 < X ≤ x} = FX ( x) − FX ( x1 )

4

Example 3: Consider the random variable defined in Example 1. The distribution
function FX ( x) is as given below:

Value of the random Variable X = x FX ( x)

x<0 0
0 ≤ x <1 1
4

1≤ x < 2 1
2

2≤ x<3 3
4

x≥3 1

FX ( x)

X =x

1.4 Probability Density Function
d
If FX (x) is differentiable f X ( x) = FX ( x) is called the probability density function and
dx
has the following properties.

• f X (x) is a nonnegative function

∞
• ∫f
−∞
X ( x)dx = 1

x2

• P ( x1 < X ≤ x 2 ) = ∫f
− x1
X ( x)dx

Remark: Using the Dirac delta function we can define the density function for a discrete
random variables.

5

1.5 Joint random variable
X and Y are two random variables defined on the same sample space S.
P{ X ≤ x, Y ≤ y} is called the joint distribution function and denoted by FX ,Y ( x, y ).

Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, we have a complete description of the

random variables X and Y .
• P{0 < X ≤ x, 0 < Y ≤ y} = FX ,Y ( x, y ) − FX ,Y ( x,0) − FX ,Y (0, y ) + FX ,Y ( x, y )

• FX ( x) = FXY ( x,+∞).
To prove this

( X ≤ x ) = ( X ≤ x ) ∩ ( Y ≤ +∞ )
∴ F X ( x ) = P (X ≤ x ) = P (X ≤ x , Y ≤ ∞ )= F XY ( x , +∞ )
FX ( x) = P( X ≤ x ) = P( X ≤ x, Y ≤ ∞ ) = FXY ( x,+∞)

Similarly FY ( y ) = FXY (∞, y ).

• Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, each of FX ( x) and FY ( y ) is

called a marginal distribution function.
We can define joint probability density function f X ,Y ( x, y ) of the random variables X and Y

by
∂2
f X ,Y ( x, y ) = FX ,Y ( x, y ) , provided it exists
∂x∂y
• f X ,Y ( x, y ) is always a positive quantity.
x y
• FX ,Y ( x, y ) = ∫∫
− ∞− ∞
f X ,Y ( x, y )dxdy

1.6 Marginal density functions
fX (x) = d
dx FX (x)
= d
dx FX ( x,∞ )
x ∞
= d
dx ∫ ( ∫ f X ,Y ( x , y ) d y ) d x
−∞ −∞
∞
= ∫ f X ,Y ( x , y ) d y
−∞
∞
and fY ( y ) = ∫ f X ,Y ( x , y ) d x
−∞

6

1.7 Conditional density function
f Y / X ( y / X = x) = f Y / X ( y / x) is called conditional density of Y given X .
Let us define the conditional distribution function.
We cannot define the conditional distribution function for the continuous random
variables X and Y by the relation
F ( y / x) = P(Y ≤ y / X = x)
Y/X

P(Y ≤ y, X = x)
=
P ( X = x)

as both the numerator and the denominator are zero for the above expression.
The conditional distribution function is defined in the limiting sense as follows:
F ( y / x) = lim∆x →0 P (Y ≤ y / x < X ≤ x + ∆x)
Y/X

P (Y ≤ y , x < X ≤ x + ∆x)
=lim∆x →0
P ( x < X ≤ x + ∆x)
y
∫ f X ,Y ( x, u )∆xdu
∞
=lim∆x →0
f X ( x ) ∆x
y
∫ f X ,Y ( x, u )du
=∞
f X ( x)

The conditional density is defined in the limiting sense as follows
f Y / X ( y / X = x ) = lim ∆y →0 ( FY / X ( y + ∆y / X = x ) − FY / X ( y / X = x )) / ∆y
= lim ∆y→0,∆x→0 ( FY / X ( y + ∆y / x < X ≤ x + ∆x ) − FY / X ( y / x < X ≤ x + ∆x )) / ∆y (1)

Because ( X = x) = lim ∆x →0 ( x < X ≤ x + ∆x)
The right hand side in equation (1) is
lim ∆y →0,∆x→0 ( FY / X ( y + ∆y / x < X < x + ∆x) − FY / X ( y / x < X < x + ∆x)) / ∆y
= lim ∆y →0, ∆x →0 ( P( y < Y ≤ y + ∆y / x < X ≤ x + ∆x)) / ∆y
= lim ∆y →0, ∆x →0 ( P( y < Y ≤ y + ∆y, x < X ≤ x + ∆x)) / P ( x < X ≤ x + ∆x)∆y
= lim ∆y →0, ∆x →0 f X ,Y ( x, y )∆x∆y / f X ( x)∆x∆y
= f X ,Y ( x, y ) / f X ( x)

∴ fY / X ( x / y ) = f X ,Y ( x, y ) / f X ( x ) (2)

Similarly, we have
∴ f X / Y ( x / y ) = f X ,Y ( x, y ) / fY ( y ) (3)

7

From (2) and (3) we get Baye’s rule

f X ,Y (x, y)
∴ f X /Y (x / y) =
fY ( y)
f (x) fY / X ( y / x)
= X
fY ( y)
f (x, y)
= ∞ X ,Y
(4)
∫ f X ,Y (x, y)dx
−∞
f ( y / x) f X (x)
= ∞ Y/X
∫ f X (u) fY / X ( y / x)du
−∞

Given the joint density function we can find out the conditional density function.

Example 4:
For random variables X and Y, the joint probability density function is given by
1 + xy
f X ,Y ( x, y ) = x ≤ 1, y ≤ 1
4
= 0 otherwise
Find the marginal density f X ( x), fY ( y ) and fY / X ( y / x). Are X and Y independent?

1 + xy
1
1
f X ( x) = ∫
−1
4
dy =
2
Similarly
1
fY ( y ) = -1 ≤ y ≤ 1
2
and
f X ,Y ( x, y ) 1 + xy
fY / X ( y / x ) = = , x ≤ 1, y ≤ 1
f X ( x) 4
= 0 otherwise
∴ X and Y are not independent

1.8 Baye’s Rule for mixed random variables
Let X be a discrete random variable with probability mass function PX ( x) and Y be a
continuous random variable. In practical problem we may have to estimate X from
observed Y . Then

8

PX /Y ( x / y ) = li m ∆y→ 0 PX /Y ( x / y < Y ≤ y + ∆ y )∞
PX ,Y (x, y < Y ≤ y + ∆y)
= li m ∆y→ 0
PY ( y < Y ≤ y + ∆ y )
PX ( x ) fY / X ( y / x ) ∆ y
= lim ∆y→ 0
fY ( y ) ∆ y
PX ( x ) fY / X ( y / x )
=
fY ( y )
PX ( x ) fY / X ( y / x )
==
∑ PX ( x ) fY / X ( y / x )
x

Example 5:
X Y
+

V

X is a binary random variable with
⎧ 1
⎪1 with probability 2
⎪
X =⎨
⎪ −1 with probability 1
⎪
⎩ 2
0 and variance σ .2
V is the Gaussian noise with mean

Then
PX ( x) fY / X ( y / x)
PX / Y ( x = 1/ y ) =
∑ PX ( x) fY / X ( y / x)
x
− ( y −1) 2 / 2σ 2
e
=
− ( y −1)2 / 2σ 2 / 2σ 2
+ (e − ( y +1)
2
e

1.9 Independent Random Variable
Let X and Y be two random variables characterised by the joint density function
FX ,Y ( x, y ) = P{ X ≤ x, Y ≤ y}

and f X ,Y ( x, y ) = ∂2
∂x∂y
FX ,Y ( x, y )

Then X and Y are independent if f X / Y ( x / y ) = f X ( x) ∀x ∈ ℜ
and equivalently
f X ,Y ( x, y ) = f X ( x ) fY ( y ) , where f X (x) and fY ( y) are called the marginal

density functions.

9

1.10 Moments of Random Variables
• Expectation provides a description of the random variable in terms of a few
parameters instead of specifying the entire distribution function or the density
function
• It is far easier to estimate the expectation of a R.V. from data than to estimate its
distribution
First Moment or mean
The mean µ X of a random variable X is defined by
µ X = EX = ∑ xi P( xi ) for a discrete random variable X
∞
= ∫ xf X ( x)dx for a continuous random variable X
−∞

For any piecewise continuous function y = g ( x ) , the expectation of the R.V.
−∞

Y = g ( X ) is given by EY = Eg ( X ) = ∫ g ( x) f
−∞
x ( x )dx

Second moment
∞
EX 2 = ∫ x 2 f X ( x)dx
−∞
Variance
∞
σ X = ∫ ( x − µ X ) 2 f X ( x) dx
2

−∞

• Variance is a central moment and measure of dispersion of the random variable
about the mean.
• σ x is called the standard deviation.
•
For two random variables X and Y the joint expectation is defined as
∞ ∞
E ( XY ) = µ X ,Y = ∫ ∫ xyf X ,Y ( x, y )dxdy
−∞ −∞

The correlation between random variables X and Y , measured by the covariance, is
given by
Cov( X , Y ) = σ XY = E ( X − µ X )(Y − µY )
= E ( XY -X µY - Y µ X + µ X µY )
= E ( XY ) -µ X µY

E( X − µX )(Y − µY ) σ XY
The ratio ρ= =
E( X − µX )2 E(Y − µY )2 σ XσY

is called the correlation coefficient. The correlation coefficient measures how much two
random variables are similar.

10

1.11 Uncorrelated random variables
Random variables X and Y are uncorrelated if covariance
Cov ( X , Y ) = 0
Two random variables may be dependent, but still they may be uncorrelated. If there
exists correlation between two random variables, one may be represented as a linear
regression of the others. We will discuss this point in the next section.

1.12 Linear prediction of Y from X

Y = aX + b
ˆ Regression
Prediction error Y − Yˆ

Mean square prediction error
E (Y − Y ) 2 = E (Y − aX − b) 2
ˆ

For minimising the error will give optimal values of a and b. Corresponding to the
optimal solutions for a and b, we have
∂ E (Y − aX − b) 2 = 0
∂a
∂ E (Y − aX − b) 2 = 0
∂b
1
Solving for a and b , Y − µY = 2 σ X ,Y ( x − µ X )
ˆ
σX

σ σ XY
so that Y − µY = ρ X ,Y y ( x − µ X ) , where ρ X ,Y =
ˆ is the correlation coefficient.
σx σ XσY
If ρ X ,Y = 0 then X and Y are uncorrelated.

=> Y − µ Y = 0
ˆ
=> Y = µ Y is the best prediction.
ˆ

Note that independence => Uncorrelatedness. But uncorrelated generally does not imply
independence (except for jointly Gaussian random variables).

Example 6:
Y = X 2 and f X (x) is uniformly distributed between (1,-1).
X and Y are dependent, but they are uncorrelated.
Cov( X , Y ) = σ X = E ( X − µ X )(Y − µ Y )
Because = EXY = EX 3 = 0
= EXEY (∵ EX = 0)

In fact for any zero- mean symmetric distribution of X, X and X 2 are uncorrelated.

11

1.13 Vector space Interpretation of Random Variables
The set of all random variables defined on a sample space form a vector space with
respect to addition and scalar multiplication. This is very easy to verify.

1.14 Linear Independence
Consider the sequence of random variables X 1 , X 2 ,.... X N .

If c1 X 1 + c2 X 2 + .... + cN X N = 0 implies that

c1 = c2 = .... = cN = 0, then X 1 , X 2 ,.... X N . are linearly independent.

1.15 Statistical Independence
X 1 , X 2 ,.... X N are statistically independent if

f X1, X 2 ,.... X N ( x1, x2 ,.... xN ) = f X1 ( x1 ) f X 2 ( x2 ).... f X N ( xN )
Statistical independence in the case of zero mean random variables also implies linear
independence

1.16 Inner Product
If x and y are real vectors in a vector space V defined over the field , the inner product
< x, y > is a scalar such that
∀x, y, z ∈ V and a ∈

1. < x, y > = < y, x >
2
2. < x, x > = x ≥ 0
3. < x + y, z > = < x, z > + < y, z >
4. < ax, y > = a < x, y >

In the case of RVs, inner product between X and Y is defined as
∞ ∞
< X, Y > = EXY = ∫ ∫ xy f X ,Y (x, y )dy dx.
−∞ −∞

Magnitude / Norm of a vector

x =< x, x >
2

So, for R.V.
2 ∞
X = EX 2 = ∫ x 2 f X (x)dx
−∞

• The set of RVs along with the inner product defined through the joint expectation
operation and the corresponding norm defines a Hilbert Space.

12

1.17 Schwary Inequality
For any two vectors x and y belonging to a Hilbert space V

| < x, y > | ≤ x y

For RV X and Y
E 2 ( XY ) ≤ EX 2 EY 2

Proof:
Consider the random variable Z = aX + Y
E (aX + Y ) 2 ≥ 0
.
⇒ a 2 EX 2 + EY 2 + 2aEXY ≥ 0
Non-negatively of the left-hand side => its minimum also must be nonnegative.
For the minimum value,
dEZ 2 EXY
= 0 => a = −
da EX 2
E 2 XY E 2 XY
so the corresponding minimum is + EY 2 − 2
EX 2 EX 2
Minimum is nonnegative =>
E 2 XY
EY 2 − ≥0
EX 2
=> E 2 XY < EX 2 EY 2
Cov( X , Y ) E ( X − µ X )(Y − µY )
ρ ( X ,Y ) = =
σ Xxσ X E ( X − µ X ) 2 E (Y − µY ) 2

From schwarz inequality
ρ ( X ,Y ) ≤ 1

1.18 Orthogonal Random Variables
Recall the definition of orthogonality. Two vectors x and y are called orthogonal if
< x, y > = 0
Similarly two random variables X and Y are called orthogonal if EXY = 0
If each of X and Y is zero-mean
Cov( X ,Y ) = EXY
Therefore, if EXY = 0 then Cov ( XY ) = 0 for this case.
For zero-mean random variables,
Orthogonality uncorrelatedness
13

1.19 Orthogonality Principle
X is a random variable which is not observable. Y is another observable random variable
which is statistically dependent on X . Given a value of Y what is the best guess for X ?
(Estimation problem).
Let the best estimate be ˆ
X (Y ) . Then E ( X − X (Y )) 2
ˆ is a minimum with respect
ˆ
to X (Y ) .
And the corresponding estimation principle is called minimum mean square error
principle. For finding the minimum, we have
∂
∂Xˆ E ( X − X (Y ))2 = 0
ˆ
∞ ∞
⇒ ∂
∂Xˆ ∫ ∫ ( x − X ( y )) f X ,Y ( x, y )dydx = 0
ˆ 2

−∞ −∞
∞ ∞
⇒ ∂
∂Xˆ ∫ ∫ ( x − X ( y )) fY ( y ) f X / Y ( x )dyd x = 0
ˆ 2

−∞ −∞
∞ ∞
⇒ ∂
∂Xˆ ∫ fY ( y )( ∫ ( x − X ( y )) f X / Y ( x )dx )dy = 0
ˆ 2

−∞ −∞

Since fY ( y ) in the above equation is always positive, therefore the
minimization is equivalent to
∞
∫ ( x − X ( y )) f X / Y ( x)dx = 0
∂ ˆ 2
∂Xˆ
−∞
∞
Or 2 ∫ ( x − X ( y )) f X / Y ( x)dx = 0
ˆ
-∞
∞ ∞
⇒ ∫ X ( y ) f X / Y ( x)dx = ∫ xf X / Y ( x)dx
ˆ
−∞ −∞

⇒ X ( y) = E ( X / Y )
ˆ

Thus, the minimum mean-square error estimation involves conditional expectation which
is difficult to obtain numerically.
Let us consider a simpler version of the problem. We assume that X ( y ) = ay and the
ˆ

estimation problem is to find the optimal value for a. Thus we have the linear
minimum mean-square error criterion which minimizes E ( X − aY ) 2 .
d
da E ( X − aY ) 2 = 0
⇒ E da ( X − aY ) 2 = 0
d

⇒ E ( X − aY )Y = 0
⇒ EeY = 0
where e is the estimation error.

14

The above result shows that for the linear minimum mean-square error criterion,
estimation error is orthogonal to data. This result helps us in deriving optimal filters to
estimate a random signal buried in noise.

The mean and variance also give some quantitative information about the bounds of RVs.
Following inequalities are extremely useful in many practical problems.

1.20 Chebysev Inequality
Suppose X is a parameter of a manufactured item with known mean
µ X and variance σ X . The quality control department rejects the item if the absolute
2

deviation of X from µ X is greater than 2σ X . What fraction of the manufacturing item
does the quality control department reject? Can you roughly guess it?
The standard deviation gives us an intuitive idea how the random variable is distributed
about the mean. This idea is more precisely expressed in the remarkable Chebysev
Inequality stated below. For a random variable X with mean µ and variance σ 2
X X

σX
2
P{ X − µ X ≥ ε } ≤
ε2
Proof:
∞
σ x = ∫ ( x − µ X ) 2 f X ( x)dx
2

−∞

≥ ∫ ( x − µ X ) 2 f X ( x)dx
X − µ X ≥ε

≥ ∫ ε 2 f X ( x) dx
X − µ X ≥ε

= ε 2 P{ X − µ X ≥ ε }

σX
2
∴ P{ X − µ X ≥ ε } ≤
ε2

1.21 Markov Inequality
For a random variable X which take only nonnegative values
E( X )
P{ X ≥ a} ≤ where a > 0.
a
∞
E ( X ) = ∫ xf X ( x)dx
0
∞
≥ ∫ xf X ( x)dx
a
∞
≥ ∫ af X ( x)dx
a

= aP{ X ≥ a}

15

E( X )
∴ P{ X ≥ a} ≤
a

E ( X − k )2
Result: P{( X − k ) 2 ≥ a} ≤
a

1.22 Convergence of a sequence of random variables
Let X 1 , X 2 ,..., X n be a sequence n independent and identically distributed random

variables. Suppose we want to estimate the mean of the random variable on the basis of
the observed data by means of the relation
1N
µX = ∑ Xi
ˆ
n i =1
How closely does µ X represent
ˆ µ X as n is increased? How do we measure the

closeness between µ X and µ X ?
ˆ

Notice that µ X is a random variable. What do we mean by the statement µ X converges
ˆ ˆ

to µ X ?
Consider a deterministic sequence x1 , x 2 ,....x n .... The sequence converges to a limit x if

correspond to any ε > 0 we can find a positive integer m such that x − xn < ε for n > m.

Convergence of a random sequence X 1 , X 2 ,.... X n .... cannot be defined as above.
A sequence of random variables is said to converge everywhere to X if
X (ξ ) − X n (ξ ) → 0 for n > m and ∀ξ .

1.23 Almost sure (a.s.) convergence or convergence with probability 1
For the random sequence X 1 , X 2 ,.... X n ....

{ X n → X } this is an event.
P{s | X n ( s ) → X ( s )} = 1 as n → ∞,
If
P{s X n ( s ) − X ( s ) < ε for n ≥ m} = 1 as m → ∞,

then the sequence is said to converge to X almost sure or with probability 1.
One important application is the Strong Law of Large Numbers:
1 n
If X 1 , X 2 ,.... X n .... are iid random variables, then ∑ X i → µ X with probability 1as n → ∞.
n i =1

16

1.24 Convergence in mean square sense
If E ( X n − X ) 2 → 0 as n → ∞, we say that the sequence converges to X in mean

square (M.S).

Example 7:
If X 1 , X 2 ,.... X n .... are iid random variables, then
1N
∑ X i → µ X in the mean square 1as n → ∞.
n i =1
1N
We have to show that lim E ( ∑ X i − µ X )2 = 0
n →∞ n i =1
Now,
1N 1 N
E ( ∑ X i − µ X ) 2 = E ( ( ∑ ( X i − µ X )) 2
n i =1 n i =1
1 N 1 n n
= 2 ∑ E ( X i − µ X ) 2 + 2 ∑ ∑ E ( X i − µ X )( X j − µ X )
n i=1 n i=1 j=1,j≠i
nσ X2
= 2
+0 ( Because of independence)
n
σX
2
=
n
1N
∴ lim E ( ∑ X i − µ X ) 2 = 0
n→∞ n i =1

1.25 Convergence in probability
P{ X n − X > ε } is a sequence of probability. X n is said to convergent to X in

probability if this sequence of probability is convergent that is
P{ X n − X > ε } → 0 as n → ∞.

If a sequence is convergent in mean, then it is convergent in probability also, because
P{ X n − X 2
> ε 2 } ≤ E ( X n − X )2 / ε 2 (Markov Inequality)
We have
P{ X n − X > ε } ≤ E ( X n − X ) 2 / ε 2

If E ( X n − X ) 2 → 0 as n → ∞, (mean square convergent) then

P{ X n − X > ε } → 0 as n → ∞.

17

Example 8:
Suppose { X n } be a sequence of random variables with

1
P ( X n = 1} = 1 −
n
and
1
P ( X n = −1} =
n
Clearly
1
P{ X n − 1 > ε } = P{ X n = −1} = →0
n
as n → ∞.

Therefore { X n } ⎯⎯ { X = 0}
P
→

1.26 Convergence in distribution
The sequence X 1 , X 2 ,.... X n .... is said to converge to X in distribution if

F X n ( x ) → FX ( x ) as n → ∞.

Here the two distribution functions eventually coincide.

1.27 Central Limit Theorem
Consider independent and identically distributed random variables X 1 , X 2 ,.... X n .

Let Y = X 1 + X 2 + ... X n

Then µ Y = µ X 1 + µ X 2 + ...µ X n

And σ Y = σ X1 + σ X 2 + ... + σ X n
2 2 2 2

The central limit theorem states that under very general conditions Y converges to
N ( µ Y , σ Y ) as n → ∞. The conditions are:
2

1. The random variables X , X ,..., X are independent with same mean and
1 2 n

variance, but not identically distributed.
2. The random variables X , X ,..., X are independent with different mean and
1 2 n

same variance and not identically distributed.

18

1.28 Jointly Gaussian Random variables
Two random variables X and Y are called jointly Gaussian if their joint density function
is
⎡ ( x−µ X )2 ( x− µ )( y − µ ) ( y − µ )2 ⎤
− 1
2 ⎢ 2
− 2 ρ XY σ
X
σ
Y + Y
2 ⎥
2 (1− ρ X ,Y ) ⎢ σX σY ⎥
f X ,Y ( x, y ) = Ae ⎣ ⎦
X Y

where A= 1
2πσ xσ y 1− ρ X ,Y
2

Properties:
(1) If X and Y are jointly Gaussian, then for any constants a and b, then the random
variable
Z , given by Z = aX + bY is Gaussian with mean µ Z = aµ X + bµY and variance

σ Z 2 = a 2σ X 2 + b 2σ Y 2 + 2abσ X σ Y ρ X ,Y

(2) If two jointly Gaussian RVs are uncorrelated, ρ X ,Y = 0 then they are statistically

independent.
f X ,Y ( x, y ) = f X ( x ) f Y ( y ) in this case.

(3) If f X ,Y ( x, y ) is a jointly Gaussian distribution, then the marginal densities

f X ( x) and f Y ( y ) are also Gaussian.

(4) If X and Y are joint by Gaussian random variables then the optimum nonlinear
estimator X of X that minimizes the mean square error ξ = E{[ X − X ] 2 } is a linear
ˆ ˆ

estimator X = aY
ˆ

19

REVIEW OF RANDOM PROCESS

2.1 Introduction

Recall that a random variable maps each sample point in the sample space to a point in
the real line. A random process maps each sample point to a waveform.
• A random process can be defined as an indexed family of random variables
{ X (t ), t ∈ T } where T is an index set which may be discrete or continuous usually

denoting time.
• The random process is defined on a common probability space {S , ℑ, P}.

• A random process is a function of the sample point ξ and index variable t and
may be written as X (t , ξ ).
• For a fixed t (= t 0 ), X (t 0 , ξ ) is a random variable.

• For a fixed ξ (= ξ 0 ), X (t , ξ 0 ) is a single realization of the random process and
is a deterministic function.
• When both t and ξ are varying we have the random process X (t , ξ ).
The random process X (t , ξ ) is normally denoted by X (t ).
We can define a discrete random process X [n] on discrete points of time. Such a random
process is more important in practical implementations.

X (t , s3 )

S

s3 X (t , s2 )

s1
s2
X (t , s1 )
t
Figure Random Process

2.2 How to describe a random process?

To describe X (t ) we have to use joint density function of the random variables at
different t .
For any positive integer n , X (t1 ), X (t 2 ),..... X (t n ) represents n jointly distributed

random variables. Thus a random process can be described by the joint distribution
function FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x 2 .....x n ) = F ( x1 , x 2 .....x n , t1 , t 2 .....t n ), ∀n ∈ N and ∀t n ∈ T

Otherwise we can determine all the possible moments of the process.
E ( X (t )) = µ x (t ) = mean of the random process at t.

R X (t1 , t 2 ) = E ( X (t 1) X (t 2 )) = autocorrelation function at t1 ,t 2
R X (t1 , t 2 , t 3 ) = E ( X (t 1 ), X (t 2 ), X (t 3 )) = Triple correlation function at t1 , t 2 , t 3 , etc.

We can also define the auto-covariance function C X (t1 , t 2 ) of X (t ) given by
C X (t1 , t 2 ) = E ( X (t 1 ) − µ X (t1 ))( X (t 2 ) − µ X (t 2 ))
= R X (t1 , t 2 ) − µ X (t1 ) µ X (t 2 )

Example 1:

(a) Gaussian Random Process
For any positive integer n, X (t1 ), X (t 2 ),..... X (t n ) represent n jointly random

variables. These n random variables define a random vector
X = [ X (t1 ), X (t2 ),..... X (tn )]'. The process X (t ) is called Gaussian if the random vector

[ X (t1 ), X (t2 ),..... X (tn )]' is jointly Gaussian with the joint density function given by
1
− X'C−1X
e 2 X
f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) = where C X = E ( X − µ X )( X − µ X ) '
( )
n
2π det(CX )

and µ X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.

(b) Bernouli Random Process
(c) A sinusoid with a random phase.

2.3 Stationary Random Process

A random process X (t ) is called strict-sense stationary if its probability structure is
invariant with time. In terms of the joint distribution function

FX ( t1 ), X (t 2 )..... X ( t n ) ( x1 , x 2 ..... x n ) = FX ( t1 + t0 ), X ( t 2 + t0 )..... X ( t n + t0 ) ( x1 , x 2 .....x n ) ∀n ∈ N and ∀t 0 , t n ∈ T

For n = 1,
FX ( t1 ) ( x1 ) = FX ( t1 +t0 ) ( x1 ) ∀t 0 ∈ T

Let us assume t 0 = −t1

FX (t1 ) ( x1 ) = FX ( 0 ) ( x1 )
⇒ EX (t1 ) = EX (0) = µ X (0) = constant

For n = 2,
FX ( t1 ), X ( t 2 ) ( x1 , x 2 .) = FX ( t1 +t0 ), X ( t 2 +t0 ) ( x1 , x 2 )

Put t 0 = −t 2

FX ( t1 ), X ( t2 ) ( x1 , x2 ) = FX ( t1 −t2 ), X ( 0 ) ( x1 , x2 )
⇒ R X (t1 , t2 ) = R X (t1 − t2 )
A random process X (t ) is called wide sense stationary process (WSS) if
µ X (t ) = constant
R X (t1 , t2 ) = R X (t1 − t2 ) is a function of time lag.

For a Gaussian random process, WSS implies strict sense stationarity, because this
process is completely described by the mean and the autocorrelation functions.
The autocorrelation function R X (τ ) = EX (t + τ ) X (t ) is a crucial quantity for a WSS
process.
• RX (0) = EX 2 (t ) is the mean-square value of the process.

• RX ( −τ ) = RX (τ )for real process (for a complex process X(t), RX ( −τ ) = R* (τ )
X

• RX (τ ) <RX (0) which follows from the Schwartz inequality

RX (τ ) = {EX (t ) X (t + τ )}2
2

= < X (t ), X (t + τ ) >
2

X (t + τ )
2 2
≤ X (t )
= EX 2 (t ) EX 2 (t + τ )
= RX (0) RX (0)
2 2

∴ RX (τ ) <RX (0)

• RX (τ ) is a positive semi-definite function in the sense that for any positive
n n
integer n and real a j , a j , ∑ ∑ ai a j RX (ti , t j )>0
i =1 j =1

• If X (t ) is periodic (in the mean square sense or any other sense like with

probability 1), then R X (τ ) is also periodic.
For a discrete random sequence, we can define the autocorrelation sequence similarly.
• If R X (τ ) drops quickly , then the signal samples are less correlated which in turn
means that the signal has lot of changes with respect to time. Such a signal has
high frequency components. If R X (τ ) drops slowly, the signal samples are highly
correlated and such a signal has less high frequency components.
• R X (τ ) is directly related to the frequency domain representation of WSS process.

2.4 Spectral Representation of a Random Process

How to have the frequency-domain representation of a random process?
• Wiener (1930) and Khinchin (1934) independently discovered the spectral
representation of a random process. Einstein (1914) also used the concept.
• Autocorrelation function and power spectral density forms a Fourier transform
pair
Lets define
X T (t) = X(t) -T < t < T
= 0 otherwise
as t → ∞, X T (t ) will represent the random process X (t ).
T
Define X T ( w) = ∫X
−T
T (t)e − jwt dt in mean square sense.
t2

−τ t1
−τ
dτ

X T (ω ) X T * (ω ) | X (ω ) |2 1 TT − jωt + jωt
E =E T = ∫ ∫ EX T (t1 ) X T (t 2 )e 1 e 2 dt1dt 2
2T 2T 2T −T −T
1 T T − jω ( t1 −t2 )
= ∫ ∫ RX (t1 − t2 )e dt1dt2
2T −T −T
1 2T − jωτ
= ∫ RX (τ )e (2T − | τ |)dτ
2T −2T

Substituting t1 − t 2 = τ so that t 2 = t1 − τ is a line, we get

X T (ω ) X T * (ω ) 2T |τ |
E = ∫ R x (τ )e − jωτ (1 − ) dτ
2T − 2T
2T
If R X (τ ) is integrable then as T → ∞,

E X T (ω )
2
∞
limT →∞ = ∫ RX (τ )e − jωτ dτ
2T −∞

E X T (ω )
2

= contribution to average power at freq ω and is called the power spectral
2T
density.
∞
Thus
∫ R (τ )e
− jωτ
SX (ω ) = x dτ
−∞
∞
and R X (τ ) = ∫
−∞
SX (ω )e jωτ dw

Properties
∞
• EX 2 (t) = R X (0) = ∫ S X (ω )dw = average power of the process.
−∞

w2
• The average power in the band ( w1 , w2) is ∫ S X ( w)dw
w1

• R X (τ ) is real and even ⇒ S X (ω ) is real, even.

E X T (ω )
2

• From the definition S X ( w) = limT →∞ is always positive.
2T
S x (ω )
• h X ( w) = = normalised power spectral density and has properties of PDF,
EX 2 (t )
(always +ve and area=1).

2.5 Cross-correlation & Cross power Spectral Density
Consider two real random processes X(t) and Y(t).
Joint stationarity of X(t) and Y(t) implies that the joint densities are invariant with shift
of time.
The cross-correlation function RX,Y (τ ) for a jointly wss processes X(t) and Y(t) is

defined as
R X,Y (τ ) = Ε X (t + τ )Y (t )
so that RYX (τ ) = Ε Y (t + τ ) X (t )
= Ε X (t )Y (t + τ )
= R X,Y ( − τ )
∴ RYX (τ ) = R X,Y ( − τ )
Cross power spectral density
∞
S X ,Y ( w) = ∫ RX ,Y (τ )e− jwτ dτ
−∞

For real processes X(t) and Y(t)

S X ,Y ( w) = SY , X ( w)
*

The Wiener-Khinchin theorem is also valid for discrete-time random processes.
If we define R X [m] = E X [n + m] X [n]
Then corresponding PSD is given by
∞
S X ( w) = ∑ Rx [ m ] e − jω m −π ≤ w ≤ π
m =−∞
∞
or S X ( f ) = ∑ Rx [ m] e− j 2π m −1 ≤ f ≤ 1
m =−∞

1 π jωm
∴ RX [m ] = ∫ S X ( w)e dw
2π −π
For a discrete sequence the generalized PSD is defined in the z − domain as follows
∞
S X ( z) = ∑ R [m ] z
m =−∞
x
−m

If we sample a stationary random process uniformly we get a stationary random sequence.
Sampling theorem is valid in terms of PSD.

Examples 2:
−a τ
(1) R X (τ ) = e a>0
2a
S X ( w) = -∞ < w < ∞
a + w2
2

m
(2) RX ( m) = a a >0
1 − a2
S X ( w) = -π ≤ w ≤ π
1 − 2a cos w + a 2

2.6 White noise process
S x (f )

→ f

A white noise process X (t ) is defined by
N
SX ( f ) = −∞ < f < ∞
2
The corresponding autocorrelation function is given by
N
RX (τ ) = δ (τ ) where δ (τ ) is the Dirac delta.
2
The average power of white noise
∞ N
Pavg = ∫ df → ∞
−∞ 2

• Samples of a white noise process are uncorrelated.
• White noise is an mathematical abstraction, it cannot be realized since it has infinite
power
• If the system band-width(BW) is sufficiently narrower than the noise BW and noise
PSD is flat , we can model it as a white noise process. Thermal and shot noise are well
modelled as white Gaussian noise, since they have very flat psd over very wide band
(GHzs
• For a zero-mean white noise process, the correlation of the process at any lag τ ≠ 0 is
zero.
• White noise plays a key role in random signal modelling.
• Similar role as that of the impulse function in the modeling of deterministic signals.

Linear
White Noise System WSS Random Signal

2.7 White Noise Sequence
For a white noise sequence x[n],
N
S X ( w) = −π ≤ w ≤ π
2
Therefore
N
R X ( m) =δ ( m)
2
where δ (m) is the unit impulse sequence.
S X (ω )
RX [ m] NN
N
N
N
22
2
2
2

• • • • • m→
−π
2.8 Linear Shift Invariant System with Random Inputs π →ω
Consider a discrete-time linear system with impulse response h[n].

h[n]
y[n]
x[n]

y[n] = x[n] * h[n]
E y[n] = E x[n] * h[n]
For stationary input x[n]
l
µY = E y[n] = µ X * h[n] = µ X ∑ h[n]
k =0

where l is the length of the impulse response sequence
RY [m] = E y[n] y[n − m]
= E ( x[n] * h[n]) * ( x[n − m] * h[n − m])
= R X [m] * h[m] * h[−m]
RY [m] is a function of lag m only.
From above we get
S Y (w) = | Η ( w) | 2 S X ( w)

S XX (w) SYY (w)
H(w )
2

Example 3:
Suppose
H ( w) = 1 − wc ≤ w ≤ wc
= 0 otherwise
N
SX ( w) = −∞ ≤ w≤ ∞ R X (τ )
2
N
Then SY ( w) = − wc ≤ w ≤ wc
2
N
τ
and R Y (τ ) = sinc(w cτ )
2

• Note that though the input is an uncorrelated process, the output is a correlated
process.
Consider the case of the discrete-time system with a random sequence x[n ] as an input.

x[n ] y[n ]
h[n ]
RY [m] = R X [m] * h[m] * h[ −m]
Taking the z − transform, we get
S Y ( z ) = S X ( z ) H ( z ) H ( z −1 )

Notice that if H (z ) is causal, then H ( z −1 ) is anti causal.

Similarly if H (z ) is minimum-phase then H ( z −1 ) is maximum-phase.

R XX [m ] RYY [m ]
H ( z) H ( z −1 )
S XX ( z ) SYY [z ]

Example 4:
1
If H ( z) = and x[n ] is a unity-variance white-noise sequence, then
1 − α z −1
SYY ( z ) = H ( z ) H ( z −1 )
⎛ 1 ⎞⎛ 1 ⎞ 1
=⎜ −1 ⎟⎜ ⎟
⎝ 1 − α z ⎠⎝ 1 − α z ⎠ 2π
By partial fraction expansion and inverse z − transform, we get
1
RY [m ] = a |m|
1−α 2

2.9 Spectral factorization theorem
A stationary random signal X [n ] that satisfies the Paley Wiener condition
π
∫ | ln S X ( w) | dw < ∞ can be considered as an output of a linear filter fed by a white noise
−π

sequence.
If S X (w) is an analytic function of w ,
π
and ∫ | ln S X ( w) | dw < ∞ , then S X ( z ) = σ v2 H c ( z ) H a ( z )
−π

where
H c ( z ) is the causal minimum phase transfer function

H a ( z ) is the anti-causal maximum phase transfer function

and σ v2 a constant and interpreted as the variance of a white-noise sequence.
Innovation sequence
v[n] X [n ]
H c ( z)

Figure Innovation Filter
Minimum phase filter => the corresponding inverse filter exists.

1 v[n]
X [n ] H c ( z)

Figure whitening filter
1
Since ln S XX ( z ) is analytic in an annular region ρ < z < ,
ρ
∞
ln S XX ( z ) = ∑ c[k ]z − k
k =−∞

1 π iwn
where c[k ] = ∫ ln S XX ( w)e dw is the kth order cepstral coefficient.
2π −π
For a real signal c[k ] = c[−k ]
1 π
and c[0] = ∫ ln S XX ( w)dw
2π −π
∞
∑ c[ k ] z − k
S XX ( z ) = e k =−∞

∞ −1
∑ c[ k ] z − k ∑ c[ k ] z − k
=e c[0]
e k =1 e k =−∞

∞
∑ c[ k ] z − k
Let H C ( z ) = ek =1 z >ρ
= 1 + hc (1)z -1 + hc (2) z −2 + ......

(∵ hc [0] = Lim z →∞ H C ( z ) = 1
H C ( z ) and ln H C ( z ) are both analytic

=> H C ( z ) is a minimum phase filter.
Similarly let
−1
∑ c( k ) z− k
H a ( z) = e k =−∞

∞
∑ c( k ) zk 1
= e k =1 = H C ( z −1 ) z<
ρ
Therefore,
S XX ( z ) = σ V H C ( z ) H C ( z −1 )
2

where σ V = ec ( 0)
2

Salient points
• S XX (z ) can be factorized into a minimum-phase and a maximum-phase factors

i.e. H C ( z ) and H C ( z −1 ).
• In general spectral factorization is difficult, however for a signal with rational
power spectrum, spectral factorization can be easily done.
• Since is a minimum phase filter, 1 exists (=> stable), therefore we can have a
H C (z)

1
filter to filter the given signal to get the innovation sequence.
HC ( z)

• X [n ] and v[n] are related through an invertible transform; so they contain the
same information.

2.10 Wold’s Decomposition
Any WSS signal X [n ] can be decomposed as a sum of two mutually orthogonal
processes
• a regular process X r [ n] and a predictable process X p [n] , X [n ] = X r [n ] + X p [n ]

• X r [ n ] can be expressed as the output of linear filter using a white noise

sequence as input.
• X p [n] is a predictable process, that is, the process can be predicted from its own

past with zero prediction error.

RANDOM SIGNAL MODELLING

3.1 Introduction
The spectral factorization theorem enables us to model a regular random process as
an output of a linear filter with white noise as input. Different models are developed
using different forms of linear filters.
• These models are mathematically described by linear constant coefficient
difference equations.
• In statistics, random-process modeling using difference equations is known as
time series analysis.

3.2 White Noise Sequence
The simplest model is the white noise v[n] . We shall assume that v[n] is of 0-

mean and variance σ V .
2

RV [m]

• • • • m

SV ( w )
σV2

2π

w

3.3 Moving Average model MA(q ) model

FIR
v[n] X [ n ]
filter

The difference equation model is
q
X [ n] = ∑ bi v[ n − i ]
i =o

µe = 0 ⇒ µ X = 0
and v[n] is an uncorrelated sequence means
q
σ X 2 = ∑ bi 2σ V
2
i =0

The autocorrelations are given by
RX [ m] = E X [ n ] X [ n - m]
q q
= ∑ ∑ bib j Ev[n − i ]v[n − m − j ]
i = 0 j =0
q q
= ∑ ∑ bib j RV [m − i + j ]
i = 0 j =0

Noting that RV [m] = σ V δ [m] , we get
2

RV [m] = σ V when
2

m-i + j = 0
⇒ i=m+ j

The maximum value for m + j is q so that
q −m
RX [m ] = ∑ b j b j + mσ V
2
0≤m≤q
j =0

and
RX [ − m ] = R X [m ]
q− m

Writing the above two relations together
RX [m ] = ∑bb
j =0
j j+ m
σ v2 m ≤q

= 0 otherwise
Notice that, RX [ m ] is related by a nonlinear relationship with model parameters. Thus
finding the model parameters is not simple.
The power spectral density is given by
σV2
− jw
B ( w) , where B( w) = = bο + b1 e + ......bq e − jqw
2
S X ( w) =
2π
FIR system will give some zeros. So if the spectrum has some valleys then MA will fit
well.

3.3.1 Test for MA process

RX [m] becomes zero suddenly after some value of m.

RX [m]

m
Figure: Autocorrelation function of a MA process

Figure: Power spectrum of a MA process

S X ( w) R X [ m]

q →m
→ω

Example 1: MA(1) process
X [n ] = b1v[n − 1] + b0v[n ]
Here the parameters b0 and b1 are tobe determined.
We have
σ X = b12 + b0
2 2

RX [1] = b1b0

From above b0 and b1 can be calculated using the variance and autocorrelation at lag 1 of
the signal.

3.4 Autoregressive Model
In time series analysis it is called AR(p) model.

v[n]
IIR X [n ]

filter

The model is given by the difference equation
p
X [ n] = ∑ ai X [ n − i ] + v[ n]
i =1

The transfer function A(w) is given by

1
A( w) = n
1 − ∑ ai e − jωi
i =1

σ e2
with a0 = 1 (all poles model) and S X ( w) =
2π | A(ω ) |2

If there are sharp peaks in the spectrum, the AR(p) model may be suitable.
S X (ω )
R X [ m]

→ω →m

The autocorrelation function RX [ m ] is given by
RX [ m] = E X [ n ] X [ n - m]
p
= ∑ ai EX [n − i ] X [ n − m] + Ev[n] X [n − m]
i =1
p
= ∑ ai RX [m − i ] + σ V δ [m]
2
i =1

p
∴ RX [ m] = ∑ ai RX [ m − i ] + σ V δ [ m]
2
∨m∈I
i =1

The above relation gives a set of linear equations which can be solved to find a i s.

These sets of equations are known as Yule-Walker Equation.

Example 2: AR(1) process
X [n] = a1 X [n − 1] + v[ n]
RX [m] = a1RX [m − 1] + σ V δ [ m]
2

∴ RX [0] = a1RX [ −1] + σ V
2
(1)
and RX [1] = a1RX [0]
RX [1]
so that a1 =
RX [0]
σV
2
From (1) σ X = RX [0] =
2
1- a 2
After some arithmatic we get
a σV
m2
RX [ m] =
1- a 2

3.5 ARMA(p,q) – Autoregressive Moving Average Model
Under the most practical situation, the process may be considered as an output of a filter
that has both zeros and poles.

X [ n]
v[n] B ( w)
H ( w) =
A( w)

The model is given by
p q
x[n] = ∑ ai X [ n − i ] + ∑ bi v[ n − i ] (ARMA 1)
i =1 i =0

and is called the ARMA( p, q ) model.
The transfer function of the filter is given by
B (ω )
H ( w) =
A(ω )
B (ω ) σ V
22

S X ( w) =
A(ω ) 2π
2

How do get the model parameters?
For m ≥ max( p, q + 1), there will be no contributions from bi terms to RX [ m].
p
RX [ m] = ∑ ai RX [ m − i ] m ≥ max( p, q + 1)
i =1

From a set of p Yule Walker equations, ai parameters can be found out.

Then we can rewrite the equation
p
X [n] = X [n] − ∑ ai X [n − i ]
i =1
q
∴ X [n] = ∑ bi v[n − i ]
i =0

From the above equation bi s can be found out.

The ARMA( p, q ) is an economical model. Only AR ( p ) only MA( q ) model may
require a large number of model parameters to represent the process adequately. This
concept in model building is known as the parsimony of parameters.
The difference equation of the ARMA( p, q ) model, given by eq. (ARMA 1) can be
reduced to p first-order difference equation give a state space representation of the
random process as follows:
z[n] = Az[n − 1] + Bu[n]
X [n] = Cz[n]

where

z[n] = [ x[ n] X [n − 1].... X [ n − p]]′
⎡ a1 a2 ......a p ⎤
⎢ ⎥
A= ⎢ 0 1......0 ⎥ , B = [1 0...0]′ and
⎢............... ⎥
⎢ ⎥
⎢ 0 0......1 ⎥
⎣ ⎦
C = [b0 b1...bq ]

Such representation is convenient for analysis.

3.6 General ARMA( p, q) Model building Steps
• Identification of p and q.
• Estimation of model parameters.
• Check the modeling error.
• If it is white noise then stop.
Else select new values for p and q
and repeat the process.

3.7 Other model: To model nonstatinary random processes
• ARIMA model: Here after differencing the data can be fed to an ARMA
model.
• SARMA model: Seasonal ARMA model etc. Here the signal contains a
seasonal fluctuation term. The signal after differencing by step equal to the
seasonal period becomes stationary and ARMA model can be fitted to the
resulting data.

CHAPTER – 4: ESTIMATION THEORY

4.1 Introduction
• For speech, we have LPC (linear predictive code) model, the LPC-parameters are
to be estimated from observed data.
• We may have to estimate the correct value of a signal from the noisy observation.

In RADAR signal processing In sonar signal processing

Array of sensors

Signals generated by
the submarine due
Mechanical movements
of the submarine

• Estimate the target,
target distance from
the observed data • Estimate the location of the
submarine.

Generally estimation includes parameter estimation and signal estimation.
We will discuss the problem of parameter estimation here.

We have a sequence of observable random variables X 1 , X 2 ,...., X N , represented by the
vector
⎡ X1 ⎤
⎢ ⎥
X
X=⎢ 2 ⎥
⎢ ⎥
⎢ ⎥
⎢XN ⎥
⎣ ⎦
X is governed by a joint density junction which depends on some unobservable
parameter θ given by
f X ( x1 , x 2 ,..., x N | θ ) = f X ( x | θ )

where θ may be deterministic or random. Our aim is to make an inference on θ from an
observed sample of X 1 , X 2 ,...., X N .

An estimator θˆ(X) is a rule by which we guess about the value of an unknown θ on the
basis of X.

θˆ(X) is a random, being a function of random variables.
For a particular observation x1 , x2 ,...., xN , we get what is known as an estimate (not
estimator)
Let X 1 , X 2 ,...., X N be a sequence of independent and identically distributed (iid) random

variables with mean µ X and variance σ X .
2

1 N
µ=
ˆ ∑ X i is an estimator for µ X .
N i =1
1 N
σX =
ˆ2 ∑ (Y1 − µ X ) is an estimator for
ˆ 2 σX.
2
N i =1
An estimator is a function of the random sequence X 1 , X 2 ,...., X N and if it does not
involve any unknown parameters. Such a function is generally called a statistic.

4.2 Properties of the Estimator
A good estimator should satisfy some properties. These properties are described in terms
of the mean and variance of the estimator.

4.3 Unbiased estimator
An estimator θˆ of θ is said to be unbiased if and only if Eθˆ = θ .
The quantity Eθˆ − θ is called the bias of the estimator.
Unbiased ness is necessary but not sufficient to make an estimator a good one.
N
1
Consider σ 12 =
ˆ
N
∑(X
i =1
i − µ X )2
ˆ

1 N
and σ 22 =
ˆ ∑ ( X i − µ X )2
N − 1 i =1
ˆ

for an iid random sequence X 1 , X 2 ,...., X N .

We can show that σ 2 is an unbiased estimator.
ˆ2
N
E ∑ ( X i − µ X )2 = E ∑ ( X i − µ X + µ X − µ X )2
ˆ ˆ
i =1

= E ∑ {( X i − µ X ) 2 + ( µ X − µ X ) 2 + 2( X i − µ X )( µ X − µ X )}
ˆ ˆ

Now E ( X i − µ X ) 2 = σ 2
2
⎛ ∑ Xi ⎞
and E (µ X − µ X ) = E⎜ µ X −
2
ˆ ⎟
⎝ N ⎠

E
= 2
( N µ X − ∑ X i )2
N
E
= 2 (∑( X i − µ X )) 2
N
E
= 2 ∑( X i − µ X ) 2 + ∑ ∑ E ( X i − µ X )( X j − µ X )
N i j ≠i

E
= 2 ∑( X i − µ X ) 2 (because of independence)
N
σX
2
=
N

also E ( X i − µ X )( µ X − µ X ) = − E ( X i − µ X ) 2
ˆ
N
∴ E ∑ ( X i − µ X ) 2 = Nσ 2 + σ 2 − 2σ 2 = ( N − 1)σ 2
ˆ
i =1

1
So Eσ 2 =
ˆ2 E ∑( X i − µ X ) 2 = σ 2
ˆ
N −1
∴σ 2 is an unbiased estimator of σ .2
ˆ2
Similarly sample mean is an unbiased estimator.
N
1
µX =
ˆ
N
∑X
i =1
i

1 N
Nµ X
Eµ X =
ˆ
N
∑ E{ X } =
i =1
i
N
= µX

4.4 Variance of the estimator

The variance of the estimator θˆ is given by
var(θˆ) = E (θˆ − E (θˆ)) 2
For the unbiased case
var(θˆ) = E (θˆ − θ ) 2
The variance of the estimator should be or low as possible.
An unbiased estimator θˆ is called a minimum variance unbiased estimator (MVUE) if
E (θˆ − θ ) 2 ≤ E (θˆ′ − θ ) 2

where θˆ′ is any other unbiased estimator.

4.5 Mean square error of the estimator
MSE = E (θˆ − θ ) 2
MSE should be as as small as possible. Out of all unbiased estimator, the MVUE has the
minimum mean square error.
MSE is related to the bias and variance as shown below.

MSE = E (θˆ − θ ) 2 = E (θˆ − Eθˆ + Eθˆ − θ ) 2
= E (θˆ − Eθˆ) 2 + E ( Eθˆ − θ ) 2 + 2 E (θˆ − Eθˆ)( Eθˆ − θ )
= E (θˆ − Eθˆ) 2 + E ( Eθˆ − θ ) 2 + 2( Eθˆ − Eθˆ)( Eθˆ − θ )
= var(θˆ) + b 2 (θˆ) + 0 ( why ?)

MSE = var(θˆ) + b 2 (θ )
ˆ
So

4.6 Consistent Estimators
As we have more data, the quality of estimation should be better.
This idea is used in defining the consistent estimator.

An estimator θˆ is called a consistent estimator of θ if θˆ converges in probability to θ.

lim N →∞ P ( θˆ-θ ≥ ε ) = 0 for any ε > 0
Less rigorous test is obtained by applying the Markov Inequality

MSE

(
P θˆ-θ ≥ ε ) ≤
E (θˆ-θ ) 2
ε2

If θˆ is an unbiased estimator ( b(θˆ) = 0 ), then MSE = var(θˆ).
Therefore, if

Lim E (θˆ−θ )2 =0 , then θˆ will be a consistent estimator.
N →∞
Also note that
MSE = var(θˆ) + b 2 (θˆ)

Therefore, if the estimator is asymptotically unbiased (i.e. b(θˆ) → 0 as N → ∞) and

var(θˆ) → 0 as N → ∞, then MSE → 0.

∴ Therefore for an asymptotically unbiased estimator θˆ, if var(θˆ) → 0, as N → ∞, then

θˆ will be a consistent estimator.

Example 1: X 1 , X 2 ,..., X N is an iid random sequence with unknown µ X and known

variance σ X .
2

1 N
Let µ X =
ˆ ∑ Xi be an estimator for µ X . We have already shown that µ X is unbiased.
ˆ
N i=1

σX
2
Also var( µ X ) =
ˆ Is it a consistent estimator?
N

σ2
Clearly lim var( µ X ) = lim X = 0. Therefore µ X is a consistent estimator of µ X .
ˆ ˆ
N →∞ N →∞ N

4.7 Sufficient Statistic
The observations X 1 , X 2 .... X N contain information about the unknown parameter θ . An
estimator should carry the same information about θ as the observed data. This concept
of sufficient statistic is based on this idea.
ˆ
A measurable function θ ( X 1 , X 2 .... X N ) is called a sufficient statistic of θ if it

contains the same information about θ as contained in the random sequence
X 1 , X 2 .... X N . In other word the joint conditional density f X ˆ ( x1 , x2 ,... x N )
1 , X 2 .. X N |θ ( X 1 , X 2 ... X N )

does not involve θ .
There are a large number of sufficient statistics for a particular criterion. One has to select
a sufficient statistic which has good estimation properties.

A way to check whether a statistic is sufficient or not is through the Factorization
theorem which states:
θˆ( X 1 , X 2 .... X N ) is a sufficient statistic of θ if

f X1 , X 2 .. X N /θ ( x1 , x2 ,...xN ) = g (θ ,θˆ)h( x1 , x2 ,...xN )

where g (θ , θˆ ) is a non-constant and nonnegative function of θ and θˆ and
h( x1 , x2 ,...x N ) does not involve θ and is a nonnegative function of x1 , x2 ,...xN .

Example 2: Suppose X 1 , X 2 ,..., X N is an iid Gaussian sequence with unknown mean

µ X and known variance 1.

1 N
Then µ X =
ˆ ∑ Xi is a sufficient statistic of .
N i=1

( )
1 2
N 1 − x −µ X
2 i
f X1 , X 2 .. X N / µ X ( x1 , x2 ,...xN ) = ∏ e
i =1 2π

( )
1N 2
1 − ∑ x −µ X
2 i =1 i
= e
( 2π ) N
( )
1N 2
1 − ∑ x −µ + µ −µ
ˆ ˆ
2 i =1 i
Because = e
( 2π ) N
1 −
1N
( ˆ 2 ˆ 2
∑ ( x − µ X ) + ( µ − µ X ) + 2( xi − µ )( µ − µ X )
2 i =1 i
ˆ ˆ )
= e
( 2π ) N
1 −
1N
ˆ 2
∑ ( x −µ X )
2 i =1 i
−
1N
(
∑ ( µ −µ X )
ˆ 2
)
= e e 2 i =1
e0 ( why ?)
( 2π ) N

The first exponential is a function of x1 , x2 ,...x N and the second exponential is a

function of µ X and µ X . Therefore µ X is a sufficient statistics of µ X .
ˆ ˆ

4.8 Cramer Rao theorem
We described about the Minimum Variance Unbiased Estimator (MVUE) which is a very
good estimator
θˆ is an MVUE if

E (θˆ ) = θ

and Var (θˆ) ≤ Var (θˆ′)
where ˆ
θ′ is any other unbiased estimator of θ .
Can we reduce the variance of an unbiased estimator indefinitely? The answer is given by
the Cramer Rao theorem.
Suppose θˆ is an unbiased estimator of random sequence. Let us denote the sequence by
the vector
⎡X1 ⎤
⎢X ⎥
X=⎢ 2⎥
⎢ ⎥
⎢ ⎥
⎣X N ⎦
Let f X ( x1 ,......, x N / θ ) be the joint density function which characterises X. This function

is also called likelihood function. θ may also be random. In that case likelihood function
will represent conditional joint density function.
L( x / θ ) = ln f X ( x1 ...... x N / θ ) is called log likelihood function.

4.9 Statement of the Cramer Rao theorem

If θˆ is an unbiased estimator of θ , then
1
Var (θˆ) ≥
I (θ )
∂L 2
where I (θ ) = E ( ) and I (θ ) is a measure of average information in the random
∂θ
sequence and is called Fisher information statistic.
∂L
The equality of CR bound holds if = c(θˆ − θ ) where c is a constant.
∂θ
Proof: θˆ is an unbiased estimator of θ
∴ E (θˆ − θ ) = 0 .
∞
ˆ
⇒ ∫ (θ − θ ) f X (x / θ )dx = 0.
−∞

Differentiate with respect to θ , we get
∞
∂
−∞
∫ ∂θ {(θˆ − θ ) f X ( x / θ )}dx = 0.

(Since line of integration are not function of θ.)
∞ ∞
∂
= ∫ (θˆ − θ ) f (x / θ )}dy − ∫ f X (x / θ )dx = 0.
∂θ X
−∞ −∞

∞ ∞
∂
∴ ∫ (θˆ − θ ) f (x / θ )}dy = ∫ f X (x / θ )dx = 1. (1)
∂θ X
−∞ −∞

∂ ∂
Note that f X (x / θ ) = {ln f X ( x / θ )} f X ( x / θ )
∂θ ∂θ
∂L
= ( ∂θ ) f X (x / θ )

Therefore, from (1)
∞
∂
∫ (θˆ − θ ){∂θ L(x / θ )} f
−∞
X (x / θ )}dx = 1.

So that
2
⎧∞ ˆ ∂ ⎫
⎨ ∫ (θ − θ ) f X (x / θ ) L(x / θ ) f X (x / θ )dx dx ⎬ = 1 . (2)
⎩ −∞ ∂θ ⎭
since f X (x / θ ) is ≥ 0.
Recall the Cauchy Schawarz Ineaquality
2 2 2
< a, b > < a b

where the equality holds when a = cb ( where c is any scalar ).
Applying this inequality to the L.H.S. of equation (2) we get
2
⎛∞ ˆ ∂ ⎞
⎜ ∫ (θ − θ ) f X (x / θ ) L(x / θ ) f X (x / θ )dx dx ⎟
⎝ −∞ ∂θ ⎠
∞ ∞ 2
⎛ ∂ ⎞
≤ ∫ (θˆ-θ ) 2 f X (x / θ ) dx ∫ ⎜ ( ∂θ L(x / θ ) ⎟ f X (x / θ ) dx
−∞ -∞ ⎝ ⎠
= var(θˆ ) I(θ )

∴ L.H .S ≤ var(θˆ ) I(θ )
But R.H.S. = 1
var(θˆ) I(θ ) ≥ 1.
1
∴ var(θˆ) ≥ ,
I (θ )
which is the Cramer Rao Inequality.
The equality will hold when
∂
{L( x / θ ) f X ( x / θ )} = c(θˆ − θ ) f X ( x / θ ) ,
∂θ
so that
∂L(x / θ )
= c (θˆ-θ )
∂θ

∞
Also from ∫ f X (x / θ )dx = 1, we get
−∞

∞
∂
∫ ∂θ
−∞
f X (x / θ )dx = 0

∞
∂L
∴∫ f (x / θ )dx = 0
−∞
∂θ X
Taking the partial derivative with respect to θ again, we get
∞ ⎧ ∂2L ∂L ∂ ⎫
∫ ⎨ 2 f X (x / θ ) + f X (x / θ ) ⎬ dx = 0
−∞ ⎩ ∂θ ∂θ ∂θ ⎭
⎧ ∂2L
⎪∞
⎛ ∂L ⎞
2
⎫
⎪
∴ ∫ ⎨ 2 f X (x / θ ) + ⎜ ⎟ f X ( x / θ ) ⎬ dx = 0
−∞ ⎪ ∂θ ⎝ ∂θ ⎠ ⎪
⎩ ⎭
2
⎛ ∂L ⎞ ∂ 2L
E⎜ ⎟ =-E 2
⎝ ∂θ ⎠ ∂θ
If θˆ satisfies CR -bound with equality, then θˆ is called an efficient estimator.

Remark:
(1) If the information I (θ ) is more, the variance of the estimator θ ˆ will be less.
(2) Suppose X 1 ............. X N are iid. Then
2
⎛ ∂ ⎞
I1 (θ ) = E ⎜ ln( f X /θ ( x)) ⎟
⎝ ∂θ 1
⎠
2
⎛ ∂ ⎞
∴ I N (θ ) = E ⎜ ln( f X , X .. X / θ ( x1 x2 ..xN )) ⎟
⎝ ∂θ 1 2 N
⎠
= NI1 (θ )

Example 3:
Let X 1 ............. X N are iid Gaussian random sequence with known variance σ 2 and

unknown mean µ .
N
1
Suppose µ =
ˆ
N
∑X
i =1
i which is unbiased.

Find CR bound and hence show that µ is an efficient estimator.
ˆ
Likelihood function
f X ( x1 , x 2 ,..... x N / θ ) will be product of individual densities (since iid)

1 N ( x − µ )2
1
− ∑
∴ f X ( x1 , x 2 ,..... x N / θ ) = e 2σ 2 i =1 i
( ( 2π ) N σ N
N
1
so that L( X / µ ) = −ln( 2π ) N σ N −
2σ 2 ∑(x
i =1
i − µ )2

∂L 1 N
Now = 0 - 2 ( -2) ∑ (X i − µ )
∂µ 2σ
/ i =1

∂2L N
∴ = - 2
∂µ 2
σ
∂2L N
So that E =- 2
∂µ 2
σ
1 1 1 σ2
∴ CR Bound = = = N =
I (θ ) ∂2L N
- E 2 σ2
∂µ

∂L 1 N
N ⎛ Xi ⎞
=
∂θ 2σ 2
∑ (X
i =1
i − µ) = 2 ∑
⎜
σ ⎝ i N
- µ⎟
⎠
N
= (µ - µ )
ˆ
σ2

∂L ˆ
Hence - = c (θ - θ )
∂θ
and µ is an efficient estimator.
ˆ

4.10 Criteria for Estimation
The estimation of a parameter is based on several well-known criteria. Each of the criteria
tries to optimize some functions of the observed samples with respect to the unknown
parameter to be estimated. Some of the most popular estimation criteria are:
• Maximum Likelihood
• Minimum Mean Square Error.
• Baye’s Method.
• Maximum Entropy Method.

4.11 Maximum Likelihood Estimator (MLE)
Given a random sequence X 1 ................. X N and the joint density function

f X1................. X N /θ ( x1 , x2 ...xN ) which depends on an unknown nonrandom parameter θ .

f X ( x 1 , x 2 , ........... x N / θ ) is called the likelihood function (for continuous function ….,
for discrete it will be joint probability mass function).
L( x / θ ) = ln f X ( x1 , x 2 ,........., x N / θ ) is called log likelihood function.
ˆ
The maximum likelihood estimator θ MLE is such an estimator that

f X ( x1 , x2 ........., xN / θˆMLE ) ≥ f X ( x1, x2 ,........, xN / θ ), ∀θ
ˆ
If the likelihood function is differentiable w.r.t. θ , then θ MLE is given by

∂
f X ( x1 , … x N /θ ) θ = 0
ˆ
∂θ MLE

∂L(x | θ )
or ˆ =0
∂θ θ MLE

Thus the MLE is given by the solution of the likelihood equation given above.
⎡θ1 ⎤
⎢θ ⎥
If we have a number of unknown parameters given by θ = ⎢ ⎥
2

⎢ ⎥
⎢ ⎥
⎣θ N ⎦

Then MLE is given by a set of conditions.

∂L ⎤ ∂L ⎤ ∂L ⎤
⎥ = ⎥ = .... = ⎥ =0
∂θ1 ⎦ θ =θ
ˆ ∂θ 2 ⎦ θ ˆ ∂θ M ⎦ θ ˆ
1 1MLE 2 =θ 2MLE M =θ MMLE

Example 4:
Let X 1 ................. X N are independent identically distributed sequence of N ( µ , σ 2 )

distributed random variables. Find MLE for µ, σ 2 .
2
1 ⎛ x −µ ⎞
1
N − ⎜ i ⎟
f X ( x1 , x 2 ,......... , x N / µ , σ ) = ∏
2
e 2⎝ σ ⎠
i =1 2π σ
L( X / µ , σ 2 ) = ln f X ( X 1 ,........., X N / µ , σ 2 )
2
1 N ⎛x −µ⎞
= - N ln − N ln 2π − N ln σ - ∑ ⎜ i ⎟
2 i=1 ⎝ σ ⎠
2
∂L N
⎛x −µ⎞ N
= 0 => ∑ ⎜ i ⎟ = ∑ (x i − µ MLE ) = 0
ˆ
∂µ i =1 ⎝ σ ⎠ i =1

∂L
=0=−
N
+
∑ (x i
− µ MLE ) 2
ˆ
=0
∂σ σ MLE
ˆ σ MLE
ˆ

Solving we get
1 N
µ MLE =
ˆ ∑ xi
N i=1
and

1 N
∑ ( xi − µ MLE )
2
σ MLE 2 =
ˆ ˆ
N i=1

Example 5:
Let X 1 ................. X N are independent identically distributed sequence with

1 − x−θ
f X /θ ( x) = e -∞ < x < ∞
2
Show that the median of X 1 ................. X N is the MLE for θ .
N

1 − ∑ xi −θ
f X1 , X 2 ... X N / θ ( x1 , x2 ....x N ) = N e i =1
2
N
L( X / θ ) = ln f X / θ ( x1 ,........., xN ) = − N ln 2 − ∑ xi − θ
i =1
N

∑ x −θ
i =1
i is minimized by median ( X 1................. X N )

Some properties of MLE (without proof)
• MLE may be biased or unbiased, asymptotically unbiased.
• MLE is consistent estimator.
• If an efficient estimator exists, it is the MLE estimator.
ˆ
An efficient estimator θ exists =>
∂ ˆ
L(x / θ) = c( θ − θ)
∂θ
ˆ
at θ = θ,
∂L( x/θ)
ˆ = c(θˆ − θˆ) = 0
∂θ θ

ˆ
⇒ θ is the MLE estimator.
• Invariance Properties of MLE:
ˆ ˆ
If θ MLE is the MLE of θ , then h(θ MLE ) is the MLE of h(θ ), where h(θ ) is an

invertible function of θ.

4.12 Bayescan Estimators
We may have some prior information about θ in a sense that some values of θ are
more likely (a priori information). We can represent this prior information in the form
of a prior density function.
In the folowing we omit the suffix in density functions just for notational simplicity.

Obervation x
f (x / θ )

Parameter θ
with density
fθ (θ )

The likelihood function will now be the conditional density f ( x / θ ).

f X ,Θ ( x,θ ) = f Θ (θ ) f X / Θ ( x )

Also we have the Bayes rule
f Θ (θ ) f X / Θ (x)
f Θ / X (θ ) =
f X ( x)

where f Θ / X (θ ) is the a posteriori density function

The parameter θ is a random variable and the estimator θˆ( x ) is another random
variable.
ˆ
Estimation error ε = θ − θ .
ˆ ˆ
We associate a cost function C (θ, θ ) with every estimator θ. It represents the positive
penalty with each wrong estimation. C (ε )
ˆ
Thus C (θ, θ ) is a non negative function.
The three most popular cost functions are: ε
ˆ
Quadratic cost function ( θ − θ ) 2
C (ε )
ˆ
Absolute cost function θ − θ

Hit or miss cost function (also called uniform cost function) ε
minimising means minimising on an average)

4.13 Bayesean Risk function or average cost C (ε ) 1

∞ ∞
ˆ ˆ
C = EC (θ, θ) = ∫ ∫ C(θ , θ) f
−∞ −∞
X ,Θ ( x , θ )dx dθ

-δ/2 δ/2 ε
The estimator seeks to minimize the Bayescan Risk.

Case I. Quadratic Cost Function
ˆ ˆ
C = (θ , θ) = (θ - θ) 2
Estimation problem is
∞ ∞
ˆ
∫ ∫ (θ − θ)
2
Minimize f X ,Θ ( x , θ )dx dθ
−∞ −∞

ˆ
with respect to θ .
This is equivalent to minimizing
∞ ∞

∫ ∫ (θ − θˆ)
2
f(θ | x)f ( x )dθ dx
−∞−∞
∞ ∞
= ∫ ( ∫ (θ − θˆ) f(θ | x)dθ ) f (x) dx
2

−∞ −∞

Since f ( x ) is always +ve, the above integral will be minimum if the inner integral is
minimum. This results in the problem:
∞

∫ (θ − θˆ)
2
Minimize f(θ | x)dθ )
−∞

ˆ
with respect to θ .
∞
∂ ˆ
ˆ ∫
=> (θ − θ)2 f Θ / X (θ ) dθ = 0
∂θ −∞
∞
ˆ
=> −2 ∫ (θ − θ) f Θ / X (θ ) dθ = 0
−∞

∞ ∞
ˆ
=> θ ∫ f Θ / X (θ ) dθ =
−∞
∫θ f
−∞
Θ/ X (θ ) dθ

∞
ˆ
=> θ =
−∞
∫θ f Θ/ X (θ )dθ

ˆ
∴θ is the conditional mean or mean of the a posteriori density. Since we are minimizing
quadratic cost it is also called minimum mean square error estimator (MMSE).

Salient Points
• Information about distribution of θ available.
• a priori density function f Θ (θ ) is available. This denotes how observed data

depend on θ
• We have to determine a posteriori density f Θ / X (θ ) . This is determined form the
Bayes rule.

1
Case II Hit or Miss Cost Function
∞ ∞
ˆ ˆ
Risk C = EC (θ, θ) = ∫ ∫ C(θ , θ) f
−∞ −∞
X ,Θ ( x , θ )dx dθ

∞ ∞
ˆ
∫ ∫ c(θ , θ) f Θ / X (θ ) f X (x)dθ dx
→∈
=
−∞ −∞ −∆ ∆
2
∞ ∞ 2
ˆ
= ∫ ( ∫ c(θ , θ) f
−∞ −∞
Θ/ X (θ )dθ ) f X (x) dx

We have to minimize
∞
ˆ ˆ
−∞
∫ C (θ,θ) f Θ/ X (θ )dθ with respect to θ.

This is equivalent to minimizing
ˆ ∆
θ +
2
=1- ∫
∆
f Θ / X (θ ) dθ
ˆ
θ −
2

This minimization is equivalent to maximization of
ˆ ∆
θ +
2

∆
∫ ˆ
f Θ / X (θ ) dθ ≅ ∆f Θ / X (θ ) when ∆ is very small
ˆ
θ −
2

ˆ
This will be maximum if f Θ / X (θ ) is maximum. That means select that value of θ that
maximizes the a posteriori density. So this is known as maximum a posteriori estimation
(MAP) principlee.
ˆ
This estimator is denoted by θ MAP .

Case III
C (θˆ,θ ) = θˆ − θ
C (ε )
C = Average cost=E θˆ − θ
∞ ∞
= ∫ ∫ θˆ − θ fθ , X (θ , x)dθ dx
−∞ −∞ ε = θˆ − θ
∞ ∞
= ∫ ∫ θˆ − θ fθ (θ ) f X /θ (x)dθ dx
−∞ −∞
∞ ∞
= ∫ ∫ θˆ − θ f X /θ (x)dxfθ (θ )dθ
−∞ −∞

For the minimum
∂ ∞
∫ C (θˆ,θ ) fθ / X (θ | x) dθ = 0
ˆ −∞
∂θ
∂ ⎧ θˆ ˆ ∞
⎨ ∫ (θ − θ ) fθ / X (θ | x) dθ + ∫ (θ − θˆ) fθ / X (θ | x) dθ = 0
∂θˆ ⎩ −∞ θˆ

Leibniz rule for differentiation of an integration
∂ 02 ( u )
/ /
02 ( u )
∂h(u, v)
∫ h (u , v) dv = ∫ dv
∂u 01 (u )
/ /
01 ( u ) ∂u
/
d0 (u ) /
d0 (u )
+ 2 h(u , 02 (u )) − 1
/ /
h(u , 01 (u ))
du du

Applying Leibniz rule we get
θˆ ∞
∫ fθ / X (θ | x) dθ − ∫ fθ / X (θ | x) dθ = 0
−∞ θˆ
ˆ
At the θ MAE
θˆMAE ∞
∫ fθ / X (θ | x) dθ − ∫ fθ / X (θ | x) dθ = 0
−∞ θˆMAE

ˆ
So θ MAE is the median of the a posteriori density

Example 6:
Let X 1 , X 2 ...., X N be an iid Gaussian sequence with unity Variance and unknown mean

θ . Further θ is known to be a 0-mean Gaussian with Unity Variance. Find the MAP
estimator for θ .

Solution: We are given
1 − 1θ 2
f Θ (θ ) = e 2
2π
N ( xi −θ )2
1 −∑
f Θ / X (θ ) = e i=1 2

( 2π ) N
f Θ (θ ) f X / Θ (x)
Therefore f Θ / X (θ ) =
f X ( x)

We have to find θ , such that f Θ / X (θ ) is maximum.

Now f Θ / X (θ ) is maximum when f Θ (θ ) f X / Θ (x) is maximum.

⇒ ln f Θ (θ ) f X / Θ (x) is maximum

1 N
(x − θ )2
⇒− θ2 −∑ i is maximum
2 i =1 2
N
⎤
⇒ θ − ∑ ( x i − θ )⎥ =0
i =1 ⎦ θ =θˆMAP
1
⇒ θˆMAP = ∑ xi
N + 1 i =1

Example 7:
Consider single observation X that depends on a random parameter θ . Suppose θ has a
prior distribution

fθ (θ ) = λ e − λθ for θ ≥ 0, λ > 0
and
f X / Θ ( x) = θ e −θ x x >0

find the MAP estimation for θ .

f Θ (θ ) f X / Θ ( x)
f Θ / X (θ ) =
f X ( x)
ln( f (θ | x)) = ln( f Θ (θ )) + ln( f X / Θ ( x)) − ln f X ( x)

Therefore MAP estimator is given by.

∂
ln f X / Θ (x) ]θ = 0
ˆ
∂θ MAP

1
⇒ θˆMAP =
λ+X

Example 8: Binary Communication problem

X Y
+

V

X is a binary random variable with

⎧1 with probability 1/2
X =⎨
⎩ −1 with probability 1/ 2
0 and variance σ .2
V is the Gaussian noise with mean
To find the MSE for X from the observed data Y .

Then
1
f x ( x ) = [δ ( x − 1) + δ ( x + 1)]
2
1 2 2
fY / X ( y / x ) = e − ( y − x ) / 2σ
2π σ
f ( x ) fY / X ( y / x )
f X /Y ( x / y) = ∞ X
∫ f X ( x ) fY / X ( y / x ) fx
−∞
− ( y − x ) 2 / 2σ 2
e [δ ( x − 1) + δ ( x + 1)]
= − ( y −1)2 / 2σ 2 2
/ 2σ 2
e + e − ( y +1)

− ( y − x )2 / 2σ 2
ˆ e ∞ [δ ( x − 1) + δ ( x + 1)]
X MMSE = E ( X / Y ) = ∫ x − ( y −1)2 / 2σ 2 2
/ 2σ 2
dx
−∞ e + e − ( y +1)
2
/ 2σ 2 2
/ 2σ 2
e− ( y −1) − e − ( y +1)
Hence = 2 2 2
/ 2σ 2
e − ( y −1) / 2σ + e− ( y +1)
= tanh( y / σ 2 )

To summarise f X / Θ ( x)

MLE:
Simplest
θˆMLE θ
MMSE:
θˆMMSE = E (Θ / X) MMSE
• Find a posteriori density.
• Find the average value by integration
• Lots of calculation hence it is computationally exhaustive.

MAP: f Θ / X (θ )
ˆ
θ MAP = Mode of the a posteriori density
f Θ / X (θ ).

θˆMAP

4.14 Relation between θˆMAP and θˆMLE

From
f Θ (θ ) f X / Θ (x)
f Θ / X (θ ) =
f X ( x)

ln( f Θ / X (θ )) = ln( f Θ (θ )) + ln( f X / Θ (x)) − ln( f X (x))

θˆMAP is given by
∂ ∂
ln f Θ (θ ) + ln( f X / Θ (x)) = 0
∂θ ∂θ

a priori density likelihood function.

Supose θ is uniformly distributed between θ MIN and θ MAX .
Then
∂
ln f Θ (θ ) = 0
∂θ

f Θ (θ )

θ MIN θ MAX
θ
ˆ
If θ MIN ≤ θ MLE ≤ θ MAX
ˆ ˆ
then θ MAP = θ MLE
ˆ
If θ MAP ≤ θ MIN
ˆ
then θ MAP = θ MIN
ˆ
If θ MLE ≥ θ MAX
ˆ
then θ MAP = θ MAX

CHAPTER – 5: OPTIMAL LINEAR FILTER: WIENER
FILTER

5.1 Estimation of signal in presence of white Gaussian noise (WGN)

Consider the signal model is
Y [n ] = X [n ] + V [n ]

X Y
+

V

where Y [n ] is the observed signal, X [n ] is 0-mean Gaussian with variance 1 and V [n ]
is a white Gaussian sequence mean 0 and variance 1. The problem is to find the best
guess for X [n ] given the observation Y [i ], i = 1, 2,..., n

Maximum likelihood estimation for X [n ] determines that value of X [n ] for which the
sequence Y [i ], i = 1, 2,..., n is most likely. Let us represent the random sequence
Y[n] = [Y [n], Y [n − 1],.., Y [1]]'
Y [i ], i = 1, 2,..., n by the random vector and the value sequence y[1], y[2],..., y[n] by
y[n] = [ y[n], y[n − 1],.., y[1]]'.
The likelihood function f Y[ n ] / X [ n ] ( y[ n] / x[ n]) will be Gaussian with mean x[n ]
n
( y [ i ]− x[ n ])2
1 − ∑
f Y[ n ] / X [ n ] ( y[n] / x[n]) = i =1
2
e
( )
n
2π

Maximum likelihood will be given by
∂
( fY [1],Y [2],...,Y [ n ]/ X [ n ] ( y[1], y[2],..., y[n]) / x[n]) ˆ [ ] = 0
∂x[n] xMLE n

1 n
⇒ χ MLE [n] =
ˆ ∑ y[i]
n i =1

Similarly, to find χ MAP [n ] and
ˆ χ MMSE [n] we have to find a posteriori density
ˆ

f X [ n ] ( x[n]) f Y[ n ]/ X [ n ] (y[n] / x[n])
f X [ n ]/ Y[ n ] ( x[n] / y[n]) =
f Y[ n ] (y[n])
n
( y [ i ]− x[ n ])2
− x [ n ]− ∑
1 2
1 2 i =1
2
= e
f Y[ n ] (y[n])

Taking logarithm
1 n
( y[i] − x[n])2
log e f X [ n ]/ Y[ n ] ( x[n]) = − x 2 [n] − ∑ − log e f Y[ n ] (y[n])
2 i =1 2
ˆ
log e f X [ n ]/ Y[ n ] ( x[ n]) is maximum at xMAP [n]. Therefore, taking partial derivative of

log e f X [ n ]/ Y[ n ] ( x[ n]) with respect to x[ n] and equating it to 0, we get
n
x[n] − ∑ ( y[i] − x[n]) =0
i =1 ˆ
xMAP [ n ]
n

∑ y[i]
xMAP [n] =
ˆ i =1

n +1
Similarly the minimum mean-square error estimator is given by
n

∑ y[i]
xMMSE [n] = E ( X [n]/ y[n]) =
ˆ i =1
n +1
• For MMSE we have to know the joint probability structure of the channel and the
source and hence the a posteriori pdf.
• Finding pdf is computationally very exhaustive and nonlinear.
• Normally we may be having the estimated values first-order and second-order
statistics of the data
We look for a simpler estimator.
The answer is Optimal filtering or Wiener filtering

We have seen that we can estimate an unknown signal (desired signal) x[ n] from an
observed signal y[ n] on the basis of the known joint distributions of y[ n] and x[ n]. We
could have used the criteria like MMSE or MAP that we have applied for parameter es
timations. But such estimations are generally non-linear, require the computation of a
posteriori probabilities and involves computational complexities.
The approach taken by Wiener is to specify a form for the estimator that depends on a
number of parameters. The minimization of errors then results in determination of an
optimal set of estimator parameters. A mathematically sample and computationally easier
estimator is obtained by assuming a linear structure for the estimator.

5.2 Linear Minimum Mean Square Error Estimator

x[ n] y[ n ] ˆ
x[ n]

Syste +
Filter
m

Noise

y[ n − M − 1] …….. y[ n] …

n − M −1 n n+ N

The linear minimum mean square error criterion is illustrated in the above figure. The
problem can be slated as follows:

Given observations of data y[ n − M + 1], y[ n − M + 2].. y[ n],... y[ n + N ], determine a set of
parameters h[ M − 1], h[ M − 2]..., h[o],...h[ − N ] such that
M −1
x[n] =
ˆ ∑ h[i] y[n − i]
i =− N

E ( x[ n] − x[n]) is
2
and the mean square error ˆ a minimum with respect to

h[ − N ], h[ − N + 1]...., h[o], h[i ],..., h[ M − 1].
This minimization problem results in an elegant solution if we assume joint stationarity of
the signals x[ n] and y[ n] . The estimator parameters can be obtained from the second
order statistics of the processes x[ n] and y[ n].

The problem of deterring the estimator parameters by the LMMSE criterion is also called
the Wiener filtering problem. Three subclasses of the problem are identified

1. The optimal smoothing problem N > 0
2. The optimal filtering problem N =0
3. The optimal prediction problem N < 0

In the smoothing problem, an estimate of the signal inside the duration of observation of
the signal is made. The filtering problem estimates the curent value of the signal on the
basis of the present and past observations. The prediction problem addresses the issues of
optimal prediction of the future value of the signal on the basis of present and past
observations.

5.3 Wiener-Hopf Equations
The mean-square error of estimation is given by
Ee2 [n] = E ( x[n] − x[n]) 2
ˆ
M −1
= E ( x[n] − ∑ h[i] y[n − i])
i =− N
2

We have to minimize Ee 2 [n] with respect to each h[i ] to get the optimal estimation.
Corresponding minimization is given by
∂E {e2 [n]}
= 0, for j = − N ..0..M − 1
∂h[ j ]
∂
( E being a linear operator, E and can be interchanged)
∂h[ j ]
Ee[ n] y[ n - j ] = 0, j = − N ...0,1,...M − 1 (1)
or
e[ n ]

⎛ M −1 ⎞
E ⎜ x[n] - ∑ h[i] y[n − i] ⎟ y[n - j ] = 0, j = − N ...0,1,...M − 1 (2)
⎝ i =− N a ⎠
M −1
RXY ( j ) = ∑ h[i]R
i =− N a
YY [ j − i ], j = − N ...0,1,...M − 1 (3)

This set of N + M + 1 equations in (3) are called Wiener Hopf equations or Normal
equations.

• The result in (1) is the orthogonality principle which implies that the error is
orthogonal to observed data.
• ˆ
x[ n] is the projection of x[ n] onto the subspace spanned by observations
y[n − M ], y[ n − M + 1].. y[ n],... y[ n + N ].

• The estimation uses second order-statistics i.e. autocorrelation and cross-
correlation functions.

• If x[ n] and y[ n] are jointly Gaussian then MMSE and LMMSE are equivalent.
Otherwise we get a sub-optimum result.
• Also observe that
x[ n] = x[ n] + e[ n]
ˆ
ˆ
where x[ n] and e[n] are the parts of x[ n] respectively correlated and uncorrelated
with y[ n]. Thus LMMSE separates out that part of x[ n] which is correlated with y[ n].
Hence the Wiener filter can be also interpreted as the correlation canceller. (See
Orfanidis).

5.4 FIR Wiener Filter
M −1
x[ n] = ∑ h[i ] y[ n − i ]
ˆ
i =0

The model parameters are given by the orthogonality principlee
e[ n ]

⎛ M −1
⎞
E ⎜ x[n] -
⎝
∑ h[i] y[n − i]⎟ y[n - j] = 0,
i =0 ⎠
j = 0,1,...M − 1

M −1

∑ h[i]R
i =0
YY [ j − i ] = RXY ( j ), j = 0,1,...M − 1

In matrix form, we have
R YYh = rXY
where
⎡ RYY [0] RYY [−1] .... RYY [1 − M ]⎤
⎢ R [1] RYY [0] .... RYY [2 − M ]⎥
R YY = ⎢ YY ⎥
⎢... ⎥
⎢ ⎥
⎢ RYY [ M − 1] RYY [ N − 2] ....
⎣ RYY [0] ⎥ ⎦
and
⎡ RXY [0] ⎤
⎢ R [1] ⎥
rXY = ⎢ XY ⎥
⎢... ⎥
⎢ ⎥
⎢ RXY [ M − 1]
⎣ ⎥
⎦
and

⎡ h[0] ⎤
⎢ h[1] ⎥
h= ⎢ ⎥
⎢... ⎥
⎢ ⎥
⎣ h[ M − 1] ⎦
Therefore,
−1
h = R YYrXY

5.5 Minimum Mean Square Error - FIR Wiener Filter
⎛ M −1
⎞
E (e [n] = Ee[n] ⎜ x[n] − ∑ h[i ] y[n − i ] ⎟
2

⎝ i =0 ⎠
=Ee[n] x[n] ∵ error isorthogonal to data
⎛ M −1
⎞
=E ⎜ x[n] −
⎝
∑ h[i] y[n − i]⎟ x[n]
i =0 ⎠
M −1
=RXX [0] − ∑ h[i ]RXY [i ]
i =0

R XX rXY

Wiener
Estimation
• • h[0] ˆ
x[n]

Z-1
h[1]

Z-1
h[ M − 1]

Example1: Noise Filtering
Consider the case of a carrier signal in presence of white Gaussian noise
π
x[n] = A cos[ w0 n + φ ], w0 =
4
y[n] = x[n] + v[n]
here φ is uniformly distributed in (1, 2π ).
v[ n] is white Gaussian noise sequence of variance 1 and is independent of x[ n]. Find the
parameters for the FIR Wiener filter with M=3.
A2
RXX [m] = cos w 0 m
2
RYY [m] = E y[n] y[n − m]
= E ( x[n] + v[m])( x[n − m] + v[n − m])
= RXX [m] + RVV [m] + 0 + 0
A2
= cos( w0 m) + δ [m]
2
RXY [m] = E x[n] y[n − m]
= E x[n] ( x[n − m] + v[n − m])
= RXX [m]
Hence the Wiener Hopf equations are
⎡ RYY [0] RYY [1] Ryy[2]⎤ ⎡ h[0]⎤ = ⎡ RXX [0]⎤
⎢ R [1] R [0] R [1] ⎥ ⎢ h[1] ⎥ ⎢ R [1] ⎥
⎢ YY YY YY ⎥⎢ ⎥ ⎢ XX ⎥
⎢ RYY [2] RYY 1] RYY [0] ⎥ ⎢ h[2]⎥ ⎢ RXX [2]⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
⎡ A2 A2 π A2 π⎤ ⎡ A2 ⎤
⎢ 2 +1 cos cos ⎥ ⎡ h[0]⎤ ⎢ ⎥
⎢ 2 2 4 2 2⎥⎢ ⎥ ⎢ 2 ⎥
⎢A π A 2
A2 π ⎥⎢ ⎥ ⎢ A2 π⎥
⎢ 2 cos 4 +1 cos ⎥ ⎢ h[1] ⎥ = ⎢ cos ⎥
2 2 4 ⎢ ⎥ ⎢ 2 4
⎢ 2 ⎥ ⎥
⎢ A cos π A2 π A 2
⎥ ⎢ ⎥ ⎢ A2 π⎥
⎢ 2 cos + 1 ⎥ ⎢ h[2]⎥ ⎢ cos ⎥
⎣ ⎦
⎣ 2 2 4 2 ⎦ ⎣ 2 2⎦

suppose A = 5v then
⎡ 12.5 ⎤
⎢13.5 0 ⎥
⎡12.5 ⎤
⎢ 2 ⎥ ⎡ h[0]⎤ = ⎢ ⎥
⎢12.5 12.5 ⎥ ⎢
h[1] ⎥ ⎢
12.5 ⎥
⎢ 2 13.5
⎢ 2 ⎥⎢ ⎥ ⎢
⎥ ⎢ h[2]⎥ ⎢ 2 ⎥
⎥
⎣ ⎦
⎢ 12.5 ⎥ ⎣ 0 ⎦
⎢ 0 13.5 ⎥
⎣ 2 ⎦

−1
⎡ 12.5 ⎤
⎢13.5 0 ⎥ ⎡12.5 ⎤
⎡ h[0]⎤ 2 ⎢ ⎥
⎢ h[1] ⎥ = ⎢12.5 12.5 ⎥ ⎢12.5 ⎥
⎢ ⎥ ⎢ 2 13.5 ⎥
2⎥ ⎢ 2⎥
⎢ h[2]⎥ ⎢
⎣ ⎦ ⎢ 0 13.5 ⎥ ⎢ 0 ⎥
12.5
⎢
⎣ 2 ⎥ ⎣
⎦
⎦
h[0] = 0.707
h[1 = 0.34
h[2] = −0.226
Plot the filter performance for the above values of h[0], h[1] and h[2]. The following
figure shows the performance of the 20-tap FIR wiener filter for noise filtering.

Example 2 : Active Noise Control
Suppose we have the observation signal y[ n] is given by
y[n] = 0.5cos( w0 n + φ ) + v1 [n]

where φ is uniformly distrubuted in ( 0, 2π ) and v1[n] = 0.6v[n − 1] + v[n] is an MA(1)

noise. We want to control v1[n] with the help of another correlated noise v2 [n] given by
v2 [n] = 0.8v[n − 1] + v[n]

x[n] + v1[n] ˆ
x[ n]

v2 [n]
2-tap FIR
filter

The Wiener Hopf Equations are given by
RV2V2 h = rV1V2

where h = [ h[0] h[1]]′

and
⎡1.64 0.8 ⎤
RV2V2 = ⎢ ⎥ and
⎣0.8 1.64 ⎦
⎡1.48 ⎤
rV1V2 = ⎢ ⎥
⎣0.6 ⎦
⎡ h[0]⎤ ⎡ 0.9500 ⎤
∴⎢ ⎥=⎢ ⎥
⎣ h[1] ⎦ ⎣-0.0976 ⎦

Example 3:
(Continuous time prediction) Suppose we want to predict the continuous-time process
X (t ) at time (t + τ ) by
X (t + τ ) = aX (t )
ˆ

Then by orthogonality principle
E ( X (t + τ ) − aX (t )) X (t ) = 0
RXX (τ )
⇒a=
RXX (0)

As a particular case consider the first-order Markov process given by

d
X (t ) = AX (t ) + v(t )
dt
In this case,
RXX (τ ) = RXX (0)e − Aτ

RXX (τ )
∴a = = e − Aτ
RXX (0)
Observe that for such a process
E ( X (t + τ ) − aX (t )) X (t − τ 1 ) = 0
= RXX (τ + τ 1 ) − aRXX (τ 1 )
= RXX (0)e − A(τ +τ1 ) − e − A(τ ) RXX (0)e − A(τ1 )
=0
Therefore, the linear prediction of such a process based on any past value is same as the
linear prediction based on current value.

5.6 IIR Wiener Filter (Causal)
Consider the IIR filter to estimate the signal x[ n] shown in the figure below.

y[n] ˆ
x[n]
h[n]

ˆ
The estimator x[ n] is given by
∞
x( n) = ∑ h(i ) y ( n − i )
ˆ
i =0

The mean-square error of estimation is given by
Ee2 [n] = E ( x[n] − x[n]) 2
ˆ
∞
= E ( x[n] − ∑ h[i ] y[n − i ]) 2
i =0

We have to minimize Ee 2 [n] with respect to each h[i ] to get the optimal estimation.

Applying the orthogonality principle, we get the WH equation.
∞
E ( x[n] − ∑ h(i ) y[n − i ]) y[n − j ] = 0, j = 0, 1, .....
i =0

From which we get
∞

∑ h[i]
i =0
RYY [ j − i ] = RXY [ j ], j = 0, 1, .....

• We have to find h[i ], i = 0,1,...∞ by solving the above infinite set of equations.

• This problem is better solved in the z-transform domain, though we cannot
directly apply the convolution theorem of z-transform.
Here comes Wiener’s contribution.
The analysis is based on the spectral Factorization theorem:

SYY ( z ) = σ v2 H c ( z ) H c ( z −1 )

y[n] v[n]
1
H1 ( z ) =
H c ( z)
Whitening filter

y[n] v[n] ˆ
x[n]
H1 ( Z ) H 2 ( z)

Wiener filter
Now h2 [n] is the coefficient of the Wiener filter to estimate x[ n] I from the innovation

sequence v[n]. Applying the orthogonality principle results in the Wiener Hopf equation
∞
x ( n) = ∑ h2 (i )v( n − i )
ˆ
i =0

⎧ ∞
⎫
E ⎨ x[n] − ∑ h2 [i ]v[n − i ]⎬ v[n − j ]) = 0
⎩ i =0 ⎭
∞
∴ ∑ h2 [i ]RVV [ j − i ] = RXV [ j ], j = 0,1,...
i =0

RVV [m] = σ V δ [m]
2

∞
∴ ∑ h2 (i )σ V δ [ j − i ] = RXV ( j ), j = 0,1,...
2

i =0

So that
RXV [ j ]
h2 [ j ] = j≥ 0
σV 2

H 2 ( z) =
[ S XV ( z )] +
σV 2

where [ S XV ( z )] + is the positive part (i.e., containing non-positive powers of z ) in power

series expansion of S XV ( z ).

∞
v[n] = ∑ h1[i ] y[n − i ]
i =0

RXV [ j ] = Ex[n]v[n − j ]
∞
= ∑ hi [i ]E x[n] y[n - j - i ]
i =0
∞
= ∑ h [i]R
i =0
1 XY [j + i]

1
S XV ( z ) = H1 ( z −1 ) S XY ( z ) = S XY ( z )
H c ( z −1 )
1 ⎡ S XY ( z ) ⎤
∴ H 2 ( z) = 2 ⎢ ⎥
σ V ⎣ H c ( z −1 ) ⎦ +

Therefore,

1 ⎡ S XY ( z ) ⎤
H ( z ) = H1 ( z ) H 2 ( z ) =
σ H c ( z ) ⎢ H c ( z −1 ) ⎥ +
2
V ⎣ ⎦
We have to
• find the power spectrum of data and the cross power spectrum of the of the
desired signal and data from the available model or estimate them from the data
• factorize the power spectrum of the data using the spectral factorization theorem

5.7 Mean Square Estimation Error – IIR Filter (Causal)
⎛ ∞
⎞
E (e 2 [n] = Ee[n] ⎜ x[n] − ∑ h[i ] y[n − i ] ⎟
⎝ i =0 ⎠
=Ee[n] x[n] ∵ error isorthogonal to data
⎛ ∞
⎞
=E ⎜ x[n] −
⎝
∑ h[i] y[n − i]⎟ x[n]
i =0 ⎠
∞
=RXX [0] − ∑ h[i ]RXY [i ]
i =0

1 π 1 π
= ∫πS ( w)dw − ∫ π H (w)S
*
( w)dw
2π 2π
− X − XY

1 π
= ∫ π (S ( w) − H ( w) S XY ( w))dw
*

2π − X

1
=
2π ∫ (S
C
X ( z ) − H ( z ) S XY ( z −1 )) z −1dz

Example 4:
y[n] = x[n] + v1[n] observation model with
x[ n] = 0.8 x [ n -1] + w[ n]

where v1[n] is and additive zero-mean Gaussian white noise with variance 1 and w[ n]
is zero-mean white noise with variance 0.68. Signal and noise are uncorrelated.
Find the optimal Causal Wiener filter to estimate x[ n].

Solution:

w[n] 1 x[n]
1 − 0.8z −1

0.68
S XX ( z ) =
(1 − 0.8 z −1 ) (1 − 0.8 z )
RYY [m] = E y[n] y[n − m]
= E ( x[n] + v[m])( x[n − m] + v[n − m])
= RXX [m] + RVV [m] + 0 + 0
SYY ( z ) = S XX ( z ) + 1
Factorize
0.68
SYY ( z ) = +1
(1 − 0.8z −1 ) (1 − 0.8 z )
2(1 − 0.4 z −1 )(1 − 0.4 z )
=
(1 − 0.8z −1 ) (1 − 0.8z )
(1 − 0.4 z −1 )
∴ H c ( z) =
(1 − 0.8 z −1 )
and
σV = 2
2

Also
RXY [m] = E x[n] y[n − m]
= E ( x[n])( x[n − m] + v[n − m])
= RXX [m]
S XY ( z ) = S XX ( z )
0.68
=
(1 − 0.8z −1 ) (1 − 0.8z )

1 ⎡ S XY ( z ) ⎤
∴ H ( z) = ⎢ ⎥
σ V H c ( z ) ⎣ H c ( z −1 ) ⎦ +
2

1 (1 − 0.8 z −1 ) ⎡ 0.68 ⎤
= ⎢ ⎥
2 (1 − 0.4 z −1 ) ⎣ (1 − 0.8 z −1 )(1 − 0.8 z ) ⎦ +
0.944
=
(1 − 0.4 z ) −1

h[n] = 0.944(0.4) n n ≥ 0

5.8 IIR Wiener filter (Noncausal)

y[n] ˆ
x[n]
H ( z)

ˆ
The estimator x[ n] is given by
∞
x[n] =
ˆ ∑ h[i] y[n − i]
i =−∞

For LMMSE, the error is orthogonal to data.
⎛ ∞
⎞
E ⎜ x [ n] −
⎝
∑
i =−∞
h [i ] y [n − i ] ⎟ y[n − j ] = 0 ∀j ∈ I
⎠
∞

∑
i =−∞
h[i ] RYY [ j − i ] = RXY [ j ], j = −∞,...0, 1, ...∞

• This form Wiener Hopf Equation is simple to analyse.
• Easily solved in frequency domain. So taking Z transform we get
• Not realizable in real time

H ( z ) SYY ( z ) = S XY ( z )
so that
S XY ( z )
H ( z) =
SYY ( z )
or
S XY ( w)
H ( w) =
SYY ( w)

5.9 Mean Square Estimation Error – IIR Filter (Noncausal)
The mean square error of estimation is given by
⎛ ∞
⎞
E ( e 2 [n ] = Ee[n ] ⎜ x[n ] − ∑ h[i ] y[n − i ] ⎟
⎝ i =−∞ ⎠
=Ee[n ] x[n ] ∵ error isorthogonal to data
⎛ ∞
⎞
=E ⎜ x[n ] −
⎝
∑
i =−∞
h[i ] y[n − i ] ⎟ x[n ]
⎠
∞
=RXX [0] − ∑
i =−∞
h[i ]RXY [i ]

1 π 1 π
= ∫π S X ( w)dw − ∫ π H ( w) S
*
( w)dw
2π 2π
XY
− −

1 π
= ∫ π (S ( w) − H ( w) S XY ( w))dw
*

2π
X
−

1
=
2π ∫ (S
C
X ( z ) − H ( z ) S XY ( z −1 )) z −1dz

Example 5: Noise filtering by noncausal IIR Wiener Filter
Consider the case of a carrier signal in presence of white Gaussian noise
y[ n] = x[ n] + v[ n]

where v[ n] is and additive zero-mean Gaussian white noise with variance σ V . Signal
2

and noise are uncorrelated
SYY ( w) = S XX ( w) + SVV ( w)
and
S XY ( w) = S XX ( w)
S XX ( w)
∴ H ( w) =
S XX ( w) + SVV ( w)
S XY ( w)
SVV ( w)
=
S XX ( w
) +1
SVV ( w)
Suppose SNR is very high
H ( w) ≅ 1
(i.e. the signal will be passed un-attenuated).
When SNR is low
S XX ( w)
H ( w) =
SVV ( w)

(i.e. If noise is high the corresponding signal component will be attenuated in proportion
of the estimated SNR.

H ( w)

SNR

Signal
Noise

w

Example 6: Image filtering by IIR Wiener filter
SYY ( w) = power spectrum of the corrupted image
SVV ( w) = power spectrum of the noise, estimated from the noise model
or from the constant intensity ( like back-ground) of the image
S XX ( w)
H ( w) =
S XX ( w) + SVV ( w)
SYY ( w) − SVV ( w)
=
SYY ( w)

Example 7:
Consider the signal in presence of white noise given by
x[ n] = 0.8 x [ n -1] + w[ n]
where v[ n] is and additive zero-mean Gaussian white noise with variance 1 and w[ n]
is zero-mean white noise with variance 0.68. Signal and noise are uncorrelated.
Find the optimal noncausal Wiener filter to estimate x[ n].

0.68

H ( z) =
SXY ( z )
=
(1-0.8z ) (1 − 0.8z )
-1

SYY ( z ) 2 (1-0.4z -1 ) (1 − 0.4z )
(1-0.6z ) (1 − 0.6z )
-1

0.34
= One pole outside the unit circle
(1-0.4z ) (1 − 0.4z )
-1

0.4048 0.4048
= +
1-0.4z -1 1-0.4z
∴ h[n] = 0.4048(0.4) n u (n) + 0.4048(0.4) − n u (− n − 1)

h[ n]

n

CHAPTER – 6: LINEAR PREDICTION OF SIGNAL
6.1 Introduction
Given a sequence of observation
y[ n - 1], y[ n - 2], …. y[ n - M ], what is the best prediction for y[ n]?
(one-step ahead prediction)
The minimum mean square error prediction ˆ
y[ n] for y[ n] is given by

y [n] = E { y[n] | y[n - 1], y[n - 2], ... , y [n - M ]}
ˆ

Which is a nonlinear predictor.
A linear prediction is given by
M
y[n] = ∑ h[i ] y[n − i ]
ˆ
i =1

where h[i ], i = 1.... M are the prediction parameters.
• Linear prediction has very wide range of applications.
• For an exact AR (M) process, linear prediction model of order M and the
corresponding AR model have the same parameters. For other signals LP model gives
an approximation.

6.2 Areas of application
• Speech modeling • ECG modeling
• Low-bit rate speech coding • DPCM coding
• Speech recognition • Internet traffic prediction
LPC (10) is the popular linear prediction model used for speech coding. For a frame of
speech samples, the prediction parameters are estimated and coded. In CELP (Code book
Excited Linear Prediction) the prediction error e[ n] = y[ n] - y[ n] is vector quantized and
ˆ
transmitted.
M
y [n ] =
ˆ ∑ h[i ] y[n - i ]
i =1
is a FIR Wiener filter shown in the following figure. It is called

the linear prediction filter.

Therefore
e[n] = y[n]-y [n]
ˆ
M
= y[n] − ∑ h[i ] y[n - i ]
i =1

is the prediction error and the corresponding filter is called prediction error filter.
Linear Minimum Mean Square error estimates for the prediction parameters are given by
the orthogonality relation
E e[n] y[n - j ] = 0 for j = 1, 2 ,… , M
M
∴ E ( y[n] − ∑ h[i ] y[n - i ])y[n − j ] = 0 j = 1, 2 ,…, M
i =1
M
⇒ RYY [ j ] - ∑ h[i] R
i =1
YY [ j - i] = 0
M
⇒ RYY [ j ] = ∑ h[i] R
i =1
YY [ j - i] j = 1, 2 ,…, M

which is the Wiener Hopf equation for the linear prediction problem and same as the Yule
Walker equation for AR (M) Process.
In Matrix notation
⎡ RYY [0] RYY [1] .... RYY [ M − 1] ⎤ ⎡ h[1] ⎤ ⎡ RYY [1] ⎤
⎢ R [1] R [o] ... R [ M - 2] ⎥ ⎢ h[2] ⎥ ⎢ R [2] ⎥
⎢ YY YY YY ⎥ ⎢ ⎥ ⎢ YY ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ = ⎢ ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [ M -1] RYY [ M - 2] ...... RYY [0]⎥
⎣ ⎦ ⎢ h[ M ]⎥
⎣ ⎦ ⎢ RYY [ M ]⎦
⎣ ⎥
R YY h = rYY
∴ h = ( R YY ) −1 rYY

6.3 Mean Square Prediction Error (MSPE)
M
E (e [n]) = E ( y[n] − ∑ h[i ] y[n − i ])e[n]
2

i =1

= Ey[n]e[n]
M
= Ey[n]( y[n] − ∑ h[i ] y[n − i ])
i =1
M
= RYY [0] − ∑ h[i ]RYY [i ])
i =1

6.4 Forward Prediction Problem
The above linear prediction problem is the forward prediction problem. For notational
simplicity let us rewrite the prediction equation as
M
y [n] =
ˆ ∑h
i =1
M [i ] y[n - i ]

where the prediction parameters are being denoted by hM [i ], i = 1...M .

6.5 Backward Prediction Problem
Given y[ n], y[ n - 1], … . y[ n - M + 1], we want to estimate y[n − M ].
The linear prediction is given by
M
y[n - M ] =
ˆ ∑b
i =1
M [i ] y[ n + 1 − i ]]

Applying orthogonality principle.
M
E ( y[ n - M ] - ∑ bM [i ] y[ n + 1 − i ]) y [ n + 1 − j ] = 0 j = 1, 2..., M .
i =1

This will give
M
RYY [ M + 1 − j ] = ∑ bM [i ] RYY [ j-i ] j = 1, 2 , …, M
i =1

Corresponding matrix form
⎡ RYY [0] RYY [1] .... RYY [ M − 1] ⎤ ⎡bM [1] ⎤ ⎡ RYY [ M ] ⎤
⎢ R [1] R [o] ... R [ M - 2] ⎥ ⎢b [2] ⎥ ⎢ R [ M − 1]⎥
⎢ YY YY YY ⎥ ⎢ M ⎥ ⎢ YY ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ (1)
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [ M -1] RYY [ M - 2] ...... RYY [0]⎥
⎣ ⎦ ⎢bM [ M ]⎥
⎣ ⎦ ⎢ RYY [1]
⎣ ⎥
⎦

6.6 Forward Prediction
Rewriting the Mth-order forward prediction problem, we have
⎡ RYY [0] RYY [1] .... RYY [ M − 1] ⎤ ⎡ hM [1] ⎤ ⎡ RYY [1] ⎤
⎢ R [1] R [o] ... RYY [ M - 2]⎥ ⎢ h [2] ⎥ ⎢ R [2] ⎥
⎢ YY YY ⎥ ⎢ M ⎥ ⎢ YY ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ (2)
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [ M -1] RYY [ M - 2] ...... RYY [0] ⎥
⎣ ⎦ ⎢ hM [ M ]⎥
⎣ ⎦ ⎢ RYY [ M ]⎥
⎣ ⎦
From (1) and (2) we conclude

bM [i ] = hM [ M + 1 − i ], i = 1,2..., M
Thus forward prediction parameters in reverse order will give the backward prediction
parameters.
M S prediction error
⎛ M
⎞
ε M = E ⎜ y[n − M ] − ∑ bM [i ] y [n + 1 − i ]⎟ y [n − M ]
⎝ i =1 ⎠
M
= RYY [0] − ∑ bM [i ] RYY [ M + 1 − i ]
i =1
M
= RYY [0] − ∑ hM [ M + 1 − i ] RYY [ M + 1 − i ]
i =1

which is same as the forward prediction error.
Thus

Backward prediction error = Forward Prediction error.

Example 1:
Find the second order predictor for y[n ] given y[ n] = x[ n] + v[ n], where v[n ] is a 0-mean
white noise with variance 1 and uncorrelated with x[n ] and x[ n] = 0.8 x[ n - 1] + w[ n] ,
w[n] is a 0-mean random variable with variance 0.68

The linear predictor is given by
y [n] = h2 [1] y[n − 1] + h2 [2] y[n − 2]
ˆ

We have to find h2 [1] and h2 [2].
Corresponding Yule Walker equations are
⎡ RYY [0] RYY [1]⎤ ⎡h2 [1] ⎤ ⎡ RYY [1]⎤
⎢ R [1] RYY [0]⎥ ⎢h [2]⎥ = ⎢ R [2]⎥
⎣ YY ⎦ ⎣ 2 ⎦ ⎣ YY ⎦
To find out RYY [0] , RYY [1] and RYY [2]
y[ n] = x[ n] + v[ n],
RYY [ m] = RXX [m] + δ [m]
x[n] = 0.8 x[n - 1] + w[n]
0.68
∴ RXX [m] = (0.8)|m| = 1.89 × (0.8)
|m|

1 − (0.8) 2

RYY [0] = 2.89, RYY [1] = 1.51 and RYY [2] = 1.21

Solving h2 [1] = 0.4178 and h 2 [2] = 0.2004.

6.7 Levinson Durbin Algorithm
Levinson Durbin algorithm is the most popular technique for determining the LPC
parameters from a given autocorrelation sequence.
Consider the Yule Walker equation for mth order linear predictor.
⎡ RYY [0] RYY [1] .... RYY [m − 1]⎤ ⎡hm [1] ⎤ ⎡ RYY [1] ⎤
⎢ R [1] R [o] .... R [m-2] ⎥ ⎢h [2] ⎥ ⎢. ⎥
⎢ YY YY yy ⎥ ⎢ m ⎥ ⎢ ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ (1)
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [m-1] ...... RYY [0]
⎣ ⎥
⎦ ⎣hm [m]⎦ ⎣ RYY [m]⎦
Writing in the reverse order
⎡ RYY [0] RYY [1] .... RYY [m − 1]⎤ ⎡hm [m ] ⎤ ⎡ RYY [m]⎤
⎢ R [1] R [o] .... R [m-2] ⎥ ⎢h [m − 1]⎥ ⎢. ⎥
⎢ YY YY yy ⎥ ⎢ m ⎥ ⎢ ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ (2)
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [m-1] ...... RYY [0]
⎣ ⎥
⎦ ⎣hm [1] ⎦ ⎣ RYY [1] ⎦

Then (m+1) the order predictor is given by
⎡ RYY [0] RYY [1] .... RYY [m − 1] RYY [m] ⎤ ⎡ hm +1[1] ⎤ ⎡ RYY [1] ⎤
⎢ R [1] RYY [0] .... RYY [m − 2] RYY [m − 1] ⎥ ⎢ h [2] ⎥ ⎢ R [2] ⎥
⎢ YY ⎥ ⎢ m +1 ⎥ ⎢ YY ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢. ⎥ ⎢. ⎥ = ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ RYY [m − 1] RYY [m − 2] .... RYY [0] RYY [1] ⎥ ⎢ hm +1[m] ⎥ ⎢ RYY [m] ⎥
⎢ R [ m]
⎢ RYY [m − 1] .... ⎥
⎥ ⎢ h [m + 1]⎥
⎢ ⎥ ⎢ R [m + 1]⎥
⎢ YY ⎥
⎣ YY RYY [1] RYY [0] ⎦ ⎣ m +1 ⎦ ⎣ ⎦

Let us partition equation (2) as shown. Then
⎡ hm +1 [1] ⎤ ⎡ RYY [m] ⎤ ⎡ RYY [1] ⎤
⎡ RYY [0] RYY [1] .... RYY [ m − 1] ⎤ ⎢h [2] ⎥ ⎢ R [m - 1]⎥ ⎢ R [2] ⎥
⎢ R [1] RYY [0] .... RYY [ m − 2 ] ⎥ ⎢ m +1 ⎥ ⎢ YY ⎥ ⎢ YY ⎥
⎢ YY ⎥ ⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢ ⎥ + h m +1 [m + 1] ⎢ ⎥= ⎢ ⎥ (3)
⎢ ⎥ ⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ RYY [ m − 1] RYY [ m − 2 ] ....
⎣ RYY [0] ⎥
⎦ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ hm +1 [ m ]⎦ ⎣ RYY [1] ⎦ ⎣ RYY [ m ]⎦
and
m

∑h
i =1
m +1 [i ] RYY [ m + 1 − i ] + hm +1[ m + 1]RYY [0] = RYY [ m + 1] (4)

−1
From equation (3) premultiplying by R YY , we get

⎡ hm +1 [1] ⎤ ⎡ RYY [m] ⎤ ⎡ RYY [1] ⎤
⎢h [2] ⎥ ⎢ R [m - 1]⎥ ⎢ R [2] ⎥
⎢ m +1 ⎥ ⎢ YY ⎥ ⎢ YY ⎥
⎢. ⎥ −1
⎢. ⎥ −1
⎢. ⎥
⎢ ⎥ + h m +1 [m + 1]R YY ⎢ ⎥ = R YY ⎢ ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ hm +1 [ m ]⎦ ⎣ RYY [1] ⎦ ⎣ RYY [ m ]⎦
=
⎡hm +1 [1] ⎤ ⎡ hm [ m ] ⎤ ⎡hm [1]) ⎤
⎢ h [ 2] ⎥ ⎢h [m − 1]⎥ ⎢h [2] ⎥
⎢ m +1 ⎥ ⎢ m ⎥ ⎢ m ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ + hm +1 [m + 1] ⎢ ⎥ = ⎢ ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢. ⎥ ⎢. ⎥ ⎢. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣hm +1 [m ]⎦ ⎣hm [1] ⎦ ⎣hm [m ]⎦

The equations can be rewritten as
hm+1 [i ] = hm [i ] + k m +1hm [m + 1 − i ] i = 1,2,...m (5)

where k m = − hm [m ] is called the reflection coefficient or the PARCOR (partial
correlation) coefficient.
From equation (4) we get
m
∑ hm +1 [i ] RYY [m + 1 − i ] + hm +1 [m ]RYY [0] = RYY [m + 1]
i =1

using equation (5)
m

∑{h
i =1
m [i ] + k m +1 hm [m + 1 − i ] }RYY [m + 1 − i] − k m +1 RYY [0] = RYY [m + 1]
m m

∑ hm [i]RYY [m + 1 − i] − k m+1 RYY [0] + k m+1 ∑ hm [m + 1 − i] RYY [m + 1 − i] = RYY [m + 1]
i =1 i =1
m m
k m +1{RYY [0] − ∑ hm [m + 1 − i ] RYY [m + 1 − i ]} = − RYY [m + 1] + ∑ hm [i ]RYY [m + 1 − i ]
i =1 i =1
m
− RYY [m + 1] + ∑ hm [i ]RYY [m + 1 − i ]
k m +1 = i =1

ε [ m]
m

∑h m [i]RYY [m + 1 − i ]
= i =0

ε [ m]
where
m
ε [m] = RYY [0] − ∑ hm [m + 1 − i] RYY [m + 1 − i] is the mean - square prediction error.
i =1

Here we have used the assumption that hm [0] = −1
m +1
∴ ε [m + 1] = RYY [0] − ∑ hm+1 [m + 2 − i ] RYY [m + 2 − i ]
i =1

Using the recursion for hm+1 [i ]
We get
(
ε [m + 1] = ε [m] 1 − k m+1 2 )
will give MSE recursively. Since MSE is non negative
km ≤ 1
2

∴ km ≤ 1

• If km < 1, the LPC error filter will be minimum-phase, and hence the corresponding

syntheses filter will be stable.
• Efficient realization can be achieved in terms of k m .

• km represents the direct correlation of the data y[ n − m] on y[ n] when the

correlation due to the intermediate data y[ n − m + 1], y[n − m + 2],..., y[ n − 1] is
removed. It is defined by
f f
Eem [n ]em [n ]
km =
Ryy (0)

where
n
em [n ] = forward prediction error = y[n ] − ∑ hm [i ] y[n − i ]
f

i =1

and
n
em [n ] = backward prediction error=y[n − m ] − ∑ hm [m + 1 − i ] y[n + 1 − i ]
b

i =1

6.8 Steps of the Levinson- Durbin algorithm
Given RYY [m] , m = 0, 1, 2, ….
Initialization
Take hm [0] = −1 for all m

For m = 0,
ε [0] = RYY [0]
For m = 1,2,3...
m −1

∑h [i ]RYY [m − i ]
m −1
km = i =0

ε [m − 1]

hm [i ] = hm −1 [i ] + k m hm −1 [m − i ] , i = 1,2..., m - 1
hm [m] = −k m
ε m = ε m−1 (1 − k m )
2

Go on computing up to given final value of m.

Some salient points
• The reflection parameters and the mean-square error completely determine the LPC
coefficients. Alternately, given the reflection coefficients and the final mean-square
prediction error, we can determine the LPC coefficients.
• The algorithm is order recursive. By solving for m-th order linear prediction problem
we get all previous order solutions
• If the estimated autocorrelation sequence satisfy the properties of an autocorrelation
functions, the algorithm will yield stable coefficients

6.9 Lattice filer realization of Linear prediction error filters
f
em [n] = prediction error due to mth order forward prediction.
b
em [n] = prediction error due to mth order backward prediction.
Then,
m
em [n] =y[n] − ∑ hm [i ] y[n − i ]
f
(1)
i =1

m
em [n] = y[n - m] - ∑ hm [m + 1 − i ] y[n + 1 − i ]
b

i =1

From (1), we get
m −1
em [n] = y[n] − hm [m] y[n − m] − ∑ hm [i ] y[n − i ]
f

i =1
m −1
= y[n] + k m y[n − m] − ∑ (hm −1 [i ] + k m hm −1 [m − i ]) y[n − i ]
i =1
m −1
⎛ m −1
⎞
= y[n] − ∑ hm −1 [i ] y[n − i ] + k m ⎜ y[n − m] − ∑ hm −1 [m − i ] y[n − i ]⎟
i =1 ⎝ i =1 ⎠
= em −1 [n] + k m em −1 [n − 1]
f

∴ em [ n] = em −1[ n] + km em −1[ n − 1]
f f b

Similarly we can show that
em [ n] = em −1[ n − 1] + k m em −1[ n]
b b f

f
em−1[n] f
em [n]
+
km

km
b
−1 +
em [n]
z
b
e m −1 [ n]

How to initialize the lattice?
We have
e0f = y[n] − 0
and
e0 = y[n] − 0
b

Hence
+
e0 = eb = y[n]
f
0

k1

y[n] k1

z−1 +

6.10 Advantage of Lattice Structure
• Modular structure can be extended by first cascading another section. New stages
can be added without modifying the earlier stages.
• Same elements are used in each stage. So efficient for VLSI implementation.
• Numerically efficient as |km| < 1.
• Each stage is decoupled with earlier stages.

=> It follows from the fact that for W.S.S. signal, em [n ] sequences as a function of m are
uncorrelated.
eib [n] and em [ n]
b
0≤ i < m

are uncorrelated (Gram Schmidt orthogonalisation may be obtained through Lattice
filter).

k
ek [n] = y[n-k] - ∑ hk [k + 1 − i ] y [n + 1 − i ]
b

i =1
k
E em [n]ek [n] = E em [n]( y[n-k] - ∑ hk [k + 1 − i ] y [n + 1 − i])
b b b

i =1

=0 for 0 ≤ k < m
Thus the lattice filter can be used to whiten a sequence.
With this result, it can be shown that
E (em −1 [n]em −1 [n − 1])
f b
km = − (i)
E (em −1 [n − 1])
b 2

and

km = −
f
(
E em −1 [n]em −1 [n − 1]
b
)
( ) 2
(ii)
f
Ee m −1 [ n]

Proof:
Mean Square Prediction Error

(
= E e m [ n]
f
)
2

= E (e )
2
f
m −1 [n] + k m em −1 [n − 1]
b

This is to be minimized w.r.t. k m

⇒ 2(em −1 [n] + k m em −1 [n − 1])em −1 [n − i ] = 0
f b b

⇒ (i )

Minimizing E (e m [ n]) ⇒ (ii )
b 2

Example 2:
Consider the random signal model y[ n] = x[ n] + v[ n], where v[n ] is a 0-mean white noise
with variance 1 and uncorrelated with x[n ] and x[ n] = 0.8 x[ n - 1] + w[ n] , w[n] is a 0-
mean random variable with variance 0.68
a) Find the second –order linear predictor for y[n ]
b) Obtain the lattice structure for the prediction error filter
c) Use the above structure to design a second-order FIR Wiener filter to estimate
x[ n] from y[ n].

Adaptive Filters

7.1 Introduction
In practical situations, the system is operating in an uncertain environment where the
input condition is not clear and/or the unexpected noise exists. Under such circumstances,
the system should have the flexible ability to modify the system parameters and makes
the adjustments based on the input signal and the other relevant signal to obtain optimal
performance.
A system that searches for improved performance guided by a computational algorithm
for adjustment of the parameters or weights is called an adaptive system. The adaptive
system is time-varying.

Wiener filter is a linear time-invariant filter.

In practical situation, the signal is non-stationary. Under such circumstances, optimal
filter should be time varying. The filter should have the ability to modify its parameters
based on the input signal and the other relevant signal to obtain optimal performance.

How to do this?
• Assume stationarity within certain data length. Buffering of data is required and
may work in some applications.
• The time-duration over which stationarity is a valid assumption, may be short so
that accurate estimation of the model parameters is difficult.
• One solution is adaptive filtering. Here the filter coefficients are updated as a
function of the filtering error. The basic filter structure is as shown in Fig. 1.

y[n ] Filter Structure x[n ]
ˆ

Adaptive algorithm
e[n]

x[ n ]

The filter structure is FIR of known tap-length, because the adaptation algorithm updates
each filter coefficient individually.

7.2 Method of Steepest Descent
Consider the FIR Wiener filter of length M. We want to compute the filter coefficients
iteratively.
Let us denote the time-varying filter parameters by
hi [n], i = 0,1, ... M - 1
and define the filter parameter vector by
⎡ h0 [n ] ⎤
⎢ h [n ] ⎥
h[n ] = ⎢ 1 ⎥
⎢ ⎥
⎢ ⎥
⎣ hM −1[n ]⎦

We want to find the filter coefficients so as to minimize the mean-square error Ee 2 [n]
where
e[n ] = x[n ] − x[n ]
ˆ
M −1
= x[n] - ∑ hi [n ] y[n − i ]
i =0

= x[n] - h′[n ]y[n ]
= x[n] - y ′[n ]h[n ]

⎡ y[ n ] ⎤
⎢ y[n − 1] ⎥
where y[n ] = ⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ y[n − M + 1]⎦
Therefore
Ee 2 [n] = E ( x[n] − h ′[n]y[n]) 2
= R XX [0] − 2h ′[n]rxy + h ′[n]R YY h[n]

⎡ R XY [0] ⎤
⎢ R [1] ⎥
where rXY = ⎢ XY ⎥
⎢ ⎥
⎢ ⎥
⎣ R XY [ M − 1]⎦
and
⎡ RYY [0] RYY [−1] .... RYY [1 − M ]⎤
⎢ R [1] RYY [0] .... RYY [2 − M ]⎥
R YY = ⎢ YY ⎥
⎢... ⎥
⎢ ⎥
⎢ RYY [ M − 1] RYY [ M − 2] ....
⎣ RYY [0] ⎥⎦
The cost function represented by Ee 2 [n] is a quadratic in h[n]
A unique global minimum exists
The minimum is obtained by setting the gradient of Ee 2 [n] to zero.

Figure - Cost Function Ee 2 [n]
The optimal set of filter parameters are given by
h opt = R −1 rXY
XY

which is the FIR Wiener filter.

Many of the adaptive filter algorithms are obtained by simple modifications of the
algorithms for deterministic optimization. Most of the popular adaptation algorithms are
based on gradient-based optimization techniques, particularly the steepest descent
technique.

The optimal Wiener filter can be obtained iteratively by the method of steepest descent.
The optimum is found by updating the filter parameters by the rule
µ
h[n + 1] = h[n ] + ( −∇Ee 2 [n ] )
2
where

⎡ ∂Eeh [ n ] ⎤
2

⎢ ∂0 ⎥
⎢.........⎥
⎢ ⎥
∇Ee 2 [n] = ⎢.........⎥
⎢........ ⎥
⎢ ∂Ee2 [ n ] ⎥
⎢ ∂hM −1 ⎥
⎣ ⎦
= −2rXY + 2R YY h[n]
and µ is the step-size parameter.
So the steepest descent rule will now give
h[n + 1] = h[n ] + µ (rXY − R YY h[n ] )

7.3 Convergence of the steepest descent method
We have
h[n + 1] = h[n ] + µ (rXY − R YY h[n ] )
= h[n ] − µR YY h[n ] + µrXY
= (I − µR YY )h[n ] + µrXY
where I is the MxM identity matrix.
This is a coupled set of linear difference equations.

Can we break it into simpler equations?
R yy can be digitalized (KL transform) by the following relation

R YY = QΛQ′
where Q is the orthogonal matrix of the eigenvectors of R YY .
Λ is a diagonal matrix with the corresponding eigen values as the diagonal elements.
Also I = QQ ′ = Q ′Q
Therefore
h[n + 1] = (QQ′ − µQΛQ′)h[n] + µ ⋅ rXY
Multiply by Q′
Q′h[n + 1] = (I − µΛ)Q′h[n] + µQ′rXY
Define a new variable
h[n ] = Q′h[n ] and rXY = Q′rXY
Then
h[n + 1] = (I − µΛ)h[n ] + µ rXY

⎡1 − µλ1 0 .0 ⎤
⎢0 ⎥
⎢ ⎥
= ⎢ ⎥ h[n ] + µrxy
⎢ ⎥
⎢ ⎥
⎢0 .....
⎣ ... 1 − µλM ⎥
⎦

This is a decoupled set of linear difference equations
hi [n + 1] = (1 − µλi ).hi [n] + µ .rxy [i ] i = 0,1,...M − 1

and can be easily solved for stability. The stability condition is given by

1 − µλi < 1
⇒ −1 < 1 − µλi < 1
⇒ 0 < µ < 2 / λi , i = 1,.......M

Note that all the eigen values of R YY are positive.

Let λmax be the maximum eigen value. Then,

λmax < λ1 + λ2 + .....λM
= Trace(R YY )
2
∴0 < µ <
Trace(R yy )
2
=
M.R YY [0]
The steepest decent algorithm converges to the corresponding Wiener filter
lim h[n] = R -1 rXY
YY
n →∞

if the step size µ is within the range of specified by the above relation.]

7.4 Rate of Convergence
The rate of convergence of the Steepest Descent Algorithm will depend on the factor
(1 − µλi ) in

hi [n + 1] = (1 − µλi ).hi [n] + µ.R xy [i] i = 0,1,...M − 1

Thus the rate of convergence depends on the statistics of data and is related to the eigen
value spread for the autocorrelation matrix. This rate is expressed using the condition
λmax
number of R YY , defined as k = where λmax and λmin are respectively the maximum
λmin
and the minimum eigen values of R YY . The fastest convergence of this system occurs
when k = 1, corresponding to white noise.

7.5 LMS algorithm (Least – Mean –Square) algorithm
Consider the steepest descent relation
µ
h[ n + 1] = h[ n] − ∇Ee 2 [n]
2
Where

⎡ ∂Eeh [ n ] ⎤
2

⎢ ∂0 ⎥
⎢.........⎥
⎢ ⎥
∇Ee [n] = ⎢.........⎥
2

⎢........ ⎥
⎢ ∂Ee 2 [ n ] ⎥
⎢ ∂hM −1 ⎥
⎣ ⎦
In the LMS algorithm Ee 2 [n ] is approximated by e 2 [n] to achieve a computationally
simple algorithm.
⎡ ∂e[n] ⎤
⎢ ∂h ⎥
⎢ 0 ⎥
⎢.......... ⎥
⎢ ⎥
∇Ee [n] ≅ 2.e[n].⎢......... ⎥
2

⎢...........⎥
⎢ ⎥
⎢ ∂e[n] ⎥
⎢ ∂hM −1 ⎥
⎣ ⎦
Now consider
M −1
e[n ] = x[n ] − ∑ hi [n ] y[n − i ]
i =0

∂e[n].
= − y[n − j ], j = 0,1,.......M − 1
∂h j

⎡ ∂e[n] ⎤ ⎡ y[n] ⎤
⎢ ∂h ⎥ ⎢ y[n − 1] ⎥
⎢ 0 ⎥ ⎢ ⎥
⎢......... ⎥ ⎢................ ⎥
∴⎢ ⎥ = −⎢ ⎥ = − y[n]
⎢..........⎥ ⎢................ ⎥
⎢ ∂e[n] ⎥ ⎢.............. ⎥
⎢ ⎥ ⎢ ⎥
⎢ ∂hM −1 ⎥
⎣ ⎦ ⎢ y[n − M + 1]⎥
⎣ ⎦
∴ ∇Ee 2 [n ] = −2e[n ]y[n ]
The steepest descent update now becomes
h[n + 1] = h[n] + µe[n ]y[n]
This modification is due to Widrow and Hopf and the corresponding adaptive filter is
known as the LMS filter.

Hence the LMS algorithm is as follows
Given the input signal y[n] , reference signal x[n] and step size µ
1. Initialization hi [0] = 0, i = 0,1,2.....M − 1

2 For n > 0

Filter output
x n ] = h′[n ]y[n ]
ˆ[

Estimation of the error
e[ n] = x[ n] − x[ n]
ˆ

3. Tap weight adaptation
h[n + 1] = h[n] + µe[n]y[n]

x[n ]
y[n ] x[n ]
ˆ e[n ]
FIR filter
hi [n], i = 0,1,..M − 1 +

LMS algorithm

7.6 Convergence of the LMS algorithm
As there is a feedback loop in the adaptive algorithm, convergence is generally not
assured. The convergence of the algorithm depends on the step size parameter µ .
• The LMS algorithm is convergent in the mean if the step size parameter µ
satisfies the condition.
2
0<µ <
λ max
Proof:
h[n + 1] = h[n] + µe[n ]y[n]
∴ Eh[n + 1] = Eh[n] + µ Ey[n]e[n]
= Eh[n] + µ Ey[n]( x[n] − y′[n]h[n])
= Eh[n] + µ rXY − µ Ey[n]y′[n]h[n]
Assuming the coefficient to be independent of data (Independence Assumption), we get
Eh[n + 1] = Eh[n] + µrXY − µEy[n]y ′[n]Eh[n]
= Eh[n] + µrXY − µR XY Eh[n]
Hence the mean value of the filter coefficients satisfies the steepest descent iterative
relation so that the same stability condition applies to the mean of the filter coefficients.
• In the practical situation, knowledge of λmax is not available and Trace R yy can be

taken as the conservative estimate of λmax so that for convergence

2
• 0<µ <
Trace(R YY )

• Also note that trace,Trace(R YY ) = MR YY [0] = Tape input power of the LMS
filter.

Generally, a too small value of µ results in slower convergence where as big values of µ
will result in larger fluctuations from the mean. Choosing a proper value of µ is very
important for the performance of the LMS algorithm.

In addition, the rate of convergence depends on the statistics of data and is related to the
eigenvalue spread for the autocorrelation matrix. This is defined using the condition
λmax
number of R YY , defined as k = where λmin is the minimum eigenvalue of R YY . The
λmin
fastest convergence of this system occurs when k = 1, corresponding to white noise. This
states that the fastest way to train a LMS adaptive system is to use white noise as the
training input. As the noise becomes more and more colored, the speed of the training
will decrease.

The average of each filter tap –weight converges to the corresponding optimal filter tap-
weight. But this does not ensure that the coefficients converge to the optimal values.

7.7 Excess mean square error
Consider the LMS difference equation:
h[n + 1] = h[n]+ µe[n]y[n]
We have seen that the mean of LMS coefficient converges to the steepest descent
solution. But this does not guarantee that the mean square error of the LMS estimator will
converge to the mean square error corresponding to the wiener solution. There is a
fluctuation of the LMS coefficient from the wiener filter coefficient.
Let h opt = optimal wiener filter impulse response.

The instantaneous deviation of the LMS coefficient from h opt is

∆h[ n] = h[ n] − h opt

ε [n] = Ee 2 [n] = E{x[n] − h ′opt y[n] − ∆h ′[n]y[n]}2
= E{x[n] − h ′ y[n]}2 + E∆h ′[n]y[n]y ′[n]∆h[n] − 2 E (eopt [n]∆h ′[n]y[n])
opt

= ε min + E∆h ′[n]y[n]y ′[n]∆h[n] − 2 E (eopt [n]∆h ′[n]y[n])
= ε min + E∆h ′[n]y[n]y ′[n]∆h[n]

assuming the independence of deviation with respect to data and at E ∆h[ n] = 0.
Therefore,
ε excess = E∆h ′[n]y[n]y ′[n]∆h[n]

An exact analysis of the excess mean-square error is quite complicated and its
approximate value is given by
M µλi
∑
i =1 2 − µλi
ε excess = ε min M µλi
1− ∑
i =1 2 − µλi

The LMS algorithm is said to converge in the mean-square sense provided the step-length
parameter satisfies the relations
M
µλi
µ∑ <1
i =1 2 − µλi
2
and 0<µ <
λ max
M
µλ i
If ∑ 2 − µλ
i =1
<< 1
i

M
µλi
then ε excess = ε min ∑
i =1 2 − µλi
Further, if
µ << 1
1
Trace(R YY )
ε excess = ε min µ 2
1− 0
1
ε min µ Trace(R YY )
2
ε excess M µλi
The factor =∑ is called the misadjustment factor for the LMS filter.
ε min i =1 2 − µλi

7.8 Drawback of the LMS Algorithm
Convergence is slow when the eigenvalue spread of the autocorrelation matrix is large.
Misadjustment factor given by
ε excess 1
≈ µTrace(R YY )
ε min 2
is large unless µ is much smaller. Thus the selection of the step-size parameter is crucial
in the case of the LMS algorithm.
When the input signal is nonstationary the eigenvalues also change with time and
selection of µ becomes more difficult.

Example
The input to a communication channel is a test sequence x[ n] = 0.8 x[ n -1] + w[ n] where
w[n] is a 0 mean unity variance white noise. The channel transfer function is given by

H ( z ) = z −1 − 0.5 z −2 and the channel is affected white Gaussian noise of variance 1.
(a) Find the FIR Wiener filter of length 2 for channel equalization
(b) Write down the LMS filter update equations
(c) Find the bounds of LMS step length parameters
(d) Find the excess mean square error.

Solution:
From the given model
x[ n] y[ n ] x[ n]
ˆ

Chan +
Equqlizer
+

+
nel
+

Noise

σW 2
RXX [m] = (0.8)|m|
1 − 0.8 2

RXX [0] = 2.78
RXX [1] = 2.22
RXX [2] = 1.78
RXX [3] = 1.42
y[n ] = x[n − 1] − 0.5 x[n − 2] + v ( n )
∴ RYY [m] = 1.25RXX [m ] − 0.5RXX [m + 1] − 0.5RXX [m − 1] + δ [m ]
RYY [0] = 2.255
RYY [1] = 0.5325
also RXY [m] = Ex[n ] ( x[n − 1 − m] − 0.5 x[n − 2 − m ] + v[n − m])
= RXX [m + 1] − 0.5 RXX [m + 2]
∴ RXY [0] = 1.33
and RXY [1] = 1.07

Therefore, the Wiener solution is given by
⎡ 2.255 0.533⎤ ⎡ h0 ⎤ ⎡1.33⎤
⎢ 0.533 2.255⎥ ⎢ h ⎥ = ⎢1.07 ⎥
⎣ ⎦ ⎣ 1⎦ ⎣ ⎦
MSE = RXX [0] − h0 RXY [1] − h1RXY [2]

h0 = 0.51
h1 = 0.35
λ1 , λ2 = 2.79,1.72
2
µ∠
2.79
= 0.72
Excess mean square error
2
2λi
∑ 2 − µλi
= ξ mn I −1
2
µλi
1− ∑
I −1 2 − 2λ i

7.9 Leaky LMS Algorithm
Minimizes e 2 [n] + α h[ n]
2

where h[n] is the modulus of the LMS weight vector and α is a positive quantity.

The corresponding algorithm is given by
h[ n + 1] = (1 − µα )h[ n] + µαe[ n]y[ n]
where µα is chosen to be less than 1. In such a situation the pole will be inside the unit
circle, instability problem will not be there and the algorithm will converge.

7.10 Normalized LMS Algorithm
For convergence of the LMS algorithm
2
0<µ <
λMAX
and the conservative bound is given by
2
0<µ <
Trace(R YY )
2
=
MR YY [0]
2
=
ME (Y 2 [n ])

We can estimate the bound by estimating E (Y 2 [ n]) by
1 M 2
∑ Y [ n] so that we get the bound
M n=0

2 2
0<µ < M −1
= 2
y[n]
∑ y 2 [n − i]
i =0

Then we can take
2
µ=β 2
y[n]

and the LMS updating becomes
1
h[n + 1] = h[n] + β 2
e[n]y[n]
y[n]

where 0 < β < 2

7.11 Discussion - LMS
The NLMS algorithm has more computational complexity compared to the LMS
algorithm
Under certain assumptions, it can be shown that the NLMS algorithm converges for
0<β <2
2
y[n] can be efficiently estimated using the recursive relation

y[n] = y[n − 1] + y 2 [n] − y 2 [n − M − 1]
2 2

Notice that the NLMS algorithm does not change the direction of updation in steepest
descent algorithm
2
If y[ n] is close to zero, the denominator term ( y[n] ) in NLMS equation becomes very

small and
1
h[n + 1] = h[n] + β 2
e[n]y[n] may diverge
y[n]

To overcome this drawback a small positive number ε is added to the denominator term
the NLMS equation. Thus
1
h[n + 1] = h[n] + β e[n]y[n]
ε + y[n]
2

For computational efficiency, other modifications are suggested to the LMS algorithm.
Some of the modified algorithms are blocked-LMS algorithm, signed LMS algorithm etc.
LMS algorithm can be obtained for IIR filter to adaptively update the parameters of the
filter
M −1 N −1
y[n] = ∑ ai [n] y[ n − i ] + ∑ bi [ n] x[ n − i ]
i =1 i =0

How ever, IIR LMS algorithm has poor performance compared to FIR LMS filter.

7.12 Recursive Least Squares (RLS) Adaptive Filter
• LMS convergence slow
• Step size parameter is to be properly chosen
• Excess mean-square error is high
• LMS minimizes the instantaneous square error e 2 [n ]
• Where e[n ] = x[n ] - h′[n ] y[n ] = x[n ] - y ′[n ] h[n ]

The RLS algorithm considers all the available data for determining the filter parameters.
The filter should be optimum with respect to all the available data in certain sense.

Minimizes the cost function
n
ε [n ] = ∑ λ n-k 2
e [k ]
k =0

⎡h0 [n ] ⎤
with respect to the filter parameter vector h[n ] = ⎢h1 [n ] ⎥
⎢ ⎥
⎢hM −1 [n ]⎥
⎣ ⎦
where λ is the weighing factor known as the forgetting factor

• Recent data is given more weightage
• For stationary case λ = 1 can be taken
• λ ≅ 0.99 is effective in tracking local nonstationarity

The minimization problem is

Minimize ε [n ] = ∑ λn −k ( x[k ] − y ′[k ] h[n ]) with respect to h[n]
n 2

k =0

The minimum is given by
∂ε ( n )
=0
∂h( n )
n
=> 2 ∑ λ
k =0
n−k
( x[k ] y[k ] − y[k ] y ′[k ] h[n]) = 0
−1
⎛ n n −k ⎞ n n −k
=> h[n ] = ⎜ ∑ λ y[k ]y ′[k ] ⎟ ∑ λ x[k ]y[ k ]
⎝ k =0 ⎠ k =0
n
Let us define R YY [n] =
ˆ ∑λ
k =0
n−k
y[k ]y ′[ k ]

which is an estimator for the autocorrelation matrix R YY .
n
Similarly rXY [n ] = ∑ λ
n −k
ˆ x[k ]y[k ] = estimator for the autocorrelation vector rXY [n ]
k =0

−1
(
Hence h[n ] = R XY [n ] rXY [n ]
ˆ ˆ )
Matrix inversion is involved which makes the direct solution difficult. We look forward
for a recursive solution.

7.13 Recursive representation of R YY [n ]
ˆ

ˆ
R YY [n ] can be rewritten as follows
n −1
R YY [n ] = ∑ λn −k y[k ] y ′[k ] + y[n ] y ′[n ]
ˆ
k =0
n −1
= λ ∑ λn −1−k y[k ] y ′[k ] + y[n ] y ′[n ]
k =0

= λR YY [n − 1] + y[n ] y ′[n ]
ˆ

This shows that the autocorrelation matrix can be recursively computed from its previous
values and the present data vector.
Similarly rXY [n ] = λrXY [n − 1] + x[n ]y[n ]
ˆ ˆ

[ −1
h[n ] = R YY [n ] rXY [n ]
ˆ ˆ ]
= (λR )−1
ˆ YY [n − 1] + y[n ]y[n]′ rXY [n ]
ˆ
For the matrix inversion above the matrix inversion lemma will be useful.

7.14 Matrix Inversion Lemma
If A, B, C and D are matrices of proper orders, A and C nonsingular
(A + BCD) −1 = A −1 - A −1B (DA −1B + C −1 ) −1 DA −1

Taking A = λR YY [n − 1], B = y[n ], C = 1 and D = y ′[n ]
ˆ

we will have
−1

(R
ˆ yy [ n ] )
−1
=
1 ˆ −1
λ
1 ˆ ⎛ 1 ˆ
[ ] ⎞
R YY [n − 1] − R −1 [n − 1]y[n ] ⎜ y ′[n ] R −1 [n − 1] y[n ] + 1⎟ y ′[n ]R −1 [n − 1]
λ λ YY
ˆ YY
YY
⎝ ⎠

1 ⎛ ˆ −1 R −1 [n − 1]y[n ]y[n ′]R −1 [n − 1] ⎞
ˆ ˆ YY
= ⎜⎜ R YY [n − 1] − YY ⎟
λ⎝ λ + y ′[n ]R −1 [n − 1]y[n ] ⎟
ˆ YY ⎠
ˆ −1
Rename P[n ] = R YY [n ]. Then

P[n] =
1
(P[n − 1] − k[n]y ′[n]P[n − 1])
λ
where k[n] is called the ‘gain vector’ and given by
P[n − 1]y[n]
k[ n ] =
λ + y ′[n]P[n − 1]y[n]
k[n] important to interpret adaptation is also related to the current data vector y[n]
by
k[ n] = P[ n]y[ n]
To establish the above relation consider

P[n] =
1
(P[n − 1] − k[n]y ′[n]P[n − 1])
λ
Multiplying by λ and post-multiplying by y[n] and simplifying we get
λP[n]y[n ] = (P[n − 1] − k[n ]y ′[n ]P[n − 1])y[n ]
= P[n − 1]y[n ] − k[n ]y ′[n ]P[n − 1]y[n ]
= λk[n ]
Therefore

( −1
)
h[n ] = R YY [n ] rXY [n ]
ˆ ˆ
= P[n ](λrXY [n − 1] + x[n ]y[n ])
ˆ
= λP[n ]rXY [n − 1] + x[n ]P[n ]y[n ]
ˆ
= λ λ [P[n − 1] − k[n ]y ′[n ]P[n − 1]]rXY [n − 1] + x[n ]P[n ]y[n ]
1 ˆ
= h[n − 1] − k[n ] y ′[n ] h[n − 1] + x[n ] k[n ]
= h[n − 1] + k[n ]( x[n ] − y ′[n ]h[n − 1])

7.15 RLS algorithm Steps
Initialization:
At n = 0
P[0] = δ I MxM , δ a postive number
,
y[0] = 0, h[0] = 0

Choose λ
Operation:
For 1 to n = Final do
1. Get x[ n], y[ n]
2. Get e[n ] = x[n ] − h′[n − 1]y[n ]
P[n − 1]y[n]
3. Calculate gain vector k[n] =
λ + y ′[n]P[n − 1]y[n]

4. Update the filter parameters
h[n ] = h[n − 1] + k[n ]e[n ]
5. Update the P matrix

P[n] =
1
(P[n − 1] − k[n]y ′[n]P[n − 1])
λ
end do

7.16 Discussion – RLS

7.16.1 Relation with Wiener filter
We have the optimality condition analysis for the RLS filters
R YY [n]h[n] = rxy [n]
ˆ ˆ

where R YY [n] = ∑ k =0 λ n −k y[k ]y′[k ]
n
ˆ

Dividing by n + 1
n

ˆ
R YY [n] ∑λ n−k
y[k ]y ′′[k ]
= k =0

n +1 n +1
ˆ
R YY [n]
if we consider the elements of , we see that each is a estimator for the auto-
n +1
correlation of specific lag.
ˆ
R YY [n]
lim = R YY [n]
n →∞ n + 1

in the mean square sense. Weighted window sample autocorrelation is a consistent
estimator.
Similarly
rXY [n]
ˆ
lim = rXY [n]
n →∞ n + 1

Hence as n → ∞ , optimality condition can be written as
R YY [n]h[n] = rXY [n].

7.16.2. Dependence condition on the initial values
Consider the recursive relation
R YY [n ] = λR YY [n − 1] + y[n ]y ′[n ]
ˆ ˆ

Corresponding to
RYY [0] = δ I
ˆ −1

I
we have R YY [0] =
ˆ
δ
With this initial condition the matrix difference equation has the solution
n
R[n] = λ n +1R YY [−1] + ∑ λ n − k y[k ]y′[k ]
ˆ
k =0

=λ n +1
R YY [−1] + R YY [n]
ˆ ˆ
I
= λ n +1 + RYY [n]
ˆ
δ
Hence the optimality condition is modified as
I ~
( λn +1 + R YY [n ]) h[n ] = rXY [n ]
ˆ ˆ
δ
~
where h[ n] is the modified solution due to assumed initial value of the P-matrix.
~
λn +1R YY [n ]h[n ] ~
ˆ −1
+ h[n ] = h[n ]
δ
If we take λ as less than 1,then the bias term in the left-hand side of the above equation
will be eventually die down and we will get
~
h[ n] = h[ n]

7.16.3. Convergence in stationary condition
• If the data is stationary, the algorithm will converge in mean at best in M
iterations, where M is the number of taps in the adaptive FIR filter.
• The filter coefficients converge in the mean to the corresponding Wiener filter
coefficients.
• Unlike the LMS filter which converges in the mean at infinite iterations, the
RLS filter converges in a finite time. Convergence is less sensitive to eigen
value spread. This is a remarkable feature of the RLS algorithm.
• The RLS filter can also be shown to converge to the Wiener filter in the
mean-square sense so that there is zero excess mean-square error.

7.16.4. Tracking non-staionarity
If λ is small λn −i ≅ 0 for i << n
⇒ the filter is based on most recent values. This is also qualitatively explains that the
filter can track non stationary in data.

7.16.5. Computational Complexity
Several matrix multiplications result in ≅ 7M 2 arithmetic operations, which are quite
large, if the filter tap-length is high. So we have to go for the best implementation of the
RLS algorithm.

CHAPTER – 8: KALMAN FILTER

8.1 Introduction
Noise

x[n] y[n] x[n]
ˆ
Linear filter
+

To estimate a signal x[ n] in the presence of noise,
• FIR Wiener Filter is optimum when the data length and the filter length are equal.
• IIR Wiener Filter is based on the assumption that infinite length of data sequence
is available.
Neither of the above filters represents the physical situation. We need a filter that adds a
tap with each addition of data.

The basic mechanism in Kalman filter ( R.E. Kalman, 1960) is to estimate the signal
recursively by the following relation
x[n] = A n x[n − 1] + K n y[n]
ˆ ˆ
The whole of Kalman filter is also based on the innovation representation of the signal.
We used this model to develop causal IIR Wiener filter.

8.2 Signal Model
The simplest Kalman filter uses the first-order AR signal model
x[n] = ax[n − 1] + w[ n]
where w[n] is a white noise sequence.
The observed data is given by
y[n] = x[ n] + v[ n]
where v[n] is another white noise sequence independent of the signal.
The general stationary signal is modeled by a difference equation representing the ARMA
(p,q) model. Such a signal can be modeled by the state-space model and is given by
x[n] = Ax[n − 1] + Bw[ n] (1)
And the observations can be represented as a linear combination of the ‘states’ and the
observation noise.
y[n] = c′x[n] + v[ n] (2)

Equations (1) and (2) have direct relation with the state space model in the control system
where you have to estimate the ‘unobservable’ states of the system through an observer
that performs well against noise.

Example 1:
Consider the AR ( M ) model
x[n] = a1 x[n − 1] + a2 x[n − 2]+....+aM x[n − M ]+w[n]

Then the state variable model for x[ n] is given by
x[n] = Ax[n − 1] + Bw[ n]
where
⎡ x1 [n] ⎤
⎢ x [ n] ⎥
x[n] = ⎢ 2 ⎥ , x [n] = x[n], x [ n] = x[ n − 1].... and x [n] = x[n − M + 1],
⎢ ⎥ 1 2 M

⎢ ⎥
⎢ xM [ n ]⎥
⎣ ⎦
⎡ a1 a2 .. ..aM ⎤
⎢1 0 .. .. 0 ⎥
A=⎢ ⎥
⎢0 1 .. .. 0 ⎥
⎢ ⎥
⎢0 0 .. .. 1 ⎥
⎣ ⎦
⎡1 ⎤
⎢0 ⎥
and b = ⎢ ⎥
⎢.. ⎥
⎢ ⎥
⎣0 ⎦
Our analysis will include only the simple (scalar) Kalman filter

The Kalman filter also uses the innovation representation of the stationary signal as
does by the IIR Wiener filter. The innovation representation is shown in the following
diagram.

y[0], y[1] , y[ n] Orthogonalisation ~[0], ~[1]
y y , ~[ n]
y

Let x[ n] be the LMMSE of x[ n] based on the data y[0], y[1]
ˆ , y[ n].
In the above representation ~[n] is the innovation of y[n] and contains the same
y

information as the original sequence. Let E ( y[ n] | y[ n − 1]............, y[0]) be the linear
ˆ

prediction of y[ n] based on y[ n − 1]............, y[0].

Then
y[n] = y[n] − E ( y[n] | y[n − 1]............, y[0])
ˆ
= y[n] − E ( x[n] + v[n] | y[n − 1]............, y[0])
ˆ
= y[n] − E (ax[n − 1] + w[n] + v[n] | y[n − 1]............, y[0])
ˆ
= y[n] − E (ax[n − 1] | y[n − 1]............, y[0])
ˆ
= y[n] − ax[n − 1]
ˆ
= y[n] − x[n] + x[n] − ax[n − 1]
ˆ
= y[n] − x[n] + ax[n − 1] + w[n] − ax[n − 1]
ˆ
= y[n] − x[n] + a ( x[n − 1] − x[n − 1]) + w[n]
ˆ
= v[n] + ae[n − 1] + w[n]

which is a linear combination of three mutually independent orthogonal sequences.
It can be easily shown that ~[n] is orthogonal to each of y[ n − 1], y[n − 2],......and y[0].
y

The LMMSE estimation of x[n ] based on y[0], y[1] , y[n ], is same as the estimation
based on the innovation sequence , ~[0], y[1],.... ~[ n − 1], ~[ n] . Therefore,
y y y
n
x[n] = ∑ k i ~[n]
ˆ y
i =0

where k i s can be obtained by the orthogonality relation.
Consider the relation
x[n] = x[n] + e[n]
ˆ
= ∑ in=0 ki y[i ] + e[n]
Then
E ( x[n] − ∑ in=0 ki y[i ]) y[ j ] = 0 j = 0,1..n
so that
k j = Ex[n] y[ j ] / σ j 2 j = 0,1..n

Similarly
x[n − 1] = x[n − 1] + e[n − 1]
ˆ
n −1
= ∑ ki′y[i ] + e[n − 1]
i =0

k ′j = Ex[n − 1] y[ j ] / σ 2
j j = 0,1..n − 1
= E ( x[n] − w[n]) y[ j ] / aσ 2
j j = 0,1..n − 1
= E ( x[n]) y[ j ] / aσ 2
j j = 0,1..n − 1
∴ k ′j = k j / a j = 0,1..n − 1

Again,
n n −1
x[n ] = ∑ k i ~[i ] = ∑ k i ~[i ] + k n ~[n ]
ˆ y y y
i =0 i =0
n −1
= a ∑ k i′~[i ] + k n ~[n ]
y y
i =0

= ax[n − 1] + k n ( y[n ] − E ( y[n ] / y[n − 1]............, y[0]))
ˆ ˆ
= ax[n − 1] + k n ( y[n ] − ax[n − 1])
ˆ ˆ
= (1 − k n )ax[n − 1]) + k n y[n ]
ˆ
where E ( y[n ] / y[n − 1]............, y[0])) is the linear prediction of y[n ] based on
ˆ
obseravtions y[n − 1]............, y[0]) and which is same as ax[n − 1].
ˆ

∴ x[n ] = An x[n − 1] + k n y[n ]
ˆ ˆ
with An = (1 − k n )a

Thus the recursive estimator x[ n] is given by
ˆ

x[n] = An x[ n − 1] + k n y[n]
ˆ ˆ

Or

x[n] = ax[ n − 1] + kn ( y[n] − ax[ n − 1])
ˆ ˆ ˆ

The filter can be represented in the following diagram

y[n]
+
kn +
x[n]
ˆ
-
z −1

a

8.3 Estimation of the filter-parameters
Consider the estimator
x[n] = An x[n − 1] + k n y[n]
ˆ ˆ

The estimation error is given by
e[n] = x[n] − x[n]
ˆ
Therefore e[n] must orthogonal to past and present observed data .
Ee[ n] y[ n − m] = 0, m ≥ 0

We want to find An and the k n using the above condition.

e[n ] is orthogonal to current and past data. First consider the condition that e[n ] is
orthogonal to the current data.
∴ Ee[n] y[n] = 0
⇒ Ee[n]( x[n] + v[n]) = 0
⇒ Ee[n]x[n] + Ee[n]v[n] = 0
⇒ Ee[n]( x[n] + e[n]) + Ee[n]v[n] = 0
ˆ
⇒ Ee 2 [n] + Ee[n]v[n] = 0
⇒ ε 2 [n] + E ( x[n] − An x[n − 1] − kn y[n])v[n] = 0
ˆ
⇒ ε 2 [ n ] − k nσ V = 0
2

ε 2 [ n]
⇒ kn =
σV 2

We have to estimate ε 2 [n] at every value of n.
How to do it?
Consider
ε 2 [n] = Ex[n]e[n]
= Ex[n]( x[n] − (1 − kn )ax[n − 1] − kn y[n])
ˆ
= σ X − (1 − kn )aEx[n]x[n − 1] − kn Ex[n] y[n]
2
ˆ
= (1 − kn )σ X − (1 − kn )aE (ax[n − 1] + w[n]) x[n − 1]
2
ˆ
= (1 − kn )σ X − (1 − kn )a 2 Ex[n − 1]x[n − 1]
2
ˆ
Again
ε 2 [n − 1] = Ex[n − 1]e[n − 1]
= Ex[n − 1]( x[n − 1] − x[n − 1])
ˆ
= σ X − Ex[n − 1]x[n − 1]
2
ˆ
Therefore,
Ex[n − 1]x[n − 1] = σ X − ε 2 [n − 1]
ˆ 2

σ W + a 2 ε 2 [n − 1]
2
Hence ε 2 [ n] = σV
2

σ W + σ V + a ε [n − 1]
2 2 2 2

where we have substituted σ W = (1 − a 2 )σ X
2 2

We have still to find ε [0]. For this assume x[ −1] = x[ −1] = 0. Hence from the relation
ˆ

ε [n] = (1 − kn )σ X 2 − (1 − kn )a 2 Ex[ n − 1] x[ n − 1]
ˆ

we get
ε 2 [0] = (1 − k0 )σ X 2

ε 2 [0]
Substituting k0 =
σV 2

in the expression for ε 2 [0] above

σ X 2σ V 2
We get ε 2 [0] =
σ X 2 + σV 2

8.4 The Scalar Kalman filter algorithm
Given: Signal model parameters a and σ W and the observation noise variance σ V .
2 2

Initialisation x[ −1] = 0
ˆ

Step 1 n = 0. Calculate
σ X 2σ V 2
ε [0] = 2
2

σ X + σV 2
Step 2 Calculate
ε 2 [ n]
kn = 2
σV
Step 3 Input y[n]. Estimate x[ n] by
ˆ

Predict
x[n / n − 1] = ax[n − 1]
ˆ ˆ
correct
x[n] = x[n / n − 1] + kn ( y[n] − y[n / n − 1])
ˆ ˆ ˆ
∴ x[n] = ax[n − 1] + kn ( y[n] − ax[n − 1]
ˆ ˆ ˆ

Step 4 n = n + 1. Calculate
σ W + a 2ε 2 [n − 1]
2
ε [ n] = 2
2
σV
2

σ W + σ V + a ε [n − 1]
2 2 2

Step 5 Go to Step 2

• We have to initialize ε 2 [0].
• Irrespective of this initialization, k[ n] and ε [n] converge to final values.
• Considering a to be time varying, the filter can be used to estimate nonstationary
signal.

Example: Given
x[n] = 0.6 x[n − 1] + w[n] n≥0
y[n] = x[n] + v[n] n≥0
σW = 0.25, σV = 0.5
2 2

Find the expression for the Kalman filter equations at convergence and the corresponding
mean square error.
σ W + a 2ε 2 [n − 1]
2
Using ε 2 [ n] = σ2
σ W + σ V + a 2ε 2 [n − 1] V
2 2

0.25 + 0.62 ε 2
We get ε = 2
0.5
0.25 + 0.5 + 0.6ε 2
Solving and taking the positive root
ε2 = 0.320
kn = ε 2 = 0.390

8.5 Vector Kalman Filter
Consider the time-varying state-space model representing the nonstationary signal.
The state equation is
x[n] = A[ n]x[n - 1] + w[n]

where w[n] is the 0-mean Gaussian noise vector with covariance matrix QW .

The observed data y[n] is related to the sates by
y[n] = c[n]x[n] + v[n]

where v[n] is the 0-mean Gaussian noise vector with covariance matrix QV .

Generalising Equation (4) we have
x[n] = A[n]x[n - 1] + k n (y[n]- c[n]A[n]x[n - 1])
ˆ ˆ ˆ

Denote x[ n / n] = x[ n] = best linear estimate of x[n] given y[0], y[1],..., y[ n].
ˆ ˆ

and x[ n / n − 1] = best linear estimate of x[n] given y[0], y[1],..., y[ n − 1].
ˆ
The corresponding state estimation errors are:
ˆ
e[n/n] = x[n]- x[n/n]
and
e[n/n − 1] = x[n]- x[n/n − 1]
ˆ
The mean-square estimation error in scalar Kalman filter will now be replaced by the
error covariance matrix denoted by P.

The a priori estimate of the error covariance matrix is then
P[n/n − 1] = Ee[n/n − 1]e′[n/n − 1]

and the a posteriori estimate error covariance is
P[n/n] = Ee[n/n]e′[n/n]

With these definitions and notations the vector Kalman filter algorithm is as follows:

8.6 The Vector Kalman filter algorithm
State equation: x[n] = A[ n]x[n - 1] + w[n]
Observation equation: y[ n] = c[n]x[ n] + v[ n]
Given (a) State matrix A[n], n = 01, 2... and the process noise covariance matrix Q W

(b) Observation parameter matrix c[n], n = 01, 2... and the observation noise
covariance matrix Q V

(c) Observed data y[n], n = 01, 2...
Initialization:
x[−1/ − 1] = x[−1] = 0
ˆ ˆ
P[−1/ − 1] = Ex[−1]x′[−]
Estimation:
For n = 0,1, 2,.... do
State Prediction
x[n/n - 1] = A[n]x[n - 1]
ˆ ˆ
a priori error covariance matrix
P[n/n - 1] = A[n]P[n − 1/n - 1]A′[n] + Q W
Kalman gain
k n = P[n/n - 1] c[n](c[n]P[n/n - 1] c′[n] + Q v ) −1
State Update
x[n/n] = x[n/n - 1] + k n (y[n] − c[n]x[n/n - 1])
ˆ ˆ ˆ
a posteriori error covariance matrix
P[n/n] = (I − k nc[n])P[n/n − 1]

CHAPTER -9: SPECTRAL ESTIMATION TECHNIQUES
OF STATIONARY SIGNALS

9.1 Introduction
The aim of spectral analysis is to determine the spectral content of a random process from
a finite set of observed data.
Spectral analysis is s very old problem: Started with the Fourier Series (1807) to solve
the wave equation.
Strum generalized it to arbitrary function (1837)
Schuster devised periodogram (1897) to determine frequency content numerically.

Consider the definition of the power spectrum of a random sequence {x[n],−∞ < n < ∞}
∞ − j 2π wm

S XX ( w) = ∑R
m =−∞
XX [m]e

where R XX [m ] is the autocorrelation function.
Thhe power spectral density is the discrete Fourier Transform (DFT) of the the
autocorrelation sequence
The definition involves infinite autocorrelation sequence.
But we have to use only finite data. This is not only for our inability to handle
infinite data, but also for the fact that the assumption of stationarity is valid
only for a sort duration. For example, the speech signal is stationary for 20 to
80-ms.
Spectral analysis is a preliminary tool – it says that the particular frequency content may
be present in the observed signal. Final decision is to be made by ‘Hypothesis Testing’.

Spectral analysis may be broadly classified as parametric and non-parametric.In the
parametric moethod, the random sequence is modeled by a time-series model, the model
parameters are estimated from the given data and the spectrum is found by substituting
parameters in the model spectrum
We will first discuss the non-parametric techniques for spectral estimation. These
techniques are based on the Fourier transform of the sample autocorrelation function
which is an estimator for the true autocorrelation function.

9.2 Sample Autocorrelation Functions
Two estimators of the autocorrelation function exist
N −1− m
1
R XX [m ] =
ˆ
N
∑ x[n]x[n + m]
n =0
(biased)
N −1− m
1
R ′ [m ] =
ˆ
XX
N−m
∑ x[n]x[n + m]
n =0
(unbiased)

Note that
N −1− m
1
E RXX [m] = ∑ E ∑ x[n]x[n + m]
ˆ
N n =0

N−m
= RXX [m]
N
m
= RXX [m] − RXX [m]
N
⎛N− m⎞
=⎜ ⎟ RXX [m]
⎝ N ⎠

ˆ
Hence R XX [m ] is a biased estimator of R XX [m ]. Had we divided the terms under

summation by N − m instead of N, the corresponding estimator would have been

unbiased. Therefore, R ′ [m] is an unbiased estimator.
ˆ
XX

ˆ
RXX [m ] is an asymptotically unbiased estimator. As N ∞, the bias ˆ
of RXX [m ] will
ˆ
tend to 0. The variance of RXX [m] is very difficult to be determined, because it involves

fourth-order moments. An approximate expression for the covariance of ˆ
RXX [m ] is
given by Jenkins and Watt (1968) as
ˆ ˆ
C o v ( R X X [ m 1 ], R X X [ m 2 ])
∞
1
≅
N
∑ (R
n = −∞
XX [ n ]R XX [ n + m 2 − m1 ] + R XX [ n − m1 ]R XX [ n + m 2 ])

This means that the estimated autocorrelation values are highly correlated
ˆ
The variance of RXX [m ] is obtained from above as
∞

∑ (R [ n ] + R XX [ n − m ]R XX [ n + m ])
1
var ( R XX [m ] ≅
ˆ
XX
2

N n = −∞

ˆ
Note that the variance of RXX [m ] is large for large lag m, especially as m approaches
N.

( )
Also as N → ∞ , var R XX [m] → 0 provided ∑ R XX [n ] < ∞.
ˆ 2 ∞

n = −∞
( )

As N → ∞, var( R XX [m ]) → 0. Though sample autocorrelation function is a consistent
ˆ

estimator, its Fourier transform is not and here lies the problem of spectral estimation.
ˆ′
Though unbiased and consistent estimators for RXX [m ], RXX [m ] is not used for spectral
ˆ′
estimation because RXX [m] does not satisfy the non-negative definiteness criterion.

9.3 Periodogram (Schuster, 1898)
N −1 2
1
S XX ( w) =
ˆp
N
∑ x[n]e
n=0
− jnw
, -π ≤ w ≤ π

ˆp
S XX ( w) gives the power output of band pass filters of impulse response

1 n
hi [n] = e − wi n rect ( )
N N
hi [n] is a very poor band-pass filter.
Also
N −1
S XX ( w) =
ˆp
∑
m =− ( N −1)
RXX [m]e − jwm
ˆ

N −1−|m|
1
where RXX [m] =
ˆ
N
∑
n=0
x[n]x[n + m]

The periodogram is the Fourier transform of the sample autocorrelation function.

To establish the above relation
Consider
xN [n] = x[n] for n < N
= 0 otherwise
x [m]* xN [− m]
∴ RXX [m] = N
ˆ
N
1 ⎛ N −1 − jwn ⎞ ⎛ N −1 jwn ⎞
S XX ( w) = ⎜ ∑ x[n]e ⎜ ∑ x[n]e ⎟
ˆp
N ⎝ n=0 ⎟
⎠ ⎝ n =0 ⎠
N −1 2
1
=
N
∑ x[n]e
n=0
− jwn

N −1
So S XX ( w) =
ˆp
∑
m =− ( N −1)
RXX [m]e − jwm
ˆ

N −1
E S XX ( w) =
ˆp
∑
m =− ( N −1)
E ⎡ RXX [m]⎤ e − jwm
⎣
ˆ
⎦
N −1 ⎛ m⎞
= ∑ ⎜ 1 − ⎟RXX [m]e
m =− ( N −1) ⎝ N⎠
− jwm

as N → ∞ the right hand side approaches true power spectral density S XX ( f ). Thus the
periodogram is an asymptotically unbiased estimator for the power spectral density.
To prove consistency of the periodogram is a difficult problem.
We consider the simple case when a sequence of Gaussian white noise in the following
example.

2πk
Let us examine the periodogram only at the DFT frequencies wk = , k = 0,1,... N − 1.
N

Example 1:
The periodogram of a zero-mean white Gaussian sequence x[n ] , n = 0, . . . , N-1.
The power spectral density is given by
σ x2
S XX ( w) = -π < w ≤ π
2π
The periodogram is given by
2
1 ⎛ N −1 ⎞
S ( w) = ⎜ ∑ x[n]e − jwn ⎟
ˆp
XX
N ⎝ n=0 ⎠
2πk
Let us examine the periodogram only at the DFT frequencies wk = , k = 0,1,... N − 1.
N
N −1 2π 2
1 −j
∑ x[n]e
kn
S (k ) =
ˆ p
XX
N
N n =0

N −1 2
1
=
N
∑ x[n]e
n =0
− jwK n

2 2
⎛ N −1 1 ⎞ ⎛ N −1 x[n ]sin wK n ⎞
= ⎜∑ x[n ]cos wK n ⎟ + ⎜ ∑ ⎟
⎝ n =0 N ⎠ ⎝ n =0 N ⎠
= C X ( w K ) + S X ( wK )
2 2

where 2 2 ˆp
C X ( w K ) and S X ( wK ) are the consine and sine parts of S XX ( w).

Let us consider
1 N −1
CX (wK ) =
2
∑ x[n] cos wK n
N n =0
which is a linear combination of a Gaussian process.
Clearly E C X ( w K ) = 0
2

2
1 ⎛ N −1 ⎞
var C ( w K ) = E ⎜ ∑ x[n ]cos wk n ⎟
2
X
N ⎝ n =0 ⎠
N −1
1
= ∑ Ex 2 [n]cos2 wK n + E (Cross terms)
N n =0
Assume the sequence x[n ] to be independent.

Therefore,

var ( CK (w K ) ) = σ X σX ⎛ 1 + cos 2 wK n ⎞
2 N −1 2 N −1

N
∑ cos2 wK n =
n =0 N
∑⎜
⎝ n =0 2
⎟
⎠
σX ⎛ N
2
sin NwK ⎞
= ⎜ + cos ( N − 1) wK ⎟
N ⎝2 2sin wK ⎠

2 ⎛
1 w ⎞
= σ X ⎜ + cos ( N − 1) wK sin N K ⎟
⎝2 2 N sin wK ⎠
2π
For wK = k
N
sin NwK N
=0 k ≠ 0, k ≠ (assuming N even).
2sin wK 2
σX
2
∴ var ( CK (w K ) ) = for k = 0
/
2
sin NwK
Again = 1 for k = 0.
N sin wK

∴ var ( CK (w K ) ) = σ X for k = 0, k =
2 N
2
Similarly considering the sine part
N-1
1
S K ( wK ) =
N
∑ x[n]sin w
n =0
K n

E S K ( wK ) = 0 ∵ x[n ] is zero mean.
var S K ( wK ) = 0 for k = 0 (for dc part has no sine term)
σX
2
= for k = 0.
/
2
Therefore for k = 0, Distribution.
CK [k ] ~ N (0, σ X )
2

S K [k ] = 0.
for k = 0
/
⎛ σ2 ⎞
CK ~ N ⎜ 0, X ⎟
⎝ 2 ⎠
⎛ σX ⎞
2
S K ~ N ⎜ 0, ⎟.
⎝ 2 ⎠
We can also show that
Cov( S X ( wK )C X ( wK )) = 0.

So C X ( wK ) and S X ( wK ) are independent Gaussian R.V.s..
The periodogram
S XX k ] = C X [k ] + S X [k ]
ˆ p[ 2 2
has a chi-square distribution

9.4 Chi square distribution
Let X 1 , X 2 . . . X N be independent zero-mean Gaussian variables each with

variance σ X . Then Y = X 12 + X 2 + . . . + X N has (chi square)
2 2 2
χN
2
distribution with
mean

EY = E ( X 12 + X 22 + . . . + X N ) = Nσ X
2 2
and variance 2 N σ X .
4

S XX [k ] = C X [k ] + S X [k ].
ˆ 2

It is a χ 2
2
distribution.
σ σ 2 2
E S XX [k ] = X + X = σ X = S XX [k ]
ˆ 2

2 2
⇒ S XX
ˆ [k ] is unbiased
2
⎛σ 2 ⎞
var (S XX [k ]) = 2 × 2 ⎜ X ⎟ = S XX [k ] which is independent of N
ˆ 2

⎝ 2 ⎠

S XX [0] = C02 [0] is a χ12 distribution of degree of freedom 1
ˆ

E S [0] = σ 2
ˆ
XX x

⇒ S XX [0] is unbiased
ˆ
ˆ ( )
and var S XX [0] = σ X = S XX [0].
4 2

It can be shown that for the Gaussian independent white noise sequence at any
frequency w,

var (S XX ( w)) = S XX ( w)
ˆp 2

ˆp
S XX ( w)

:

w
−π π

For general case
N −1
S XX ( w) =
ˆp
∑
m =− ( N −1)
RXX [m]e − jwm
ˆ

N −1−|m|
1
where RXX [m] =
ˆ
N
∑n =0
x[n]x[n + m], the biased estimator of autocorrelation fn.
N −1
⎛ | m |⎞ ˆ′
= ∑ ⎜1 −
m =− ( N −1) ⎝
⎟ RXX [m]e
N ⎠
− jwm

N-1
= ∑
m = -(N-1)
ˆ′
wB [m]RXX [m]e − jwm

ˆ′
where RXX [m] is the unbiased estimator of auto correlation
|m|
and wB [m] = 1 − is the Bartlett Window.
N
( Fourier Transform of product of two functions)
So, ˆp {
S XX ( w) = WB ( w) ∗ FT RXX [m]
ˆ' }
ˆp
E S XX ( w) = W ( w) ∗ FT { E R [m]}
B
ˆ '
XX

= WB ( w) ∗ S XX ( w)
π
= ∫W
−π
B ( w − ξ ) S XX (ξ )dξ

As N → ∞ , E S XX ( w ) →
ˆp S XX ( w )

ˆp ( )
Now var S XX ( w) cannot be found out exactly (no analytical tool). But an approximate

expression for the covariance is given by

ˆp ( ˆp
Cov S XX ( w1 ), S XX ( w2 ) )
⎡⎛ N ( w1 + w2 ) ⎞ ⎛
2
N ( w1 − w2 ) ⎞ ⎤
2

⎢⎜ sin ⎟ ⎜ sin ⎟ ⎥
~ S XX ( w1 ) S XX ( w2 ) ⎢⎜ 2
⎟ +⎜
2
⎟ ⎥
⎢⎜ w1 + w2 ⎟ ⎜ w1 − w2 ⎟ ⎥
⎢⎜ N sin( 2 ) ⎟ ⎜ N sin( 2 ) ⎟ ⎥
⎣⎝ ⎠ ⎝ ⎠ ⎦
⎡ 2
⎤
var S XX ( XX )
ˆ p ( w) ~ S 2 ( w) ⎢1 + ⎛ sin Nw ⎞ ⎥
⎜ ⎟
⎢ ⎝ N sin w ⎠ ⎥
⎣ ⎦
{
∴ var S XX ( w) ~ 2S XX ( w)
ˆp 2
} for w = 0, π
≅ S XX ( w)
2
for 0 < w < π
k1 k
Consider w1 = 2π and f 2 = 2π 2 , k1 , k2 integers.
N N
Then
ˆp ( ˆp
Cov S XX ( w1 ), S XX ( w2 ) ) ≅ 0

This means that there will be no correlation between two neighboring spectral estimates.
Therefore periodogram is not a reliable estimator for the power spectrum for the
following two reasons:

(
ˆp )
(1) The periodogram is not a consistent estimator in the sense that var S XX ( w) does

not tend to zero as the data length approaches infinity.
(2) For two frequencies, the covariance of the periodograms decreases data length
increases. Thus the peridogram is erratic and widely fluctuating.
Our aim will be to look for spectral estimator with variance as low as possible, and
without increasing much the bias.

9.5 Modified Periodograms
Data windowing
Multiply data with a more suitable window rather than rectangular before finding the
perodogram. The resulting periodogram will have less bias.

9.5.1 Averaged Periodogram: The Bartlett Method
The motivation for this method comes from the observation that
lim E S XX ( w) →
ˆp S XX ( w)
N →∞ ,

ˆp
We have to modify S XX ( w) to get a consistent estimator for S XX ( f ).

Given the data x[n] , n = 0,1, . . . N - 1
Divide the data in K non-overlapping segments each of length L.
Determine periodogram of each section.
L −1 2
1
S ( w) =
ˆ(k )
XX
L
∑ x[n]e
n =0
− jwn
for each section

1 K −1 ˆ ( k )
Then S XX ( w) = ∑ S XX ( w).
ˆ ( av )
K m =0
As shown earlier,
π
ES XX) ( w) =
ˆ (k ∫π
−
WB ( w − ξ )S XX (ξ )dξ

⎛ |m|
1− , | m |≤ M − 1
where wB [m ] = ⎜ L
⎜
⎝ 0, otherwise

2
⎛ wL ⎞
⎜ sin 2 ⎟
WB ( w) = ⎜
w ⎟
⎜ sin ⎟
⎝ 2 ⎠
L −1

S XX ( w) =
ˆ (k ) ∑
m =− ( L −1)
RXX [m ]e − jwm
ˆ

K −1
( w) = E {S XX) ( w)}
1
ES XX ( w) =
ˆ av
K
∑ ES
ˆ
k =0
(k )
XX
(k

π
ES XX ( w) =
ˆ av
∫W
−π
B ( w − ξ ) S XX (ξ )dξ .

To find the mean of the averaged periodogram, the true spectrum is now convolved with
the frequency WB ( f ) of the Barlett window. The effect of reducing the length of the data
from N points to L = N / K results in a window whose spectral width has been increased
by a factor K consequently the frequency resolution has been reduced by a factor K.

Original
WB (w)

Modified

Figure Effect of reducing window size

9.5.2 Variance of the averaged periodogram
ˆ av
var( S XX ( w)) is not simple to evaluate as 4th order moments are involved.

Simplification :
Assume the K data segments are independent.
⎧ K −1 ˆ ( k ⎫
Then var(S XX ( w)) = var ⎨∑ S XX) ( w) ⎬
ˆ av
⎩ k =0 ⎭
K −1
1
2 ∑
= ˆ (m
var S XX ) ( w)
K k =0
1 ⎛ ⎛ sin wL ⎞ 2 ⎞ 2
× K ⎜1 + ⎜
⎜ ⎝ L sin w ⎟ ⎟ XX
~ S ( w)
K2 ⎝ ⎠ ⎟⎠
1
~ original variance of the periodogram.
K

So variance will be reduced by a factor less than K because in practical situations, data
segments will not be independent.

ˆ av ˆ av
For large L and large K , var(S XX ( w)) will tend to zero and S XX ( w) will be a consistent
estimator.

The Welch Method (Averaging modified periodograms)
(1) Divide the data into overlapping segments ( overlapping by about 50 to 75%).
(2) Window the data so that the modified periodogram of each segment is
L −1 2
1
S XX ( w) =
ˆ (mod)
UL
∑ x[n]w[n]e
n =0
− jwn

1 L −1 2
= ∑ w [ n]
L n=0
The window w[n] need not be an is a even function and is used to control spectral leakage.
L −1

∑ x[n]w[n]e
n=0
− jwn
is the DTFT of x[n] w[n] where w[n] = 1 for n = 0 , . . . L -1

= 0 otherwise
K −1
1
(3) Compute S XX ) ( w) =
ˆ (Welch
K ∑S
ˆ
k =0
(mod)
XX ( w).

9.6 Smoothing the periodogram : The Blackman and Tukey Method
• Widely used non parametric method
• ˆ
Biased autocorrelation sequence is used R XX [m ].
• Higher order or large lags autocorrelation estimation involves less data
( N - m, m large) and so more prone to error, we give less importance to higher-
order autocorrelation.
ˆ
RXX [m] is multiplied by a window sequence w[ m] which underweighs the

autocorrelation function at large lags. The window function w[m] has the following
properties.
0 < w[m] < 1
w[0] = 1
w[-m ] = w[m ]
w[m] = 0 for | m | > M .

w[0] = 1 is a consequence of the fact that the smoothed periodogram should not modify
π
a smooth spectrum and so ∫ W ( w)dw = 1.
−π

The smoothed periodogram is given by
M −1
S XX ( w) =
ˆ BT
∑
m =− ( M −1)
w[m]RXX [m]e − jwm
ˆ

Issues concerned :
1. How to select w[m]?
There are a large numbers of windows available. Use a window with small side-lobes.
This will reduce bias and improve resolution.
How to select M. – Normally
N
M ~ or M ~ 2 N (Mostly based on experience)
5
N ~ 1000 if N is large 10,000
S ( w) = convolution of S ( w) and W ( w), the F.T. of the window sequence.
ˆ BT
XX
ˆ p
XX

= Smoothing the periodogram, thus decreasing the variance in the estimate at the
expense of reducing the resolution.
E ( S BT ( w)) = E S p ( w) * W ( w)
ˆ
XX
ˆ
XX
π
where E SXX ( w) = ∫ SXX (θ )WB ( w − θ ) dθ
ˆp
−π

or from time domain
M −1
E SXX ( w) = E
ˆ BT ∑
m =− ( M −1)
W [m]RXX [m]e − jwm arymtotically unbiased
ˆ

M −1
= ∑
m =− ( M −1)
W [m] E RXX [m] e − jwm
ˆ

ˆ BT
SXX ( w) can be proved to be asymptotically unbiased.

S XX ( w) M −1
2
ˆ BT
and variance of S XX ( w)~ ∑ −1) w2 [k ]
N K =− ( M
Some of the popular windows are reectangualr window, Bartlett window, Hamming
window, Hanning window, etc.

Procedure
1. Given the data x[n], n = 0,1.., N − 1
N −1 2
1
2. Find periodogram. S XX (2π k / N ) =
ˆp
N
∑ x[n]e
n=0
− j 2π k / N
,

3. By IDFT find the autocorrelation sequence.
4. Multiply by proper window and take FFT.

9.7 Parametric Method
Disadvantage classical spectral estimators like Blackman Tuckey Method using
windowed autocorrelation function.

- Do not use a priori information about the spectral shape
- Do not make realistic assumptions about x[n] for n < 0 and n ≥ N .

Normally RX [m], m = 0, ±1,....... ± ( N − 1) can be estimated from N sample values. From
ˆ

these autocorrelations we can estimate a model for the signal – basically a time series
model. Once the model is available it is equivalent to know the autocorrelation for all lags
and hence will give better spectral estimation. (better resolution) A stationary signal can
be represented by ARMA (p, q) model.

p q
x [n] = ∑ ai x[n − i ] + ∑ bi v[n − i ]
i =1 i =0

v[n] x[ n]

H(z)

q

B( z ) ∑b i z −i
H ( z) = = i =0
p
A( z )
1 − ∑ a i z −i
i =1

and
S XX ( w) = H ( w) σ V
2
2

• Model signal as AR/MA/ARMA process
• Estimate model parameters from the given data
• Find the power spectrum by substituting the values of the model parameters in
expression of power spectrum of the model

9.8 AR spectral estimation
AR( p) process is the most popular model based technique.
• Widely used for parametric spectral analysis because:
• Can approximate continuous power spectrum arbitrarily well
• (but might need very large p to do so)
• Efficient algorithms available for model parameter estimation
• if the process Gaussian maximum entropy spectral estimate is AR ( p ) spectral
estimate
• Many physical models suggest AR processes (e.g. speech)
• Sinusoids can be expressed as AR-like models

The spectrum is given by S XX ( w)

σV
2
S XX ( w) =
ˆ
2
P
1 − ∑ ai e − jwi

i =1

where ai s are AR( p) process parameters. w
Figure an AR spectrum

9.9 The Autocorrelation method
The ai s are obtained by solving the Yule Walker equations corresponding to estimated
autocorrelation functions.
⎡ RXX [0] RXX [1] ..... RXX [ p − 1] ⎤
ˆ ˆ ˆ ⎡ a1 ⎤ ⎡ RX [1] ⎤
ˆ
⎢ ⎥ ⎢a ⎥ = ⎢ ⎥
⎢ RXX [1] RXX [0] ..... RXX [ p − 2] ⎥
ˆ ˆ ˆ
⎢ 2⎥ ⎢ RX [2] ⎥
ˆ
⎢ ⎥ ⎢ .. ⎥ ⎢ ⎥
⎢.............. ....................⎥
⎢ ⎥ ⎢ .. ⎥
⎢ R [ p − 1] R [1] ..... R [0] ⎥
ˆ ˆ ˆ ⎢ aP ⎥
⎣ ⎦ ⎢ R [ P + 1]⎥
ˆ
⎣ XX XX XX ⎦ ⎣ X ⎦
P
σ V = RX [0] − ∑ ai RXX [i ]
2 ˆ ˆ
i =1

9.10 The Covariance method
The problem of estimating the AR-parmeters may be considered in terms of minimizing
the sum-square error
N −1
ε [ n] = ∑ e 2 [ n ]
n= p

with respect to the AR parameters, where
p
e[n] = x[n] − ∑ ai x[n − i ]
i =1

Theis formulation results in
⎡ RXX [1,1] RXX [2,1] ..... RXX [ p,1] ⎤
ˆ ˆ ˆ ⎡ a1 ⎤ ⎡ RX [0,1] ⎤
ˆ
⎢ ⎥ ⎢a ⎥ = ⎢ ⎥
ˆ ˆ ˆ
⎢ RXX [1, 2] RXX [2, 2] ..... RXX [ p, 2] ⎥ ˆ
⎢ RX [0, 2] ⎥
⎢ 2⎥
⎢ ⎥ ⎢.. ⎥ ⎢ ⎥
⎢.............. .................... ⎥
⎢ ⎥ ⎢ .. ⎥
⎢ R [1, p] R [2, p] ..... R [ p, p]⎥
ˆ ˆ ˆ ⎢ aP ⎥
⎣ ⎦ ⎢ R [0, P + 1]⎥
ˆ
⎣ XX XX XX ⎦ ⎣ X ⎦
P
σ V = RX [0, 0] − ∑ ai RXX [0, i ]
2 ˆ ˆ
i =1

where
1 N
RXX [k , l ] =
ˆ ∑ x[n − k ]x[n − l ]
N − p n= p
is an estimate for the autocorrelation function.
Note that the autocorrelation matrix in the Covariance method is not Toeplitz and cannot
be solved by efficient algorithms like the Levinson Durbin recursion.
The flow chart for the AR spectral estimation is given below:

x[ n], n = 0,1,.., N

Estimate the auto correlation
values

RXX [m], m = 0,1,.. p
ˆ

Select an order p

ai , i = 1, 2,.. p
and

Solve for σ V
2

σV
2
S XX ( w) =
ˆ
2
P
1 + ∑ ai e − jwi

Find i =1

ˆ
S XX ( w)

Some questions :
Can an A R ( p ) process model a band pass signal?
• If we use A R (1) model, it will never be able to model a band-pass process. If
one sinusoid is present then A R ( 2 ) process will be able to discern it. If there

is a strong frequency component w0 , an A R ( 2 ) process with poles at re ± jw0

with r → 1 will be able to discriminate the frequency components.

How do we select the order of the AR(p) process?
• MSE will give some guidance regarding the selected order is proper or not.
For spectral estimation, some criterion function with respect to the order
parameter p are to be minimized. For example,

- Minimize Forward Prediction Error.
FPE(p)
N + p +1
FPE (p) = σˆ 2

N − p −1
P

where N = No of data P

σ P = mean square prediction error (variance for non zero mean case)
ˆ2

• Akike Information Criteria
-Minimize
2p
AIC ( p) = ln(σ v2 ) +
ˆ
N

Ssp notes

More Related Content

What's hot

Viewers also liked

Similar to Ssp notes

Recently uploaded

Ssp notes