stochastic notes

1
These notes contain comments on Orfanidis book, Wiener Filter, adaptive filter, Karhunen-
Loeve expansion, Kalman Filter, Blind Deconvolution, and others.
Introduction to Random Variables
In this section, we present a short review of probability concepts. Let x be a random variable that
lies in the range  x , and has probability density function (pdf) f(x). Its mean m,
variance 2
 , and nth moment are defined by the expectation values:
  


 dxxxfxEm )(  x
      


 dxxfxExxExE )(
222
  x
  


 dxxfxxE nn
)(  x
Notice that f(x), and  n
xE are all deterministic quantities.
For N realizations of the random variable x, 0x , 1x , …, 1Nx we have:
  





1
0
1
)(
N
i
ix
N
dxxxfxEm
   





1
0
1
)(
N
i
n
i
nn
x
N
dxxfxxE
The probability that the random variable x will assume a value within an interval of values [a, b]
is given by:
  
b
a
dxxfbxaob )(Pr
A commonly used model for the pdf f(x) of the random variable x is the Gaussian or normal
distribution which is given as:

2
  







 
 2
2
2 2
1
exp
2
1
)(

xEx
xf  x
In typical signal processing problems, such as designing filters to remove or separate noise from
signal, it is often assumed that the noise interference is Gaussian.
Joint and Conditional Densities, and Bayes’ Rule:
In many situations we deal with more than one random variable i.e. random vectors. A pair of
two different random variables  21, xxx  may be thought of as a vector-valued random
variable. Its statistical description requires the knowledge of the joint probability density
function (jpdf)    21, xxfxf  . The two random variables may or may not be independent of
each other. A quantity that provides a measure for the degree of dependence of the two variables
on each other is the conditional density )/( 21 xxf which is the pdf of 1x given 2x . The
conditional pdf, the jpdf, and the pdf are given by the Bayes’ rule:
         11222121 /)(/, xfxxfxfxxfxxfxf 
More generally, Bayes’ rule for two events A and B is given as:
)(Pr)/(Pr)(Pr)/(Pr),(Pr AobABobBobBAobbAob 
The two random variables 1x and 2x are independent of each other if:
       2121, xfxfxxfxf 
i.e.    121 / xfxxf 
The correlation between 1x and 2x is defined by the expectation value
    




 21212121 ,21
dxdxxxfxxxxER xx ,  1x ,  2x
For N realizations of the random variables 1x , 10x , … 1,1 Nx , and 2x , 20x , … 1,2 Nx , we have:

3
     







1
0
2121212121
1
,21
N
i
iixx xx
N
dxdxxxfxxxxER
When 1x and 2x are independent the correlation is the product of the expected values i.e.
      212121 ,21
xandxxExExxER xx  are independent
Example: Assume that vxx  12  where  is a deterministic quantity and v is a Gaussian
random variable with zero mean and variance
2
v . We need to find the conditional pdf  12 / xxf .
For a given value of 1x we treat the term “ 1x ” as if it is a deterministic quantity and the only
randomness occurs due to v. Since v is Gaussian, then the conditional pdf of 2x is also Gaussian
but with mean value 1x and variance
2
v . Thus we get:
    







 
 2
2
11
212
2
1
exp
2
1
/
vv
axx
xxf

,  1x ,  2x
The concept of a random vector generalizes to any dimension. A vector of N random variables













Nx
x
x
x
...
2
1
is completely described if we know the joint pdf    Nxxxfxf ,...,, 21 . The first order statistics
is the mean m . The second-order statistics of x are its correlation matrix R, and its covariance
matrix, defined by:
 xEm 
 T
xxER 
   T
mxmxE 

4
where the superscript T denotes transpose. The ijth element of the correlation matrix R is the
correlation between the ith random variable ix and the jth random variable jx , that is,
 jiij xxER  . It is easily shown that the covariance and correlation matrices are related by
T
mmR  . When the mean is zero, R and Σ coincide. Both R and Σ are symmetric positive
semi definite matrices.
Example: The probability density of a Gaussian random vector  T
Nxxxx ...21 is
completely specified by its mean m and covariance matrix Σ, that is,
   
   








 
mxmxxf
T
N
1
2/
2
1
exp
det2
1
)(

Example: Under a linear transformation, a Gaussian random vector remains Gaussian i.e. a linear
function of Gaussian is Gaussian. Let  T
Nxxxx ...21 be a Gaussian random vector with
mean xm , and covariance matrix x . The linearly transformed vector xB , B is a nonsingular
N×N matrix, is Gaussian-distributed with mean and covariance given by xmBm  , and
T
xBB . These relations (mean and covariance) are valid also for non-Gaussian random
vectors. They are easily derived as follows:    xEBE  ,     TTT
BxxEBE  .

5
Random Signal Models
A stochastic process is a collection of random variables; a random vector. The index could be
time, space, volume or others. In this work, we focus on time domain index and on stationary
processes. A stationary process is a process where, generally speaking, the statistical properties
are not function of time. Thus, the vector  T
Nxxxx ...21 is a sample in time of a
stochastic process X(n) observed at instants X(1), X(2), …, X(N). Notice that we use capital
letter when dealing with stochastic processes. One of the most useful ways to model a random
signal is to consider it as being the output of a causal and stable linear filter C(z) driven by a
stationary uncorrelated (white-noise) sequence )(n , sometimes we use the notations n ,
where 




0
)(
n
n
n zczC ,
  )()()()( 2
kkRknnE    ,


 

elsewhere,0
0k,1
)(k is the delta function. Thus,


n
i
in icnnCnX
0
)()(*)()( 
where “*” is the convolution operation. The above model is termed moving average (MA)
model. Another common model is the autoregressive moving average (ARMA) model where
C(z) is the ratio of two polynomials in z i.e.
)(
)(
)(
)( z
zA
zB
zX  i.e. )()()()( zzBzXzA 
and in the time domain it has the shape
 
j
j
i
i jnbinXa )()( 

6
Example: )()( 1
10
2
2
1
10
z
zaa
zbzbb
zX 



 , then in the time domain we have
)2()1()()1()( 21010  nbnbnbnXanXa 
When the order of the numerator B(z) is zero ( 0b is nonzero and the rest of parameters are zero),
we have what is termed autoregressive (AR) model i.e. )()1()( 010 nbnXanXa  .
Maximum Likelihood Estimation:
Once the model is chosen, we use the data to find the model parameters. One of the most
commonly used methods is the maximum likelihood method because it yields unbiased estimates
of the parameters. We start with a simple AR(1) model, only one lag in X, and then generalize
the results to more than one lag. Assume that we have a set of N+1 data points, X(0), X(1), X(2),
… X(N). The signal is modeled as an AR(1) process;
)()1()( 1 nnXanX 
For a given value of X(n-1), and assuming )(n is Gaussian with zero mean and variance
2
 , the
conditional pdf  )1(/)( nXnXf becomes:
   
2
2
2
)1()(
2
1
exp
2
1
)1(/)(




naXnX
nXnXf
For the maximum likelihood method, the basic idea is to find the jpdf,
))(),...2(),1(),0(( NXXXXf , of the observations and find the estimates for the
unknowns that maximize this likelihood function.
     ...)0(),1(),2(/)3()0(),1(/)2())0(/)1(()0())(),...2(),1(),0(( XXXXfXXXfXXfXfNXXXXf 
Define    T
NXXXXN )(),...2(),1(),0(X , using Baye’s theory, the joint pdf is:

7
      
    1)0(),...1(/)(
)0(),...1()0(),...1(/)())(),...2(),1(),0((


NfXNXNXf
XNXfXNXNXfNfNXXXXf
X
X
Similarly
        
    2)0(),...2(/)1(
)0(),...2()0(),...2(/)1(1)0(),...1(


NfXNXNXf
XNXfXNXNXfNfXNXf
X
X
And we continue till we get to the initial conditions at X(m), X(m-1),…,X(0). This will yield
       



1
/)1()(),...1(),0())(),...2(),1(),0((
N
mi
iiXfmXXXfNfNXXXXf XX
Where  )(),...1(),0( mXXXf is considered to be initial conditions and could be known or be
neglected for large amount of data i.e. N>>m.
For AR(1) model, X(n) is related only to X(n-1) and we use the Baye’s formula to find a
reasonable expression for the jpdf as follows:
     ...)2(/)3()1(/)2())0(/)1(()0())(),...2(),1(),0(( XXfXXfXXfXfNXXXXf 
   

N
n
nXnXfXf
1
)1(/)()0(
Substituting for the derived expression for  )1(/)( nXnXf , we get:
   



N
n
naXnX
XfNXXXXf
1
2
2
2
)1()(
2
1
exp
2
1
)0())(),...2(),1(),0((


    

 

N
n
N naXnX
Xf
1
2
2
2/2 )1()(
2
1
exp2)0(




Taking the log of both sides we get the log likelihood function  as:
         

N
n
naXnX
NN
XfNXXXXf
1
2
2
2
)1()(
2
1
log
2
2log
2
)0(log))(),...2(),1(),0((log





8
Maximizing this quantity, ignoring  )0(log Xf , with respect to the unknowns, a and
2
 , we get
the desired results. Thus,
 




 N
naa
nXnXanX
a 1
2
ˆ,ˆ
)1()1(ˆ)(
ˆ2
2
0
22
 


which is reduced to :   0)1()1(ˆ)(
1

N
n
nXnXanX
Rearrange we get:






 N
n
N
n
nX
nXnX
a
1
2
1
)1(
)1()(
ˆ
Similarly, for
2
 , we get:
 
 


 N
naa
nXanX
N
1
2
222
ˆ,ˆ
2
)1(ˆ)(
ˆ2
1
ˆ2
0
22
 


Rearrange we get:    

N
n
nXanX
N 1
22
)1(ˆ)(
1
ˆ
We use the same approach to find the parameters of MA(1), “1” means only one lag in  .
Assume that the signal is modeled as:
)1()()( 1  nbnnX 
where )(n are independent identically distributed random variables with zero mean and variance
2

.
We also assume that 0)0(  . Since X(n) is a sum of two independent zero mean
Gaussian random variables, then it is also Gaussian with zero mean and variance equals the sum
of the two variances. Thus, )(nX ~   22
1
2
,0   bN  , where ~ means the distribution, N stands
for normal distribution.

9
We need to find the joint pdf ))(),...2(),1(),0(( NXXXXf . In many situations, the pdf of X(0)
will be ignored. Using Baye’s theory, the joint pdf is:
     ...)0(),1(),2(/)3()0(),1(/)2())0(/)1(()0())(),...2(),1(),0(( XXXXfXXXfXXfXfNXXXXf 
We need to find each element of the joint pdf. Let us start with  )0(/)1( XXf
n=1: )1()0()1()1( 1   bX , since )0( is zero by assumption
Thus,   2
2
2 2
)1(
exp
2
1
))1(()0(/)1(


X
XfXXf 
n=2: )1()2()2( 1 bX  , )1()2()1()2()2( 11 XbXbX  
Rearrange we get: )2()1()2( 1  XbX
For a given value of X(1) and X(0),
     
2
2
1
22
2
1
2 2
)1()2(
exp
2
1
2
)1()2(
exp
2
1
)0(),1(/)2(




XbXbX
XXXf




n=3:  )1()2()3()2()3()3( 111 XbXbbX   ,
i.e.   )1()2()3()1()2()3()2()3()3( 2
11111 XbXbXXbXbXbX  
For a given value of X(2), X(1) and X(0), and we also know for certainty )2( because
)1()2()2( 1 XbX  , we get:
    
2
2
11
2 2
)1()2()3(
exp
2
1
)0(),1(),2(/)3(


XbXbX
XXXXf


n=4:  )1()2()3()4()3()4()4( 2
1111 XbXbXbbX   ,
i.e.    )1()2()3()4()1()2()3()4()3()4()4( 3
1
2
111111 XbXbXbXXbXbXbXbX  
For a given value of X(3), X(2), X(1), and X(0)

10
    
2
22
111
2 2
)1()2()3()4(
exp
2
1
)0(),1(),2(),3(/)4(


XbXbXbX
XXXXXf


and in general:
   
2
2
1
2 2
)1()(
exp
2
1
)0(),...2(),1(/)(






nbnX
XnXnXnXf
where the expression for )1( n will be more complicated. We can continue in this way and find
the joint pdf. Instead we use the independence of )(n to find the joint pdf of the observations as
function of the joint pdf of )(n as follows:





































)(
...
)2(
)1(
1...0
...01
0...01
)(
)2(
)1(
1
1
Nb
b
NX
X
X



i.e. BX 
The jpdf of  is given as:
   









 


 

1
2/
2
1
exp
det2
1
)(
T
N
f
where  T
N)(...)2()1(   ,  T
NXXXX )(...)2()1( and I2
  .
Since BX  i.e. a linear function of a Gaussian vector, then it is also Gaussian and
    0 BEXE ,
T
X BB 
Thus,
   









 
XXXf X
T
X
N
1
2/
2
1
exp
det2
1
)(


11
where
T
X BB  . Maximizing the jpdf )(Xf with respect to the unknowns 1b and
2
 we
get the desired maximum likelihood estimates. The process could be repeated for higher order
MA models.
We now develop the maximum likelihood method for ARMA(1,1). Assume that the data is
modeled as:
)()1()1()( 11 nnbnXanX  
Define )1()()( 1  nXanXny , and assume that X(0)=0,
then







































)(
...
)2(
)1(
1...0
...01
0...01
)(
)2(
)1(
1
1
NX
X
X
a
a
Ny
y
y
i.e. XAy  or yAX 1

If we are able to get the joint pdf of y(n) we could get the joint pdf of X(n) as we shall see.
As before, it is assumed that 0)0(  and X(0) = 0.
n=1: )1()1()0()0()1()1( 11   bXaXy
n=2: )2()1()1()2()2( 11   bXaXy
and in general we get:





































)(
...
)2(
)1(
1...0
...01
0...01
)(
)2(
)1(
1
1
Nb
b
Ny
y
y



i.e. By 
Thus, BAyAX 11 
 is a zero mean Gaussian vector.
and
   









 
XXXf X
T
X
N
1
2/
2
1
exp
det2
1
)(


12
where    T
X BABA 11 
  . Maximizing the jpdf )(Xf with respect to the unknowns 1a , 1b
and
2
 we get the desired maximum likelihood estimates. The process could be repeated for
higher order ARMA models.
Matrix inversion:
In general, matrix inversion is time consuming and not easy to find. In some special cases, as we
have, inversion of a matrix is straightforward.
Example: consider the upper triangular matrix
















1000
100
010
001
1
2
3
a
a
a
A , its inverse is













1000
100
10
1
1
122
123233
1
a
aaa
aaaaaa
A and  













1
01
001
0001
112123
223
31
aaaaaa
aaa
a
A
T
Since
















100
010
001
0001
1
2
3
a
a
a
AT is lower triangular matrix, and since we know that
   TT
AA 11 
 , and   IAA TT

1
i.e.  




























1000
0100
0010
0001
100
010
001
0001
1
2
31
a
a
a
AT
then  1T
A is also lower triangular matrix=












1
01
001
0001
434241
3231
21
bbb
bb
b
.
By inspection, second row times first column yields: 0321  ab . Then 321 ab 
Third row times second column yields: 0232  ab . Then 232 ab 

13
Fourth row times third column yields: 0143  ab . Then 143 ab 
Third row times first column yields: 0323133231  aababb . Then 3231 aab 
Fourth row times second column yields: 0214224342  aababb . Then 2142 aab 
Fourth row times first column yields: 03214134241  aaababb . Then 32141 aaab 
This gives the complete inverse.
ARCH and GARCH Models:
In some situations, the random quantity is not independent from the observations and depends on
the data itself. This happens when the variance of the error term is not constant. Consider the
AR(p) model with ARCH(m) disturbance given as:
)()()(...)2()1()( 21 nnhpnXanXanXanX p 
and )(...)2()1()( 22
2
2
10 MnXcnXcnXccnh M 
How to Choose a Model; the Akaike information criterion (AIC):
Assuming a stationary signal, one is usually confronted with question of which is the best model
for the data. The Akaike information criterion (AIC) was developed for this purpose. Let M be
the number of estimated parameters of the model. Let  be the maximum value of the log
likelihood function (log of joint pdf of observations after maximization), then the AIC is defined
as:
22  MAIC
Given a set of data, we try several models and we select the model that minimizes the AIC.

14
Hypothesis Testing:
In its simplest form, we have two sources of a signal and we receive only one noisy version. The
basic component of a simple decision theory problem is to choose between the two sources based
on the observations we receive. If we decide on the null hypothesis 0H this means that the source
is the first signal. If we decide on the alternative hypothesis 1H this means that the source is the
second signal. For example, we could receive an EKG signal and we need to decide whether the
patient is normal 0H or sick 1H . In EEG we need to identify the presence of Epileptic danger 1H
or the patient is normal 0H … etc.
Consider a set of N observations  T
NXXXX )(...)2()1( and we need, based on these
observations, to decide which hypothesis is true 0H or 1H . To do that we find the joint pdf of the
observations under both hypothesis i.e. we need )/( 0HXf and )/( 1HXf . We then
develop the likelihood function )(X defined as
)/(
)/(
)(
0
1
HXf
HXf
X  . Notice that )(X is a
random quantity but scalar. The likelihood ratio test is to decide 1H if  )(X where  is a
threshold value, and to decide 0H otherwise. If both hypothesis are equally likely,  is set to 1.
Example: We receive a set of N noisy measurements that are independent and identically
distributed Gaussian random variables with known mean m under the hypothesis 1H , and zero
mean under the hypothesis 0H . This is stated as follows:

15
)()(:1 inmiXH  , i=1, 2, …, N
)()(:0 iniXH  , i=1, 2, …, N
Where )(in is Gaussian white noise with zero mean and 2
 variance i.e.
  





 2
2
2 2
)(
exp
2
1
)(

in
inf
Thus, 





 2
2
20
2
)(
exp
2
1
)/)((

iX
HiXf
and
 







 
 2
2
21
2
)(
exp
2
1
)/)((

miX
HiXf
Since the observations are independent we get:








N
i
iX
HXf
1
2
2
20
2
)(
exp
2
1
)/(

and
 








 

N
i
miX
HXf
1
2
2
21
2
)(
exp
2
1
)/(

The likelihood ratio is:
 


















 

 N
i
N
i
iX
miX
HXf
HXf
X
1
2
2
2
1
2
2
2
0
1
2
)(
exp
2
1
2
)(
exp
2
1
)/(
)/(
)(


After some manipulations and taking the natural log we get:

16
2
2
1
2
2
)()(ln

Nm
iX
m
X
N
i
 
When a set of new data arrives, we substitute their values in the expression of )(ln X . If
)(ln X > ln we decide 1H else we decide 0H . End of example.
Composite Hypothesis or the Generalized Likelihood Test:
In some situations, some parameters are unknown and we still need to make a decision about the
source of the signal. Specifically assume that, under 0H , the vector of unknown parameters is 0
and under 1H , the vector of unknown parameters is 1 . In this case we have a set a training data
from which we estimate the unknown parameters under both hypothesis. When a new data
arrives and we need to decide which hypothesis is true, we simply substitute the values of the
new data in the generalized likelihood ratio )(Xg . In summary:
),/(
),/(
)(
00
11
max
max
0
1
HXf
HXf
Xg





The likelihood ratio test is to decide 1H if  )(Xg where  is a threshold value, and to
decide 0H otherwise. If both hypothesis are equally likely,  is set to 1.

17
Example: Here the mean, under 1H , is unknown. Under 0H the mean is zero. When the data
arrives, we estimate the mean, m, as: 

N
i
iX
N
m
1
)(
1
ˆ . The generalized likelihood ratio becomes



































 N
i
N
i
N
i
m
g
iX
iX
N
iX
HXf
HmXf
X
1
2
2
2
1
2
2
1
2
0
1
2
)(
exp
2
1
2
)(
1
)(
exp
2
1
)/(
),/(
)(
max


The likelihood ratio test is to decide 1H if  )(Xg where  is a threshold value, and to
decide 0H otherwise.
Hypothesis Testing for Stochastic Processes:
In some applications, as in brain computer interface (BCI), we receive a signal X(t) and we need
to decide whether the signal represents a forward or a backward command or others. Usually we
have a training set of data for each hypothesis. If the signal is stationary, one option is to expand
the training signals using Karhunen-Loeve expansion that converts the signals into a set of
random variables (see the section on Karhunen-Loeve expansion). When a new set of data
arrives and we need to decide to which hypothesis it belongs. We substitute their values in the
likelihood ratio formula and decide to which group the new set of data belongs.
In this analysis, we assume that we have zero mean stochastic process. A stochastic process,
under hypothesis jH , )(tX j
is expanded in terms of orthonormal basis )(tj
i as:

18
 TtttX
i
j
i
j
i
j
,0,)()(
1
 



where 
T
jj
i
j
i dttXt
0
)()(
ik
T
j
k
j
i dttt  0
)()(
The basis )(tj
i are chosen such that the random coefficients j
i are uncorrelated; Viz:
  ik
j
i
j
k
j
iE  
For the process )(tX j
, define the covariance function ),( utK j
as:
 )()(),( uXtXEutK jjj

The covariance function, for uncorrelated j
i , satisfies the Fredholm integral equation of the
second kind:
0),()(),(
0

j
i
j
i
j
i
T
j
i
j
tduuutK 
For a stationary process,   )()()(),( utKuXtXEutK jjjj
 , and it is much easier to find the
orthogonal basis.
Assuming that j
i is a normal random variable and we know that it has zero mean and variance
j
i , the pdf is given as:

19
 
 
 
  







 2
2
2
2
exp
2
1
j
i
j
i
j
i
j
if




Define  Tj
N
jj
 ,...,1 ,the joint pdf of the observations, under jH , is given as:
 
 
 









N
i
j
i
j
i
j
i
j
j
Hf
1
2
2
2
2
exp
2
1
)/(




When a new set of data, X(t), arrives and we need to know to which hypothesis it belongs, we
simply find the different values 
T
j
i
j
i dttXt
0
)()( for the different hypothesis jH . We then
calculate the maximum of the joint pdf
 
 
 









N
i
j
i
j
i
j
i
j
j
j
Hf
1
2
2
2
2
exp
2
1
)/(max 


 .
The maximum of the joint pdfs is where the hypothesis is true.

20
Least Square Estimates
Suppose that we have two random variables X and Y that are related to each other with joint pdf
     yfyxfyxfYXf YYXYX /,),( /,  . When Y=y the random variable X=x, where y and x are
deterministic values. Our estimate of X, xˆ , is some nonlinear function of y; h(y) i.e. )(ˆ yhx  ,
and the error in the estimate is )(ˆ yhxxx  . The mean square error (m.s.e) is thus given as:
           dxyhxyxfdyyfdxdyyhxyxfesm YXYYX
2
/
2
, )(/)(,..
By conditioning on Y, the term  2
)(yhx  becomes deterministic in y, but x is still random, and
we would be able to use the Riemann calculus to find the minimum w.r.t h(y). If we have
conditioned on X, we would not be able to minimize w.r.t. h(y) using Riemann calculus and we
need to use another calculus that deals with random quantities.
We need to find an estimate for h(y) that minimizes the m.s.e. This is done by looking for
    dxyhxyxf YX
yh
2
/
)(
)(/min w.r.t. h(y). The result is:
    0)(/
)(
..
)(
2
/ 





 dxyhxyxf
yh
esm
yh
YX
Which yields        0)(/2/2)(/2 ///   dxyhyxfdxyxxfdxyhxyxf YXYXYX
And since     )(/)()(/ // yhdxyxfyhdxyhyxf YXYX  
then      YXEdxyxxfyh YX //)( /
If we have two random variables 1Y and 2Y and we need to find an estimate of X based on
observations of these two random variables i.e. ),(ˆ 21 yyhx  . As before, by minimizing the m.s.e we
getan estimate of the function ),( 21 yyh as:

21
     2121,/21 ,/,/),( 21
YYXEdxyyxxfyyh YYX
For n observed values of the n random variables 1Y , 2Y , …, nY we have:
     nnYYYXn YYYXEdxyyyxxfyyyh n
,...,,/,...,,/),...,( 2121,...,,/21 21
For a stochastic process  Y and sampling at intervals, then on the limit:
    n
n
YYYXfYXf ,...,,/lim/ 21


 )(ˆ yhx  . As before, by minimizing the m.s.e we get an estimate of the function  )( yh as:
             YXEdxyxxfyh YX //)( /
Linear Least Squares Estimates [Kailath; 2000]
In many useful applications, obtaining the conditional expectation is very difficult. Thus, one has
to resort to some other suboptimal approaches for the estimation. A common approach is the
linear least square estimate where the estimated value Xˆ of a random variable X is linearly
related to another random variable Y. Specifically;
ghYX ˆ
Where h and g are unknown but deterministic values and chosen to minimize the mean square
error m.s.e given by:
             XYhEYhgEXgEgYEhXEghYXEesm 222.. 22222

Minimizing the m.s.e. w.r.t g and h yields:
    YX mhmYEhXEg ˆˆˆ 
        YmgXYEYEgXYEYEh ˆˆˆ 2


22
The above two equations have the matrix format:
   

















XYE
m
h
g
YEm
m X
Y
Y
ˆ
ˆ1
2
Since       YXXYXY mmXYEmXmYE  , then   YXXY mmXYE   .
Similarly since      2222
YYY mYEmYE  ,then   222
YY mYE  
Substitutingwe get:
  


















 YXXY
X
YYY
Y
mm
m
h
g
mm
m
 ˆ
ˆ1
22
Invertingthe matrix we get:
 





















YXXY
X
Y
YYY
Y mm
m
m
mm
h
g


 1
1
ˆ
ˆ 22
2
Solving we get:
   2
2
/
1ˆ
YXYYXXYXY
Y
mmmmh 








and
           22
2
22
2
/
11
ˆ YXYYXXYYXY
Y
YXXYYXYY
Y
mmmmmmmmmg 
















Thus,   YYXYX mYmX  2
/ˆ 
The corresponding m.s.e. is:
 222
/... YXesm  
Under general conditions, we could interchange expectations and derivations. Thus, we could
use the derivative operator inside the expectations as follows:

23
      ghYXEghYX
g
E
g
esm













20
.. 2
Which yields       YhEgghYEXE 
i.e.     YX mhmYEhXEg ˆˆˆ 
Similarly       ghYXYEghYX
h
E
h
esm













20
.. 2
Which yields         2
YhEYgEghYYEYXE 
i.e.         YmgXYEYEgXYEYEh ˆˆˆ 2

For a vector of observations  T
nYYYY ,..., 21 , and a vector of coefficients  T
nhhhh ,..., 21 , a
scalar X is estimated as:
gYhX
T
ˆ
The least m.s.e. is derived as before as:
 YYYYXX mYRRmX  1ˆ
Where    T
YXYX mYmXER  ,    T
YYYY mYmYER 
The corresponding m.s.e. is given as:
XYYYYXXX RRRResm 1
... 

Assume that we have an observation period  ba, where we measure a scalar stochastic process
Y(t). We need to find linear least square estimate Xˆ of the random variable X based on this
observation. In this analysis we shall assume zero mean value for all random variables involved.
As before,

b
a
dYhX  )()(ˆ

24
The filter h(t) is obtained as before as:

b
a
YYXY dtRhtR  ),()()(  bat ,
Where  )()(),(  YtYEtRYY  , and assuming zero mean for Y(t).
The m.s.e. is now given as:

b
a
b
a
YYX dvdvtRvhhesm  ),()()(... 2
Geometric Interpretation of Random Variables:
We shall assume that all the random variables we are dealing with are zero mean. This will only
facilitate the analysis. The results, with minor changes, will be valid for non zero mean random
variables.
We could think of a random variable X as a vector in some abstract space with inner product
defined as:
 XYEYX ,
For stochastic processes X(t) and Y(t), defined on the interval  ba, , the inner product is defined
as:
 
b
a
YXEYX )()(, 
Thus, for 
i
iiYhXˆ we need to find an estimate of ih such that the error  XX ˆ is the
orthogonal to the observation space made from the observations iY . i.e.
j
i
ii YYhX 





  , where  means orthogonal.
In terms of the inner product

25
j
i
ii YYhX 





  means 0, 


















  j
i
iij
i
ii YYhXEYYhX j
i.e. 
i
jiij YYhYX ,,
If the observations are a stochastic process Y(t), then the estimate of X is obtained as:

b
a
dYhX  )()(ˆ
Again, the error is orthogonal to the observations. Thus,
  )(ˆ tYXX 
i.e. 0)(,)()( 







  tYdYhX
b
a


b
a
dYhtYtYX  )()(),()(,
Which means     














b
a
b
a
b
a
dtYYEhdtYYhEdYhtYEtXYE  )()()()()()()()()()(
In terms of correlation, the above expression is:

b
a
YYXY dtRhtR  ),()()(
This is exactly the same results obtained before.
The Multivariate Case:
For the estimation of a vector of random variables  T
nXXXX ,...,, 21 based on the
observations of m stochastic processes         T
m tYtYtYtY ,...,, 21 , we follow the same route as
before but now )(H is a matrix that satisfies the equation:

26

b
a
YYYX dtRHtR  ),()()(
And the linear least square estimate Xˆ satisfies the equation:

b
a
dYHX  )()(ˆ
Gram-Schmidt Orthogonalization:
Assume that we have a set of random variables 1Y , … MY that could be correlated. We need to
find the orthogonal basis of the space spanned by these random variables. We call these basis 1 ,
… M , and they are found from the observations 1Y , … MY . The basic idea is to select 1 = 1Y .
We then take 2Y and decompose it into two components 2 plus an error term such that 2 is
orthogonal to 1 i.e.   0, 2121   E . We then move to 3Y and decompose it into 3 plus two
terms such that   0, 3131   E , and   0, 3232   E . We repeat this process M times
till we get the total new M basis that are orthogonal to each other. Specifically
21212  YhY , 1211/22
ˆˆ YhYY  i.e. 121212122  hYYhY 
Where 1/2
ˆY is the estimate of 2Y given the observation 1Y and 21h is unknown to be estimated in
what to follow. Since the error is orthogonal to the observations, then
  112122 YYhY  i.e.      0112122  YYhYEE 
Using linear least square estimation and 11 Y we get:
         12
112
2
11221 /

  EYEYEYYEh
Thus,      1
12
11222 

 EYEY

27
For 3Y we find an estimate 2/33
ˆˆ YY  as a linear combination of 1 and 2 Viz;
32321313   hhY , 2321313
ˆ  hhY  i.e. 23213133  hhY 
Since the error is orthogonal to the observations, then
  2123213133 ,  hhY i.e.    012321313   hhYE , and
   022321313   hhYE
Using linear least square estimation we get:
         12
113
2
11331 /

  EYEYEYYEh
and          12
223
2
22332 /

  EYEEYEh
Thus,           2
12
2231
12
11333 

 EYEEYEY
In general we get:
     MnEYEY
n
i
iiinnn  



2,
1
1
12

Notice that     



 
1
1
12
1/
ˆˆ
n
i
iiinnnn EYEYY 
i.e. nnnn YY  1/
ˆ
In matrix notations, one could represent the observations in terms of the orthogonal basis as:





































MMMM hh
hh
h
Y
Y
Y



...
0......
0...1
0...01
0......01
...
2
1
21
3231
212
1
Define the vectors  T
nn YYYY ,...,, 21 and  T
nn  ,...,, 21 , then
nnn HY  , and nnn YH 1


28
        1
1
111
1
1
12
1/
ˆˆ 





   n
T
nn
T
nn
n
i
iiinnnn YEEYEYY 
Substituting 1
1
11 

  nnn YH , we get:
      1
1
1
1
1
111
1
1
1
111/
ˆ 








  nn
T
n
T
nnn
T
n
T
nnnn YHHYYHHYYEY
          1
1
1
11
1
1
11
1
1
1
1
11 










 nnn
T
nn
T
n
T
n
T
nn YHHYYHHYYE
   1
1
111 

 n
T
nn
T
nn YYYYYE
And    1
1
1111/
ˆ 

  n
T
nn
T
nnnnnnn YYYYYEYYY
Thus, we were able to get the orthogonal basis, n , in terms of the observations
 T
nn YYYY ,...,, 21 .
Discrete Time Recursive Estimation:
We have a random variable X that we need to estimate based on observations 0Y , 1Y , … In this
approach we, recursively, estimate X and update the estimate as new data comes in. Assume that
we have the linear least square estimate 1/ˆ,..,/ˆ
110 

 kXYYYX k
based on observations 0Y , 1Y ,
… 1kY . We now have a new observation kY and we need to update our estimate and find kX /ˆ
in terms of 1/ˆ kX . In the updating of the estimate we only use new information in the new
data. This new information is called innovation and is obtained as follows:
Define the innovation 1/
ˆ
 kkkk YY , with 01/000
ˆ YYY   , where 1/
ˆ
kkY is the estimate of
kY given the previous observations 0Y , 1Y , … 1kY . Clearly k is uncorrelated with the

29
previous random variables 0Y , 1Y , … 1kY and k is regarded as the new information in the
random variable kY . This suggests that:
1/ˆ/ˆ  kXkX + linear least square estimate of X given only k .
As before, linear least square estimate of X given only      kkkkk EXE  1
 . Collecting
terms we get:
     kkkk EXEkXkX  1
1/ˆ/ˆ 

We could also derive the same results starting from the innovation sequence 0 , 1 , … k
which are uncorrelated with each other and form the basis for the space spanned by 0Y , 1Y , …
kY .
Thus, kX /ˆ linear least square estimate of X given 0 , 1 , … k .
    


k
i
iiii EXE
0
1

          kkkk
k
i
iiii EXEEXE  1
1
0
1 



 
     kkkk EXEkX  1
1/ˆ 

This result was obtained before.
Instead of estimating just a single random variable X, we need to estimate the values of the
stochastic process lX given the observations of the stochastic process 0Y , 1Y , … kY . We use the
above equation to get:
    


k
i
iiiill EXEkX
0
1
/ˆ 

30
          kkkkl
k
i
iiiil EXEEXE  1
1
0
1 



 
Thus we are interested in klX /
ˆ linear least square estimate of X given 0 , 1 , … k .
     kkkklkl EXEX  1
1/
ˆ 
 
Notice that l could be greater than, less than or equal to the value k. We need to find the
relation  klXE  and this could come from the observation equation or others. For example, it is
not uncommon to have the observation equation: kkkk vXHY 
Where kv is additive white Gaussian noise, and kH is a matrix with proper dimensions. In order
to find  klXE  , we use
  kkkkkkkkkkkkkkk vXXHvXHXHYY   1/1/1/
ˆˆˆ
and thus         T
kkkkkkkkkkkk vXXHvXXHEE   1/1/
ˆˆ
     kk
T
k
T
kkkkkkk vvEHXXXXEH   1/1/
ˆˆ
Also        T
kkkkkkkk
T
kkkk XXvXXHEXXE 1/1/1/
ˆˆˆ
 
  T
kkkkkkk XXvEPH 1/1/
ˆ
 
01/  kkk PH
where the error variance    T
kkkkkkkk XXXXEP 1/1/1/
ˆˆ
 
Notice that since 0ˆ
1/0 X , then        TT
XXEXXXXEP 001/001/001/0
ˆˆ  
For l =k, we have:
    


k
i
iiiikkk EXEX
0
1
/
ˆ  , and     



 
1
0
1
1/
ˆ
k
i
iiiikkk EXEX 

31
     T
kkkkkkkk vXXHXEXE  1/
ˆ
Substitute 1/1/
ˆˆ
  kkkkkk XXXX , we get:
       T
kkkkkkkkkkkk vXXHXXXEXE   1/1/1/
ˆˆˆ
        T
kkkkkkk
T
kkkkkkkk vXXHXvXXHXXE   1/1/1/1/
ˆˆˆˆ
   T
kkkkkkk
T
kkk vXXHXEHP   1/1/1/
ˆˆ
Notice that 1/
ˆ
kkX is, by assumption, independent of kv and, by derivation, is orthogonal to the
error  1/
ˆ
 kkk XX i.e.     000ˆˆ
1/1/  
T
kkkkkkk vXXHXE .
Thus,   k
T
kkkkk KHPXE  1/
We now need to find a recursive relation for the error covariance 1/ kkP .
We know that:
     kkkkkkkkk EXEXX  1
1//
ˆˆ 
     kkkkkk EKX  1
1/
ˆ 
  , 0ˆ
1/0 X
    1/
1
1/1/1/
ˆˆ


  kkkkkk
T
kkkk
T
kkkkk XHYvvEHPHHPX
The above is the updating equation of the estimate when new data, kY , arrives. We need an
updating equation for the error covariance, kkP / , when new data arrives.
We know that:
   T
kkkk
T
kkkk KPHXXE   1/1/
ˆ
    kk
T
kkkkkk vvEHPHE  1/
Since    T
kkkkkkkk XXXXEP ///
ˆˆ 
We substitute    kkkkkkkk EKXX  1
1//
ˆˆ 
  and we get:

32
           T
kkkkkkkkkkkkkkkk EKXXEKXXEP  1
1/
1
1//
ˆˆ 


 
               
     T
kkkkkkk
T
kkkkkkkk
T
kkkkkkkkk
XXEKE
EKEKEEKXXEP
1/
1
111
1/1/
ˆ
ˆ








         T
kkkk
T
kkkk
T
kkkkkk KEKKEKKEKP
111
1/

  
   T
kkkkkk KEKP
1
1/

  
Substitute     kk
T
kkkkkk vvEHPHE  1/ we get:
   T
kkk
T
kkkkkkkkk KvvEHPHKPP
1
1/1//

  ,
T
kkkk HPK 1/  ,    T
XXEP 001/0 
This is the desired result. Thus, we were able to find the updated estimate kkX /
ˆ and its updated
covariance kkP /
. We now summarize the estimation steps:
(1) kkkk vXHY  (linear observation model)
(2) 1/
ˆ
 kkkk YY
(3)    1
1
1111/
ˆ


  k
T
kk
T
kkkkkkk YYYYYEYYY ,  T
nn YYYY ,...,, 21
(4)     


k
i
iiiilkl EXEX
0
1
/
ˆ 
(5)     



 
1
0
1
1/
ˆ
k
i
(6)     

 
k
i
iiiikkk EXEX
0
1
1/1
ˆ  ?? we need a recursive equation.
(7)    T
ˆˆ
  ?? we need a recursive equation.

33
(8)    T
kkk
T
1
1/1//

  ,
T
kkkk HPK 1/  ,
   T
XXEP 001/0 
(9)     1/
1
1/1/1//
ˆˆˆ


T
kkkk
T
kkkkkkk XHYvvEHPHHPXX , 0ˆ
1/0 X
Thus, for every step, we first find 1/
ˆ
kkX and its covariance 1/ kkP . We then update the estimates
to get kkX /
ˆ and its covariance kkP /
. To have a recursive relation, we need to find a predictive
equation, kkX /1
ˆ
 , for kX and its covariance kkP /1
. This could be obtained if we know the system
dynamics.
Assume that the system dynamics is linear with the equation:
kkkkk wXX 1
where kw is zero mean white Gaussian noise independent of all other noises and has covariance
kQ . kX is zero mean with covariance  kkk XXE which has the recursive relation:
T
kkk
T
kkkk Q  1
The linear least square estimate given observations uptil time k kkX /1
ˆ
 , is thus,
kkkkkkkkkkk XwXX ////1
ˆˆˆˆ 
Substituting for the equation of kkX /
ˆ , we get:
    1/
1
1/1/1//1
ˆˆˆ


  kkkkkk
T
kkkk
T
kkkkkkkkk XHYvvEHPHHPXX
Remember that  1/
ˆ
 kkkkk XHY which has zero mean and covariance
  kk
T
kkkk vvEHPH 1/ . Since all the terms involved are Gaussian with zero mean and 1/
ˆ
kkX is
independent of k , then the covariance  kkkkkk XXE /1/1/1
ˆˆ
  of kkX /1
ˆ
 is obtained as:

34
   T
k
T
kkkkk
T
kkkk
T
kkkk
T
kkkkkk PHvvEHPHHP  

 1/
1
1/1/1//1 , 01/0  
Notice that the covariance of 1kX , 1k , is the sum of the covariance of kkX /1
ˆ
 , kk /1 , and the
covariance of the estimated error kkP /1 . This is because the error in the estimate is orthogonal to
the observations or the estimate. Thus, we have kkkkkP /11/1   .
Using the fact that kw is independent of kX and older values we get:
       T
kkkkkkkkkkkkkk
T
kkkkkkkk XwXXwXEXXXXEP ///11/11/1
ˆˆˆˆ  
      T
kkkkkkkkkkkk wXXwXXE  //
ˆˆ
            T
k
T
kkkkk
T
k
T
kkkkk
T
k
T
kkk
T
k
T
kkkkkkk XXwEwXXEwwEXXXXE  ////
ˆˆˆˆ
And since         0ˆˆ
// 
T
kkkk
T
kkkk XXEwEXXwE because kw is independent of kX , by
assumption, and independent of ,..., 1kk YY , by assumption, then we get:
T
kkk
T
kkkkkk QPP  //1 , 01/0 P
This is the covariance of the predicted estimate kkX /1
ˆ
 .
We now summarize our finding so far:
(1) kkkk vXHY  (linear observation model)
(2) kkkkk wXX 1
(linear system dynamics model)
(3) 1/
ˆ
(4)    1
1
1111/
ˆ


  k
T
kk
T
kkkkkkk YYYYYEYYY ,  T
nn YYYY ,...,, 21
(5)     


k
i
iiiilkl EXEX
0
1
/
ˆ 

35
(6)     



 
1
0
1
1/
ˆ
k
i
(7)     

 
k
i
iiiikkk EXEX
0
1
1/1
ˆ 
kkkkk XX //1
ˆˆ  or 1/11/
ˆˆ
  kkkkk XX , this is prediction step for states
(8)    T
ˆˆ
 
  T
k
T
kkk
T
kkkkkk wwEPP  //1 , or   T
k
T
kkk
T
kkkkkk wwEPP   1/11/
01/0 P , this is prediction step for the covariance of the state estimate
(9)    T
kkk
T
1
1/1//

  ,
T
kkkk HPK 1/  ,
   T
XXEP 001/0  , update of the covariance matrix of the states estimate
(10)     1/
1
1/1/1//
ˆˆˆ


T
kkkk
T
kkkkkkk XHYvvEHPHHPXX , update of states
estimate
So far we were able to get an estimate of the stochastic process X(k)= kX , kkX /
ˆ , and its
covariance matrix kkP /
based on a train of observations of Y(0), Y(1), …Y(k). The above
procedures were developed by R. Kalman. We now summarize the estimation steps or the
Kalman filter equations in a vector format:
(1) kkkkk wXX 1
(linear system dynamics model), kw ~  kQN ,0 , white Gaussian
noise
(2) kkkk vXHY  (linear observation model), kv ~  kRN ,0 , white Gaussian noise
(3) Prediction step: 1/11/
ˆˆ
  kkkkk XX ,
T
kk
T
kkkkkk QPP   k1/11/
(4) Updating step: 1/
ˆ

36
T
kkkkkk HPHRS 1/ 
1
1/

 k
T
kkkk SHPK
 1/
1
1//
ˆˆˆ


  kkkkkkkkk XHYKXX
    T
kkk
T
kkkkkkkk KRKHKIPHKIP  1//

37
Karhunen-Loeve Expansion
Karhunen-Loeve Expansion; the Scalar Case [Van Trees; 1968]:
A stochastic process )(tX is expanded in terms of orthonormal basis )(ti as:
 TtttX
i
ii ,0,)()(
1
 



where 
T
ii dttXt
0
)()(
ij
T
ji dttt  0
)()(
The basis )(ti are chosen such that the random coefficients i are uncorrelated; Viz:
If   ii mE 
then     ijijjii mmE  
For the process X(t) with mean m(t), define the covariance function ),( utK as:
   )()()()(),( umuXtmtXEutK 
The covariance function, for uncorrelated i , satisfies the Fredholm integral equation of the
second kind:
0),()(),(
0
 iii
T
i tduuutK 
and could be expanded in terms of the orthonormal basis as:
 TutututK
i
iii ,0,,)()(),(
1
 




38
It is this equation that we use to find the orthonormal basis.
Proof of the Fredholm integral equation of the second kind:
Assuming zero mean stochastic process and thus zero mean random variables i , we use the
equation:
    ijijjii mmE  
Now  






 
T
j
T
iijiji duuXudttXtEE
00
)()()()( 
Exchanging expectation and integration we get:
   
T
j
T
iijiji duuuXtXEdttE
00
)()()()( 

T
j
T
i duuutKdtt
00
)(),()( 
A necessary and sufficient condition is that in the right hand side of the equation:
)()(),(
0
tduuutK jj
T
j  
In this case we get:
ijj
T
ijj
T
j
T
i dtttduuutKdtt    000
)()()(),()(
which is the left hand side of the equation.
Example: Wiener process. The Wiener process is a zero mean process with covariance






utt
tuu
tuutK
,
,
),min(),( 2
2
2



Thus,  
T
t
i
t
i
T
iii duutduuuduuutKt )()()(),()( 2
0
2
0


39
Taking the derivative w.r.t. “t” of both sides we get:

T
t
i
i
i duu
dt
td
)(
)( 2



Taking the derivative one more time we get:
)(
)( 2
2
2
t
dt
td
i
i
i 

 
This yields the solution:
2
2
22
2
1











n
T
n












 t
T
n
T
tn


2
1
sin
2
)(
Example: Stationary process; Assume that the stochastic process X(t) is zero mean and stationary
with correlation     
 
 PetXtXER )()( and spectrum
  222
2
2)(
)(




w
P
wD
wN
wS . Thus,
)(),( utRutK  . The Fredholm integral equation becomes:
   






T
t
i
tu
t
T
i
ut
T
T
i
T
T
iii duuePduuePduuutRduuutKt )()()()()(),()(  
Differentiating w.r.t “t” we get:





T
t
i
ut
t
T
i
uti
i duueePduueeP
dt
td
)()(
)(


 
Differentiating again w.r.t “t” we get:

40
  )(2)(2)()(2)(
)( 222
2
2
tPtPttPduueP
dt
td
iiiiii
T
T
i
t-ui
i 

 
 

which has a solution:
tjbtjb
i
ii
ecect 
 21)( ,
 
i
i
i
P
b

 /22
2 
 , 22
2
i
i
b
P





After some manipulations we end up with the expressions:

























evenis,sin
2
2sin
1
1
oddis,cos
2
2sin
1
1
)(
2/1
2/1
2/1
2/1
itb
Tb
Tb
T
itb
Tb
Tb
T
t
i
i
i
i
i
i
i
Karhunen-Loeve expansion for Stationary process:
For the spectrum given by
 2
2
)(
)(
wD
wN
wS  where the numerator order is q and the denominator
order is p and q<p, and assuming that the data is available for long observation time   ,t ,
the Fredholm integral equation has the form:
  


,),()()()()( tttRduuutRt iiii 
where “ ” is a the convolution operator.
Using the Fourier transform, one could obtain a solution to the above equation as:
 2
2
)(
)()()()(
wD
wN
wwwSw iiii  

41
or    )()(0 22
wwNwD ii  
For each i , there are p homogeneous solutions corresponding to the roots of   )( 22
wNwDi  .
We denote these solutions as plt ihl ,...,1),,(  .
Thus, 

p
l
ihlli tct
1
),()( 
Substitute in the Fredholm equation, to find lc and i , we get:
     



 
,,),()(),()(),(
111
tduucutRduucutRtc
p
l
ihll
p
l
ihll
p
l
ihlli 
or   


,,,...,1,),()(),( tplduuutRt ihlihli 
Example: In the above we used
  222
2
2)(
)(




w
P
wD
wN
wS , we need to solve
      )(2)()(0 2222
wPwwwNwD iiii   . The roots are located at
  0222
 Pwi  i.e.
 
i
i
i
i
i
PPP
w






 /222 22
2 


 . Thus, the
two homogeneous solutions 2,1),,( lt ihl  are given by
jwt
ih
jwt
ih etet 
 ),(,),( 21  , and
we get
jwtjwt
i ecect 
 21)(

42
Karhunen-Loeve Expansion; the Vector Case:
A stochastic vector process  T
N tXtXtX )(),...,()( 1 is expanded in terms of orthonormal basis
vector  T
iNii
ttt )(),...,()( 1   as:
 TtttX
i
ii ,0,)()(
1
 



where 
T
T
ii dttXt
0
)()(
ij
T
j
T
i
dttt  0
)()(
The basis )(ti
 are chosen such that the random coefficients i are uncorrelated; Viz:
If   ii mE 
then     ijijjii mmE  
For the process X(t) with mean m(t), define the covariance matrix ),( utK as:
   T
umuXtmtXEutK )()()()(),( 
The covariance matrix, for uncorrelated i , satisfies the Fredholm integral equation:
0),()(),(
0
 iii
T
i
tduuutK 
and could be expanded in terms of the orthonormal basis as:
 TutututK
i
T
iii ,0,,)()(),(
1
 



It is this equation that we use to find the orthonormal basis.

43
The Wiener Filter
We shall develop the Wiener filter for stationary signals. Nonstationary signals could, in many
cases, be reduced to stationary signals by focusing on a small window of the signal. In this
window, the statistical properties do not change much i.e. we have a stationary signal.
Assume that we receive/observe a signal y(n) that is correlated with another signal X(n). We use
y(n) to find an estimate, )(ˆ nX , of X(n) according to the following equation:


I
i
inyihnX
0
)()()(ˆ
Define )(ˆ)()( nXnXne  ,     22
)(ˆ)()()( nXnXEneEn 
where h(0), h(1), …, h(I) are the unknown filter parameters. In order to find the filter parameters,
we need to minimize the expected value of the squared error w.r.t. the unknowns.
  















)()()()(2)()(ˆ)(20
)(
)(
0
lnyinyihnXElnynXnXE
lh
n I
i

This yields:
    IlinylnyEihinylnyihElnynXE
I
i
I
i
,...,1,0,)()()()()()()()(
00







  
i.e. IlliRihlR
I
i
yyXy ,...,1,0,)()()(
0
 

44
These are a set of (I+1) equations in the filter coefficients h(0), h(1),…, h(I).
We usually have the observations y(n). This gives us an estimate of )(iRyy . We need to find a
relation between y(n) and X(n) in order to get an estimate of )(lRXy .
In some situations, as in noise cancelling, we have the signal X(n). We could get a numerical
Estimate for )(lRXy as
 





lN
n
Xy lnynX
lN
lR
0
)()(
1
1
)(ˆ
Assume that y(n) is a linear observation of the process X(n); Viz:
)()()( nvncXny 
where )(nv is additive zero mean
2
v variance Gaussian noise which is independent of X(n).
In this case in order to get )(lRXy , we multiply by )( lny  and take the expectations as follows:
l=0:      )()()()()()()0( nvnyEnXnycEnynyERyy 
  )()()()0( nvnvncXEcRXy 
2
)0( vXycR 
Thus,  2
)0(
1
)0( vyyXy R
c
R 
where we used       0)()()()(  nvEnXEnvnXE because X(n) and v(n) are independent and
  0)( nvE .

45
l=1:      )()1()()1()1()()1( nvnyEnXnycEnynyERyy 
)1(XycR
i.e. )1(
1
)1( yyXy R
c
R 
and in general
)(
1
)( lR
c
lR yyXy 
The Observations are convolution of the state:
Assume that y(n) is a linear observation of the process X(n); Viz:
)()1()()( 10 nvnXcnXcny 
where )(nv is additive zero mean
2
v variance Gaussian noise which is independent of X(n).
In this case in order to get )(lRXy , we multiply by )( lny  and take the expectations as follows:
l=0:        )()()1()()()()()()0( 10 nvnyEnXnyEcnXnyEcnynyERyy 
  )()()1()()1()0( 1010 nvnvnXcnXcERcRc XyXy 
2
10 )1()0( vXyXy RcRc 
Similarly

46
l=1:
       )()1()1()1()()1()()1()1( 10 nvnyEnXnyEcnXnyEcnynyERyy 
  )()1()2()1()0()1( 1010 nvnvnXcnXcERcRc XyXy 
)1()0( 01 XyXy RcRc 
In matrix format we have 



















 
)1(
)0(
)1(
)0(
01
10
2
Xy
Xy
yy
vyy
R
R
cc
cc
R
R 
Thus, 






 













)1(
)0(
)1(
)0( 21
01
10
yy
vyy
Xy
Xy
R
R
cc
cc
R
R 
For
)()()(
2
0
nvknXcny
k
k  
         )()()2()()1()()()()()()0( 210 nvnyEnXnyEcnXnyEcnXnyEcnynyERyy 
  )()()1()()2()1()0( 10210 nvnvnXcnXcERcRcRc XyXyXy 
2
210 )2()1()0( vXyXyXy RcRcRc 
Similarly
         )()1()2()1()1()1()()1()()1()1( 210 nvnyEnXnyEcnXnyEcnXnyEcnynyERyy 
  )()1()3()2()1()1()0()1( 210210 nvnvnXcnXcnXcERcRcRc XyXyXy 

47
  )1()0( 201 XyXy RccRc 
         )()2()2()2()1()2()()2()()2()2( 210 nvnyEnXnyEcnXnyEcnXnyEcnynyERyy 
  )()1()4()3()2()0()1()2( 210210 nvnvnXcnXcnXcERcRcRc XyXyXy 
)0()1()2( 210 XyXyXy RcRcRc 
In matrix format we have
 






























 
)2(
)1(
)0(
0
)2(
)1(
)0(
012
201
210
2
Xy
Xy
Xy
yy
yy
vyy
R
R
R
ccc
ccc
ccc
R
R
R 
Thus,  









 






















)2(
)1(
)0(
0
)2(
)1(
)0( 21
012
201
210
yy
yy
vyy
Xy
Xy
Xy
R
R
R
ccc
ccc
ccc
R
R
R 
Example: Assume that we have a first order filter i.e. I=1, h(0) and h(1) are unknowns. We also
have the observations y(n) related to the signal X(n) by )()()( nvncXny  . Thus,
 2
)0(
1
)0( vyyXy R
c
R  and )1(
1
)1( yyXy R
c
R 
)1()1()()0()(ˆ  nyhnyhnX
  















)()()()(2)()(ˆ)(20
)0(
)( 1
0
nyinyihnXEnynXnXE
h
n
i

This yields:

48
       )()1()1()()()0()()1()1()()()0()()( nynyEhnynyEhnynyhnynyhEnynXE 
i.e. )1()1()0()0()0( yyyyXy RhRhR 
or   )1()1()0()0()0(
1 2
yyyyvyy RhRhR
c

Similarly   















)1()()()(2)1()(ˆ)(20
)1(
)( 1
0
nyinyihnXEnynXnXE
h
n
i

This yields:
       )1()1()1()1()()0()1()1()1()1()()0()1()(  nynyEhnynyEhnynyhnynyhEnynXE
i.e. )0()1()1()0()1( yyyyXy RhRhR 
or )0()1()1()0()(
1
yyyyyy RhRhlR
c

In matrix format we have:




















 
)1(
)0(
)0()1(
)1()0(
)1(
)0(1
2
h
h
RR
RR
R
R
c yyyy
yyyy
yy
vyy 
Inverting the matrix we get







 













)1(
)0(
)0()1(
)1()0(1
)1(
)0( 21
yy
vyy
yyyy
yyyy
R
R
RR
RR
ch
h 
This is the desired result.

49
Adaptive Filters
Adaptive Frequency Estimation:
Assume that the stochastic process X(t) is the output of a Linear filter H(z) driven by white
Gaussian noise. Assume further that the filter is an AR process with real coefficients ka i.e. it has
only poles. In this case the filter is given by:
 






 M
k
jkw
k
jMw
M
jw
jw
ea
eaea
eH
1
1
1
1
...1
1
For zero mean Gaussian process )(nv with variance
2
v , the output spectrum )(wSAR
becomes:
  2
1
2
22
1
)(




M
k
jkw
k
vjw
vAR
ea
eHwS


)()()(
1
nvknXanX
M
k
k  
If we assume adaptive parameters i.e. we have )(ˆ nak , the spectrum becomes also adaptive,
),( nwSAR
, and is given by:
2
1
2
)(ˆ1
),(




M
k
jkw
k
v
AR
ena
nwS

The parameters are updated using, e.g., the LMS algorithm as follows:

50
)()()(ˆ)1(ˆ neknXnana kk  


M
k
k knXnanXnXnXne
1
)(ˆ)(ˆ)()(ˆ)()(
Example: Single sinusoid with random frequency. Assume that the receive signal is made of
single sinusoid with unknown frequency that is behaving according to OU process. The sinusoid
X(n) is modeled as and AR(2). The initial conditions X(-2), X(-1), X(0), determine the phase,
and the values of the coefficients determine the frequency.
)()2()1()( 21 nvnXanXanX 
which has the transfer function:
         222
22
21
2
2
2
2
1
1 21
1
)(
)(
 






 
zz
z
jzjz
z
azaz
z
zazazv
zX
i.e. 2/1a ,
2
1
2
2







a
a
 











 2/arctan 


f
where is the sampling interval. Notice that the arctan operation yields the phase between “
 ” and “  ”.
The frequency )(tf is modeled as an Ornestein-Uhlenbeck (OU) process;

51
  )()()( tdWdttftdf fff  
where W(t) is the Wiener process.
We use the LMS algorithm to find an estimate of the frequency. We first estimate the changing
coefficients according to the equation:
)()()(ˆ)1(ˆ neknXnana kk  


M
k
k knXnanXnXnXne
1
)(ˆ)(ˆ)()(ˆ)()(
The frequency is estimated as  







 


2/
)(ˆ
)(ˆ
arctan)(ˆ
n
n
nf
where 2/)(ˆ)(ˆ 1 nan  ,
2
1
2
2
)(ˆ
)(ˆ)(ˆ 






na
nan
Adaptive Noise Cancelling to Remove Sinusoidal Interference:
Assume that the signal X(t) is modeled as:
 000 cos)()(  nwAnSnX
where  000 cos nwA is the sinusoidal interference and S(n) is the desired unknown signal.
We also have another reference signal, y(n), that has the same frequency as the interference but
different in amplitude and phase; Viz:

52
  000 ,cos)(   andAAnwAny
The estimated signal, )(ˆ nX , and the error, e(n), are modeled as:
  





1
0
0
1
0
cos)(ˆ)()(ˆ)(ˆ
M
i
i
M
i
i inwAnainynanX 
     





 


1
0
0000 cos)(ˆcos)()(ˆ)()(
M
i
i inwAnanwAnSnXnXne 
The Wiener filter weights, )(ˆ nai , are updated through the LMS algorithm as:
)()()(ˆ)1(ˆ inynenana ii  
Notice that the signal X(n) is made of two parts; (1) e(n) which is not correlated with y(n) and
(2) )(ˆ nX which is correlated with y(n). The adaptive filter output must converge to the correlated
part and thus the remainder is the desired signal.
Adaptive Line Enhancement (ALE):
In some situations there is only one signal X(n) that is available. This signal is made of two
parts; (1) Slowly varying signal )(1 nX with slowly decaying correlation, and (2) rapidly varying
signal )(2 nX independent of )(1 nX and that could be nonstationary or has short duration
correlation. Define another signal   nXny )( , where  is a delay. We choose the delay 
such that y(n) is highly correlated with the slowly varying part of X(n) but is weakly correlated

53
with the rapidly varying part of X(n). Thus, X(n) has two components (1) )()(ˆ
1 nXnX  that is
highly correlated with y(n), and (2) )()( 2 nXne  that is orthogonal or independent of y(n) .
Let )()()( 21 nXnXnX 
then      )()()()()()( 2121   nXnXnXnXEnXnXE
       )()()()()()()()( 22122111   nXnXEnXnXEnXnXEnXnXE
  )()()( 1111  RnXnXE 
Similarly )()()()( 21  nXnXnXny
       )()()()()()()()( 2121 nXnXnXnXEnXnXEnXnyE 
       )()()()()()()()( 22122111 nXnXEnXnXEnXnXEnXnXE 
  )()()( 1111  RnXnXE
Define )1()1()()0()1()1()()0()(ˆ  nXhnXhnyhnyhnX
   )1()1()1()()()0( 2121  nXnXhnXnXh
 )1()1()()0()()(ˆ)()(  nXhnXhnXnXnXne
i.e.  )1()1()()0()()(ˆ)()(  nXhnXhnenXnenX
The signal X(n) has two parts; (1) e(n) which is uncorrelated with y(n), and (2) )(ˆ nX which is
highly correlated with y(n).

54
     22
)1()1()()0()()()(  nXhnXhnXEneEn
To find the unknown coefficients, h(0) and h(1), we minimize )(n and equate the derivative to
zero:
 )()(20
)0(
)(



nXneE
h
n
   )()1()1()()0()(2  nXnXhnXhnXE
i.e.      )()1()1()()0()()( 2
 nXnXEhnXEhnXnXE
Similarly  )1()(20
)1(
)(



nXneE
h
n
   )1()1()1()()0()(2  nXnXhnXhnXE
i.e.      )1()1()1()()0()1()( 2
 nXEhnXnXEhnXnXE
These are two equations in the two unknowns h(0) and h(1).
Since
  )1()1()()0()()1()1()()0()()(ˆ)()(  nyhnyhnenXhnXhnenXnenX ,
the signal X(n) has two parts; (1) e(n) which is uncorrelated with y(n), and (2) )(ˆ nX which is
highly correlated with y(n). In our case, )(ˆ nX is an estimate of )(1 nX and e(n) is an estimate of
)(2 nX .

55
Example: The signal is a product of stationary and nonstationary components i.e.
)()()( 21 nZnZnZ  . Taking the log operation, assuming nonnegative quantities, we get
)(log)(log)(log 21 nZnZnZ  . Define )(log)( nZnX  , )(log)( 11 nZnX  , )(log)( 22 nZnX  ,
and we follow the same steps as before.

56
Blind Deconvolution
Assume that we have one source of signals e.g. the Aortic pressure )(nPA
. We measure the
pressure at the Femoral artery )(nPF
and the Illiac artery )(nPI
through the unknown filters )(nFF
and )(nFI
:
)(*)()( nPnFnP AFF 
)(*)()( nPnFnP AII 
where “*” is the convolution operation.
Convolve the first equation by )(nFI
and the second equation by )(nFF
, we get:
)(*)(*)()(*)( nPnFnFnPnF AFIFI 
)(*)(*)()(*)( nPnFnFnPnF AIFIF 
Notice that the right hand side of both equations are equal. Thus,
)(*)(*)()(*)()(*)( nPnFnFnPnFnPnF AFIIFFI 
The two pressures )(nPF
and )(nPI
are highly correlated. If we use a Wiener filter, as correlation
canceller, we should get the output an exact replica of the other signal i.e. we should get an
estimate of )(nPI
if we pass )(nPF
through the Wiener filter.

57
Instead, if we use two Wiener filters )(nHF
and )(nHI
such that the error between the two
outputs is zero, then the output of each filter is )(nPA
i.e. )(nHF
is actually the inverse of )(nFF
and )(nHI
is the inverse of )(nFI
Thus, 

FI
i
FFFFAF inPiHnPnHnP
0
)()()(*)()(


IJ
j
IIIIAI jnPjHnPnHnP
0
)()()(*)()(
The error is    

IF J
j
II
I
i
FFAIAF jnPjHinPiHnPnPne
00
)()()()(0)()()(
Define  














  
2
00
2
)()()()()()(
IF J
j
II
I
i
FF jnPjHinPiHEneEn
To find the coefficients of the two Wiener filters, we minimize )(n w.r.t. the unknowns as
follows:
FF
J
j
II
I
i
FF
F
IkknPjnPjHinPiHE
kH
n IF
,...,0,)()()()()(0
)(
)(
00

















 

II
J
j
II
I
i
FF
I
JllnPjnPjHinPiHE
lH
n IF
,...,0,)()()()()(0
)(
)(
00

















 

This will yield FI + II -1 independent equations in the FI + II unknowns. Thus, another equation
is needed to find a unique solution.

58
Example: Assume that we have first order filters. Thus,
)1()1()()0()()()(*)()(
1
0
 


nPHnPHinPiHnPnHnP FFFF
I
i
FFFFAF
F
)1()1()()0()()()(*)()(
1
0
 


nPHnPHjnPjHnPnHnP IIII
J
j
IIIIAI
I
and   





1
0
1
0
)()()()(0)()()(
IF J
j
II
I
i
FFAIAF jnPjHinPiHnPnPne
 














  
2
00
2
)()()()()()(
IF J
j
II
I
i
FF jnPjHinPiHEneEn
To find the coefficients of the two Wiener filters, we minimize )(n w.r.t. the unknowns as
follows:
FF
J
j
II
I
i
FF
F
IkknPjnPjHinPiHE
kH
n IF
,...,0,)()()()()(0
)(
)(
00

















 

for k=0





















)()()()()(0
)0(
)( 1
0
1
0
nPjnPjHinPiHE
H
n
F
J
j
II
I
i
FF
F
IF

which yields 0)1()1()0()0()1()1()0()0(  IFIIFIFFFFFF RHRHRHRH
for k=1





















)1()()()()(0
)1(
)( 1
0
1
0
nPjnPjHinPiHE
H
n
F
J
j
II
I
i
FF
F
IF

which yields 0)0()1()1()0()0()1()1()0(  IFIIFIFFFFFF RHRHRHRH

59
Similarly for the Illiac artery we get:
for l=0





















)()()()()(0
)0(
)( 1
0
1
0
nPjnPjHinPiHE
H
n
I
J
j
II
I
i
FF
I
IF

which yields 0)1()1()0()0()1()1()0()0(  IIIIIIFIFFIF RHRHRHRH
for l=1





















)1()()()()(0
)1(
)( 1
0
1
0
nPjnPjHinPiHE
H
n
I
J
j
II
I
i
FF
I
IF

which yields 0)0()1()1()0()0()1()1()0(  IIIIIIFIFFIF RHRHRHRH
Collecting terms and putting the above equations into matrix format we get:
0
)1(
)0(
)1(
)0(
)0()1()0()1(
)1()0()1()0(
)0()1()0()1(
)1()0()1()0(





























I
I
F
F
IIIIFIFI
IIIIFIFI
FIFIFFFF
FIFIFFFF
H
H
H
H
RRRR
RRRR
RRRR
RRRR
where )()( lRlR IFFI 
These are three independent equations in four unknowns. One should be able to get another
independent equation to find a unique solution. Otherwise, we use one of the unknowns as a
numeraire. Assume that the numeraire is )0(FH . In this case the above equation is reduced to:





































)1(
)0(
)1(
)0(
)1(
)0(
)1(
)0()1()0(
)1()0()1(
)0()1()0(
FI
FI
FF
F
I
I
F
IIIIFI
IIIIFI
FIFIFF
R
R
R
H
H
H
H
RRR
RRR
RRR
This yields the estimates of the unknown parameters as function of the value of )0(FH .

61
Blind Deconvolution for more than One Source:
In many applications there is more than one source that is generating signals e.g. Mother EKG
and Fetus EKG. The receivers are usually more than the number of sources. We shall focus on
the case of two sources, )(1 nu and )(2 nu , and three receivers, )(1 ny , )(2 ny , and )(3 ny , and
explain how to use the Wiener filter to find the desired sources. Later on we shall generalize the
analysis. In all the analysis, it is assumed that the sources are independent and stationary signals
and the transmission media/modulation are linear time invariant filters. Specifically



























)(
)(
)()(
)()(
)()(
)(
)(
)(
2
1
3231
2221
1211
3
2
1
zu
zu
zFzF
zFzF
zFzF
zy
zy
zy
which could be split into three equations as:


















)(
)(
)()(
)()(
)(
)(
2
1
2221
1211
2
1
zu
zu
zFzF
zFzF
zy
zy


















)(
)(
)()(
)()(
)(
)(
2
1
3231
1211
3
1
zu
zu
zFzF
zFzF
zy
zy
and 

















)(
)(
)()(
)()(
)(
)(
2
1
3231
2221
3
2
zu
zu
zFzF
zFzF
zy
zy
Inverting the square matrices (assuming invertiblity) we get:
  


























)(
)(
)()(
)()(
)()()()(
1
)(
)(
2
1
1121
1222
211222112
1
zy
zy
zFzF
zFzF
zFzFzFzFzu
zu



























)(
)(
)()(
)()(
)()()()(
1
)(
)(
3
1
1131
1232
311232112
1
zy
zy
zFzh
zFzF
zFzFzFzFzu
zu
and

62



























)(
)(
)()(
)()(
)()()()(
1
)(
)(
3
2
2131
2232
312232212
1
zy
zy
zFzF
zFzF
zFzFzFzFzu
zu
For the source signal )(2 zu , we have the equation:
 )()()()(
)()()()(
21122211
211121
zFzFzFzF
zyzFzyzF


 )()()()(
)()()()(
31123211
311131
zFzFzFzF
zyzFzyzF



 
)(
)()()()(
)()()()(
2
31223221
321231
zu
zFzFzFzF
zyzFzyzF




The above equation is actually a set of three equations in the unknown filter coefficients. To see
this, we take the first two terms of equation i.e.:
   
)(
)()()()(
)()()()(
)()()()(
)()()()(
2
31123211
311131
21122211
211121
zu
zFzFzFzF
zyzFzyzF
zFzFzFzF
zyzFzyzF






Rearrange we get:
  
  )()()()()()()()(
)()()()()()()()(
31113121122211
21112131123211
zyzFzyzFzFzFzFzF
zyzFzyzFzFzFzFzF


The above equation represents a linear relation between the observations, delayed observations,
and the unknown coefficients of the mixing FIR filters. Solving this equation using for example
regression analysis, estimates for the filters could be found. Once the filter coefficients are
found, we use them to find an estimate for the source signal )(2 zu . The same approach could be
applied for )(1 zu .
Instead, we use the Wiener filter approach where we assume that the signals are stationary. Each
observed signal has two Wiener filters. The parameters of each are estimated from the
minimization of a squared error criterion. Specifically, for )(nyi we have the two Wiener filters
)(1 nHi and )(2 nHi . They are related to the data through the equations:

63
 )()()()(
)()()()(
21122211
211121
zFzFzFzF
zyzFzyzF


 )()()()(
)()()()(
31123211
311131
zFzFzFzF
zyzFzyzF



 
)(
)()()()(
)()()()(
2
31223221
321231
zu
zFzFzFzF
zyzFzyzF




In terms of the Wiener filters we get the three equations:
 )()()()()()()()()( 3311122211111 zyzHzyzHzyzHzyzHze 
and  )()()()()()()()()( 3322223311123 zyzHzyzHzyzHzyzHze 
For the first equation we have:
 )(*)()(*)()(*)()(*)()( 3311122211111 nynHnynHnynHnynHne 
and        2
331112221111
2
11 )(*)()(*)()(*)()(*)()()( nynHnynHnynHnynHEneEn 
Minimizing )(1 n with respect to the unknown coefficients of the filters, we get an estimate of
these coefficients as function of the correlation between the observed signals )(1 ny , )(2 ny , and
)(3 ny .
In a Similar manner we obtain the equations for )(2 ne and )(3 ne .
In vector notations we have:


































)(
)(
)(
*
)()()()(
)()()()(
)()()()(
)(
)(
)(
3
2
1
32312212
32222111
31211211
3
2
1
ny
ny
ny
nHnHnHnH
nHnHnHnH
nHnHnHnH
ne
ne
ne
or )(*)()( nynHne 
Example: Assume all the Wiener filters are first order. Thus, )1()0()( 1
ijijij HzHzH 
 and

64
       2
331112221111
2
    
  
















2
331331
2212211121111211
)1()1()()0(
)1()1()()0()1()1()1()()0()0(
nyHnyH
nyHnyHnyHHnyHH
E
minimizing )(1 n w.r.t the unknown coefficients we get:
 )()(0
)0(
)(
11
11
1
nyneE
H
n



,  )1()(0
)1(
)(
11
11
1



nyneE
H
n
 )()(0
)0(
)(
11
12
1
nyneE
H
n



,  )1()(0
)1(
)(
11
12
1



nyneE
H
n
Notice that the above two equations are identical. Thus, only one equation will be useful.
 )()(0
)0(
)(
21
21
1
nyneE
H
n



,  )1()(0
)1(
)(
21
21
1



nyneE
H
n
 )()(0
)0(
)(
31
31
1
nyneE
H
n



,  )1()(0
)1(
)(
31
31
1



nyneE
H
n
The above are a total of 6 equations.
 )()(0
)0(
)(
31
31
1
nyneE
H
n



,  )1()(0
)1(
)(
31
31
1



nyneE
H
n
       2
331112221111
2
In a similar way we have:
 )()(0
)0(
)(
22
22
2
nyneE
H
n



,  )1()(0
)1(
)(
22
22
2



nyneE
H
n

65
 )()(0
)0(
)(
32
32
2
nyneE
H
n



,  )1()(0
)1(
)(
32
32
2



nyneE
H
n
 )()(0
)0(
)(
33
31
3
nyneE
H
n



,  )1()(0
)1(
)(
33
31
3



nyneE
H
n
The above are a total of 2 equations. Thus, we have a total of 12 equations in 12 unknowns.
Unfortunately, these are not independent equations. Actually we only have 11 independent
equations. Thus, we need to assume a value for one of the unknowns and calculate the other
values of the Wiener filters in terms of this quantity; same as we did with the single source and
two measurements.































































































































)1(
)0(
)1(
)0(
)1(
)0(
)1(
)0(
)1(
)0(
)1(
)0(
)0()1(00)0()1(00)0()1(00
)1()0(00)1()0(00)1()0(00
)0()1(00)0()1()0()1(00)0()1(
)1()0(00)1()0()1()0(00)1()0(
)0()1(00)0()1()0()1(00)0()1(
)1()0(00)1()0()1()0(00)1()0(
00)0()1(00)0()1()0()1()0()1(
00)1()0(00)1()0()1()0()1()0(
00)0()1(00)0()1()0()1()0()1(
00)1()0(00)1()0()1()0()1()0(
00)0()1(00)0()1()0()1()0()1(
00)1()0(00)1()0()1()0()1()0(
0
0
0
0
0
0
0
0
0
0
0
0
32
32
31
31
22
22
21
21
12
12
11
11
333332323131
333332323131
3333323232323131
3333323232323131
3232222222222121
3232222222222121
3333323231313131
3333323231313131
3232222221212121
3232222221212121
3131212111111111
3131212111111111
H
H
H
H
H
H
H
H
H
H
H
H
RRRRRR
RRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
RRRRRRRR
yyyyyyyyyyyy
yyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyy

66
Once we have estimates for the Wiener filters, we go back and substitute to get an estimate for
the unknown input source signal )(2 nu . The same process could be repeated for the other signal
)(1 nu .
Example: A simpler example would be to assume that all the filters involved are constant values.



























)(
)(
)(
)(
)(
2
1
3231
2221
1211
3
2
1
zu
zu
FF
FF
FF
zy
zy
zy
which could be split into three equations as:


















)(
)(
)(
)(
2
1
2221
1211
2
1
zu
zu
FF
FF
zy
zy


















)(
)(
)(
)(
2
1
3231
1211
3
1
zu
zu
FF
FF
zy
zy
and 

















)(
)(
)(
)(
2
1
3231
2221
3
2
zu
zu
FF
FF
zy
zy
Inverting the square matrices (assuming invertiblity) we get:
  


























)(
)(1
)(
)(
2
1
1121
1222
211222112
1
zy
zy
FF
FF
FFFFzu
zu



























)(
)(1
)(
)(
3
1
1131
1232
311232112
1
zy
zy
FF
FF
FFFFzu
zu
and



























)(
)(1
)(
)(
3
2
2131
2232
312232212
1
zy
zy
FF
FF
FFFFzu
zu
 21122211
212122 )()(
FFFF
zyFzyF


 31123211
312132 )()(
FFFF
zyFzyF



 
)(
)()(
1
31223221
322232
zu
FFFF
zyFzyF





67
 21122211
211121 )()(
FFFF
zyFzyF


 31123211
311131 )()(
FFFF
zyFzyF



 
)(
)()(
2
31223221
321231
zu
FFFF
zyFzyF




The above equation is actually a set of three equations in the unknown filter coefficients. To see
this, we take the first two terms of equation i.e.:
   
)(
)()()()(
2
31123211
311131
21122211
211121
zu
FFFF
zyFzyF
FFFF
zyFzyF






Rearrange we get:
     )()()()( 3111312112221121112131123211 zyFzyFFFFFzyFzyFFFFF 
The above equation represents a linear relation between the observations, and the unknown
coefficients of the mixing FIR filters. Solving this equation using for example regression
analysis, estimates for the filters could be found. Once the filter coefficients are found, we use
them to find an estimate for the source signal )(2 zu . The same approach could be applied for
)(1 zu .
Instead, we use the Wiener filter approach where we assume that the signals are stationary. Each
observed signal has two Wiener filters. The parameters of each are estimated from the
minimization of a squared error criterion. Specifically, for )(nyi we have the two Wiener filters
)(1 nHi and )(2 nHi which are constant values. They are related to the data through the equations:
 21122211
211121 )()(
FFFF
zyFzyF


 31123211
311131 )()(
FFFF
zyFzyF



 
)(
)()(
2
31223221
321231
zu
FFFF
zyFzyF




In terms of the Wiener filters we get the three equations:

68
 )()()()()( 3311122211111 zyHzyHzyHzyHze 
 )()()()()( 3322222211112 zyHzyHzyHzyHze 
and  )()()()()( 3322223311123 zyHzyHzyHzyHze 
where
 21122211
21
11
FFFF
F
H


 ,
 31123211
31
12
FFFF
F
H



 21122211
11
21
FFFF
F
H

 ,
 31223221
31
22
FFFF
F
H



and
 31123211
11
31
FFFF
F
H

 ,
 31223221
21
32
FFFF
F
H


If one is able to get estimates for the Wiener filter coefficients ijH , we could find the values of
the filter coefficients ijF . This will yield an estimate for both inputs )(1 nu and )(2 nu .
We now move ahead and find the Wiener filter coefficients. For the first error equation )(1 ne we
have:
 )()()()()( 3311122211111 nyHnyHnyHnyHne 
and        2
331112221111
2
11 )()()()()()( nyHnyHnyHnyHEneEn 
Minimizing )(1 n with respect to the unknown coefficients of the filters, we get an estimate of
these coefficients as function of the correlation between the observed signals )(1 ny , )(2 ny , and
)(3 ny .
In a Similar manner we obtain the equations for )(2 ne and )(3 ne .

69


































)(
)(
)(
)(
)(
)(
3
2
1
32312212
32222111
31211211
3
2
1
ny
ny
ny
HHHH
HHHH
HHHH
ne
ne
ne
or )()( nyHne 
       2
331112221111
2
11 )()()()()()( nyHnyHnyHnyHEneEn 
 )()(0
)(
11
11
1
nyneE
H
n



 )()(0
)(
11
12
1
nyneE
H
n



Notice that the above two equations are identical. Thus, only one equation will be useful.
 )()(0
)(
21
21
1
nyneE
H
n



,
 )()(0
)(
31
31
1
nyneE
H
n



,
 )()(0
)(
31
31
1
nyneE
H
n



,
In a similar way we have:
 )()()()()( 3322222211112 zyHzyHzyHzyHze 
 )()(0
)(
22
22
2
nyneE
H
n



,

70
 )()(0
)(
32
32
2
nyneE
H
n



,
 )()()()()( 3322223311123 zyHzyHzyHzyHze 
 )()(0
)(
33
31
3
nyneE
H
n



,
The above are a total of 1 equations. Thus, we have a total of 6 equations in 6 unknowns.
Unfortunately, these are not independent equations. Actually we only have 5 independent
equations. Thus, we need to assume a value for one of the unknowns and calculate the other
values of the Wiener filters in terms of this quantity; same as we did with the single source and
two measurements.
 )()()()()( 3322223311123 zyHzyHzyHzyHze 
 )()(0
)(
33
31
3
nyneE
H
n



,



































































32
31
22
21
12
11
)0()0()0(0)0(0
)0(0)0()0(0)0(
)0(0)0()0(0)0(
0)0(0)0()0()0(
0)0(0)0()0()0(
0)0(0)0()0()0(
0
0
0
0
0
0
33333231
33323231
32222221
33323131
32222121
31211111
H
H
H
H
H
H
RRRR
RRRR
RRRR
RRRR
RRRR
RRRR
yyyyyyyy
yyyyyyyy
yyyyyyyy
yyyyyyyy
yyyyyyyy
yyyyyyyy
Once we have estimates for the Wiener filters, we go back and substitute to get an estimate for
the unknown input source signal )(2 nu . The same process could be repeated for the other signal

71
)(1 nu . Instead we use the estimated values of ijH to get estimates for ijF and consequently get
estimates for the inputs )(1 nu and )(2 nu . Notice that the above matrix equation does not have a
unique solution. We could always take one of the unknowns as a numeraire. For example if we
take 11H as the numeraire, the above equation becomes:






































































32
31
22
21
12
11
)0()0()0(0)0(
)0(0)0()0(0
)0(0)0()0(0
0)0(0)0()0(
0)0(0)0()0(
0
)0(
)0(
)0(
)0(
33333231
333232
322222
333231
322221
31
21
31
21
H
H
H
H
H
RRRR
RRR
RRR
RRR
RRR
H
R
R
R
R
yyyyyyyy
yyyyyy
yyyyyy
yyyyyy
yyyyyy
yy
yy
yy
yy
Taking the inverse of the right hand side matrix we get:
11
1
32
31
22
21
12
0
)0(
)0(
)0(
)0(
)0()0()0(0)0(
)0(0)0()0(0
)0(0)0()0(0
0)0(0)0()0(
0)0(0)0()0(
31
21
31
21
33333231
333232
322222
333231
322221
H
R
R
R
R
RRRR
RRR
RRR
RRR
RRR
H
H
H
H
H
yy
yy
yy
yy
yyyyyyyy
yyyyyy
yyyyyy
yyyyyy
yyyyyy







































































These are the estimated coefficients ijH as function of the numeraire 11H .
If one of the source signals, )(1 nu or )(2 nu , is much higher than the other, the estimate of the
stronger signal is usually very good. In this case, we use the techniques of noise cancelling to get
the other signal. Specifically, assume that we have found good estimate (high SNR) of )(1 nu , we
could use )()()( 2121111 nuFnuFny  as the source signal and use Wiener filter, as correlation
canceller, to find the component in )(1 ny that is correlated with )(1 nu . Effectively we are getting
new and more accurate estimate for 11F . Similarly we could use )()()( 2221212 nuFnuFny  and

72
)()()( 2321313 nuFnuFny  to get better estimates for 21F and 31F . We then repeat the above
process again to find estimates for ijH but now only 12F , 22F , and 23F are unknowns.
Assume that we receive/observe a signal )(1 nu that is correlated with another signal )(1 ny . We
use )(1 nu to find an estimate, )(ˆ1 ny , of )(1 ny according to the following equation:
)()0()(ˆ 111 nuhny 
Define )(ˆ)()( 11 nynyne  ,     2
11
2
)(ˆ)()()( nynyEneEn 
where )0(1h is the unknown filter parameter. In order to find the filter parameter, we need to
minimize the expected value of the squared error w.r.t. the unknowns.
     )()()0()(2)()(ˆ)(20
)0(
)(
1111111
1
nunuhnyEnunynyE
h
n


This yields:
     )()0()()0()()( 2
11
2
1111 nuEhnuhEnunyE 
i.e. )0()0()0( 1111 1 uuuy RhR  and
)0(
)0(
)0(
11
11
1
uu
uy
R
R
h 
)0(1h is the revised more accurate estimate of 11F . The same process could be repeated for the
other unknowns and we get:
)0(
)0(
)0(
11
12
212
uu
uy
R
R
Fh  , and
)0(
)0(
)0(
11
13
313
uu
uy
R
R
Fh 

73
We then use the equations for the estimated ijH to get the revised values for the rest of ijF
according to:
 21122211
21
11
FFFF
F
H


 ,
 31123211
31
12
FFFF
F
H



 21122211
11
21
FFFF
F
H

 ,
 31223221
31
22
FFFF
F
H



and
 31123211
11
31
FFFF
F
H

 ,
 31223221
21
32
FFFF
F
H


Finally we used the improved estimates of ijF to get improved estimates for )(2 nu .

74
The Kalman Filter
In some situations, one has a difference equation describing the evolution of the Mx1 state vector
or signal )(tX and we have the Nx1 vector )(ty measurements of the vector )(tX . In the vector
format we have:
)()()()( nvnXncny 
)()1()()( nnXnTnX 
where )(nv is Gaussian with zero mean and variance v , )(n is independent of )(nv and is
Gaussian with zero mean and variance  .
Once a model is put into state space form, the Kalman filter can be used to estimate the state
vector by filtering. The Kalman filter will provide estimates of the unobserved variable )(nX .
The purpose of filtering is to update our knowledge of the state vector as soon as a new
observation )(ny becomes available. Therefore, Kalman filter can be described as an algorithm
for the unobserved components at time n based on the available information at the same date.
The estimates of any other desired parameters including what is called hyper parameters, v and
 , can be obtained by Maximum Likelihood Estimation (MLE) algorithm as adapted by
[Shumway and Stoffer; 1982]. Estimating the states through Kalman filter encompasses three
step processes: the initial states, the predict states and the update states.
Initial state: )0/0(X , )0/0(P

75
Predict states: )1/1()()1/(  nnXnTnnX
 )()1/1()()1/( nTnnPnTnnP T
where )1/( nnX is the estimate of )(nX given observations up till time “n-1” )1( ny , and
)1/( nnP is the covariance of the estimate.
Update states:  )1/()()()()1/()/(  nnXncnynKnnXnnX
 1
)()1/()()()1/()(

 v
TT
ncnnPncncnnPnK
  )1/()()()/(  nnPncnKInnP
The )0/0(X and )0/0(P are the vectors of initial state and covariance matrix respectively. The
covariance matrix )0/0(P depicts noise of the )0/0(X . If the vector of )0/0(X and the
covariance matrix )0/0(P are not given prior, )0/0(X is assumed zero and we assume large
number for diagonal elements of the matrix )0/0(P .

76
Principal Component Analysis (PCA)
Assume that the data is given in a matrix format e.g. EEG channels, two dimensional image, …
etc. We represent this data as the mxn matrix X where each row represents for example a single
EEG channel. Before any calculations, we subtract from each row its mean value. PCA is a
method to express this data as a linear combination of the data basis vectors. Specifically; let X
and Y be m×n matrices related by a linear transformation P. X is the original recorded data set
and Y is a re-representation of that data set. Thus, we have:






















mm y
y
Y
x
x
X ...,...
11
where Y=PX or
































mmm
x
x
p
p
y
y
Y .........
111
Also let us define the mxm matrix











m
p
p
P ...
1
where i
p are the 1xm rows of P. The equation
“PX = Y” represents a change of basis i.e. the rows of P,  m
pp ,...,1
are a set of new basis
vectors.
Define the covariance matrix of X, after removing the mean from each row, as XX , and the
covariance matrix of Y as YY :
T
XX XX
n 1
1

 ,
     TTTT
YY PXXP
n
PXPX
n
YY
n 1
1
1
1
1
1







77
T
XX PP
We know that a symmetric matrix  T
XX could be diagonalized by an orthogonal matrix of its
eigenvectors i.e.   TT
EDEXX 
where D is a diagonal matrix and E is a matrix of the eigenvectors of  T
XX arranged as
columns, and 1
 EET
. The matrix  T
XX has r orthonormal eigenvectors each of dimensions
mx1, where r is the rank of the matrix. The rank r of  T
XX is usually less than m. We select the
matrix P to be a matrix where each row i
p is an eigenvector of  T
XX . By this selection,
1
 EEP T
and IEEEEPP TT
 1
. Thus,
  DPPEDEXX TTT

and      TTTT
YY PXXP
n
PXPX
n
YY
n 1
1
1
1
1
1






  D
n
PDPPP
n
TT
1
1
1
1




It is obvious that 1
 EEP T
results in a diagonal form of YY . This was the goal for PCA.
We can summarize the results of PCA in the matrices P and YY as follows: (1) The principal
components of X are the eigenvectors of  T
XX or the rows of P, (2) The ith diagonal value of
YY is the variance of X along i
p . One benefit of PCA is that we can examine the variances of
YY associated with the principal components. Often one finds that large variances associated
with the first k < m principal components, and then a precipitous drop-off. One can conclude that
the most interesting dynamics occur only in the first k dimensions.
Example: We are given the two vectors 1x10 1x and 2x as follows:

78
 1.1,5.1,1,2,3.2,1.3,9.1,2.2,5.0,5.21 x ,  9.0,6.1,1.1,6.1,7.2,0.3,2.2,9.2,7.0,4.22 x . The average of
1x is estimated as   81.11 xE and for 2x we have   91.12 xE . We subtract the means and get
new zero mean data sets. The covariance matrix is calculated as
       






0.64490.5539
0.55390.5549
2211 xExxExE . The Eigen values are 2 0.049 and 1 1.284
with the corresponding Eigen vectors  735.0,6778.01
p and  6778.0,735.02
p . Notice
that the Eigen vectors are normalized to have unity length.
It is clear that there is one big Eigen value and the other is small. Thus, one could remove the
space with the small Eigen value and retain only the space with the big Eigen value.
Using the equation
































mmm
x
x
p
p
y
y
Y .........
111
, we get the value of the transformed matrix as:















6778.0735.0
735.06778.0
2
1
y
y
Y 





1.0181,-0.31,-,-0.31,-0.,1.09,0.79,0.99,0.290.49,-1.21
.711,-0.31,-0,0.19,-0.8,1.29,0.49,0.39,0.090.69,-1.31







165,0.02,-0.,-0.35,0.0-0.21,0.180.38,0.13,0.18,0.14,-
.22.14,0.44,1.91,0.10,17,-1.68,-0-0.99,-0.20.83,1.78,-
Notice that the values of the vector 1
y are much higher than the values of the vector 2
y . This is
in agreement with the fact that the Eigen values corresponding to 1
p is much higher than the
Eigen values corresponding to 2
p . Thus, we could retain the vector 1
y as a representative of the
data and ignore the vector 2
y .
Linearization:

stochastic notes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to stochastic notes

Similar to stochastic notes (20)

More from cairo university

More from cairo university (20)

Recently uploaded

Recently uploaded (20)

stochastic notes