2009 spie hmm

Unsupervised learning in Hyperspectral Classifiers using
Hidden Markov Models
Vikram Jayaram and Bryan Usevitch
Dept. of Electrical and Computer Engineering
The University of Texas at El Paso
500 W. University Ave., El Paso, TX 79968-0523
ABSTRACT
Hyperspectral data represents a mixture of several component spectra from many classifiable sources. The
knowledge of the contributions of the underlying sources to the recorded spectra is valuable in many remote
sensing applications. Traditional Hyperspectral classification and segmentation algorithms have used Markov
random field (MRF) based estimation in recent investigations. Although, this method reflects plausible local,
spatial correlation in the true scene, it is limited to using supervised learning schemes for parameter estimation.
Besides, the expectation-maximization (EM) for the hidden MRF is considerably more difficult to realize due
to the absence of a closed form formulation. In this paper, we propose a hidden Markov model (HMM) based
probability density function (PDF) classifier for reduced dimensional feature space. Our approach uses an
unsupervised learning scheme for maximum-likelihood (ML) parameter estimation that combines both model
selection and estimation in a single algorithm. The proposed method accurately models and synthesizes the
approximate observations of the true data in a reduced dimensional feature space.
Keywords: Unsupervised Learning, hidden Markov model, expectation-maximization, maximum-likelihood.

1. INTRODUCTION
Hyperspectral images exploit the fact that each material radiates a different amount of electromagnetic energy
throughout the spectra. This unique characteristic of the material is commonly known as the spectral signature
and we can read this signature from the images obtained by airborne or spaceborne-based detectors. The
bandwidth of these sensors ranges from the visible region (0.4-0.7 μm) through the near infrared (about 2.4 μm)
in hundreds of narrow contiguous bands about 10 nm wide.1
Classification of Hyperspectral imagery (HSI) data is a challenging problem for two main reasons. First,
limited spatial resolution of HSI sensors and/or the distance of the observed scene, the images invariably contain
pixels composed of several materials. It is desirable to resolve the contributions of the constituents from the
observed image without relying on high spatial resolution images. Remote sensing cameras have been designed to
capture a wide spectral range motivating the use of post-processing techniques to distinguish materials via their
spectral signatures. Secondly, available training data for most pattern recognition problems in HSI processing is
severely inadequate. Under the framework of statistical classifiers, Hughes2 was able to demonstrate the impact
of this problem on a theoretical basis. Concerning the second problem, feature extraction and optimal band
selection are the methods most commonly used for finding useful features in high-dimensional data.3, 4 On the
other hand reduced dimensionality algorithms suffer from theoretical loss of performance. This performance loss
occurs due to reduction of data to features, and further approximating the theoretical features to PDFs. Figure
1 hypothetically illustrates the different types of feature representation. However, it is beneficial to understand
the trade-off between need to retain as much information (increased feature space), and the need to obtain better
PDF estimates (reduced feature dimensionality) for a wide array of HSI applications.
In statistical pattern recognition, finite mixtures allow a probabilistic approach to unsupervised learning
(clustering).5, 6 Our problem of interest is to introduce a new unsupervised algorithm for learning a finite
Further author information: (Send correspondence to Vikram Jayaram)
V. Jayaram: E-mail: vjayaram@miners.utep.edu, Telephone: 1 915 747 5869
Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XV,
edited by Sylvia S. Shen, Paul E. Lewis, Proc. of SPIE Vol. 7334, 73340F · © 2009 SPIE
CCC code: 0277-786X/09/$18 · doi: 10.1117/12.820325
Proc. of SPIE Vol. 7334 73340F-1
Downloaded From: http://spiedigitallibrary.org/ on 09/04/2013 Terms of Use: http://spiedl.org/terms

mixture model from the multivariate HSI data using an HMM. This HMM approach estimates the proportion of
each HSI class present in a finite mixture model by incorporating both the estimation step and model selection
in a single algorithm. The model selection step that was previously introduced in7 automatically assigns mixture
components for a GM. Our technique utilizes a reduced dimensional feature space to model and synthesize the
approximate observations of the true HSI data. In order to define the relevance of using finite mixture model for
HSI, let us consider a random variable X, the finite mixture models decompose a PDF f (x) into sum of K class
PDFs. A general density function f (x) is considered K class probability density functions. A general density
function f (x) is considered semiparametric, since it may be decomposed into K components. Let fk (x) denote
the k th class PDF. The finite mixture model with K components expands as
K

ak fk (x),

f (x) =

(1)

k=1

where ak denotes the proportion of the k th class. The proportion ak may be interpreted as the prior probability of
observing a sample from class k. Furthermore, the prior probabilities ak for each distribution must be nonnegative
and sum-to-one, or
(2)
ak ≥ 0 f or k = 1, · · ·, K,
where

K

ak = 1.

(3)

k=1

Since, the underlying probability densities of the mixture are initially unknown, one must estimate the
densities from samples of each class iteratively. Thus, we formally extend the PDF based classification approach
to the analysis of HSI data (dependent data). In our approach we adapt a stationary Markovian model which
is a powerful stochastic model that can closely approximate many naturally occurring phenomena. One such
famous phenomena is approximation of human speech.8 While a very powerful stochastic model, a single HMM
cannot easily act as a good classifier between a wide variety of signal classes. Instead, it is best to design them
specifically for each signal type and feature type.
The rest of the paper is organized as follows. In Section II, we mention some examples of previous work in
literature that is related to our experiments. In Section III, we review some of the basics of HMM formulation,
problem of mixture learning and density estimation. Section IV describes briefly about minimum noise fraction
(MNF) transform. Section V reports experimental results, and Section VI ends the paper by presenting some
concluding remarks.

2. EARLIER WORK
Few investigations have introduced HMM in HSI processing in recent times. In Du et. al.,9 hidden Markov
model information divergence (HMMID) was introduced as a discriminatory measure among target spectra.
Comparison were made to deterministic distance metrics such as the spectral angle mapper (SAM) and minimum
Euclidean distance (MED). More recently, Bali et. al.10 shows the problem of joint segmentation of hyperspectral
images in the Bayesian framework. This approach based on a HMM of the images with common segmentation, or
equivalently with common hidden classification label variables which is modeled by a Potts Markov Random Field.
In a related work, Li et. al.11 proposed a two dimensional HMM for image classification. This method provided a
structured way to incorporate context information into classification. All the above mentioned approaches come
under the domain of supervised techniques. The modest development of unsupervised classification techniques
in the HSI regime has been the primary source of motivation for the proposed work.
Multidimensional data such as the HSI can be modeled by a multidimensional Gaussian mixture (GM).12
Normally, a GM in the form of the PDF for z ∈ RP is given by
L

αi N (z, μi , Σi )

p(z) =
i=1


z

255

Feature space (3D)

Feature Vector

Scatter plot (20)

Figure 1. An illustration of Image and Feature space representation.

where
N (z, μi , Σi ) =

−1
1
1
e{− 2 (z−µi ) Σi (z−µi )} .
(2π)P/2 |Σi |1/2

Here L is the number of mixture components and P the number of spectral channels (bands). The GM parameters
are denoted by λ = {αi , μi , Σi }. The parameters of the GM are estimated using maximum likelihood by means
of the EM algorithm. In7 we show the structural learning of a GM that is employed to model and classify HSI
data. This methodology utilizes a fast and automatic assignment of mixture components to model PDFs. Later
on, we employ the same mechanism to estimate parameters and further model state PDFs of an HMM.
Consider a data that consists of K samples of dimension P , it is not necessary or even desirable to group all
the data together in to a single KXP -dimensional sample. In the simplest case, all K samples are independent
and we may regard them as samples of the same RV. For most practical cases, they are not independent. The
Markovian principle assumes consecutive samples are statistically independent when conditioned on knowing the
samples that preceded it. This leads to an elegant solution of HMM which employs a set of M PDFs of dimension
P . The HMM regards each of the K samples as having originated from one of the M possible states and there
is a distinct probability that the underlying model “jumps” from one state to another. In our approach, the
HMMs uses GM to model each state PDFs.5 We have focused on an unsupervised learning algorithm for ML
parameter estimation which in turn is used as a reduced dimensional PDF based classiﬁer.

3. UNSUPERVISED LEARNING OF MARKOV SOURCES
In this section, we mention general formulation of HMM, re-estimation of HMM parameters, observed PDFs and
GM parameters. Let us begin following the notational approach of Rabiner,8 consider there are T observation
times. At each time 1 ≤ t ≤ T , there is a discrete state variable qt which takes one of N values qt ∈ {S1 , S2 , · ·
·, SN }. According to the Markovian assumption, the probability distribution of qt+1 depends only on the value
of qt . This is described compactly as a state transition probability matrix A whose elements aij represents the
probability that qt+1 equals j given that qt equals i. The initial state probabilities are denoted πi , the probability
that q1 equals Si . It is hidden Markov model because the states qt are hidden from view; that is we cannot
observe them. But, we can observe the random data Ot which is generated according to a PDF dependent on
the state at time t as illustrated in Figure 2. We denote the PDF of Ot under state j as bj (Ot ). The complete
set of model parameters that deﬁne the HMM are ∧ = {πj , aij , bj }.
The EM also known as the Baum-Welch algorithm calculates new estimates ∧ given an observation sequence O =
O1 , O2 , O3 , · · ·OT and a previous estimate of ∧. The algorithm is composed of two parts: the forward/backward
procedure and the re-estimation of parameters.
Using Gaussian Mixtures for bj (Ot )
We model the PDFs bj (Ot ) as GM,


S

I
A

T

i.0
2
30

0

0

0

a
0

0

0

p

p

14 0

0

43

40

S

V
I

I

I

Hidden

SthS

p (z)

4'

2:

1

I

p (z)

p (z)

2

p (z)

3

4

1

p (z)

5

Observer ® ®
Figure 2. A hidden Markov model. The observer makes his observations whose PDF depends on the state.

M

cjm N (O, μjm , Ujm ),

bj (O) =

1≤j≤N

m=1

where

N (O, μjm , Ujm ) =

1
P

2π 2 | Ujm |

− 1 (O−µjm )
2

e

U−1 (O−µjm )
jm

and P is the dimension of O. We will refer to these GM parameters collectively as bj

,
{cjm , μjm , Ujm }.

Forward/Backward Procedure
We wish to compute the probability of observation sequence O = O1 , O2 , ···, OT given the model ∧ = {πj , aij , bj }.
The forward procedure for p(O|∧) is
• Initialization:

1≤i≤N

αi = πi bi (O1 ),
• Induction:

N

αt+1 (j) =

1≤t≤T −1

αt (i)aij bj (Ot+1 ),
i=1

• Termination:

N

p(O|∧) =

αT (i)
i=1

The backward procedure is
• Initialization:
βt (i) = 1


1≤j≤N

• Induction:

N

βt (i) =

t = T − 1, T − 2, · · ··, 1

aij bj (Ot+1 )βt+1 (j),

1≤i≤N

j=1

Re-estimation of HMM parameters
The re-estimation procedure calculates new estimates of ∧ given the observation sequence O = O1 , O2 , O3 , ···OT .
We ﬁrst deﬁne
ξt (i, j) =

N
i=1

αt (i)aij bj (Ot+1 )βt+1 (j)
N
j=1

αt (i)aij bj (Ot+1 )βt+1 (j)

and
N

γt (i) =

ξt (i, j).
j=1

The updated state priors are
πi = γ1 (i).
The updated state transition matrix is
T −1
t=1 ξt (i, j)
.
T −1
t=1 γt (i)

aij =
Re-estimation of Observation PDF’s13

In order to update the observation PDF’s, it is necessary to maximize
T

Qj =

wtj log bj (Ot )
t=1

over the PDF bj , where
αt (j)βt (j)
.
N
i=1 αt (i)βt (i)

wt,j =

This is a “weighted” likelihood (ML) procedure since if wtj = cj , the results are strict ML estimates. The weights
wtj are interpreted as the probability that the Markov chain is in state j at time t.
Re-estimation of Gaussian Mixture Parameters
In our experiments, bj (O) are modeled as GM by simply determining the weighted ML estimates of the GM
parameters. This would require iterating to convergence at each step. A more global approach is possible if the
mixture components assignments are regarded as “missing data”.13 The result is that the quantity
T

M

Qj =

γt (j, m) log bj (Ot )
t=1 m=1

is maximized, where


cjm N (Ot , μjm , Ujm )
M
k=1 cjk N (Ot , μjm , Ujm )

γt (j, m) = wt,j

.

Here, the weights γt (j, m) are interpreted as the probability that the Markov chain is in state j and the
observation is from mixture component m at time t. The resulting update equations for cjm , μjm , and Ujm are
computed as follows:

cjm =
ˆ

T
t=1

T
t=1

γt (j, m)
M
l=1

γt (j, l)

.

The above expression is similar to re-estimation of GM.5 This means that the algorithms designed for GM
are applicable for updating the state PDFs of the HMM. Therefore,
T
t=1 γt (j, m)Ot
T
t=1 γt (j, m)

μjm =
ˆ

ˆ
Ujm =

T
t=1

γt (j, m)(Ot − μjm )(Ot − μjm )
T
t=1

γt (j, m)

.

4. MINIMUM NOISE FRACTION TRANSFORM
Before we begin the our section on experiments, we shall define minimum noise fraction (MNF) transform
since we use them to obtain a 2D feature plot of the true data as shown in Figure 3 (right). The MNF
transformation is a highly useful spectral processing tool in HSI analysis.14 It is used to determine the inherent
dimensionality of image data, to segregate noise in the data, and to reduce the computational requirements
for subsequent processing. This transform is essentially two cascaded principal components transformations.
The first transformation, based on an estimated noise covariance matrix, decorrelates and rescales the noise
in the data. This first step results in transformed data in which the noise has unit variance and no band-toband correlations. The second step is a standard principal components transformation of the noise-whitened
data. For the purposes of further spectral processing, the inherent dimensionality of the data is determined by
examination of the final eigenvalues and the associated images. The data space can be divided into two parts:
one part associated with large eigenvalues and coherent eigenimages, and a complementary part with near-unity
eigenvalues and noise-dominated images. By using only the coherent portions, the noise is separated from the
data, thus the image bands get ranked based on signal to noise ratios (SNR).

5. EXPERIMENTS
The remote sensing data sets that we have used in our experiments comes from an Airborne Visible/Infrared
Imaging Spectrometer (AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated images
of the upwelling spectral radiance in 224 contiguous spectral bands with wavelengths corresponding to 0.4-2.5
μm. AVIRIS is flown all across the US, Canada and Europe. Figure 3 shows data sets used in our experiments
that belong to a Indian Pine scene in northwest Indiana. The spatial bands of this scene are of size 169 X 169
pixels. Since, HSI imagery is highly correlated in the spectral direction using MNF rotation is an obvious choice
for decorrelation among the bands. This also results in a 2D “scatter” plot of the first two MNF components
of the data as shown in Figure 3. The scatter plots used in the paper are similar to marginalized PDF on
any 2D plane. Marginalization could be easily depicted for visualizing state PDFs of HMM. To illustrate this
visualization scheme, let z = [z1 , z2 , z3 , z4 ]. For example, to visualize on the (z2 , z4 ) plane, we would need to
compute
p(z2 , z4 ) =

z1

z3

p(z1 , z2 , z3 , z4 )dz1 dz3 .


I
Figure 3. (Left) Composite image of Indian Pine scene. (Right) Composite image of MNF transformed scene.

This utility is very useful when visualizing high-dimensional PDF. On basis of an initial analysis based on
Iso-data and K-means unsupervised classifiers, it was found that the scene consisted of 3 prominent mixture
classes. Therefore, we begin the training by considering a tri-state (corresponding to the 3 mixture classes
identified)uniform state transition matrix A and prior probability π to initialize the HMM parameters. The
PDF of the feature vector in each state is approximated by Gaussian mixtures. The automatic learning and
initialization of the Gaussian mixtures are explicitly dealt in our earlier work.7 The algorithm outputs the total
log likelihood at each iteration.
Training an HMM is an iterative process that seeks to maximize the probability that the HMM accounts for
the example sequences. However, there are chances of running into a “local maximum” problem; the model,
though converged to some locally optimal choice of parameters, is not guaranteed to be the best possible model.
In an attempt to avoid this pitfall, we use a simulated annealing procedure along side of training. This step is
performed by expanding the covariance matrices of the PDF estimates and by pushing the state transition matrix
and prior state probabilities closer to “uniform”. We attempt to find a “bad” stationary point by re-running
the above sequence until one is found. The PDF plots of the three state PDF’s after convergence are shown in
Figures 5, 6 and 7. In our experiments, (both model and synthesis stage) we use Viterbi algorithm8 to estimate
the most likely state sequence. A few outliers are also observed in one or more state PDFs. Now that we have
mixtures modeled by their corresponding state PDFs, we would like to test the model by generating synthetic
observations. In Figure 8 we were able to synthesize 100 observation. We clearly notice that the synthetic
observations closely approximate the true data observations. This result is also exemplified in Figure 9 when we
compare true states of the data with the estimated states of the synthetic observations. Similarly, in Figures 10
and 11 we show instances that compare 300 and 600 synthetic observations to the true data. These comparisons
show that the underlying mixture density were adequately modeled using HMM.

6. CONCLUSIONS
In this paper, we proposed the use of hidden Markov model that uses structural learning for approximating
underlying mixture densities. Algorithm test were carried out using real Hyperspectral data consisting of a
scene from Indian pines of NW Indiana. In our experiments, we utilized only the first two components of MNF
transformed bands to ensure feature learning in reduced representation of the data. We show that mixture
learning for multivariate Gaussians is very similar to learning HMM parameters. In fact, unsupervised learning
of GM parameters for each class are seamlessly integrated to model the state PDFs of a HMM in a single
algorithm. This technique could be applied to any type of parameter mixture model that utilizes EM algorithm.
Our experiments exemplifies that the proposed method models and well synthesizes the observations of the
HSI data in a reduced dimensional feature space. This technique can considered a new paradigm of reduced
dimensional classifier in processing HSI data.


40

7000

6000

30
7000

20

5000

4000

5000

10

4000

MNF

Band

2

6000

3000
3000

0

2000
2000
1000

−10
0

−10

0
−10

MNF Band 1

20

10
20

−20
−20

1000

10
0
30

30

0

MNF Band 2

−10

0

10

20

30

40

MNF Band 1

Figure 4. (Left) 2D scatter plot of MNF transformed Band 1 Vs. Band 2. (Right) 2D Histogram of MNF bands 1 and 2.

State 1

MNF2

40
20
0
−20
−30

−20

−10

0

10
MNF1

20

30

40

50

−20

−10

0

10
MNF1

20

30

40

50

MNF2

40
20
0
−20
−30

Figure 5. (Top) 2D scatter plot of true data. (Bottom) PDF of State 1 after convergence.


State 2

MNF2

40
20
0
−20
−30

−20

−10

0

10
MNF1

20

30

40

50

−20

−10

0

10
MNF1

20

30

40

50

MNF2

40
20
0
−20
−30


State 3

MNF2

40
20
0
−20
−30

−20

−10

0

10
MNF1

20

30

40

50

−20

−10

0

10
MNF1

20

30

40

50

MNF2

40
20
0
−20
−30



40

30
Original Samples
Synthetic Samples
MNF Band 2

20

10

0

−10

−20
−20

−10

0

10
MNF Band 1

20

30

40

Figure 8. Comparison of true data Vs. 100 synthetic observations.

3
2.8

True States
Estimates States

2.6
2.4

States

2.2
2
1.8
1.6
1.4
1.2
1
0

10

20

30

40

50

60

70

80

90

100

No.of Samples

Figure 9. Comparison of true states vs. estimated states from synthetic observation.


40

30
Original Samples
Synthetic Samples

MNF Band 2

20

10

0

−10

−20
−20

−10

0

10

20

30

40

MNF Band 1


40

30
Original Samples
Synthetic Samples

MNF Band 2

20

10

0

−10

−20
−20

−10

0

10

20

30

MNF Band 1



40

ACKNOWLEDGMENTS
We would like to thank department of Geological Sciences at UTEP for providing access to the ENVI software
and LARS, Purdue University for making the HSI data15 available. This work was supported by NASA Earth
System Science (ESS) doctoral fellowship at the University of Texas at El Paso.

REFERENCES
[1] Schott, J. R., [Remote Sensing: The Image Chain Approach], Oxford University Press.
[2] Hughes, G. F., “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information
Theory 14, 55–63 (1968).
[3] Shaw, G. and Manolakis, D., “Signal processing for hyperspectral image exploitation,” IEEE Signal Processing Magazine 19, 12–16 (2002).
[4] Keshava, N., “Distance metrics & band selection in hyperspectral processing with applications to material
identification and spectral libraries,” IEEE Transactions on Geoscience and Remote Sensing 42, No. 7,
1552–1565 (July 2004).
[5] McLachlan, G. and Peel, D., [Finite Mixture Models], Wiley Series in Probability and Statistics, New York,
NY, second ed. (2000).
[6] Figueiredo, M. A. T. and Jain, A. K., “Unsupervised learning of finite mixture models,” IEEE Transactions
on Pattern Analysis and Machine Intelligence 24, 381–396 (2002).
[7] Jayaram, V. and Usevitch, B., “Dynamic mixing kernels in gaussian mixture classifier for hyperspectral
classification,” in [Mathematics of Data/Image Pattern Recognition, Compression, and Encryption with
Applications XI, Proceedings of the SPIE], 70750L–70750L–8 (2008).
[8] Rabiner, L. R., “A tutorial on hidden Markov models and selected applications in speech recognition,” in
[Proceedings of the IEEE], 257–286 (1989).
[9] Du, Q. and Chang, C.-I., “A hidden markov model approach to spectral analysis for hyperspectral imagery,”
Optical Engineering 40, No. 10, 2277–2284 (2001).
[10] Bali, N. and Mohammad-Djafari, A., “Bayesian approach with hidden Markov modeling and Mean Field
Approximation for Hyperspectral data analysis,” IEEE Transactions on Image Processing 17, No. 2, 217–
225 (2008).
[11] J. Li, A. N. and Gray, R. M., “Image classification by a two-dimensional hidden Markov model,” IEEE
Trans. Signal Processing 48, 517–533 (2000).
[12] Marden, D. B. and Manolakis, D. G., “Modeling hyperspectral imaging data,” in [Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery IX. Edited by Shen, Sylvia S.; Lewis, Paul
E. Proceedings of the SPIE], 253–262 (2003).
[13] Huang, B. H., “Maximum likelihood estimation for mixture multivariate stochastic observations of markov
chains,” in [AT&T Technical Journal], 1235–1249 (1985).
[14] A. A. Green, M. Berman, P. S. and Craig, M. D., “A transformation for ordering multispectral data in
terms of image quality with implications for noise removal,” IEEE Transcations on Geoscience and Remote
Sensing 26, 65–74 (1988).
[15] Landgrebe, D., “AVIRIS derived Northwest Indianas Indian Pines 1992 Hyperspectral dataset,”
http://dynamo.ecn.purdue.edu/ biehl/MultiSpec/documentation.html. .


2009 spie hmm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 2009 spie hmm

Similar to 2009 spie hmm (20)

More from Pioneer Natural Resources

More from Pioneer Natural Resources (7)

Recently uploaded

Recently uploaded (20)

2009 spie hmm