A Modified HMM Algorithm for Multiple Observation Spaces

A MODIFIED BAUM-WELCH ALGORITHM FOR HIDDEN MARKOV
MODELS WITH MULTIPLE OBSERVATION SPACES

Dr. Paul M. Baggenstoss

Naval Undersea Warfare Center
Newport RI, 02841
401-832-8240
p.m. baggenstoss@ieee.org

ABSTRACT been (1) use a smaller and insufficient features set, (2) use
more features and suffer PDF estimation errors, or (3) apply
In this paper, a new algorithm based on the Baum-Welch methods of dimensionality reduction [4]. Such methods in-
algorithm for estimating the parameters of a hidden Markov clude linear subspace analysis [5], projection pursuit [6], or
model (HMM) is presented. It allows each state to be ob- simply assuming the features are independent (a factorable
served using a different set of features rather than relying PDF). All these methods involve assumptions that do not
on a common feature set. Each feature set is chosen to hold in general. We now present a new method based on
be a sufficient statistic for discrimination of the given state sufficient statistics that is completely general and is based
from a common “white-noise” state. Comparison of likeli- on the class-specific classifier [7]. Central to the classical
hood values is possible through the use of likelihood ratios. Baum-Welch algorithm [l] is the ability to calculate rt(j),
The new algorithm is the same in theory as the algorithm the probability that state j is true at time t given the entire
based on a common feature set, but without the necessity of observation sequence X[1],. . . , X[T]. These probabilities
estimating high-dimensional probability density functions depend on the forward and backward probabilities, which in
(PDF’s). A simulated data example is provided showing turn depend on the state likelihood functions b3(X). It is
superior performance over the conventional HMM. possible to re-formulate the problem to compute rt(j)using
low-dimensional PDF’s. Consider an additional “common”
1 INTRODUCTION
. state SOfor which the raw data is pure independent Gaus-
sian noise. Using the likelihood ratios
Consider a hidden Markov model (HMM) for a process with
N states numbered SIthrough S N . Let the raw data be
denoted X[t], for time steps t = 1 , 2 , . . . , T . The parame-
ters of the HNM, denoted A, comprise the state transition
in place of the likelihood functions b,(X) in the Baum-
matrix A = {a,,}, the state prior probabilities U , , and the
Welch algorithm, causes only a scaling change since the
state observation densities b3 (X), where i and j range from
denominator is independent of j . Each of the ratios in (1)
1 to N . These parameters can be estimated from training
may be thought of as an optinial binary test or detector
data using the Baum-Welch algorithm [l], [Z]. But, be-
that can be re-written in terms of a state-dependent suffi-
cause X[t] is often of high dimension, it may be necessary
cient statistic, denoted z J , j = 1 , 2 , . . . , N . Thus, we have
t o reduce the raw data to a set of features ~ [ t = T(X[t]).
]
We then define a new HMM with the same A and U , , but
with observations ~ [ t ] t = 1 , 2 , . . . , T and the state densi-
,
ties b, (2) (we allow the argument of the density functions to
imply the identity of the function, thus b, ( X ) and b, ( z ) are which follows from the sufficiency of z, for state S, vs. state
distinct). This is the approach used in speech processing SO.Note that we use the same symbol b, (.) to represent the
today where z[t] are usually a set of cepstral coefficients. density of X and z,, however they are distinct (the argu-
If ~ [ t is of low dimension, it is practical to apply proba-
] ment of the function should make it clear which function is
bility density function (PDF) estimation methods such as implied). Since these tests do not involve the other states,
Gaussian Mixtures to estimate the state observation densi- they are simpler, and z, can be individually lower in dimen-
ties. Such PDF estimatimation methods tend to give poor sion than z. Note that this requires przor knowledge about
results above dimensions above about 5 t o 10 [3] unless the each state, i.e., that a given state is best distinguished from
features are exceptionally well-behaved (are close to inde- So by a given statistic’. This przor knowledge is often avail-
pendent or multivariate Gaussian). In human speech, it is able, but not used. It only requires some knowledge about
doubtful that 5 to 10 features can capture all the relevant the signal characteristics in each state, the same knowledge
information in the data. Traditionally, the choices have necessary for choosing features. Keep in mind that each
This work was supported by Office of Naval Research (ONR- ‘ I t is not neccessary to know which state is true at any given
321US) time

0-7803-6293-4/00/$10.00 0 2 0 0 0 EEE. 7 17

state does not need a distinct set of statistics. The class- 2.1.2. T h e class-specific backward procedure
specific HMM is an extension of the conventional HMM. If 1. Initialization:
each state has the same set of statistics, the class-specific
HMM specializes to the conventional HMM. When design- PT(i) = 1 (6)
ing a class-specific HMM, one could allocate a subset of 2. Induction (for t = T - 1 , . . . l , 1 5 i 5 N):
states to each class of data. Each subset of states would N
require only the statistics apropriate for that class of data.
In fact, an entirely different processing for feature extrac- PtW = C%i
b.(z.[t+l )
bt,(z:[t+l]l)Pt+l(i). (7)
i=l
tion, tailored to that type of signal can be used. This can
solve the problem that occurs in speech processing when
2.1.3. H M M Reestimation formulas
short-duration plosives are forced to be analyzed using long
analysis windows which are more appropriate for vowels. Define rt(j) p ( & = jlX[l], . . . ,X [ T ] ) .We have
as
When compared to the standard feature-based HMM using
a common feature set z, the density of the numerators may
be estimated using fewer training samples. Or, for a fixed
number of training samples, more accuracy and robustness
can be obtained. In addition, experience has shown that, Let
difficulties encountered in initialization of the Baum-Welch
algorithm are greatly diminished when this przor knowledge
is used.
A few words about the denominator densities bo(z,) is
necessary. These densities may be approximated from syn-
thetic white noise or derived analytically. For data that
significantly departs from SO,all denominators can tend to The updated state priors are Ci = -yl(i).The updated state
zero. In this case, the denominator densities bo(z,) must transition matrix is
be solved for analytically or with an approximation valid in
the far tails. The task of finding solutions valid in the tails
is helped by the fact that SO is characterized by pure zzd
Gaussian noise. In the computer simulations, we use exact
formulas or approximations valid in the tails.

2.1.4. Gaussian mixture reestzmatzon formulas
2. MATHEMATICAL RESULTS
We assume the following Gaussian mixture representation
for each b, ( 2 , ) :
Details of the derivation are t o be published [8]. Here, we
provide only the algorithm. M

b ~ ( ~= )
3 c3k N(Zj,pJk,U~k), 1 5j 5 N. (11)
k=l
2.1. The Class-Specific HMM
where
2. I . 1. T h e class-specific forward procedure n/(z,,p, ) 2
U (2.1r)-P-’/21UI-’/2e{-~(s-p)’u-l(s-p)}
7

1. Initialization:
and P, is the dimension of 2,. We may let A be indepen-
4
dent of j as long as M is sufficiently large. Let

2. Induction ( for 1 5 t 5 T - 1, 15j 6 N):

3. Termination:
and

where Ho is the condition t h a t state SO is true at
every t. t=l

718

2.2. Discussion State 1
Description: An impulse of duration 2 samples oc-
2.2.1. Relationship t o conventional algorithm curring on samples 1 and 2 of the time-series with ad-
The class-specific forward procedure terminates having com- ditive Gaussian noise of variance 1. Model: xt =
puted the likelihood ratio (5), whereas the conventional al- 2(S[t - 1 +
1 S [ t - 21) + nt, where nt is distributed

gorithm computes only the numerator. The class-specific N(0,l). Sufficient statistic: z1 = z1 +z2. log-PDF:
Baum-Welch algorithm maximizes (5) over A, which is equiv- logbo(zl) = -0.5 l o g ( 4 ~ - 0.252:.
)
alent to maximizing the numerator only. I t is comforting State 2
Description: Same as state 1, but on samples 2 a n d
to know that - y t ( j ) , & ( z , j ) , i i i , and &j will be identical to
those estimated by the conventional approach if (2) is true.
+ +
3. Model: xt = 2(6[t - 21 S [ t - 31) nt, where nt is
distributed N ( 0 , l ) . Sufficient statistic: zz = z2 +x3.
There is no correspondence, however, between the Gaus-
log-PDF: logbo(z2) = -0.510g(4~) - 0.252;.
sian mixture parameters of the two methods except that State 3
the densities they approximate obey (2). Uescription: Signal type 3 is iid Gaussian noise of
zero mean and variance 1.7. Model: xt = nt, where
3. COMPUTER SIMULATION nt is distributed N(0,1.7). Sufficient statistic: z3 =
log {Eryl xi}. log-PDF: bgbo(Z3) = -logr(N/2) -
3.1. Simulation details
+
(N/2) log(2) (N/2) z3 - exp(z3)/2.
To test the Class-Specific HMM, we created a synthetic State 4 and 5
Markov model with known sufficient statistics for each state. Description: Signal types 4 and 5 are closely-spaced
Comparisons with the conventional approach was made by sinewawes in Gaussian noise. Model: xt = nt +
combining all the class-specific features into one common +
asin(wkt 4), k = 4,5, where a = 0.4, w4 = 0.100, and
w5 = 0.101 radians per sample. Sufficient statistic:
feature set. In this way, neither method has an unfair ad-
vantage with respect to features. The following state tran- zk = log {IC,”=, t e - j w k t 1’1 k = 4,s. log-PDF:

I
x
sition matrix and state priors were used:
log bO(Zk) = - log N - + Zk k = 4,s.
0.7 0.3 0 0 0 0 State 6
0 0.7 0.3 0 0 0 Uescription: Signal type 6 is autoregressive noise of
A=
0 0 0.7 0.3 0 0 variance 1. Model: xt = [--alzt-l - a2xt-2 + nt] cy,
0 0 0 0.7 0.3 0 where nt is distributed N(0,1), a1 = -0.75, a2 = 0.78,
0 0 0 0 0.7 0.3 and cy = 0.5675 ( a is chosen such that the variance of xt
0.1 0 0 0 0 0.9 is 1). Approximate sufficient statistic: The first and
second lags of the normalized autocorrelation estimates
and U ] = 1/6, for all j . For each t , a state is chosen ran- computed circularly. z6 = {?1,?2} where ?k = fk/?O,
domly according to the Markov model. Then, a N-sample
segment of a time series is created from the statistical model
and f k = & E,”=, z ; z ( i - k ) mod N . log-PDF: Details
provided in 181.
corresponding to the chosen state ( N = 256). In our nota-
tion, we have X [ t ]= { z ~ [ t ] , ’ [ t. ] .,,x,v[~]}.Features are
~ .
then computed from each segment and used as an obser- Table I: Description of each signal type, sufficient statistics
vation vector of an HhIM. The common state, SO,is iid (SS), and distribution of the SS under state SO.
Gaussian noise of zero mean and unit variance, denoted
N(0,I). For each model, a sufficient statistic is chosen for
distinguishing the signal from Ho. These are denoted z1 The multiple record implementation of the HMM de-
through z6. Table 1 lists each signal description, the suf- scribed in Section V.(B.) of Rabiner [I](page 273) was used.
ficient statistic, and the distribution under state SO. The All training and testing data was organized into records with
criteria for selecting these Synthetic signals were (a) the a constant length of 99 data segments per record (T, 99).
=
signals are easy to describe and produce, (b) a sufficient Each segment consisted of N = 256 time samples. Features
or approximately sufficient statistic was known for distin- were computed on each segment (a segment is associated
guishing each state from SO,(c) these statistics had a known with a time-step or observation of the HMM). In the exper-
density under state SO,(d) the signals and statistics were iment, algorithm performance was determined as a function
diverse and statistically dependent. of R , the number of training records. In the experiments,
To illustrate the strength of the Class-Specific algo- we used R=1,2,5,10,20,40,80, 160, and 320 records. To mea-
rithm, it was compared with two classical implementations. sure algorithm performance of all implementations, the ap-
The classical implementations used a 7-dimensional feature propriate version of the Baum-Welch algorithm was first
vector consisting of all features used by the Class-Specific run to convergence on the available training data. Next,
(CS) approach. Two variations of the classical approach
were created by assuming the Gaussian Mixtures had (1)
the Viterbi algorithm ’
was used to determine the most
likely state sequence on a separate pool of 640 records of
a general covariance matrix (“classical” or CL approach ) testing data. The number of state errors divided by the
or (2) a diagonal structure which assumes all t h e features
are statistically independent ( “independence assumption” ’The algorithm in Rabiner [l], pp 263-264, appropriately
or IA approach). We will refer to these three methods by modified for the CS method by substituting likelihood ratios
the abbreviations CS, CL, and IA. b3(zJ[tl)/bo(zj[tl) in place of bJ (Wl).

7 19

number of opport,unities (640 times 99) was used as a per- I ‘ “ “ . ‘ ‘ I ’ “ “ “ ‘ I ’ ”

formance metric. For more accurate determination of algo-
rithm performance, sixteen independent trials were made
at each value of R.
9
3.1 .l. Catastrophic errors and “assisted” training

The CL and IA approaches encountered severe problems
with initialization. More often than not, a poor initial-
ization resulted in the Baum-Welch algorithm finding the 9
incorrect stationary point. This resulted in catastrophic er-

rors. This appeared to be caused by an inability to distin- IndependenceAssumption (assisted)
guish two or more states. The number of catastrophic errors

decreased as the number of training records increased. The

CS method did not encounter any catastrophic errors. To
study only the PDF estimation issue apart from any initial- , Classical (assisted)
ization issues, it was decided t o “assist” the CL and IA ap-
%--
*--_
proaches by providing a good initialization point. This ini-
tialization point was found by training separately on labeled
data from each of the N states, then using these PDF’s as
the state PDF’s with the known U and A. In short, the
algorithm was initialized with the true parameter values.
d I I .I..., I . , ,
This “good” parameter set was used as the starting point
lo” 10’ lo’
for all trials of the CL and IA approaches. Doing this pro- Number of training records (R)
vides somewhat of a “lower bound” on error performance.
The CS results are unassisted.
Figure 1: Probability of state classification error as a func-
3 2 Main results
..
tion of number of training records (R) for three methods -
A plot of median error probability for the CL, CS, and IA median performance of 16 trials. The CL and IA methods
methods is provided in Figure 1. The IA approach has a are assisted by providing a “good” initialization point.
higher limiting error rate, due to the built-in assumption of
independence, which is not valid. It has, however, nearly
identical convergence behavior as the CS method. This [2] B. H. Juang, “Maximum likelihood estimation for mix-
is expected since the dimension of the P D F estimates is ture multivariate stochastic observations of Markov
similar for the two approaches. chains,” AT&T Technical Journal, vol. 64, no. 6,
pp. 1235-1249, 1985.
4. CONCLUSIONS [3] D. W. Scott, Multivariate Density Estimation. Wiley,
1992.
A class-specific implementation of the Baum-Welch algo-
rithm has been developed and verified to work in princi- [4] S.Aeberhard, D. Coomans, and 0. de Vel, “Compar-
ple. A computer simulation used a 6-state HhIM with six ative analysis of statistical pattern recognition meth-
synthetic signal models. The class-specific feature dimen- ods in high dimensional settings,” Pattern Recognition,
sions were either 1 or 2. When the performance was mea- vol. 27, no. 8, pp. 1065-1077, 1994.
sured in terms of state errors in the most likely state se- [5] H. Watanabe and S. Katagiri, “Discriminative sub-
quence, the class-specific method greatly outperformed the space method for minimum error pattern recognition,”
classic HMM approach which operated on a combined 7- in Proc. 1995 IEEE Workshop on Neural Networks for
dimensional feature set. The improved performance was Signal Processing, pp. 77-86, 1995.
due to the lower dimension of the feature spaces made pos- [6] P. J. Huber, “Projection pursuit,” Annals of Statistics,
sible by knowledge of the appropriate feature set for each vol. 13, no. 2, pp. 435-475, 1985.
state. It also outperformed a method that assumed the
features were statistically independent. In this case, the [7] P. M. Baggenstoss, “Class-specific features in classifica-
improved performance comes from the fact that the class- tion.,” IEEE Trans Signal Processing, December 1999.
specific approach does not compromise theoretical perfor- [8] P. M. Baggenstoss, “A modified Baum-Welch algorithm
mance when reducing dimension. for hidden Markov models with multiple observation
spaces.,” IEEE Trans Speech and Audio, t o appear 2000.
5. REFERENCES

[l] L. R. Rabiner, “A tutorial on hidden Markov models
and selected applications in speech recognition,” Pro-
ceedings of the IEEE, vol. 77, pp. 257-286, February
1989.

720

A Modified HMM Algorithm for Multiple Observation Spaces

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to A Modified HMM Algorithm for Multiple Observation Spaces

Similar to A Modified HMM Algorithm for Multiple Observation Spaces (20)

Recently uploaded

Recently uploaded (20)

A Modified HMM Algorithm for Multiple Observation Spaces