SlideShare a Scribd company logo
1 of 4
Download to read offline
A MODIFIED BAUM-WELCH ALGORITHM FOR HIDDEN MARKOV
             MODELS WITH MULTIPLE OBSERVATION SPACES

                                                  Dr. Paul M. Baggenstoss

                                              Naval Undersea Warfare Center
                                                    Newport RI, 02841
                                                       401-832-8240
                                                p.m. baggenstoss@ieee.org


                        ABSTRACT                                          been (1) use a smaller and insufficient features set, (2) use
                                                                          more features and suffer PDF estimation errors, or (3) apply
In this paper, a new algorithm based on the Baum-Welch                    methods of dimensionality reduction [4]. Such methods in-
algorithm for estimating the parameters of a hidden Markov                clude linear subspace analysis [5], projection pursuit [6], or
model (HMM) is presented. It allows each state to be ob-                  simply assuming the features are independent (a factorable
served using a different set of features rather than relying              PDF). All these methods involve assumptions that do not
on a common feature set. Each feature set is chosen to                    hold in general. We now present a new method based on
be a sufficient statistic for discrimination of the given state           sufficient statistics that is completely general and is based
from a common “white-noise” state. Comparison of likeli-                  on the class-specific classifier [7]. Central to the classical
hood values is possible through the use of likelihood ratios.             Baum-Welch algorithm [l] is the ability to calculate rt(j),
The new algorithm is the same in theory as the algorithm                  the probability that state j is true at time t given the entire
based on a common feature set, but without the necessity of               observation sequence X[1],. . . , X[T]. These probabilities
estimating high-dimensional probability density functions                 depend on the forward and backward probabilities, which in
(PDF’s). A simulated data example is provided showing                     turn depend on the state likelihood functions b3(X). It is
superior performance over the conventional HMM.                           possible to re-formulate the problem to compute rt(j)using
                                                                          low-dimensional PDF’s. Consider an additional “common”
                   1 INTRODUCTION
                    .                                                     state SOfor which the raw data is pure independent Gaus-
                                                                          sian noise. Using the likelihood ratios
Consider a hidden Markov model (HMM) for a process with
N states numbered SIthrough S N . Let the raw data be
denoted X[t], for time steps t = 1 , 2 , . . . , T . The parame-
ters of the HNM, denoted A, comprise the state transition
                                                                          in place of the likelihood functions b,(X) in the Baum-
matrix A = {a,,}, the state prior probabilities U , , and the
                                                                          Welch algorithm, causes only a scaling change since the
state observation densities b3 (X), where i and j range from
                                                                          denominator is independent of j . Each of the ratios in (1)
1 to N . These parameters can be estimated from training
                                                                          may be thought of as an optinial binary test or detector
data using the Baum-Welch algorithm [l], [Z]. But, be-
                                                                          that can be re-written in terms of a state-dependent suffi-
cause X[t] is often of high dimension, it may be necessary
                                                                          cient statistic, denoted z J , j = 1 , 2 , . . . , N . Thus, we have
t o reduce the raw data to a set of features ~ [ t = T(X[t]).
                                                         ]
We then define a new HMM with the same A and U , , but
with observations ~ [ t ] t = 1 , 2 , . . . , T and the state densi-
                           ,
ties b, (2) (we allow the argument of the density functions to
imply the identity of the function, thus b, ( X ) and b, ( z ) are            which follows from the sufficiency of z, for state S, vs. state
distinct). This is the approach used in speech processing                     SO.Note that we use the same symbol b, (.) to represent the
today where z[t] are usually a set of cepstral coefficients.                  density of X and z,, however they are distinct (the argu-
If ~ [ t is of low dimension, it is practical to apply proba-
         ]                                                                    ment of the function should make it clear which function is
bility density function (PDF) estimation methods such as                      implied). Since these tests do not involve the other states,
Gaussian Mixtures to estimate the state observation densi-                    they are simpler, and z, can be individually lower in dimen-
ties. Such PDF estimatimation methods tend to give poor                       sion than z. Note that this requires przor knowledge about
results above dimensions above about 5 t o 10 [3] unless the                  each state, i.e., that a given state is best distinguished from
features are exceptionally well-behaved (are close to inde-                   So by a given statistic’. This przor knowledge is often avail-
pendent or multivariate Gaussian). In human speech, it is                     able, but not used. It only requires some knowledge about
doubtful that 5 to 10 features can capture all the relevant                   the signal characteristics in each state, the same knowledge
information in the data. Traditionally, the choices have                      necessary for choosing features. Keep in mind that each
    This work was supported by Office of Naval Research (ONR-                    ‘ I t is not neccessary to know which state is true at any given
321US)                                                                        time




0-7803-6293-4/00/$10.00 0 2 0 0 0 EEE.                                 7 17
state does not need a distinct set of statistics. The class-                  2.1.2. T h e class-specific backward procedure
specific HMM is an extension of the conventional HMM. If                   1. Initialization:
each state has the same set of statistics, the class-specific
HMM specializes to the conventional HMM. When design-                                              PT(i) = 1                       (6)
ing a class-specific HMM, one could allocate a subset of                   2. Induction (for t = T - 1 , . . . l , 1 5 i   5 N):
states to each class of data. Each subset of states would                                               N
require only the statistics apropriate for that class of data.
In fact, an entirely different processing for feature extrac-                              PtW     =   C%i
                                                                                                         b.(z.[t+l   )
                                                                                                         bt,(z:[t+l]l)Pt+l(i).     (7)
                                                                                                       i=l
tion, tailored to that type of signal can be used. This can
solve the problem that occurs in speech processing when
                                                                              2.1.3. H M M Reestimation formulas
short-duration plosives are forced to be analyzed using long
analysis windows which are more appropriate for vowels.                 Define rt(j) p ( & = jlX[l], . . . ,X [ T ] ) .We have
                                                                                    as
When compared to the standard feature-based HMM using
a common feature set z, the density of the numerators may
be estimated using fewer training samples. Or, for a fixed
number of training samples, more accuracy and robustness
can be obtained. In addition, experience has shown that,                Let
difficulties encountered in initialization of the Baum-Welch
algorithm are greatly diminished when this przor knowledge
is used.
     A few words about the denominator densities bo(z,) is
necessary. These densities may be approximated from syn-
thetic white noise or derived analytically. For data that
significantly departs from SO,all denominators can tend to              The updated state priors are Ci = -yl(i).The updated state
zero. In this case, the denominator densities bo(z,) must               transition matrix is
be solved for analytically or with an approximation valid in
the far tails. The task of finding solutions valid in the tails
is helped by the fact that SO is characterized by pure zzd
Gaussian noise. In the computer simulations, we use exact
formulas or approximations valid in the tails.

                                                                              2.1.4. Gaussian mixture reestzmatzon formulas
          2. MATHEMATICAL RESULTS
                                                                        We assume the following Gaussian mixture representation
                                                                        for each b, ( 2 , ) :
Details of the derivation are t o be published [8]. Here, we
provide only the algorithm.                                                             M

                                                                              b ~ ( ~= )
                                                                                     3       c3k   N(Zj,pJk,U~k),   1 5j    5 N.   (11)
                                                                                       k=l
2.1. The Class-Specific HMM
                                                                        where
    2. I . 1. T h e class-specific forward procedure                      n/(z,,p, ) 2
                                                                                 U           (2.1r)-P-’/21UI-’/2e{-~(s-p)’u-l(s-p)}
                                                                                                                               7


   1. Initialization:
                                                                        and P, is the dimension of 2,. We may let A be indepen-
                                                                                                                       4
                                                                        dent of j as long as M is sufficiently large. Let



   2. Induction ( for 1   5 t 5 T - 1,      15j   6 N):




   3. Termination:
                                                                        and




      where Ho is the condition t h a t state SO is true at
      every t.                                                                                         t=l




                                                                  718
2.2. Discussion                                                                                                    State 1
                                                                                      Description: An impulse of duration 2 samples oc-
     2.2.1. Relationship t o conventional algorithm                                   curring on samples 1 and 2 of the time-series with ad-
The class-specific forward procedure terminates having com-                           ditive Gaussian noise of variance 1. Model: xt =
puted the likelihood ratio (5), whereas the conventional al-                          2(S[t - 1   +
                                                                                                1 S [ t - 21)      + nt, where nt is distributed

gorithm computes only the numerator. The class-specific                               N(0,l). Sufficient statistic: z1 = z1 +z2. log-PDF:
Baum-Welch algorithm maximizes (5) over A, which is equiv-                            logbo(zl) = -0.5 l o g ( 4 ~ - 0.252:.
                                                                                                                    )
alent to maximizing the numerator only. I t is comforting                                                          State 2
                                                                                      Description: Same as state 1, but on samples 2 a n d
to know that - y t ( j ) , & ( z , j ) , i i i , and &j will be identical to
those estimated by the conventional approach if (2) is true.
                                                                                                                       +           +
                                                                                      3. Model: xt = 2(6[t - 21 S [ t - 31) nt, where nt is
                                                                                      distributed N ( 0 , l ) . Sufficient statistic: zz = z2 +x3.
There is no correspondence, however, between the Gaus-
                                                                                      log-PDF: logbo(z2) = -0.510g(4~) - 0.252;.
sian mixture parameters of the two methods except that                                                             State 3
the densities they approximate obey (2).                                              Uescription: Signal type 3 is iid Gaussian noise of
                                                                                      zero mean and variance 1.7. Model: xt = nt, where
              3. COMPUTER SIMULATION                                                  nt is distributed N(0,1.7). Sufficient statistic: z3 =
                                                                                      log {Eryl   xi}. log-PDF: bgbo(Z3) = -logr(N/2) -
3.1. Simulation details
                                                                                                    +
                                                                                      (N/2) log(2) (N/2) z3 - exp(z3)/2.
To test the Class-Specific HMM, we created a synthetic                                                          State 4 and 5
Markov model with known sufficient statistics for each state.                         Description: Signal types 4 and 5 are closely-spaced
Comparisons with the conventional approach was made by                                sinewawes in Gaussian noise. Model: xt = nt                  +
combining all the class-specific features into one common                                       +
                                                                                      asin(wkt 4), k = 4,5, where a = 0.4, w4 = 0.100, and
                                                                                      w5 = 0.101 radians per sample. Sufficient statistic:
feature set. In this way, neither method has an unfair ad-
vantage with respect to features. The following state tran-                           zk = log {IC,”=, t e - j w k t   1’1       k = 4,s. log-PDF:




                  I
                                                                                                     x
sition matrix and state priors were used:
                                                                                      log bO(Zk) = - log N -                 +  Zk    k = 4,s.
                     0.7    0.3    0      0      0   0                                                                State 6
                      0     0.7   0.3     0      0   0                                Uescription: Signal type 6 is autoregressive noise of
           A=
                      0      0    0.7    0.3     0   0                                variance 1. Model: xt = [--alzt-l - a2xt-2            +  nt] cy,
                      0      0     0     0.7    0.3 0                                 where nt is distributed N(0,1), a1 = -0.75, a2 = 0.78,
                      0      0     0      0     0.7 0.3                               and cy = 0.5675 ( a is chosen such that the variance of xt
                     0.1     0     0      0      0  0.9                               is 1). Approximate sufficient statistic: The first and
                                                                                      second lags of the normalized autocorrelation estimates
and U ] = 1/6, for all j . For each t , a state is chosen ran-                        computed circularly. z6 = {?1,?2} where ?k = fk/?O,
domly according to the Markov model. Then, a N-sample
segment of a time series is created from the statistical model
                                                                                      and f k = &     E,”=, z ; z ( i - k ) mod N . log-PDF: Details
                                                                                      provided in 181.
corresponding to the chosen state ( N = 256). In our nota-
tion, we have X [ t ]= { z ~ [ t ] , ’ [ t. ] .,,x,v[~]}.Features are
                                 ~          .
then computed from each segment and used as an obser-                             Table I: Description of each signal type, sufficient statistics
vation vector of an HhIM. The common state, SO,is iid                             (SS), and distribution of the SS under state SO.
Gaussian noise of zero mean and unit variance, denoted
N(0,I). For each model, a sufficient statistic is chosen for
distinguishing the signal from Ho. These are denoted z1                               The multiple record implementation of the HMM de-
through z6. Table 1 lists each signal description, the suf-                       scribed in Section V.(B.) of Rabiner [I](page 273) was used.
ficient statistic, and the distribution under state SO. The                       All training and testing data was organized into records with
criteria for selecting these Synthetic signals were (a) the                       a constant length of 99 data segments per record (T, 99).
                                                                                                                                         =
signals are easy to describe and produce, (b) a sufficient                        Each segment consisted of N = 256 time samples. Features
or approximately sufficient statistic was known for distin-                       were computed on each segment (a segment is associated
guishing each state from SO,(c) these statistics had a known                      with a time-step or observation of the HMM). In the exper-
density under state SO,(d) the signals and statistics were                        iment, algorithm performance was determined as a function
diverse and statistically dependent.                                              of R , the number of training records. In the experiments,
     To illustrate the strength of the Class-Specific algo-                       we used R=1,2,5,10,20,40,80, 160, and 320 records. To mea-
rithm, it was compared with two classical implementations.                        sure algorithm performance of all implementations, the ap-
The classical implementations used a 7-dimensional feature                        propriate version of the Baum-Welch algorithm was first
vector consisting of all features used by the Class-Specific                      run to convergence on the available training data. Next,
(CS) approach. Two variations of the classical approach
were created by assuming the Gaussian Mixtures had (1)
                                                                                  the Viterbi algorithm        ’
                                                                                                            was used to determine the most
                                                                                  likely state sequence on a separate pool of 640 records of
a general covariance matrix (“classical” or CL approach )                         testing data. The number of state errors divided by the
or (2) a diagonal structure which assumes all t h e features
are statistically independent ( “independence assumption”                            ’The algorithm in Rabiner [l], pp 263-264, appropriately
or IA approach). We will refer to these three methods by                          modified for the CS method by substituting likelihood ratios
the abbreviations CS, CL, and IA.                                                 b3(zJ[tl)/bo(zj[tl) in place of bJ (Wl).

                                                                               7 19
number of opport,unities (640 times 99) was used as a per-                          I        ‘ “ “ . ‘ ‘ I                             ’    “ “ “ ‘ I         ’    ”

formance metric. For more accurate determination of algo-
rithm performance, sixteen independent trials were made
at each value of R.
                                                                                    9
    3.1 .l. Catastrophic errors and “assisted” training
                                                                                         
The CL and IA approaches encountered severe problems
with initialization. More often than not, a poor initial-
ization resulted in the Baum-Welch algorithm finding the                                     9
incorrect stationary point. This resulted in catastrophic er-                                 

rors. This appeared to be caused by an inability to distin-                                                                   IndependenceAssumption (assisted)
guish two or more states. The number of catastrophic errors                                           
                                                                                                          
decreased as the number of training records increased. The                                                    

CS method did not encounter any catastrophic errors. To                                                           
study only the PDF estimation issue apart from any initial-                                                           ,        Classical (assisted)
ization issues, it was decided t o “assist” the CL and IA ap-
                                                                                                                      %--
                                                                                                                              *--_
proaches by providing a good initialization point. This ini-
tialization point was found by training separately on labeled
data from each of the N states, then using these PDF’s as
the state PDF’s with the known U and A. In short, the
algorithm was initialized with the true parameter values.
                                                                                                                            d          I    I   .I...,   I    .    ,   ,
This “good” parameter set was used as the starting point
                                                                                   lo”                                      10’                         lo’
for all trials of the CL and IA approaches. Doing this pro-                                                           Number of training records (R)
vides somewhat of a “lower bound” on error performance.
The CS results are unassisted.
                                                                      Figure 1: Probability of state classification error as a func-
3 2 Main results
 ..
                                                                      tion of number of training records (R) for three methods -
A plot of median error probability for the CL, CS, and IA             median performance of 16 trials. The CL and IA methods
methods is provided in Figure 1. The IA approach has a                are assisted by providing a “good” initialization point.
higher limiting error rate, due to the built-in assumption of
independence, which is not valid. It has, however, nearly
identical convergence behavior as the CS method. This                 [2] B. H. Juang, “Maximum likelihood estimation for mix-
is expected since the dimension of the P D F estimates is                 ture multivariate stochastic observations of Markov
similar for the two approaches.                                           chains,” AT&T Technical Journal, vol. 64, no. 6,
                                                                          pp. 1235-1249, 1985.
                  4. CONCLUSIONS                                      [3] D. W. Scott, Multivariate Density Estimation. Wiley,
                                                                          1992.
A class-specific implementation of the Baum-Welch algo-
rithm has been developed and verified to work in princi-              [4]   S.Aeberhard, D. Coomans, and 0. de Vel, “Compar-
ple. A computer simulation used a 6-state HhIM with six                     ative analysis of statistical pattern recognition meth-
synthetic signal models. The class-specific feature dimen-                  ods in high dimensional settings,” Pattern Recognition,
sions were either 1 or 2. When the performance was mea-                     vol. 27, no. 8, pp. 1065-1077, 1994.
sured in terms of state errors in the most likely state se-           [5] H. Watanabe and S. Katagiri, “Discriminative sub-
quence, the class-specific method greatly outperformed the                space method for minimum error pattern recognition,”
classic HMM approach which operated on a combined 7-                      in Proc. 1995 IEEE Workshop on Neural Networks for
dimensional feature set. The improved performance was                     Signal Processing, pp. 77-86, 1995.
due to the lower dimension of the feature spaces made pos-            [6] P. J. Huber, “Projection pursuit,” Annals of Statistics,
sible by knowledge of the appropriate feature set for each                vol. 13, no. 2, pp. 435-475, 1985.
state. It also outperformed a method that assumed the
features were statistically independent. In this case, the            [7] P. M. Baggenstoss, “Class-specific features in classifica-
improved performance comes from the fact that the class-                  tion.,” IEEE Trans Signal Processing, December 1999.
specific approach does not compromise theoretical perfor-             [8] P. M. Baggenstoss, “A modified Baum-Welch algorithm
mance when reducing dimension.                                            for hidden Markov models with multiple observation
                                                                          spaces.,” IEEE Trans Speech and Audio, t o appear 2000.
                   5. REFERENCES

[l] L. R. Rabiner, “A tutorial on hidden Markov models
    and selected applications in speech recognition,” Pro-
    ceedings of the IEEE, vol. 77, pp. 257-286, February
    1989.



                                                                720

More Related Content

What's hot

[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
 
Murpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelMurpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelJungkyu Lee
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsUniversity of Salerno
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsJungkyu Lee
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksAlina Leidinger
 
Rabbit challenge 5_dnn3
Rabbit challenge 5_dnn3Rabbit challenge 5_dnn3
Rabbit challenge 5_dnn3TOMMYLINK1
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)tanafuyu
 
Exponential-plus-Constant Fitting based on Fourier Analysis
Exponential-plus-Constant Fitting based on Fourier AnalysisExponential-plus-Constant Fitting based on Fourier Analysis
Exponential-plus-Constant Fitting based on Fourier AnalysisMatthieu Hodgkinson
 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copyGuy Lebanon
 
International Journal of Engineering Research and Development (IJERD)
 International Journal of Engineering Research and Development (IJERD) International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityHector Zenil
 

What's hot (19)

[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 
Astaño 4
Astaño 4Astaño 4
Astaño 4
 
Murpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. KernelMurpy's Machine Learning:14. Kernel
Murpy's Machine Learning:14. Kernel
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov Chains
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensions
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
 
Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural Networks
 
Rabbit challenge 5_dnn3
Rabbit challenge 5_dnn3Rabbit challenge 5_dnn3
Rabbit challenge 5_dnn3
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
 
Exponential-plus-Constant Fitting based on Fourier Analysis
Exponential-plus-Constant Fitting based on Fourier AnalysisExponential-plus-Constant Fitting based on Fourier Analysis
Exponential-plus-Constant Fitting based on Fourier Analysis
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copy
 
International Journal of Engineering Research and Development (IJERD)
 International Journal of Engineering Research and Development (IJERD) International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
1 6
1 61 6
1 6
 
lcr
lcrlcr
lcr
 
Trondheim, LGM2012
Trondheim, LGM2012Trondheim, LGM2012
Trondheim, LGM2012
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 

Viewers also liked

Viewers also liked (20)

Draftall
DraftallDraftall
Draftall
 
Amaia lopez escolar activity 1
Amaia lopez escolar activity 1Amaia lopez escolar activity 1
Amaia lopez escolar activity 1
 
Create an Assignment in MBC
Create an Assignment in MBCCreate an Assignment in MBC
Create an Assignment in MBC
 
PRESENTOLOGY .....power your presentation
PRESENTOLOGY .....power your presentationPRESENTOLOGY .....power your presentation
PRESENTOLOGY .....power your presentation
 
Scandinavian ad networks slutkund
Scandinavian ad networks slutkundScandinavian ad networks slutkund
Scandinavian ad networks slutkund
 
Alumnos
AlumnosAlumnos
Alumnos
 
Personal History
Personal HistoryPersonal History
Personal History
 
Delivering Clear Insight through the Use of Earth Imagery within GIS
Delivering Clear Insight through the Use of Earth Imagery within GIS Delivering Clear Insight through the Use of Earth Imagery within GIS
Delivering Clear Insight through the Use of Earth Imagery within GIS
 
71071733[1]
71071733[1]71071733[1]
71071733[1]
 
Powerpoint fet
Powerpoint fetPowerpoint fet
Powerpoint fet
 
Intro to press regulation
Intro to press regulationIntro to press regulation
Intro to press regulation
 
5. multiple ich. case report
5. multiple ich. case report5. multiple ich. case report
5. multiple ich. case report
 
Promotion webinar recording
Promotion webinar recordingPromotion webinar recording
Promotion webinar recording
 
インタビュープロジェクト
インタビュープロジェクトインタビュープロジェクト
インタビュープロジェクト
 
Candidate c film_industry
Candidate c film_industryCandidate c film_industry
Candidate c film_industry
 
Conexión Startup
Conexión StartupConexión Startup
Conexión Startup
 
Lokalno blago u globalnom prostoru, Breda Karun
Lokalno blago u globalnom prostoru, Breda KarunLokalno blago u globalnom prostoru, Breda Karun
Lokalno blago u globalnom prostoru, Breda Karun
 
Informe sucursal
Informe sucursalInforme sucursal
Informe sucursal
 
Daftar Hadir & Nilai Evaluasi PAI 2011.2012
Daftar Hadir & Nilai Evaluasi PAI 2011.2012Daftar Hadir & Nilai Evaluasi PAI 2011.2012
Daftar Hadir & Nilai Evaluasi PAI 2011.2012
 
Daftar hadir&nilai evaluasi pai 1314
Daftar hadir&nilai evaluasi pai 1314Daftar hadir&nilai evaluasi pai 1314
Daftar hadir&nilai evaluasi pai 1314
 

Similar to A Modified HMM Algorithm for Multiple Observation Spaces

Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsJustin Cletus
 
20140530.journal club
20140530.journal club20140530.journal club
20140530.journal clubHayaru SHOUNO
 
Algorithms Behind Term Structure Models II Hull-White Model
Algorithms Behind Term Structure Models II  Hull-White ModelAlgorithms Behind Term Structure Models II  Hull-White Model
Algorithms Behind Term Structure Models II Hull-White ModelJeff Brooks
 
An efficient approach to wavelet image Denoising
An efficient approach to wavelet image DenoisingAn efficient approach to wavelet image Denoising
An efficient approach to wavelet image Denoisingijcsit
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...ijaia
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksSteve Nouri
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdfGieTe
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...Joonhyung Lee
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding IJCERT
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural NetworksESCOM
 
Statistical Analysis and Model Validation of Gompertz Model on different Real...
Statistical Analysis and Model Validation of Gompertz Model on different Real...Statistical Analysis and Model Validation of Gompertz Model on different Real...
Statistical Analysis and Model Validation of Gompertz Model on different Real...Editor Jacotech
 

Similar to A Modified HMM Algorithm for Multiple Observation Spaces (20)

Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering Algorithms
 
20140530.journal club
20140530.journal club20140530.journal club
20140530.journal club
 
Algorithms Behind Term Structure Models II Hull-White Model
Algorithms Behind Term Structure Models II  Hull-White ModelAlgorithms Behind Term Structure Models II  Hull-White Model
Algorithms Behind Term Structure Models II Hull-White Model
 
panel regression.pptx
panel regression.pptxpanel regression.pptx
panel regression.pptx
 
An efficient approach to wavelet image Denoising
An efficient approach to wavelet image DenoisingAn efficient approach to wavelet image Denoising
An efficient approach to wavelet image Denoising
 
kcde
kcdekcde
kcde
 
PhD defense talk slides
PhD  defense talk slidesPhD  defense talk slides
PhD defense talk slides
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...
RESEARCH ON FUZZY C- CLUSTERING RECURSIVE GENETIC ALGORITHM BASED ON CLOUD CO...
 
W33123127
W33123127W33123127
W33123127
 
Wavelet
WaveletWavelet
Wavelet
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdf
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
 
Statistical Analysis and Model Validation of Gompertz Model on different Real...
Statistical Analysis and Model Validation of Gompertz Model on different Real...Statistical Analysis and Model Validation of Gompertz Model on different Real...
Statistical Analysis and Model Validation of Gompertz Model on different Real...
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

A Modified HMM Algorithm for Multiple Observation Spaces

  • 1. A MODIFIED BAUM-WELCH ALGORITHM FOR HIDDEN MARKOV MODELS WITH MULTIPLE OBSERVATION SPACES Dr. Paul M. Baggenstoss Naval Undersea Warfare Center Newport RI, 02841 401-832-8240 p.m. baggenstoss@ieee.org ABSTRACT been (1) use a smaller and insufficient features set, (2) use more features and suffer PDF estimation errors, or (3) apply In this paper, a new algorithm based on the Baum-Welch methods of dimensionality reduction [4]. Such methods in- algorithm for estimating the parameters of a hidden Markov clude linear subspace analysis [5], projection pursuit [6], or model (HMM) is presented. It allows each state to be ob- simply assuming the features are independent (a factorable served using a different set of features rather than relying PDF). All these methods involve assumptions that do not on a common feature set. Each feature set is chosen to hold in general. We now present a new method based on be a sufficient statistic for discrimination of the given state sufficient statistics that is completely general and is based from a common “white-noise” state. Comparison of likeli- on the class-specific classifier [7]. Central to the classical hood values is possible through the use of likelihood ratios. Baum-Welch algorithm [l] is the ability to calculate rt(j), The new algorithm is the same in theory as the algorithm the probability that state j is true at time t given the entire based on a common feature set, but without the necessity of observation sequence X[1],. . . , X[T]. These probabilities estimating high-dimensional probability density functions depend on the forward and backward probabilities, which in (PDF’s). A simulated data example is provided showing turn depend on the state likelihood functions b3(X). It is superior performance over the conventional HMM. possible to re-formulate the problem to compute rt(j)using low-dimensional PDF’s. Consider an additional “common” 1 INTRODUCTION . state SOfor which the raw data is pure independent Gaus- sian noise. Using the likelihood ratios Consider a hidden Markov model (HMM) for a process with N states numbered SIthrough S N . Let the raw data be denoted X[t], for time steps t = 1 , 2 , . . . , T . The parame- ters of the HNM, denoted A, comprise the state transition in place of the likelihood functions b,(X) in the Baum- matrix A = {a,,}, the state prior probabilities U , , and the Welch algorithm, causes only a scaling change since the state observation densities b3 (X), where i and j range from denominator is independent of j . Each of the ratios in (1) 1 to N . These parameters can be estimated from training may be thought of as an optinial binary test or detector data using the Baum-Welch algorithm [l], [Z]. But, be- that can be re-written in terms of a state-dependent suffi- cause X[t] is often of high dimension, it may be necessary cient statistic, denoted z J , j = 1 , 2 , . . . , N . Thus, we have t o reduce the raw data to a set of features ~ [ t = T(X[t]). ] We then define a new HMM with the same A and U , , but with observations ~ [ t ] t = 1 , 2 , . . . , T and the state densi- , ties b, (2) (we allow the argument of the density functions to imply the identity of the function, thus b, ( X ) and b, ( z ) are which follows from the sufficiency of z, for state S, vs. state distinct). This is the approach used in speech processing SO.Note that we use the same symbol b, (.) to represent the today where z[t] are usually a set of cepstral coefficients. density of X and z,, however they are distinct (the argu- If ~ [ t is of low dimension, it is practical to apply proba- ] ment of the function should make it clear which function is bility density function (PDF) estimation methods such as implied). Since these tests do not involve the other states, Gaussian Mixtures to estimate the state observation densi- they are simpler, and z, can be individually lower in dimen- ties. Such PDF estimatimation methods tend to give poor sion than z. Note that this requires przor knowledge about results above dimensions above about 5 t o 10 [3] unless the each state, i.e., that a given state is best distinguished from features are exceptionally well-behaved (are close to inde- So by a given statistic’. This przor knowledge is often avail- pendent or multivariate Gaussian). In human speech, it is able, but not used. It only requires some knowledge about doubtful that 5 to 10 features can capture all the relevant the signal characteristics in each state, the same knowledge information in the data. Traditionally, the choices have necessary for choosing features. Keep in mind that each This work was supported by Office of Naval Research (ONR- ‘ I t is not neccessary to know which state is true at any given 321US) time 0-7803-6293-4/00/$10.00 0 2 0 0 0 EEE. 7 17
  • 2. state does not need a distinct set of statistics. The class- 2.1.2. T h e class-specific backward procedure specific HMM is an extension of the conventional HMM. If 1. Initialization: each state has the same set of statistics, the class-specific HMM specializes to the conventional HMM. When design- PT(i) = 1 (6) ing a class-specific HMM, one could allocate a subset of 2. Induction (for t = T - 1 , . . . l , 1 5 i 5 N): states to each class of data. Each subset of states would N require only the statistics apropriate for that class of data. In fact, an entirely different processing for feature extrac- PtW = C%i b.(z.[t+l ) bt,(z:[t+l]l)Pt+l(i). (7) i=l tion, tailored to that type of signal can be used. This can solve the problem that occurs in speech processing when 2.1.3. H M M Reestimation formulas short-duration plosives are forced to be analyzed using long analysis windows which are more appropriate for vowels. Define rt(j) p ( & = jlX[l], . . . ,X [ T ] ) .We have as When compared to the standard feature-based HMM using a common feature set z, the density of the numerators may be estimated using fewer training samples. Or, for a fixed number of training samples, more accuracy and robustness can be obtained. In addition, experience has shown that, Let difficulties encountered in initialization of the Baum-Welch algorithm are greatly diminished when this przor knowledge is used. A few words about the denominator densities bo(z,) is necessary. These densities may be approximated from syn- thetic white noise or derived analytically. For data that significantly departs from SO,all denominators can tend to The updated state priors are Ci = -yl(i).The updated state zero. In this case, the denominator densities bo(z,) must transition matrix is be solved for analytically or with an approximation valid in the far tails. The task of finding solutions valid in the tails is helped by the fact that SO is characterized by pure zzd Gaussian noise. In the computer simulations, we use exact formulas or approximations valid in the tails. 2.1.4. Gaussian mixture reestzmatzon formulas 2. MATHEMATICAL RESULTS We assume the following Gaussian mixture representation for each b, ( 2 , ) : Details of the derivation are t o be published [8]. Here, we provide only the algorithm. M b ~ ( ~= ) 3 c3k N(Zj,pJk,U~k), 1 5j 5 N. (11) k=l 2.1. The Class-Specific HMM where 2. I . 1. T h e class-specific forward procedure n/(z,,p, ) 2 U (2.1r)-P-’/21UI-’/2e{-~(s-p)’u-l(s-p)} 7 1. Initialization: and P, is the dimension of 2,. We may let A be indepen- 4 dent of j as long as M is sufficiently large. Let 2. Induction ( for 1 5 t 5 T - 1, 15j 6 N): 3. Termination: and where Ho is the condition t h a t state SO is true at every t. t=l 718
  • 3. 2.2. Discussion State 1 Description: An impulse of duration 2 samples oc- 2.2.1. Relationship t o conventional algorithm curring on samples 1 and 2 of the time-series with ad- The class-specific forward procedure terminates having com- ditive Gaussian noise of variance 1. Model: xt = puted the likelihood ratio (5), whereas the conventional al- 2(S[t - 1 + 1 S [ t - 21) + nt, where nt is distributed gorithm computes only the numerator. The class-specific N(0,l). Sufficient statistic: z1 = z1 +z2. log-PDF: Baum-Welch algorithm maximizes (5) over A, which is equiv- logbo(zl) = -0.5 l o g ( 4 ~ - 0.252:. ) alent to maximizing the numerator only. I t is comforting State 2 Description: Same as state 1, but on samples 2 a n d to know that - y t ( j ) , & ( z , j ) , i i i , and &j will be identical to those estimated by the conventional approach if (2) is true. + + 3. Model: xt = 2(6[t - 21 S [ t - 31) nt, where nt is distributed N ( 0 , l ) . Sufficient statistic: zz = z2 +x3. There is no correspondence, however, between the Gaus- log-PDF: logbo(z2) = -0.510g(4~) - 0.252;. sian mixture parameters of the two methods except that State 3 the densities they approximate obey (2). Uescription: Signal type 3 is iid Gaussian noise of zero mean and variance 1.7. Model: xt = nt, where 3. COMPUTER SIMULATION nt is distributed N(0,1.7). Sufficient statistic: z3 = log {Eryl xi}. log-PDF: bgbo(Z3) = -logr(N/2) - 3.1. Simulation details + (N/2) log(2) (N/2) z3 - exp(z3)/2. To test the Class-Specific HMM, we created a synthetic State 4 and 5 Markov model with known sufficient statistics for each state. Description: Signal types 4 and 5 are closely-spaced Comparisons with the conventional approach was made by sinewawes in Gaussian noise. Model: xt = nt + combining all the class-specific features into one common + asin(wkt 4), k = 4,5, where a = 0.4, w4 = 0.100, and w5 = 0.101 radians per sample. Sufficient statistic: feature set. In this way, neither method has an unfair ad- vantage with respect to features. The following state tran- zk = log {IC,”=, t e - j w k t 1’1 k = 4,s. log-PDF: I x sition matrix and state priors were used: log bO(Zk) = - log N - + Zk k = 4,s. 0.7 0.3 0 0 0 0 State 6 0 0.7 0.3 0 0 0 Uescription: Signal type 6 is autoregressive noise of A= 0 0 0.7 0.3 0 0 variance 1. Model: xt = [--alzt-l - a2xt-2 + nt] cy, 0 0 0 0.7 0.3 0 where nt is distributed N(0,1), a1 = -0.75, a2 = 0.78, 0 0 0 0 0.7 0.3 and cy = 0.5675 ( a is chosen such that the variance of xt 0.1 0 0 0 0 0.9 is 1). Approximate sufficient statistic: The first and second lags of the normalized autocorrelation estimates and U ] = 1/6, for all j . For each t , a state is chosen ran- computed circularly. z6 = {?1,?2} where ?k = fk/?O, domly according to the Markov model. Then, a N-sample segment of a time series is created from the statistical model and f k = & E,”=, z ; z ( i - k ) mod N . log-PDF: Details provided in 181. corresponding to the chosen state ( N = 256). In our nota- tion, we have X [ t ]= { z ~ [ t ] , ’ [ t. ] .,,x,v[~]}.Features are ~ . then computed from each segment and used as an obser- Table I: Description of each signal type, sufficient statistics vation vector of an HhIM. The common state, SO,is iid (SS), and distribution of the SS under state SO. Gaussian noise of zero mean and unit variance, denoted N(0,I). For each model, a sufficient statistic is chosen for distinguishing the signal from Ho. These are denoted z1 The multiple record implementation of the HMM de- through z6. Table 1 lists each signal description, the suf- scribed in Section V.(B.) of Rabiner [I](page 273) was used. ficient statistic, and the distribution under state SO. The All training and testing data was organized into records with criteria for selecting these Synthetic signals were (a) the a constant length of 99 data segments per record (T, 99). = signals are easy to describe and produce, (b) a sufficient Each segment consisted of N = 256 time samples. Features or approximately sufficient statistic was known for distin- were computed on each segment (a segment is associated guishing each state from SO,(c) these statistics had a known with a time-step or observation of the HMM). In the exper- density under state SO,(d) the signals and statistics were iment, algorithm performance was determined as a function diverse and statistically dependent. of R , the number of training records. In the experiments, To illustrate the strength of the Class-Specific algo- we used R=1,2,5,10,20,40,80, 160, and 320 records. To mea- rithm, it was compared with two classical implementations. sure algorithm performance of all implementations, the ap- The classical implementations used a 7-dimensional feature propriate version of the Baum-Welch algorithm was first vector consisting of all features used by the Class-Specific run to convergence on the available training data. Next, (CS) approach. Two variations of the classical approach were created by assuming the Gaussian Mixtures had (1) the Viterbi algorithm ’ was used to determine the most likely state sequence on a separate pool of 640 records of a general covariance matrix (“classical” or CL approach ) testing data. The number of state errors divided by the or (2) a diagonal structure which assumes all t h e features are statistically independent ( “independence assumption” ’The algorithm in Rabiner [l], pp 263-264, appropriately or IA approach). We will refer to these three methods by modified for the CS method by substituting likelihood ratios the abbreviations CS, CL, and IA. b3(zJ[tl)/bo(zj[tl) in place of bJ (Wl). 7 19
  • 4. number of opport,unities (640 times 99) was used as a per- I ‘ “ “ . ‘ ‘ I ’ “ “ “ ‘ I ’ ” formance metric. For more accurate determination of algo- rithm performance, sixteen independent trials were made at each value of R. 9 3.1 .l. Catastrophic errors and “assisted” training The CL and IA approaches encountered severe problems with initialization. More often than not, a poor initial- ization resulted in the Baum-Welch algorithm finding the 9 incorrect stationary point. This resulted in catastrophic er- rors. This appeared to be caused by an inability to distin- IndependenceAssumption (assisted) guish two or more states. The number of catastrophic errors decreased as the number of training records increased. The CS method did not encounter any catastrophic errors. To study only the PDF estimation issue apart from any initial- , Classical (assisted) ization issues, it was decided t o “assist” the CL and IA ap- %-- *--_ proaches by providing a good initialization point. This ini- tialization point was found by training separately on labeled data from each of the N states, then using these PDF’s as the state PDF’s with the known U and A. In short, the algorithm was initialized with the true parameter values. d I I .I..., I . , , This “good” parameter set was used as the starting point lo” 10’ lo’ for all trials of the CL and IA approaches. Doing this pro- Number of training records (R) vides somewhat of a “lower bound” on error performance. The CS results are unassisted. Figure 1: Probability of state classification error as a func- 3 2 Main results .. tion of number of training records (R) for three methods - A plot of median error probability for the CL, CS, and IA median performance of 16 trials. The CL and IA methods methods is provided in Figure 1. The IA approach has a are assisted by providing a “good” initialization point. higher limiting error rate, due to the built-in assumption of independence, which is not valid. It has, however, nearly identical convergence behavior as the CS method. This [2] B. H. Juang, “Maximum likelihood estimation for mix- is expected since the dimension of the P D F estimates is ture multivariate stochastic observations of Markov similar for the two approaches. chains,” AT&T Technical Journal, vol. 64, no. 6, pp. 1235-1249, 1985. 4. CONCLUSIONS [3] D. W. Scott, Multivariate Density Estimation. Wiley, 1992. A class-specific implementation of the Baum-Welch algo- rithm has been developed and verified to work in princi- [4] S.Aeberhard, D. Coomans, and 0. de Vel, “Compar- ple. A computer simulation used a 6-state HhIM with six ative analysis of statistical pattern recognition meth- synthetic signal models. The class-specific feature dimen- ods in high dimensional settings,” Pattern Recognition, sions were either 1 or 2. When the performance was mea- vol. 27, no. 8, pp. 1065-1077, 1994. sured in terms of state errors in the most likely state se- [5] H. Watanabe and S. Katagiri, “Discriminative sub- quence, the class-specific method greatly outperformed the space method for minimum error pattern recognition,” classic HMM approach which operated on a combined 7- in Proc. 1995 IEEE Workshop on Neural Networks for dimensional feature set. The improved performance was Signal Processing, pp. 77-86, 1995. due to the lower dimension of the feature spaces made pos- [6] P. J. Huber, “Projection pursuit,” Annals of Statistics, sible by knowledge of the appropriate feature set for each vol. 13, no. 2, pp. 435-475, 1985. state. It also outperformed a method that assumed the features were statistically independent. In this case, the [7] P. M. Baggenstoss, “Class-specific features in classifica- improved performance comes from the fact that the class- tion.,” IEEE Trans Signal Processing, December 1999. specific approach does not compromise theoretical perfor- [8] P. M. Baggenstoss, “A modified Baum-Welch algorithm mance when reducing dimension. for hidden Markov models with multiple observation spaces.,” IEEE Trans Speech and Audio, t o appear 2000. 5. REFERENCES [l] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Pro- ceedings of the IEEE, vol. 77, pp. 257-286, February 1989. 720