129966864599036360[1]

1
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen Hung
Department of Electrical Engineering
Ming Chi Institute of Technology
Taishan, Taiwan, 243 ROC
E-mail : wwhung@ccsun.mit.edu.tw
FAX : 886-02-2903-6852
Tel. : 886-02-2906-0379
and
Hsiao-Chuan Wang
Department of Electrical Engineering
National Tsing Hua University
Hsinchu, Taiwan, 30043 ROC
E-mail : hcwang@ee.nthu.edu.tw
FAX : 886-03-571-5971
Tel. : 886-03-574-2587
Corresponding author: Hsiao-Chuan Wang

2
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen Hung and Hsiao-Chuan Wang
Department of Electrical Engineering, National Tsing Hua University,
Hsinchu, Taiwan, 30043, ROC
Abstract
Modeling the state duration of HMMs can effectively improve the accuracy in decoding the state sequence
of an utterance and result in an improvement of speech recognition accuracy. However, when a speech signal
is contaminated by ambient noise, the decoded state sequence may be distorted. It may stay at some states
too long or too short even with the help of state duration models. This paper presents a proportional
alignment decoding (PAD) algorithm for re-training the hidden Markov models (HMMs). A task of
multi-speaker isolated Mandarin digit recognition was conducted to demonstrate the effectiveness and
robustness of the PAD-based variable duration hidden Markov model (VDHMM/PAD) method.
Experimental results show that the discriminativity of VDHMM/PAD in noisy environment has been
significantly enhanced. Moreover, the proposed method outperforms those widely used state duration
modeling methods, such as using Poisson, gamma, Gaussian, bounded and non-parametric probability density
functions.
This research has been partially sponsored by the National Science Council, Taiwan,
ROC, under contract number NSC-85-2221-E-007-005.

3
1. Introduction
Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech
recognition. This method provides a powerful framework for modeling the time-varying speech signals.
One of the advantages of HMM is that it enables us to well characterize speech signals as a parametric
stochastic process, and the parameters of this stochastic process can be optimized by the
estimation-maximization (EM) algorithm. In addition, the quality of HMM can also be significantly
improved by incorporating the information of state duration (Rabiner, 1989). In a conventional hidden
Markov model, the probability of staying in state i for d frames is modeled
by p d a ai ii
d
ii( ) ( ) ( )= ⋅ −−1
1 , where aii is the state transition probability from state i to itself and
( )1 − aii from state i to other states. This inherent temporal characteristic implies that the state duration
in a conventional HMM is exponentially distributed. It does not adequately model the temporal structures
of different acoustic regions in a speech signal (Juang et al., 1985, Rabiner et al., 1985 & Rabiner et al.,
1988). In order to cope with this deficiency, some modeling methods for state duration and word
duration have been proposed.
A. Bonafonte et al., (Bonafonte et al., 1996) used a Markov chain to model the occupancy of the
HMM states, and the parameters of the Markov chain were estimated directly from the duration data. To
reduce the insertion error rate in connected digit recognition, K. Power proposed an expanded-state
duration model (Power, 1996). In this approach, each individual state was expanded by multiple
sub-states, each sharing the original state observation probability density function (pdf). Moreover, K.
Laurila noticed that duration constraints applied only to the recognition phase is quite loose and not
effective enough. Therefore, a state duration constrained maximum likelihood (SDML) training scheme
(Laurila, 1997) was presented to gradually tighten the duration constraints in a hidden Markov model.
Duration modeling technique is not only applied to state level, but can also be extended to word level.

4
David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string
error rate for connected digit recognition task. In general, no matter what kind of duration modeling
mechanism is employed, the probability density function for modeling state duration distributions can be
roughly classified into two categories (Gu et al., 1991), non-parametric and parametric methods. For the
non-parametric method, the distribution of state duration is directly estimated from the training data. Thus,
we can obtain a more accurate duration distribution for each state in a word model. However, this
approach needs a large amount of training utterances in order to reach to a desired degree of accuracy.
Moreover, it also requires a considerable amount of memory space for the storage of all the duration
distributions. On the other hand, for the parametric method, some specific probability density functions,
such as Poisson (Russell et al., 1985 & Russell et al., 1987), gamma (Levinson, 1986 & Burshtein,
1995), Gaussian (Rabiner, 1989 & Burshtein, 1995) and bounded density functions (Gu et al., 1991,
Kim et al., 1994, Vaseghi, 1995, Power, 1996 & Laurila, 1997) were used to model the state duration
distributions explicitly, and by which only a few parameters were required to completely specify its
distribution. It is intuitive that there are some drawbacks in the use of parametric approach. One is that
the assumed probability density function may not always fit to the real duration distribution of each state in
a hidden Markov model.
In spite of ambient noises, most of the researches in modeling duration distributions dealt with the
minimization of recognition errors which are mainly attributed to unrealistically modeling for duration
distributions. How to make a duration model more robust to noise contamination is still a problem to be
solved. In this paper, we focus our attention on the robustness of modeling state duration in noisy
environment and neglect the modeling for word duration. This is due to the fact (Burshtein, 1995) that the
state duration modeling is the major contribution to the improvement of recognition rate.
In Section 2, some methods of state duration modeling are reviewed. Then, a series of experiments

5
were conducted to compare those methods in Section 3. Also, the behaviors of various duration models
under the influence of noise contamination are also investigated. In Section 4, based on the results
obtained in the previous section, we propose a new method that combines a proportional alignment
decoding (PAD) algorithm with state duration distributions to re-train a conventional hidden Markov
model. This is so-called the variable duration hidden Markov model and denoted as VDHMM/PAD.
The state duration distributions of VDHMM/PAD are proved to be more robust than those of other
methods in noisy environment. An experiment of multi-speaker isolated Mandarin digit recognition was
conducted in Section 5 to evaluate the effectiveness and robustness of the proposed method. Finally, a
conclusion is given in Section 6.
2. Overview of state duration modeling methods
When the statistics of state duration is incorporated into both training and recognition phases of a
conventional hidden Markov model, this will result in a variable duration hidden Markov model
(VDHMM) (Levinson, 1986 & Rabiner, 1989). In VDHMM, the likelihood function is defined in terms
of modified forward likelihood and backward likelihood. Let O o o oT= 1 2 ... be the observation
sequence. The modified forward likelihood αt w j( , ) and backward likelihood βt w j( , ) are defined
as (Levinson, 1986, Rabiner, 1989 & Hung et al., 1997)
α λt t tw j p o o o q j w( , ) ( ... , ( ) )= =1 2
= ⋅∑ ∑ −
= ≠d
t d w ij
i i j
S
w i a
w
α ( , ) ,
,1
⋅ ⋅ − +
=
∏p d b ow j w j t d
d
, ,( ) ( )τ
τ 1
(1)
and
β λt t t T tw i p o o o q i w( , ) ( ... , ( ) )= =+ +1 2

6
= ⋅
= ≠
∑∑ a p dw ij w j
j j i
S
d
w
, ,
,
( )
1
⋅ ⋅+ +
=
∏b o w jw j t t d
d
, ( ) ( , ).τ
τ
β
1
(2)
where λ ( )w denotes the variable duration hidden Markov model for word w with Sw states, qt
the present state at time t , aw ij, the state-transition probability from state i to state j of word
model λ ( )w , b ow j t, ( ) the symbol distribution of ot in the j -th state of word model λ ( )w , and
p dw j, ( ) the j -th state duration pdf of word model λ ( )w with duration length of d frames. Then,
given a variable duration hidden Markov model, λ ( )w , the likelihood function of an observation
sequence,O , can be modeled as
p O w( ( ))λ = αt d
d
D w j
j j i
S
i
S
w ij w jw i a p d
ww
−
== ≠=
⋅ ⋅∑∑∑ ( , ) ( )
( , )
,
, ,
111
⋅ ⋅− +
=
∏b o w jw j t d t
d
, ( ) ( , ),τ
τ
β
1
(3)
where D w j( , ) indicates the allowable maximum duration length within the j-th state of word model
λ ( )w . Based on above definition, the derivation of re-estimation formulas for the variable duration
HMM is formally identical to those for the conventional HMM (Levinson, 1986 & Rabiner, 1989). For a
left to right variable duration HMM without jumps, the maximum likelihood, p O w( ( ) )λ , can be
effectively calculated by a three-dimensional (time, state, duration) Viterbi decoding algorithm which is
derived from the literature proposed by Gu et al., (Gu et al., 1991) and can be summarized as follows :
for d = 1
ψt w j( , , )1 = max{ ( , ,
~
) log[ (
~
)]}~ ,
d
t w jw j d p dψ − −− +1 11
+ +−log[ ] log[ ( )],,( ) ,a b ow j j w j t1 (4)
for d ≥ 2
ψ ψt t w j tw j d w j d b o( , , ) ( , , ) log[ ( )],,= − +−1 1 (5)
and

7
p O w( ( ) )λ = max{ ( , , ) log[ ( )]},,
d
T
T w w Sw S d p dw
=
+
1
ψ (6)
where ψ t w j d( , , ) represents the maximum likelihood of proceeding from state 1 to state j − 1along a
state sequence of duration length ( )t d− frames and producing the observations o o ot d1 2 ... ,− and
then staying at the state j and producing the observations o o ot d t t− + −1 1... at this state. From above
description, we can find that the success of modeling state duration distributions will promote the
performance of a HMM-based speech recognizer.
In general, the modeling methods for state duration can be classified into two categories, i.e.,
non-parametric and parametric modeling methods.
2.1 Non-parametric state duration modeling method
In non-parametric approaches (Juang et al., 1985, Rabiner et al., 1985, Rabiner et al., 1988,
Anastasakos et al., 1995 & Hung et al., 1997), the probabilities, p dw j, ( ) , used for describing state
duration distributions are estimated via a direct counting procedure on the training data. Let dw j t, , be
the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word
model λ ( )w , and Nw be the total number of training utterances of the word w. Then, the
probabilities, p dw j, ( ) , can be estimated by
p d
d
Nw j
d w j t
t
N
w
w
,
, ,
( )
( )
= =
∑Θ
1
for d ≥ 1, (7)
where Θd w j td( ), , is a binary characteristic function and defined as
Θd w j t
w j t
d
if d d
otherwise
( )
, ,
, .
, ,
, ,
=
=


1
0
(8)
In this non-parametric approach, the accuracy of duration model depends on the amount of training

8
data. When the amount of training data is sufficient, this modeling method can well approximate the
temporal characteristic of each state in a hidden Markov model. However, large number of parameters to
be stored is one of the drawbacks. A non-parametric approach for isolated Mandarin digit recognition
proposed by Hung et al., (Hung et al., 1997) had shown that the recognition rates were significantly
improved as comparing with those of conventional HMM under the influence of white noise. The
recognition rates are improved from 48.8% in baseline HMM to 62.0% in non-parametric approach
when the signal is contaminated with white noise at SNR equal to 20dB.
2.2 Parametric state duration modeling methods
In parametric approaches, some specific probability density functions have been proposed to model
the distribution of state duration explicitly. The parametric approach has the advantage that only few
parameters are required to completely specify its probability density function. Thus, comparing with the
non-parametric approaches, the memory space for the parametric approach can be significantly reduced.
One of the drawbacks in using parametric duration modeling methods is that the assumed probability
density function may not always match with the state duration distribution of each state in a hidden
Markov model. Some probability density functions including Poisson, gamma, bounded and Gaussian
duration density functions have been proposed to model the distribution of state duration. Detailed
formulations of those duration modeling methods are described as follows.
2.2.1 Poisson distribution for state duration
To characterize the duration property more effectively, M. J. Russell (Russell et al., 1985 & Russell et
al., 1987) replaced the self-transition probability in conventional HMM by a Poisson duration density
function so that there was no self-transition from a state back to itself. This is the so-called hidden

9
semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is
thought to has some advantages. First, the Poisson probability density function represents a plausible
model for state duration. Second, only one parameter, i.e., the state duration mean, is needed to specify
the distribution of state duration. Third, maximum likelihood estimation of the state duration mean can be
accomplished by using the methods which are analogous to the standard Baum-Welch re-estimation
process.
When the distribution of state durationis modeled by a Poisson density function, it is expressed as
p d
d
d
ew j
w j
d
dw j
,
,
( )
( )
( )!
,
=
−
⋅
−
−
1
1
for d ≥ 1, (9)
where d w j, denotes the duration mean of j-th state in the word model λ ( )w . For comparison, hidden
Markov model (HMM), dynamic time-warping (DTW) and the hidden semi-Markov model (HSMM)
with Poisson distributed state duration were applied to the task of speaker dependent isolated word
recognition (Russell et al., 1985). Experimental results for the third set of recordings showed that error
rate of HSMM is 11.8% and 6.3% lower than those of HMM and DTW, respectively.
2.2.2 Gamma distribution for state duration
In the literature proposed by Levinson (Levinson, 1986), the author first used a family of gamma
probability density functions to characterize the distribution of state duration and formed a continuously
variable duration hidden Markov model (CVDHMM). The gamma distribution was considered to be
ideally suited to the specification of duration density function since it assigns zero probability to negative
duration lengths and only two parameters, state duration mean and variance, are required to specify its
distribution. Moreover, David Burshtein (Burshtein, 1995) proposed a modified Viterbi decoding
algorithm that incorporates both state and word duration models for connected digit string recognition. In

10
this approach, a duration penalty based on gamma density function is considered at each frame transition.
The modified Viterbi decoding algorithm was proved to have essentially the same computational
requirements as the conventional Viterbi algorithm. The experimental results showed that the modified
Viterbi decoding algorithm with gamma duration distribution reduced the string error rate from 4.77% to
2.86% for the case of unknown string length, and from 2.20% to 1.60% for the case of known string
length as compared withthe baseline HMM. The gamma duration density function can be formulated as
p d d ew j
w j
w j
d
w j
w j w j
,
,
,
( )
( )
( )
( )
,
, ,
= ⋅ ⋅
− − ⋅ξ
γ
γ
γ ξ
Γ
1
for d ≥ 1 (10)
and
γw j
w j w j
w j
d d
,
, ,
,
=
⋅
∇
, ξw j
w j
w j
d
,
,
,
=
∇
, (11)
where d w j, and ∇w j, are the duration mean and variance of j-th state in the word model λ ( )w ,
respectively. Γ ( )z is a gamma function defined by
Γ ( )z x e dxz x
= ⋅− −
∞
∫
1
0
for z > 0 . (12)
2.2.3 Bounded state duration
Due to the characteristic of continuous probability density function, both Poisson and gamma functions
have the advantage of operating well when a relatively small number of training utterances is available.
However, in some situations, there exists the possibility that duration length of some states will be too
long or too short. To avoid those unexpected duration and minimize the erroneous match between testing
utterances and reference models, H. Y. Gu et al. (Gu et al., 1991) proposed a hidden Markov model
with bounded state duration in which the allowable state duration is constrained by some boundaries. The
duration length of each state in this approach is simply bounded by lower and upper bounds in the

11
recognition phase. The probability density function for bounded state durationis modeled by
p d D D
if D d D
otherwise
w j w j
upper
w j
lower w j
lower
w j
upper
, , ,
, ,
( )
, ,
, ,
= − +
≤ ≤





1
1
0
(13)
where Dw j
lower
, and Dw j
upper
, are the lower and upper bounds of the state duration for state j of the word
model λ ( )w , and can be estimated by
{ }D dw j
lower
t
N
w j t
w
, , ,min=
=1
(14)
and
{ }D dw j
upper
t
N
w j t
w
, , ,max=
=1
. (15)
A series of experiments using all the 408 highly confused first-tone Mandarin syllables (Gu et al., 1991)
were conducted to evaluate the effectiveness of HMM with bounded state duration (BSD). In the
discrete case, the recognition rate of HMM with BSD is 78.5%. This is 9.0%, 6.3% and 1.9% higher
than the conventional HMM’s, HMM’s with Poisson and HMM’s with gamma distributed state duration,
respectively. In the continuous case, the recognition rate of HMM with BSD is 88.3%. This is 6.3%,
5.9% and 3.1% higher than those of conventional HMM, HMM with Poisson and HMM with gamma
distributed state duration, respectively. Similar applications of bounded state duration distribution for
speech recognition can be found in the literature by Kim et al., (Kim et al., 1994), Vaseghi, (Vaseghi,
1995) and Power (Power, 1996). The minimum and maximum durations for each state were estimated in
the training phase. Those loose state duration constraints were then used in the final recognition phase. To
tighten those duration constraints, K. Laurila (Laurila, 1997) employed bounded state duration model in
both the training and recognition phases to achieve higher consistency in state duration constraints.
2.2.4 Gaussian distribution for state duration

12
A parametric approach using Gaussian probability density function for modeling the state duration
distributions is suggested by Rabiner (Rabiner, 1989). Moreover, David Burshtein (Burshtein, 1995) also
claimed that Gaussian pdf provides good approximation for word duration. By modeling word duration
using Gaussian pdf, the string error rate can be further reduced from 2.86% to 2.78% for the case of
unknown string length, and from 1.60% to 1.59% for the case of known string length as compared with
the baseline HMM. The Gaussian duration density function can be formulated as
p d
d d
w j
w j
w j
w j
,
,
,
,
( ) exp{
( )
}=
⋅ ∇
⋅ −
−
⋅ ∇
1
2 2
2
π
. (16)
3. Comparison of state duration modeling methods
3.1 Databases and experimental conditions
A task of multi-speaker isolated Mandarin digit recognition was conducted for the comparison of
those state duration modeling methods described above. The database for the experiments were
provided by 50 male and 50 female speakers. Each speaker was asked to utter a set of 10 Mandarin
digits in each of three sessions. Totally, there were 3000 utterances recorded with the sampling rate of 8
KHz. Each frame, which contained 256 samples with 128 samples overlapped, was multiplied by a
256-point Hamming window. The pre-silence and post-silence of 0.1 ~ 0.5 seconds were included. Each
digit was modeled as a left-to-right HMM of 7 ~ 9 states, including the pre-silence and the post-silence
states, without jumps. The output of each state was a Gaussian distribution of feature vectors. The feature
vector was composed of 12-order LPC derived cepstral coefficients, 12-order delta cepstral coefficients
and one delta log-energy.
The NOISEX-92 noise database (Varga et al., 1992) was used for generating the noisy speech. In
our study, three kinds of noises, including white noise, F16 cockpit noise and babble noise, were directly

13
added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was
added to the clean speech, the signal-to-noise (SNR) was defined by the following equation :
SNR
E
E
s
n
= ⋅





10 log , (17)
where Es was the total energy of clean speech and En was the energy of the added noise over the
entire speech portion. The F16 cockpit noise was recorded at the co-pilot’s seat in a two-seat F16
traveling at a speed of 500 knots and an altitude of 300-600 feet. The source of babble noise was 100
people speaking in a canteen and in which individual voices were slightly audible.
The subsequent experiments were conducted to examine the following problems : (1) the effectiveness
of state duration modeling methods, (2) the incorporation of state duration modeling in training phase, and
(3) the robustness of state duration modeling methods in noisy environment.
3.2 Effectiveness of state duration modeling methods
The first two sessions of collected utterances in the database were used to train an initial set of word
models by using the segmental k-means algorithm (Rabiner et al., 1986). Once a conventional
HMM-based word model (denoted as ‘Baseline’ HMM) was established for each isolated Mandarin
digit, the training utterances were time-aligned with their corresponding word models. By using the
standard Viterbi decoding algorithm, we can re-decode each utterance into a state sequence and from
which the number of frames spent on every state is known. Based on these decoded state durations, we
can find the distribution of state duration for each state in a word model. This distribution can be treated
as the non-parametric modeling for a state duration and denoted as HMM/Npar. In Fig. 1, we show the
duration distributions of seven states in the HMM/Npar for isolated Mandarin digit ‘4’. The state duration
distributions modeled by Poisson, gamma, Gaussian and bounded density functions are also illustrated in

14
Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD,
respectively. The third session of collected utterances was used as a clean version of testing data for
evaluating the effectiveness of various state duration modeling methods. In the recognition phase, a testing
utterance is decoded into a state sequence by using the standard Viterbi decoding algorithm for the
‘Baseline’ HMM method, while using the three-dimensional Viterbi decoding algorithm, i,e., Eqs. (4)-(6),
for other state duration modeling methods. The resulted recognition rates for various state duration
modeling methods are shown in Table I.
( Fig. 1 and Table I about here )
Let us examine the state duration distributions of HMM/Npar shown in Fig. 1. We can find that the
distribution of state duration is different from state to state and can not be confined to a certain type of
probability density function. No single probability density function can fit to the statistical characteristics of
all the states in a word model. Furthermore, we can also find that HMM/Gam and HMM/Gau are more
capable than HMM/Pois and HMM/BSD in modeling the state duration distributions represented by
HMM/Npar. Particularly, gamma function is slightly better than Gaussian function. This result is consistent
with the conclusion given by David Burshtein (Burshtein, 1995) which stated that the gamma function can
provide high quality approximations for state duration and word duration. For HMM/BSD, lower and
upper bounds of state duration can prevent any state from occupying too many or too few frames.
However, the state duration distribution in the range of allowable duration is treated as an uniform
distribution which can not well approximate the actual distribution of state duration. This fact does affect
the performance as shown in Table I. From the experimental results shown in Table I, we can find that
the HMMs employing non-parametric, gamma and Gaussian state duration models have slightly higher
recognition rate than that of the baseline HMM. Also, the recognition rate of HMM/Gam is superior to
those of other methods. It concludes that a good modeling method for state duration can improve the

15
recognition accuracy.
3.3 Incorporation of state duration modeling in training phase
When statistics of state duration is considered only in the recognition phase but not in the training phase,
it will result in quite loose state duration constraints (Laurila, 1997). To solve this inconsistency problem,
a variable duration hidden Markov model (VDHMM) (Levinson, 1986, Rabiner et al., 1989 & Laurila,
1997) which incorporates state duration statistics into both training and recognition phases of a word
model has been proposed to seek for further improvement in recognition accuracy. The duration
distribution of each state in a word model can be obtained as follows :
Step 1. The segmental k-means algorithm and standard Viterbi decoding method are used to train an
initial set of word models.
Step 2. The duration statistics for each state in a word model are estimated and modeled by
non-parametric or parametric methods.
Step 3. Using the three-dimensional Viterbi decoding algorithm, each training utterance is decoded
into a maximum likelihood state sequence.
Step 4. According to those maximum likelihood state sequences, the statistics of each state is
re-calculated and the parameters of underlying state duration model are also revised. Step 3
and step 4 are iterated several times to come out a final set of desired word models.
In Fig. 2, we show the duration distributions of seven states in those VDHMMs for isolated Mandarin
digit ‘4’ using various state duration modeling methods. The variable duration HMMs with
non-parametric, Poisson, gamma, Gaussian and bounded state duration density functions are denoted as
VDHMM/Npar, VDHMM/Pois, VDHMM/Gam, VDHMM/Gau and VDHMM/BSD, respectively.
Moreover, The clean speech recognition rates based on variable duration HMMs are also shown in

16
Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation
phenomenon of some state duration distributions in the HMM/Npar more obvious. This phenomenon can
be found in the 4-th, 5-th and 6-th states of word model ‘4’. In addition, the duration distributions of
some states (e.g., 3-rd and 7-th states) become more concentrated and sharper. Table I and Table II
show that no matter employing non-parametric or parametric approaches, VDHMM methods are better
than the corresponding HMM methods. Since there are two confusion sets in Mandarin digit speech (“1”
vs. “7” and “6” vs. “9”), the recognition rate can hardly be further improved in clean speech for this
specific task. Even though the improvement is small, it does demonstrate the effectiveness of applying
state duration models in both training and recognition phases. ( Fig. 2 and Table II about here )
3.4 Robustness of state duration modeling methods
When a speech recognition system is deployed in a noisy environment, the background noise will
cause the mismatch of statistical characteristics between testing speech and reference models. Due to the
environmental mismatch, it is very possible that some state with very high likelihood scores will dominate
the result of decoding process (Zeljkovic, 1996). Thus, an erroneous maximum likelihood state sequence
with state duration too long or too short may be obtained even if a state duration modeling method is
employed. This phenomenon will cause the drastic degradation of recognition rate of a speech recognizer.
In this subsection, a series of experiments were conducted to evaluate the robustness of various methods
for modeling state duration in noisy environment.
In our experiments, the first two sessions of collected utterances in the database were used to train a
set of word models. To generate noisy speech, a noise with specific SNR values was added to the clean
testing data, i.e., the third session in the database. Those distorted utterances were then evaluated on their
corresponding word models and decoded into state sequences. Thus, from those most likely state

17
sequences, we can find the state duration distributions under the influence of additive white noise. In Fig.
3 through Fig. 6, the duration distributions of the 5-th and the 6-th states of isolated Mandarin digit ‘4’
under the influence of white noise are plotted. In addition, the recognition rates under the influence of
white noise, F16 cockpit noise and babble noise for various HMMs and VDHMMs are also presented in
Table III and Table IV.
( Fig. 3 - Fig. 6, Table III - Table IV about here )
The results in Table III and Table IV convince that properly employing a duration model does improve
the recognition accuracy in noisy environment. Above all, further improvement can be obtained by using a
variable duration hidden Markov model. The performances of those HMMs and VDHMMs in different
noisy environment are similar to the results listed in Table I and Table II for clean speech recognition. It is
worth to note that at SNR = 0 dB, the recognition rates based on the bounded state duration (BSD)
modeling method are higher than those of the models based on other parametric duration modeling
methods. One explanation is that the BSD method is more effective than other parametric modeling
methods in inhibiting one state occupying too long or too short of speech frames. From Fig. 3 through Fig.
6, we can also find that the additive white noise has the effect to distort the duration distribution of each
state in a word model. When the background environment becomes more noisy, the duration distribution
of the 5-th state of Mandarin digit “4” is gradually shifted to the left while the 6-th state to the right.
Especially, when the signal-to-noise ratio is very low, e.g., 0 dB, the duration density functions of some
states become extremely concentrated at some unexpected duration lengths even with the helps of state
duration modeling methods. This implies that the underlying duration density functions of those modeling
methods are not robust enough to noise contamination. For some state duration modeling methods, the
probability density functions are relatively smooth in the range of allowable duration lengths. This will
reduce the discriminativity of duration lengths in noisy environment and results in erroneous state

18
sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it
is very possible for a state to stay too long or too short in decoding a state sequence. From above
discussion, we conclude that : (1) The non-parametric duration modeling method can accurately specify
the state duration distribution of each state in a hidden Markov model. (2) The duration modeling method
must be applied in both the training and recognition phases so that the state duration constraints in these
two phases are consistent. (3) A sharper pdf of state duration may enhance the discriminativity of the
allowable duration lengths. (4) A narrow distribution range of state duration can efficiently prevent a
decoded state from being too long or too short.
4. Implementation of the VDHMM/PAD
In this section, a proportional alignment decoding (PAD) algorithm (Hung & Wang, 1997) combining
with the statistics of state durations is proposed to re-train a conventional hidden Markov model and
results in a more robust variable duration hidden Markov model (VDHMM/PAD). Instead of the widely
used Viterbi decoding algorithm, the proportional alignment decoding algorithm is used for state decoding
in the intermediate stage of training a word model. It produces a new set of state duration statistics in
which the distribution of state duration becomes sharper and more concentrated. This meets the
conclusion made in the previous section. It is also worth to note that the PAD method is not implemented
in the recognition phase. The detailed implementation of VDHMM/PAD is described as follows.
4.1 Formulation of the proportional alignment decoding algorithm
Consider the training of a word model λ ( )w that belongs to the set of M word models. The
parameter set of the word model λ ( )w is represented as λ µ( ) { , , , , }w w w w w w= Σ Ρ Α Β , where
µ µw w j= { }, and Σ Σw w j= { }, for 1 ≤ ≤j Sw denote the mean vector and covariance matrix of the

19
j-th state in the word model λ ( )w , respectively. Ρw w jp d= { ( )}, , Αw w ija= { }, and
Βw w jb O= { ( )}, for 1 ≤ ≤j Sw represent the probability density functions of state durations, state
transitions and state outputs for the word model λ ( )w , respectively. It is noted that the probability
density function p dw j, ( ) is modeled by the non-parametric duration modeling method. Let
Χ Χ( ) { ( ), }w w t Nt w= ≤ ≤1 be a set of feature vector sequences extracted from all the training
utterances for the word model λ ( )w . Here, Χt w( ) denotes the feature vector sequence of the t-th
training utterance which has Kt
w
frames, This feature vector sequence can be expressed as
Χt t
w
t
w
t K
w
w x x x t
w( ) , , ,
= ⋅⋅⋅1 2 . Then, in a continuous-density HMM, the output probability density function,
b xw j t k
w
, ,( ) , can be characterized by a Gaussian function defined as follows :
b xw j t k
w
w j
D
, , ,( ) ( )= ⋅ ⋅
− −
2 2
1
2
π Σ exp{ ( ) ( ) ( )}, , , , ,− ⋅ − ⋅ ⋅ −−1
2
1
x xt k
w
w j
T
w j t k
w
w jµ µΣ , (18)
where D is the dimension of feature vector xt k
w
, .
Based on the set of word models λ λ= ≤ ≤{ ( ), }w w M1 and the standard Viterbi decoding
algorithm, we can decode the t-th training utterance of word w, X wt ( ) , into a state sequence
q q q qw t w t w t w t K t
w
, , , , , , ,
= ⋅ ⋅ ⋅1 2 . Assume dw j t, , denotes the duration of state j in the maximum likelihood
state sequence of the t-th training utterance for the word model λ ( )w . Then, the state duration mean,
d w j, , of state j in the word model λ ( )w is formulated as
d
N
dw j
w
w j t
t
Nw
, , ,=
=
∑
1
1
for 1 ≤ ≤j Sw . (19)
Moreover, The word duration mean d w defined as the accumulation of all the state duration means in
the word model λ ( )w can also be expressed as

20
d dw w j
j
Sw
=
=
∑ ,
1
. (20)
Then, the state duration ratio of the j-th state to the total states in the word model λ ( )w can be
calculated by
ℜ =j
w w j
w
d
d
,
for 1 ≤ ≤j Sw (21)
Once we obtain ℜj
w
for all states in every word model, the proportional alignment decoding procedure
can be proceeded in a simple way and each training utterance of word w is re-decoded into a new
state sequence, ~
,
qw t
, where
~ ~ ~ ~ ,, , , , , , ,
q q q qw t w t w t w t Kt
w= ⋅⋅ ⋅1 2 1 1≤ ≤ ≤ ≤w M t Nw, . (22)
For example, the t-th training utterance of word w has duration of Kt
w
frames, we segment this training
utterance into Sw states according to the following rules :
x wt k
w
v, ( )∈Ω and ~ ,, ,q vw t k =
iff. k K Kj
w
j
v
t
w
j
w
t
w
j
v
∈ ℜ ⋅ + ℜ ⋅
=
−
=
∑ ∑[( ) ,( ) ],
1
1
1
1 (23)
where Ω Ω( ) { ( ), }w w v Sv w= ≤ ≤1 . Ωv w( ) is the set of collected vectors belonging to state v in
the word model λ ( )w .
4.2 Training procedure of VDHMM/PAD
The training procedure works as follows.
Step 1. Obtain initial word models.
Employing segmental k-means algorithm (Juang et al., 1990) and standard Viterbi decoding
algorithm, all the feature vectors extracted from training utterances of word w are used to train

21
an initial word model λp
w−1
( ) , where p = 0 and 1 ≤ ≤w M .
Step 2. Decode training utterances and update word models.
(1) Based on the initial word model λp
w−1
( ) , standard Viterbi decoding algorithm is used to
decode each training utterance, such that
q p X w q w p q ww t
p
q
t w t
p
w t
p
w t
, , ,
arg max{ ( ( ) , ( )) ( ( ))}
,
− − −
= ⋅
1 1 1
λ λ , 1 1≤ ≤ ≤ ≤w M t Nw, . (24)
(2) The decoded state sequence is denoted as q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
.
− − − −
= ⋅ ⋅ ⋅
1
1
1
2
1 1
(3) Let Ω Ωp
j
p
ww w j S− −
= ≤ ≤1 1
1( ) { ( ), }, and Ω j
p
w−1
( ) be the set of vectors of state j
in word model λp
w−1
( ) . For a feature vector of k-th frame in utterance t, xt k
w
, , this frame
belongs to Ω j
p
w−1
( ) if its corresponding state belongs to state j in model λp
w−1
( ) .
Then the duration of state j is equal to the number of vectors in utterance t belonging to
Ω j
p
w−1
( ) . The duration set is expressed as d w d j St
p
w j t
p
w
− −
= ≤ ≤1 1
1( ) { , }, , .
Step 3. Align state sequences using PAD method.
(1) Based on the duration set d wt
p−1
( ) , we can find state duration mean d w j
p
,
−1
, word duration
mean dw
p−1
and state duration ratio ℜ −
j
w p, 1
for each state in the word model λp
w−1
( )
via Eqs. (19)-(21).
(2) Every training utterance of word w is then proportionally segmented into Sw states by
using the Eq. (23). Thus we can find new state sequences q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
= ⋅ ⋅ ⋅1 2 .
(3) Rearrange the set of vectors collected in a state such that x wt k
w
j
p
, ( )∈Ω if its
corresponding state belongs to state j defined for model λp
w( ) . The new duration of
state j in utterance t , dw j t
p
, , , is obtained.

22
(4) Use the duration set d w d j St
p
w j t
p
w( ) { , }, ,= ≤ ≤1 and the following equation to calculate
the distribution of state duration:
p d
d
N
for dw j
p
d w j t
p
t
N
w
w
,
, ,
( )
( )
,= ≥=
∑Θ
1
1. (25)
(5) Use Ω Ωp
j
p
ww w j S( ) { ( ), }= ≤ ≤1 to find the parameters set { , , , }µw
p
w
p
w
p
w
p
Σ Α Β of the
word model λp
w( ) .
Step 4. Re-train the word models.
(1) Calculate the accumulated log-likelihood of Χ( )w by
∆ Χp
t
p
t
N
w p w w
w
( ) log [ ( ) ( )]≡
=
∑ λ
1
= +
=
∑{log ( ( ) , ( ))] log [ ( )]}, ,
p w q w p q wt w t
p p
t
N
w t
p p
w
Χ λ λ
1
, (26)
where
p X w q w b xt w t
p p
w q t k
w
k
K
w t k
p
t
w
( ( ) , ( )) ( ), , ,
, ,
λ =
=
∏1
(27)
and p q w aw t
p p
w q q
k
K
w t k
p
w t k
p
t
w
( ( )), , , , , ,
λ =
+
=
−
∏ 1
1
1
. (28)
(2) Based on the word model λp
w( ) , we can use the three-dimensional Viterbi decoding
algorithm to find a maximum likelihood state sequence q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
+ + + +
= ⋅ ⋅ ⋅
1
1
1
2
1 1
for the
t-th training utterance.
(3) Collect the vectors such that x wt k
w
j
p
, ( )∈ +
Ω 1
if its corresponding state belongs to state j
defined for model λp
w+1
( ) .
(4) Use Ωp
w+1
( ) to update the model parameters and generate the new model λp
w+1
( ) .

23
(5) Update the accumulated log-likelihood of Χ( )w by
∆ Χp
t
p
t
N
w p w w
w
+ +
=
= ∑1 1
1
( ) log [ ( ) ( )]λ , (29)
where the likelihood function p w wt
p
[ ( ) ( )]Χ λ +1
can be evaluated efficiently by using Eqs.
(4)-(6).
(6) Convergence testing.
IF the improvement rate of ∆p
w+1
( ) is greater than a preset threshold ∆th , i.e.,
∆ ∆
∆
∆
p p
p th
w w
w
+
−
>
1
( ) ( )
( )
, (30)
THEN p p+ →1 and repeat Steps 4.(2)-4.(6),
ELSE λ λp
VDHMM PADw w+
→1
( ) ( )/ .
4.3 Recognition procedure of VDHMM/PAD
Consider a testing utterance Υ with Ty frames and Υ = ⋅⋅ ⋅y y yTy1 2 , where y j denotes the
feature vector of j-th frame. The recognition procedure based upon the VDHMM/PAD is proceeded
as follows.
Step 1. Set w = 1 .
Step 2. Use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state
sequence q
**
for the testing utterance Υ based on the word model λVDHMM PAD w/ ( ) .
Step 3. Calculate the likelihood score of Υ for the word model λVDHMM PAD w/ ( ) by using Eqs.
(4)-(6), i.e.,
p wVDHMM PAD[ | ( )]/Υ λ = p q w p q wVDHMM PAD VDHMM PAD[ | , ( )] [ | ( )]
**
/
**
/Υ λ λ⋅ , (31)

24
Step 4. w w+ →1 .
IF w M≤ , THEN repeat Step 2 to Step 4,
ELSE go to Step 5.
Step 5. Select the word whose likelihood score is highest, i.e.,
w p w
w
VDHMM PAD
*
/argmax{ [ | ( )]}= Υ λ . (32)
5. Experiments and discussion
In this section, the same procedure described in Section 3.2 is used to find the distribution of state
duration in the VDHMM/PAD. Moreover, to demonstrate the behaviors of state duration distributions of
VDHMM/PAD under the influence of white noise, the same experiments conducted in Section 3.4 are
also implemented here. Fig. 7 and Fig. 8 show the state duration distributions of seven states in the
VDHMM/PAD for isolated Mandarin digit ‘4’ and the distorted state duration distributions due to white
noise contamination. The recognition rates of VDHMM/PAD under the influences of white noise, F16
cockpit noise and babble noise are also listed in Table V. Furthermore, in order to make the comparison,
we plotted those experimental results listed in Tables I-V on Fig. 9. From those experimental results we
observe the following facts :
(1) Distribution of state duration
Comparing Fig. 7 with Fig. 1 and Fig. 2, we can find that for conventional HMM employing various
state duration modeling methods, the distribution of state duration is relatively smooth and widespread.
By incorporating state duration statistics into the training phase, the variable duration HMMs make the
duration distributions of some states more concentrated and sharper. It results in the higher recognition
rate. In Fig. 7, we can observe that for most of states (e.g., 2-nd, 4-th, 5-th and 6-th states) the
allowable ranges of state duration modeled by VDHMM/PAD become more concentrated. The

25
shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition,
comparing with those state duration distributions shown in Fig.1 and Fig.2, the probability fluctuation in
the VDHMM/PAD is more severe. This fluctuation phenomenon occurs in the duration distributions of
2-nd, 4-th and 6-th states of VDHMM/Npar and is considered to be helpful for enhancing its
discriminativity in recognizing the noisy speech.
(2) Robustness to noise contamination
When speech signal is contaminated by white noise, the state duration distributions shown in Fig. 3
through Fig. 6 are affected and distorted. Especially, under SNR = 0 dB, duration distributions are
severely distorted and concentrated extremely at some unexpected duration lengths. Using Fig. 3 and
Fig. 4 as examples, we can find that for some HMMs (e.g., HMM/Npar, VDHMM/BSD) the
duration distribution of 5-th state excessively concentrates at duration length of 3 frames for SNR = 0
dB while the other HMMs (e.g., HMM/Gam, VDHMM/Gau) at duration length of one frame.
Moreover, the maximum probabilities of 5-th state duration are also dramatically increased from about
0.2~0.3 up to 0.8~1.0. In contrast to those state duration distributions described in Fig. 3 through Fig.
6, we can observe from Fig. 8 that even under the influence of white noise, the original ranges of state
duration in the VDHMM/PAD keep almost unchanged and the duration distributions are less distorted
by ambient noises. When the SNR value is reduced to 0 dB, the maximum probability of 5-th state
duration is increased from 0.25 up to 0.45. This implies that the VDHMM/PAD is more effective than
other duration modeling methods in preventing the state duration distribution from extremely
concentrating at a specific duration length.
(3) Performance of noisy speech recognition
The recognition rates listed in Table V and the performances shown in Fig. 9 tell us that the
VDHMM/PAD outperforms those HMMs and VDHMMs employing other duration modeling

26
methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case
of white noise and at low SNR (0 to 5 dB) in the case of F16 cockpit noise and babble noise.
Especially, when the distortion due to ambient noises is serious, such as distortion due to white noise,
the improvement of recognition rates is obvious. The superiority of VDHMM/PAD to the other hidden
Markov models we discussed is essentially due to its novel state duration distributions. It is evident that
the sharper and more concentrated duration distributions and relatively more fluctuated duration
density function facilitate the VDHMM/PAD with better discriminativity and modeling capability in
noisy environment. Moreover, it is noted that the VDHMM/PAD performs worse than the other
hidden Markov models in clean condition. The reason for this phenomenon can be explained as
follows. The PAD method proportionally segments each training utterance into states. This
segmentation mechanism narrows the allowable ranges of some state duration distributions. Thus, a
property of the VDHMM/PAD is that it can efficiently prevent any state from occupying too long or
too short and gain performance benefits in noisy environment. However, this method will also cause
duration mismatch between clean testing speech and reference models. This mismatch makes the
recognition performance of VDHMM/PAD to degrade slightly in clean conditions as comparing with
other hidden Markov models.
( Fig. 7 - Fig. 9, Table V about here )
6. Conclusion
In this paper, we first demonstrated the distribution of state duration in a conventional HMM and
compared the effectiveness and performance of some widely used modeling methods for state duration in
noisy environment. Based upon the weakness of those modeling method we evaluated, a proportional
alignment decoding algorithm (PAD) combining with the statistics of state duration is then proposed in the

27
training phase to re-train a conventional hidden Markov model and produce a new variable duration
hidden Markov model (VDHMM/PAD). The PAD method enables us to make the distribution of state
duration sharper, more fluctuated and relatively concentrated, and thus improve the model discriminativity
for allowable duration lengths under the influence of ambient noises. Experimental results have
demonstrated the robustness of VDHMM/PAD in noisy speech recognition. The proposed method can
provide better recognition rates than the conventional HMM and the other duration modeling methods in
various noisy environments.
Acknowledgement
The authors would like to thank Dr. Lee Lee-Min of Mingchi Institute of Technology, Taipei, Taiwan,
for his enthusiasm in supporting valuable programming experiences and many fruitful discussions.
References
Anastasakos, A., Schwartz, R. & Shu, H. (1995), Duration modeling in large vocabulary speech recognition.
Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.
628-631.
Bonafonte, A., Vidal, J. & Nogueiras, A. (1996). Duration modeling with Expanded HMM applied to
Speech Recognition. Proceedings of International Conference on Spoken Language Processing, pp.
1097-1100.
Burshtein, D. (1995). Robust parametric modeling of durations in hidden Markov models. Proceedings of the
IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 548-551.

28
Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden
Markov Models with Bounded State Durations. IEEE Trans. on Signal Processing, vol. 39, no.8 , pp.
1743-1752, August.
Hung, W. W. & Wang, H. C. (1997). HMM retraining based on state duration alignment for noisy speech
recognition. in Proc. of European Conference on Speech Communication and Technology
(EUROSPEECH), vol. 3, pp. 1519-1522, September.
Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov models for speech signals.
IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 5, pp. 1404-1413.
Juang B. H. and Rabiner L. R. (1990). The segmental k-means algorithm for estimating parameters of hidden
Markov models. IEEE Trans. On Acoustics, Speech, Signal Processing, vol. 38, pp. 1639-1641,
September.
Kim, W. G., Yoon, J. Y. & Youn, D. H. (1994). HMM with global path constraint in Viterbi decoding for
isolated word recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and
Signal Processing, pp. 605-608.
Laurila, K. (1997). Noise robust speech recognition with state duration constraints. Proceedings of the IEEE
International Conference on Acoustic, Speech and Signal Processing, pp. 871-874.
Lee, L. M. & Wang, H. C. (1994). A study on adaptation of cepstral and delta cepstral coefficients for noisy
speech recognition. Proceedings of International Conference on Spoken Language Processing, pp.
1011-1014.
Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis.
Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.
1241-1244.
Power, K. (1996). Durational modeling for improved connected digit recognition. Proceedings of

29
International Conference on Spoken Language Processing, pp. 885-888.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proc. IEEE, vol. 77, no. 2, pp. 257-286.
Rabiner, L. R. & Juang, B.H. (1986). An introduction to hidden Markov model. IEEE ASSP Magazine,
January, pp.4-16.
Rabiner, L. R., Juang, B. H., Levinson, S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using
hidden Markov models with continuous mixture densities. AT&T Tech. J. , vol. 64, no. 6, pp.
1211-1234, July-Aug.
Rabiner, L. R., Wilpson, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for
connected word recognition. AT&T Technical Journal, Vol. 65, pp. 21-31.
Rabiner, L. R., Wilpon, J. G. & Soong, F. K. (1988). High performance connected digit recognition, using
hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and
Signal Processing, pp. 119-122.
Russell, M. J. & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for automatic
speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal
Processing, pp. 2376-2379.
Russell, M. J. & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models
for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech
and Signal Processing, pp. 5-8.
Varga, A. Steeneken, H.J.M., Tomlinson, M. & Jones, D. (1992). The NOISEX-92 study on the effect of
additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern,
England.
Vaseghi, S. V. (1995). State duration modeling in hidden Markov models. Signal Processing, Vol. 41, pp.

30
31-41.
Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the
IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 129-132.

31
Table I. Clean speech recognition rates for HMMs using various state duration modeling methods.
methods baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
recognition
rates 97.2 97.6 97.5 97.4 97.2 96.8
Table II. Clean speech recognition rates for VDHMMs using various state duration modeling methods.
methods baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
recognition
rates 97.2 97.6 97.6 97.5 97.4 97.1
Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods
(a) white noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
clean 97.2 97.6 97.5 97.4 97.2 96.8
20dB 48.8 62.0 60.9 60.4 59.6 57.0
15dB 30.8 42.8 41.1 40.5 40.2 38.5
10dB 19.2 26.8 25.4 24.7 25.3 23.6
5dB 11.2 20.8 20.1 19.4 19.7 19.3
0dB 10.0 17.6 16.4 16.0 16.0 17.6
(b) F16 cockpit noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
20dB 92.0 95.2 93.8 93.5 93.2 92.8
15dB 79.6 85.5 83.6 81.7 80.8 80.1
10dB 67.6 74.7 73.2 72.8 72.5 71.6
5dB 44.0 54.3 53.7 52.8 53.4 52.2
0dB 15.2 25.6 23.5 22.5 22.8 22.3

32
(c) babble noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
20dB 94.8 95.9 95.6 95.4 95.2 94.9
15dB 88.0 92.2 91.1 90.3 89.7 88.2
10dB 75.2 80.4 79.3 76.9 77.8 75.6
5dB 58.4 70.4 68.9 65.8 66.1 63.7
0dB 33.2 42.8 41.4 38.6 39.3 38.5
Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods
(a) white noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
clean 97.2 97.6 97.6 97.5 97.4 97.1
20dB 48.8 67.6 64.8 63.9 61.6 59.4
15dB 30.8 49.2 46.8 45.9 43.6 42.1
10dB 19.2 31.2 29.0 27.4 28.4 26.9
5dB 11.2 24.0 22.8 21.7 22.0 20.8
0dB 10.0 18.4 17.3 17.1 17.2 18.5
(b) F16 cockpit noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
20dB 92.0 96.0 94.4 94.3 94.0 93.6
15dB 79.6 86.4 84.1 82.3 81.4 80.9
10dB 67.6 76.3 74.5 73.9 73.8 72.5
5dB 44.0 55.3 54.8 53.5 54.2 53.0
0dB 15.2 28.2 26.3 24.9 25.5 24.5

33
(c) babble noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
20dB 94.8 96.4 96.2 95.8 95.6 95.3
15dB 88.0 93.5 91.8 90.9 90.6 89.4
10dB 75.2 82.4 80.8 79.3 80.1 77.2
5dB 58.4 71.5 69.7 66.1 67.3 65.2
0dB 33.2 45.2 43.6 40.9 42.1 40.6
Table V. Noisy speech recognition rates for VDHMM/PAD.
SNR
noise type
clean 20 dB 15 dB 10 dB 5 dB 0 dB
white noise 96.8 72.4 60.0 44.0 29.6 24.8
F16 cockpit
noise
96.8 95.2 87.3 79.9 60.2 35.1
babble noise 96.8 95.9 94.4 84.7 76.4 52.1

129966864599036360[1]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 129966864599036360[1]

Similar to 129966864599036360[1] (20)

More from 威華王

More from 威華王 (6)

Recently uploaded

Recently uploaded (20)