SlideShare a Scribd company logo
1 of 33
Download to read offline
1
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen Hung
Department of Electrical Engineering
Ming Chi Institute of Technology
Taishan, Taiwan, 243 ROC
E-mail : wwhung@ccsun.mit.edu.tw
FAX : 886-02-2903-6852
Tel. : 886-02-2906-0379
and
Hsiao-Chuan Wang
Department of Electrical Engineering
National Tsing Hua University
Hsinchu, Taiwan, 30043 ROC
E-mail : hcwang@ee.nthu.edu.tw
FAX : 886-03-571-5971
Tel. : 886-03-574-2587
Corresponding author: Hsiao-Chuan Wang
2
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen Hung and Hsiao-Chuan Wang
Department of Electrical Engineering, National Tsing Hua University,
Hsinchu, Taiwan, 30043, ROC
Abstract
Modeling the state duration of HMMs can effectively improve the accuracy in decoding the state sequence
of an utterance and result in an improvement of speech recognition accuracy. However, when a speech signal
is contaminated by ambient noise, the decoded state sequence may be distorted. It may stay at some states
too long or too short even with the help of state duration models. This paper presents a proportional
alignment decoding (PAD) algorithm for re-training the hidden Markov models (HMMs). A task of
multi-speaker isolated Mandarin digit recognition was conducted to demonstrate the effectiveness and
robustness of the PAD-based variable duration hidden Markov model (VDHMM/PAD) method.
Experimental results show that the discriminativity of VDHMM/PAD in noisy environment has been
significantly enhanced. Moreover, the proposed method outperforms those widely used state duration
modeling methods, such as using Poisson, gamma, Gaussian, bounded and non-parametric probability density
functions.
This research has been partially sponsored by the National Science Council, Taiwan,
ROC, under contract number NSC-85-2221-E-007-005.
3
1. Introduction
Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech
recognition. This method provides a powerful framework for modeling the time-varying speech signals.
One of the advantages of HMM is that it enables us to well characterize speech signals as a parametric
stochastic process, and the parameters of this stochastic process can be optimized by the
estimation-maximization (EM) algorithm. In addition, the quality of HMM can also be significantly
improved by incorporating the information of state duration (Rabiner, 1989). In a conventional hidden
Markov model, the probability of staying in state i for d frames is modeled
by p d a ai ii
d
ii( ) ( ) ( )= ⋅ −−1
1 , where aii is the state transition probability from state i to itself and
( )1 − aii from state i to other states. This inherent temporal characteristic implies that the state duration
in a conventional HMM is exponentially distributed. It does not adequately model the temporal structures
of different acoustic regions in a speech signal (Juang et al., 1985, Rabiner et al., 1985 & Rabiner et al.,
1988). In order to cope with this deficiency, some modeling methods for state duration and word
duration have been proposed.
A. Bonafonte et al., (Bonafonte et al., 1996) used a Markov chain to model the occupancy of the
HMM states, and the parameters of the Markov chain were estimated directly from the duration data. To
reduce the insertion error rate in connected digit recognition, K. Power proposed an expanded-state
duration model (Power, 1996). In this approach, each individual state was expanded by multiple
sub-states, each sharing the original state observation probability density function (pdf). Moreover, K.
Laurila noticed that duration constraints applied only to the recognition phase is quite loose and not
effective enough. Therefore, a state duration constrained maximum likelihood (SDML) training scheme
(Laurila, 1997) was presented to gradually tighten the duration constraints in a hidden Markov model.
Duration modeling technique is not only applied to state level, but can also be extended to word level.
4
David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string
error rate for connected digit recognition task. In general, no matter what kind of duration modeling
mechanism is employed, the probability density function for modeling state duration distributions can be
roughly classified into two categories (Gu et al., 1991), non-parametric and parametric methods. For the
non-parametric method, the distribution of state duration is directly estimated from the training data. Thus,
we can obtain a more accurate duration distribution for each state in a word model. However, this
approach needs a large amount of training utterances in order to reach to a desired degree of accuracy.
Moreover, it also requires a considerable amount of memory space for the storage of all the duration
distributions. On the other hand, for the parametric method, some specific probability density functions,
such as Poisson (Russell et al., 1985 & Russell et al., 1987), gamma (Levinson, 1986 & Burshtein,
1995), Gaussian (Rabiner, 1989 & Burshtein, 1995) and bounded density functions (Gu et al., 1991,
Kim et al., 1994, Vaseghi, 1995, Power, 1996 & Laurila, 1997) were used to model the state duration
distributions explicitly, and by which only a few parameters were required to completely specify its
distribution. It is intuitive that there are some drawbacks in the use of parametric approach. One is that
the assumed probability density function may not always fit to the real duration distribution of each state in
a hidden Markov model.
In spite of ambient noises, most of the researches in modeling duration distributions dealt with the
minimization of recognition errors which are mainly attributed to unrealistically modeling for duration
distributions. How to make a duration model more robust to noise contamination is still a problem to be
solved. In this paper, we focus our attention on the robustness of modeling state duration in noisy
environment and neglect the modeling for word duration. This is due to the fact (Burshtein, 1995) that the
state duration modeling is the major contribution to the improvement of recognition rate.
In Section 2, some methods of state duration modeling are reviewed. Then, a series of experiments
5
were conducted to compare those methods in Section 3. Also, the behaviors of various duration models
under the influence of noise contamination are also investigated. In Section 4, based on the results
obtained in the previous section, we propose a new method that combines a proportional alignment
decoding (PAD) algorithm with state duration distributions to re-train a conventional hidden Markov
model. This is so-called the variable duration hidden Markov model and denoted as VDHMM/PAD.
The state duration distributions of VDHMM/PAD are proved to be more robust than those of other
methods in noisy environment. An experiment of multi-speaker isolated Mandarin digit recognition was
conducted in Section 5 to evaluate the effectiveness and robustness of the proposed method. Finally, a
conclusion is given in Section 6.
2. Overview of state duration modeling methods
When the statistics of state duration is incorporated into both training and recognition phases of a
conventional hidden Markov model, this will result in a variable duration hidden Markov model
(VDHMM) (Levinson, 1986 & Rabiner, 1989). In VDHMM, the likelihood function is defined in terms
of modified forward likelihood and backward likelihood. Let O o o oT= 1 2 ... be the observation
sequence. The modified forward likelihood αt w j( , ) and backward likelihood βt w j( , ) are defined
as (Levinson, 1986, Rabiner, 1989 & Hung et al., 1997)
α λt t tw j p o o o q j w( , ) ( ... , ( ) )= =1 2
= ⋅∑ ∑ −
= ≠d
t d w ij
i i j
S
w i a
w
α ( , ) ,
,1
⋅ ⋅ − +
=
∏p d b ow j w j t d
d
, ,( ) ( )τ
τ 1
(1)
and
β λt t t T tw i p o o o q i w( , ) ( ... , ( ) )= =+ +1 2
6
= ⋅
= ≠
∑∑ a p dw ij w j
j j i
S
d
w
, ,
,
( )
1
⋅ ⋅+ +
=
∏b o w jw j t t d
d
, ( ) ( , ).τ
τ
β
1
(2)
where λ ( )w denotes the variable duration hidden Markov model for word w with Sw states, qt
the present state at time t , aw ij, the state-transition probability from state i to state j of word
model λ ( )w , b ow j t, ( ) the symbol distribution of ot in the j -th state of word model λ ( )w , and
p dw j, ( ) the j -th state duration pdf of word model λ ( )w with duration length of d frames. Then,
given a variable duration hidden Markov model, λ ( )w , the likelihood function of an observation
sequence,O , can be modeled as
p O w( ( ))λ = αt d
d
D w j
j j i
S
i
S
w ij w jw i a p d
ww
−
== ≠=
⋅ ⋅∑∑∑ ( , ) ( )
( , )
,
, ,
111
⋅ ⋅− +
=
∏b o w jw j t d t
d
, ( ) ( , ),τ
τ
β
1
(3)
where D w j( , ) indicates the allowable maximum duration length within the j-th state of word model
λ ( )w . Based on above definition, the derivation of re-estimation formulas for the variable duration
HMM is formally identical to those for the conventional HMM (Levinson, 1986 & Rabiner, 1989). For a
left to right variable duration HMM without jumps, the maximum likelihood, p O w( ( ) )λ , can be
effectively calculated by a three-dimensional (time, state, duration) Viterbi decoding algorithm which is
derived from the literature proposed by Gu et al., (Gu et al., 1991) and can be summarized as follows :
for d = 1
ψt w j( , , )1 = max{ ( , ,
~
) log[ (
~
)]}~ ,
d
t w jw j d p dψ − −− +1 11
+ +−log[ ] log[ ( )],,( ) ,a b ow j j w j t1 (4)
for d ≥ 2
ψ ψt t w j tw j d w j d b o( , , ) ( , , ) log[ ( )],,= − +−1 1 (5)
and
7
p O w( ( ) )λ = max{ ( , , ) log[ ( )]},,
d
T
T w w Sw S d p dw
=
+
1
ψ (6)
where ψ t w j d( , , ) represents the maximum likelihood of proceeding from state 1 to state j − 1along a
state sequence of duration length ( )t d− frames and producing the observations o o ot d1 2 ... ,− and
then staying at the state j and producing the observations o o ot d t t− + −1 1... at this state. From above
description, we can find that the success of modeling state duration distributions will promote the
performance of a HMM-based speech recognizer.
In general, the modeling methods for state duration can be classified into two categories, i.e.,
non-parametric and parametric modeling methods.
2.1 Non-parametric state duration modeling method
In non-parametric approaches (Juang et al., 1985, Rabiner et al., 1985, Rabiner et al., 1988,
Anastasakos et al., 1995 & Hung et al., 1997), the probabilities, p dw j, ( ) , used for describing state
duration distributions are estimated via a direct counting procedure on the training data. Let dw j t, , be
the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word
model λ ( )w , and Nw be the total number of training utterances of the word w. Then, the
probabilities, p dw j, ( ) , can be estimated by
p d
d
Nw j
d w j t
t
N
w
w
,
, ,
( )
( )
= =
∑Θ
1
for d ≥ 1, (7)
where Θd w j td( ), , is a binary characteristic function and defined as
Θd w j t
w j t
d
if d d
otherwise
( )
, ,
, .
, ,
, ,
=
=


1
0
(8)
In this non-parametric approach, the accuracy of duration model depends on the amount of training
8
data. When the amount of training data is sufficient, this modeling method can well approximate the
temporal characteristic of each state in a hidden Markov model. However, large number of parameters to
be stored is one of the drawbacks. A non-parametric approach for isolated Mandarin digit recognition
proposed by Hung et al., (Hung et al., 1997) had shown that the recognition rates were significantly
improved as comparing with those of conventional HMM under the influence of white noise. The
recognition rates are improved from 48.8% in baseline HMM to 62.0% in non-parametric approach
when the signal is contaminated with white noise at SNR equal to 20dB.
2.2 Parametric state duration modeling methods
In parametric approaches, some specific probability density functions have been proposed to model
the distribution of state duration explicitly. The parametric approach has the advantage that only few
parameters are required to completely specify its probability density function. Thus, comparing with the
non-parametric approaches, the memory space for the parametric approach can be significantly reduced.
One of the drawbacks in using parametric duration modeling methods is that the assumed probability
density function may not always match with the state duration distribution of each state in a hidden
Markov model. Some probability density functions including Poisson, gamma, bounded and Gaussian
duration density functions have been proposed to model the distribution of state duration. Detailed
formulations of those duration modeling methods are described as follows.
2.2.1 Poisson distribution for state duration
To characterize the duration property more effectively, M. J. Russell (Russell et al., 1985 & Russell et
al., 1987) replaced the self-transition probability in conventional HMM by a Poisson duration density
function so that there was no self-transition from a state back to itself. This is the so-called hidden
9
semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is
thought to has some advantages. First, the Poisson probability density function represents a plausible
model for state duration. Second, only one parameter, i.e., the state duration mean, is needed to specify
the distribution of state duration. Third, maximum likelihood estimation of the state duration mean can be
accomplished by using the methods which are analogous to the standard Baum-Welch re-estimation
process.
When the distribution of state durationis modeled by a Poisson density function, it is expressed as
p d
d
d
ew j
w j
d
dw j
,
,
( )
( )
( )!
,
=
−
⋅
−
−
1
1
for d ≥ 1, (9)
where d w j, denotes the duration mean of j-th state in the word model λ ( )w . For comparison, hidden
Markov model (HMM), dynamic time-warping (DTW) and the hidden semi-Markov model (HSMM)
with Poisson distributed state duration were applied to the task of speaker dependent isolated word
recognition (Russell et al., 1985). Experimental results for the third set of recordings showed that error
rate of HSMM is 11.8% and 6.3% lower than those of HMM and DTW, respectively.
2.2.2 Gamma distribution for state duration
In the literature proposed by Levinson (Levinson, 1986), the author first used a family of gamma
probability density functions to characterize the distribution of state duration and formed a continuously
variable duration hidden Markov model (CVDHMM). The gamma distribution was considered to be
ideally suited to the specification of duration density function since it assigns zero probability to negative
duration lengths and only two parameters, state duration mean and variance, are required to specify its
distribution. Moreover, David Burshtein (Burshtein, 1995) proposed a modified Viterbi decoding
algorithm that incorporates both state and word duration models for connected digit string recognition. In
10
this approach, a duration penalty based on gamma density function is considered at each frame transition.
The modified Viterbi decoding algorithm was proved to have essentially the same computational
requirements as the conventional Viterbi algorithm. The experimental results showed that the modified
Viterbi decoding algorithm with gamma duration distribution reduced the string error rate from 4.77% to
2.86% for the case of unknown string length, and from 2.20% to 1.60% for the case of known string
length as compared withthe baseline HMM. The gamma duration density function can be formulated as
p d d ew j
w j
w j
d
w j
w j w j
,
,
,
( )
( )
( )
( )
,
, ,
= ⋅ ⋅
− − ⋅ξ
γ
γ
γ ξ
Γ
1
for d ≥ 1 (10)
and
γw j
w j w j
w j
d d
,
, ,
,
=
⋅
∇
, ξw j
w j
w j
d
,
,
,
=
∇
, (11)
where d w j, and ∇w j, are the duration mean and variance of j-th state in the word model λ ( )w ,
respectively. Γ ( )z is a gamma function defined by
Γ ( )z x e dxz x
= ⋅− −
∞
∫
1
0
for z > 0 . (12)
2.2.3 Bounded state duration
Due to the characteristic of continuous probability density function, both Poisson and gamma functions
have the advantage of operating well when a relatively small number of training utterances is available.
However, in some situations, there exists the possibility that duration length of some states will be too
long or too short. To avoid those unexpected duration and minimize the erroneous match between testing
utterances and reference models, H. Y. Gu et al. (Gu et al., 1991) proposed a hidden Markov model
with bounded state duration in which the allowable state duration is constrained by some boundaries. The
duration length of each state in this approach is simply bounded by lower and upper bounds in the
11
recognition phase. The probability density function for bounded state durationis modeled by
p d D D
if D d D
otherwise
w j w j
upper
w j
lower w j
lower
w j
upper
, , ,
, ,
( )
, ,
, ,
= − +
≤ ≤





1
1
0
(13)
where Dw j
lower
, and Dw j
upper
, are the lower and upper bounds of the state duration for state j of the word
model λ ( )w , and can be estimated by
{ }D dw j
lower
t
N
w j t
w
, , ,min=
=1
(14)
and
{ }D dw j
upper
t
N
w j t
w
, , ,max=
=1
. (15)
A series of experiments using all the 408 highly confused first-tone Mandarin syllables (Gu et al., 1991)
were conducted to evaluate the effectiveness of HMM with bounded state duration (BSD). In the
discrete case, the recognition rate of HMM with BSD is 78.5%. This is 9.0%, 6.3% and 1.9% higher
than the conventional HMM’s, HMM’s with Poisson and HMM’s with gamma distributed state duration,
respectively. In the continuous case, the recognition rate of HMM with BSD is 88.3%. This is 6.3%,
5.9% and 3.1% higher than those of conventional HMM, HMM with Poisson and HMM with gamma
distributed state duration, respectively. Similar applications of bounded state duration distribution for
speech recognition can be found in the literature by Kim et al., (Kim et al., 1994), Vaseghi, (Vaseghi,
1995) and Power (Power, 1996). The minimum and maximum durations for each state were estimated in
the training phase. Those loose state duration constraints were then used in the final recognition phase. To
tighten those duration constraints, K. Laurila (Laurila, 1997) employed bounded state duration model in
both the training and recognition phases to achieve higher consistency in state duration constraints.
2.2.4 Gaussian distribution for state duration
12
A parametric approach using Gaussian probability density function for modeling the state duration
distributions is suggested by Rabiner (Rabiner, 1989). Moreover, David Burshtein (Burshtein, 1995) also
claimed that Gaussian pdf provides good approximation for word duration. By modeling word duration
using Gaussian pdf, the string error rate can be further reduced from 2.86% to 2.78% for the case of
unknown string length, and from 1.60% to 1.59% for the case of known string length as compared with
the baseline HMM. The Gaussian duration density function can be formulated as
p d
d d
w j
w j
w j
w j
,
,
,
,
( ) exp{
( )
}=
⋅ ∇
⋅ −
−
⋅ ∇
1
2 2
2
π
. (16)
3. Comparison of state duration modeling methods
3.1 Databases and experimental conditions
A task of multi-speaker isolated Mandarin digit recognition was conducted for the comparison of
those state duration modeling methods described above. The database for the experiments were
provided by 50 male and 50 female speakers. Each speaker was asked to utter a set of 10 Mandarin
digits in each of three sessions. Totally, there were 3000 utterances recorded with the sampling rate of 8
KHz. Each frame, which contained 256 samples with 128 samples overlapped, was multiplied by a
256-point Hamming window. The pre-silence and post-silence of 0.1 ~ 0.5 seconds were included. Each
digit was modeled as a left-to-right HMM of 7 ~ 9 states, including the pre-silence and the post-silence
states, without jumps. The output of each state was a Gaussian distribution of feature vectors. The feature
vector was composed of 12-order LPC derived cepstral coefficients, 12-order delta cepstral coefficients
and one delta log-energy.
The NOISEX-92 noise database (Varga et al., 1992) was used for generating the noisy speech. In
our study, three kinds of noises, including white noise, F16 cockpit noise and babble noise, were directly
13
added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was
added to the clean speech, the signal-to-noise (SNR) was defined by the following equation :
SNR
E
E
s
n
= ⋅





10 log , (17)
where Es was the total energy of clean speech and En was the energy of the added noise over the
entire speech portion. The F16 cockpit noise was recorded at the co-pilot’s seat in a two-seat F16
traveling at a speed of 500 knots and an altitude of 300-600 feet. The source of babble noise was 100
people speaking in a canteen and in which individual voices were slightly audible.
The subsequent experiments were conducted to examine the following problems : (1) the effectiveness
of state duration modeling methods, (2) the incorporation of state duration modeling in training phase, and
(3) the robustness of state duration modeling methods in noisy environment.
3.2 Effectiveness of state duration modeling methods
The first two sessions of collected utterances in the database were used to train an initial set of word
models by using the segmental k-means algorithm (Rabiner et al., 1986). Once a conventional
HMM-based word model (denoted as ‘Baseline’ HMM) was established for each isolated Mandarin
digit, the training utterances were time-aligned with their corresponding word models. By using the
standard Viterbi decoding algorithm, we can re-decode each utterance into a state sequence and from
which the number of frames spent on every state is known. Based on these decoded state durations, we
can find the distribution of state duration for each state in a word model. This distribution can be treated
as the non-parametric modeling for a state duration and denoted as HMM/Npar. In Fig. 1, we show the
duration distributions of seven states in the HMM/Npar for isolated Mandarin digit ‘4’. The state duration
distributions modeled by Poisson, gamma, Gaussian and bounded density functions are also illustrated in
14
Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD,
respectively. The third session of collected utterances was used as a clean version of testing data for
evaluating the effectiveness of various state duration modeling methods. In the recognition phase, a testing
utterance is decoded into a state sequence by using the standard Viterbi decoding algorithm for the
‘Baseline’ HMM method, while using the three-dimensional Viterbi decoding algorithm, i,e., Eqs. (4)-(6),
for other state duration modeling methods. The resulted recognition rates for various state duration
modeling methods are shown in Table I.
( Fig. 1 and Table I about here )
Let us examine the state duration distributions of HMM/Npar shown in Fig. 1. We can find that the
distribution of state duration is different from state to state and can not be confined to a certain type of
probability density function. No single probability density function can fit to the statistical characteristics of
all the states in a word model. Furthermore, we can also find that HMM/Gam and HMM/Gau are more
capable than HMM/Pois and HMM/BSD in modeling the state duration distributions represented by
HMM/Npar. Particularly, gamma function is slightly better than Gaussian function. This result is consistent
with the conclusion given by David Burshtein (Burshtein, 1995) which stated that the gamma function can
provide high quality approximations for state duration and word duration. For HMM/BSD, lower and
upper bounds of state duration can prevent any state from occupying too many or too few frames.
However, the state duration distribution in the range of allowable duration is treated as an uniform
distribution which can not well approximate the actual distribution of state duration. This fact does affect
the performance as shown in Table I. From the experimental results shown in Table I, we can find that
the HMMs employing non-parametric, gamma and Gaussian state duration models have slightly higher
recognition rate than that of the baseline HMM. Also, the recognition rate of HMM/Gam is superior to
those of other methods. It concludes that a good modeling method for state duration can improve the
15
recognition accuracy.
3.3 Incorporation of state duration modeling in training phase
When statistics of state duration is considered only in the recognition phase but not in the training phase,
it will result in quite loose state duration constraints (Laurila, 1997). To solve this inconsistency problem,
a variable duration hidden Markov model (VDHMM) (Levinson, 1986, Rabiner et al., 1989 & Laurila,
1997) which incorporates state duration statistics into both training and recognition phases of a word
model has been proposed to seek for further improvement in recognition accuracy. The duration
distribution of each state in a word model can be obtained as follows :
Step 1. The segmental k-means algorithm and standard Viterbi decoding method are used to train an
initial set of word models.
Step 2. The duration statistics for each state in a word model are estimated and modeled by
non-parametric or parametric methods.
Step 3. Using the three-dimensional Viterbi decoding algorithm, each training utterance is decoded
into a maximum likelihood state sequence.
Step 4. According to those maximum likelihood state sequences, the statistics of each state is
re-calculated and the parameters of underlying state duration model are also revised. Step 3
and step 4 are iterated several times to come out a final set of desired word models.
In Fig. 2, we show the duration distributions of seven states in those VDHMMs for isolated Mandarin
digit ‘4’ using various state duration modeling methods. The variable duration HMMs with
non-parametric, Poisson, gamma, Gaussian and bounded state duration density functions are denoted as
VDHMM/Npar, VDHMM/Pois, VDHMM/Gam, VDHMM/Gau and VDHMM/BSD, respectively.
Moreover, The clean speech recognition rates based on variable duration HMMs are also shown in
16
Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation
phenomenon of some state duration distributions in the HMM/Npar more obvious. This phenomenon can
be found in the 4-th, 5-th and 6-th states of word model ‘4’. In addition, the duration distributions of
some states (e.g., 3-rd and 7-th states) become more concentrated and sharper. Table I and Table II
show that no matter employing non-parametric or parametric approaches, VDHMM methods are better
than the corresponding HMM methods. Since there are two confusion sets in Mandarin digit speech (“1”
vs. “7” and “6” vs. “9”), the recognition rate can hardly be further improved in clean speech for this
specific task. Even though the improvement is small, it does demonstrate the effectiveness of applying
state duration models in both training and recognition phases. ( Fig. 2 and Table II about here )
3.4 Robustness of state duration modeling methods
When a speech recognition system is deployed in a noisy environment, the background noise will
cause the mismatch of statistical characteristics between testing speech and reference models. Due to the
environmental mismatch, it is very possible that some state with very high likelihood scores will dominate
the result of decoding process (Zeljkovic, 1996). Thus, an erroneous maximum likelihood state sequence
with state duration too long or too short may be obtained even if a state duration modeling method is
employed. This phenomenon will cause the drastic degradation of recognition rate of a speech recognizer.
In this subsection, a series of experiments were conducted to evaluate the robustness of various methods
for modeling state duration in noisy environment.
In our experiments, the first two sessions of collected utterances in the database were used to train a
set of word models. To generate noisy speech, a noise with specific SNR values was added to the clean
testing data, i.e., the third session in the database. Those distorted utterances were then evaluated on their
corresponding word models and decoded into state sequences. Thus, from those most likely state
17
sequences, we can find the state duration distributions under the influence of additive white noise. In Fig.
3 through Fig. 6, the duration distributions of the 5-th and the 6-th states of isolated Mandarin digit ‘4’
under the influence of white noise are plotted. In addition, the recognition rates under the influence of
white noise, F16 cockpit noise and babble noise for various HMMs and VDHMMs are also presented in
Table III and Table IV.
( Fig. 3 - Fig. 6, Table III - Table IV about here )
The results in Table III and Table IV convince that properly employing a duration model does improve
the recognition accuracy in noisy environment. Above all, further improvement can be obtained by using a
variable duration hidden Markov model. The performances of those HMMs and VDHMMs in different
noisy environment are similar to the results listed in Table I and Table II for clean speech recognition. It is
worth to note that at SNR = 0 dB, the recognition rates based on the bounded state duration (BSD)
modeling method are higher than those of the models based on other parametric duration modeling
methods. One explanation is that the BSD method is more effective than other parametric modeling
methods in inhibiting one state occupying too long or too short of speech frames. From Fig. 3 through Fig.
6, we can also find that the additive white noise has the effect to distort the duration distribution of each
state in a word model. When the background environment becomes more noisy, the duration distribution
of the 5-th state of Mandarin digit “4” is gradually shifted to the left while the 6-th state to the right.
Especially, when the signal-to-noise ratio is very low, e.g., 0 dB, the duration density functions of some
states become extremely concentrated at some unexpected duration lengths even with the helps of state
duration modeling methods. This implies that the underlying duration density functions of those modeling
methods are not robust enough to noise contamination. For some state duration modeling methods, the
probability density functions are relatively smooth in the range of allowable duration lengths. This will
reduce the discriminativity of duration lengths in noisy environment and results in erroneous state
18
sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it
is very possible for a state to stay too long or too short in decoding a state sequence. From above
discussion, we conclude that : (1) The non-parametric duration modeling method can accurately specify
the state duration distribution of each state in a hidden Markov model. (2) The duration modeling method
must be applied in both the training and recognition phases so that the state duration constraints in these
two phases are consistent. (3) A sharper pdf of state duration may enhance the discriminativity of the
allowable duration lengths. (4) A narrow distribution range of state duration can efficiently prevent a
decoded state from being too long or too short.
4. Implementation of the VDHMM/PAD
In this section, a proportional alignment decoding (PAD) algorithm (Hung & Wang, 1997) combining
with the statistics of state durations is proposed to re-train a conventional hidden Markov model and
results in a more robust variable duration hidden Markov model (VDHMM/PAD). Instead of the widely
used Viterbi decoding algorithm, the proportional alignment decoding algorithm is used for state decoding
in the intermediate stage of training a word model. It produces a new set of state duration statistics in
which the distribution of state duration becomes sharper and more concentrated. This meets the
conclusion made in the previous section. It is also worth to note that the PAD method is not implemented
in the recognition phase. The detailed implementation of VDHMM/PAD is described as follows.
4.1 Formulation of the proportional alignment decoding algorithm
Consider the training of a word model λ ( )w that belongs to the set of M word models. The
parameter set of the word model λ ( )w is represented as λ µ( ) { , , , , }w w w w w w= Σ Ρ Α Β , where
µ µw w j= { }, and Σ Σw w j= { }, for 1 ≤ ≤j Sw denote the mean vector and covariance matrix of the
19
j-th state in the word model λ ( )w , respectively. Ρw w jp d= { ( )}, , Αw w ija= { }, and
Βw w jb O= { ( )}, for 1 ≤ ≤j Sw represent the probability density functions of state durations, state
transitions and state outputs for the word model λ ( )w , respectively. It is noted that the probability
density function p dw j, ( ) is modeled by the non-parametric duration modeling method. Let
Χ Χ( ) { ( ), }w w t Nt w= ≤ ≤1 be a set of feature vector sequences extracted from all the training
utterances for the word model λ ( )w . Here, Χt w( ) denotes the feature vector sequence of the t-th
training utterance which has Kt
w
frames, This feature vector sequence can be expressed as
Χt t
w
t
w
t K
w
w x x x t
w( ) , , ,
= ⋅⋅⋅1 2 . Then, in a continuous-density HMM, the output probability density function,
b xw j t k
w
, ,( ) , can be characterized by a Gaussian function defined as follows :
b xw j t k
w
w j
D
, , ,( ) ( )= ⋅ ⋅
− −
2 2
1
2
π Σ exp{ ( ) ( ) ( )}, , , , ,− ⋅ − ⋅ ⋅ −−1
2
1
x xt k
w
w j
T
w j t k
w
w jµ µΣ , (18)
where D is the dimension of feature vector xt k
w
, .
Based on the set of word models λ λ= ≤ ≤{ ( ), }w w M1 and the standard Viterbi decoding
algorithm, we can decode the t-th training utterance of word w, X wt ( ) , into a state sequence
q q q qw t w t w t w t K t
w
, , , , , , ,
= ⋅ ⋅ ⋅1 2 . Assume dw j t, , denotes the duration of state j in the maximum likelihood
state sequence of the t-th training utterance for the word model λ ( )w . Then, the state duration mean,
d w j, , of state j in the word model λ ( )w is formulated as
d
N
dw j
w
w j t
t
Nw
, , ,=
=
∑
1
1
for 1 ≤ ≤j Sw . (19)
Moreover, The word duration mean d w defined as the accumulation of all the state duration means in
the word model λ ( )w can also be expressed as
20
d dw w j
j
Sw
=
=
∑ ,
1
. (20)
Then, the state duration ratio of the j-th state to the total states in the word model λ ( )w can be
calculated by
ℜ =j
w w j
w
d
d
,
for 1 ≤ ≤j Sw (21)
Once we obtain ℜj
w
for all states in every word model, the proportional alignment decoding procedure
can be proceeded in a simple way and each training utterance of word w is re-decoded into a new
state sequence, ~
,
qw t
, where
~ ~ ~ ~ ,, , , , , , ,
q q q qw t w t w t w t Kt
w= ⋅⋅ ⋅1 2 1 1≤ ≤ ≤ ≤w M t Nw, . (22)
For example, the t-th training utterance of word w has duration of Kt
w
frames, we segment this training
utterance into Sw states according to the following rules :
x wt k
w
v, ( )∈Ω and ~ ,, ,q vw t k =
iff. k K Kj
w
j
v
t
w
j
w
t
w
j
v
∈ ℜ ⋅ + ℜ ⋅
=
−
=
∑ ∑[( ) ,( ) ],
1
1
1
1 (23)
where Ω Ω( ) { ( ), }w w v Sv w= ≤ ≤1 . Ωv w( ) is the set of collected vectors belonging to state v in
the word model λ ( )w .
4.2 Training procedure of VDHMM/PAD
The training procedure works as follows.
Step 1. Obtain initial word models.
Employing segmental k-means algorithm (Juang et al., 1990) and standard Viterbi decoding
algorithm, all the feature vectors extracted from training utterances of word w are used to train
21
an initial word model λp
w−1
( ) , where p = 0 and 1 ≤ ≤w M .
Step 2. Decode training utterances and update word models.
(1) Based on the initial word model λp
w−1
( ) , standard Viterbi decoding algorithm is used to
decode each training utterance, such that
q p X w q w p q ww t
p
q
t w t
p
w t
p
w t
, , ,
arg max{ ( ( ) , ( )) ( ( ))}
,
− − −
= ⋅
1 1 1
λ λ , 1 1≤ ≤ ≤ ≤w M t Nw, . (24)
(2) The decoded state sequence is denoted as q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
.
− − − −
= ⋅ ⋅ ⋅
1
1
1
2
1 1
(3) Let Ω Ωp
j
p
ww w j S− −
= ≤ ≤1 1
1( ) { ( ), }, and Ω j
p
w−1
( ) be the set of vectors of state j
in word model λp
w−1
( ) . For a feature vector of k-th frame in utterance t, xt k
w
, , this frame
belongs to Ω j
p
w−1
( ) if its corresponding state belongs to state j in model λp
w−1
( ) .
Then the duration of state j is equal to the number of vectors in utterance t belonging to
Ω j
p
w−1
( ) . The duration set is expressed as d w d j St
p
w j t
p
w
− −
= ≤ ≤1 1
1( ) { , }, , .
Step 3. Align state sequences using PAD method.
(1) Based on the duration set d wt
p−1
( ) , we can find state duration mean d w j
p
,
−1
, word duration
mean dw
p−1
and state duration ratio ℜ −
j
w p, 1
for each state in the word model λp
w−1
( )
via Eqs. (19)-(21).
(2) Every training utterance of word w is then proportionally segmented into Sw states by
using the Eq. (23). Thus we can find new state sequences q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
= ⋅ ⋅ ⋅1 2 .
(3) Rearrange the set of vectors collected in a state such that x wt k
w
j
p
, ( )∈Ω if its
corresponding state belongs to state j defined for model λp
w( ) . The new duration of
state j in utterance t , dw j t
p
, , , is obtained.
22
(4) Use the duration set d w d j St
p
w j t
p
w( ) { , }, ,= ≤ ≤1 and the following equation to calculate
the distribution of state duration:
p d
d
N
for dw j
p
d w j t
p
t
N
w
w
,
, ,
( )
( )
,= ≥=
∑Θ
1
1. (25)
(5) Use Ω Ωp
j
p
ww w j S( ) { ( ), }= ≤ ≤1 to find the parameters set { , , , }µw
p
w
p
w
p
w
p
Σ Α Β of the
word model λp
w( ) .
Step 4. Re-train the word models.
(1) Calculate the accumulated log-likelihood of Χ( )w by
∆ Χp
t
p
t
N
w p w w
w
( ) log [ ( ) ( )]≡
=
∑ λ
1
= +
=
∑{log ( ( ) , ( ))] log [ ( )]}, ,
p w q w p q wt w t
p p
t
N
w t
p p
w
Χ λ λ
1
, (26)
where
p X w q w b xt w t
p p
w q t k
w
k
K
w t k
p
t
w
( ( ) , ( )) ( ), , ,
, ,
λ =
=
∏1
(27)
and p q w aw t
p p
w q q
k
K
w t k
p
w t k
p
t
w
( ( )), , , , , ,
λ =
+
=
−
∏ 1
1
1
. (28)
(2) Based on the word model λp
w( ) , we can use the three-dimensional Viterbi decoding
algorithm to find a maximum likelihood state sequence q q q qw t
p
w t
p
w t
p
w t K
p
t
w
, , , , , , ,
+ + + +
= ⋅ ⋅ ⋅
1
1
1
2
1 1
for the
t-th training utterance.
(3) Collect the vectors such that x wt k
w
j
p
, ( )∈ +
Ω 1
if its corresponding state belongs to state j
defined for model λp
w+1
( ) .
(4) Use Ωp
w+1
( ) to update the model parameters and generate the new model λp
w+1
( ) .
23
(5) Update the accumulated log-likelihood of Χ( )w by
∆ Χp
t
p
t
N
w p w w
w
+ +
=
= ∑1 1
1
( ) log [ ( ) ( )]λ , (29)
where the likelihood function p w wt
p
[ ( ) ( )]Χ λ +1
can be evaluated efficiently by using Eqs.
(4)-(6).
(6) Convergence testing.
IF the improvement rate of ∆p
w+1
( ) is greater than a preset threshold ∆th , i.e.,
∆ ∆
∆
∆
p p
p th
w w
w
+
−
>
1
( ) ( )
( )
, (30)
THEN p p+ →1 and repeat Steps 4.(2)-4.(6),
ELSE λ λp
VDHMM PADw w+
→1
( ) ( )/ .
4.3 Recognition procedure of VDHMM/PAD
Consider a testing utterance Υ with Ty frames and Υ = ⋅⋅ ⋅y y yTy1 2 , where y j denotes the
feature vector of j-th frame. The recognition procedure based upon the VDHMM/PAD is proceeded
as follows.
Step 1. Set w = 1 .
Step 2. Use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state
sequence q
**
for the testing utterance Υ based on the word model λVDHMM PAD w/ ( ) .
Step 3. Calculate the likelihood score of Υ for the word model λVDHMM PAD w/ ( ) by using Eqs.
(4)-(6), i.e.,
p wVDHMM PAD[ | ( )]/Υ λ = p q w p q wVDHMM PAD VDHMM PAD[ | , ( )] [ | ( )]
**
/
**
/Υ λ λ⋅ , (31)
24
Step 4. w w+ →1 .
IF w M≤ , THEN repeat Step 2 to Step 4,
ELSE go to Step 5.
Step 5. Select the word whose likelihood score is highest, i.e.,
w p w
w
VDHMM PAD
*
/argmax{ [ | ( )]}= Υ λ . (32)
5. Experiments and discussion
In this section, the same procedure described in Section 3.2 is used to find the distribution of state
duration in the VDHMM/PAD. Moreover, to demonstrate the behaviors of state duration distributions of
VDHMM/PAD under the influence of white noise, the same experiments conducted in Section 3.4 are
also implemented here. Fig. 7 and Fig. 8 show the state duration distributions of seven states in the
VDHMM/PAD for isolated Mandarin digit ‘4’ and the distorted state duration distributions due to white
noise contamination. The recognition rates of VDHMM/PAD under the influences of white noise, F16
cockpit noise and babble noise are also listed in Table V. Furthermore, in order to make the comparison,
we plotted those experimental results listed in Tables I-V on Fig. 9. From those experimental results we
observe the following facts :
(1) Distribution of state duration
Comparing Fig. 7 with Fig. 1 and Fig. 2, we can find that for conventional HMM employing various
state duration modeling methods, the distribution of state duration is relatively smooth and widespread.
By incorporating state duration statistics into the training phase, the variable duration HMMs make the
duration distributions of some states more concentrated and sharper. It results in the higher recognition
rate. In Fig. 7, we can observe that for most of states (e.g., 2-nd, 4-th, 5-th and 6-th states) the
allowable ranges of state duration modeled by VDHMM/PAD become more concentrated. The
25
shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition,
comparing with those state duration distributions shown in Fig.1 and Fig.2, the probability fluctuation in
the VDHMM/PAD is more severe. This fluctuation phenomenon occurs in the duration distributions of
2-nd, 4-th and 6-th states of VDHMM/Npar and is considered to be helpful for enhancing its
discriminativity in recognizing the noisy speech.
(2) Robustness to noise contamination
When speech signal is contaminated by white noise, the state duration distributions shown in Fig. 3
through Fig. 6 are affected and distorted. Especially, under SNR = 0 dB, duration distributions are
severely distorted and concentrated extremely at some unexpected duration lengths. Using Fig. 3 and
Fig. 4 as examples, we can find that for some HMMs (e.g., HMM/Npar, VDHMM/BSD) the
duration distribution of 5-th state excessively concentrates at duration length of 3 frames for SNR = 0
dB while the other HMMs (e.g., HMM/Gam, VDHMM/Gau) at duration length of one frame.
Moreover, the maximum probabilities of 5-th state duration are also dramatically increased from about
0.2~0.3 up to 0.8~1.0. In contrast to those state duration distributions described in Fig. 3 through Fig.
6, we can observe from Fig. 8 that even under the influence of white noise, the original ranges of state
duration in the VDHMM/PAD keep almost unchanged and the duration distributions are less distorted
by ambient noises. When the SNR value is reduced to 0 dB, the maximum probability of 5-th state
duration is increased from 0.25 up to 0.45. This implies that the VDHMM/PAD is more effective than
other duration modeling methods in preventing the state duration distribution from extremely
concentrating at a specific duration length.
(3) Performance of noisy speech recognition
The recognition rates listed in Table V and the performances shown in Fig. 9 tell us that the
VDHMM/PAD outperforms those HMMs and VDHMMs employing other duration modeling
26
methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case
of white noise and at low SNR (0 to 5 dB) in the case of F16 cockpit noise and babble noise.
Especially, when the distortion due to ambient noises is serious, such as distortion due to white noise,
the improvement of recognition rates is obvious. The superiority of VDHMM/PAD to the other hidden
Markov models we discussed is essentially due to its novel state duration distributions. It is evident that
the sharper and more concentrated duration distributions and relatively more fluctuated duration
density function facilitate the VDHMM/PAD with better discriminativity and modeling capability in
noisy environment. Moreover, it is noted that the VDHMM/PAD performs worse than the other
hidden Markov models in clean condition. The reason for this phenomenon can be explained as
follows. The PAD method proportionally segments each training utterance into states. This
segmentation mechanism narrows the allowable ranges of some state duration distributions. Thus, a
property of the VDHMM/PAD is that it can efficiently prevent any state from occupying too long or
too short and gain performance benefits in noisy environment. However, this method will also cause
duration mismatch between clean testing speech and reference models. This mismatch makes the
recognition performance of VDHMM/PAD to degrade slightly in clean conditions as comparing with
other hidden Markov models.
( Fig. 7 - Fig. 9, Table V about here )
6. Conclusion
In this paper, we first demonstrated the distribution of state duration in a conventional HMM and
compared the effectiveness and performance of some widely used modeling methods for state duration in
noisy environment. Based upon the weakness of those modeling method we evaluated, a proportional
alignment decoding algorithm (PAD) combining with the statistics of state duration is then proposed in the
27
training phase to re-train a conventional hidden Markov model and produce a new variable duration
hidden Markov model (VDHMM/PAD). The PAD method enables us to make the distribution of state
duration sharper, more fluctuated and relatively concentrated, and thus improve the model discriminativity
for allowable duration lengths under the influence of ambient noises. Experimental results have
demonstrated the robustness of VDHMM/PAD in noisy speech recognition. The proposed method can
provide better recognition rates than the conventional HMM and the other duration modeling methods in
various noisy environments.
Acknowledgement
The authors would like to thank Dr. Lee Lee-Min of Mingchi Institute of Technology, Taipei, Taiwan,
for his enthusiasm in supporting valuable programming experiences and many fruitful discussions.
References
Anastasakos, A., Schwartz, R. & Shu, H. (1995), Duration modeling in large vocabulary speech recognition.
Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.
628-631.
Bonafonte, A., Vidal, J. & Nogueiras, A. (1996). Duration modeling with Expanded HMM applied to
Speech Recognition. Proceedings of International Conference on Spoken Language Processing, pp.
1097-1100.
Burshtein, D. (1995). Robust parametric modeling of durations in hidden Markov models. Proceedings of the
IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 548-551.
28
Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden
Markov Models with Bounded State Durations. IEEE Trans. on Signal Processing, vol. 39, no.8 , pp.
1743-1752, August.
Hung, W. W. & Wang, H. C. (1997). HMM retraining based on state duration alignment for noisy speech
recognition. in Proc. of European Conference on Speech Communication and Technology
(EUROSPEECH), vol. 3, pp. 1519-1522, September.
Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov models for speech signals.
IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 5, pp. 1404-1413.
Juang B. H. and Rabiner L. R. (1990). The segmental k-means algorithm for estimating parameters of hidden
Markov models. IEEE Trans. On Acoustics, Speech, Signal Processing, vol. 38, pp. 1639-1641,
September.
Kim, W. G., Yoon, J. Y. & Youn, D. H. (1994). HMM with global path constraint in Viterbi decoding for
isolated word recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and
Signal Processing, pp. 605-608.
Laurila, K. (1997). Noise robust speech recognition with state duration constraints. Proceedings of the IEEE
International Conference on Acoustic, Speech and Signal Processing, pp. 871-874.
Lee, L. M. & Wang, H. C. (1994). A study on adaptation of cepstral and delta cepstral coefficients for noisy
speech recognition. Proceedings of International Conference on Spoken Language Processing, pp.
1011-1014.
Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis.
Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.
1241-1244.
Power, K. (1996). Durational modeling for improved connected digit recognition. Proceedings of
29
International Conference on Spoken Language Processing, pp. 885-888.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proc. IEEE, vol. 77, no. 2, pp. 257-286.
Rabiner, L. R. & Juang, B.H. (1986). An introduction to hidden Markov model. IEEE ASSP Magazine,
January, pp.4-16.
Rabiner, L. R., Juang, B. H., Levinson, S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using
hidden Markov models with continuous mixture densities. AT&T Tech. J. , vol. 64, no. 6, pp.
1211-1234, July-Aug.
Rabiner, L. R., Wilpson, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for
connected word recognition. AT&T Technical Journal, Vol. 65, pp. 21-31.
Rabiner, L. R., Wilpon, J. G. & Soong, F. K. (1988). High performance connected digit recognition, using
hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and
Signal Processing, pp. 119-122.
Russell, M. J. & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for automatic
speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal
Processing, pp. 2376-2379.
Russell, M. J. & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models
for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech
and Signal Processing, pp. 5-8.
Varga, A. Steeneken, H.J.M., Tomlinson, M. & Jones, D. (1992). The NOISEX-92 study on the effect of
additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern,
England.
Vaseghi, S. V. (1995). State duration modeling in hidden Markov models. Signal Processing, Vol. 41, pp.
30
31-41.
Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the
IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 129-132.
31
Table I. Clean speech recognition rates for HMMs using various state duration modeling methods.
methods baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
recognition
rates 97.2 97.6 97.5 97.4 97.2 96.8
Table II. Clean speech recognition rates for VDHMMs using various state duration modeling methods.
methods baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
recognition
rates 97.2 97.6 97.6 97.5 97.4 97.1
Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods
(a) white noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
clean 97.2 97.6 97.5 97.4 97.2 96.8
20dB 48.8 62.0 60.9 60.4 59.6 57.0
15dB 30.8 42.8 41.1 40.5 40.2 38.5
10dB 19.2 26.8 25.4 24.7 25.3 23.6
5dB 11.2 20.8 20.1 19.4 19.7 19.3
0dB 10.0 17.6 16.4 16.0 16.0 17.6
Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods
(b) F16 cockpit noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
20dB 92.0 95.2 93.8 93.5 93.2 92.8
15dB 79.6 85.5 83.6 81.7 80.8 80.1
10dB 67.6 74.7 73.2 72.8 72.5 71.6
5dB 44.0 54.3 53.7 52.8 53.4 52.2
0dB 15.2 25.6 23.5 22.5 22.8 22.3
32
Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods
(c) babble noise.
methods
SNR
baseline HMM
/Npar
HMM
/Gam
HMM
/Gau
HMM
/Pois
HMM
/BSD
20dB 94.8 95.9 95.6 95.4 95.2 94.9
15dB 88.0 92.2 91.1 90.3 89.7 88.2
10dB 75.2 80.4 79.3 76.9 77.8 75.6
5dB 58.4 70.4 68.9 65.8 66.1 63.7
0dB 33.2 42.8 41.4 38.6 39.3 38.5
Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods
(a) white noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
clean 97.2 97.6 97.6 97.5 97.4 97.1
20dB 48.8 67.6 64.8 63.9 61.6 59.4
15dB 30.8 49.2 46.8 45.9 43.6 42.1
10dB 19.2 31.2 29.0 27.4 28.4 26.9
5dB 11.2 24.0 22.8 21.7 22.0 20.8
0dB 10.0 18.4 17.3 17.1 17.2 18.5
Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods
(b) F16 cockpit noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
20dB 92.0 96.0 94.4 94.3 94.0 93.6
15dB 79.6 86.4 84.1 82.3 81.4 80.9
10dB 67.6 76.3 74.5 73.9 73.8 72.5
5dB 44.0 55.3 54.8 53.5 54.2 53.0
0dB 15.2 28.2 26.3 24.9 25.5 24.5
33
Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods
(c) babble noise.
methods
SNR
baseline VDHMM
/Npar
VDHMM
/Gam
VDHMM
/Gau
VDHMM
/Pois
VDHMM
/BSD
20dB 94.8 96.4 96.2 95.8 95.6 95.3
15dB 88.0 93.5 91.8 90.9 90.6 89.4
10dB 75.2 82.4 80.8 79.3 80.1 77.2
5dB 58.4 71.5 69.7 66.1 67.3 65.2
0dB 33.2 45.2 43.6 40.9 42.1 40.6
Table V. Noisy speech recognition rates for VDHMM/PAD.
SNR
noise type
clean 20 dB 15 dB 10 dB 5 dB 0 dB
white noise 96.8 72.4 60.0 44.0 29.6 24.8
F16 cockpit
noise
96.8 95.2 87.3 79.9 60.2 35.1
babble noise 96.8 95.9 94.4 84.7 76.4 52.1

More Related Content

What's hot

2014.03.31.bach glc-pham-finalizing[conflict]
2014.03.31.bach glc-pham-finalizing[conflict]2014.03.31.bach glc-pham-finalizing[conflict]
2014.03.31.bach glc-pham-finalizing[conflict]Bách Vũ Trọng
 
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...John Paul
 
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSION
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSIONMaster Thesis on Rotating Cryostats and FFT, DRAFT VERSION
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSIONKaarle Kulvik
 
Performance of MMSE Denoise Signal Using LS-MMSE Technique
Performance of MMSE Denoise Signal Using LS-MMSE  TechniquePerformance of MMSE Denoise Signal Using LS-MMSE  Technique
Performance of MMSE Denoise Signal Using LS-MMSE TechniqueIJMER
 
Discrete wavelet transform-based RI adaptive algorithm for system identification
Discrete wavelet transform-based RI adaptive algorithm for system identificationDiscrete wavelet transform-based RI adaptive algorithm for system identification
Discrete wavelet transform-based RI adaptive algorithm for system identificationIJECEIAES
 
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ijcsa
 
Dynamic magnification factor-A Re-evaluation
Dynamic magnification factor-A Re-evaluationDynamic magnification factor-A Re-evaluation
Dynamic magnification factor-A Re-evaluationSayan Batabyal
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
 
Resource theory of asymmetric distinguishability
Resource theory of asymmetric distinguishabilityResource theory of asymmetric distinguishability
Resource theory of asymmetric distinguishabilityMark Wilde
 
On the approximation of the sum of lognormals by a log skew normal distribution
On the approximation of the sum of lognormals by a log skew normal distributionOn the approximation of the sum of lognormals by a log skew normal distribution
On the approximation of the sum of lognormals by a log skew normal distributionIJCNCJournal
 
Chapter no4 image transform3
Chapter no4 image transform3Chapter no4 image transform3
Chapter no4 image transform3ShardaSalunkhe1
 
Mobile radio chaneel matlab kostov
Mobile radio chaneel matlab kostovMobile radio chaneel matlab kostov
Mobile radio chaneel matlab kostovDwi Putra Asana
 
Compensating Joint Configuration through Null Space Control in Composite Weig...
Compensating Joint Configuration through Null Space Control in Composite Weig...Compensating Joint Configuration through Null Space Control in Composite Weig...
Compensating Joint Configuration through Null Space Control in Composite Weig...Waqas Tariq
 
1_VTC_Qian
1_VTC_Qian1_VTC_Qian
1_VTC_QianQian Han
 
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...john236zaq
 
hankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijchankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijcVasilis Tsoulkas
 

What's hot (20)

Dr35672675
Dr35672675Dr35672675
Dr35672675
 
fading-conf
fading-conffading-conf
fading-conf
 
2014.03.31.bach glc-pham-finalizing[conflict]
2014.03.31.bach glc-pham-finalizing[conflict]2014.03.31.bach glc-pham-finalizing[conflict]
2014.03.31.bach glc-pham-finalizing[conflict]
 
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...
TIME-DOMAIN MODELING OF ELECTROMAGNETIC WAVE PROPAGATION IN COMPLEX MATERIALS...
 
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSION
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSIONMaster Thesis on Rotating Cryostats and FFT, DRAFT VERSION
Master Thesis on Rotating Cryostats and FFT, DRAFT VERSION
 
Performance of MMSE Denoise Signal Using LS-MMSE Technique
Performance of MMSE Denoise Signal Using LS-MMSE  TechniquePerformance of MMSE Denoise Signal Using LS-MMSE  Technique
Performance of MMSE Denoise Signal Using LS-MMSE Technique
 
Discrete wavelet transform-based RI adaptive algorithm for system identification
Discrete wavelet transform-based RI adaptive algorithm for system identificationDiscrete wavelet transform-based RI adaptive algorithm for system identification
Discrete wavelet transform-based RI adaptive algorithm for system identification
 
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
ON APPROACH OF OPTIMIZATION OF FORMATION OF INHOMOGENOUS DISTRIBUTIONS OF DOP...
 
Dynamic magnification factor-A Re-evaluation
Dynamic magnification factor-A Re-evaluationDynamic magnification factor-A Re-evaluation
Dynamic magnification factor-A Re-evaluation
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
 
Resource theory of asymmetric distinguishability
Resource theory of asymmetric distinguishabilityResource theory of asymmetric distinguishability
Resource theory of asymmetric distinguishability
 
On the approximation of the sum of lognormals by a log skew normal distribution
On the approximation of the sum of lognormals by a log skew normal distributionOn the approximation of the sum of lognormals by a log skew normal distribution
On the approximation of the sum of lognormals by a log skew normal distribution
 
Masters Report 3
Masters Report 3Masters Report 3
Masters Report 3
 
Conference_ANS
Conference_ANSConference_ANS
Conference_ANS
 
Chapter no4 image transform3
Chapter no4 image transform3Chapter no4 image transform3
Chapter no4 image transform3
 
Mobile radio chaneel matlab kostov
Mobile radio chaneel matlab kostovMobile radio chaneel matlab kostov
Mobile radio chaneel matlab kostov
 
Compensating Joint Configuration through Null Space Control in Composite Weig...
Compensating Joint Configuration through Null Space Control in Composite Weig...Compensating Joint Configuration through Null Space Control in Composite Weig...
Compensating Joint Configuration through Null Space Control in Composite Weig...
 
1_VTC_Qian
1_VTC_Qian1_VTC_Qian
1_VTC_Qian
 
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...
Low-Complexity DFT-Based Channel Estimation with Leakage Nulling for OFDM Sys...
 
hankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijchankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijc
 

Similar to 129966864599036360[1]

A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_Mike Zhang
 
129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]威華 王
 
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...cscpconf
 
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...sipij
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]威華 王
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
 
Application Of The Least-Squares Method For Solving Population Balance Proble...
Application Of The Least-Squares Method For Solving Population Balance Proble...Application Of The Least-Squares Method For Solving Population Balance Proble...
Application Of The Least-Squares Method For Solving Population Balance Proble...Anita Miller
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionCSCJournals
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Arthur Weglein
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Arthur Weglein
 
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...A High Order Continuation Based On Time Power Series Expansion And Time Ratio...
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...IJRES Journal
 
A Two Step Taylor Galerkin Formulation For Fast Dynamics
A Two Step Taylor Galerkin Formulation For Fast DynamicsA Two Step Taylor Galerkin Formulation For Fast Dynamics
A Two Step Taylor Galerkin Formulation For Fast DynamicsHeather Strinden
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGijdkp
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
 
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...IJERA Editor
 

Similar to 129966864599036360[1] (20)

A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
 
129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]
 
Smoothed Particle Hydrodynamics
Smoothed Particle HydrodynamicsSmoothed Particle Hydrodynamics
Smoothed Particle Hydrodynamics
 
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
STATE SPACE POINT DISTRIBUTION PARAMETER FOR SUPPORT VECTOR MACHINE BASED CV ...
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 
A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak Transform
 
Application Of The Least-Squares Method For Solving Population Balance Proble...
Application Of The Least-Squares Method For Solving Population Balance Proble...Application Of The Least-Squares Method For Solving Population Balance Proble...
Application Of The Least-Squares Method For Solving Population Balance Proble...
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
Rabiner
RabinerRabiner
Rabiner
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...
 
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...A High Order Continuation Based On Time Power Series Expansion And Time Ratio...
A High Order Continuation Based On Time Power Series Expansion And Time Ratio...
 
A Two Step Taylor Galerkin Formulation For Fast Dynamics
A Two Step Taylor Galerkin Formulation For Fast DynamicsA Two Step Taylor Galerkin Formulation For Fast Dynamics
A Two Step Taylor Galerkin Formulation For Fast Dynamics
 
Bay's marko chain
Bay's marko chainBay's marko chain
Bay's marko chain
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home Applications
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...
Evaluation of the Sensitivity of Seismic Inversion Algorithms to Different St...
 

More from 威華 王

129966863931865940[1]
129966863931865940[1]129966863931865940[1]
129966863931865940[1]威華 王
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]威華 王
 
129966863516564072[1]
129966863516564072[1]129966863516564072[1]
129966863516564072[1]威華 王
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]威華 王
 
129966863002202240[1]
129966863002202240[1]129966863002202240[1]
129966863002202240[1]威華 王
 
129966862758614726[1]
129966862758614726[1]129966862758614726[1]
129966862758614726[1]威華 王
 

More from 威華 王 (6)

129966863931865940[1]
129966863931865940[1]129966863931865940[1]
129966863931865940[1]
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]
 
129966863516564072[1]
129966863516564072[1]129966863516564072[1]
129966863516564072[1]
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]
 
129966863002202240[1]
129966863002202240[1]129966863002202240[1]
129966863002202240[1]
 
129966862758614726[1]
129966862758614726[1]129966862758614726[1]
129966862758614726[1]
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

129966864599036360[1]

  • 1. 1 Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, Taiwan, 243 ROC E-mail : wwhung@ccsun.mit.edu.tw FAX : 886-02-2903-6852 Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan, 30043 ROC E-mail : hcwang@ee.nthu.edu.tw FAX : 886-03-571-5971 Tel. : 886-03-574-2587 Corresponding author: Hsiao-Chuan Wang
  • 2. 2 Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase Wei-Wen Hung and Hsiao-Chuan Wang Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30043, ROC Abstract Modeling the state duration of HMMs can effectively improve the accuracy in decoding the state sequence of an utterance and result in an improvement of speech recognition accuracy. However, when a speech signal is contaminated by ambient noise, the decoded state sequence may be distorted. It may stay at some states too long or too short even with the help of state duration models. This paper presents a proportional alignment decoding (PAD) algorithm for re-training the hidden Markov models (HMMs). A task of multi-speaker isolated Mandarin digit recognition was conducted to demonstrate the effectiveness and robustness of the PAD-based variable duration hidden Markov model (VDHMM/PAD) method. Experimental results show that the discriminativity of VDHMM/PAD in noisy environment has been significantly enhanced. Moreover, the proposed method outperforms those widely used state duration modeling methods, such as using Poisson, gamma, Gaussian, bounded and non-parametric probability density functions. This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract number NSC-85-2221-E-007-005.
  • 3. 3 1. Introduction Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech recognition. This method provides a powerful framework for modeling the time-varying speech signals. One of the advantages of HMM is that it enables us to well characterize speech signals as a parametric stochastic process, and the parameters of this stochastic process can be optimized by the estimation-maximization (EM) algorithm. In addition, the quality of HMM can also be significantly improved by incorporating the information of state duration (Rabiner, 1989). In a conventional hidden Markov model, the probability of staying in state i for d frames is modeled by p d a ai ii d ii( ) ( ) ( )= ⋅ −−1 1 , where aii is the state transition probability from state i to itself and ( )1 − aii from state i to other states. This inherent temporal characteristic implies that the state duration in a conventional HMM is exponentially distributed. It does not adequately model the temporal structures of different acoustic regions in a speech signal (Juang et al., 1985, Rabiner et al., 1985 & Rabiner et al., 1988). In order to cope with this deficiency, some modeling methods for state duration and word duration have been proposed. A. Bonafonte et al., (Bonafonte et al., 1996) used a Markov chain to model the occupancy of the HMM states, and the parameters of the Markov chain were estimated directly from the duration data. To reduce the insertion error rate in connected digit recognition, K. Power proposed an expanded-state duration model (Power, 1996). In this approach, each individual state was expanded by multiple sub-states, each sharing the original state observation probability density function (pdf). Moreover, K. Laurila noticed that duration constraints applied only to the recognition phase is quite loose and not effective enough. Therefore, a state duration constrained maximum likelihood (SDML) training scheme (Laurila, 1997) was presented to gradually tighten the duration constraints in a hidden Markov model. Duration modeling technique is not only applied to state level, but can also be extended to word level.
  • 4. 4 David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string error rate for connected digit recognition task. In general, no matter what kind of duration modeling mechanism is employed, the probability density function for modeling state duration distributions can be roughly classified into two categories (Gu et al., 1991), non-parametric and parametric methods. For the non-parametric method, the distribution of state duration is directly estimated from the training data. Thus, we can obtain a more accurate duration distribution for each state in a word model. However, this approach needs a large amount of training utterances in order to reach to a desired degree of accuracy. Moreover, it also requires a considerable amount of memory space for the storage of all the duration distributions. On the other hand, for the parametric method, some specific probability density functions, such as Poisson (Russell et al., 1985 & Russell et al., 1987), gamma (Levinson, 1986 & Burshtein, 1995), Gaussian (Rabiner, 1989 & Burshtein, 1995) and bounded density functions (Gu et al., 1991, Kim et al., 1994, Vaseghi, 1995, Power, 1996 & Laurila, 1997) were used to model the state duration distributions explicitly, and by which only a few parameters were required to completely specify its distribution. It is intuitive that there are some drawbacks in the use of parametric approach. One is that the assumed probability density function may not always fit to the real duration distribution of each state in a hidden Markov model. In spite of ambient noises, most of the researches in modeling duration distributions dealt with the minimization of recognition errors which are mainly attributed to unrealistically modeling for duration distributions. How to make a duration model more robust to noise contamination is still a problem to be solved. In this paper, we focus our attention on the robustness of modeling state duration in noisy environment and neglect the modeling for word duration. This is due to the fact (Burshtein, 1995) that the state duration modeling is the major contribution to the improvement of recognition rate. In Section 2, some methods of state duration modeling are reviewed. Then, a series of experiments
  • 5. 5 were conducted to compare those methods in Section 3. Also, the behaviors of various duration models under the influence of noise contamination are also investigated. In Section 4, based on the results obtained in the previous section, we propose a new method that combines a proportional alignment decoding (PAD) algorithm with state duration distributions to re-train a conventional hidden Markov model. This is so-called the variable duration hidden Markov model and denoted as VDHMM/PAD. The state duration distributions of VDHMM/PAD are proved to be more robust than those of other methods in noisy environment. An experiment of multi-speaker isolated Mandarin digit recognition was conducted in Section 5 to evaluate the effectiveness and robustness of the proposed method. Finally, a conclusion is given in Section 6. 2. Overview of state duration modeling methods When the statistics of state duration is incorporated into both training and recognition phases of a conventional hidden Markov model, this will result in a variable duration hidden Markov model (VDHMM) (Levinson, 1986 & Rabiner, 1989). In VDHMM, the likelihood function is defined in terms of modified forward likelihood and backward likelihood. Let O o o oT= 1 2 ... be the observation sequence. The modified forward likelihood αt w j( , ) and backward likelihood βt w j( , ) are defined as (Levinson, 1986, Rabiner, 1989 & Hung et al., 1997) α λt t tw j p o o o q j w( , ) ( ... , ( ) )= =1 2 = ⋅∑ ∑ − = ≠d t d w ij i i j S w i a w α ( , ) , ,1 ⋅ ⋅ − + = ∏p d b ow j w j t d d , ,( ) ( )τ τ 1 (1) and β λt t t T tw i p o o o q i w( , ) ( ... , ( ) )= =+ +1 2
  • 6. 6 = ⋅ = ≠ ∑∑ a p dw ij w j j j i S d w , , , ( ) 1 ⋅ ⋅+ + = ∏b o w jw j t t d d , ( ) ( , ).τ τ β 1 (2) where λ ( )w denotes the variable duration hidden Markov model for word w with Sw states, qt the present state at time t , aw ij, the state-transition probability from state i to state j of word model λ ( )w , b ow j t, ( ) the symbol distribution of ot in the j -th state of word model λ ( )w , and p dw j, ( ) the j -th state duration pdf of word model λ ( )w with duration length of d frames. Then, given a variable duration hidden Markov model, λ ( )w , the likelihood function of an observation sequence,O , can be modeled as p O w( ( ))λ = αt d d D w j j j i S i S w ij w jw i a p d ww − == ≠= ⋅ ⋅∑∑∑ ( , ) ( ) ( , ) , , , 111 ⋅ ⋅− + = ∏b o w jw j t d t d , ( ) ( , ),τ τ β 1 (3) where D w j( , ) indicates the allowable maximum duration length within the j-th state of word model λ ( )w . Based on above definition, the derivation of re-estimation formulas for the variable duration HMM is formally identical to those for the conventional HMM (Levinson, 1986 & Rabiner, 1989). For a left to right variable duration HMM without jumps, the maximum likelihood, p O w( ( ) )λ , can be effectively calculated by a three-dimensional (time, state, duration) Viterbi decoding algorithm which is derived from the literature proposed by Gu et al., (Gu et al., 1991) and can be summarized as follows : for d = 1 ψt w j( , , )1 = max{ ( , , ~ ) log[ ( ~ )]}~ , d t w jw j d p dψ − −− +1 11 + +−log[ ] log[ ( )],,( ) ,a b ow j j w j t1 (4) for d ≥ 2 ψ ψt t w j tw j d w j d b o( , , ) ( , , ) log[ ( )],,= − +−1 1 (5) and
  • 7. 7 p O w( ( ) )λ = max{ ( , , ) log[ ( )]},, d T T w w Sw S d p dw = + 1 ψ (6) where ψ t w j d( , , ) represents the maximum likelihood of proceeding from state 1 to state j − 1along a state sequence of duration length ( )t d− frames and producing the observations o o ot d1 2 ... ,− and then staying at the state j and producing the observations o o ot d t t− + −1 1... at this state. From above description, we can find that the success of modeling state duration distributions will promote the performance of a HMM-based speech recognizer. In general, the modeling methods for state duration can be classified into two categories, i.e., non-parametric and parametric modeling methods. 2.1 Non-parametric state duration modeling method In non-parametric approaches (Juang et al., 1985, Rabiner et al., 1985, Rabiner et al., 1988, Anastasakos et al., 1995 & Hung et al., 1997), the probabilities, p dw j, ( ) , used for describing state duration distributions are estimated via a direct counting procedure on the training data. Let dw j t, , be the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word model λ ( )w , and Nw be the total number of training utterances of the word w. Then, the probabilities, p dw j, ( ) , can be estimated by p d d Nw j d w j t t N w w , , , ( ) ( ) = = ∑Θ 1 for d ≥ 1, (7) where Θd w j td( ), , is a binary characteristic function and defined as Θd w j t w j t d if d d otherwise ( ) , , , . , , , , = =   1 0 (8) In this non-parametric approach, the accuracy of duration model depends on the amount of training
  • 8. 8 data. When the amount of training data is sufficient, this modeling method can well approximate the temporal characteristic of each state in a hidden Markov model. However, large number of parameters to be stored is one of the drawbacks. A non-parametric approach for isolated Mandarin digit recognition proposed by Hung et al., (Hung et al., 1997) had shown that the recognition rates were significantly improved as comparing with those of conventional HMM under the influence of white noise. The recognition rates are improved from 48.8% in baseline HMM to 62.0% in non-parametric approach when the signal is contaminated with white noise at SNR equal to 20dB. 2.2 Parametric state duration modeling methods In parametric approaches, some specific probability density functions have been proposed to model the distribution of state duration explicitly. The parametric approach has the advantage that only few parameters are required to completely specify its probability density function. Thus, comparing with the non-parametric approaches, the memory space for the parametric approach can be significantly reduced. One of the drawbacks in using parametric duration modeling methods is that the assumed probability density function may not always match with the state duration distribution of each state in a hidden Markov model. Some probability density functions including Poisson, gamma, bounded and Gaussian duration density functions have been proposed to model the distribution of state duration. Detailed formulations of those duration modeling methods are described as follows. 2.2.1 Poisson distribution for state duration To characterize the duration property more effectively, M. J. Russell (Russell et al., 1985 & Russell et al., 1987) replaced the self-transition probability in conventional HMM by a Poisson duration density function so that there was no self-transition from a state back to itself. This is the so-called hidden
  • 9. 9 semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is thought to has some advantages. First, the Poisson probability density function represents a plausible model for state duration. Second, only one parameter, i.e., the state duration mean, is needed to specify the distribution of state duration. Third, maximum likelihood estimation of the state duration mean can be accomplished by using the methods which are analogous to the standard Baum-Welch re-estimation process. When the distribution of state durationis modeled by a Poisson density function, it is expressed as p d d d ew j w j d dw j , , ( ) ( ) ( )! , = − ⋅ − − 1 1 for d ≥ 1, (9) where d w j, denotes the duration mean of j-th state in the word model λ ( )w . For comparison, hidden Markov model (HMM), dynamic time-warping (DTW) and the hidden semi-Markov model (HSMM) with Poisson distributed state duration were applied to the task of speaker dependent isolated word recognition (Russell et al., 1985). Experimental results for the third set of recordings showed that error rate of HSMM is 11.8% and 6.3% lower than those of HMM and DTW, respectively. 2.2.2 Gamma distribution for state duration In the literature proposed by Levinson (Levinson, 1986), the author first used a family of gamma probability density functions to characterize the distribution of state duration and formed a continuously variable duration hidden Markov model (CVDHMM). The gamma distribution was considered to be ideally suited to the specification of duration density function since it assigns zero probability to negative duration lengths and only two parameters, state duration mean and variance, are required to specify its distribution. Moreover, David Burshtein (Burshtein, 1995) proposed a modified Viterbi decoding algorithm that incorporates both state and word duration models for connected digit string recognition. In
  • 10. 10 this approach, a duration penalty based on gamma density function is considered at each frame transition. The modified Viterbi decoding algorithm was proved to have essentially the same computational requirements as the conventional Viterbi algorithm. The experimental results showed that the modified Viterbi decoding algorithm with gamma duration distribution reduced the string error rate from 4.77% to 2.86% for the case of unknown string length, and from 2.20% to 1.60% for the case of known string length as compared withthe baseline HMM. The gamma duration density function can be formulated as p d d ew j w j w j d w j w j w j , , , ( ) ( ) ( ) ( ) , , , = ⋅ ⋅ − − ⋅ξ γ γ γ ξ Γ 1 for d ≥ 1 (10) and γw j w j w j w j d d , , , , = ⋅ ∇ , ξw j w j w j d , , , = ∇ , (11) where d w j, and ∇w j, are the duration mean and variance of j-th state in the word model λ ( )w , respectively. Γ ( )z is a gamma function defined by Γ ( )z x e dxz x = ⋅− − ∞ ∫ 1 0 for z > 0 . (12) 2.2.3 Bounded state duration Due to the characteristic of continuous probability density function, both Poisson and gamma functions have the advantage of operating well when a relatively small number of training utterances is available. However, in some situations, there exists the possibility that duration length of some states will be too long or too short. To avoid those unexpected duration and minimize the erroneous match between testing utterances and reference models, H. Y. Gu et al. (Gu et al., 1991) proposed a hidden Markov model with bounded state duration in which the allowable state duration is constrained by some boundaries. The duration length of each state in this approach is simply bounded by lower and upper bounds in the
  • 11. 11 recognition phase. The probability density function for bounded state durationis modeled by p d D D if D d D otherwise w j w j upper w j lower w j lower w j upper , , , , , ( ) , , , , = − + ≤ ≤      1 1 0 (13) where Dw j lower , and Dw j upper , are the lower and upper bounds of the state duration for state j of the word model λ ( )w , and can be estimated by { }D dw j lower t N w j t w , , ,min= =1 (14) and { }D dw j upper t N w j t w , , ,max= =1 . (15) A series of experiments using all the 408 highly confused first-tone Mandarin syllables (Gu et al., 1991) were conducted to evaluate the effectiveness of HMM with bounded state duration (BSD). In the discrete case, the recognition rate of HMM with BSD is 78.5%. This is 9.0%, 6.3% and 1.9% higher than the conventional HMM’s, HMM’s with Poisson and HMM’s with gamma distributed state duration, respectively. In the continuous case, the recognition rate of HMM with BSD is 88.3%. This is 6.3%, 5.9% and 3.1% higher than those of conventional HMM, HMM with Poisson and HMM with gamma distributed state duration, respectively. Similar applications of bounded state duration distribution for speech recognition can be found in the literature by Kim et al., (Kim et al., 1994), Vaseghi, (Vaseghi, 1995) and Power (Power, 1996). The minimum and maximum durations for each state were estimated in the training phase. Those loose state duration constraints were then used in the final recognition phase. To tighten those duration constraints, K. Laurila (Laurila, 1997) employed bounded state duration model in both the training and recognition phases to achieve higher consistency in state duration constraints. 2.2.4 Gaussian distribution for state duration
  • 12. 12 A parametric approach using Gaussian probability density function for modeling the state duration distributions is suggested by Rabiner (Rabiner, 1989). Moreover, David Burshtein (Burshtein, 1995) also claimed that Gaussian pdf provides good approximation for word duration. By modeling word duration using Gaussian pdf, the string error rate can be further reduced from 2.86% to 2.78% for the case of unknown string length, and from 1.60% to 1.59% for the case of known string length as compared with the baseline HMM. The Gaussian duration density function can be formulated as p d d d w j w j w j w j , , , , ( ) exp{ ( ) }= ⋅ ∇ ⋅ − − ⋅ ∇ 1 2 2 2 π . (16) 3. Comparison of state duration modeling methods 3.1 Databases and experimental conditions A task of multi-speaker isolated Mandarin digit recognition was conducted for the comparison of those state duration modeling methods described above. The database for the experiments were provided by 50 male and 50 female speakers. Each speaker was asked to utter a set of 10 Mandarin digits in each of three sessions. Totally, there were 3000 utterances recorded with the sampling rate of 8 KHz. Each frame, which contained 256 samples with 128 samples overlapped, was multiplied by a 256-point Hamming window. The pre-silence and post-silence of 0.1 ~ 0.5 seconds were included. Each digit was modeled as a left-to-right HMM of 7 ~ 9 states, including the pre-silence and the post-silence states, without jumps. The output of each state was a Gaussian distribution of feature vectors. The feature vector was composed of 12-order LPC derived cepstral coefficients, 12-order delta cepstral coefficients and one delta log-energy. The NOISEX-92 noise database (Varga et al., 1992) was used for generating the noisy speech. In our study, three kinds of noises, including white noise, F16 cockpit noise and babble noise, were directly
  • 13. 13 added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was added to the clean speech, the signal-to-noise (SNR) was defined by the following equation : SNR E E s n = ⋅      10 log , (17) where Es was the total energy of clean speech and En was the energy of the added noise over the entire speech portion. The F16 cockpit noise was recorded at the co-pilot’s seat in a two-seat F16 traveling at a speed of 500 knots and an altitude of 300-600 feet. The source of babble noise was 100 people speaking in a canteen and in which individual voices were slightly audible. The subsequent experiments were conducted to examine the following problems : (1) the effectiveness of state duration modeling methods, (2) the incorporation of state duration modeling in training phase, and (3) the robustness of state duration modeling methods in noisy environment. 3.2 Effectiveness of state duration modeling methods The first two sessions of collected utterances in the database were used to train an initial set of word models by using the segmental k-means algorithm (Rabiner et al., 1986). Once a conventional HMM-based word model (denoted as ‘Baseline’ HMM) was established for each isolated Mandarin digit, the training utterances were time-aligned with their corresponding word models. By using the standard Viterbi decoding algorithm, we can re-decode each utterance into a state sequence and from which the number of frames spent on every state is known. Based on these decoded state durations, we can find the distribution of state duration for each state in a word model. This distribution can be treated as the non-parametric modeling for a state duration and denoted as HMM/Npar. In Fig. 1, we show the duration distributions of seven states in the HMM/Npar for isolated Mandarin digit ‘4’. The state duration distributions modeled by Poisson, gamma, Gaussian and bounded density functions are also illustrated in
  • 14. 14 Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD, respectively. The third session of collected utterances was used as a clean version of testing data for evaluating the effectiveness of various state duration modeling methods. In the recognition phase, a testing utterance is decoded into a state sequence by using the standard Viterbi decoding algorithm for the ‘Baseline’ HMM method, while using the three-dimensional Viterbi decoding algorithm, i,e., Eqs. (4)-(6), for other state duration modeling methods. The resulted recognition rates for various state duration modeling methods are shown in Table I. ( Fig. 1 and Table I about here ) Let us examine the state duration distributions of HMM/Npar shown in Fig. 1. We can find that the distribution of state duration is different from state to state and can not be confined to a certain type of probability density function. No single probability density function can fit to the statistical characteristics of all the states in a word model. Furthermore, we can also find that HMM/Gam and HMM/Gau are more capable than HMM/Pois and HMM/BSD in modeling the state duration distributions represented by HMM/Npar. Particularly, gamma function is slightly better than Gaussian function. This result is consistent with the conclusion given by David Burshtein (Burshtein, 1995) which stated that the gamma function can provide high quality approximations for state duration and word duration. For HMM/BSD, lower and upper bounds of state duration can prevent any state from occupying too many or too few frames. However, the state duration distribution in the range of allowable duration is treated as an uniform distribution which can not well approximate the actual distribution of state duration. This fact does affect the performance as shown in Table I. From the experimental results shown in Table I, we can find that the HMMs employing non-parametric, gamma and Gaussian state duration models have slightly higher recognition rate than that of the baseline HMM. Also, the recognition rate of HMM/Gam is superior to those of other methods. It concludes that a good modeling method for state duration can improve the
  • 15. 15 recognition accuracy. 3.3 Incorporation of state duration modeling in training phase When statistics of state duration is considered only in the recognition phase but not in the training phase, it will result in quite loose state duration constraints (Laurila, 1997). To solve this inconsistency problem, a variable duration hidden Markov model (VDHMM) (Levinson, 1986, Rabiner et al., 1989 & Laurila, 1997) which incorporates state duration statistics into both training and recognition phases of a word model has been proposed to seek for further improvement in recognition accuracy. The duration distribution of each state in a word model can be obtained as follows : Step 1. The segmental k-means algorithm and standard Viterbi decoding method are used to train an initial set of word models. Step 2. The duration statistics for each state in a word model are estimated and modeled by non-parametric or parametric methods. Step 3. Using the three-dimensional Viterbi decoding algorithm, each training utterance is decoded into a maximum likelihood state sequence. Step 4. According to those maximum likelihood state sequences, the statistics of each state is re-calculated and the parameters of underlying state duration model are also revised. Step 3 and step 4 are iterated several times to come out a final set of desired word models. In Fig. 2, we show the duration distributions of seven states in those VDHMMs for isolated Mandarin digit ‘4’ using various state duration modeling methods. The variable duration HMMs with non-parametric, Poisson, gamma, Gaussian and bounded state duration density functions are denoted as VDHMM/Npar, VDHMM/Pois, VDHMM/Gam, VDHMM/Gau and VDHMM/BSD, respectively. Moreover, The clean speech recognition rates based on variable duration HMMs are also shown in
  • 16. 16 Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation phenomenon of some state duration distributions in the HMM/Npar more obvious. This phenomenon can be found in the 4-th, 5-th and 6-th states of word model ‘4’. In addition, the duration distributions of some states (e.g., 3-rd and 7-th states) become more concentrated and sharper. Table I and Table II show that no matter employing non-parametric or parametric approaches, VDHMM methods are better than the corresponding HMM methods. Since there are two confusion sets in Mandarin digit speech (“1” vs. “7” and “6” vs. “9”), the recognition rate can hardly be further improved in clean speech for this specific task. Even though the improvement is small, it does demonstrate the effectiveness of applying state duration models in both training and recognition phases. ( Fig. 2 and Table II about here ) 3.4 Robustness of state duration modeling methods When a speech recognition system is deployed in a noisy environment, the background noise will cause the mismatch of statistical characteristics between testing speech and reference models. Due to the environmental mismatch, it is very possible that some state with very high likelihood scores will dominate the result of decoding process (Zeljkovic, 1996). Thus, an erroneous maximum likelihood state sequence with state duration too long or too short may be obtained even if a state duration modeling method is employed. This phenomenon will cause the drastic degradation of recognition rate of a speech recognizer. In this subsection, a series of experiments were conducted to evaluate the robustness of various methods for modeling state duration in noisy environment. In our experiments, the first two sessions of collected utterances in the database were used to train a set of word models. To generate noisy speech, a noise with specific SNR values was added to the clean testing data, i.e., the third session in the database. Those distorted utterances were then evaluated on their corresponding word models and decoded into state sequences. Thus, from those most likely state
  • 17. 17 sequences, we can find the state duration distributions under the influence of additive white noise. In Fig. 3 through Fig. 6, the duration distributions of the 5-th and the 6-th states of isolated Mandarin digit ‘4’ under the influence of white noise are plotted. In addition, the recognition rates under the influence of white noise, F16 cockpit noise and babble noise for various HMMs and VDHMMs are also presented in Table III and Table IV. ( Fig. 3 - Fig. 6, Table III - Table IV about here ) The results in Table III and Table IV convince that properly employing a duration model does improve the recognition accuracy in noisy environment. Above all, further improvement can be obtained by using a variable duration hidden Markov model. The performances of those HMMs and VDHMMs in different noisy environment are similar to the results listed in Table I and Table II for clean speech recognition. It is worth to note that at SNR = 0 dB, the recognition rates based on the bounded state duration (BSD) modeling method are higher than those of the models based on other parametric duration modeling methods. One explanation is that the BSD method is more effective than other parametric modeling methods in inhibiting one state occupying too long or too short of speech frames. From Fig. 3 through Fig. 6, we can also find that the additive white noise has the effect to distort the duration distribution of each state in a word model. When the background environment becomes more noisy, the duration distribution of the 5-th state of Mandarin digit “4” is gradually shifted to the left while the 6-th state to the right. Especially, when the signal-to-noise ratio is very low, e.g., 0 dB, the duration density functions of some states become extremely concentrated at some unexpected duration lengths even with the helps of state duration modeling methods. This implies that the underlying duration density functions of those modeling methods are not robust enough to noise contamination. For some state duration modeling methods, the probability density functions are relatively smooth in the range of allowable duration lengths. This will reduce the discriminativity of duration lengths in noisy environment and results in erroneous state
  • 18. 18 sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it is very possible for a state to stay too long or too short in decoding a state sequence. From above discussion, we conclude that : (1) The non-parametric duration modeling method can accurately specify the state duration distribution of each state in a hidden Markov model. (2) The duration modeling method must be applied in both the training and recognition phases so that the state duration constraints in these two phases are consistent. (3) A sharper pdf of state duration may enhance the discriminativity of the allowable duration lengths. (4) A narrow distribution range of state duration can efficiently prevent a decoded state from being too long or too short. 4. Implementation of the VDHMM/PAD In this section, a proportional alignment decoding (PAD) algorithm (Hung & Wang, 1997) combining with the statistics of state durations is proposed to re-train a conventional hidden Markov model and results in a more robust variable duration hidden Markov model (VDHMM/PAD). Instead of the widely used Viterbi decoding algorithm, the proportional alignment decoding algorithm is used for state decoding in the intermediate stage of training a word model. It produces a new set of state duration statistics in which the distribution of state duration becomes sharper and more concentrated. This meets the conclusion made in the previous section. It is also worth to note that the PAD method is not implemented in the recognition phase. The detailed implementation of VDHMM/PAD is described as follows. 4.1 Formulation of the proportional alignment decoding algorithm Consider the training of a word model λ ( )w that belongs to the set of M word models. The parameter set of the word model λ ( )w is represented as λ µ( ) { , , , , }w w w w w w= Σ Ρ Α Β , where µ µw w j= { }, and Σ Σw w j= { }, for 1 ≤ ≤j Sw denote the mean vector and covariance matrix of the
  • 19. 19 j-th state in the word model λ ( )w , respectively. Ρw w jp d= { ( )}, , Αw w ija= { }, and Βw w jb O= { ( )}, for 1 ≤ ≤j Sw represent the probability density functions of state durations, state transitions and state outputs for the word model λ ( )w , respectively. It is noted that the probability density function p dw j, ( ) is modeled by the non-parametric duration modeling method. Let Χ Χ( ) { ( ), }w w t Nt w= ≤ ≤1 be a set of feature vector sequences extracted from all the training utterances for the word model λ ( )w . Here, Χt w( ) denotes the feature vector sequence of the t-th training utterance which has Kt w frames, This feature vector sequence can be expressed as Χt t w t w t K w w x x x t w( ) , , , = ⋅⋅⋅1 2 . Then, in a continuous-density HMM, the output probability density function, b xw j t k w , ,( ) , can be characterized by a Gaussian function defined as follows : b xw j t k w w j D , , ,( ) ( )= ⋅ ⋅ − − 2 2 1 2 π Σ exp{ ( ) ( ) ( )}, , , , ,− ⋅ − ⋅ ⋅ −−1 2 1 x xt k w w j T w j t k w w jµ µΣ , (18) where D is the dimension of feature vector xt k w , . Based on the set of word models λ λ= ≤ ≤{ ( ), }w w M1 and the standard Viterbi decoding algorithm, we can decode the t-th training utterance of word w, X wt ( ) , into a state sequence q q q qw t w t w t w t K t w , , , , , , , = ⋅ ⋅ ⋅1 2 . Assume dw j t, , denotes the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word model λ ( )w . Then, the state duration mean, d w j, , of state j in the word model λ ( )w is formulated as d N dw j w w j t t Nw , , ,= = ∑ 1 1 for 1 ≤ ≤j Sw . (19) Moreover, The word duration mean d w defined as the accumulation of all the state duration means in the word model λ ( )w can also be expressed as
  • 20. 20 d dw w j j Sw = = ∑ , 1 . (20) Then, the state duration ratio of the j-th state to the total states in the word model λ ( )w can be calculated by ℜ =j w w j w d d , for 1 ≤ ≤j Sw (21) Once we obtain ℜj w for all states in every word model, the proportional alignment decoding procedure can be proceeded in a simple way and each training utterance of word w is re-decoded into a new state sequence, ~ , qw t , where ~ ~ ~ ~ ,, , , , , , , q q q qw t w t w t w t Kt w= ⋅⋅ ⋅1 2 1 1≤ ≤ ≤ ≤w M t Nw, . (22) For example, the t-th training utterance of word w has duration of Kt w frames, we segment this training utterance into Sw states according to the following rules : x wt k w v, ( )∈Ω and ~ ,, ,q vw t k = iff. k K Kj w j v t w j w t w j v ∈ ℜ ⋅ + ℜ ⋅ = − = ∑ ∑[( ) ,( ) ], 1 1 1 1 (23) where Ω Ω( ) { ( ), }w w v Sv w= ≤ ≤1 . Ωv w( ) is the set of collected vectors belonging to state v in the word model λ ( )w . 4.2 Training procedure of VDHMM/PAD The training procedure works as follows. Step 1. Obtain initial word models. Employing segmental k-means algorithm (Juang et al., 1990) and standard Viterbi decoding algorithm, all the feature vectors extracted from training utterances of word w are used to train
  • 21. 21 an initial word model λp w−1 ( ) , where p = 0 and 1 ≤ ≤w M . Step 2. Decode training utterances and update word models. (1) Based on the initial word model λp w−1 ( ) , standard Viterbi decoding algorithm is used to decode each training utterance, such that q p X w q w p q ww t p q t w t p w t p w t , , , arg max{ ( ( ) , ( )) ( ( ))} , − − − = ⋅ 1 1 1 λ λ , 1 1≤ ≤ ≤ ≤w M t Nw, . (24) (2) The decoded state sequence is denoted as q q q qw t p w t p w t p w t K p t w , , , , , , , . − − − − = ⋅ ⋅ ⋅ 1 1 1 2 1 1 (3) Let Ω Ωp j p ww w j S− − = ≤ ≤1 1 1( ) { ( ), }, and Ω j p w−1 ( ) be the set of vectors of state j in word model λp w−1 ( ) . For a feature vector of k-th frame in utterance t, xt k w , , this frame belongs to Ω j p w−1 ( ) if its corresponding state belongs to state j in model λp w−1 ( ) . Then the duration of state j is equal to the number of vectors in utterance t belonging to Ω j p w−1 ( ) . The duration set is expressed as d w d j St p w j t p w − − = ≤ ≤1 1 1( ) { , }, , . Step 3. Align state sequences using PAD method. (1) Based on the duration set d wt p−1 ( ) , we can find state duration mean d w j p , −1 , word duration mean dw p−1 and state duration ratio ℜ − j w p, 1 for each state in the word model λp w−1 ( ) via Eqs. (19)-(21). (2) Every training utterance of word w is then proportionally segmented into Sw states by using the Eq. (23). Thus we can find new state sequences q q q qw t p w t p w t p w t K p t w , , , , , , , = ⋅ ⋅ ⋅1 2 . (3) Rearrange the set of vectors collected in a state such that x wt k w j p , ( )∈Ω if its corresponding state belongs to state j defined for model λp w( ) . The new duration of state j in utterance t , dw j t p , , , is obtained.
  • 22. 22 (4) Use the duration set d w d j St p w j t p w( ) { , }, ,= ≤ ≤1 and the following equation to calculate the distribution of state duration: p d d N for dw j p d w j t p t N w w , , , ( ) ( ) ,= ≥= ∑Θ 1 1. (25) (5) Use Ω Ωp j p ww w j S( ) { ( ), }= ≤ ≤1 to find the parameters set { , , , }µw p w p w p w p Σ Α Β of the word model λp w( ) . Step 4. Re-train the word models. (1) Calculate the accumulated log-likelihood of Χ( )w by ∆ Χp t p t N w p w w w ( ) log [ ( ) ( )]≡ = ∑ λ 1 = + = ∑{log ( ( ) , ( ))] log [ ( )]}, , p w q w p q wt w t p p t N w t p p w Χ λ λ 1 , (26) where p X w q w b xt w t p p w q t k w k K w t k p t w ( ( ) , ( )) ( ), , , , , λ = = ∏1 (27) and p q w aw t p p w q q k K w t k p w t k p t w ( ( )), , , , , , λ = + = − ∏ 1 1 1 . (28) (2) Based on the word model λp w( ) , we can use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state sequence q q q qw t p w t p w t p w t K p t w , , , , , , , + + + + = ⋅ ⋅ ⋅ 1 1 1 2 1 1 for the t-th training utterance. (3) Collect the vectors such that x wt k w j p , ( )∈ + Ω 1 if its corresponding state belongs to state j defined for model λp w+1 ( ) . (4) Use Ωp w+1 ( ) to update the model parameters and generate the new model λp w+1 ( ) .
  • 23. 23 (5) Update the accumulated log-likelihood of Χ( )w by ∆ Χp t p t N w p w w w + + = = ∑1 1 1 ( ) log [ ( ) ( )]λ , (29) where the likelihood function p w wt p [ ( ) ( )]Χ λ +1 can be evaluated efficiently by using Eqs. (4)-(6). (6) Convergence testing. IF the improvement rate of ∆p w+1 ( ) is greater than a preset threshold ∆th , i.e., ∆ ∆ ∆ ∆ p p p th w w w + − > 1 ( ) ( ) ( ) , (30) THEN p p+ →1 and repeat Steps 4.(2)-4.(6), ELSE λ λp VDHMM PADw w+ →1 ( ) ( )/ . 4.3 Recognition procedure of VDHMM/PAD Consider a testing utterance Υ with Ty frames and Υ = ⋅⋅ ⋅y y yTy1 2 , where y j denotes the feature vector of j-th frame. The recognition procedure based upon the VDHMM/PAD is proceeded as follows. Step 1. Set w = 1 . Step 2. Use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state sequence q ** for the testing utterance Υ based on the word model λVDHMM PAD w/ ( ) . Step 3. Calculate the likelihood score of Υ for the word model λVDHMM PAD w/ ( ) by using Eqs. (4)-(6), i.e., p wVDHMM PAD[ | ( )]/Υ λ = p q w p q wVDHMM PAD VDHMM PAD[ | , ( )] [ | ( )] ** / ** /Υ λ λ⋅ , (31)
  • 24. 24 Step 4. w w+ →1 . IF w M≤ , THEN repeat Step 2 to Step 4, ELSE go to Step 5. Step 5. Select the word whose likelihood score is highest, i.e., w p w w VDHMM PAD * /argmax{ [ | ( )]}= Υ λ . (32) 5. Experiments and discussion In this section, the same procedure described in Section 3.2 is used to find the distribution of state duration in the VDHMM/PAD. Moreover, to demonstrate the behaviors of state duration distributions of VDHMM/PAD under the influence of white noise, the same experiments conducted in Section 3.4 are also implemented here. Fig. 7 and Fig. 8 show the state duration distributions of seven states in the VDHMM/PAD for isolated Mandarin digit ‘4’ and the distorted state duration distributions due to white noise contamination. The recognition rates of VDHMM/PAD under the influences of white noise, F16 cockpit noise and babble noise are also listed in Table V. Furthermore, in order to make the comparison, we plotted those experimental results listed in Tables I-V on Fig. 9. From those experimental results we observe the following facts : (1) Distribution of state duration Comparing Fig. 7 with Fig. 1 and Fig. 2, we can find that for conventional HMM employing various state duration modeling methods, the distribution of state duration is relatively smooth and widespread. By incorporating state duration statistics into the training phase, the variable duration HMMs make the duration distributions of some states more concentrated and sharper. It results in the higher recognition rate. In Fig. 7, we can observe that for most of states (e.g., 2-nd, 4-th, 5-th and 6-th states) the allowable ranges of state duration modeled by VDHMM/PAD become more concentrated. The
  • 25. 25 shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition, comparing with those state duration distributions shown in Fig.1 and Fig.2, the probability fluctuation in the VDHMM/PAD is more severe. This fluctuation phenomenon occurs in the duration distributions of 2-nd, 4-th and 6-th states of VDHMM/Npar and is considered to be helpful for enhancing its discriminativity in recognizing the noisy speech. (2) Robustness to noise contamination When speech signal is contaminated by white noise, the state duration distributions shown in Fig. 3 through Fig. 6 are affected and distorted. Especially, under SNR = 0 dB, duration distributions are severely distorted and concentrated extremely at some unexpected duration lengths. Using Fig. 3 and Fig. 4 as examples, we can find that for some HMMs (e.g., HMM/Npar, VDHMM/BSD) the duration distribution of 5-th state excessively concentrates at duration length of 3 frames for SNR = 0 dB while the other HMMs (e.g., HMM/Gam, VDHMM/Gau) at duration length of one frame. Moreover, the maximum probabilities of 5-th state duration are also dramatically increased from about 0.2~0.3 up to 0.8~1.0. In contrast to those state duration distributions described in Fig. 3 through Fig. 6, we can observe from Fig. 8 that even under the influence of white noise, the original ranges of state duration in the VDHMM/PAD keep almost unchanged and the duration distributions are less distorted by ambient noises. When the SNR value is reduced to 0 dB, the maximum probability of 5-th state duration is increased from 0.25 up to 0.45. This implies that the VDHMM/PAD is more effective than other duration modeling methods in preventing the state duration distribution from extremely concentrating at a specific duration length. (3) Performance of noisy speech recognition The recognition rates listed in Table V and the performances shown in Fig. 9 tell us that the VDHMM/PAD outperforms those HMMs and VDHMMs employing other duration modeling
  • 26. 26 methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case of white noise and at low SNR (0 to 5 dB) in the case of F16 cockpit noise and babble noise. Especially, when the distortion due to ambient noises is serious, such as distortion due to white noise, the improvement of recognition rates is obvious. The superiority of VDHMM/PAD to the other hidden Markov models we discussed is essentially due to its novel state duration distributions. It is evident that the sharper and more concentrated duration distributions and relatively more fluctuated duration density function facilitate the VDHMM/PAD with better discriminativity and modeling capability in noisy environment. Moreover, it is noted that the VDHMM/PAD performs worse than the other hidden Markov models in clean condition. The reason for this phenomenon can be explained as follows. The PAD method proportionally segments each training utterance into states. This segmentation mechanism narrows the allowable ranges of some state duration distributions. Thus, a property of the VDHMM/PAD is that it can efficiently prevent any state from occupying too long or too short and gain performance benefits in noisy environment. However, this method will also cause duration mismatch between clean testing speech and reference models. This mismatch makes the recognition performance of VDHMM/PAD to degrade slightly in clean conditions as comparing with other hidden Markov models. ( Fig. 7 - Fig. 9, Table V about here ) 6. Conclusion In this paper, we first demonstrated the distribution of state duration in a conventional HMM and compared the effectiveness and performance of some widely used modeling methods for state duration in noisy environment. Based upon the weakness of those modeling method we evaluated, a proportional alignment decoding algorithm (PAD) combining with the statistics of state duration is then proposed in the
  • 27. 27 training phase to re-train a conventional hidden Markov model and produce a new variable duration hidden Markov model (VDHMM/PAD). The PAD method enables us to make the distribution of state duration sharper, more fluctuated and relatively concentrated, and thus improve the model discriminativity for allowable duration lengths under the influence of ambient noises. Experimental results have demonstrated the robustness of VDHMM/PAD in noisy speech recognition. The proposed method can provide better recognition rates than the conventional HMM and the other duration modeling methods in various noisy environments. Acknowledgement The authors would like to thank Dr. Lee Lee-Min of Mingchi Institute of Technology, Taipei, Taiwan, for his enthusiasm in supporting valuable programming experiences and many fruitful discussions. References Anastasakos, A., Schwartz, R. & Shu, H. (1995), Duration modeling in large vocabulary speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 628-631. Bonafonte, A., Vidal, J. & Nogueiras, A. (1996). Duration modeling with Expanded HMM applied to Speech Recognition. Proceedings of International Conference on Spoken Language Processing, pp. 1097-1100. Burshtein, D. (1995). Robust parametric modeling of durations in hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 548-551.
  • 28. 28 Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden Markov Models with Bounded State Durations. IEEE Trans. on Signal Processing, vol. 39, no.8 , pp. 1743-1752, August. Hung, W. W. & Wang, H. C. (1997). HMM retraining based on state duration alignment for noisy speech recognition. in Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), vol. 3, pp. 1519-1522, September. Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov models for speech signals. IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 5, pp. 1404-1413. Juang B. H. and Rabiner L. R. (1990). The segmental k-means algorithm for estimating parameters of hidden Markov models. IEEE Trans. On Acoustics, Speech, Signal Processing, vol. 38, pp. 1639-1641, September. Kim, W. G., Yoon, J. Y. & Youn, D. H. (1994). HMM with global path constraint in Viterbi decoding for isolated word recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 605-608. Laurila, K. (1997). Noise robust speech recognition with state duration constraints. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 871-874. Lee, L. M. & Wang, H. C. (1994). A study on adaptation of cepstral and delta cepstral coefficients for noisy speech recognition. Proceedings of International Conference on Spoken Language Processing, pp. 1011-1014. Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 1241-1244. Power, K. (1996). Durational modeling for improved connected digit recognition. Proceedings of
  • 29. 29 International Conference on Spoken Language Processing, pp. 885-888. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, vol. 77, no. 2, pp. 257-286. Rabiner, L. R. & Juang, B.H. (1986). An introduction to hidden Markov model. IEEE ASSP Magazine, January, pp.4-16. Rabiner, L. R., Juang, B. H., Levinson, S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using hidden Markov models with continuous mixture densities. AT&T Tech. J. , vol. 64, no. 6, pp. 1211-1234, July-Aug. Rabiner, L. R., Wilpson, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for connected word recognition. AT&T Technical Journal, Vol. 65, pp. 21-31. Rabiner, L. R., Wilpon, J. G. & Soong, F. K. (1988). High performance connected digit recognition, using hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 119-122. Russell, M. J. & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 2376-2379. Russell, M. J. & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 5-8. Varga, A. Steeneken, H.J.M., Tomlinson, M. & Jones, D. (1992). The NOISEX-92 study on the effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern, England. Vaseghi, S. V. (1995). State duration modeling in hidden Markov models. Signal Processing, Vol. 41, pp.
  • 30. 30 31-41. Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 129-132.
  • 31. 31 Table I. Clean speech recognition rates for HMMs using various state duration modeling methods. methods baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD recognition rates 97.2 97.6 97.5 97.4 97.2 96.8 Table II. Clean speech recognition rates for VDHMMs using various state duration modeling methods. methods baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD recognition rates 97.2 97.6 97.6 97.5 97.4 97.1 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (a) white noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD clean 97.2 97.6 97.5 97.4 97.2 96.8 20dB 48.8 62.0 60.9 60.4 59.6 57.0 15dB 30.8 42.8 41.1 40.5 40.2 38.5 10dB 19.2 26.8 25.4 24.7 25.3 23.6 5dB 11.2 20.8 20.1 19.4 19.7 19.3 0dB 10.0 17.6 16.4 16.0 16.0 17.6 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (b) F16 cockpit noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD 20dB 92.0 95.2 93.8 93.5 93.2 92.8 15dB 79.6 85.5 83.6 81.7 80.8 80.1 10dB 67.6 74.7 73.2 72.8 72.5 71.6 5dB 44.0 54.3 53.7 52.8 53.4 52.2 0dB 15.2 25.6 23.5 22.5 22.8 22.3
  • 32. 32 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (c) babble noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD 20dB 94.8 95.9 95.6 95.4 95.2 94.9 15dB 88.0 92.2 91.1 90.3 89.7 88.2 10dB 75.2 80.4 79.3 76.9 77.8 75.6 5dB 58.4 70.4 68.9 65.8 66.1 63.7 0dB 33.2 42.8 41.4 38.6 39.3 38.5 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (a) white noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD clean 97.2 97.6 97.6 97.5 97.4 97.1 20dB 48.8 67.6 64.8 63.9 61.6 59.4 15dB 30.8 49.2 46.8 45.9 43.6 42.1 10dB 19.2 31.2 29.0 27.4 28.4 26.9 5dB 11.2 24.0 22.8 21.7 22.0 20.8 0dB 10.0 18.4 17.3 17.1 17.2 18.5 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (b) F16 cockpit noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD 20dB 92.0 96.0 94.4 94.3 94.0 93.6 15dB 79.6 86.4 84.1 82.3 81.4 80.9 10dB 67.6 76.3 74.5 73.9 73.8 72.5 5dB 44.0 55.3 54.8 53.5 54.2 53.0 0dB 15.2 28.2 26.3 24.9 25.5 24.5
  • 33. 33 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (c) babble noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD 20dB 94.8 96.4 96.2 95.8 95.6 95.3 15dB 88.0 93.5 91.8 90.9 90.6 89.4 10dB 75.2 82.4 80.8 79.3 80.1 77.2 5dB 58.4 71.5 69.7 66.1 67.3 65.2 0dB 33.2 45.2 43.6 40.9 42.1 40.6 Table V. Noisy speech recognition rates for VDHMM/PAD. SNR noise type clean 20 dB 15 dB 10 dB 5 dB 0 dB white noise 96.8 72.4 60.0 44.0 29.6 24.8 F16 cockpit noise 96.8 95.2 87.3 79.9 60.2 35.1 babble noise 96.8 95.9 94.4 84.7 76.4 52.1