1
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen ...
2
Improvement of noisy speech recognition using a proportional alignment
decoding algorithm in the training phase
Wei-Wen ...
3
1. Introduction
Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech
recognition. Th...
4
David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string
error rate for c...
5
were conducted to compare those methods in Section 3. Also, the behaviors of various duration models
under the influence...
6
= ⋅
= ≠
∑∑ a p dw ij w j
j j i
S
d
w
, ,
,
( )
1
⋅ ⋅+ +
=
∏b o w jw j t t d
d
, ( ) ( , ).τ
τ
β
1
(2)
where λ ( )w denot...
7
p O w( ( ) )λ = max{ ( , , ) log[ ( )]},,
d
T
T w w Sw S d p dw
=
+
1
ψ (6)
where ψ t w j d( , , ) represents the maximu...
8
data. When the amount of training data is sufficient, this modeling method can well approximate the
temporal characteris...
9
semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is
thought to has some ad...
10
this approach, a duration penalty based on gamma density function is considered at each frame transition.
The modified ...
11
recognition phase. The probability density function for bounded state durationis modeled by
p d D D
if D d D
otherwise
...
12
A parametric approach using Gaussian probability density function for modeling the state duration
distributions is sugg...
13
added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was
added to the clea...
14
Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD,
respectively. The third session of collect...
15
recognition accuracy.
3.3 Incorporation of state duration modeling in training phase
When statistics of state duration ...
16
Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation
phenomenon of ...
17
sequences, we can find the state duration distributions under the influence of additive white noise. In Fig.
3 through ...
18
sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it
is very pos...
19
j-th state in the word model λ ( )w , respectively. Ρw w jp d= { ( )}, , Αw w ija= { }, and
Βw w jb O= { ( )}, for 1 ≤ ...
20
d dw w j
j
Sw
=
=
∑ ,
1
. (20)
Then, the state duration ratio of the j-th state to the total states in the word model λ...
21
an initial word model λp
w−1
( ) , where p = 0 and 1 ≤ ≤w M .
Step 2. Decode training utterances and update word models...
22
(4) Use the duration set d w d j St
p
w j t
p
w( ) { , }, ,= ≤ ≤1 and the following equation to calculate
the distribut...
23
(5) Update the accumulated log-likelihood of Χ( )w by
∆ Χp
t
p
t
N
w p w w
w
+ +
=
= ∑1 1
1
( ) log [ ( ) ( )]λ , (29)
...
24
Step 4. w w+ →1 .
IF w M≤ , THEN repeat Step 2 to Step 4,
ELSE go to Step 5.
Step 5. Select the word whose likelihood s...
25
shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition,
comparing with those sta...
26
methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case
of white noise and at ...
27
training phase to re-train a conventional hidden Markov model and produce a new variable duration
hidden Markov model (...
28
Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden
Markov Models with Bo...
29
International Conference on Spoken Language Processing, pp. 885-888.
Rabiner, L. R. (1989). A tutorial on hidden Markov...
30
31-41.
Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the
IEEE Int...
31
Table I. Clean speech recognition rates for HMMs using various state duration modeling methods.
methods baseline HMM
/N...
32
Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods
(c) babble noise.
meth...
33
Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods
(c) babble noise.
met...
Upcoming SlideShare
Loading in …5
×

129966864599036360[1]

215 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
215
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

129966864599036360[1]

  1. 1. 1 Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, Taiwan, 243 ROC E-mail : wwhung@ccsun.mit.edu.tw FAX : 886-02-2903-6852 Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan, 30043 ROC E-mail : hcwang@ee.nthu.edu.tw FAX : 886-03-571-5971 Tel. : 886-03-574-2587 Corresponding author: Hsiao-Chuan Wang
  2. 2. 2 Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase Wei-Wen Hung and Hsiao-Chuan Wang Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30043, ROC Abstract Modeling the state duration of HMMs can effectively improve the accuracy in decoding the state sequence of an utterance and result in an improvement of speech recognition accuracy. However, when a speech signal is contaminated by ambient noise, the decoded state sequence may be distorted. It may stay at some states too long or too short even with the help of state duration models. This paper presents a proportional alignment decoding (PAD) algorithm for re-training the hidden Markov models (HMMs). A task of multi-speaker isolated Mandarin digit recognition was conducted to demonstrate the effectiveness and robustness of the PAD-based variable duration hidden Markov model (VDHMM/PAD) method. Experimental results show that the discriminativity of VDHMM/PAD in noisy environment has been significantly enhanced. Moreover, the proposed method outperforms those widely used state duration modeling methods, such as using Poisson, gamma, Gaussian, bounded and non-parametric probability density functions. This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract number NSC-85-2221-E-007-005.
  3. 3. 3 1. Introduction Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech recognition. This method provides a powerful framework for modeling the time-varying speech signals. One of the advantages of HMM is that it enables us to well characterize speech signals as a parametric stochastic process, and the parameters of this stochastic process can be optimized by the estimation-maximization (EM) algorithm. In addition, the quality of HMM can also be significantly improved by incorporating the information of state duration (Rabiner, 1989). In a conventional hidden Markov model, the probability of staying in state i for d frames is modeled by p d a ai ii d ii( ) ( ) ( )= ⋅ −−1 1 , where aii is the state transition probability from state i to itself and ( )1 − aii from state i to other states. This inherent temporal characteristic implies that the state duration in a conventional HMM is exponentially distributed. It does not adequately model the temporal structures of different acoustic regions in a speech signal (Juang et al., 1985, Rabiner et al., 1985 & Rabiner et al., 1988). In order to cope with this deficiency, some modeling methods for state duration and word duration have been proposed. A. Bonafonte et al., (Bonafonte et al., 1996) used a Markov chain to model the occupancy of the HMM states, and the parameters of the Markov chain were estimated directly from the duration data. To reduce the insertion error rate in connected digit recognition, K. Power proposed an expanded-state duration model (Power, 1996). In this approach, each individual state was expanded by multiple sub-states, each sharing the original state observation probability density function (pdf). Moreover, K. Laurila noticed that duration constraints applied only to the recognition phase is quite loose and not effective enough. Therefore, a state duration constrained maximum likelihood (SDML) training scheme (Laurila, 1997) was presented to gradually tighten the duration constraints in a hidden Markov model. Duration modeling technique is not only applied to state level, but can also be extended to word level.
  4. 4. 4 David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string error rate for connected digit recognition task. In general, no matter what kind of duration modeling mechanism is employed, the probability density function for modeling state duration distributions can be roughly classified into two categories (Gu et al., 1991), non-parametric and parametric methods. For the non-parametric method, the distribution of state duration is directly estimated from the training data. Thus, we can obtain a more accurate duration distribution for each state in a word model. However, this approach needs a large amount of training utterances in order to reach to a desired degree of accuracy. Moreover, it also requires a considerable amount of memory space for the storage of all the duration distributions. On the other hand, for the parametric method, some specific probability density functions, such as Poisson (Russell et al., 1985 & Russell et al., 1987), gamma (Levinson, 1986 & Burshtein, 1995), Gaussian (Rabiner, 1989 & Burshtein, 1995) and bounded density functions (Gu et al., 1991, Kim et al., 1994, Vaseghi, 1995, Power, 1996 & Laurila, 1997) were used to model the state duration distributions explicitly, and by which only a few parameters were required to completely specify its distribution. It is intuitive that there are some drawbacks in the use of parametric approach. One is that the assumed probability density function may not always fit to the real duration distribution of each state in a hidden Markov model. In spite of ambient noises, most of the researches in modeling duration distributions dealt with the minimization of recognition errors which are mainly attributed to unrealistically modeling for duration distributions. How to make a duration model more robust to noise contamination is still a problem to be solved. In this paper, we focus our attention on the robustness of modeling state duration in noisy environment and neglect the modeling for word duration. This is due to the fact (Burshtein, 1995) that the state duration modeling is the major contribution to the improvement of recognition rate. In Section 2, some methods of state duration modeling are reviewed. Then, a series of experiments
  5. 5. 5 were conducted to compare those methods in Section 3. Also, the behaviors of various duration models under the influence of noise contamination are also investigated. In Section 4, based on the results obtained in the previous section, we propose a new method that combines a proportional alignment decoding (PAD) algorithm with state duration distributions to re-train a conventional hidden Markov model. This is so-called the variable duration hidden Markov model and denoted as VDHMM/PAD. The state duration distributions of VDHMM/PAD are proved to be more robust than those of other methods in noisy environment. An experiment of multi-speaker isolated Mandarin digit recognition was conducted in Section 5 to evaluate the effectiveness and robustness of the proposed method. Finally, a conclusion is given in Section 6. 2. Overview of state duration modeling methods When the statistics of state duration is incorporated into both training and recognition phases of a conventional hidden Markov model, this will result in a variable duration hidden Markov model (VDHMM) (Levinson, 1986 & Rabiner, 1989). In VDHMM, the likelihood function is defined in terms of modified forward likelihood and backward likelihood. Let O o o oT= 1 2 ... be the observation sequence. The modified forward likelihood αt w j( , ) and backward likelihood βt w j( , ) are defined as (Levinson, 1986, Rabiner, 1989 & Hung et al., 1997) α λt t tw j p o o o q j w( , ) ( ... , ( ) )= =1 2 = ⋅∑ ∑ − = ≠d t d w ij i i j S w i a w α ( , ) , ,1 ⋅ ⋅ − + = ∏p d b ow j w j t d d , ,( ) ( )τ τ 1 (1) and β λt t t T tw i p o o o q i w( , ) ( ... , ( ) )= =+ +1 2
  6. 6. 6 = ⋅ = ≠ ∑∑ a p dw ij w j j j i S d w , , , ( ) 1 ⋅ ⋅+ + = ∏b o w jw j t t d d , ( ) ( , ).τ τ β 1 (2) where λ ( )w denotes the variable duration hidden Markov model for word w with Sw states, qt the present state at time t , aw ij, the state-transition probability from state i to state j of word model λ ( )w , b ow j t, ( ) the symbol distribution of ot in the j -th state of word model λ ( )w , and p dw j, ( ) the j -th state duration pdf of word model λ ( )w with duration length of d frames. Then, given a variable duration hidden Markov model, λ ( )w , the likelihood function of an observation sequence,O , can be modeled as p O w( ( ))λ = αt d d D w j j j i S i S w ij w jw i a p d ww − == ≠= ⋅ ⋅∑∑∑ ( , ) ( ) ( , ) , , , 111 ⋅ ⋅− + = ∏b o w jw j t d t d , ( ) ( , ),τ τ β 1 (3) where D w j( , ) indicates the allowable maximum duration length within the j-th state of word model λ ( )w . Based on above definition, the derivation of re-estimation formulas for the variable duration HMM is formally identical to those for the conventional HMM (Levinson, 1986 & Rabiner, 1989). For a left to right variable duration HMM without jumps, the maximum likelihood, p O w( ( ) )λ , can be effectively calculated by a three-dimensional (time, state, duration) Viterbi decoding algorithm which is derived from the literature proposed by Gu et al., (Gu et al., 1991) and can be summarized as follows : for d = 1 ψt w j( , , )1 = max{ ( , , ~ ) log[ ( ~ )]}~ , d t w jw j d p dψ − −− +1 11 + +−log[ ] log[ ( )],,( ) ,a b ow j j w j t1 (4) for d ≥ 2 ψ ψt t w j tw j d w j d b o( , , ) ( , , ) log[ ( )],,= − +−1 1 (5) and
  7. 7. 7 p O w( ( ) )λ = max{ ( , , ) log[ ( )]},, d T T w w Sw S d p dw = + 1 ψ (6) where ψ t w j d( , , ) represents the maximum likelihood of proceeding from state 1 to state j − 1along a state sequence of duration length ( )t d− frames and producing the observations o o ot d1 2 ... ,− and then staying at the state j and producing the observations o o ot d t t− + −1 1... at this state. From above description, we can find that the success of modeling state duration distributions will promote the performance of a HMM-based speech recognizer. In general, the modeling methods for state duration can be classified into two categories, i.e., non-parametric and parametric modeling methods. 2.1 Non-parametric state duration modeling method In non-parametric approaches (Juang et al., 1985, Rabiner et al., 1985, Rabiner et al., 1988, Anastasakos et al., 1995 & Hung et al., 1997), the probabilities, p dw j, ( ) , used for describing state duration distributions are estimated via a direct counting procedure on the training data. Let dw j t, , be the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word model λ ( )w , and Nw be the total number of training utterances of the word w. Then, the probabilities, p dw j, ( ) , can be estimated by p d d Nw j d w j t t N w w , , , ( ) ( ) = = ∑Θ 1 for d ≥ 1, (7) where Θd w j td( ), , is a binary characteristic function and defined as Θd w j t w j t d if d d otherwise ( ) , , , . , , , , = =   1 0 (8) In this non-parametric approach, the accuracy of duration model depends on the amount of training
  8. 8. 8 data. When the amount of training data is sufficient, this modeling method can well approximate the temporal characteristic of each state in a hidden Markov model. However, large number of parameters to be stored is one of the drawbacks. A non-parametric approach for isolated Mandarin digit recognition proposed by Hung et al., (Hung et al., 1997) had shown that the recognition rates were significantly improved as comparing with those of conventional HMM under the influence of white noise. The recognition rates are improved from 48.8% in baseline HMM to 62.0% in non-parametric approach when the signal is contaminated with white noise at SNR equal to 20dB. 2.2 Parametric state duration modeling methods In parametric approaches, some specific probability density functions have been proposed to model the distribution of state duration explicitly. The parametric approach has the advantage that only few parameters are required to completely specify its probability density function. Thus, comparing with the non-parametric approaches, the memory space for the parametric approach can be significantly reduced. One of the drawbacks in using parametric duration modeling methods is that the assumed probability density function may not always match with the state duration distribution of each state in a hidden Markov model. Some probability density functions including Poisson, gamma, bounded and Gaussian duration density functions have been proposed to model the distribution of state duration. Detailed formulations of those duration modeling methods are described as follows. 2.2.1 Poisson distribution for state duration To characterize the duration property more effectively, M. J. Russell (Russell et al., 1985 & Russell et al., 1987) replaced the self-transition probability in conventional HMM by a Poisson duration density function so that there was no self-transition from a state back to itself. This is the so-called hidden
  9. 9. 9 semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is thought to has some advantages. First, the Poisson probability density function represents a plausible model for state duration. Second, only one parameter, i.e., the state duration mean, is needed to specify the distribution of state duration. Third, maximum likelihood estimation of the state duration mean can be accomplished by using the methods which are analogous to the standard Baum-Welch re-estimation process. When the distribution of state durationis modeled by a Poisson density function, it is expressed as p d d d ew j w j d dw j , , ( ) ( ) ( )! , = − ⋅ − − 1 1 for d ≥ 1, (9) where d w j, denotes the duration mean of j-th state in the word model λ ( )w . For comparison, hidden Markov model (HMM), dynamic time-warping (DTW) and the hidden semi-Markov model (HSMM) with Poisson distributed state duration were applied to the task of speaker dependent isolated word recognition (Russell et al., 1985). Experimental results for the third set of recordings showed that error rate of HSMM is 11.8% and 6.3% lower than those of HMM and DTW, respectively. 2.2.2 Gamma distribution for state duration In the literature proposed by Levinson (Levinson, 1986), the author first used a family of gamma probability density functions to characterize the distribution of state duration and formed a continuously variable duration hidden Markov model (CVDHMM). The gamma distribution was considered to be ideally suited to the specification of duration density function since it assigns zero probability to negative duration lengths and only two parameters, state duration mean and variance, are required to specify its distribution. Moreover, David Burshtein (Burshtein, 1995) proposed a modified Viterbi decoding algorithm that incorporates both state and word duration models for connected digit string recognition. In
  10. 10. 10 this approach, a duration penalty based on gamma density function is considered at each frame transition. The modified Viterbi decoding algorithm was proved to have essentially the same computational requirements as the conventional Viterbi algorithm. The experimental results showed that the modified Viterbi decoding algorithm with gamma duration distribution reduced the string error rate from 4.77% to 2.86% for the case of unknown string length, and from 2.20% to 1.60% for the case of known string length as compared withthe baseline HMM. The gamma duration density function can be formulated as p d d ew j w j w j d w j w j w j , , , ( ) ( ) ( ) ( ) , , , = ⋅ ⋅ − − ⋅ξ γ γ γ ξ Γ 1 for d ≥ 1 (10) and γw j w j w j w j d d , , , , = ⋅ ∇ , ξw j w j w j d , , , = ∇ , (11) where d w j, and ∇w j, are the duration mean and variance of j-th state in the word model λ ( )w , respectively. Γ ( )z is a gamma function defined by Γ ( )z x e dxz x = ⋅− − ∞ ∫ 1 0 for z > 0 . (12) 2.2.3 Bounded state duration Due to the characteristic of continuous probability density function, both Poisson and gamma functions have the advantage of operating well when a relatively small number of training utterances is available. However, in some situations, there exists the possibility that duration length of some states will be too long or too short. To avoid those unexpected duration and minimize the erroneous match between testing utterances and reference models, H. Y. Gu et al. (Gu et al., 1991) proposed a hidden Markov model with bounded state duration in which the allowable state duration is constrained by some boundaries. The duration length of each state in this approach is simply bounded by lower and upper bounds in the
  11. 11. 11 recognition phase. The probability density function for bounded state durationis modeled by p d D D if D d D otherwise w j w j upper w j lower w j lower w j upper , , , , , ( ) , , , , = − + ≤ ≤      1 1 0 (13) where Dw j lower , and Dw j upper , are the lower and upper bounds of the state duration for state j of the word model λ ( )w , and can be estimated by { }D dw j lower t N w j t w , , ,min= =1 (14) and { }D dw j upper t N w j t w , , ,max= =1 . (15) A series of experiments using all the 408 highly confused first-tone Mandarin syllables (Gu et al., 1991) were conducted to evaluate the effectiveness of HMM with bounded state duration (BSD). In the discrete case, the recognition rate of HMM with BSD is 78.5%. This is 9.0%, 6.3% and 1.9% higher than the conventional HMM’s, HMM’s with Poisson and HMM’s with gamma distributed state duration, respectively. In the continuous case, the recognition rate of HMM with BSD is 88.3%. This is 6.3%, 5.9% and 3.1% higher than those of conventional HMM, HMM with Poisson and HMM with gamma distributed state duration, respectively. Similar applications of bounded state duration distribution for speech recognition can be found in the literature by Kim et al., (Kim et al., 1994), Vaseghi, (Vaseghi, 1995) and Power (Power, 1996). The minimum and maximum durations for each state were estimated in the training phase. Those loose state duration constraints were then used in the final recognition phase. To tighten those duration constraints, K. Laurila (Laurila, 1997) employed bounded state duration model in both the training and recognition phases to achieve higher consistency in state duration constraints. 2.2.4 Gaussian distribution for state duration
  12. 12. 12 A parametric approach using Gaussian probability density function for modeling the state duration distributions is suggested by Rabiner (Rabiner, 1989). Moreover, David Burshtein (Burshtein, 1995) also claimed that Gaussian pdf provides good approximation for word duration. By modeling word duration using Gaussian pdf, the string error rate can be further reduced from 2.86% to 2.78% for the case of unknown string length, and from 1.60% to 1.59% for the case of known string length as compared with the baseline HMM. The Gaussian duration density function can be formulated as p d d d w j w j w j w j , , , , ( ) exp{ ( ) }= ⋅ ∇ ⋅ − − ⋅ ∇ 1 2 2 2 π . (16) 3. Comparison of state duration modeling methods 3.1 Databases and experimental conditions A task of multi-speaker isolated Mandarin digit recognition was conducted for the comparison of those state duration modeling methods described above. The database for the experiments were provided by 50 male and 50 female speakers. Each speaker was asked to utter a set of 10 Mandarin digits in each of three sessions. Totally, there were 3000 utterances recorded with the sampling rate of 8 KHz. Each frame, which contained 256 samples with 128 samples overlapped, was multiplied by a 256-point Hamming window. The pre-silence and post-silence of 0.1 ~ 0.5 seconds were included. Each digit was modeled as a left-to-right HMM of 7 ~ 9 states, including the pre-silence and the post-silence states, without jumps. The output of each state was a Gaussian distribution of feature vectors. The feature vector was composed of 12-order LPC derived cepstral coefficients, 12-order delta cepstral coefficients and one delta log-energy. The NOISEX-92 noise database (Varga et al., 1992) was used for generating the noisy speech. In our study, three kinds of noises, including white noise, F16 cockpit noise and babble noise, were directly
  13. 13. 13 added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was added to the clean speech, the signal-to-noise (SNR) was defined by the following equation : SNR E E s n = ⋅      10 log , (17) where Es was the total energy of clean speech and En was the energy of the added noise over the entire speech portion. The F16 cockpit noise was recorded at the co-pilot’s seat in a two-seat F16 traveling at a speed of 500 knots and an altitude of 300-600 feet. The source of babble noise was 100 people speaking in a canteen and in which individual voices were slightly audible. The subsequent experiments were conducted to examine the following problems : (1) the effectiveness of state duration modeling methods, (2) the incorporation of state duration modeling in training phase, and (3) the robustness of state duration modeling methods in noisy environment. 3.2 Effectiveness of state duration modeling methods The first two sessions of collected utterances in the database were used to train an initial set of word models by using the segmental k-means algorithm (Rabiner et al., 1986). Once a conventional HMM-based word model (denoted as ‘Baseline’ HMM) was established for each isolated Mandarin digit, the training utterances were time-aligned with their corresponding word models. By using the standard Viterbi decoding algorithm, we can re-decode each utterance into a state sequence and from which the number of frames spent on every state is known. Based on these decoded state durations, we can find the distribution of state duration for each state in a word model. This distribution can be treated as the non-parametric modeling for a state duration and denoted as HMM/Npar. In Fig. 1, we show the duration distributions of seven states in the HMM/Npar for isolated Mandarin digit ‘4’. The state duration distributions modeled by Poisson, gamma, Gaussian and bounded density functions are also illustrated in
  14. 14. 14 Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD, respectively. The third session of collected utterances was used as a clean version of testing data for evaluating the effectiveness of various state duration modeling methods. In the recognition phase, a testing utterance is decoded into a state sequence by using the standard Viterbi decoding algorithm for the ‘Baseline’ HMM method, while using the three-dimensional Viterbi decoding algorithm, i,e., Eqs. (4)-(6), for other state duration modeling methods. The resulted recognition rates for various state duration modeling methods are shown in Table I. ( Fig. 1 and Table I about here ) Let us examine the state duration distributions of HMM/Npar shown in Fig. 1. We can find that the distribution of state duration is different from state to state and can not be confined to a certain type of probability density function. No single probability density function can fit to the statistical characteristics of all the states in a word model. Furthermore, we can also find that HMM/Gam and HMM/Gau are more capable than HMM/Pois and HMM/BSD in modeling the state duration distributions represented by HMM/Npar. Particularly, gamma function is slightly better than Gaussian function. This result is consistent with the conclusion given by David Burshtein (Burshtein, 1995) which stated that the gamma function can provide high quality approximations for state duration and word duration. For HMM/BSD, lower and upper bounds of state duration can prevent any state from occupying too many or too few frames. However, the state duration distribution in the range of allowable duration is treated as an uniform distribution which can not well approximate the actual distribution of state duration. This fact does affect the performance as shown in Table I. From the experimental results shown in Table I, we can find that the HMMs employing non-parametric, gamma and Gaussian state duration models have slightly higher recognition rate than that of the baseline HMM. Also, the recognition rate of HMM/Gam is superior to those of other methods. It concludes that a good modeling method for state duration can improve the
  15. 15. 15 recognition accuracy. 3.3 Incorporation of state duration modeling in training phase When statistics of state duration is considered only in the recognition phase but not in the training phase, it will result in quite loose state duration constraints (Laurila, 1997). To solve this inconsistency problem, a variable duration hidden Markov model (VDHMM) (Levinson, 1986, Rabiner et al., 1989 & Laurila, 1997) which incorporates state duration statistics into both training and recognition phases of a word model has been proposed to seek for further improvement in recognition accuracy. The duration distribution of each state in a word model can be obtained as follows : Step 1. The segmental k-means algorithm and standard Viterbi decoding method are used to train an initial set of word models. Step 2. The duration statistics for each state in a word model are estimated and modeled by non-parametric or parametric methods. Step 3. Using the three-dimensional Viterbi decoding algorithm, each training utterance is decoded into a maximum likelihood state sequence. Step 4. According to those maximum likelihood state sequences, the statistics of each state is re-calculated and the parameters of underlying state duration model are also revised. Step 3 and step 4 are iterated several times to come out a final set of desired word models. In Fig. 2, we show the duration distributions of seven states in those VDHMMs for isolated Mandarin digit ‘4’ using various state duration modeling methods. The variable duration HMMs with non-parametric, Poisson, gamma, Gaussian and bounded state duration density functions are denoted as VDHMM/Npar, VDHMM/Pois, VDHMM/Gam, VDHMM/Gau and VDHMM/BSD, respectively. Moreover, The clean speech recognition rates based on variable duration HMMs are also shown in
  16. 16. 16 Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation phenomenon of some state duration distributions in the HMM/Npar more obvious. This phenomenon can be found in the 4-th, 5-th and 6-th states of word model ‘4’. In addition, the duration distributions of some states (e.g., 3-rd and 7-th states) become more concentrated and sharper. Table I and Table II show that no matter employing non-parametric or parametric approaches, VDHMM methods are better than the corresponding HMM methods. Since there are two confusion sets in Mandarin digit speech (“1” vs. “7” and “6” vs. “9”), the recognition rate can hardly be further improved in clean speech for this specific task. Even though the improvement is small, it does demonstrate the effectiveness of applying state duration models in both training and recognition phases. ( Fig. 2 and Table II about here ) 3.4 Robustness of state duration modeling methods When a speech recognition system is deployed in a noisy environment, the background noise will cause the mismatch of statistical characteristics between testing speech and reference models. Due to the environmental mismatch, it is very possible that some state with very high likelihood scores will dominate the result of decoding process (Zeljkovic, 1996). Thus, an erroneous maximum likelihood state sequence with state duration too long or too short may be obtained even if a state duration modeling method is employed. This phenomenon will cause the drastic degradation of recognition rate of a speech recognizer. In this subsection, a series of experiments were conducted to evaluate the robustness of various methods for modeling state duration in noisy environment. In our experiments, the first two sessions of collected utterances in the database were used to train a set of word models. To generate noisy speech, a noise with specific SNR values was added to the clean testing data, i.e., the third session in the database. Those distorted utterances were then evaluated on their corresponding word models and decoded into state sequences. Thus, from those most likely state
  17. 17. 17 sequences, we can find the state duration distributions under the influence of additive white noise. In Fig. 3 through Fig. 6, the duration distributions of the 5-th and the 6-th states of isolated Mandarin digit ‘4’ under the influence of white noise are plotted. In addition, the recognition rates under the influence of white noise, F16 cockpit noise and babble noise for various HMMs and VDHMMs are also presented in Table III and Table IV. ( Fig. 3 - Fig. 6, Table III - Table IV about here ) The results in Table III and Table IV convince that properly employing a duration model does improve the recognition accuracy in noisy environment. Above all, further improvement can be obtained by using a variable duration hidden Markov model. The performances of those HMMs and VDHMMs in different noisy environment are similar to the results listed in Table I and Table II for clean speech recognition. It is worth to note that at SNR = 0 dB, the recognition rates based on the bounded state duration (BSD) modeling method are higher than those of the models based on other parametric duration modeling methods. One explanation is that the BSD method is more effective than other parametric modeling methods in inhibiting one state occupying too long or too short of speech frames. From Fig. 3 through Fig. 6, we can also find that the additive white noise has the effect to distort the duration distribution of each state in a word model. When the background environment becomes more noisy, the duration distribution of the 5-th state of Mandarin digit “4” is gradually shifted to the left while the 6-th state to the right. Especially, when the signal-to-noise ratio is very low, e.g., 0 dB, the duration density functions of some states become extremely concentrated at some unexpected duration lengths even with the helps of state duration modeling methods. This implies that the underlying duration density functions of those modeling methods are not robust enough to noise contamination. For some state duration modeling methods, the probability density functions are relatively smooth in the range of allowable duration lengths. This will reduce the discriminativity of duration lengths in noisy environment and results in erroneous state
  18. 18. 18 sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it is very possible for a state to stay too long or too short in decoding a state sequence. From above discussion, we conclude that : (1) The non-parametric duration modeling method can accurately specify the state duration distribution of each state in a hidden Markov model. (2) The duration modeling method must be applied in both the training and recognition phases so that the state duration constraints in these two phases are consistent. (3) A sharper pdf of state duration may enhance the discriminativity of the allowable duration lengths. (4) A narrow distribution range of state duration can efficiently prevent a decoded state from being too long or too short. 4. Implementation of the VDHMM/PAD In this section, a proportional alignment decoding (PAD) algorithm (Hung & Wang, 1997) combining with the statistics of state durations is proposed to re-train a conventional hidden Markov model and results in a more robust variable duration hidden Markov model (VDHMM/PAD). Instead of the widely used Viterbi decoding algorithm, the proportional alignment decoding algorithm is used for state decoding in the intermediate stage of training a word model. It produces a new set of state duration statistics in which the distribution of state duration becomes sharper and more concentrated. This meets the conclusion made in the previous section. It is also worth to note that the PAD method is not implemented in the recognition phase. The detailed implementation of VDHMM/PAD is described as follows. 4.1 Formulation of the proportional alignment decoding algorithm Consider the training of a word model λ ( )w that belongs to the set of M word models. The parameter set of the word model λ ( )w is represented as λ µ( ) { , , , , }w w w w w w= Σ Ρ Α Β , where µ µw w j= { }, and Σ Σw w j= { }, for 1 ≤ ≤j Sw denote the mean vector and covariance matrix of the
  19. 19. 19 j-th state in the word model λ ( )w , respectively. Ρw w jp d= { ( )}, , Αw w ija= { }, and Βw w jb O= { ( )}, for 1 ≤ ≤j Sw represent the probability density functions of state durations, state transitions and state outputs for the word model λ ( )w , respectively. It is noted that the probability density function p dw j, ( ) is modeled by the non-parametric duration modeling method. Let Χ Χ( ) { ( ), }w w t Nt w= ≤ ≤1 be a set of feature vector sequences extracted from all the training utterances for the word model λ ( )w . Here, Χt w( ) denotes the feature vector sequence of the t-th training utterance which has Kt w frames, This feature vector sequence can be expressed as Χt t w t w t K w w x x x t w( ) , , , = ⋅⋅⋅1 2 . Then, in a continuous-density HMM, the output probability density function, b xw j t k w , ,( ) , can be characterized by a Gaussian function defined as follows : b xw j t k w w j D , , ,( ) ( )= ⋅ ⋅ − − 2 2 1 2 π Σ exp{ ( ) ( ) ( )}, , , , ,− ⋅ − ⋅ ⋅ −−1 2 1 x xt k w w j T w j t k w w jµ µΣ , (18) where D is the dimension of feature vector xt k w , . Based on the set of word models λ λ= ≤ ≤{ ( ), }w w M1 and the standard Viterbi decoding algorithm, we can decode the t-th training utterance of word w, X wt ( ) , into a state sequence q q q qw t w t w t w t K t w , , , , , , , = ⋅ ⋅ ⋅1 2 . Assume dw j t, , denotes the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word model λ ( )w . Then, the state duration mean, d w j, , of state j in the word model λ ( )w is formulated as d N dw j w w j t t Nw , , ,= = ∑ 1 1 for 1 ≤ ≤j Sw . (19) Moreover, The word duration mean d w defined as the accumulation of all the state duration means in the word model λ ( )w can also be expressed as
  20. 20. 20 d dw w j j Sw = = ∑ , 1 . (20) Then, the state duration ratio of the j-th state to the total states in the word model λ ( )w can be calculated by ℜ =j w w j w d d , for 1 ≤ ≤j Sw (21) Once we obtain ℜj w for all states in every word model, the proportional alignment decoding procedure can be proceeded in a simple way and each training utterance of word w is re-decoded into a new state sequence, ~ , qw t , where ~ ~ ~ ~ ,, , , , , , , q q q qw t w t w t w t Kt w= ⋅⋅ ⋅1 2 1 1≤ ≤ ≤ ≤w M t Nw, . (22) For example, the t-th training utterance of word w has duration of Kt w frames, we segment this training utterance into Sw states according to the following rules : x wt k w v, ( )∈Ω and ~ ,, ,q vw t k = iff. k K Kj w j v t w j w t w j v ∈ ℜ ⋅ + ℜ ⋅ = − = ∑ ∑[( ) ,( ) ], 1 1 1 1 (23) where Ω Ω( ) { ( ), }w w v Sv w= ≤ ≤1 . Ωv w( ) is the set of collected vectors belonging to state v in the word model λ ( )w . 4.2 Training procedure of VDHMM/PAD The training procedure works as follows. Step 1. Obtain initial word models. Employing segmental k-means algorithm (Juang et al., 1990) and standard Viterbi decoding algorithm, all the feature vectors extracted from training utterances of word w are used to train
  21. 21. 21 an initial word model λp w−1 ( ) , where p = 0 and 1 ≤ ≤w M . Step 2. Decode training utterances and update word models. (1) Based on the initial word model λp w−1 ( ) , standard Viterbi decoding algorithm is used to decode each training utterance, such that q p X w q w p q ww t p q t w t p w t p w t , , , arg max{ ( ( ) , ( )) ( ( ))} , − − − = ⋅ 1 1 1 λ λ , 1 1≤ ≤ ≤ ≤w M t Nw, . (24) (2) The decoded state sequence is denoted as q q q qw t p w t p w t p w t K p t w , , , , , , , . − − − − = ⋅ ⋅ ⋅ 1 1 1 2 1 1 (3) Let Ω Ωp j p ww w j S− − = ≤ ≤1 1 1( ) { ( ), }, and Ω j p w−1 ( ) be the set of vectors of state j in word model λp w−1 ( ) . For a feature vector of k-th frame in utterance t, xt k w , , this frame belongs to Ω j p w−1 ( ) if its corresponding state belongs to state j in model λp w−1 ( ) . Then the duration of state j is equal to the number of vectors in utterance t belonging to Ω j p w−1 ( ) . The duration set is expressed as d w d j St p w j t p w − − = ≤ ≤1 1 1( ) { , }, , . Step 3. Align state sequences using PAD method. (1) Based on the duration set d wt p−1 ( ) , we can find state duration mean d w j p , −1 , word duration mean dw p−1 and state duration ratio ℜ − j w p, 1 for each state in the word model λp w−1 ( ) via Eqs. (19)-(21). (2) Every training utterance of word w is then proportionally segmented into Sw states by using the Eq. (23). Thus we can find new state sequences q q q qw t p w t p w t p w t K p t w , , , , , , , = ⋅ ⋅ ⋅1 2 . (3) Rearrange the set of vectors collected in a state such that x wt k w j p , ( )∈Ω if its corresponding state belongs to state j defined for model λp w( ) . The new duration of state j in utterance t , dw j t p , , , is obtained.
  22. 22. 22 (4) Use the duration set d w d j St p w j t p w( ) { , }, ,= ≤ ≤1 and the following equation to calculate the distribution of state duration: p d d N for dw j p d w j t p t N w w , , , ( ) ( ) ,= ≥= ∑Θ 1 1. (25) (5) Use Ω Ωp j p ww w j S( ) { ( ), }= ≤ ≤1 to find the parameters set { , , , }µw p w p w p w p Σ Α Β of the word model λp w( ) . Step 4. Re-train the word models. (1) Calculate the accumulated log-likelihood of Χ( )w by ∆ Χp t p t N w p w w w ( ) log [ ( ) ( )]≡ = ∑ λ 1 = + = ∑{log ( ( ) , ( ))] log [ ( )]}, , p w q w p q wt w t p p t N w t p p w Χ λ λ 1 , (26) where p X w q w b xt w t p p w q t k w k K w t k p t w ( ( ) , ( )) ( ), , , , , λ = = ∏1 (27) and p q w aw t p p w q q k K w t k p w t k p t w ( ( )), , , , , , λ = + = − ∏ 1 1 1 . (28) (2) Based on the word model λp w( ) , we can use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state sequence q q q qw t p w t p w t p w t K p t w , , , , , , , + + + + = ⋅ ⋅ ⋅ 1 1 1 2 1 1 for the t-th training utterance. (3) Collect the vectors such that x wt k w j p , ( )∈ + Ω 1 if its corresponding state belongs to state j defined for model λp w+1 ( ) . (4) Use Ωp w+1 ( ) to update the model parameters and generate the new model λp w+1 ( ) .
  23. 23. 23 (5) Update the accumulated log-likelihood of Χ( )w by ∆ Χp t p t N w p w w w + + = = ∑1 1 1 ( ) log [ ( ) ( )]λ , (29) where the likelihood function p w wt p [ ( ) ( )]Χ λ +1 can be evaluated efficiently by using Eqs. (4)-(6). (6) Convergence testing. IF the improvement rate of ∆p w+1 ( ) is greater than a preset threshold ∆th , i.e., ∆ ∆ ∆ ∆ p p p th w w w + − > 1 ( ) ( ) ( ) , (30) THEN p p+ →1 and repeat Steps 4.(2)-4.(6), ELSE λ λp VDHMM PADw w+ →1 ( ) ( )/ . 4.3 Recognition procedure of VDHMM/PAD Consider a testing utterance Υ with Ty frames and Υ = ⋅⋅ ⋅y y yTy1 2 , where y j denotes the feature vector of j-th frame. The recognition procedure based upon the VDHMM/PAD is proceeded as follows. Step 1. Set w = 1 . Step 2. Use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state sequence q ** for the testing utterance Υ based on the word model λVDHMM PAD w/ ( ) . Step 3. Calculate the likelihood score of Υ for the word model λVDHMM PAD w/ ( ) by using Eqs. (4)-(6), i.e., p wVDHMM PAD[ | ( )]/Υ λ = p q w p q wVDHMM PAD VDHMM PAD[ | , ( )] [ | ( )] ** / ** /Υ λ λ⋅ , (31)
  24. 24. 24 Step 4. w w+ →1 . IF w M≤ , THEN repeat Step 2 to Step 4, ELSE go to Step 5. Step 5. Select the word whose likelihood score is highest, i.e., w p w w VDHMM PAD * /argmax{ [ | ( )]}= Υ λ . (32) 5. Experiments and discussion In this section, the same procedure described in Section 3.2 is used to find the distribution of state duration in the VDHMM/PAD. Moreover, to demonstrate the behaviors of state duration distributions of VDHMM/PAD under the influence of white noise, the same experiments conducted in Section 3.4 are also implemented here. Fig. 7 and Fig. 8 show the state duration distributions of seven states in the VDHMM/PAD for isolated Mandarin digit ‘4’ and the distorted state duration distributions due to white noise contamination. The recognition rates of VDHMM/PAD under the influences of white noise, F16 cockpit noise and babble noise are also listed in Table V. Furthermore, in order to make the comparison, we plotted those experimental results listed in Tables I-V on Fig. 9. From those experimental results we observe the following facts : (1) Distribution of state duration Comparing Fig. 7 with Fig. 1 and Fig. 2, we can find that for conventional HMM employing various state duration modeling methods, the distribution of state duration is relatively smooth and widespread. By incorporating state duration statistics into the training phase, the variable duration HMMs make the duration distributions of some states more concentrated and sharper. It results in the higher recognition rate. In Fig. 7, we can observe that for most of states (e.g., 2-nd, 4-th, 5-th and 6-th states) the allowable ranges of state duration modeled by VDHMM/PAD become more concentrated. The
  25. 25. 25 shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition, comparing with those state duration distributions shown in Fig.1 and Fig.2, the probability fluctuation in the VDHMM/PAD is more severe. This fluctuation phenomenon occurs in the duration distributions of 2-nd, 4-th and 6-th states of VDHMM/Npar and is considered to be helpful for enhancing its discriminativity in recognizing the noisy speech. (2) Robustness to noise contamination When speech signal is contaminated by white noise, the state duration distributions shown in Fig. 3 through Fig. 6 are affected and distorted. Especially, under SNR = 0 dB, duration distributions are severely distorted and concentrated extremely at some unexpected duration lengths. Using Fig. 3 and Fig. 4 as examples, we can find that for some HMMs (e.g., HMM/Npar, VDHMM/BSD) the duration distribution of 5-th state excessively concentrates at duration length of 3 frames for SNR = 0 dB while the other HMMs (e.g., HMM/Gam, VDHMM/Gau) at duration length of one frame. Moreover, the maximum probabilities of 5-th state duration are also dramatically increased from about 0.2~0.3 up to 0.8~1.0. In contrast to those state duration distributions described in Fig. 3 through Fig. 6, we can observe from Fig. 8 that even under the influence of white noise, the original ranges of state duration in the VDHMM/PAD keep almost unchanged and the duration distributions are less distorted by ambient noises. When the SNR value is reduced to 0 dB, the maximum probability of 5-th state duration is increased from 0.25 up to 0.45. This implies that the VDHMM/PAD is more effective than other duration modeling methods in preventing the state duration distribution from extremely concentrating at a specific duration length. (3) Performance of noisy speech recognition The recognition rates listed in Table V and the performances shown in Fig. 9 tell us that the VDHMM/PAD outperforms those HMMs and VDHMMs employing other duration modeling
  26. 26. 26 methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case of white noise and at low SNR (0 to 5 dB) in the case of F16 cockpit noise and babble noise. Especially, when the distortion due to ambient noises is serious, such as distortion due to white noise, the improvement of recognition rates is obvious. The superiority of VDHMM/PAD to the other hidden Markov models we discussed is essentially due to its novel state duration distributions. It is evident that the sharper and more concentrated duration distributions and relatively more fluctuated duration density function facilitate the VDHMM/PAD with better discriminativity and modeling capability in noisy environment. Moreover, it is noted that the VDHMM/PAD performs worse than the other hidden Markov models in clean condition. The reason for this phenomenon can be explained as follows. The PAD method proportionally segments each training utterance into states. This segmentation mechanism narrows the allowable ranges of some state duration distributions. Thus, a property of the VDHMM/PAD is that it can efficiently prevent any state from occupying too long or too short and gain performance benefits in noisy environment. However, this method will also cause duration mismatch between clean testing speech and reference models. This mismatch makes the recognition performance of VDHMM/PAD to degrade slightly in clean conditions as comparing with other hidden Markov models. ( Fig. 7 - Fig. 9, Table V about here ) 6. Conclusion In this paper, we first demonstrated the distribution of state duration in a conventional HMM and compared the effectiveness and performance of some widely used modeling methods for state duration in noisy environment. Based upon the weakness of those modeling method we evaluated, a proportional alignment decoding algorithm (PAD) combining with the statistics of state duration is then proposed in the
  27. 27. 27 training phase to re-train a conventional hidden Markov model and produce a new variable duration hidden Markov model (VDHMM/PAD). The PAD method enables us to make the distribution of state duration sharper, more fluctuated and relatively concentrated, and thus improve the model discriminativity for allowable duration lengths under the influence of ambient noises. Experimental results have demonstrated the robustness of VDHMM/PAD in noisy speech recognition. The proposed method can provide better recognition rates than the conventional HMM and the other duration modeling methods in various noisy environments. Acknowledgement The authors would like to thank Dr. Lee Lee-Min of Mingchi Institute of Technology, Taipei, Taiwan, for his enthusiasm in supporting valuable programming experiences and many fruitful discussions. References Anastasakos, A., Schwartz, R. & Shu, H. (1995), Duration modeling in large vocabulary speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 628-631. Bonafonte, A., Vidal, J. & Nogueiras, A. (1996). Duration modeling with Expanded HMM applied to Speech Recognition. Proceedings of International Conference on Spoken Language Processing, pp. 1097-1100. Burshtein, D. (1995). Robust parametric modeling of durations in hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 548-551.
  28. 28. 28 Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden Markov Models with Bounded State Durations. IEEE Trans. on Signal Processing, vol. 39, no.8 , pp. 1743-1752, August. Hung, W. W. & Wang, H. C. (1997). HMM retraining based on state duration alignment for noisy speech recognition. in Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), vol. 3, pp. 1519-1522, September. Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov models for speech signals. IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 5, pp. 1404-1413. Juang B. H. and Rabiner L. R. (1990). The segmental k-means algorithm for estimating parameters of hidden Markov models. IEEE Trans. On Acoustics, Speech, Signal Processing, vol. 38, pp. 1639-1641, September. Kim, W. G., Yoon, J. Y. & Youn, D. H. (1994). HMM with global path constraint in Viterbi decoding for isolated word recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 605-608. Laurila, K. (1997). Noise robust speech recognition with state duration constraints. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 871-874. Lee, L. M. & Wang, H. C. (1994). A study on adaptation of cepstral and delta cepstral coefficients for noisy speech recognition. Proceedings of International Conference on Spoken Language Processing, pp. 1011-1014. Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 1241-1244. Power, K. (1996). Durational modeling for improved connected digit recognition. Proceedings of
  29. 29. 29 International Conference on Spoken Language Processing, pp. 885-888. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, vol. 77, no. 2, pp. 257-286. Rabiner, L. R. & Juang, B.H. (1986). An introduction to hidden Markov model. IEEE ASSP Magazine, January, pp.4-16. Rabiner, L. R., Juang, B. H., Levinson, S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using hidden Markov models with continuous mixture densities. AT&T Tech. J. , vol. 64, no. 6, pp. 1211-1234, July-Aug. Rabiner, L. R., Wilpson, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for connected word recognition. AT&T Technical Journal, Vol. 65, pp. 21-31. Rabiner, L. R., Wilpon, J. G. & Soong, F. K. (1988). High performance connected digit recognition, using hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 119-122. Russell, M. J. & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 2376-2379. Russell, M. J. & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 5-8. Varga, A. Steeneken, H.J.M., Tomlinson, M. & Jones, D. (1992). The NOISEX-92 study on the effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern, England. Vaseghi, S. V. (1995). State duration modeling in hidden Markov models. Signal Processing, Vol. 41, pp.
  30. 30. 30 31-41. Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 129-132.
  31. 31. 31 Table I. Clean speech recognition rates for HMMs using various state duration modeling methods. methods baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD recognition rates 97.2 97.6 97.5 97.4 97.2 96.8 Table II. Clean speech recognition rates for VDHMMs using various state duration modeling methods. methods baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD recognition rates 97.2 97.6 97.6 97.5 97.4 97.1 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (a) white noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD clean 97.2 97.6 97.5 97.4 97.2 96.8 20dB 48.8 62.0 60.9 60.4 59.6 57.0 15dB 30.8 42.8 41.1 40.5 40.2 38.5 10dB 19.2 26.8 25.4 24.7 25.3 23.6 5dB 11.2 20.8 20.1 19.4 19.7 19.3 0dB 10.0 17.6 16.4 16.0 16.0 17.6 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (b) F16 cockpit noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD 20dB 92.0 95.2 93.8 93.5 93.2 92.8 15dB 79.6 85.5 83.6 81.7 80.8 80.1 10dB 67.6 74.7 73.2 72.8 72.5 71.6 5dB 44.0 54.3 53.7 52.8 53.4 52.2 0dB 15.2 25.6 23.5 22.5 22.8 22.3
  32. 32. 32 Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods (c) babble noise. methods SNR baseline HMM /Npar HMM /Gam HMM /Gau HMM /Pois HMM /BSD 20dB 94.8 95.9 95.6 95.4 95.2 94.9 15dB 88.0 92.2 91.1 90.3 89.7 88.2 10dB 75.2 80.4 79.3 76.9 77.8 75.6 5dB 58.4 70.4 68.9 65.8 66.1 63.7 0dB 33.2 42.8 41.4 38.6 39.3 38.5 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (a) white noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD clean 97.2 97.6 97.6 97.5 97.4 97.1 20dB 48.8 67.6 64.8 63.9 61.6 59.4 15dB 30.8 49.2 46.8 45.9 43.6 42.1 10dB 19.2 31.2 29.0 27.4 28.4 26.9 5dB 11.2 24.0 22.8 21.7 22.0 20.8 0dB 10.0 18.4 17.3 17.1 17.2 18.5 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (b) F16 cockpit noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD 20dB 92.0 96.0 94.4 94.3 94.0 93.6 15dB 79.6 86.4 84.1 82.3 81.4 80.9 10dB 67.6 76.3 74.5 73.9 73.8 72.5 5dB 44.0 55.3 54.8 53.5 54.2 53.0 0dB 15.2 28.2 26.3 24.9 25.5 24.5
  33. 33. 33 Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods (c) babble noise. methods SNR baseline VDHMM /Npar VDHMM /Gam VDHMM /Gau VDHMM /Pois VDHMM /BSD 20dB 94.8 96.4 96.2 95.8 95.6 95.3 15dB 88.0 93.5 91.8 90.9 90.6 89.4 10dB 75.2 82.4 80.8 79.3 80.1 77.2 5dB 58.4 71.5 69.7 66.1 67.3 65.2 0dB 33.2 45.2 43.6 40.9 42.1 40.6 Table V. Noisy speech recognition rates for VDHMM/PAD. SNR noise type clean 20 dB 15 dB 10 dB 5 dB 0 dB white noise 96.8 72.4 60.0 44.0 29.6 24.8 F16 cockpit noise 96.8 95.2 87.3 79.9 60.2 35.1 babble noise 96.8 95.9 94.4 84.7 76.4 52.1

×