SlideShare a Scribd company logo
1 of 30
Download to read offline
1
Smoothing Hidden Markov Models by Using an
Adaptive Signal Limiter for Noisy Speech Recognition
Wei-Wen Hung
Department of Electrical Engineering
Ming Chi Institute of Technology
Taishan, 243, Taiwan, Republic of China
E-mail : wwhung@ccsun.mit.edu.tw
FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379
and
Hsiao-Chuan Wang
Department of Electrical Engineering
National Tsing Hua University
Hsinchu, 30043, Taiwan, Republic of China
E-mail : hcwang@ee.nthu.edu.tw
FAX : 886-03-571-5971; Tel. : 886-03-574-2587
Paper No. : 1033. (second review)
Corresponding Author : Hsiao-Chuan Wang
Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter
(ASL), autocorrelation function, arcsin transformation.
2
Smoothing hidden Markov models by using
an adaptive signal limiter for noisy speech recognition
Wei-Wen Hung and Hsiao-Chuan Wang
Department of Electrical Engineering, National Tsing Hua University
Hsinchu, 30043, Taiwan, Republic of China
Abstract. When a speech recognition system is deployed in the real world, environmental interference will
make noisy speech signals and reference models mismatched and cause serious degradation in recognition
accuracy. To deal with the effect of environmental mismatch, a family of signal limiters has been successfully
applied to a template-based DTW recognizer to reduce the variability of speech features in noisy conditions.
Though simulation results indicate that heavily smoothing can effectively reduce the variability of speech
features in low signal-to-noise ratio (SNR), it would also cause the loss of information in speech features.
Therefore, we suggest that the smoothing factor of a signal limiter should be related to SNR and adapted on a
frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous
and dynamic spectral features of reference models and test speech. By smoothing spectral features, the
smoothed covariance matrices of reference models can be obtained by means of maximum likelihood (ML)
estimation. A speech recognition task for multispeaker isolated Mandarin digits has been conducted to
evaluate the effectiveness and robustness of the proposed method. Experimental results indicate that the
adaptive signal limiter can achieve significant improvement in noisy conditions and is more robust than the
hard limiter over a wider range of SNR values.
Key words. hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL), autocorrelation
function, arcsin transformation.
This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract
number NSC-88-2614-E-007-002.
3
LIST OF FIGURES AND TABLES
Fig. 1 Block diagram for implementing a speech recognizer with adaptive signal limiter.
Fig. 2 The various LPC log magnitude spectra of utterance ‘1’ in clean condition.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 3 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 4 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .)
Fig. 5 The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise.
(a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 6 The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise.
(a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .)
Table 1. Comparison of digit recognition rates (%) for white noise.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Table 2. Comparison of digit recognition rates (%) for factory noise.
4
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .)
Table 3. Comparison of digit recognition rates (%) for F16 noise.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 15 , SNR dBUB = 35 .)
Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.
5
1. Introduction
When a speech recognition system trained in a well-defined environment is used in the real world
applications, the acoustic mismatch between training and testing environments will degrade its recognition
accuracy severely. This acoustic mismatch is mainly caused by a wide variety of distortion sources, such
as ambient additive noises, channel effect and speaker’s Lombard effect. During the past several decades,
researchers focused their attentions in dealing with the mismatch problem and tried to narrow the
mismatch gap. There are many algorithms have been proposed and successfully applied for robust
speech recognition. Generally speaking, the methods for handling noisy speech recognition could be
roughly classified into the following approaches (Sankar and Lee, 1996). The first approach tries to
minimize the distance measures between reference models and testing signals by adaptively adjusting
speech signals in feature space. For example, Mansour and Juang (Mansour and Juang, 1989) found that
the norm of a cepstral vector is shrunk under noise contamination. Therefore, they used a first-order
equalization method to adapt the cepstral means of reference models so that the shrinkage of speech
features can be adequately compensated. Likewise, Carlson and Clement (Carlson and Clement, 1994)
also proposed a weighted projection measure (WPM) for recognition of noisy speech in the framework
of continuous density hidden Markov model (CDHMM). In addition, the norm shrinkage of cepstral
means will also lead to the reduction of HMM covariance matrices. Thus, Chien et al., (Chien, 1997a;
Chien et al., 1997b) proposed a variance adapted and mean compensated likelihood measure
(VA-MCLM) to adapt the mean vector and covariance matrix simultaneously.
The second approach estimates a transformation function in model space for transforming reference
models into testing environment and thus the environmental mismatch gap can be effectively reduced. In
the literature, there were a number of techniques compensating ambient noise effect in model space.
Among them, one of the most promising techniques is the so-called parallel model combination (PMC).
In the PMC algorithm, Varga and Moore (Varga and Moore, 1992a) adapted the statistics of reference
6
models to meet the testing conditions by optimally combining the reference models and noise model in
linear spectral domain. In the later few years, several related works have been successively reported for
improving the performance of PMC method. Flores and Young (Flores and Young, 1992) integrated
spectral subtraction (SS) and PMC methods to seek for further improvement in recognition accuracy. In
addition, Gales and Young (Gales and Young, 1995) extended PMC scheme to include the effect of
convolutional noise.
In the third approach, a more robust feature representation is developed in signal space so that the
speech feature is invariant or less susceptible to environmental variations. In this approach, Lee and Lin
(Lee and Lin, 1993) developed a family of signal limiters as a preprocessor to smooth speech signals.
When a speech signal is passed through a signal limiter with zero smoothing factor (i.e., a hard limiter),
the hard limiting operation preserves the sign of an input speech signal and ignores its magnitude. Thus,
the hard-limited speech signal is only affected by ambient noises when the signal-to-noise ratio (SNR) is
relatively low. This smoothing process for feature vectors has been shown to be effective for reducing the
variability of feature vectors in a noisy environment and make them less affected by ambient noises over a
wide range of SNR values. Experimental results for recognition of 39-word alpha-digit vocabulary also
demonstrate that an equivalent gain of 5-7 dB in SNR can be achieved for a template-based DTW
recognizer.
However, from the experimental results reported by Lee and Lin (Lee and Lin, 1993), we can also
observe that the recognition accuracy using a hard limiter for clean speech becomes worse. This
phenomenon may be explained as follows. For an utterance, the amplitudes of unvoiced segments are
generally much lower than the amplitudes of voiced segments. Heavily smoothing can reduce feature
variability of the speech segments with low SNR, but it also causes the loss of some important
informations embedded in the clean segments and the segments with high SNR. Therefore, a signal limiter
with fixed smoothing factor might not work well for the all segments in a speech utterance. We suggest
7
that the smoothing factor of a signal limiter should be related to SNR value and adapted on a frame by
frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and
dynamic spectral features of hidden Markov models (HMM) and testing speech signals. In addition, in
order to moderately reflect the variation of model covariance due to application of signal limiting
operation to the state statistics of word models, the adaptation of covariance matrix is also performed in
the sense of maximum likelihood (ML) estimation.
The layout of this paper is as follows. In the subsequent section, we describe the detailed formulation
of the proposed adaptive signal limiter and its extension to the framework of a continuous density hidden
Markov model. In Section 3, we investigate the behavior of LPC spectra of a speech utterance and its
signal-limited version under the influence of various ambient noises. In addition, a series of experiments
were conducted to compare the discriminability of different signal limiters in various noisy conditions.
Some experiments for recognition of multispeaker isolated Mandarin digits were performed in Section 4
to evaluate the effectiveness and robustness of the proposed method in presence of ambient noises.
Finally, a conclusion is drawn in Section 5.
2. Smoothing hidden Markov models by using an adaptive signal limiter
In this section, we describe the detailed formulation of the proposed adaptive signal limiter (ASL) and
its extension to the framework of an HMM-based speech recognizer.
2.1 Representation of the underlying hidden Markov models
Conventionally, for a continuous density hidden Markov model (CDHMM), the output likelihood
measure of t th− frame in the testing utterance { }Y y ct dt t Tyt
= = ≤ ≤[ , ],1 based on the
statistics of i th− state of word model { }Λ Λ Σ( ) ( , ),
, , ,
w i S
w i w i w i w= = ≤ ≤µ 1 can be
8
characterized by a multivariate Gaussian probability density function (pdf) and formulated as
p y
p
y yt w i w i t w i
T
w i t w i
( ) ( ) exp ( ) ( )
, , , , ,
Λ Σ Σ= ⋅
−
⋅ ⋅ − ⋅ − ⋅ ⋅ −
−
−





2
1
2
1 2
1
π µ µ , (1)
where µw i w i w ic d, , ,[ , ]= denotes the mean vector of i th− state of word model Λ( )w and
consists of p − order cepstral vector cw i, and p − order delta cepstral vector dw i, . Σw i, denotes
the covariance matrix of i th− state of word model Λ( )w and is simplified as a diagonal matrix, i.e.,
Σw i w i w i w idiag p, , , ,[ ( ) ( ) ( )]= ⋅⋅⋅ ⋅σ σ σ2 2 2
1 2 2 . However, in order to adequately reflect the
variation of dynamic spectral features due to application of a signal limiting operation to instantaneous
spectral features, the representation of state statistics in a conventional hidden Markov model is modified
slightly. In our approach, the mean vector µw i w i w ic d, , ,[ , ]= of i th− state of the word model
Λ( )w is indirectly represented by the normalized autocorrelation vectors of a five-frame context
window (Lee and Wang, 1995), that is [ , , , , ], , , , , , , , , ,r r r r rw i w i w i w i w i− −2 1 0 1 2 , where
r r r pw i j w i j w i j
T
, , , , , ,[ ( ), , ( )]= ⋅⋅⋅1 , j =0 denotes the instantaneous frame, j =-1, -2 the left context
frames and j =1, 2 the right context frames. The estimation of those normalized autocorrelation vectors
in a five-frame context window is proceeded as follows. Firstly, a conventional hidden Markov model is
trained for each word by means of the segmental k-means algorithm. Then, based upon the obtained
word models, each frame in the training utterances is labeled with its decoded state identity by using the
Viterbi decoding algorithm. Those instantaneous, left-context and right-context autocorrelation vectors
corresponding to the same state identity are collected and averaged to obtain the indirect representation
of the underlying hidden Markov models. For example, the normalized autocorrelation vectors of i th−
state of the word model Λ( )w can be formulated by
[ , , , , ]
[ , , , , ]
,, , , , , , , ,1 , ,
, , , , ,
,
r r r r r
r r r r r
N
w i w i w i w i w i
w t
u
w t
u
w t
u
w t
u
w t
u
u t
s
− −
− − + +
=
∑
2 1 0 2
2 1 1 2
(2)
9
where rw t
u
, represents the normalized autocorrelation vector of the u th− training utterance, t th−
frame of word w . Above summation includes all the N s frames which are labeled with the state
identity i of word model Λ( )w .
Based upon this indirect representation, the analysis equations of linear predictive coding (LPC) model
can be expressed in matrix form as
R a rw i j w i j w i j, , , , , ,⋅ = for j = − ⋅⋅⋅2 2, , , (3)
where Rw i j, , is an autocorrelation matrix of the form
R
r r r p
r r r p
r p r p r
w i j
w i j w i j w i j
w i j w i j w i j
w i j w i j w i j
, ,
, , , , , ,
, , , , , ,
, , , , , ,
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
=
−
−
− −














0 1 1
1 0 2
1 2 0
L
L
M M O M
L
. (4)
Since the autocorrelation matrix is Toeplitz symmetric and positive definite, the LPC coefficient vector
a a a a pw i j w i j w i j w i j
T
, , , , , , , ,[ ( ) ( ) ( ) ]= 1 2 L can be solved efficiently by the Levinson-Durbin
recursion method (Rabiner and Juang, 1993). Once we obtain the LPC coefficient vector for Eq. (3), the
corresponding cepstral vector cw i j, , can be recursively calculated by using the LPC to cepstral
coefficient conversion formula
c m a m
k
m
c k a m kw i j w i j w i j w i j
k
m
, , , , , , , ,( ) ( ) ( ) ( ) ( )= + ⋅ ⋅ −
=
−
∑
1
1
, 1 ≤ ≤m p . (5)
Finally, the cepstral vector of instantaneous frame, i.e., cw i j, , for j =0, is used as the mean vector,
cw i, , of i th− state of word model Λ( )w . In addition, the corresponding delta cepstral vector
dw i, can also be calculated by using the following equation :
d
j c
j
w i
w i j
j
j
j
j,
, ,
=
⋅
=−
=
=−
=
∑
∑
2
2
2
2
2
. (6)
10
2.2 Formulation of the adaptive signal limiter
For recognition of noisy speech, it has been observed that employing a signal limiter to smooth a
speech signal in time domain leads to significant performance improvement. The basic theory of a signal
limiter can be roughly described as follows (Lee and Lin, 1993). When a signal x is passed through a
signal limiter, the signal limiting operation is equivalent to performing a nonlinear transformation on the
input signal and so that the corresponding output signal y can be essentially characterized by an error
function of the form :
y s x
K
t dt
x
= =
⋅ ⋅
⋅ − ⋅∫( ) exp[ ( )] ,
2
2
2
2 2
0
π σ
σ (7)
where K is a scaling constant and σ 2
is a tunable factor for adjusting the smoothing degree of a
signal limiting operation.
In light of above pronounced smoothing property, a signal limiter can be readily extended to the
processing of speech signals in a noisy environment. Consider an input speech signal x , approximated
by a zero mean, stationary Gaussian process with variance σx
2
, has the density function as
g x
x
x x
( ) exp .=
⋅ ⋅
⋅ −
⋅








1
2 22
2
2
π σ σ
(8)
Then, the output y of a signal limiter has the density function expressed as (see Appendix A)
( ) ,
2
1exp))(()( 2
2






⋅⋅
⋅−−⋅==
x
x
K
xshyh
σδ
δ
δ
(9)
where x s y= −1
( ). δ denotes the smoothing factor of a signal limiter and is defined as δ σ σ= 2 2
x .
The larger the value of δ , the smaller the value of output signal y . When the smoothing factor δ
approaches 0, the corresponding signal limiter changes into a hard limiter of the form
y f x
K if x
if x
K if x
= =
>
=
− <




( )
2 0
0 0
2 0
. (10)
A signal limiting operationcan also be interpreted as an arcsin transformation in autocorrelation domain.
Assume that the autocorrelation functions of input speech signal x and its signal-limited output y are
denoted as rx ( )τ and ry ( )τ , respectively. Then, the normalized autocorrelation function of a
signal-limited output y can be formulated as (Lee and Lin, 1993)
11
[ ]
[ ]
r
r
r
r
r
r
y
y
y
x
( )
( )
( )
sin ( ) ( )
sin ( )
τ
τ τ δ
δ
≡ =
+
+
−
−
0
1
1 1
1
1
, (11)
where
r
r r rx x x( ) ( ) ( )τ τ≡ 0 is the normalized autocorrelation function of the input speech signal x .
By properly adjusting the smoothing factor δ , various degrees of smoothing effect can be obtained.
When δ approaches infinity, the normalized autocorrelation function of an input speech signal
r
rx ( )τ
is almost equal to the normalized autocorrelation function of the corresponding signal-limited output
r
ry ( )τ . Furthermore, in the case of δ = 0, the normalized autocorrelation function of the signal-limited
output
r
ry ( )τ is reduced to the following equation (see Appendix B) :
[ ]r r
r ry x( ) sin ( ) .τ
π
τ= ⋅ −2 1
(12)
In the literature presented by Lee and Lin (Lee and Lin, 1993), they used a hard limiter as a
pre-processor to reduce the variability of feature vectors in noisy conditions. That is, a pre-determined
smoothing factor is used throughout a speech signal. However, it is known that the segments of clean
speech with less energy are influenced most by ambient noises and thus require heavily smoothing. As to
the clean segments and the segments with high SNR, excessively smoothing not only destroys their
distinct features but also reduces the discriminability of speech features in a noisy environment. Therefore,
we propose an adaptive signal limiter (ASL) in which the smoothing factor δ is related to SNR and
adapted on a frame by frame basis. In the proposed adaptive signal limiter, the smoothing factor δ is
empirically formulated as :
( )δ
δ
δ δ
δ
δ
( ) ,
min
max min
min
max
SNR
if SNR SNR
SNR SNR
SNR SNR if SNR SNR SNR
if SNR SNR
LB
UB LB
LB LB UB
UB
=
<
−
−





 ⋅ − + ≤ ≤
>







(13)
and SNR
E
E
s
n
≡ ⋅10 10log ( ), (14)
where δ δmin max, , ,SNR SNRLB UB are tuning constants, E s is the frame energy of a clean speech
signal and E n is the noise energy. In the subsequent experiments, the arcsin transformation shown in
Eq.(11)-(14) are used to compute the normalized autocorrelation of a signal-limited signal rather than
directly applying the nonlinear operation of Eq. (7) on the input signal. This is because of that the
underlying hidden Markov models are indirectly represented by the LPC-based spectral features. The
LPC spectral features can be efficiently calculated from autocorrelation function by means of Eq.(5).
12
Moreover, comparing with the signal limiting operation shown in Eq.(7), the arcsin transformation
requires less computation cost.
2.3 Adaptations of dynamic spectral feature and covariance matrix
When a signal limiting operation is performed on the autocorrelation function of a speech signal, it not
only smooths instantaneous spectral vectors but also leads to reduction of the corresponding dynamic
spectral features and model covariance matrices. Therefore, in order to achieve higher consistency, the
adaptations of a model’s dynamic spectral features and its covariance matrices are necessary. This
adaptation procedure is proceeded as follows. When the t th− frame yt of a testing utterance Y is
evaluated on the state Λw i, , the cepstral vectors ct j, of its context frames yt j, , for j− ≤ ≤2 2 ,
are first transformed to give the corresponding normalized autocorrelation vectors rt j, . Then, those
normalized autocorrelation vectors r r r pt j t j t j
T
, , ,[ ( ), , ( )]= ⋅⋅⋅1 are processed by the following
arcsin transformation :
[ ]
[ ]
~ ( )
sin
( )
( )
sin
( )
,,
,
,
,
r
r
SNR
SNR
t j
t j
t j
t j
τ
τ
δ
δ
=
+








+








−
−
1
1
1
1
1
for − ≤ ≤2 2j and 1 ≤ ≤τ p . (15)
In above equation, the SNRt j, variable is determined by
SNR
E E
Et j
t j n
n
, log
( )
,= ⋅
−





−
10 10 (16)
where Et is the t th− frame energy in the testing utterance Y . E n is the noise energy and can be
roughly estimated by selecting the lowest energy in the testing utterance Y , i.e.,
{ }E E E En Ty
= ⋅⋅⋅min , , ,1 2 . Once the smoothed autocorrelation vectors ~ ,,r for jt j − ≤ ≤2 2 , were
obtained, the smoothed testing cepstral vector ~
,ct j of ~
,yt j can be calculated by means of the LPC to
cepstrum conversion formula. Moreover, the corresponding smoothed testing delta cepstral vector
~
dt
can also be solved by using the following equation :
~
~
,
d
j c
j
t
t j
j
j
j
j
=
⋅
=−
=
=−
=
∑
∑
2
2
2
2
2
, (17)
13
and thus, the smoothed testing feature vector ~ [ ~ ,
~
]y c dt t t= can be taken as the term ~ [ ~ ,
~
], ,y c dt t t0 0= .
Similarly, in order to avoid introducing mismatch between testing speech signals and reference models,
the mean vector of state Λw i, should be also smoothed by using Eq. (11) with the same smoothing
factor, and thus its smoothed version
~
[ ~ ,
~
], , ,µw i w i w ic d= can be obtained. On the other hand, by
substituting
~
[ ~ ,
~
], , ,µw i w i w ic d= and ~ [ ~ ,
~
]y c dt t t= into Eq.(1), we may obtain
~ ( ~ ( ~ , )) ( ) exp ( ~ ~ ) ( ~ ~ )
, , , , , ,
p y
p
y yt w i w i w i t w i
T
w i t w i
µ π µ µΣ Σ Σ= ⋅
−
⋅ ⋅ − ⋅ − ⋅ ⋅ −
−
−





2
1
2
1 2
1
. (18)
By taking differential of logarithm ofEq. (18) with respect to Σw i, and setting the result to zero, we can
obtain the optimal smoothed covariance matrix
~
,Σw i which maximize the likelihood function in Eq.(18),
that is (see Appendix C)
~
~ ( ) ~ ( )
( )
~
( )
~
( )
( )
.,
,
,
,
,
,Σ Σw i
t w i
w i
t w i
w im
m p
m
m p
w i
c m c m
m
d m d m
p m
p
=
−







+
−
+








⋅
⋅
=
=
=
=
∑∑ σ σ
2
11
2
2
(19)
Finally, the resulting smoothed output likelihood measure can be rewritten as :
~ ( ~ ~
) ( )
~
exp ( ~ ~ )
~
( ~ ~ )
, , , , ,
p y
p
y yt w i w i t w i
T
w i t w i
Λ Σ Σ= ⋅
−
⋅ ⋅ − ⋅ − ⋅ ⋅ −
−
−





2
1
2
1 2
1
π µ µ . (20)
2.4 Implementation of a speech recognizer with adaptive signal limiter
To be more detailed, the overall system diagram for implementing a HMM-based speech recognizer
with adaptive signal limiter is depicted in Fig. 1. In the training phase, we first train a set of word models
by using the segmental k-means algorithm and Viterbi decoding method (Juang and Rabiner, 1990). Also,
the state statistics of a word model are indirectly represented by the normalized autocorrelation vectors of
a five-frame context window. When a testing utterance Y is to be recognized, we first use Eq.(15) and
Eq.(16) to estimate the frame-dependent smoothing factor and perform arcsin transformation on the
normalized autocorrelation vectors rt j, . When the arcsin-transformed vectors ~
,rt j are obtained, we
can solve the smoothed cepstral vector ~
,ct j and its delta cepstral vector by LPC to cepstrum
conversion formula and Eq. (17). Moreover, the same smoothing factor is also used to smooth the state
14
statistics of word models. Once the smoothed autocorrelation vectors ~
, ,rw i j are obtained, the
smoothed cepstral vectors ~
, ,cw i j can likewise be calculated by means of the LPC to cepstrum
conversion formula. Moreover, the corresponding smoothed delta cepstral vector
~
,dw i and covariance
matrix
~
,Σw i can also be solved by using the Eq.(6) and Eq.(19). Finally, by substituting ~y t ,
~
,µw i and
~
,Σw i into Eq. (20), we can obtain the smoothed output likelihoods.
(Figure 1 is about here.)
3. Effectiveness and robustness of the adaptive signal limiter
3.1 Database and experimental conditions
A multispeaker (50 male and 50 female speakers) isolated Mandarin digit recognition (Lee and Wang,
1994) was conducted to demonstrate the effectiveness and robustness of the proposed adaptive signal
limiter. There are three sessions of data collection in the digit database. For each session, every speaker
uttered a set of 10 Mandarin digits. Speech signals are sampled at 8 KHz. Each frame contains 256
samples with 128 samples overlapped, and is multiplied by a 256-point Hamming window. Endpoints are
not detected so that each utterance still contains about 0.1~0.5 seconds of pre-silence and post-silence.
Each digit is modeled as a left-to-right HMM without jumps in which the output of each state is a
2-mixture Gaussian distribution of feature vectors. Each word model contains seven to nine states
including pre-silence and post-silence states. The feature vector is indirectly represented by the 12-order
normalized autocorrelation vectors of a five-frame context window. This representation can be then
transformed into a 12-order cepstral vectors and a 12-order delta cepstral vector. Moreover,
NOISEX-92 noise database (Varga et al., 1992b) was used for generating noisy speech. The
subsequent experiments were conducted to examine the following problems : (1) influence of signal
limiters on the LPC spectra of clean speech, (2) influence of signal limiters on the LPC spectra of noisy
speech, and (3) effects of signal limiters on speech discriminability in a noisy environment.
15
3.2 Influence of signal limiters on LPC spectra of clean speech
A sample utterance of Mandarin digit ‘1’ uttered by a male speaker is used to demonstrate the
influence of signal limiters on LPC spectra of clean speech. The 12-order LPC spectrum analysis is
performed on a 32 msec window with 16 msec frame shift. To observe the spectral variation in frequency
domain, we ploted the LPC spectra of 15 consecutive frames extracted from the middle portion of the
sample utterance. Figure 2 shows the log LPC spectra of the sample utterance ‘1’ in the cases of without
signal limiter, with hard limiter and with adaptive signal limiter. From this figure, we can observe that the
formants of utterance ‘1’ occur about at the positions of 200Hz, 1950Hz, 3100 Hz and 3350Hz. After
applying a signal limiter, it is noted that parts of the original spectra become more smoothed and their
formant peaks are broaden. Especially, in the case of using a hard limiter, the second, third and fourth
formants are severely suppressed. Since the location and spacing of formant frequencies are highly
correlated with the shape of a vocal tract, this suppression will reduce the discriminability of speech
utterances and lead to misrecognition. On the other hand, we can also find that the spectral shape in the
case of using the adaptive signal limiter is almost unaffected. This is mainly due to that an adaptive signal
limiter employing larger smoothing factor is useful to keep the arcsin-transformed autocorrelation function
almost unchanged in clean condition.
(Figure 2 is about here.)
3.3 Influence of signal limiters on LPC spectra of noisy speech
In this subsection, we explore the influence of signal limiters on LPC spectra of noisy speech. This is
shown in Fig. 3 and Fig. 4, where we plot the LPC spectra of the same utterance shown in Fig. 2 with 20
dB additive white Gaussian noise and factory noise, respectively. When a white noise is added to clean
speech, there gradually appears an abnormal formant peak in the LPC spectra of distorted utterance ‘1’
at about 1125 Hz ~ 1625 Hz as shown in Fig. 3 (a). This phenomenon also happens in the case of
16
adding a factory noise to clean speech. In the case of adding factory noise, the abnormal formant peak
occurs at about 1000 Hz ~ 1375 Hz. However, comparing with the baseline case, the spectral distortion
in the LPC spectra with signal limiter are less pronounced. This property verifies the robustness of signal
limiters in a noisy environment. In addition, a comparison of Fig. 3, Fig. 4 and Fig. 2 shows that
excessively smoothing autocorrelation function will suppress parts of formant peaks and lose some
important informations about the shape of a vocal tract. Instead of using a fixed smoothing factor, an
adaptive signal limiter adaptively adjusting the degree of smoothness can not only effectively reduce the
variability of speech features, but also preserve more useful spectral information embedded in a speech
signal.
(Figure 3 and figure 4 are about here.)
3.4 Effects of signal limiters on speech discriminability in a noisy environment
In this subsection, we evaluate the robustness of signal limiters in noisy conditions. First, the first two
sessions of database were used to train a set of word models by using the segmental k-means algorithm.
To generate noisy speech, white Gaussian noise and factory noise were separately added to the 100
utterances of Mandarin digit ‘1’ in the third session. Those distorted utterances were then evaluated on
the 10 word models to obtain maximum log likelihoods. For each word model, we can find the average
log likelihoods by averaging the accumulation of all log likelihoods corresponding to the same word
model. In Fig. 5 and Fig. 6, we plot the average log likelihoods of utterance ‘1’ as a function of SNR
values in the cases of white Gaussian noise and factory noise, respectively. When the underlying
environment is getting noisy, i.e., below a SNR threshold, utterance ‘1’ is easily misrecognized as
utterance ‘7’. For white noise, the SNR thresholds occur at about 20 dB, 15 dB and 7 dB for the cases
of without signal limiter, with hard limiter and with adaptive signal limiter, respectively. Similarly, for
factory noise, the SNR thresholds occur at about 15 dB, 10 dB and 3 dB for the cases of without signal
17
limiter, with hard limiter and with adaptive signal limiter, respectively. These experimental results reveal
that an equivalent gain of about 12 ~ 13 dB and 7 ~ 8 dB in SNR can be achieved when the adaptive
signal limiter is compared with the baseline and hard limiter for recognition of utterance ‘1’ in noisy
conditions, respectively.
(Figure 5 and figure 6 are about here.)
4. Experimental results and discussion
In this section, a multispeaker (50 males and 50 females) recognition of isolated Mandarin digits (Lee
and Wang, 1994) was conducted to demonstrate the merits of the proposed method. The experimental
setup and underlying database have been described in subsection 3.1. In the experiments we conducted,
a conventional hidden Markov model without incorporating any signal limiters is referred as a baseline
system. The ambient noises including white Gaussian noise, F16 noise and factory noise were separately
added to clean speech with predetermined SNRs at 20, 15, 10, 5 and 0 dB to generate various noisy
speech signals. Moreover, the parameters used in the proposed adaptive signal limiter under different
noisy conditions are determined empirically as follows. Firstly, the smoothing factor δ is initially set to
0 and increased with increment ∆δ = 0 1. while SNRLB and SNRUB are kept constant. It is
observed that when the smoothing factor δ is beyond 1 , smoothing operation has little effect on digit
recognition rates. This phenomenon also happens in the cases of using different sets of parameters
SNRLB and SNRUB . Therefore, the maximum value of smoothing factor can be well approximated by
setting δmax .= 1 0 and employed throughout all experiments. Similarly, we chose a SNR lower bound
from the interval 0 30~ dB while a SNR upper bound from the interval 20 50~ dB with increment
5dB to test which set of SNR parameters can achieve better digit recognition accuracy.
In Table 1, we assess the recognition accuracy of baseline, parallel model combination (PMC),
18
baseline with hard limiter and baseline with adaptive signal limiter for recognition of noisy speech under
the influence of a white noise. From the experimental results, we can find that the baseline with hard
limiter improves the recognition accuracy at low SNR and performs worse at high SNR and clean
condition. This is mainly because that oversmoothing autocorrelation function severely distorts some
important spectral informations embedded in original speech signals. On the other hand, the improvement
of the proposed adaptive signal limiter is remarkable due to the helps of adaptively adjusting the
smoothing factor. The adaptive signal limiter further outperforms the hard limiter. This means that using
larger smoothing factors for clean condition and high SNR is as important as using smaller smoothing
factors for low SNR.
(Table 1 is about here.)
Moreover, we also find that the PMC method is superior to the proposed adaptive signal limiter
technique in recognition accuracy. This superiority is mainly due to that the PMC method decomposes
the concurrent processes of speech and background noise, and so that the environmental mismatch can
be effectively reduced by optimally combining those two processes in the linear spectral domain. In
contrast, the environmental mismatch is not compensated during the signal limiting operation. The
proposed adaptive signal limiter can be considered as a weighting function which neglects the speech
segments with low SNR by heavily smoothing their features in autocorrelation domain. This smoothing
operation not only reduces feature variability in noisy conditions but also inevitably deteriorates parts of
characteristics of speech features. Therefore, it is intuitive that the PMC method has better recognition
accuracy as comparing with the proposed method. However, those comparison results do not indicate
that the proposed method is useless for noisy speech recognition. For the segments with low SNR (e.g.,
distorted unvoiced segment), the adaptive signal limiter seems to be more effective than the PMC method
in some noisy conditions. This implies that model adaptation is useful for high and medium SNRs while
feature smoothing is more feasible for low SNR. As described in the paper proposed by C. H. Lee and
19
C. H. Lin (Lee and Lin, 1993), a signal limiter can be combined with other noise-robust speech
recognition techniques to obtain an additional performance improvement. Therefore, it is expected that by
properly integrating the adaptive signal limiter with other noise-robust speech recognition techniques, such
as WPM, PMC methods, further improvement in recognition accuracy could be obtained.
Likewise, the comparison of different methods in the presence of factory noise and F16 noise are also
illustrated in Table 2 and Table 3, respectively. We can observe that the proposed method consistently
achieves remarkable improvement in recognition accuracy. This result verifies the effectiveness and
robustness of the adaptive signal limiter for speech recognition in white noise as well as colored noises.
As far as the computation time is concerned, the adaptive signal limiter needs fewer computation time
than the PMC method. The reduction in CPU time is about 25%. A detail of CPU time for different
methods is shown in the Table 4.
(Table 2, Table 3, and Table 4 are about here.)
5. Conclusion
In this paper, we explore the influence of a hard limiter on LPC spectra of clean and noisy speech. It is
found that excessively smoothing in autocorrelation domain of a speech signal will suppress parts of
formant peaks and reduce the discriminability of speech features in noisy conditions. Based upon the
weakness of a hard limiter, an adaptive signal limiter is proposed to improve its robustness. In our
approach, the smoothing degree of a signal limiter is related to SNR value and adaptively determined on
a frame by frame basis. That is, the smaller the SNR value of a speech frame, the smaller the smoothing
factor of a signal limiter. Experimental results verify that an adaptive signal limiter outperforms a hard
limiter at various SNRs. This improvement is mainly due to that an adaptive signal limiter not only reduces
feature’s variability in low SNR, but also preserves some important informations bearing in the speech
segments with high SNR.
20
Acknowledgement
The authors would like to thank Dr. Lee-Min Lee of Mingchi Institute of Technology, Taipei, Taiwan,
for his enthusiasm in providing experiences for implementing the new representation of hidden Markov
model with five-frame context window.
References
Carlson, B. A., Clement, M. A., 1994. A projection-based likelihood measure for speech recognitionin
noise. IEEE Trans. on Speech and Audio Processing. Vol. 2, pp. 97-102.
Chien, J. T., 1997a. Speech recognition under telephone environments. Ph.D. Thesis. Department of
Electrical Engineering, National Tsing Hua University, Taiwan, R.O.C.
Chien, J. T., Lee, L. M., Wang, H. C., 1997b. Extended studies on projection-based likelihood measure
for noisy speech recognition. revised in IEEE Trans. on Speech and Audio Processing.
Flores, J. A. N., Young, S. J., 1992. Continuous speech recognition in noise using spectral subtraction
and HMM adaptation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). San
Francisco. Vol. 1, pp. 409-412.
Gales, M. J. F., Young, S. J., 1995. Robust speech recognition in additive and convolutional noise using
parallel model combination. Computer Speech and Language, Vol. 4, pp. 352-359.
Juang, B. H., Rabiner, L. R., 1990. The segmental k-means algorithm for estimating parameters of
hidden Markov models. IEEE Trans. on Acoustics, Speech, Signal Proc., 38(9) : 1639-1641,
September.
Lee, C. H., Lin, C. H., 1993. On the use of a family of signal limiters for recognition of noisy speech.
Speech Communication, Vol. 12, pp. 383-392.
Lee, L. M., and Wang, H. C., 1994. A study on adaptation of cepstral and delta cepstral coefficients for
21
noisy speech recognition. Proc. of Int. Conf. Spoken Language Processing (ICSLP). Yokohama,
Japan. pp. 1011-1014.
Lee, L. M., Wang, H. C., 1995. Representation of hidden Markov model for noise adaptive speech
recognition. Electronics Letters, Vol. 31, No. 8, pp. 616-617.
Mansour, D., Juang, B. H., 1989. A family of distortion measures based upon projection operation for
robust speech recognition. IEEE Trans. on Acoustics, Speech, Sig5nal Processing, Vol. 37, pp.
1659-1671.
Rabiner, L., and Juang, B. H., 1993. Fundamentals of Speech Recognition, Englewood Cliffs, New
Jersey, Prentice-Hall, pp. 112-117.
Sankar, A., Lee, C. H., 1996. A maximum-likelihood approach to stochastic matching for robust speech
recognition. IEEE Trans. on Speech and Audio Processing, Vol. 4, pp. 190-202.
Varga, A. P., Moore, R. K., 1992a. Hidden Markov model decomposition of speech and noise. IEEE
Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), San Francisco. pp. 845-848.
Varga, A. P., Steeneken, H.J.M., Tomlinson, M., Jones, D., 1992b. The NOISEX-92 study on the
effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research
Unit, Malvern, England.
22
Table 1. Comparison of digit recognition rates (%) for white noise.
( 0.0min =δ , 0.1max =δ , SNR dBLB = 20 , SNR dBUB = 30 .)
SNRs
Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 80.2 65.7 48.8 25.6 10.6
PMC 98.7 92.2 84.6 72.7 59.3 47.1
hard limiter 90.6 76.8 68.5 55.8 35.9 21.4
adaptive limiter 95.2 85.1 76.4 68.1 58.5 49.7
Table 2. Comparison of digit recognition rates (%) for factory noise.
( 0.0min =δ , 0.1max =δ , dBSNRLB 10= , dBSNRUB 40= .)
SNRs
Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 91.2 81.4 65.9 46.9 25.4
PMC 98.7 95.0 91.8 82.3 73.2 52.5
hard limiter 90.6 86.3 80.2 71.3 57.5 30.0
adaptive limiter 94.9 91.9 87.8 77.7 69.2 53.3
Table 3. Comparison of digit recognition rates (%) for F16 noise.
( 0.0min =δ , 0.1max =δ , SNR dBLB = 15 , SNR dBUB = 35 .)
SNRs
Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 91.1 78.9 65.2 43.9 21.0
PMC 98.7 95.9 92.5 87.4 68.1 44.5
hard limiter 90.6 84.9 77.1 67.8 54.6 29.4
adaptive limiter 95.1 91.4 85.3 78.7 61.9 42.2
Table 4. Comparison of computation costs based on Pentium II-266 MHz
Personal Computer.
Methods baseline PMC hard limiter adaptive limiter
CPU Time
recognition
(sec)
0.203 4.038 0.291 2.981
23
δ
(model adaptation)
δ
Fig. 1. Block diagram for implementing a speech recognizer with adaptive signal limiter.
0 375 750 1125 1500 1875 2250 2625 3000 3375 3750
1
6
11
-2
-1
0
1
2
3
4
magnitude(dB)
frequency (Hz)
frame index
baseline-clean
(a) LPC log magnitude spectra without signal limiter.
word
models
segmental
k-means
arcsin
transform
autocorr.→LPC
LPC→cepstrum
estimate smoothing
factor δ
autocorr. vectors
of a context window
arcsin
transform
speech
recognizer
training
utterances
autocorr. vectors
of context windows
find smoothed delta cepstrum
and covariance matrix
testing
utterances
autocorr.→LPC
LPC→cepstrum
find smoothed
delta cepstrum
recognition
results
24
0 500 1000 1500 2000 2500 3000 3500
1
7
13
-0.5
0
0.5
1
1.5
2magnitude(dB)
frequency (Hz)
frame index
hard limiter-clean
(b) LPC log magnitude spectra with hard limiter.
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
magnitude(dB)
frequency (Hz)
frame index
adaptive limiter-clean
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 2. The various LPC log magnitude spectra of utterance ‘1’ in clean condition.
25
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
magnitude(dB)
frequency (Hz)
frame index
baseline-white20dB
(a) LPC log magnitude spectra without signal limiter.
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-0.5
0
0.5
1
1.5
2
magnitude(dB)
frequency (Hz)
frame index
hard limiter-white20dB
(b) LPC log magnitude spectra with hard limiter.
26
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-1
-0.5
0
0.5
1
1.5
2
2.5
magnitude(dB)
frequency (Hz)
frame index
adaptive limiter-white20dB
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 3. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
magnitude(dB)
frequency (Hz)
frame index
baseline-factory20dB
(a) LPC log magnitude spectra without signal limiter.
27
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-0.5
0
0.5
1
1.5
2
magnitude(dB)
frequency (Hz)
frame index
hard limiter-factory20dB
(b) LPC log magnitude spectra with hard limiter.
0 500 1000 1500 2000 2500 3000 3500
1
6
11
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
magnitude(dB)
frequency (Hz)
frame index
adaptive limiter-factory20dB
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .)
Fig. 4. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.
28
word '1' in white noise
using baseline system
0
100
200
300
400
500
600
700
800
900
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1 model 2
model 3 model 4 model 5
model 6 model 7 model 8
model 9
(a) Comparison of average log likelihoods without signal limiter.
word '1' in white noise
using hard limiter
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1 model 2
model 3 model 4 model 5
model 6 model 7 model 8
model 9
(b) Comparison of average log likelihoods with hard limiter.
29
word '1' in white noise
using adaptive signal limiter
600
700
800
900
1000
1100
1200
1300
1400
1500
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1
model 2 model 3
model 4 model 5
model 6 model 7
model 8 model 9
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .)
Fig. 5. The average log likelihoods of utterance ‘1’ evaluated on various
word models in white noise.
word '1' in factory noise
using baseline system
200
300
400
500
600
700
800
900
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1
model 2 model 3
model 4 model 5
model 6 model 7
model 8 model 9
(a) Comparison of average log likelihoods without signal limiter.
30
word '1' in factory noise
using hard limiter
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1 model 2
model 3 model 4 model 5
model 6 model 7 model 8
model 9
(b) Comparison of average log likelihoods with hard limiter.
word '1' in factory noise
using adaptive signal limiter
600
800
1000
1200
1400
1600
0dB 5dB 10dB 15dB 20dB clean
SNR values
loglikelihoods
model 0 model 1
model 2 model 3
model 4 model 5
model 6 model 7
model 8 model 9
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .)
Fig. 6. The average log likelihoods of utterance ‘1’ evaluated on various
word models in factory noise.

More Related Content

What's hot

Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]
威華 王
 
Adaptive noise estimation algorithm for speech enhancement
Adaptive noise estimation algorithm for speech enhancementAdaptive noise estimation algorithm for speech enhancement
Adaptive noise estimation algorithm for speech enhancement
Harshal Ladhe
 
Paper id 252014135
Paper id 252014135Paper id 252014135
Paper id 252014135
IJRAT
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signal
Vinodhini
 
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
guestfb80e22
 

What's hot (18)

A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancement
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
129966863723746268[1]
129966863723746268[1]129966863723746268[1]
129966863723746268[1]
 
Paper id 28201448
Paper id 28201448Paper id 28201448
Paper id 28201448
 
Adaptive noise estimation algorithm for speech enhancement
Adaptive noise estimation algorithm for speech enhancementAdaptive noise estimation algorithm for speech enhancement
Adaptive noise estimation algorithm for speech enhancement
 
Paper id 252014135
Paper id 252014135Paper id 252014135
Paper id 252014135
 
Fast auralization using radial basis functions type of artificial neural netw...
Fast auralization using radial basis functions type of artificial neural netw...Fast auralization using radial basis functions type of artificial neural netw...
Fast auralization using radial basis functions type of artificial neural netw...
 
Dq24746750
Dq24746750Dq24746750
Dq24746750
 
Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Art
 
G010424248
G010424248G010424248
G010424248
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signal
 
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
 
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
Speech Enhancement Using A Minimum Mean Square Error Short Time Spectral Ampl...
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
Nd2421622165
Nd2421622165Nd2421622165
Nd2421622165
 

Viewers also liked

Hidden markovmodel
Hidden markovmodelHidden markovmodel
Hidden markovmodel
Piyorot
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
Sarang Afle
 
Ec2306 mini project report-matlab
Ec2306 mini project report-matlabEc2306 mini project report-matlab
Ec2306 mini project report-matlab
unnimaya_k
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 

Viewers also liked (15)

Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19Speech recognition using hidden markov model mee 03_19
Speech recognition using hidden markov model mee 03_19
 
Hidden markovmodel
Hidden markovmodelHidden markovmodel
Hidden markovmodel
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Matlab: Speech Signal Analysis
Matlab: Speech Signal AnalysisMatlab: Speech Signal Analysis
Matlab: Speech Signal Analysis
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
 
Ec2306 mini project report-matlab
Ec2306 mini project report-matlabEc2306 mini project report-matlab
Ec2306 mini project report-matlab
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Role of easy vr in Arduino Speech Processing
Role of easy vr in Arduino Speech ProcessingRole of easy vr in Arduino Speech Processing
Role of easy vr in Arduino Speech Processing
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
 

Similar to 129966864160453838[1]

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 

Similar to 129966864160453838[1] (20)

ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
 
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
 
Speech Enhancement for Nonstationary Noise Environments
Speech Enhancement for Nonstationary Noise EnvironmentsSpeech Enhancement for Nonstationary Noise Environments
Speech Enhancement for Nonstationary Noise Environments
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
 
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
 
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
 
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdfA_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
 
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANMETHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
 
L046056365
L046056365L046056365
L046056365
 
D111823
D111823D111823
D111823
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
Ky2418521856
Ky2418521856Ky2418521856
Ky2418521856
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 AudioNovel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
 
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
 

More from 威華 王

129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]
威華 王
 
129966863931865940[1]
129966863931865940[1]129966863931865940[1]
129966863931865940[1]
威華 王
 
129966863516564072[1]
129966863516564072[1]129966863516564072[1]
129966863516564072[1]
威華 王
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]
威華 王
 
129966863002202240[1]
129966863002202240[1]129966863002202240[1]
129966863002202240[1]
威華 王
 
129966862758614726[1]
129966862758614726[1]129966862758614726[1]
129966862758614726[1]
威華 王
 
129966864599036360[1]
129966864599036360[1]129966864599036360[1]
129966864599036360[1]
威華 王
 

More from 威華 王 (7)

129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]
 
129966863931865940[1]
129966863931865940[1]129966863931865940[1]
129966863931865940[1]
 
129966863516564072[1]
129966863516564072[1]129966863516564072[1]
129966863516564072[1]
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]
 
129966863002202240[1]
129966863002202240[1]129966863002202240[1]
129966863002202240[1]
 
129966862758614726[1]
129966862758614726[1]129966862758614726[1]
129966862758614726[1]
 
129966864599036360[1]
129966864599036360[1]129966864599036360[1]
129966864599036360[1]
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

129966864160453838[1]

  • 1. 1 Smoothing Hidden Markov Models by Using an Adaptive Signal Limiter for Noisy Speech Recognition Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, 243, Taiwan, Republic of China E-mail : wwhung@ccsun.mit.edu.tw FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, 30043, Taiwan, Republic of China E-mail : hcwang@ee.nthu.edu.tw FAX : 886-03-571-5971; Tel. : 886-03-574-2587 Paper No. : 1033. (second review) Corresponding Author : Hsiao-Chuan Wang Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL), autocorrelation function, arcsin transformation.
  • 2. 2 Smoothing hidden Markov models by using an adaptive signal limiter for noisy speech recognition Wei-Wen Hung and Hsiao-Chuan Wang Department of Electrical Engineering, National Tsing Hua University Hsinchu, 30043, Taiwan, Republic of China Abstract. When a speech recognition system is deployed in the real world, environmental interference will make noisy speech signals and reference models mismatched and cause serious degradation in recognition accuracy. To deal with the effect of environmental mismatch, a family of signal limiters has been successfully applied to a template-based DTW recognizer to reduce the variability of speech features in noisy conditions. Though simulation results indicate that heavily smoothing can effectively reduce the variability of speech features in low signal-to-noise ratio (SNR), it would also cause the loss of information in speech features. Therefore, we suggest that the smoothing factor of a signal limiter should be related to SNR and adapted on a frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and dynamic spectral features of reference models and test speech. By smoothing spectral features, the smoothed covariance matrices of reference models can be obtained by means of maximum likelihood (ML) estimation. A speech recognition task for multispeaker isolated Mandarin digits has been conducted to evaluate the effectiveness and robustness of the proposed method. Experimental results indicate that the adaptive signal limiter can achieve significant improvement in noisy conditions and is more robust than the hard limiter over a wider range of SNR values. Key words. hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL), autocorrelation function, arcsin transformation. This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract number NSC-88-2614-E-007-002.
  • 3. 3 LIST OF FIGURES AND TABLES Fig. 1 Block diagram for implementing a speech recognizer with adaptive signal limiter. Fig. 2 The various LPC log magnitude spectra of utterance ‘1’ in clean condition. (a) LPC log magnitude spectra without signal limiter. (b) LPC log magnitude spectra with hard limiter. (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 3 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise. (a) LPC log magnitude spectra without signal limiter. (b) LPC log magnitude spectra with hard limiter. (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 4 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise. (a) LPC log magnitude spectra without signal limiter. (b) LPC log magnitude spectra with hard limiter. (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .) Fig. 5 The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise. (a) Comparison of average log likelihoods without signal limiter. (b) Comparison of average log likelihoods with hard limiter. (c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 6 The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise. (a) Comparison of average log likelihoods without signal limiter. (b) Comparison of average log likelihoods with hard limiter. (c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .) Table 1. Comparison of digit recognition rates (%) for white noise. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Table 2. Comparison of digit recognition rates (%) for factory noise.
  • 4. 4 (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .) Table 3. Comparison of digit recognition rates (%) for F16 noise. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 15 , SNR dBUB = 35 .) Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.
  • 5. 5 1. Introduction When a speech recognition system trained in a well-defined environment is used in the real world applications, the acoustic mismatch between training and testing environments will degrade its recognition accuracy severely. This acoustic mismatch is mainly caused by a wide variety of distortion sources, such as ambient additive noises, channel effect and speaker’s Lombard effect. During the past several decades, researchers focused their attentions in dealing with the mismatch problem and tried to narrow the mismatch gap. There are many algorithms have been proposed and successfully applied for robust speech recognition. Generally speaking, the methods for handling noisy speech recognition could be roughly classified into the following approaches (Sankar and Lee, 1996). The first approach tries to minimize the distance measures between reference models and testing signals by adaptively adjusting speech signals in feature space. For example, Mansour and Juang (Mansour and Juang, 1989) found that the norm of a cepstral vector is shrunk under noise contamination. Therefore, they used a first-order equalization method to adapt the cepstral means of reference models so that the shrinkage of speech features can be adequately compensated. Likewise, Carlson and Clement (Carlson and Clement, 1994) also proposed a weighted projection measure (WPM) for recognition of noisy speech in the framework of continuous density hidden Markov model (CDHMM). In addition, the norm shrinkage of cepstral means will also lead to the reduction of HMM covariance matrices. Thus, Chien et al., (Chien, 1997a; Chien et al., 1997b) proposed a variance adapted and mean compensated likelihood measure (VA-MCLM) to adapt the mean vector and covariance matrix simultaneously. The second approach estimates a transformation function in model space for transforming reference models into testing environment and thus the environmental mismatch gap can be effectively reduced. In the literature, there were a number of techniques compensating ambient noise effect in model space. Among them, one of the most promising techniques is the so-called parallel model combination (PMC). In the PMC algorithm, Varga and Moore (Varga and Moore, 1992a) adapted the statistics of reference
  • 6. 6 models to meet the testing conditions by optimally combining the reference models and noise model in linear spectral domain. In the later few years, several related works have been successively reported for improving the performance of PMC method. Flores and Young (Flores and Young, 1992) integrated spectral subtraction (SS) and PMC methods to seek for further improvement in recognition accuracy. In addition, Gales and Young (Gales and Young, 1995) extended PMC scheme to include the effect of convolutional noise. In the third approach, a more robust feature representation is developed in signal space so that the speech feature is invariant or less susceptible to environmental variations. In this approach, Lee and Lin (Lee and Lin, 1993) developed a family of signal limiters as a preprocessor to smooth speech signals. When a speech signal is passed through a signal limiter with zero smoothing factor (i.e., a hard limiter), the hard limiting operation preserves the sign of an input speech signal and ignores its magnitude. Thus, the hard-limited speech signal is only affected by ambient noises when the signal-to-noise ratio (SNR) is relatively low. This smoothing process for feature vectors has been shown to be effective for reducing the variability of feature vectors in a noisy environment and make them less affected by ambient noises over a wide range of SNR values. Experimental results for recognition of 39-word alpha-digit vocabulary also demonstrate that an equivalent gain of 5-7 dB in SNR can be achieved for a template-based DTW recognizer. However, from the experimental results reported by Lee and Lin (Lee and Lin, 1993), we can also observe that the recognition accuracy using a hard limiter for clean speech becomes worse. This phenomenon may be explained as follows. For an utterance, the amplitudes of unvoiced segments are generally much lower than the amplitudes of voiced segments. Heavily smoothing can reduce feature variability of the speech segments with low SNR, but it also causes the loss of some important informations embedded in the clean segments and the segments with high SNR. Therefore, a signal limiter with fixed smoothing factor might not work well for the all segments in a speech utterance. We suggest
  • 7. 7 that the smoothing factor of a signal limiter should be related to SNR value and adapted on a frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and dynamic spectral features of hidden Markov models (HMM) and testing speech signals. In addition, in order to moderately reflect the variation of model covariance due to application of signal limiting operation to the state statistics of word models, the adaptation of covariance matrix is also performed in the sense of maximum likelihood (ML) estimation. The layout of this paper is as follows. In the subsequent section, we describe the detailed formulation of the proposed adaptive signal limiter and its extension to the framework of a continuous density hidden Markov model. In Section 3, we investigate the behavior of LPC spectra of a speech utterance and its signal-limited version under the influence of various ambient noises. In addition, a series of experiments were conducted to compare the discriminability of different signal limiters in various noisy conditions. Some experiments for recognition of multispeaker isolated Mandarin digits were performed in Section 4 to evaluate the effectiveness and robustness of the proposed method in presence of ambient noises. Finally, a conclusion is drawn in Section 5. 2. Smoothing hidden Markov models by using an adaptive signal limiter In this section, we describe the detailed formulation of the proposed adaptive signal limiter (ASL) and its extension to the framework of an HMM-based speech recognizer. 2.1 Representation of the underlying hidden Markov models Conventionally, for a continuous density hidden Markov model (CDHMM), the output likelihood measure of t th− frame in the testing utterance { }Y y ct dt t Tyt = = ≤ ≤[ , ],1 based on the statistics of i th− state of word model { }Λ Λ Σ( ) ( , ), , , , w i S w i w i w i w= = ≤ ≤µ 1 can be
  • 8. 8 characterized by a multivariate Gaussian probability density function (pdf) and formulated as p y p y yt w i w i t w i T w i t w i ( ) ( ) exp ( ) ( ) , , , , , Λ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ − − −      2 1 2 1 2 1 π µ µ , (1) where µw i w i w ic d, , ,[ , ]= denotes the mean vector of i th− state of word model Λ( )w and consists of p − order cepstral vector cw i, and p − order delta cepstral vector dw i, . Σw i, denotes the covariance matrix of i th− state of word model Λ( )w and is simplified as a diagonal matrix, i.e., Σw i w i w i w idiag p, , , ,[ ( ) ( ) ( )]= ⋅⋅⋅ ⋅σ σ σ2 2 2 1 2 2 . However, in order to adequately reflect the variation of dynamic spectral features due to application of a signal limiting operation to instantaneous spectral features, the representation of state statistics in a conventional hidden Markov model is modified slightly. In our approach, the mean vector µw i w i w ic d, , ,[ , ]= of i th− state of the word model Λ( )w is indirectly represented by the normalized autocorrelation vectors of a five-frame context window (Lee and Wang, 1995), that is [ , , , , ], , , , , , , , , ,r r r r rw i w i w i w i w i− −2 1 0 1 2 , where r r r pw i j w i j w i j T , , , , , ,[ ( ), , ( )]= ⋅⋅⋅1 , j =0 denotes the instantaneous frame, j =-1, -2 the left context frames and j =1, 2 the right context frames. The estimation of those normalized autocorrelation vectors in a five-frame context window is proceeded as follows. Firstly, a conventional hidden Markov model is trained for each word by means of the segmental k-means algorithm. Then, based upon the obtained word models, each frame in the training utterances is labeled with its decoded state identity by using the Viterbi decoding algorithm. Those instantaneous, left-context and right-context autocorrelation vectors corresponding to the same state identity are collected and averaged to obtain the indirect representation of the underlying hidden Markov models. For example, the normalized autocorrelation vectors of i th− state of the word model Λ( )w can be formulated by [ , , , , ] [ , , , , ] ,, , , , , , , ,1 , , , , , , , , r r r r r r r r r r N w i w i w i w i w i w t u w t u w t u w t u w t u u t s − − − − + + = ∑ 2 1 0 2 2 1 1 2 (2)
  • 9. 9 where rw t u , represents the normalized autocorrelation vector of the u th− training utterance, t th− frame of word w . Above summation includes all the N s frames which are labeled with the state identity i of word model Λ( )w . Based upon this indirect representation, the analysis equations of linear predictive coding (LPC) model can be expressed in matrix form as R a rw i j w i j w i j, , , , , ,⋅ = for j = − ⋅⋅⋅2 2, , , (3) where Rw i j, , is an autocorrelation matrix of the form R r r r p r r r p r p r p r w i j w i j w i j w i j w i j w i j w i j w i j w i j w i j , , , , , , , , , , , , , , , , , , , , ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) = − − − −               0 1 1 1 0 2 1 2 0 L L M M O M L . (4) Since the autocorrelation matrix is Toeplitz symmetric and positive definite, the LPC coefficient vector a a a a pw i j w i j w i j w i j T , , , , , , , ,[ ( ) ( ) ( ) ]= 1 2 L can be solved efficiently by the Levinson-Durbin recursion method (Rabiner and Juang, 1993). Once we obtain the LPC coefficient vector for Eq. (3), the corresponding cepstral vector cw i j, , can be recursively calculated by using the LPC to cepstral coefficient conversion formula c m a m k m c k a m kw i j w i j w i j w i j k m , , , , , , , ,( ) ( ) ( ) ( ) ( )= + ⋅ ⋅ − = − ∑ 1 1 , 1 ≤ ≤m p . (5) Finally, the cepstral vector of instantaneous frame, i.e., cw i j, , for j =0, is used as the mean vector, cw i, , of i th− state of word model Λ( )w . In addition, the corresponding delta cepstral vector dw i, can also be calculated by using the following equation : d j c j w i w i j j j j j, , , = ⋅ =− = =− = ∑ ∑ 2 2 2 2 2 . (6)
  • 10. 10 2.2 Formulation of the adaptive signal limiter For recognition of noisy speech, it has been observed that employing a signal limiter to smooth a speech signal in time domain leads to significant performance improvement. The basic theory of a signal limiter can be roughly described as follows (Lee and Lin, 1993). When a signal x is passed through a signal limiter, the signal limiting operation is equivalent to performing a nonlinear transformation on the input signal and so that the corresponding output signal y can be essentially characterized by an error function of the form : y s x K t dt x = = ⋅ ⋅ ⋅ − ⋅∫( ) exp[ ( )] , 2 2 2 2 2 0 π σ σ (7) where K is a scaling constant and σ 2 is a tunable factor for adjusting the smoothing degree of a signal limiting operation. In light of above pronounced smoothing property, a signal limiter can be readily extended to the processing of speech signals in a noisy environment. Consider an input speech signal x , approximated by a zero mean, stationary Gaussian process with variance σx 2 , has the density function as g x x x x ( ) exp .= ⋅ ⋅ ⋅ − ⋅         1 2 22 2 2 π σ σ (8) Then, the output y of a signal limiter has the density function expressed as (see Appendix A) ( ) , 2 1exp))(()( 2 2       ⋅⋅ ⋅−−⋅== x x K xshyh σδ δ δ (9) where x s y= −1 ( ). δ denotes the smoothing factor of a signal limiter and is defined as δ σ σ= 2 2 x . The larger the value of δ , the smaller the value of output signal y . When the smoothing factor δ approaches 0, the corresponding signal limiter changes into a hard limiter of the form y f x K if x if x K if x = = > = − <     ( ) 2 0 0 0 2 0 . (10) A signal limiting operationcan also be interpreted as an arcsin transformation in autocorrelation domain. Assume that the autocorrelation functions of input speech signal x and its signal-limited output y are denoted as rx ( )τ and ry ( )τ , respectively. Then, the normalized autocorrelation function of a signal-limited output y can be formulated as (Lee and Lin, 1993)
  • 11. 11 [ ] [ ] r r r r r r y y y x ( ) ( ) ( ) sin ( ) ( ) sin ( ) τ τ τ δ δ ≡ = + + − − 0 1 1 1 1 1 , (11) where r r r rx x x( ) ( ) ( )τ τ≡ 0 is the normalized autocorrelation function of the input speech signal x . By properly adjusting the smoothing factor δ , various degrees of smoothing effect can be obtained. When δ approaches infinity, the normalized autocorrelation function of an input speech signal r rx ( )τ is almost equal to the normalized autocorrelation function of the corresponding signal-limited output r ry ( )τ . Furthermore, in the case of δ = 0, the normalized autocorrelation function of the signal-limited output r ry ( )τ is reduced to the following equation (see Appendix B) : [ ]r r r ry x( ) sin ( ) .τ π τ= ⋅ −2 1 (12) In the literature presented by Lee and Lin (Lee and Lin, 1993), they used a hard limiter as a pre-processor to reduce the variability of feature vectors in noisy conditions. That is, a pre-determined smoothing factor is used throughout a speech signal. However, it is known that the segments of clean speech with less energy are influenced most by ambient noises and thus require heavily smoothing. As to the clean segments and the segments with high SNR, excessively smoothing not only destroys their distinct features but also reduces the discriminability of speech features in a noisy environment. Therefore, we propose an adaptive signal limiter (ASL) in which the smoothing factor δ is related to SNR and adapted on a frame by frame basis. In the proposed adaptive signal limiter, the smoothing factor δ is empirically formulated as : ( )δ δ δ δ δ δ ( ) , min max min min max SNR if SNR SNR SNR SNR SNR SNR if SNR SNR SNR if SNR SNR LB UB LB LB LB UB UB = < − −       ⋅ − + ≤ ≤ >        (13) and SNR E E s n ≡ ⋅10 10log ( ), (14) where δ δmin max, , ,SNR SNRLB UB are tuning constants, E s is the frame energy of a clean speech signal and E n is the noise energy. In the subsequent experiments, the arcsin transformation shown in Eq.(11)-(14) are used to compute the normalized autocorrelation of a signal-limited signal rather than directly applying the nonlinear operation of Eq. (7) on the input signal. This is because of that the underlying hidden Markov models are indirectly represented by the LPC-based spectral features. The LPC spectral features can be efficiently calculated from autocorrelation function by means of Eq.(5).
  • 12. 12 Moreover, comparing with the signal limiting operation shown in Eq.(7), the arcsin transformation requires less computation cost. 2.3 Adaptations of dynamic spectral feature and covariance matrix When a signal limiting operation is performed on the autocorrelation function of a speech signal, it not only smooths instantaneous spectral vectors but also leads to reduction of the corresponding dynamic spectral features and model covariance matrices. Therefore, in order to achieve higher consistency, the adaptations of a model’s dynamic spectral features and its covariance matrices are necessary. This adaptation procedure is proceeded as follows. When the t th− frame yt of a testing utterance Y is evaluated on the state Λw i, , the cepstral vectors ct j, of its context frames yt j, , for j− ≤ ≤2 2 , are first transformed to give the corresponding normalized autocorrelation vectors rt j, . Then, those normalized autocorrelation vectors r r r pt j t j t j T , , ,[ ( ), , ( )]= ⋅⋅⋅1 are processed by the following arcsin transformation : [ ] [ ] ~ ( ) sin ( ) ( ) sin ( ) ,, , , , r r SNR SNR t j t j t j t j τ τ δ δ = +         +         − − 1 1 1 1 1 for − ≤ ≤2 2j and 1 ≤ ≤τ p . (15) In above equation, the SNRt j, variable is determined by SNR E E Et j t j n n , log ( ) ,= ⋅ −      − 10 10 (16) where Et is the t th− frame energy in the testing utterance Y . E n is the noise energy and can be roughly estimated by selecting the lowest energy in the testing utterance Y , i.e., { }E E E En Ty = ⋅⋅⋅min , , ,1 2 . Once the smoothed autocorrelation vectors ~ ,,r for jt j − ≤ ≤2 2 , were obtained, the smoothed testing cepstral vector ~ ,ct j of ~ ,yt j can be calculated by means of the LPC to cepstrum conversion formula. Moreover, the corresponding smoothed testing delta cepstral vector ~ dt can also be solved by using the following equation : ~ ~ , d j c j t t j j j j j = ⋅ =− = =− = ∑ ∑ 2 2 2 2 2 , (17)
  • 13. 13 and thus, the smoothed testing feature vector ~ [ ~ , ~ ]y c dt t t= can be taken as the term ~ [ ~ , ~ ], ,y c dt t t0 0= . Similarly, in order to avoid introducing mismatch between testing speech signals and reference models, the mean vector of state Λw i, should be also smoothed by using Eq. (11) with the same smoothing factor, and thus its smoothed version ~ [ ~ , ~ ], , ,µw i w i w ic d= can be obtained. On the other hand, by substituting ~ [ ~ , ~ ], , ,µw i w i w ic d= and ~ [ ~ , ~ ]y c dt t t= into Eq.(1), we may obtain ~ ( ~ ( ~ , )) ( ) exp ( ~ ~ ) ( ~ ~ ) , , , , , , p y p y yt w i w i w i t w i T w i t w i µ π µ µΣ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ − − −      2 1 2 1 2 1 . (18) By taking differential of logarithm ofEq. (18) with respect to Σw i, and setting the result to zero, we can obtain the optimal smoothed covariance matrix ~ ,Σw i which maximize the likelihood function in Eq.(18), that is (see Appendix C) ~ ~ ( ) ~ ( ) ( ) ~ ( ) ~ ( ) ( ) ., , , , , ,Σ Σw i t w i w i t w i w im m p m m p w i c m c m m d m d m p m p = −        + − +         ⋅ ⋅ = = = = ∑∑ σ σ 2 11 2 2 (19) Finally, the resulting smoothed output likelihood measure can be rewritten as : ~ ( ~ ~ ) ( ) ~ exp ( ~ ~ ) ~ ( ~ ~ ) , , , , , p y p y yt w i w i t w i T w i t w i Λ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ − − −      2 1 2 1 2 1 π µ µ . (20) 2.4 Implementation of a speech recognizer with adaptive signal limiter To be more detailed, the overall system diagram for implementing a HMM-based speech recognizer with adaptive signal limiter is depicted in Fig. 1. In the training phase, we first train a set of word models by using the segmental k-means algorithm and Viterbi decoding method (Juang and Rabiner, 1990). Also, the state statistics of a word model are indirectly represented by the normalized autocorrelation vectors of a five-frame context window. When a testing utterance Y is to be recognized, we first use Eq.(15) and Eq.(16) to estimate the frame-dependent smoothing factor and perform arcsin transformation on the normalized autocorrelation vectors rt j, . When the arcsin-transformed vectors ~ ,rt j are obtained, we can solve the smoothed cepstral vector ~ ,ct j and its delta cepstral vector by LPC to cepstrum conversion formula and Eq. (17). Moreover, the same smoothing factor is also used to smooth the state
  • 14. 14 statistics of word models. Once the smoothed autocorrelation vectors ~ , ,rw i j are obtained, the smoothed cepstral vectors ~ , ,cw i j can likewise be calculated by means of the LPC to cepstrum conversion formula. Moreover, the corresponding smoothed delta cepstral vector ~ ,dw i and covariance matrix ~ ,Σw i can also be solved by using the Eq.(6) and Eq.(19). Finally, by substituting ~y t , ~ ,µw i and ~ ,Σw i into Eq. (20), we can obtain the smoothed output likelihoods. (Figure 1 is about here.) 3. Effectiveness and robustness of the adaptive signal limiter 3.1 Database and experimental conditions A multispeaker (50 male and 50 female speakers) isolated Mandarin digit recognition (Lee and Wang, 1994) was conducted to demonstrate the effectiveness and robustness of the proposed adaptive signal limiter. There are three sessions of data collection in the digit database. For each session, every speaker uttered a set of 10 Mandarin digits. Speech signals are sampled at 8 KHz. Each frame contains 256 samples with 128 samples overlapped, and is multiplied by a 256-point Hamming window. Endpoints are not detected so that each utterance still contains about 0.1~0.5 seconds of pre-silence and post-silence. Each digit is modeled as a left-to-right HMM without jumps in which the output of each state is a 2-mixture Gaussian distribution of feature vectors. Each word model contains seven to nine states including pre-silence and post-silence states. The feature vector is indirectly represented by the 12-order normalized autocorrelation vectors of a five-frame context window. This representation can be then transformed into a 12-order cepstral vectors and a 12-order delta cepstral vector. Moreover, NOISEX-92 noise database (Varga et al., 1992b) was used for generating noisy speech. The subsequent experiments were conducted to examine the following problems : (1) influence of signal limiters on the LPC spectra of clean speech, (2) influence of signal limiters on the LPC spectra of noisy speech, and (3) effects of signal limiters on speech discriminability in a noisy environment.
  • 15. 15 3.2 Influence of signal limiters on LPC spectra of clean speech A sample utterance of Mandarin digit ‘1’ uttered by a male speaker is used to demonstrate the influence of signal limiters on LPC spectra of clean speech. The 12-order LPC spectrum analysis is performed on a 32 msec window with 16 msec frame shift. To observe the spectral variation in frequency domain, we ploted the LPC spectra of 15 consecutive frames extracted from the middle portion of the sample utterance. Figure 2 shows the log LPC spectra of the sample utterance ‘1’ in the cases of without signal limiter, with hard limiter and with adaptive signal limiter. From this figure, we can observe that the formants of utterance ‘1’ occur about at the positions of 200Hz, 1950Hz, 3100 Hz and 3350Hz. After applying a signal limiter, it is noted that parts of the original spectra become more smoothed and their formant peaks are broaden. Especially, in the case of using a hard limiter, the second, third and fourth formants are severely suppressed. Since the location and spacing of formant frequencies are highly correlated with the shape of a vocal tract, this suppression will reduce the discriminability of speech utterances and lead to misrecognition. On the other hand, we can also find that the spectral shape in the case of using the adaptive signal limiter is almost unaffected. This is mainly due to that an adaptive signal limiter employing larger smoothing factor is useful to keep the arcsin-transformed autocorrelation function almost unchanged in clean condition. (Figure 2 is about here.) 3.3 Influence of signal limiters on LPC spectra of noisy speech In this subsection, we explore the influence of signal limiters on LPC spectra of noisy speech. This is shown in Fig. 3 and Fig. 4, where we plot the LPC spectra of the same utterance shown in Fig. 2 with 20 dB additive white Gaussian noise and factory noise, respectively. When a white noise is added to clean speech, there gradually appears an abnormal formant peak in the LPC spectra of distorted utterance ‘1’ at about 1125 Hz ~ 1625 Hz as shown in Fig. 3 (a). This phenomenon also happens in the case of
  • 16. 16 adding a factory noise to clean speech. In the case of adding factory noise, the abnormal formant peak occurs at about 1000 Hz ~ 1375 Hz. However, comparing with the baseline case, the spectral distortion in the LPC spectra with signal limiter are less pronounced. This property verifies the robustness of signal limiters in a noisy environment. In addition, a comparison of Fig. 3, Fig. 4 and Fig. 2 shows that excessively smoothing autocorrelation function will suppress parts of formant peaks and lose some important informations about the shape of a vocal tract. Instead of using a fixed smoothing factor, an adaptive signal limiter adaptively adjusting the degree of smoothness can not only effectively reduce the variability of speech features, but also preserve more useful spectral information embedded in a speech signal. (Figure 3 and figure 4 are about here.) 3.4 Effects of signal limiters on speech discriminability in a noisy environment In this subsection, we evaluate the robustness of signal limiters in noisy conditions. First, the first two sessions of database were used to train a set of word models by using the segmental k-means algorithm. To generate noisy speech, white Gaussian noise and factory noise were separately added to the 100 utterances of Mandarin digit ‘1’ in the third session. Those distorted utterances were then evaluated on the 10 word models to obtain maximum log likelihoods. For each word model, we can find the average log likelihoods by averaging the accumulation of all log likelihoods corresponding to the same word model. In Fig. 5 and Fig. 6, we plot the average log likelihoods of utterance ‘1’ as a function of SNR values in the cases of white Gaussian noise and factory noise, respectively. When the underlying environment is getting noisy, i.e., below a SNR threshold, utterance ‘1’ is easily misrecognized as utterance ‘7’. For white noise, the SNR thresholds occur at about 20 dB, 15 dB and 7 dB for the cases of without signal limiter, with hard limiter and with adaptive signal limiter, respectively. Similarly, for factory noise, the SNR thresholds occur at about 15 dB, 10 dB and 3 dB for the cases of without signal
  • 17. 17 limiter, with hard limiter and with adaptive signal limiter, respectively. These experimental results reveal that an equivalent gain of about 12 ~ 13 dB and 7 ~ 8 dB in SNR can be achieved when the adaptive signal limiter is compared with the baseline and hard limiter for recognition of utterance ‘1’ in noisy conditions, respectively. (Figure 5 and figure 6 are about here.) 4. Experimental results and discussion In this section, a multispeaker (50 males and 50 females) recognition of isolated Mandarin digits (Lee and Wang, 1994) was conducted to demonstrate the merits of the proposed method. The experimental setup and underlying database have been described in subsection 3.1. In the experiments we conducted, a conventional hidden Markov model without incorporating any signal limiters is referred as a baseline system. The ambient noises including white Gaussian noise, F16 noise and factory noise were separately added to clean speech with predetermined SNRs at 20, 15, 10, 5 and 0 dB to generate various noisy speech signals. Moreover, the parameters used in the proposed adaptive signal limiter under different noisy conditions are determined empirically as follows. Firstly, the smoothing factor δ is initially set to 0 and increased with increment ∆δ = 0 1. while SNRLB and SNRUB are kept constant. It is observed that when the smoothing factor δ is beyond 1 , smoothing operation has little effect on digit recognition rates. This phenomenon also happens in the cases of using different sets of parameters SNRLB and SNRUB . Therefore, the maximum value of smoothing factor can be well approximated by setting δmax .= 1 0 and employed throughout all experiments. Similarly, we chose a SNR lower bound from the interval 0 30~ dB while a SNR upper bound from the interval 20 50~ dB with increment 5dB to test which set of SNR parameters can achieve better digit recognition accuracy. In Table 1, we assess the recognition accuracy of baseline, parallel model combination (PMC),
  • 18. 18 baseline with hard limiter and baseline with adaptive signal limiter for recognition of noisy speech under the influence of a white noise. From the experimental results, we can find that the baseline with hard limiter improves the recognition accuracy at low SNR and performs worse at high SNR and clean condition. This is mainly because that oversmoothing autocorrelation function severely distorts some important spectral informations embedded in original speech signals. On the other hand, the improvement of the proposed adaptive signal limiter is remarkable due to the helps of adaptively adjusting the smoothing factor. The adaptive signal limiter further outperforms the hard limiter. This means that using larger smoothing factors for clean condition and high SNR is as important as using smaller smoothing factors for low SNR. (Table 1 is about here.) Moreover, we also find that the PMC method is superior to the proposed adaptive signal limiter technique in recognition accuracy. This superiority is mainly due to that the PMC method decomposes the concurrent processes of speech and background noise, and so that the environmental mismatch can be effectively reduced by optimally combining those two processes in the linear spectral domain. In contrast, the environmental mismatch is not compensated during the signal limiting operation. The proposed adaptive signal limiter can be considered as a weighting function which neglects the speech segments with low SNR by heavily smoothing their features in autocorrelation domain. This smoothing operation not only reduces feature variability in noisy conditions but also inevitably deteriorates parts of characteristics of speech features. Therefore, it is intuitive that the PMC method has better recognition accuracy as comparing with the proposed method. However, those comparison results do not indicate that the proposed method is useless for noisy speech recognition. For the segments with low SNR (e.g., distorted unvoiced segment), the adaptive signal limiter seems to be more effective than the PMC method in some noisy conditions. This implies that model adaptation is useful for high and medium SNRs while feature smoothing is more feasible for low SNR. As described in the paper proposed by C. H. Lee and
  • 19. 19 C. H. Lin (Lee and Lin, 1993), a signal limiter can be combined with other noise-robust speech recognition techniques to obtain an additional performance improvement. Therefore, it is expected that by properly integrating the adaptive signal limiter with other noise-robust speech recognition techniques, such as WPM, PMC methods, further improvement in recognition accuracy could be obtained. Likewise, the comparison of different methods in the presence of factory noise and F16 noise are also illustrated in Table 2 and Table 3, respectively. We can observe that the proposed method consistently achieves remarkable improvement in recognition accuracy. This result verifies the effectiveness and robustness of the adaptive signal limiter for speech recognition in white noise as well as colored noises. As far as the computation time is concerned, the adaptive signal limiter needs fewer computation time than the PMC method. The reduction in CPU time is about 25%. A detail of CPU time for different methods is shown in the Table 4. (Table 2, Table 3, and Table 4 are about here.) 5. Conclusion In this paper, we explore the influence of a hard limiter on LPC spectra of clean and noisy speech. It is found that excessively smoothing in autocorrelation domain of a speech signal will suppress parts of formant peaks and reduce the discriminability of speech features in noisy conditions. Based upon the weakness of a hard limiter, an adaptive signal limiter is proposed to improve its robustness. In our approach, the smoothing degree of a signal limiter is related to SNR value and adaptively determined on a frame by frame basis. That is, the smaller the SNR value of a speech frame, the smaller the smoothing factor of a signal limiter. Experimental results verify that an adaptive signal limiter outperforms a hard limiter at various SNRs. This improvement is mainly due to that an adaptive signal limiter not only reduces feature’s variability in low SNR, but also preserves some important informations bearing in the speech segments with high SNR.
  • 20. 20 Acknowledgement The authors would like to thank Dr. Lee-Min Lee of Mingchi Institute of Technology, Taipei, Taiwan, for his enthusiasm in providing experiences for implementing the new representation of hidden Markov model with five-frame context window. References Carlson, B. A., Clement, M. A., 1994. A projection-based likelihood measure for speech recognitionin noise. IEEE Trans. on Speech and Audio Processing. Vol. 2, pp. 97-102. Chien, J. T., 1997a. Speech recognition under telephone environments. Ph.D. Thesis. Department of Electrical Engineering, National Tsing Hua University, Taiwan, R.O.C. Chien, J. T., Lee, L. M., Wang, H. C., 1997b. Extended studies on projection-based likelihood measure for noisy speech recognition. revised in IEEE Trans. on Speech and Audio Processing. Flores, J. A. N., Young, S. J., 1992. Continuous speech recognition in noise using spectral subtraction and HMM adaptation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). San Francisco. Vol. 1, pp. 409-412. Gales, M. J. F., Young, S. J., 1995. Robust speech recognition in additive and convolutional noise using parallel model combination. Computer Speech and Language, Vol. 4, pp. 352-359. Juang, B. H., Rabiner, L. R., 1990. The segmental k-means algorithm for estimating parameters of hidden Markov models. IEEE Trans. on Acoustics, Speech, Signal Proc., 38(9) : 1639-1641, September. Lee, C. H., Lin, C. H., 1993. On the use of a family of signal limiters for recognition of noisy speech. Speech Communication, Vol. 12, pp. 383-392. Lee, L. M., and Wang, H. C., 1994. A study on adaptation of cepstral and delta cepstral coefficients for
  • 21. 21 noisy speech recognition. Proc. of Int. Conf. Spoken Language Processing (ICSLP). Yokohama, Japan. pp. 1011-1014. Lee, L. M., Wang, H. C., 1995. Representation of hidden Markov model for noise adaptive speech recognition. Electronics Letters, Vol. 31, No. 8, pp. 616-617. Mansour, D., Juang, B. H., 1989. A family of distortion measures based upon projection operation for robust speech recognition. IEEE Trans. on Acoustics, Speech, Sig5nal Processing, Vol. 37, pp. 1659-1671. Rabiner, L., and Juang, B. H., 1993. Fundamentals of Speech Recognition, Englewood Cliffs, New Jersey, Prentice-Hall, pp. 112-117. Sankar, A., Lee, C. H., 1996. A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Trans. on Speech and Audio Processing, Vol. 4, pp. 190-202. Varga, A. P., Moore, R. K., 1992a. Hidden Markov model decomposition of speech and noise. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), San Francisco. pp. 845-848. Varga, A. P., Steeneken, H.J.M., Tomlinson, M., Jones, D., 1992b. The NOISEX-92 study on the effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern, England.
  • 22. 22 Table 1. Comparison of digit recognition rates (%) for white noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB = 20 , SNR dBUB = 30 .) SNRs Methods clean 20 dB 15 dB 10 dB 5 dB 0 dB baseline 98.9 80.2 65.7 48.8 25.6 10.6 PMC 98.7 92.2 84.6 72.7 59.3 47.1 hard limiter 90.6 76.8 68.5 55.8 35.9 21.4 adaptive limiter 95.2 85.1 76.4 68.1 58.5 49.7 Table 2. Comparison of digit recognition rates (%) for factory noise. ( 0.0min =δ , 0.1max =δ , dBSNRLB 10= , dBSNRUB 40= .) SNRs Methods clean 20 dB 15 dB 10 dB 5 dB 0 dB baseline 98.9 91.2 81.4 65.9 46.9 25.4 PMC 98.7 95.0 91.8 82.3 73.2 52.5 hard limiter 90.6 86.3 80.2 71.3 57.5 30.0 adaptive limiter 94.9 91.9 87.8 77.7 69.2 53.3 Table 3. Comparison of digit recognition rates (%) for F16 noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB = 15 , SNR dBUB = 35 .) SNRs Methods clean 20 dB 15 dB 10 dB 5 dB 0 dB baseline 98.9 91.1 78.9 65.2 43.9 21.0 PMC 98.7 95.9 92.5 87.4 68.1 44.5 hard limiter 90.6 84.9 77.1 67.8 54.6 29.4 adaptive limiter 95.1 91.4 85.3 78.7 61.9 42.2 Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer. Methods baseline PMC hard limiter adaptive limiter CPU Time recognition (sec) 0.203 4.038 0.291 2.981
  • 23. 23 δ (model adaptation) δ Fig. 1. Block diagram for implementing a speech recognizer with adaptive signal limiter. 0 375 750 1125 1500 1875 2250 2625 3000 3375 3750 1 6 11 -2 -1 0 1 2 3 4 magnitude(dB) frequency (Hz) frame index baseline-clean (a) LPC log magnitude spectra without signal limiter. word models segmental k-means arcsin transform autocorr.→LPC LPC→cepstrum estimate smoothing factor δ autocorr. vectors of a context window arcsin transform speech recognizer training utterances autocorr. vectors of context windows find smoothed delta cepstrum and covariance matrix testing utterances autocorr.→LPC LPC→cepstrum find smoothed delta cepstrum recognition results
  • 24. 24 0 500 1000 1500 2000 2500 3000 3500 1 7 13 -0.5 0 0.5 1 1.5 2magnitude(dB) frequency (Hz) frame index hard limiter-clean (b) LPC log magnitude spectra with hard limiter. 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 magnitude(dB) frequency (Hz) frame index adaptive limiter-clean (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 2. The various LPC log magnitude spectra of utterance ‘1’ in clean condition.
  • 25. 25 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 magnitude(dB) frequency (Hz) frame index baseline-white20dB (a) LPC log magnitude spectra without signal limiter. 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -0.5 0 0.5 1 1.5 2 magnitude(dB) frequency (Hz) frame index hard limiter-white20dB (b) LPC log magnitude spectra with hard limiter.
  • 26. 26 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -1 -0.5 0 0.5 1 1.5 2 2.5 magnitude(dB) frequency (Hz) frame index adaptive limiter-white20dB (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 3. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise. 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 magnitude(dB) frequency (Hz) frame index baseline-factory20dB (a) LPC log magnitude spectra without signal limiter.
  • 27. 27 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -0.5 0 0.5 1 1.5 2 magnitude(dB) frequency (Hz) frame index hard limiter-factory20dB (b) LPC log magnitude spectra with hard limiter. 0 500 1000 1500 2000 2500 3000 3500 1 6 11 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 magnitude(dB) frequency (Hz) frame index adaptive limiter-factory20dB (c) LPC log magnitude spectra with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .) Fig. 4. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.
  • 28. 28 word '1' in white noise using baseline system 0 100 200 300 400 500 600 700 800 900 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (a) Comparison of average log likelihoods without signal limiter. word '1' in white noise using hard limiter 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (b) Comparison of average log likelihoods with hard limiter.
  • 29. 29 word '1' in white noise using adaptive signal limiter 600 700 800 900 1000 1100 1200 1300 1400 1500 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 20 , SNR dBUB = 30 .) Fig. 5. The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise. word '1' in factory noise using baseline system 200 300 400 500 600 700 800 900 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (a) Comparison of average log likelihoods without signal limiter.
  • 30. 30 word '1' in factory noise using hard limiter 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (b) Comparison of average log likelihoods with hard limiter. word '1' in factory noise using adaptive signal limiter 600 800 1000 1200 1400 1600 0dB 5dB 10dB 15dB 20dB clean SNR values loglikelihoods model 0 model 1 model 2 model 3 model 4 model 5 model 6 model 7 model 8 model 9 (c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .= 0 0 , δmax .= 1 0 , SNR dBLB = 10 , SNR dBUB = 40 .) Fig. 6. The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise.