A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
1. 1
A Novel Parallel Model Method for Noise
Speech Recognition
ZHANG Mingxin1 2
, CHEN Guoping1 2
, NI Hong1
, ZHANG Dongbin1
1
(Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080, China)
2
(Graduate School of the Chinese Academy of Sciences, Beijing 100039, China)
Abstract ─ In noise robust speech recognition, parallel model combination (PMC)
method is suitable for non-stationary environment noise, and theoretically the performance
the combined model can approach that of the model matching the noisy environment, so it is
an important and popular noise robust speech recognition research field. In this paper, a
new feature MFCC_FWD_BWD is presented to make PMC much simple and direct, which
is based on forward-backward difference dynamic parameters. On this condition, a novel
parallel sub-state hidden Markov model (PSSHMM) is also presented for PMC, which
topology is different from that of the standard hidden Markov model (HMM). In PSSHMM
each state has parallel sub-states with transitions. In experiment, PSSHMM using the
feature MFCC_FWD_BWD achieves good results under each kind of noise and SNR.
Especially for non-stationary noise, its robust performance is also excellent.
Key words ─ Parallel Model, Speech recognition, Noise robust, PMC
1 Introduction
Recognition rate of LVCSR (large vocabulary continuous speech recognition) system has
reached fairly high level in laboratory environment up to now. However, if the system works in
noisy environment, the performance degrades seriously. Such performance degradation has greatly
prevented the application of LVCSR. Therefore, robust speech recognition in noisy environment is
becoming increasingly important.
Now international noise robust speech recognition researches mainly focus on three aspects.
Firstly, robust feature representation is used, such as relative spectral (RASTA)[1], perceptual
linear prediction (PLP) and cepstral mean normalization (CMN). Secondly, approaches trying to
modify the testing speech features to make them better match the conditions of the pre-trained
recognition model. The methods based on spectral subtraction [2] and speech enhancement belong
to this aspect. Thirdly, the compensation is performed on the pre-trained model to match the noisy
background. Such model-based compensation schemes include parallel model combination
(PMC)[3][4], maximum likelihood linear regression (MLLR), etc.. Because PMC is suitable for
non-stationary noise and the performance of the combined model without retraining can
approximate that of the matched model trained using the noisy speech of corresponding
environment, it has been paid great attention. PMC is the subject of this paper.
In this paper, the basic PMC method is introduced first. Then the new feature named
MFCC_FWD_BWD is described, which dynamic parameters in feature vector is based on forward
and backward difference. And then, PSSHMM is presented for PMC noise robust speech
recognition model. The model parameters combination algorithm is also explained. Last, the
evaluation and conclusion are given.
2. 2
2 Basic PMC Method
When LVCSR system works in additive noise environment, the matched model should be the
retrained model using the noisy speech sampled in the environment or obtained by adding the
noise and pure speech in time domain waveform. This model will have the best performance under
the noise environment. However, retraining the matched model online in any environment is
impractical for its great computation costs. Fortunately, PMC method doesn’t need retraining. It
believes that the pure speech model contains enough information about the speech feature and the
noise model contains enough information about the noise feature, so we can combine the speech
and noise model to match the noisy background [4].
In this paper, the speech model is the standard HMM model. The noise model is the single
Gaussian component and state-full-transition model, which is obtained by clustering noise feature
vectors. It has no starting and ending state, which is different from HMM speech model.
In order to describe the effects of the noise on the clean speech, a series of assumptions are
required, as show in the following [3].
1) speech and noise are independent.
2) speech and noise are additive in the time domain.
3) a single Gaussian or multiple Gaussian component(s) model contain sufficient information to
represent the distribution of the observation feature vector in cepstral or log-spectral domain.
4) the frame/state alignment used to generate the speech models from the clean speech data is
not altered by the addition of noise.
Under above assumptions, the speech and noise is treated as additive in power spectral [5]. As the
feature used in recognition is usually in cepstral domain, the model parameters of speech and
noise should be transformed to spectral domain. After model combination, the combined model
should be transformed back to cepstral domain. The procedure is shown in Fig.1.
1−
C 1−
C
exp{} exp{}
PMC
log{}
C
Fig.1 Parallel model combination procedure
3 Feature Vector Construction Method for PMC
PMC method requires that the feature vector used in recognition can regenerate the raw
parameter vectors that will be used for combination in power spectral domain [3]. Especially for
dynamic feature parameters in the feature vector, this requirement is much more important. For
this reason, we present a novel feature vector MFCC_FWD_BWD. Its static part is the same with
MFCC_D_A, while its dynamic part uses the forward and backward difference parameters to take
3. 3
place of the difference and accelerate parameters. The MFCC_FWD_BWD feature is constructed
as following:
TTc
Bw
Tc
Fw
Tcc
NFVec ])()()([)( ττττ OOOO ∆∆= (1)
where )()()( τττ c
FB
cc
Fw w OOO −+=∆ is the forward difference part and
)()()( FB
ccc
Bw w−−=∆ τττ OOO is the backward difference part. In matrix form,
c
NTVecN
FB
c
c
FB
c
c
Bw
c
Fw
c
c
NFVec
w
w
OA
O
O
O
II0
0II
0I0
O
O
O
O =
−
+
−
−=
∆
∆=
)(
)(
)(
)(
)(
)(
)(
τ
τ
τ
τ
τ
τ
τ . (2)
However, MFCC_D_A is constructed as
c
TVec
c
c
c
c
c
c
c
c
c
FVec
w
w
w
w
AO
O
O
O
O
O
I02I0I
0I0I0
00I00
O
O
O
O =
−
−
+
+
−
−=
∆
∆=
)2(
)(
)(
)(
)2(
)(
)(
)(
)(
2
τ
τ
τ
τ
τ
τ
τ
τ
τ . (3)
Comparing formula (2) and (3), we can see that MFCC_FWD_BWD construction matrix
NA is invertible, so feature vector static time series
c
NTVecO can be obtained from feature
vector )(τc
NFVecO . On the contrary, because the MFCC_D_A construction matrix A is not
invertible, we can not obtain
c
TVecO from )(τc
FVecO . Construction matrix is invertible is
necessary for PMC.
4 Parallel Sub-State HMM for PMC
4.1 Speech model and noise model used in PMC
In the system of this paper, clean speech model is the usually used standard HMM model,
which is the finite state machine model and is quite suited for describing the speech generating
procedure. The topology of HMM model is shown in Fig.2. HMM model may be characterized by
following three important parameters:
SN , the number of states in the model;
ST , the state-transition probability matrix;
)( tjb o , 1,,2 −= SNj L , the output observation probability distribution.
In the model, both the starting state and ending state are non-emitting states which are used for
HMM models connection. Here )( tjb o are often described by Gaussian (single or multiple)
4. 4
component(s), i.e. ∑
=
=
Ms
m
jmjmjmtj Ncb
1
),()( Sµo , where 1≥sM (for the convenience of
explanation, we let 1=sM in the following, then ),()( jjtj Nb Sµo = ).
The noise model in the paper is defined as full-state-transition model, which topology is
shown in Fig.3. It is composed by several states, which parameters are obtained by clustering
background noise features. The noise model also can be characterized by following three
important parameters:
NoiN , the number of states in noise model;
NoiT , the full-state-transition probability matrix of noise model;
)( tkb o , noiNk ,,1 L= , the output observation probability distribution;
where )( tkb o is described by single Gaussian component, i.e. )
~
,~()( kktk Nb Sµo = .
Fig.2 HMM topology
Fig.3 Noise model topology
4.2 PSSHMM
PMC combines the clean speech model and noise model to achieve the matched model. In
this subsection, the presented PSSHMM used for the combined matched model is described in
detail. The topology of PSSHMM is shown in Fig.4. PSSHMM is a complex HMM model, while
each state has several parallel sub-states. These sub-states are generated by combining the
corresponding clean speech state and each noise state. In PSSHMM, there are two kinds of
transition. One is the transition of the global HMM, as shown in Fig.5; the other is the transition
among the sub-states, which obey the noise states transition matrix. Seen from the time
synchronous expanded states series, we can find that the sub-states are arranged parallel and at
each time point only one sub-state can emit an observation, as shown in Fig.6. It also can be seen
that the transition among the sub-states exists between the previous and posterior time
synchronous states.
Fig.4 PSSHMM topology
5. 5
Fig.5 PSSHMMl is a complex HMM
2t 3t 4t1t
Fig.6 PSSHMM time synchronous expanded state series
The PSSHMM can be described using following five parameters:
pmN , the number of states in the model;
pmT , the state-transition probability matrix;
subD , the number of sub-states in each model state;
subT , the sub-state-transition probability matrix;
)( tjkb o , 1,,2 −= pmNj L , subDk ,,1 L= , the output observation probability distribution of
each sub-state.
Here )( tjkb o is often described by Gaussian component, i.e. )ˆ,ˆ()( jkjktjk Nb Sµo = , where
jkµˆ and jkSˆ are obtained by parallel model combination algorithm that will be discussed in
following section.
It should be emphasized that the output probability of parallel model state is related with that
of the sub-states. The relation can be described by )}|()({max)( 1−⋅= ttjk
k
tj kkPbb oo , which
has directly effect on recognition decoding, where 1−tk is previous optimal sub-state label and
)|( 1−tkkP is the sub-state-transition probability, i.e. ],[)|( 11 −− = tsubt kkkkP T .
5 Parallel Model Parameter Combination Algorithm
Using the MFCC_FWD_BWD feature, we combine the clean speech model and noise model
to achieve the parallel model to match the noisy environment. In this paper log-add algorithm [4]
is used for the model parameters combination. Log-add algorithm only combines means and does
not combine variances. It is assumed that clean speech model state is described by Gaussian
components ])[],([ c
Bw
c
Fw
cc
Bw
c
Fw
c
N ∆∆∆∆
SSSµµµ and noise model state is described by Gaussian
6. 6
components ])
~~~
[],~~~([ c
Bw
c
Fw
cc
Bw
c
Fw
c
N ∆∆∆∆
SSSµµµ . The parameter combining steps is:
1) Transform the clean speech model parameters from MFCC_FWD_BWD to static time series
parameters in cepstral, i.e.
[ ] [ ]TTc
Bw
Tc
Fw
Tc
N
TTc
w
Tc
w
Tc ∆∆−
−+ = µµµAµµµ 1
τττ (4)
It is the same for noise model
[ ] [ ]TTc
Bw
Tc
Fw
Tc
N
TTc
w
Tc
w
Tc ∆∆−
−+ = µµµAµµµ ~~~~~~ 1
τττ (5)
2) Using IDCT, transform the static time series parameters from cepstral domain to log domain,
i.e.
[ ] [ ]TTc
w
Tc
w
Tc
TTl
w
Tl
w
Tl
−
−
+
−−
−+ = ττττττ µCµCµCµµµ 111
(6)
[ ] [ ]TTc
w
Tc
w
Tc
TTl
w
Tl
w
Tl
−
−
+
−−
−+ = ττττττ µCµCµCµµµ ~~~~~~ 111
(7)
3) Combine the parameters of clean speech model and noise model using log-add algorithm, i.e.
}}~exp{}log{exp{ˆ τττ
lll
µµµ += (8)
}}~exp{}log{exp{ˆ w
l
w
l
w
l
+++ += τττ µµµ (9)
}}~exp{}log{exp{ˆ w
l
w
l
w
l
−−− += τττ µµµ (10)
4) Using DTC, transform the combined model parameters from log domain to cepstral domain,
i.e.
[ ] [ ]TTl
w
Tl
w
Tl
TTc
w
Tc
w
Tc
−+−+ = ττττττ µCµCµCµµµ ˆˆˆˆˆˆ (11)
5) Transform the static time series combined model parameters to MFCC_FWD_BWD, i.e.
[ ] [ ]TTc
w
Tc
w
Tc
N
TTc
Bw
Tc
Fw
Tc
−+
∆∆
= τττ µµµAµµµ ˆˆˆˆˆˆ (12)
Thus, the sub-state output observation probability components of combined parallel model states
subpmtjk DkNjb ,,1,1,,2),( LL =−=o , can be calculated by combining the clean speech
model state Stj Njb ,,1),( L=o and each noise model state noitk Nkb ,,1),( L=o using the
above log-add algorithm.
6 Evaluation
Our experiment is based on HTK3.0 [6] speech recognition platform that has been improved
and changed for PMC. The acoustic models are context dependent single Gaussian component
model. The database used for clean speech was Mandarin 863 Speech Database. We select 9515
sentences of 16 female speakers as training set and 400 sentences of 4 speakers outside the
training set as testing set. The second database used is NOISEX92 as noise database. We select
four kinds of representative noise: babble, f16, machinegun and white. The four kinds of noise are
added to clean speech according to certain ratio to form four different SNR: 30db, 20db, 10db,
0db noisy speech for test. All of feature vector sizes are 39.
7. 7
In the experiment, we first test recognition performance of the clean speech data using the
MFCC_D_A and MFCC_FWD_BWD features. As shown in table 1, it can be seen that the word
accurate recognition rates of two kinds of feature are almost same and the difference is less than
0.5%. So the new feature MFCC_FWD_BWD simplifies parameter combination procedure with
slight decrease in recognition rate.
Table 1. Accuracy comparison of clean speech (MFCC_D_A vs MFCC_FWD_BWD)
Feature kind Acc (%)
MFCC_D_A 75.08
MFCC_FWD_BWD 74.65
In noise robust experiments, the baseline system uses the MFCC_D_A feature and standard
HMM, and the PMC testing system uses the MFCC_FWD_BWD feature and PSSHMM. In PMC,
the number of noise model state is 3. In order to compare the performance of the new method we
also test the spectral subtraction method, which is usually used in noise robust speech recognition
field. The evaluations are given out in following table 2-4.
Table 2. Baseline system performance
Baseline system (MFCC_D_A & HMM)
Acc(%) babble f16 machinegun white Avg.
30db 66.52 69.10 68.46 55.77 64.96
20db 47.80 51.34 61.47 18.92 44.88
10db 5.70 10.41 48.34 5.34 17.45
0db 0.36 0.00 33.46 0.00 8.46
Avg. 30.10 32.71 52.93 20.01 33.94
Table 3. Spectral subtraction method performance
Spectral subtraction (MFCC_D_A & HMM)
Acc (%) babble f16 machinegun white Avg.
30db 65.67 69.10 66.70 65.83 66.83
20db 51.62 57.42 58.47 34.97 50.62
10db 13.69 24.44 37.31 5.46 20.23
0db 0.00 1.67 11.62 2.11 3.85
Avg. 32.75 38.16 43.53 27.09 35.38
Table 4. PMC method performance
PMC (MFCC_FWD_BWD & PSSHMM)
Acc (%) babble f16 machinegun white Avg.
30db 76.08 74.63 75.08 67.87 73.42
20db 69.65 63.78 73.67 50.05 64.29
10db 46.65 31.63 71.47 21.66 42.85
0db 12.46 5.80 65.63 5.68 22.39
Avg. 51.21 43.96 71.46 36.32 50.74
From table 2, it can be seen that the recognition rate of baseline system without any robust
processing decreases sharply with SNR descending. In table 3, as spectral subtraction feature
8. 8
processing method is used, the recognition rate increases 4.2% relative to baseline system
averagely. Table 4 shows the performance of the PMC method using MFCC_FWD_BWD feature
and PSSHMM. It is clear that this method achieves excellent noise robust result, which
recognition rate is far higher than baseline system and spectral subtraction method with the
relative increase of 49.5% and 43.4%.
Comparing the recognition rate of machinegun noisy speech in table 2-4, we can find that
spectral subtraction method cannot reach the goal of noisy robustness. On the contrary, the
recognition rate decreases 17.8% relative to baseline system. However the PMC method has
excellent robust performance that it achieves relative increase of 35.0% compared with baseline
system.
7 Conclusion
In this paper, the PMC method using the presented MFCC_FWD_BWD feature and
PSSHMM achieves excellent noise robust performance in each kind of noise and each SNR level.
Its recognition rate improves 49.5% relative to baseline system and 43.4% relative to spectral
subtraction method. Especially for machinegun noise, the spectral subtraction cannot makes any
improvement, but the PMC stands out with the 35.0% increase relative to the baseline system. Our
planed work will be to put on the research of the model parameter combination algorithm to
improve the recognition performance.
Reference
[1] B.E.D. Kingsbury, N. Morgan, Recognizing Reverberant Speech with RASTA-PLP, ICASSP-97, pp.
1259-1262, Munich, Germany, 1997.
[2] Randy Gomez, Akinobu Lee, Hiroshi Saruwatari, etc., Robust Speech Recognition with Spectral Subtraction
in low SNR. ICSLP-04, pp. 2077-2080, Jeju Island, Korea, 2004.
[3] Mark J. F. Gales, Steve Young, Robust Continuous Speech Recognition Using Parallel Model Combination,
IEEE Trans. Speech and Audio Processing, vol. 4, pp. 352-359, 1996.
[4] Jeih-weih Huang, Jia-lin Shen, Lin-shan Lee, New Approach for Domain Transformation and Parameter
Combination for Improved Accuracy in Parallel Model Combination (PMC) Techniques, IEEE Trans. Speech
and Audio Processing, vol. 9, 842-855, 2001.
[5] Febe de Wet, Jhan de Veth, Loe Boves, etc., Additive Background Noise as a source of non-linear mismatch
in the cepstral and log-energy domain, Computer Speech and Language, Vol.19, pp. 31-54, 2005.
[6] Steve Young, Dan Kershaw, Julian Odell, etc., The HTK book (for HTK v3.0), Cambridge University, 2000.