129966863931865940[1]

On the Use of Weighted Filter Bank Analysis for
the Derivation of Robust MFCCs
Wei-Wen Hung
(Member, IEEE)
Department of Electrical Engineering
Ming Chi Institute of Technology
84 Gungjuan Road, Taishan, Taipei, Taiwan, 24306, Republic of China
E-mail :wwhung@ccsun.mit.edu.tw
FAX : 886-02-2906-1780; Tel. : 886-02-2906-0379
and
Hsiao-Chuan Wang
(Senior Member, IEEE)
(Associate Editor of IEEE Transactions on Speech and Audio Processing)
Department of Electrical Engineering
National Tsing Hua University
Hsinchu, 30043, Taiwan, Republic of China
E-mail : hcwang@ee.nthu.edu.tw
FAX : 886-03-571-5971; Tel. : 886-03-574-2587
EDICS number : SPL.SA.1.6 Speech Recognition
Re : SPL-2145
Corresponding Author : Wei-Wen Hung

On the Use of Weighted Filter Bank Analysis for
the Derivation of Robust MFCCs
∗
Wei-Wen Hung and #
Hsiao-Chuan Wang
∗
Department of Electrical Engineering, Ming Chi Institute of Technology
(Member, IEEE)
#
Department of Electrical Engineering, National Tsing Hua University
(Senior Member, IEEE)
(Associate Editor of IEEE Transactions on Speech and Audio Processing)
Abstract – In this paper, we discuss the use of weighted filter bank analysis (WFBA) to increase the
discriminating ability of mel frequency cepstral coefficients (MFCCs). The WFBA emphasizes the peak
structure of the log filter bank energies (LFBEs) obtained from filter bank analysis while attenuating the
components with lower energy in a simple, direct and effective way. Experimental results for recognition
of continuous Mandarin telephone speech indicate that the WFBA-based cepstral features are more
robust than those derived by employing the standard filter bank analysis and some widely used cepstral
liftering and frequency filtering schemes both in channel-distorted and noisy conditions.
Indexing Terms – Weighted filter bank analysis (WFBA), log filter bank energy (LFBE), mel frequency
cepstral coefficient (MFCC).
This research has been partially sponsored by the National Science Council, Taiwan, ROC, under
contract number NSC-89-2614-E-007-002.

LIST OF FIGURES AND TABLES
Fig. 1. Block diagram for the derivation of MFCCs based on the weighted filter bank analysis.
Fig. 2. F-ratio curves of mel frequency cepstral coefficients based on various schemes.
(A) For the 12-order cepstral coefficients.
(B) For the 12-order delta cepstral coefficients.
Fig. 3. Relationships between fuzzy factors and syllable recognition rates under different conditions.
Table I. COMPARISONS OF SYLLABLE RECOGNITION RATES FOR VARIOUS SCHEMES
UNDER DIFFERENT CONDITIONS.

I. INTRODUCTION
The filter bank analysis (FBA) is one of the most extensively employed spectral analysis techniques,
which is required among various kinds of speech applications. This approach typically uses a bank of
highly overlapped band-pass filters that roughly approximates the frequency response of basilar
membrane in the cochlea to cover the frequency range of interest in a speech signal. The measurement
from the outputs of those band-pass filters can be essentially treated as a short-time spectral envelope.
This measured spectral envelope is easily prone to statistical variation due to speaker characteristics,
background noise, channel effect and limitations of the underlying speech analysis model, etc., and it may
make spectral comparisons unreliable. To suppress those undesired variations and to obtain a more
reliable distance measure, a cepstral liftering (CL) scheme [1] has been developed to account for the
sensitivity of cepstral coefficients. In this regard the applied weights )(mL used in the liftering process
take advantage of the statistical characteristic of cepstral coefficients and the resulting liftered distance
measure is given by
[ ] [ ] ,~)()(~)
~
,(
1
2
1
2
)()()()( ∑∑
==
⋅−⋅=−=
L
m
mm
L
m
CLmCLmCLCL cmLcmLccCCd (1)
where [ ])()()()( ,,, CLLCL2CL1CL cccC ⋅⋅⋅= and [ ])()()()(
~,,~,~~
CLLCL2CL1CL cccC ⋅⋅⋅= are two liftered cepstral
vectors. Various types of weighting functions including linear, sinusoidal, exponential, band-pass and
ramp lifters have been introduced in the literature.
Besides the cepstral liftering scheme, Battle et al. [2] proposed an alternative to improve the
robustness of FBA-based speech features by filtering the frequency sequence of log filter bank energies
(LFBEs). The frequency filtering (FF) scheme not only approximately equalizes the variances of cepstral
coefficients up to a certain quefrency index, but also decorrelates the log filter bank energies to some

extents. This filtering process can be accomplished by passing the sequence of log filter bank energies
through a finite impulse response (FIR) filter of the form
∑ −
⋅=
i
i
i zhzH )( (2)
Although the aforementioned cepstral liftering and frequency filtering schemes have been widely used in
enhancing the robustness of cepstral features, there is still a need to investigate new approaches for
achieving better performance. Subsequently, we shall introduce a new weighted filter bank analysis
(WFBA) scheme which results in a set of discriminating cepstral features in a simple, direct and effective
way while maintaining a relatively low computation cost.
II. WEIGHTED FILTER BANK ANALYSIS SCHEME
Assuming that )(nx represents the frame of a speech signal that is pre-emphasized and
Hamming-windowed, then the derivation of conventional mel frequency cepstral coefficients (MFCCs)
proceeds as follows. Firstly, the speech frame )(nx , where Nn1 ≤≤ , is transformed from time
domain into frequency domain by applying an −N point short-time Fourier transform (STFT), and the
resulting power spectrum
2
kX )( can be formulated as
,)
2
exp()()(
2
1
2
∑
=
⋅⋅⋅
⋅−⋅=
N
n N
kn
jnxkX
π
(3)
where Nk ≤≤1 . Once the power spectrum
2
)(kX is obtained, we can calculate the filter bank
energy )(ie passing through the thi − mel-scaled critical band-pass filter )(kiψ by
,)()()(
1
2
∑
=
⋅=
N
k
i kkXie ψ (4)

where Qi1 ≤≤ and Q is the number of mel-scaled triangular band-pass filters. Finally, a discrete
cosine transform (DCT) is applied to the frequency sequence of log filter bank energies
{ }Qi1ie ≤≤)],(log[ . Thus, the mel frequency cepstral coefficients mc can be expressed as
,)(cos)](log[∑=






⋅
−⋅
⋅⋅=
Q
1i
m
Q2
1i2
miec
π
(5)
where ,Lm1 ≤≤ and L is the desired number of cepstral features.
From above description, we can find that a distorted speech signal always causes considerable
spectral variations and results in performance degradation. However, it is also known that more noise can
be perceptually tolerated in the spectral formant regions than in the spectral valleys. Therefore, our goal is
to emphasize the high energy parts of the log filter bank energies such that the cepstral features become
less susceptible to environmental interference. In our approach shown in Fig. 1, the log filter bank
energies are multiplied by a set of weighting factors prior to performing discrete cosine transform, that is
[3]
.)
2
12
(cos)](log[)(
1
)( ∑
=






⋅
−⋅
⋅⋅⋅=
Q
i
WFBAm
Q
i
mieiwc
π
(6)
In this study, we investigate the effects of the following two types of weighting functions.
Type 1. ∑
=
=
Q
j
jiiw
1
)( ββ and ∑
=
−






+
+
=
Q
1r
1F
1
i
01re
01ie
].)(log[
].)(log[
β . (7)
Type 2. .].)(log[].)(log[)( ∑
=
++=
Q
1j
01je01ieiw (8)
For the first type of weighting function, a fuzzy membership function is used to determine the weights. By
properly adjusting the fuzzy factor F , we can achieve various extents of fuzziness for the WFBA

scheme. When the fuzzy factor F tends to 1.0 and )(ie is the maximum energy, then the weights are
distributed with )(iw =1.0 and )( jw =0.0 for ji ≠ . On the other hand, in the case of ∞→F , all the
weights become equal and are set to Q1 . In the second type, the weighting terms are directly
proportional to the log energy of each critical band. In addition, it does not require a priori determination
of the fuzzy factor and therefore needs less computation. We will refer to the cepstral features calculated
by WFBA scheme using Type 1 and 2 weighting functions as the “FWFBA” and “DWFBA”,
respectively.
III. EXPERIMENTS AND DISCUSSIONS
The MAT (Mandarin Across Taiwan) speech database [3] was used to evaluate the presented
schemes. The database provided by the Computational Linguistic Society of R.O.C. was collected over
the public telephone network and each Mandarin word comprised 1~23 Mandarin syllables. From the
MAT database, we chose 8320 phonetically balanced Mandarin words (37784 syllables) spoken by 81
males and 79 females to train the right-context-dependent sub-syllable HMMs of 410 Mandarin syllables.
Moreover, each syllable model contains six to seven states in which the output observation distribution is
characterized by a 4-mixture Gaussian density function with diagonal covariance matrix. In the testing
phase, the evaluated schemes were applied to a 500-utterance (4754 syllables) recognition task in which
the testing utterances spoken by 15 males and 15 females were selected from a different set of the MAT
database. The feature vector was composed of 12-order mel frequency cepstral coefficients and their
first-order time derivatives. To simulate various noisy conditions, the 500 testing utterances were
corrupted by the additive white Gaussian noise (AWGN) with signal-to-noise ratio (SNR) at 10 dB, 20
dB and 30 dB. In addition, a sinusoidal lifter [1] and an FIR filter of the form 1
zzzH −
−=)( [2] were

used in the experiments for comparative purpose and abbreviated as SL and FF, respectively.
To evaluate the discriminating abilities of the speech features employing various schemes, we treated
each state from all the syllable models as a separate speech class and used F-ratio measure [4] to test the
class separability in the feature space. The F-ratio measure takes into account the variance of means and
the mean of variances among classes. It has been confirmed that good class separability with large F-ratio
measure gives high recognition accuracy. In Fig. 2, it shows the F-ratio curves of the 12-order mel
frequency cepstral coefficients and their first-order time derivatives derived by applying various schemes.
From these curves, we can find that lower quefrency coefficients generally have higher F-ratios and
should therefore offer better class separation. In addition, it can be seen that the WFBA scheme
compared to the other schemes always achieves higher F-ratios for different cepstral coefficients.
Especially, the FWFBA is superior to the DWFBA at the price of requiring more computation cost.
In the aspect of recognition for continuous Mandarin telephone speech, we evaluated these schemes in
terms of syllable recognition rate (S.R.R). Two kinds of environmental conditions including channel
distortion and noise corruption were investigated, and to see if the WFBA scheme can achieve better
syllable recognition rates than the other evaluated schemes in channel-distorted and noisy conditions. In
the channel-compensated condition, the widely used cepstral mean subtraction (CMS) [5] was employed
for canceling the embedded channel effect. In Fig. 3, we illustrated the relationships between the fuzzy
factors and the syllable recognition rates under different conditions. It shows that the syllable recognition
rate initially increases with the fuzzy factor F , attains a maximum value and then decreases with an
increase in the fuzzy factor. Obviously, the optimal value of fuzzy factor is related to SNR value, i.e.,
the smaller the SNR value of additive white Gaussian noise, the smaller the optimal value of fuzzy factor.
Moreover, we also find that further improvement in syllable recognition rate can be obtained by

integrating the WFBA with the CMS. On the other hand, as shown in Table I, we can also observe that
the WFBA technique outperforms the SL and FF schemes and exhibits consistent improvements for the
channel-distorted, channel-compensated and various noisy conditions. As far as computation cost is
concerned, the computation complexity required by the DWFBA is lower than for the FWFBA. Finally,
it is worth to note that the optimal value of fuzzy factor should be heavily related to SNR value and is still
not easily derived. In this study, the optimal values of fuzzy factor under various conditions were
determined in a time-consuming manner by selecting some specific values and their neighbors and
comparing the corresponding syllable recognition rates.
IV. CONCLUSIONS
In this paper, a weighted filter bank analysis scheme with emphasis on the peak structure of log filter
bank energies is proposed for the derivation of robust cepstral features. Two kinds of weighting functions
employed in the WFBA are investigated. The experiments show that by properly adjusting the fuzzy
factor the FWFBA has higher capability in enhancing the discriminating ability of cepstral features than the
conventional FBA scheme and the other two widely used schemes, i.e., cepstral liftering and frequency
filtering schemes. Also, instead of the FWFBA, the DWFBA can offer a simpler form for weighting the
LFBEs with much less computation cost while maintaining comparable recognition accuracy. In addition,
it is shown that the WFBA is effective for noisy speech recognition and can be well combined with some
environment-compensated techniques, such as the CMS, to achieve higher recognition rates if necessary.

REFERENCES
[1] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of band-pass liftering in speech recognition,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, no. 7, pp. 947-954, July, 1987.
[2] E. Battle, C. Nadeu and J. A. R. Fonollosa, “Feature decorrelation methods in speech recognition :A
comparative study,” Proceedings of International Conference on Spoken Language Processing, pp.
951-954, 1998.
[3] W. W. Hung, and H. C. Wang, “A fuzzy approach for equalization of the cepstral variances,”
Proceeding of International Conference on Acoustics, Speech, and Signal Processing, vol. 3,
SP-P7, pp.1611-1614, Istanbul, June 2000.
[4] S. Nicholson, B. Milner and S. Cox, “Evaluating feature set performance using the F-ratio and
J-measures,” Proceeding of European Conference on Speech Communication and Technology,
vol. 1, pp.413-416, Greece, September 1997.
[5] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoustics,
Speech and Signal Processing, vol. ASSP-29, pp. 254-272, 1981.

Figures and tables captions :
Fig. 1. Block diagram for the derivation of MFCCs based on the weighted filter bank analysis.
Fig. 2. F-ratio curves of mel frequency cepstral coefficients based on various schemes.
(A) For the 12-order cepstral coefficients.
(B) For the 12-order delta cepstral coefficients.
Fig. 3. Relationships between fuzzy factors and syllable recognition rates under different conditions.
Table I. COMPARISONS OF SYLLABLE RECOGNITION RATES FOR VARIOUS SCHEMES
UNDER DIFFERENT CONDITIONS.
Fig. 1
)(k1ψ )1(w
2
)(kX ⊗ )1(e ])(log[ .011e + ⊗
)(nx )(kQψ )(Qw
)(WFBAmC
2
)(kX ⊗ )(Qe ])(log[ .01Qe + ⊗
|STFT|
×
|STFT|
1.1.DCT
Pre-emphasis&
HammingWindowing
1.2.
1.3.

Fig. 2 (A)
0.25
0.75
1.25
1.75
2.25
2.75
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
cepstral coefficients
F-ratiomeasures
MFCC
FWFBA(F=1.9)
DWFBA
FF
SL
Fig. 2 (B)
0.2
0.7
1.2
1.7
2.2
2.7
C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24
delta cepstral coefficients
F-ratiomeasures
MFCC
FWFBA(F=1.9)
DWFBA
FF
SL

Fig. 3
0
5
10
15
20
25
30
35
40
45
50
less
than
1.0
1.001 1.01 1.1 1.3 1.5 1.6 1.7 1.8 1.9 2 2.2 10 100 1000
fuzzy factors
syllablerecognitionrates(%)
NO CMS+NO AWGN NO CMS+30dB AWGN
NO CMS+20dB AWGN NO CMS+10dB AWGN
CMS+NO AWGN

Table I.
S.R.Rs(%) Conditions
Schemes
NO CMS
NO AWGN
CMS
NO AWGN
NO CMS
30 dB AWGN
NO CMS
20 dB AWGN
NO CMS
10 dB AWGN
MFCC 40.79 44.90 36.16 24.41 8.01
Sinusoidal Lifter (SL) 40.32 45.01 37.42 26.34 9.78
Frequency Filtering (FF) 41.49 46.73 38.94 28.85 10.46
DWFBA 42.95 48.81 39.98 30.25 11.32
FWFBA ).( 01000F = 40.98 44.78 37.03 24.01 7.59
FWFBA ).( 02F = 43.21 48.36 40.66 30.51 11.95
FWFBA ).( 0011F = 7.18 10.53 2.89 0.0 0.0
FWFBA ).( 01F < 0.0 0.0 0.0 0.0 0.0
FWFBA
)( valueoptimalF =
43.8
).( 91F =
49.27
).( 22F =
41.21
).( 91F =
32.15
).( 81F =
14.81
).( 61F =

129966863931865940[1]

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to 129966863931865940[1]

Similar to 129966863931865940[1] (20)

Recently uploaded

Recently uploaded (20)

129966863931865940[1]