129966863931865940[1]

147 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
147
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

129966863931865940[1]

  1. 1. On the Use of Weighted Filter Bank Analysis for the Derivation of Robust MFCCs Wei-Wen Hung (Member, IEEE) Department of Electrical Engineering Ming Chi Institute of Technology 84 Gungjuan Road, Taishan, Taipei, Taiwan, 24306, Republic of China E-mail :wwhung@ccsun.mit.edu.tw FAX : 886-02-2906-1780; Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang (Senior Member, IEEE) (Associate Editor of IEEE Transactions on Speech and Audio Processing) Department of Electrical Engineering National Tsing Hua University Hsinchu, 30043, Taiwan, Republic of China E-mail : hcwang@ee.nthu.edu.tw FAX : 886-03-571-5971; Tel. : 886-03-574-2587 EDICS number : SPL.SA.1.6 Speech Recognition Re : SPL-2145 Corresponding Author : Wei-Wen Hung
  2. 2. On the Use of Weighted Filter Bank Analysis for the Derivation of Robust MFCCs ∗ Wei-Wen Hung and # Hsiao-Chuan Wang ∗ Department of Electrical Engineering, Ming Chi Institute of Technology (Member, IEEE) # Department of Electrical Engineering, National Tsing Hua University (Senior Member, IEEE) (Associate Editor of IEEE Transactions on Speech and Audio Processing) Abstract – In this paper, we discuss the use of weighted filter bank analysis (WFBA) to increase the discriminating ability of mel frequency cepstral coefficients (MFCCs). The WFBA emphasizes the peak structure of the log filter bank energies (LFBEs) obtained from filter bank analysis while attenuating the components with lower energy in a simple, direct and effective way. Experimental results for recognition of continuous Mandarin telephone speech indicate that the WFBA-based cepstral features are more robust than those derived by employing the standard filter bank analysis and some widely used cepstral liftering and frequency filtering schemes both in channel-distorted and noisy conditions. Indexing Terms – Weighted filter bank analysis (WFBA), log filter bank energy (LFBE), mel frequency cepstral coefficient (MFCC). This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract number NSC-89-2614-E-007-002.
  3. 3. LIST OF FIGURES AND TABLES Fig. 1. Block diagram for the derivation of MFCCs based on the weighted filter bank analysis. Fig. 2. F-ratio curves of mel frequency cepstral coefficients based on various schemes. (A) For the 12-order cepstral coefficients. (B) For the 12-order delta cepstral coefficients. Fig. 3. Relationships between fuzzy factors and syllable recognition rates under different conditions. Table I. COMPARISONS OF SYLLABLE RECOGNITION RATES FOR VARIOUS SCHEMES UNDER DIFFERENT CONDITIONS.
  4. 4. I. INTRODUCTION The filter bank analysis (FBA) is one of the most extensively employed spectral analysis techniques, which is required among various kinds of speech applications. This approach typically uses a bank of highly overlapped band-pass filters that roughly approximates the frequency response of basilar membrane in the cochlea to cover the frequency range of interest in a speech signal. The measurement from the outputs of those band-pass filters can be essentially treated as a short-time spectral envelope. This measured spectral envelope is easily prone to statistical variation due to speaker characteristics, background noise, channel effect and limitations of the underlying speech analysis model, etc., and it may make spectral comparisons unreliable. To suppress those undesired variations and to obtain a more reliable distance measure, a cepstral liftering (CL) scheme [1] has been developed to account for the sensitivity of cepstral coefficients. In this regard the applied weights )(mL used in the liftering process take advantage of the statistical characteristic of cepstral coefficients and the resulting liftered distance measure is given by [ ] [ ] ,~)()(~) ~ ,( 1 2 1 2 )()()()( ∑∑ == ⋅−⋅=−= L m mm L m CLmCLmCLCL cmLcmLccCCd (1) where [ ])()()()( ,,, CLLCL2CL1CL cccC ⋅⋅⋅= and [ ])()()()( ~,,~,~~ CLLCL2CL1CL cccC ⋅⋅⋅= are two liftered cepstral vectors. Various types of weighting functions including linear, sinusoidal, exponential, band-pass and ramp lifters have been introduced in the literature. Besides the cepstral liftering scheme, Battle et al. [2] proposed an alternative to improve the robustness of FBA-based speech features by filtering the frequency sequence of log filter bank energies (LFBEs). The frequency filtering (FF) scheme not only approximately equalizes the variances of cepstral coefficients up to a certain quefrency index, but also decorrelates the log filter bank energies to some
  5. 5. extents. This filtering process can be accomplished by passing the sequence of log filter bank energies through a finite impulse response (FIR) filter of the form ∑ − ⋅= i i i zhzH )( (2) Although the aforementioned cepstral liftering and frequency filtering schemes have been widely used in enhancing the robustness of cepstral features, there is still a need to investigate new approaches for achieving better performance. Subsequently, we shall introduce a new weighted filter bank analysis (WFBA) scheme which results in a set of discriminating cepstral features in a simple, direct and effective way while maintaining a relatively low computation cost. II. WEIGHTED FILTER BANK ANALYSIS SCHEME Assuming that )(nx represents the frame of a speech signal that is pre-emphasized and Hamming-windowed, then the derivation of conventional mel frequency cepstral coefficients (MFCCs) proceeds as follows. Firstly, the speech frame )(nx , where Nn1 ≤≤ , is transformed from time domain into frequency domain by applying an −N point short-time Fourier transform (STFT), and the resulting power spectrum 2 kX )( can be formulated as ,) 2 exp()()( 2 1 2 ∑ = ⋅⋅⋅ ⋅−⋅= N n N kn jnxkX π (3) where Nk ≤≤1 . Once the power spectrum 2 )(kX is obtained, we can calculate the filter bank energy )(ie passing through the thi − mel-scaled critical band-pass filter )(kiψ by ,)()()( 1 2 ∑ = ⋅= N k i kkXie ψ (4)
  6. 6. where Qi1 ≤≤ and Q is the number of mel-scaled triangular band-pass filters. Finally, a discrete cosine transform (DCT) is applied to the frequency sequence of log filter bank energies { }Qi1ie ≤≤)],(log[ . Thus, the mel frequency cepstral coefficients mc can be expressed as ,)(cos)](log[∑=       ⋅ −⋅ ⋅⋅= Q 1i m Q2 1i2 miec π (5) where ,Lm1 ≤≤ and L is the desired number of cepstral features. From above description, we can find that a distorted speech signal always causes considerable spectral variations and results in performance degradation. However, it is also known that more noise can be perceptually tolerated in the spectral formant regions than in the spectral valleys. Therefore, our goal is to emphasize the high energy parts of the log filter bank energies such that the cepstral features become less susceptible to environmental interference. In our approach shown in Fig. 1, the log filter bank energies are multiplied by a set of weighting factors prior to performing discrete cosine transform, that is [3] .) 2 12 (cos)](log[)( 1 )( ∑ =       ⋅ −⋅ ⋅⋅⋅= Q i WFBAm Q i mieiwc π (6) In this study, we investigate the effects of the following two types of weighting functions. Type 1. ∑ = = Q j jiiw 1 )( ββ and ∑ = −       + + = Q 1r 1F 1 i 01re 01ie ].)(log[ ].)(log[ β . (7) Type 2. .].)(log[].)(log[)( ∑ = ++= Q 1j 01je01ieiw (8) For the first type of weighting function, a fuzzy membership function is used to determine the weights. By properly adjusting the fuzzy factor F , we can achieve various extents of fuzziness for the WFBA
  7. 7. scheme. When the fuzzy factor F tends to 1.0 and )(ie is the maximum energy, then the weights are distributed with )(iw =1.0 and )( jw =0.0 for ji ≠ . On the other hand, in the case of ∞→F , all the weights become equal and are set to Q1 . In the second type, the weighting terms are directly proportional to the log energy of each critical band. In addition, it does not require a priori determination of the fuzzy factor and therefore needs less computation. We will refer to the cepstral features calculated by WFBA scheme using Type 1 and 2 weighting functions as the “FWFBA” and “DWFBA”, respectively. III. EXPERIMENTS AND DISCUSSIONS The MAT (Mandarin Across Taiwan) speech database [3] was used to evaluate the presented schemes. The database provided by the Computational Linguistic Society of R.O.C. was collected over the public telephone network and each Mandarin word comprised 1~23 Mandarin syllables. From the MAT database, we chose 8320 phonetically balanced Mandarin words (37784 syllables) spoken by 81 males and 79 females to train the right-context-dependent sub-syllable HMMs of 410 Mandarin syllables. Moreover, each syllable model contains six to seven states in which the output observation distribution is characterized by a 4-mixture Gaussian density function with diagonal covariance matrix. In the testing phase, the evaluated schemes were applied to a 500-utterance (4754 syllables) recognition task in which the testing utterances spoken by 15 males and 15 females were selected from a different set of the MAT database. The feature vector was composed of 12-order mel frequency cepstral coefficients and their first-order time derivatives. To simulate various noisy conditions, the 500 testing utterances were corrupted by the additive white Gaussian noise (AWGN) with signal-to-noise ratio (SNR) at 10 dB, 20 dB and 30 dB. In addition, a sinusoidal lifter [1] and an FIR filter of the form 1 zzzH − −=)( [2] were
  8. 8. used in the experiments for comparative purpose and abbreviated as SL and FF, respectively. To evaluate the discriminating abilities of the speech features employing various schemes, we treated each state from all the syllable models as a separate speech class and used F-ratio measure [4] to test the class separability in the feature space. The F-ratio measure takes into account the variance of means and the mean of variances among classes. It has been confirmed that good class separability with large F-ratio measure gives high recognition accuracy. In Fig. 2, it shows the F-ratio curves of the 12-order mel frequency cepstral coefficients and their first-order time derivatives derived by applying various schemes. From these curves, we can find that lower quefrency coefficients generally have higher F-ratios and should therefore offer better class separation. In addition, it can be seen that the WFBA scheme compared to the other schemes always achieves higher F-ratios for different cepstral coefficients. Especially, the FWFBA is superior to the DWFBA at the price of requiring more computation cost. In the aspect of recognition for continuous Mandarin telephone speech, we evaluated these schemes in terms of syllable recognition rate (S.R.R). Two kinds of environmental conditions including channel distortion and noise corruption were investigated, and to see if the WFBA scheme can achieve better syllable recognition rates than the other evaluated schemes in channel-distorted and noisy conditions. In the channel-compensated condition, the widely used cepstral mean subtraction (CMS) [5] was employed for canceling the embedded channel effect. In Fig. 3, we illustrated the relationships between the fuzzy factors and the syllable recognition rates under different conditions. It shows that the syllable recognition rate initially increases with the fuzzy factor F , attains a maximum value and then decreases with an increase in the fuzzy factor. Obviously, the optimal value of fuzzy factor is related to SNR value, i.e., the smaller the SNR value of additive white Gaussian noise, the smaller the optimal value of fuzzy factor. Moreover, we also find that further improvement in syllable recognition rate can be obtained by
  9. 9. integrating the WFBA with the CMS. On the other hand, as shown in Table I, we can also observe that the WFBA technique outperforms the SL and FF schemes and exhibits consistent improvements for the channel-distorted, channel-compensated and various noisy conditions. As far as computation cost is concerned, the computation complexity required by the DWFBA is lower than for the FWFBA. Finally, it is worth to note that the optimal value of fuzzy factor should be heavily related to SNR value and is still not easily derived. In this study, the optimal values of fuzzy factor under various conditions were determined in a time-consuming manner by selecting some specific values and their neighbors and comparing the corresponding syllable recognition rates. IV. CONCLUSIONS In this paper, a weighted filter bank analysis scheme with emphasis on the peak structure of log filter bank energies is proposed for the derivation of robust cepstral features. Two kinds of weighting functions employed in the WFBA are investigated. The experiments show that by properly adjusting the fuzzy factor the FWFBA has higher capability in enhancing the discriminating ability of cepstral features than the conventional FBA scheme and the other two widely used schemes, i.e., cepstral liftering and frequency filtering schemes. Also, instead of the FWFBA, the DWFBA can offer a simpler form for weighting the LFBEs with much less computation cost while maintaining comparable recognition accuracy. In addition, it is shown that the WFBA is effective for noisy speech recognition and can be well combined with some environment-compensated techniques, such as the CMS, to achieve higher recognition rates if necessary.
  10. 10. REFERENCES [1] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of band-pass liftering in speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, no. 7, pp. 947-954, July, 1987. [2] E. Battle, C. Nadeu and J. A. R. Fonollosa, “Feature decorrelation methods in speech recognition :A comparative study,” Proceedings of International Conference on Spoken Language Processing, pp. 951-954, 1998. [3] W. W. Hung, and H. C. Wang, “A fuzzy approach for equalization of the cepstral variances,” Proceeding of International Conference on Acoustics, Speech, and Signal Processing, vol. 3, SP-P7, pp.1611-1614, Istanbul, June 2000. [4] S. Nicholson, B. Milner and S. Cox, “Evaluating feature set performance using the F-ratio and J-measures,” Proceeding of European Conference on Speech Communication and Technology, vol. 1, pp.413-416, Greece, September 1997. [5] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-29, pp. 254-272, 1981.
  11. 11. Figures and tables captions : Fig. 1. Block diagram for the derivation of MFCCs based on the weighted filter bank analysis. Fig. 2. F-ratio curves of mel frequency cepstral coefficients based on various schemes. (A) For the 12-order cepstral coefficients. (B) For the 12-order delta cepstral coefficients. Fig. 3. Relationships between fuzzy factors and syllable recognition rates under different conditions. Table I. COMPARISONS OF SYLLABLE RECOGNITION RATES FOR VARIOUS SCHEMES UNDER DIFFERENT CONDITIONS. Fig. 1 )(k1ψ )1(w 2 )(kX ⊗ )1(e ])(log[ .011e + ⊗ )(nx )(kQψ )(Qw )(WFBAmC 2 )(kX ⊗ )(Qe ])(log[ .01Qe + ⊗ |STFT| × |STFT| 1.1.DCT Pre-emphasis& HammingWindowing 1.2. 1.3.
  12. 12. Fig. 2 (A) 0.25 0.75 1.25 1.75 2.25 2.75 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 cepstral coefficients F-ratiomeasures MFCC FWFBA(F=1.9) DWFBA FF SL Fig. 2 (B) 0.2 0.7 1.2 1.7 2.2 2.7 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 delta cepstral coefficients F-ratiomeasures MFCC FWFBA(F=1.9) DWFBA FF SL
  13. 13. Fig. 3 0 5 10 15 20 25 30 35 40 45 50 less than 1.0 1.001 1.01 1.1 1.3 1.5 1.6 1.7 1.8 1.9 2 2.2 10 100 1000 fuzzy factors syllablerecognitionrates(%) NO CMS+NO AWGN NO CMS+30dB AWGN NO CMS+20dB AWGN NO CMS+10dB AWGN CMS+NO AWGN
  14. 14. Table I. S.R.Rs(%) Conditions Schemes NO CMS NO AWGN CMS NO AWGN NO CMS 30 dB AWGN NO CMS 20 dB AWGN NO CMS 10 dB AWGN MFCC 40.79 44.90 36.16 24.41 8.01 Sinusoidal Lifter (SL) 40.32 45.01 37.42 26.34 9.78 Frequency Filtering (FF) 41.49 46.73 38.94 28.85 10.46 DWFBA 42.95 48.81 39.98 30.25 11.32 FWFBA ).( 01000F = 40.98 44.78 37.03 24.01 7.59 FWFBA ).( 02F = 43.21 48.36 40.66 30.51 11.95 FWFBA ).( 0011F = 7.18 10.53 2.89 0.0 0.0 FWFBA ).( 01F < 0.0 0.0 0.0 0.0 0.0 FWFBA )( valueoptimalF = 43.8 ).( 91F = 49.27 ).( 22F = 41.21 ).( 91F = 32.15 ).( 81F = 14.81 ).( 61F =

×