3rd International Conference on Electrical & Computer EngineeringICECE 2004, 28-30 December 2004, Dhaka, Bangladesh       ...
3.1 The MFCC processor                                                                 K      ~                           ...
are called encoding regions. The set of allcodevectors is called the codebook and the set of allencoding regions is called...
been presented in Table 1 and Table 2. The speech        the system increases. It has been found thatdatabase consists of ...
Upcoming SlideShare
Loading in...5
×

Speaker identification using mel frequency

2,424

Published on

f

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,424
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
167
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Speaker identification using mel frequency

  1. 1. 3rd International Conference on Electrical & Computer EngineeringICECE 2004, 28-30 December 2004, Dhaka, Bangladesh SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani Md. Saifur Rahman Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000 E-mail: saif672@yahoo.com ABSTRACT and techniques. The choice of which technology toThis paper presents a security system based on use is application-specific. At the highest level, allspeaker identification. Mel frequency Cepstral Co- speaker recognition systems contain two mainefficients{MFCCs} have been used for feature modules feature extraction and feature matchingextraction and vector quantization technique is used [2,3].to minimize the amount of data to be handled . 3. SPEECH FEATURE EXTRACTION 1. INTRODUCTION The purpose of this module is to convert the speech waveform to some type of parametric representationSpeech is one of the natural forms of (at a considerably lower information rate). Thecommunication. Recent development has made it speech signal is a slowly time varying signal (it ispossible to use this in the security system. In called quasi-stationary). When examined over aspeaker identification, the task is to use a speech sufficiently short period of time (between 5 and 100sample to select the identity of the person that ms), its characteristics are fairly stationary.produced the speech from among a population of However, over long periods of time (on the order ofspeakers. In speaker verification, the task is to use a 0.2s or more) the signal characteristics change tospeech sample to test whether a person who claims reflect the different speech sounds being spoken.to have produced the speech has in fact done so[1]. Therefore, short-time spectral analysis is the mostThis technique makes it possible to use the speakers’ common way to characterize the speech signal. Avoice to verify their identity and control access to wide range of possibilities exist for parametricallyservices such as voice dialing, banking by telephone, representing the speech signal for the speakertelephone shopping, database access services, recognition task, such as Linear Prediction Codinginformation services, voice mail, security control for (LPC), Mel-Frequency Cepstrum Coefficientsconfidential information areas, and remote access to (MFCC), and others. MFCC is perhaps the bestcomputers.. known and most popular, and this feature has been 2. PRINCIPLES OF SPEAKER used in this paper. MFCC’s are based on the known RECOGNITION variation of the human ear’s critical bandwidths with frequency. The MFCC technique makes use of twoSpeaker recognition methods can be divided into types of filter, namely, linearly spaced filters andtext-independent and text-dependent methods. In a logarithmically spaced filters. To capture thetext-independent system, speaker models capture phonetically important characteristics of speech,characteristics of somebody’s speech which show up signal is expressed in the Mel frequency scale. Thisirrespective of what one is saying. [1] scale has a linear frequency spacing below 1000 HzIn a text-dependent system, on the other hand, the and a logarithmic spacing above 1000 Hz. Normalrecognition of the speaker’s identity is based on his speech waveform may vary from time to timeor her speaking one or more specific phrases, like depending on the physical condition of speakers’passwords, card numbers, PIN codes, etc. Every vocal cord. Rather than the speech waveformstechnology of speaker recognition, identification and themselves, MFFCs are less susceptible to the saidverification, whether text-independent and text- variations [1,4].dependent, each has its own advantages anddisadvantages and may require different treatmentsISBN 984-32-1804-4 565
  2. 2. 3.1 The MFCC processor K ~   1 π  = ∑ (log S k ) n k −  ,....................(2) ~A block diagram of the structure of an MFCC cn k =1   2 K processor is given in Figure 1. The speech input isrecorded at a sampling rate of 22050Hz. This where n=1,2,….Ksampling frequency is chosen to minimize theeffects of aliasing in the analog-to-digital The number of mel cepstrum coefficients, K, isconversion process. Figure 1. shows the block ~diagram of an MFCC processor . typically chosen as 20. The first component, c 0 , is excluded from the DCT since it represents the meanContinuous Frame Windowing FFT value of the input signal which carries little speaker Speech Blocking specific information. By applying the procedure described above, for each speech frame of about 30 ms with overlap, a set of mel-frequency cepstrum mel Cepstrum Mel-frequency coefficients is computed. This set of coefficients is cepstrum Wrapping called an acoustic vector. These acoustic vectors can be used to represent and recognize the voice Sk characteristic of the speaker [4]. Therefore eachFigure 1 Block diagram of the MFCC processor input utterance is transformed into a sequence of3.2 Mel-frequency wrapping acoustic vectors. The next section describes how these acoustic vectors can be used to represent andThe speech signal consists of tones with different recognize the voice characteristic of a speaker.frequencies. For each tone with an actualFrequency, f, measured in Hz, a subjective pitch is 4. FEATURE MATCHINGmeasured on the ‘Mel’ scale. The mel-frequency The state-of-the-art feature matching techniquesscale is a linear frequency spacing below 1000Hz used in speaker recognition include, Dynamic Timeand a logarithmic spacing above 1000Hz. As a Warping (DTW), Hidden Markov Modelingreference point, the pitch of a 1kHz tone, 40dB (HMM), and Vector Quantization (VQ). The VQabove the perceptual hearing threshold, is defined as approach has been used here for its ease of1000 mels. Therefore we can use the following implementation and high accuracy.formula to compute the mels for a given frequency fin Hz[5]: 4.1 Vector quantization Vector quantization (VQ) is a lossy data mel(f)= 2595*log10(1+f/700) ……….. (1) compression method based on principle of blockcoding [6]. It is a fixed-to-fixed lengthOne approach to simulating the subjective spectrumis to use a filter bank, one filter for each desired mel- algorithm. VQ may be thought as anfrequency component. The filter bank has a aproximator. Figure 2 shows an example of a 2-triangular bandpass frequency response, and the dimensional VQ.spacing as well as the bandwidth is determined by aconstant mel-frequency interval.3.3 CEPSTRUMIn the final step, the log mel spectrum has to beconverted back to time. The result is called the melfrequency cepstrum coefficients (MFCCs). Thecepstral representation of the speech spectrumprovides a good representation of the local spectralproperties of the signal for the given frame analysis.Because the mel spectrum coefficients are realnumbers(and so are their logarithms), they may be Figure 2 An example of a 2-dimensional VQconverted to the time domain using the DiscreteCosine Transform (DCT). The MFCCs may be Here, every pair of numbers falling in a particularcalculated using this equation [3,5]: region are approximated by a star associated with that region. In Figure 2, the stars are called codevectors and the regions defined by the borders 566
  3. 3. are called encoding regions. The set of allcodevectors is called the codebook and the set of allencoding regions is called the partition of the space[6].4.2 LBG design algorithm The LBG VQ design algorithm is an iterativealgorithm (as proposed by Y. Linde, A. Buzo & R.Gray) which alternatively solves optimality criteria[7]. The algorithm requires an initial codebook. Theinitial codebook is obtained by the splitting method.In this method, an initial codevector is set as theaverage of the entire training sequence. This Figure 4 Conceptual diagram to illustrate the VQcodevector is then split into two. The iterative Process.algorithm is run with these two vectors as the initialcodebook. The final two codevectors are split into In the training phase, a speaker-specific VQfour and the process is repeated until the desired codebook is generated for each known speaker bynumber of codevectors is obtained. The algorithm is clustering his/her training acoustic vectors. Thesummarized in the flowchart of Figure 3. resultant codewords (centroids) are shown in Figure 4 by circles and triangles at the centers of the corresponding blocks for speaker1 and 2, Find respectively. The distance from a vector to the Centroid closest codeword of a codebook is called a VQ- distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” Split each using each trained codebook and the total VQ Centroid distortion is computed. The speaker corresponding to the VQ codebook with the smallest total distortion m=2*m is identified. Figure 5 shows the use of different number of centroids for the same data field. Cluster vectors Find Centroids Compute D (distortion) Yes D− D Yes <∈ m<M D No No Stop D=DFigure 3 Flowchart of VQ-LBG algorithm Figure 5 Pictorial view of codebook with 5 and 15In figure 4 , only two speakers and two dimensions centroids respectively.of the acoustic space are shown. The circles refer tothe acoustic vectors from speaker 1 while the 5. RESULTStriangles are from speaker 2. The system has been implemented in Matlab6.1 on windowsXP platform.The result of the study has 567
  4. 4. been presented in Table 1 and Table 2. The speech the system increases. It has been found thatdatabase consists of 21 speakers, which includes 13 combination of Mel frequency and Hammingmale and 8 female speakers. Here, identification rate window gives the best performance. It also suggestsis defined as the ratio of the number of speakers that in order to obtain satisfactory result, the numberidentified to the total number of speakers tested. of centroids has to be increased as the number of speakers increases. The study shows that the linearTable 1: Identification rate (in %) for different scale can also have a reasonable identification rate if windows [using Linear scale] a comparatively higher number of centroids is used. However, the recognition rate using a linear scale Code Triangular Rectangular Hamming would be much lower if the number of speakers book increases. Mel scale is also less vulnerable to the size changes of speakers vocal cord in course of time. 1 66.67 38.95 57.14 2 85.7 42.85 85.7 The present study is still ongoing, which may 4 90.47 57.14 90.47 include following further works. HMM may be used 8 95.24 57.14 95.24 to improve the efficiency and precision of the 16 100 80.95 100 segmentation to deal with crosstalk, laughter and 32 100 80.95 100 uncharacteristic speech sounds. A more effective 64 100 85.7 100 normalization algorithm can be adopted on extracted parametric representations of the acoustic signal, Table 2: Identification rate (in %) for different which would improve the identification rate further. windows [using Mel scale] Finally, a combination of features (MFCC, LPC, LPCC, Formant etc) may be used to implement a Code Triangular Rectangular Hamming robust parametric representation for speaker book identification. size REFERENCES 1 57.14 57.14 57.14 2 85.7 66.67 85.7 [1] Lawrence Rabiner and Biing-Hwang Juang, 4 90.47 76.19 100 Fundamental of Speech Recognition”, Prentice-Hall, 8 95.24 80.95 100 Englewood Cliffs, N.J., 1993. [2] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, 16 100 85.7 100 Yu. (1999). “Binary Quantization of Feature Vectors 32 100 90.47 100 for Robust Text-Independent Speaker Identification” 64 100 95.24 100 in IEEE Transactions on Speech and Audio Processing, Vol. 7, No. 1, January 1999. IEEE, NewTable 1 shows identification rate when triangular, York, NY, U.S.A.or rectangular, or hamming window is used for [3] F. Soong, E. Rosenberg, B. Juang, and L. Rabiner,framing in a linear frequency scale. The table "A Vector Quantization Approach to Speakerclearly shows that as codebook size increases, the Recognition", AT&T Technical Journal, vol. 66,identification rate for each of the three cases March/April 1987, pp. 14-26increases, and when codebook size is 16, [4] Comp.speech Frequently Asked Questions WWW site, http://svr-www.eng.cam.ac.uk/comp.speech/identification rate is 100% for both the triangular [5] Jr., J. D., Hansen, J., and Proakis, J. Discrete-Timeand hamming windows.However, in case of Table2 Processing of Speech Signals, second ed. IEEE Press,the same windows are used along with a Mel scale New York, 2000.instead of aLinear scale. Here, too, identification [6] R. M. Gray, ``Vector Quantization, IEEE ASSPrate increases with increase in the size of the Magazine, pp. 4--29, April 1984.codebook. In this case, 100% identification rate is [7] Y. Linde, A. Buzo & R. Gray, “An algorithm forobtained with a codebook size of 4 when hamming vector quantizer design”, IEEE Transactions onwindow is used. Communications, Vol. 28, pp.84-95, 1980. 6. CONCLUSIONThe MFCC technique has been applied for speakeridentification. VQ is used to minimize the data ofthe extracted feature. The study reveals that asnumber of centroids increases, identification rate of 568

×