Your SlideShare is downloading. ×
Applying Various DSP-Related Techniques for
Robust Recognition of Adult and Child
Speakers.
R.Atachiants
C.Bendermacher
J....
estimate the noise?', 'How does an algorithm achieve the classication of speech
and when to do so?', and 'How can an algor...
Figure 1: Architectural Overview speech detection
2.2 Endpoint detection
The endpoint detection algorithm lters the noise ...
speech and N1 needs to be reconsidered, see gure 2(b). Similarly N2 can be
determined more accurately.
Figure 2: (a) Deter...
Figure 3: FIR lter
following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95.
So the FIR lter is stab...
of the signal is made. After applying Discrete Fourier Transform (DFT) to the
windowed signal, the noise estimation is sim...
Figure 6: Signal Decomposition
is more complicated. It substracts the value of the threshold from the values of
the signal...
Figure 7: Architectural overview speaker recognition
3.1 Discrete word selection
Discrete word selection is used for two r...
Figure 8: V/C/P classication algorithm blocks.
the energy smaller than the overall noise level, then the frame can be clas...
4. Frames are classied as V/C/P coarsely by using the following rules, where
FrameType is used to denote the type of each ...
Figure 10: Matching MFCCs to a codebook
cessing step before averaging them, we must warp the time axis of one (or
both) se...
ˆ W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K  m+n-1
3. Select only in the path which minimizes the warping cost:
DTW(Q, ...
eld of game theory to model human speech and to accurately recognize speakers
on a text-independent basis.
In Appendix A, ...
x have to be such that lling them out in equations (A.4) and (A.5) will yield
the correct population graph.
Using an ESS w...
the system and both results are combined in order to detect the most speakers
as possible.
4.1 Framed multi speaker classi...
Figure 13: FMS classication, per frame classication block.
The table above indicates what kind of speech is uttered by the...
recognized. Our system is currently limited to recognizing maximal 2 speakers
from a mixed signal, which is an obvious con...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Starting DWS...
Savin...
START PHASE 1
Starting Endpoint Detection...
.ITL: 1.3699
.ITU: 6.8496
.IZCT: 120
.EnergyTotal: 421 elements
.RatesTotal: ...
For the actual classication, three algorithms have been selected that t the
requirements of the system the most. An origin...
[3] Fant, G., Acoustic Theory of Speech Production, Mouton (The Hague),
1960.
[4] Flanagan, J.L., Speech Analysis, Synthes...
Appendix A: Using Evolutionary Stable Strategies
to Model Human Speech
Let the air signal be called the signal s, then it ...
Applying strategy (1,0) means that the entire population exists of type i
23
exclusively. Since the ospring is equal to 2, the population will never grow
beyond its initial size, namely 2. Strategy (...
Consider the same strategies (pairs) again, but now with a pure invasion at
some moment in time.
The general population fu...
offspring(i,j)
2 +
offspring(i,j)
2 ∗ (P(tx−1) − 1) ∗ P (tx−1)
2 −
(Fi(tx−1) ∗ P(tx−1) − 1) ∗ (Fi(tx−1)∗P (tx−1))
2 +
(Fj(...
Upcoming SlideShare
Loading in...5
×

Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers

157

Published on

This paper approaches speaker recognition in a new way. A speaker recognition system has been realized that works on adult and child speakers, both male and female. Furthermore, the system employs text-dependent and text-independent algorithms, which makes robust speaker recognition possible in many applications. Single-speaker classication is achieved by age/sex pre-classication and is implemented using classic text-dependent techniques, as well as a novel technology for text-independent recognition. This new research uses Evolutionary Stable Strategies to model human speech and allows speaker recognition by analyzing just one vowel.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
157
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers"

  1. 1. Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers. R.Atachiants C.Bendermacher J.Claessen E.Lesser S.Karami January 21, 2009 Faculty of Humanity and Science, Maastricht University Abstract This paper approaches speaker recognition in a new way. A speaker recognition system has been realized that works on adult and child speak- ers, both male and female. Furthermore, the system employs text-dependent and text-independent algorithms, which makes robust speaker recognition possible in many applications. Single-speaker classication is achieved by age/sex pre-classication and is implemented using classic text-dependent techniques, as well as a novel technology for text-independent recognition. This new research uses Evolutionary Stable Strategies to model human speech and allows speaker recognition by analyzing just one vowel. 1 Introduction In the past few years privacy became of bigger importance to people all over the world. A great factor for this is the rise of the Internet, all private elements in a persons life became easier to adjust. The privacy of people became easier to copy. Since money is very important to be able to live nowadays, it was stolen very often, using the internet. Copying cards and the information belonging to it, was easier then ever and still occurrs very often. If it would be able to have a proper system for voicerecognation in combination with a password, maybe the security of our money increases. The importance to handle these problems lies with modeling the human speech. If an algorithm recognizes speech on its own, a person to check the sounds for human speech would not be needed. Some questions arise namely 'How can an algorithm know that there is speech?', 'How does an algorithm 1
  2. 2. estimate the noise?', 'How does an algorithm achieve the classication of speech and when to do so?', and 'How can an algorithm notice that there are multiple speakers?'. These questions leads to an overal problem denition, namely: 'How to identify one or more speakers?'. To handle this problem the paper starts with detecting speech. This is the subject of the rst section, section 2. Here speech is detected by using an end-point detection algorithm, which recognizes speech, and noise reduction that uses three ways of ltering, namely (a) Finite Impulse Response (FIR), (b) Wavelets, and (c) Spectral Subtractions. Combining these three subjects the program retrieves a signal that satises the properties to be used by the algorithms for classication: classifying the speaker alone or in a conversation. First the speaker has to be recognized when he is talking alone, this is discussed in section 3. Speaker recognition is done using (a) discrete word selection, (b) Mel-Frequency Cepstral Coecients and Vector Quantization, (c) Age/Sex Classication, (d) Voice Model and (e) the contradictions that leads to the con- clusion that there is a person speaking or not. In the last part multiple speaker are identied and classied adjusting the methods Framed Multi Speaker Classi- cation and Harmonic Matching Classier. After these sections there is a short discussion about the subjects and then the conclusion are shown. 2 Speech detection The very rst step for identifying the speaker is detecting speech. This means that the part of the signal that contains speech has to be seperared from the noise part. There are two algorithms that can be used to detect speech. The rst one is endpoint detection which will be described in subsection 2.2 and the second algorithm is noise reduction. More information about the second algorithm can be found in subsection 2.3. 2.1 Architectural overview If a signal contains little noise, the end point detection algorithm can eectively determine wether the signal contains speach. However, if there is much noise, noise reduction has to be applied to the signal rst. To estimate the noise level of a signal the Spectral Subtraction algorithm is used. This estimation is then compared to the whole signal, resulting in the signal-to-noise ratio (SNR). If needed, one of three noise reduction techniques (FIR, Spectral Subtraction and Wavelets) is selected, based on the weighted SNR of each denoised signal. FIR is prefered over Wavelets and Spectral Subtraction, while the use of Wavelets is prefered over Spectral Subtraction. When end point detection is used on the selected denoised signal, it is safe to say that speech can be detected accurately. See gure 1 for a schematic overview. 2
  3. 3. Figure 1: Architectural Overview speech detection 2.2 Endpoint detection The endpoint detection algorithm lters the noise from the begin and end of the signal and detects the begin and end of speech. If these two points are the same it means that the signal contains no speech and only exists of noise. It is assumed that the rst 100 ms of the signal contains no speech. From this part of the signal the energy and the zero crossing rate (ZCR) of the noise can be calculated. Next, the lower treshold (ITL) and higher treshold (ITU) can be calculated as follows: I1 = 0, 03(maxEnergy=avgEnergy) + avgEnergy I2 = 4 ∗ avgEnergy ITL = MIN(I1, I2) ITU = 5 ∗ ITU To determine the starting point (N1) and the end point (N2) of the speech, the ITL and ITU are considered. When the energy of the signal crosses the ITL for the rst time, this point is saved. If the energy then goes below the ITL again it was a false alarm. However, when it also crosses the ITU it means speech was found and the saved point is considered N1, see gure 2(a). For N2 a similar procedure is followed, just the other way around. Finally N1 and N2 can be determined more precisely by looking at the ZCR. To be more exact, a closer look is taken on the ZCRs of the 250 ms before N1. If a high ZCR is found in that interval that is an indication that there is 3
  4. 4. speech and N1 needs to be reconsidered, see gure 2(b). Similarly N2 can be determined more accurately. Figure 2: (a) Determining N1 and N2. (b) Redermining N1 and N2. 2.3 Noise reduction Noise reduction is the other algorithm that is used for speech recognition, it can be done with help of FIR ltering, spectral subtraction and wavelets, more information about these topics can be found respectively in subsections 2.3.1, 2.3.2 and 2.3.3. 2.3.1 FIR ltering In signal processing two types of ltering are used. As the name suggests, the impulse response of the FIR lter is nite. The other type's lter response is normally not nite because of the feedback structure. The FIR lter is exten- sively used in this project in order to remove the white gaussion noise (WGN) from the signals. The frequencies of the WGN lie mainly in the low frequency band of the spectrum. A high pass rst order (FIR) lter has been applied to strengthen the amplitude of the high frequencies. This is done by decreasing the amplitudo of the low frequencies up to 20 dB, so the speech becomes stronger and the noise is reduced. For ltering the transfer function from thez-domain is used: H[z] = z − α z (1) The standard formula for a transfer function is H(z) = Y (z) X(z) . So it is clear that Y (z) = z - αand X(z) = z. The working of this formula is shown in gure 3. A transfer function with all coecients within the z-plane are always stable. Therefore αlies between -1 and 1. To get a decrease as high as possible with a rst-order formula the alpha is set to 0.95. In gure 4 the poles are shown 4
  5. 5. Figure 3: FIR lter following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95. So the FIR lter is stable because of the place of α in the z-plane. Figure 4: z-plane To determine the frequency response of a discrete-time (FIR) lter, the trans- fer function is evaluated at z = ejωT . From all this the transfer formula used in this paper for FIR ltering looks as in formula 2. H[ejωT ] = 1 − (0.95 ∗ e−jωT ) (2) This is one way of ltering WGN. In the part about wavelets, 2.3.3, another approach to lter WGN is explained. 2.3.2 Spectral subtraction Spectral subtraction is an advanced form of noise reduction. It is used for signals that contain Non-Gaussian (articial) noise. After framing and Hamming win- dowing (DSP), endpoint detection is used on every frame to seperate the noise frames from the frames with speech. From the noise frames a noise estimation 5
  6. 6. of the signal is made. After applying Discrete Fourier Transform (DFT) to the windowed signal, the noise estimation is simply subtracted from the signal to obtain the denoised frames. Moreover the noise estimation is used to calculate the SNR later on (see section 2.1). Finally the inverse DFT is taken and the frames can be reassembled to get the denoised signal. A schematic overview of the whole process is given in gure 5. Figure 5: spectral subtraction 2.3.3 Wavelets As in the section about FIR ltering already suggested, wavelets are used to lter the signal on white gaussian noise. This way of ltering starts with the original signal and the mother wavelet. The mother wavelet could be one of many mother wavelets that are available. In this paper one of the available daubechie wavelets are used, the daubechie 3, used often in Matlab. This mother wavelet is recommended by Matlab, the program used to create the wavelet lter. Next step in ltering is the decomposition of the original signal. By tting the mother wavelet to the signal at the smallest scale, the lter produces what is called the rst wavelet detail and a remainder which is called the rst approx- imation. Then the timescale of the mother wavelet is doubled and again t to the rst approximation. This results in a second wavelet detail and the second remainder, the second approximation. Doubling the timescale of the mother wavelet is also known as dilation. Dilation and splitting the remainders into a new detail and approximation part, gure 6, is continued until the mother wavelet has been dilated to such an extent that it covers the entire range of the signal. [9] There are two ways of thresholding, soft- and hard-thresholding. With hard thresholding the signal below a certain threshold is set to zero. Soft thresholding 6
  7. 7. Figure 6: Signal Decomposition is more complicated. It substracts the value of the threshold from the values of the signal that are above that certain threshold. The values below that threshold are set to zero again. [10] In Matlab this is integrated in the function ddencmp and wdencmp. The function ddencmp de-noises the signal using a threshold and the way of thresholding dened using the sound sample. The function wdencmp uses this threshold value and the soft/hard-thresholding to create a de-noised signal. So using these two functions, Matlab generates a denoised signal by itself. 3 Speaker classication he speaker classication algorithms described in this paper works best on a discrete words or small signals. First, Discrete Word Selection (DWS) algo- rithm is applied to cut the signal containing the most vowel components. Next, Age/Sex Classication (ASC) algorithm tries to classify the signal in order to reduce computation by eliminating the database samples that should be pro- cessed. Text-Dependent (T-D) speaker detection techniques such as Dynamic Time Warping (DTW) and Vector Quantization (VQ) and Text-Independent (T-I) such as Voice Model Algorithm are processed. The contradictions are checked and if detected, the ASC bias is discarded and the T-D and T-I al- gorithms are computed again. If a speaker is detected, the system proceeds to classication of Multiple speakers, using in parallel two dierent techniques: Framed Multi-Speaker Classication and Harmonic Matching Classier. There results of both are combined to achieve best result. See gure 7 for schematic overview. 7
  8. 8. Figure 7: Architectural overview speaker recognition 3.1 Discrete word selection Discrete word selection is used for two reasons: rst of all, the techniques used in the system are mainly valid for discrete speech processing and not so much for the processing of continuous speech. This means that the best results will be achieved when working only with one, isolated group of words. Working with discrete speech will also optimize the performance of the system. The second reason for using discrete word selection is as a help for the 'Age/Sex Classication' (ASC) block. The ASC block uses physical properties of the human vocal tract to classify speech. The algorithm for discrete word selection is based on V/C/P (Vowel / Con- sonant / Pause) classication algorithm. This algorithm is text independent and composed of four blocks, see gure 8. In the rst block the main features are extracted; in the second block the signal is framed and classied for the rst time. Next, the noise level is estimated and the frames are classied again with an updated noise level parameter. In order to distinguish a consonant, the V/C/P algorithm proposes the usage of zero crossing rate features and a threshold (ZCR_dyna). In the case where ZCR is bigger than the threshold, the frame can be classied as a consonant. If the frame can not be classied, the energy of that frame will be checked. Is 8
  9. 9. Figure 8: V/C/P classication algorithm blocks. the energy smaller than the overall noise level, then the frame can be classied as a pause. The frame can be classied as a vowel if the energy is larger. The results of an example speech clip using V/C/P classication is shown in gure 9. Figure 9: V/C/P classication of an example speech clip (o:consonant, +:pause, *:vowel).Image from Microsoft Research Asia.[12] The complete discrete word selection algorithm is implemented as follows: 1. Audio input is segmented into non-overlapping frames of 10ms, where energy and ZCR features are extracted. 2. Energy curve is smoothed, using FIR. 3. The Mean_Energy and Std_Energy of the energy curve are calculated to estimate the background noise energy level, and the threshold of ZCR (ZCR_dyna) as: NoiseLevel = Mean_Energy - 0,75 * Std_Energy ZCR_dyna = Mean_ZCR + 0,5 Std_ZCR 9
  10. 10. 4. Frames are classied as V/C/P coarsely by using the following rules, where FrameType is used to denote the type of each frame. If ZCR ZCR_dyna then FrameType = Consonant Elseif Energy NoiseLevel, then FrameType = Pause Else FrameType = Vowel 5. Update the NoiseLevel as the weighted average energy of the frames at each vowel boundary and the background segments 6. Re-classify the frames using algorithm in step 4 with the updated Noise- Level. Pauses are merged by removing isolated short consonants. Vowel will be split at its energy if its duration is too long. 7. After classication is terminated, select the word with the highest number of V-frames. 3.2 MFCC and vector quantization Mel-frequency cepstral coecients (MFCCs) and vector quantization (VQ) are used to construct a set of highly representative feature vectors from a speech fragment. These vectors are used to achieve speaker classication. Frequencies below 1 kHz contain the most relevant information for speech. Hence the human hearing emphazises these frequencies. To immitate this, fre- quencies can be mapped to the Mel frequency scale (Mel scale). The Mel scale is linear up to 1 kHz, while for higher frequencies it is a logarithmic scale, thus emphasizing lower frequencies. After converting to the Mel scale, the MFCCs can be found using the Discrete Cosine Transform. In this paper 13 MFCCs are obtained from each frame of the speech signal. Since a speech fragment generally is divided into many frames, this will result in a large set of data. Therefore VQ, implemented as proposed in [7], is used to compress these data points to a set of feature vectors (codevectors). In the case of speech fragments the set of codevectors is a representation of the speaker. Such a representation is called a codebook. Here VQ is used to compress each set of MFCCs to 4 points. In the training phase a codebook is generated for every known speaker. These codebooks are saved in the database. When identifying a speaker from a new speech fragment VQ compares the MFCCs of the fragment to each codebook in the database, as can be seen in 10. The distance between a MFCC and the closest codevector is called its distortion. The codebook with the smallest total distortion of all MFCCs is identied as the speaker. 3.3 Dynamic time warping Dynamic Time Warping (DTW) is a generic algorithm, used to compare two signals. In order to nd the similarity between such sequences or as a prepro- 10
  11. 11. Figure 10: Matching MFCCs to a codebook cessing step before averaging them, we must warp the time axis of one (or both) sequences to achieve a better alignment,gure11. Figure 11: Two sequences of data, having both overall similar shape but they are not aligned to the time axis.[11] In order to compare two speech signals in the system the DTW is applied to the 13 of Mel-frequency cepstral coecients (MFCCs) from the Mel scale and compared to its database samples. To nd a warping path of two sequences of MFCC data, few steps are re- quired: 1. Calculate the distances cost matrix (In this Paper Euclidean distance was used to compute the cost) 2. Computing the path, starting from a corner of the cost matrix, processing adjacent cells. This path can be found very eciently using dynamic programming.[11] 11
  12. 12. ˆ W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K m+n-1 3. Select only in the path which minimizes the warping cost: DTW(Q, C) = min K k−1 Wk K 4. Repeat the path calculation for each MFCC feature and compute a dier- ence from each path. 3.4 Age/ sex classication The ASC block is based on physical properties of speech and the vocal tract and will pre-classify the input to one of the following 4 categories: male adult, female adult, male child, female child. This pre-classication will help the classication algorithms of the system to classify the speaker more accurately. The total length of the vocal tract L can be calculated from the rst harmonic of a sound exiting the closed tube. L = c 4F (3) where c is the speed of sound and F the fundamental frequency. Once the length of the vocal tract has been calculated, it is very straightforward to classify the length according to age and sex. General assumptions are that an adult has a longer vocal tract than a child and that a male also has a longer vocal tract than a female [1]. For easier implementation of the classier, it was chosen to work with vocal tract length instead of directly with the fundamental frequencies. Based on [2] the ASC algorithm has been developed and implemented, which uses LPC to extract the rst formant out of the signal. Classication is then based on heuristic methods, where length intervals for adult female and child male are divided into sub-bands, allowing to distinguish between these cate- gories. Implementation-wise it is important to note that a ASC has been imple- mented such that it will only be carried out if the number of samples in the database of the system is larger than the number of classes of speakers. This is done to avoid the pre-classication block (ASC) to act as a classication algorithm and hence disable the classication blocks. 3.5 Voice model Human speech is produced by expelling air from the lungs into the vocal tract, where the `air signal' is `modeled' into the desired utterance by the glottis, the tongue and the lips, amongst others. Thus, a speech signal can be seen as a signal evolving over time which is formed by certain invasions. In this research, it is proposed to use Evolutionary Stable Strategies (ESS) originating from the 12
  13. 13. eld of game theory to model human speech and to accurately recognize speakers on a text-independent basis. In Appendix A, a detailed overview is given of how this theory is developed. Here the general implementation of the algorithm will be discussed. Finding a solution for the following two research problems is attempted: 1. Find an algorithm that, given an utterance of human speech, determines a tness matrix, appropriate strategies and invasions so that the speech utterance is correctly dened by the resulting evolution of the population of the game. 2. Employ the result of goal 1 to achieve speaker recognition, text-independent if possible. Since the ltering eect of separate speech organs can hardly be distinguished, a lossless concatenated tube model (n-tube model [3][4]) for modeling the vocal tract is assumed instead. The n-tube model also allows sequential modeling of the speech utterance and thus solves the problem of parallel eects that occur in the vocal tract. In essence, the algorithm we need will proceed as follows: 1. Determine the number of tubes in the model and their respective equa- tions. 2. Start lling out the tness matrix: (a) Initially it contains the value 2 in position (1,1) (b) Determine the equation of the signal after applying the rst lter. (c) Determine the elements of the next column of the tness matrix. (d) Determine the correct invasion parameters so that the current signal will become the desired signal as determined in (b) (e) Repeat steps (b) to (d) until the desired utterance is modeled (until all tubes have been passed) 3. Store the values from step 2 in a database format that includes elements of the tness matrix as well as strategy information and invasions. In order to analyze the feasibility of this algorithm it is required to delve a bit deeper into steps (c)and (d). It is obvious that (c)and (d) are mutually dependent since the outcome of an invasion will depend of the ospring param- eters. Furthermore, it has to be determined what strategy to play generally and when to invade. Finally, an ESS that will simplify the entire process has to be incorporated. Let's assume that at every iteration it is decided to carry out a pure invasion; that is, at time step x+e the type of column x will invade the existing population, or more concrete, at that point in time the game will be played with strategy (0,1), where 1 is for the type of column x. In that case, the elements of column 13
  14. 14. x have to be such that lling them out in equations (A.4) and (A.5) will yield the correct population graph. Using an ESS will help determine at what exact time steps to carry out pure invasions, since the evolution of the population is then predetermined and thus known. It is desirable that playing (1,0), where 1 is for the rst element of the rst column, is an ESS. Therefore, all other elements in the tness matrix must be smaller than 2. To tackle the second research goal, it is important to know that the equations of the lters will partially depend on the physical model of the speaker. It is thus the question how to extract these parameters from the speech utterance so that the equations for the lters can be established. 3.6 Contradictions Since three algorithms are employed in the single-speaker classication stage, their respective outcome have to be checked on consistency. A list of contra- dictions allows the system to detect inconsistencies as well as indications to multiple speakers. In the table above T-D denotes text-dependent algorithms, while T-I de- notes the text-independent algorithm. The system contains two text-dependent algorithms and one text-independent algorithm. The binary value for T-D is dened by the logical AND operation of T-D1 and TD2. 4 Multiple speaker detection In order to successfully classify multiple speakers in a speech clip, two use-cases should be analyzed. There are two main types: 1. Non-Overlapping speech where two or more speakers are speaking in dif- ferent time frames (For example a dialogue). 2. Overlapping speech where two or more speakers speaking in both separate time frames and same time frames (For example a debate). In this se we discuss a technique for each of those use-cases: Framed Multi- Speaker classication for Non-Overlapping speech and Harmonic Matching Fil- ter for Overlapping speech. Those two techniques are executed in parallel in 14
  15. 15. the system and both results are combined in order to detect the most speakers as possible. 4.1 Framed multi speaker classication Framed Multi-Speaker classication algorithm is used in the system in order to detect and classify multiple speakers in a speech signal. In order to do this, the whole signal is processed. The algorithm is used on dialogues or other non- overlapping speech clips. It uses single speaker classication techniques in order to detect each speaker. Figure 12: FMS classication stages. The algorithm works in 3 stages as shown in gure 12: 1. FHM starts with erasing the pauses in the signal and uses this to frame the signal; 2. Loop on each frame and classies the frame using the classication tech- niques discussed in the previous section. The text-dependent speaker clas- sication as well as the text-independent classication algorithms are used. Also a check for contradiction is done to classify the single speaker as shown in gure 13; 3. Finally FHM checks the results to extract only distinct speakers. 4.2 Harmonic matching classied In order to enable the system to recognize speakers in multi-speaker speech fragments with overlapping speech, the Harmonic Matching Classier (HMC) is used. The HMC was introduced by Radfar et al. in [5] and separates Unvoiced- Voiced (U-V) frames from Voiced-Voiced (V-V) frames in mixed speech. 15
  16. 16. Figure 13: FMS classication, per frame classication block. The table above indicates what kind of speech is uttered by the respective speaker in each frame category. U-V frames are useful in speaker recognition of mixed speech, since in such a frame the features of the voiced speaker will dominate. Hence, it will be possible to recognize the speaker for every frame. However, before being able to separate U-V frames from V-V frames, rst the U-U frames have to be removed from the signal. To achieve this, an algorithm proposed by Bachu et al. [6] is employed, which uses energy and ZCR cal- culations to distinguish unvoiced frames from voiced frames. Unvoiced/voiced classication is based on heuristic methods. HMC recognizes U-V frames by tting a harmonic model, given by equation (2), to a mixed analysis frame and then evaluate the introduced error (3) against a threshold sv (4). This process is repeated for all frames of the mixed signal. 1. Hmodel = L(ωi) l=1 A2 lωi W2 (ω − lωi) 2. et = minwi ||Xt mix(w)|2 − Hmodel| 3. σ = mean({et } T t=1) where ωi is the fundamental frequency and W(ω)is a window applied to the spectrum. The X component of equation (3) denotes the spectrum of the tth mixed signal frame. After the U-V frames have been extracted from the mixed speech signal, they are passed to the Vector Quantization (VQ) block of the system, where every frame is matched against the relevant database and two speakers are nally 16
  17. 17. recognized. Our system is currently limited to recognizing maximal 2 speakers from a mixed signal, which is an obvious consequence of the limitations of the methods used, especially harmonic model tting. 5 Test and Results The output of the program comparing exact same speech le with existing one. Everything classied perfectly. - START PHASE 1 Starting Endpoint Detection... .ITL: 1.2448 ..ITU: 6.2242 ..IZCT: 220.5 .EnergyTotal: 384 elements .RatesTotal: 384 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Starting MFCC... Starting DTW... 'Adult Female' Starting VQ... 'Adult Female' Starting VM... 'Adult Female' Final result: 'Adult Female' Trying to classify dierent sound le (same person, same text). Once again, everything is classiend and there's no contradictions. START PHASE 1 Starting Endpoint Detection... .ITL: 0.66714 .ITU: 3.3357 .IZCT: 120 .EnergyTotal: 384 elements .RatesTotal: 384 elements .BackoLength: 12 Starting FIR... 17
  18. 18. Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Saving le... Starting MFCC... Starting DTW... 'Adult Female' Starting VQ... 'Adult Female' Starting VM... 'Adult Female' Final result: 'Adult Female' - Classifying a poor quality sound le, VM and VQ classies it correctly, but DTW fails. The contradictions are veried and nal result is assigned correctly. - START PHASE 1 Starting Endpoint Detection... .ITL: 0.2542 .ITU: 1.271 .IZCT: 120 .EnergyTotal: 387 elements .RatesTotal: 387 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Starting MFCC... Starting DTW... 'Adult Male' Starting VQ... 'Child Female' Starting VM... 'Child Female' Final result: 'Child Female' - Classifying a poor quality sound le, this time DTW and VM classies it cor- rectly, but VQ fails. The contradictions are veried and nal result is assigned correctly. - 18
  19. 19. START PHASE 1 Starting Endpoint Detection... .ITL: 1.3699 .ITU: 6.8496 .IZCT: 120 .EnergyTotal: 421 elements .RatesTotal: 421 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Skipping DWS and loading existing one... Starting MFCC... Starting DTW... 'Adult Male' Starting VQ... 'Child Female' Starting VM... 'Adult Male' Final result: 'Adult Male' 6 Discussion The developed system incorporates classical techniques as well as novel tech- niques and is a combination of scientically proven and heuristic methods. The techniques used for speech detection and noise reduction are well-known and widely-used in speech processing applications. The addition of Spectral Sub- traction to this stage of the system is a novel touch that improves the accuracy of further steps. In the processing and single-speaker classication stage various DSP-related techniques have been combined with new research. Discrete Word Selection and Age/Sex Classication both rely on existing methods, but are used in an entirely new fashion in our implementation. Digital Signal Processing which incorporates Windowing and Framing and Frequency Analysis (MFCC), on the other hand, are classical supporting techniques that are used to prepare the signal for further processing, as is customary in this kind of systems. Working with pre-classication is very useful for larger databases and does provide the user of the system with information about the speaker even if the system can nd no match. Needless to say, the system relies heavily on the physical model of speech and the vocal tract to accomplish this for adult and child, male and female speakers. 19
  20. 20. For the actual classication, three algorithms have been selected that t the requirements of the system the most. An originally planned implementation of Extended Dynamic Time Warping (EDTW), however, had to be reduced to the simple Dynamic Time Warping implementation, due to a lack of time. Extended Dynamic Time Warping applies dimensionality reduction algorithms like Principle Component Analysis before searching for a cost path, which would have optimized the performance of the system. The new research that the system incorporates, namely single-speaker, text- independent classication by using Evolutionary Stable Strategies, is a very in- teresting technique that needs further development and testing before its actual use can be proven. Multi-speaker classication also a novel heuristic method (Framed Multi- Speaker Classication) for the recognition of multiple speakers in non-overlapping speech. Harmonic Model Classication is a combination and adaptation of ex- isting methods and is used for recognition in overlapping speech, which is a novelty in its own right that is not easily achieved. Between several stages of the system, a considerable amount of logic has been incorporated to assure accurate processing of temporary results. The most striking example of this logic and its use is probably the technique employed to detect multiple speaker in a speech signal. It is implemented via the logical de- coding of the results of the multiple classication algorithms. Of course, for this method to be accurate, a reasonable amount of input is necessary. Therefore, the more classication algorithms we have in the system the better the result will be. Hence, incorporating EDTW and maybe other classication algorithms in the system, in addition to the existing algorithms, will prove useful for the switch to multi-speaker recognition, which is currently partially a task for the user to carry out manually. 7 Conclusion In this Paper several techniques to classify/detect single or multiple speakers are discussed. In conjunction and proper usage, those techniques help to identify one or more speakers. Tests and results of such system have shown that many existing algorithms have dierent purposes and can only classify a speaker if several conditions are met (for instance text dependent algorithms). Thus, to be able to achieve best results for the speaker classication problem, the algorithms should work together and be checked for contradictions of their output. References [1] Stevens, K.N., Acoustic Phonetics', 0262692503, MIT Press, 1998. [2] Kamran, M. and Bruce, I. C., Robust Formant Tracking for Continuous Speech with Speaker Variability, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 2, 2006. 20
  21. 21. [3] Fant, G., Acoustic Theory of Speech Production, Mouton (The Hague), 1960. [4] Flanagan, J.L., Speech Analysis, Synthesis and Perception, Springer Ver- lag, Berlin, Heidelberg, 1972. [5] Radfar, M.H., Sayadiyan, A. and Dansereau, R.M., A Generalized Ap- proach for Model-Based Speaker-Dependent Single Channel Speech Sepa- ration, Iranian Journal of Science Technology, Transaction B, Engineer- ing, Vol. 31, No. B3, pp 361-375, The Islamic Republic of Iran, 2007. [6] Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B. D., Separation of Voiced and Unvoiced using Zero-Crossing Rate and Energy of the Speech Signal, American Society for Engineering Education (ASEE) Zone Con- ference Proceedings, 2008. [7] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani and Md. Sai- fur Rahman, Speaker Identication Using Mel Frequency Cepstral Coe- cients, 2004 [9] Goring, D. (2006). Orthogonal Wavelet Decomposition. Available: http://www.tideman.co.nz/Salalah/OrthWaveDecomp.html. Last accessed 21 January 2009. [10] Patrick J. van Fleet (2008). Discrete wavelet transformation. New Jersey: John Wiley sons. 317-350 [11] Keogh E.J., Pazzani M.J. Derivative Dynamic Time Warping, 2000 [12] Dong Wang, Lie L., Hong-Jiang Zhang Speech Segmentation Without Speech Recognition, Microsoft Research Asia 21
  22. 22. Appendix A: Using Evolutionary Stable Strategies to Model Human Speech Let the air signal be called the signal s, then it can be modeled by an evolu- tionary game with the following tness matrix: This matrix can be extended to contain the eects of the speech mdeling, as follows: where g,t,l are the deformation signals of the glottis, tongue and lips, respec- tively. The question marks in the matrix represent the amount of deformation one signal evokes in another.This value is obviously dependant on the utterance, which leads us to our rst conclusion. Conclusion1: Evolutionary games can only be used to model discrete speech utterances. Practically this means that this technique will be used to model isolated vowels and consonants. Let's clarify the above a bit by considering an evolutionary game consisting of a population of two types, i and j. The game has the following tness matrix (not bimatrix, since only player 1 gets ospring): Now, let's plot the evolution of the population over time for the following strategies (or strategy pairs; player 1 and player 2 use the same strategy in each of the following cases). Note that for this game we assume that all possible re- lations occur during one generation (one element of the population has multiple inter- and intra-type relationships, where applicable). It is also obvious that no distinction is made between male and female elements; in fact, all elements are genderless. 22
  23. 23. Applying strategy (1,0) means that the entire population exists of type i 23
  24. 24. exclusively. Since the ospring is equal to 2, the population will never grow beyond its initial size, namely 2. Strategy (0,1) yields a similar case, where the entire population consists of type j exclusively. However, the ospring size here is 4, hence the population will grow over time. The number of relationships that can (and will) occur at a certain point tx in time is : P (tx−1)−1 n=1 n = 1 + 2 + 3 + ... + P(tx−1) − 1 which are all possible combinations, except the element with itself and re- versed combinations. This amount of relationships can be calculated using the form: 1 + 2 + 3 + ... + n = n(n + 1) 2 which then yields equation (A.2). Finally, the population when using strategy 1 2 , 1 2 consists for 50% type i and 50% type j. Equation (A) is an extension of equation (A.2) in order to include all possible relationships. The term −2 ∗ P (tx−1) 2 − 1 ∗ P (tx−1) 4 can not be simplied because it originates from the form mentioned above and hence a standard simplication would yield a wrong result. In this specic case equation (A.3) can be reduced to A.3.1 : P(tx) = offspring(i,i) ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 + offspring(i,j)∗ (P(tx−1) − 1) ∗ P(tx−1) 2 − 2 ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 + offspring(j,j) ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 since 1 2 offspring(i,j) + 1 2 offspring(j,i) = offspring(i,j) = offspring(j,i). Equation (A.3.1) can then further be reduced to (A.3.2) P(tx) = offspring(j,i) ∗ (tx−1 − 1) ∗ tx−1 2 , which equals equation(A.2), since in this case (i, j) = (j, i) = (i,i)+(j,j) 2 . Let us now consider the eect that an invasion would have on the population graph. As it happens, the pure strategy pair ((0,1),(0,1)) that we have examined previously, is an Evolutionary Stable Strategy (ESS), because (a) it is a Nash equilibrium and (b) i scores better against j than against itself. (Note that if we remove dominated actions from this game, only strategy (0,1) remains.) 24
  25. 25. Consider the same strategies (pairs) again, but now with a pure invasion at some moment in time. The general population function is given by equations (A.1), (A.2) and (A.3) respectively until t3, and by (A.4) and (A.5), as detailed below, thereafter: P(tx) = offspring(i,i) ∗ ((Fi(tx−1)) ∗ P(tx−1) − 1) ∗ Fi(tx−1)∗P (tx−1) 2 + 25
  26. 26. offspring(i,j) 2 + offspring(i,j) 2 ∗ (P(tx−1) − 1) ∗ P (tx−1) 2 − (Fi(tx−1) ∗ P(tx−1) − 1) ∗ (Fi(tx−1)∗P (tx−1)) 2 + (Fj(tx−1) ∗ P(tx−1) − 1) ∗ Fj (tx−1)∗P (tx−1) 2 + offspring(j,j) ∗ ((Fj(tx−1)) ∗ P(tx−1) − 1) ∗ Fj (tx−1)∗P (tx−1) 2 A = y=i,j (Fy(tx−1) ∗ P(tx−1) − 1) ∗ Fy(tx−1)∗P (tx−1) 2 B = Ftype(tx−1) ∗ P(tx−1) C = offspring(i,j) 2 + offspring(j,i) 2 Ftype (tx) = offspring(type,type)∗(B−1)∗ B 2 + C ∗ (P(tx−1) − 1) ∗ P(tx−1) 2 − A 2 P (tx) x = 1...∞ The general deformation function is dened by: D(tx)    0 x inv Pinv(tx) − Pno−inv(tx) x ≥ inv x = 1...∞ Equation (A.4) consists of three components: the rst to calculate the number of possible combinations (and ospring after multiplication with the ospring-factor) of type i, the second for the mixed combinations and the third for combinations of type j. Equation (A.5) is a function called from equation (A.4) and calculates the fraction (the ratio) of a certain type at a given moment in time. This is achieved by calculating the sum of the ospring of the respective type and half of the mixed ospring, and dividing this sum by the population number. As can be seen from the 'Type Ratios'-graphs, only in the case of an ESS the evolution of the population restores and stabilizes over time. 26

×