Contributed PaperManuscript received October 15, 2009 0098 3063/09/$20.00 © 2009 IEEEA Voice Trigger System using Keyword ...
utterance of about one second, without a full-fledged speechrecognizer. When an unregistered keyword is uttered or thevoic...
large amount of speech data in advance, as shown in Fig. 3(top row). The garbage model represents all the words anddescrib...
• Keyword model: We propose a pseudo phoneme keywordHMM generation algorithm that produces a speaker-independent keyword H...
IV. EXPERIMENTSWe performed the experiments using a Korean word corpusto evaluate the performance of the proposed methods....
reduced the verification error by 25.5% compared to thetemplate based method.Fig. 7 shows the effects of the number of Gau...
The two step decision procedure requires all four models to beevaluated, rather than only two models as in the one stepdec...
Yongserk Kim received the B.S. degree in electronicsengineering from Sungkyunkwan University, Korea, in 1983.He has been w...
Upcoming SlideShare
Loading in...5

A voice trigger system using keyword and speaker recognition for mobile recognotion


Published on

for more projects visit @

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A voice trigger system using keyword and speaker recognition for mobile recognotion

  1. 1. Contributed PaperManuscript received October 15, 2009 0098 3063/09/$20.00 © 2009 IEEEA Voice Trigger System using Keyword and Speaker Recognitionfor Mobile DevicesHyeopwoo Lee, Sukmoon Chang, Member, IEEE, and Dongsuk Yook, Member, IEEE, Yongserk KimAbstract — Voice activity detection plays an important rolefor an efficient voice interface between human and mobiledevices, since it can be used as a trigger to activate anautomatic speech recognition module of a mobile device. If theinput speech signal can be recognized as a predefined magicword coming from a legitimate user, it can be utilized as atrigger. In this paper, we propose a voice trigger system usinga keyword-dependent speaker recognition technique. Thevoice trigger must be able to perform keyword recognition, aswell as speaker recognition, without using computationallydemanding speech recognizers to properly trigger a mobiledevice with low computational power consumption. Wepropose a template based method and a hidden Markov model(HMM) based method for the voice trigger to solve thisproblem. The experiments using a Korean word corpus showthat the template based method performed 4.1 times fasterthan the HMM based method. However, the HMM basedmethod reduced the recognition error by 27.8% relativelycompared to the template based method. The proposedmethods are complementary and can be used selectivelydepending on the device of interest.1Index Terms — Voice trigger, keyword recognition, speakerrecognition, dynamic time warping, vector quantization,Gaussian mixture model, hidden Markov modelI. INTRODUCTIONThe burgeoning of handheld mobile devices and homeappliances in recent decades has provided us with theunfathomable convenience in various daily activities andcommunications. However, the use of these devices requires ahigh level of user attention, as well as dexterity to activate andoperate the devices. For example, many communicationdevices are equipped with multiple buttons and/or touchsensitive screens. A user must manipulate the small buttons ortouch sensitive screens to activate and operate such devices.While seemingly trivial, this method is often difficult to1This work was supported by the Korea Research Foundation (KRF) grantfunded by the Korea government (MEST) (No. 2009-0077392). It was alsosupported by the MKE (The Ministry of Knowledge Economy), Korea, underthe ITRC (Information Technology Research Center) support programsupervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2009-C1090-0902-0007).Hyeopwoo Lee and Dongsuk Yook (corresponding author) are with theSpeech Information Processing Laboratory, Department of Computer andCommunication Engineering, Korea University, Seoul, 136-701, Republic ofKorea (e-mail: and Theywould like to thank Samsung Electronics for their cooperation. SukmoonChang is with Pennsylvania State University, Middletown, PA 17057, USA(e-mail: Yongserk Kim is with Acoustic TechnologyCenter, Samsung Electronics Co., LTD., Suwon, 443-742, Republic of Korea(e-mail:, requiring the user’s careful attention, especially whenthe user is simultaneously carrying out another activity, suchas driving a car. Moreover, the dexterity required tomanipulate such user interfaces prevents users who may havelittle or no motor control capability from unassisted use of thedevices [1][2].This issue has been alleviated to some degree by the use ofspeech recognition systems for the hands-free activation andoperation of the devices. When a voice signal is detected, thespeech recognizer is triggered to process the signal. Thedetection of voice signals can be performed by traditionalvoice activity detection (VAD) methods [3]. This approach,however, raises several concerns. Note that the devices areexpected to be used in real world environments where much ofthe speech around the devices is not directed to them.Although it is effective for isolating voice signals from noises,traditional VAD methods cannot differentiate the voice signalsof a legitimate user from others. This causes the speechrecognizer to be frequently activated to perform unnecessarytasks [4]. It is desirable to prevent the frequent activation ofthe system on small mobile devices with limited power supply.The system should be activated only when the voice signalfrom a legitimate user is detected [5]. Furthermore, due totheir computational cost, the full-fledged speech recognizersare unsuited to devices with limited computing power. Insummary, to work effectively on a device with limited powersupply and computing power, the voice trigger system musthave a small computational cost and be able to detect only thekeywords uttered by legitimate users without a fully featuredspeech recognizer.The voice trigger system can be viewed as a keyword-dependent speaker verification problem, as shown in Fig. 1.To properly trigger a mobile device using voice signals, thesystem must recognize the registered keywords (magic words),as well as the speaker of the voice signals, from a shortFig. 1. A voice trigger system for mobile devices. The magic wordrepresents the registered keyword by the authorized speaker.H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2377
  2. 2. utterance of about one second, without a full-fledged speechrecognizer. When an unregistered keyword is uttered or thevoice of an impostor is detected, the voice trigger systemshould reject the signal. Thus, a voice trigger system mayconsist of two components, i.e., keyword recognition andspeaker recognition [6].Hidden Markov models (HMM) have been widely used toregister words for keyword recognition [7][8]. However, thesystem must be provided with the voice signals of thekeywords along with their labels to register the keywordsusing HMM. That is, the user must not only speak thekeywords but also register the keywords using an input device,such as a keyboard, prohibiting the voice trigger system frombeing a truly hands-free system.The methods based on the Gaussian mixture model-universal background model (GMM-UBM) [9] as well assupport vector machines [10] have been widely used forspeaker recognition. Although these methods were shown toproduce good performances in the NIST (National Institute ofStandards and Technology) speaker recognition evaluation[11], they have relatively high computational costs and aretext-independent. Vector quantization based methods havebeen developed to reduce the computational costs of thespeaker recognition task [12][13]. These methods explicitlycreate the speaker model using the codebook through thevector quantization procedure. They produce relatively goodperformance with low computational cost. However, sincethey lack a background model, and thus lack a normalizationprocess, their performance rapidly degrades when used in adifferent condition from that of the training data collection,e.g., different microphones and environments.These approaches cannot be reliably used as a voice triggerin the small devices due to their high computational costs inGMM-UBM based methods and performance degradation invector quantization based methods. In this paper, we proposenew methods that address these issues: a template basedmethod and a HMM based method. The template basedmethod is proposed to overcome performance degradation inthe vector quantization method in different environments. Thisis achieved by adding the background model using the vectorquantization method, enabling the normalization of the voiceand noise signals. The HMM based method is proposed toadapt the GMM-UBM based method for small devices withlimited computational power. Registration can be made withonly the voice signals for the keywords, without requiring thelabels of the uttered keywords, in the proposed HMM basedmethod. The proposed voice trigger system consists of twosteps, as shown in Fig. 2. The keyword recognition steprequires two acoustic models. They are the keyword modeland the garbage model. The speaker recognition step requiresthe speaker model and the background model. We introducethe proposed method in two phases, i.e., registration phase andverification phase, since the models are generated during thekeyword and user registration process and are used for thekeyword and user verification.The remainder of this paper is organized as follows. Thetemplate based method and the HMM based method areintroduced in Sections II and III, respectively. Section IVgives experimental results. Section V concludes the paper.II. TEMPLATE BASED METHODThe template based method is a simple way of keyword andspeaker recognition. The well-known pattern matchingalgorithm termed the dynamic time warping (DTW) method[14] may be used as a voice trigger. The DTW algorithm doesnot require any specific knowledge other than the featurevectors of the registration speech data. The DTW methodmeasures the distance between the input and the registereddata and shows relatively good performance when used in thesame condition as the one under which the registration datawere collected. However, the performance of the DTWmethod degrades when the testing condition changes; forexample, different types of microphone are used in differentenvironments. The main cause of the performance degradationis the lack of models that can be used to normalize voice dataand noise. The proposed method generates the acoustic modelsof codebook using the vector quantization scheme for thenormalization to overcome this weakness.A. Registration PhaseThe four models, in Fig. 2, generated during the keywordand user registration phase are:• Garbage model: The garbage model is generated as acodebook by the k-means clustering algorithms [15] using aFig. 2. Block diagram of a voice trigger system comprising keywordrecognition and speaker recognition modules.2378 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
  3. 3. large amount of speech data in advance, as shown in Fig. 3(top row). The garbage model represents all the words anddescribes the general acoustic space.• Keyword model: The keyword model represents theregistered keyword. The model is generated using thecodebook of the garbage model and the registration data, asshown in Fig. 3 (bottom row). For each feature vector of theregistration data, the best matching codeword is selectedfrom the codebook and a new vector sequence is generatedby replacing the feature vectors with the selected codewords.• Speaker model: The speaker model represents the acousticspace of the registered speaker. The model uses the featurevectors of the registration data of the user withoutgenerating any specific model.• Background model: The background model represents thevoice data from all the speakers, other than the registeredspeaker. We assume that each codeword in the codebookroughly corresponds to a context-dependent sub-word unit.Under this assumption, since the keyword model generatedabove implies some speaker-independent property, thebackground model simply uses the keyword model.B. Verification PhaseWe first calculate the DTW score of the input voice dataagainst to the models (keyword and speaker) for theverification of the registered keyword and the user:),(DTW ..1 mxS Tm = , (1)where x1..T represents the input voice data and m representsone of the acoustic models except the garbage model. Thescore of the garbage model is calculated by the sum of thedistance between each input feature vector and the minimumdistance codeword corresponding to the vector.We apply a two-step procedure that performs the keywordand speaker recognition in sequence to determine if the inputvoice data is the registered keyword spoken by the registereduser:1garbagekeyword θ>− SS , (2)2backgroundspeaker θ>− SS . (3)The input voice data is determined as the registered keywordwhen the score difference between the keyword model and thegarbage model exceeds a threshold 1θ . Similarly, if the scoredifference between the speaker model and the backgroundmodel exceeds threshold 2θ , the system determines the inputvoice data to be the voice of the registered speaker.Another decision method is motivated by the fact that thebackground model simply uses the same model as thekeyword model. Thus, by adding (2) and (3), we obtain asimpler one-step verification procedure:3garbagespeaker θ>− SS , (4)where )( 213 θθθ += is a threshold. Although the one-stepprocedure is faster than the two-step procedure, it is importantto note that the one-step procedure can only be used when it issafe to assume that the model scores have similar distributions.Otherwise, the one-step procedure will most likely fail. Forexample, if Skeyword – Sgarbage is relatively large, but Sspeaker –Sbackground is not, it is possible that (4) is satisfied, but (2) is not.As we will examine in more detail in Section IV, these twoprocedures may be used with the consideration of the tradeoffsbetween the time and accuracy.III. HMM BASED METHODAlthough the template based method uses the four acousticmodels to address the performance degradation issue of theconventional vector quantization method, the problem cannotbe completely overcome. We propose a voice trigger systembased on the HMM to obtain performance that is more reliable.A. Registration PhaseThe four acoustic models to be used in the HMM basedmethod are generated as follows:• Garbage model: The garbage model is represented using theGMM, a well known approach in speaker recognition tasks:∑ ΣMkkk xw1),;( μN , (5)where M is the number of Gaussian probability densityfunctions (PDFs) in the mixture, wk is the weight for the k-th component Gaussian PDF, x is the input feature vector,and ),;( kkx ΣμN represents a Gaussian PDF with meanvector kμ and covariance matrix kΣ . The GMM is trainedin advance on a large amount of speech data using theexpectation-maximization (EM) algorithm.Fig. 3. The garbage model (codebook) and the keyword model generationprocedure in the template based method. The feature vectors of thekeyword voice data are replaced by the best matching codewords in thecodebook.H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2379
  4. 4. • Keyword model: We propose a pseudo phoneme keywordHMM generation algorithm that produces a speaker-independent keyword HMM to register a keyword withoutany transcription. If the registration data represents simplyone utterance, the algorithm works as follows (also, Fig. 4):Step 1: For each input vector, xt (t = 1 … T), calculate loglikelihood, Sk,t, with each Gaussian of the GMM inthe garbage model:),;(log, kktktk xwS Σ= μN . (6)Step 2: Select top N Gaussian PDFs with the largest Sk,t andbuild a Gaussian index table, Gg,t (g = 1 … N and t =1 … T), which contains the top N Gaussian PDFindices for each feature vector. Fig. 4 shows anexample of a Gaussian index table with N = 3 and T= 6.Step 3: Cluster the columns of the table based on theGaussian with the largest Sk,t from each column andusing the distance between the adjacent columns(i.e., adjacent times). The Bhattacharya distance canbe used that measures the similarity of two GaussianPDFs to cluster the adjacent columns of theGaussian index table [16].Step 4: For each cluster, select top N Gaussians with thelargest Sk,t from the cluster and assign them to a stateof the keyword HMM. Corresponding Gaussianweights are also assigned to the state and normalized.Step 5: The states generated in step 4 are concatenated toform a left-to-right HMM with self transition.If the registration data represents more than one utterance,the pseudo phoneme keyword HMM generation method ismodified, as follows. For each utterance, we follow Steps 1through 3 of the single utterance case to build a clusteredGaussian index table. Once all the tables are built for eachutterance, we find the median number of clusters amongstthe tables. Then, by adjusting the clustering threshold inStep 3, we re-cluster the tables so that the number ofclusters in each table is the same as the median number ofclusters found previously. If a table fails to be re-clusteredto have the median number of clusters, we simply ignore thetable. We then select top N Gaussians with the largest Sk,tfrom all the clusters of the table with the same cluster indexand assign them to a state of the keyword HMM. Finally,the Gaussian weights and the transition probabilities areassigned in the same way as in Steps 4 and 5 of the singleutterance case. Note that the clustering of the featurevectors is performed based on their time index. That is, thepseudo phoneme keyword HMM generation algorithmincorporates the time information of the keyword voice datainto the model. Assuming that each Gaussian in the originalGMM roughly models a context-dependent sub-word unit,the keyword model generated in this way implies somespeaker-independent property.• Speaker model: As mentioned previously, the speakermodel indicates the acoustic model of a registered speaker,whilst the keyword model is speaker-independent. We buildthe speaker model by adapting the keyword model using theregistration data from the speaker. The adaptation isperformed using the maximum a posterior (MAP) basedmethod [17]. We perform only the mean adaptation of thecomponent Gaussians, since our experimental results showthat good performance is achieved using only the meanadaptation.• Background model: As in the template based method, thebackground model simply uses the keyword model, sincewe assume that the keyword model generated abovecontains some speaker-independent characteristics.B. Verification PhaseWe first calculate the log likelihood, Sm to verify theregistered keyword and user:);(log ..1 mxpS Tm = , (7)where x1..T and m represents the input voice data and each ofthe four acoustic models, respectively. We can calculate Smusing the forward algorithm [18]. The procedures of theverification are the same as those in the template based case.Fig. 4. Pseudo phoneme keyword HMM and speaker HMM generationprocedure. In the Gaussian index table in Step 3, the columns with thesame color represent one cluster.2380 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
  5. 5. IV. EXPERIMENTSWe performed the experiments using a Korean word corpusto evaluate the performance of the proposed methods. Thecorpus consists of six words spoken by 30 people (23 malesand 7 females), repeating each word ten times. The wordscontain 4–11 phonemes. We used half of the corpus to trainthe models and the remaining half as test data. That is, out ofthe ten repetitions of each word spoken by each person, weused five words for the keyword and user registration. Theremaining five words were used for testing, along with otherwords spoken by the same person, as well as the words spokenby other people. Table 1 shows the composition of a test set.Each test set contains five acceptance trials (i.e., the samekeyword spoken by the same person as the training data) andfifteen rejection trials (i.e., different keywords and/or differentperson). The rejection trial data were selected randomly fromthe corpus. We repeated this experiment 180 (=30 x 6) times.The voice data were partitioned into a sequence of 25millisecond frames with a 10 millisecond advance. AHamming window was applied to each frame. 12 dimensionalmel-frequency cepstral coefficients (MFCCs), log energy, andtheir first and second order time derivatives were used as thefeature vector. The equal error rate (EER), which is the errorrate measured when the false alarm rate and the false rejectionrate are the same, was used as the performance measure.The garbage model for the template based method (i.e., thecodebook) and the garbage model for the HMM based method (i.e.,the GMM) were trained with the Korean Standard 2001 corpus.This is a collection of 16433 words spoken by 200 people. The datain the corpus is phonetically rich. We varied the size of GMM toanalyze the effect of the number of Gaussians; 1024, 2048, 4096,and 8192 Gaussians. The codebooks were trained with the k-meansalgorithm. Each codebook contained 2048 code words.Fig. 5 shows the experimental results with various values ofthe clustering threshold in Step 3 of Section III-A, whichcontrols the number of HMM states in the pseudo phonemekeyword HMM generation algorithm. There were 2048Gaussians in the GMM. The results of the template basedmethod using 2048 codewords are also shown in the figure asa reference. In this figure, five training data sets were used togenerate the keyword and speaker models. The verificationdecision was made by (4). On average, there were 27 to 38states in the 2048-Gaussian GMMs in Fig. 5. Ten Gaussianswere selected (i.e., N = 10) during Steps 2 and 4 of thekeyword model generation process. As the clusteringthreshold increases, denoting that the clustering criterionrelaxes, the performance of the verification degrades. That is,as the clustering criterion relaxes, the number of clusters in theGaussian index table becomes smaller. This, in turn, meansthat the small number of states in the HMM causes thedegradation of the keyword and speaker representation powerof the model. On average, the threshold of 4.5 shows the bestperformance. We use this value in the remaining experiments.The execution time of the two methods run on a 500 MHzCPU ultra mobile personal computer (UMPC) is shown in Fig.6. The verification decision was made by (4). It takes roughlyreal-time to register a keyword and a user, and much less thanreal-time for recognition, using the template based method. Incontrast, it takes almost 4.5 times real-time for registration,and two times real-time for recognition, in the HMM basedmethod. The template based method is much faster than theHMM based method. However, the HMM based methodTABLE I.COMPOSITION OF EACH TEST SETIndex Keyword User Label1-5 (acceptance trials) Same Same SK-SU6-10 (rejection trials) Same Different SK-DU11-15 (rejection trials) Different Same DK-SU16-20 (rejection trials) Different Different DK-DUFig. 6. Comparison of the execution time of the template based methodand the HMM based method.Fig. 5. Performance comparison of the template based method with 2048codewords and the HMM based method with 2048 GMM Gaussians.H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2381
  6. 6. reduced the verification error by 25.5% compared to thetemplate based method.Fig. 7 shows the effects of the number of Gaussian in a stateof the HMM. As the number of Gaussians increased,performance improved. However, after 15 Gaussians theperformance can be degraded due to the representation power.Thus, the remaining experiments were conducted with 15Gaussians in a state of the HMM to acquire the best result.The relationship among the EER, the execution time, andthe number of Gaussians is shown in Fig. 8. Although thenumber of Gaussians is an important performance factor, thecomputation times of the registration and recognition are alsocritical factors in the small mobile devices, such as cellularphones. The computation time increases almost linearly withthe number of Gaussians in the GMM. The case using 8192Gaussians achieved the highest recognition accuracy.Considering the computation time along with the EER,however, the 2048 Gaussians achieved a better result overallthan the 8192 Gaussians.Next experiments were performed with the differentnumber of registration utterances to generate the keywordmodel and perform the speaker model adaptation in order toshow the effects of the amount of registration data. Fig. 9shows these results. As in our expectation, the larger numberof training data achieves better performance than lesser ones.Thus, the amount of registration data is an important issue indigital devices to use as a voice trigger.Note that, until now, only the results of the one stepdecision procedure have been presented. Table II analyses theerror rates contributed by the configuration of the test set. Theconfiguration labels used in the table were explained in TableI. As shown in Table II, the error in the keyword recognition,i.e., ‘DK-SU’ and ‘DK-DU’, contributed to a large portion ofthe overall error rate. Since the decision procedure in (4)consists of only one step, the classification ability may bedegraded, especially when the four DTW scores have largediscrepancies. Therefore, the one-step decision procedureshould only be used when the four DTW scores can beassumed to come from similar distributions and after carefulconsideration of the tradeoffs between execution time andaccuracy.Finally, to compensate for the error in keyword verification,the two step decision procedure in (2) and (3) was tested usingthe HMM based method with the 2048-Gaussian GMM, aswell as the template based method with the 2048 codewords.Fig. 7. EER as a function of the number of Gaussians for the HMMbased method.Fig. 8. Performance of the registration and recognition time for varioussizes of the GMM in the HMM based method. The circular marksrepresent EER, the diamond marks describe the real-time factor of theregistration process, and the triangular marks indicate the real-timefactor of the recognition process.TABLE IIERROR PORTIONLabel SK-SU SK-DU DK-SU DK-DUError (%) 25.8 35.5 23.3 15.4Fig. 9. Performance as a function of number of registration words for the2048-Gaussian GMM.2382 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
  7. 7. The two step decision procedure requires all four models to beevaluated, rather than only two models as in the one stepdecision scheme. The two thresholds, 1θ and 2θ , weredetermined empirically to apply the two step decisionprocedure when the miss ratio of the keyword recognition isless than 1.0%. Table III shows the results of the EER of bothapproaches. The two step decision procedure decreased theEER by 2.0% in the HMM based method and 3.6% in thetemplate based method compared to the one step decisionscheme. In addition, the HMM based method decreased therecognition error rate by 27.8% compared to the templatebased method (from 13.2% to 9.5%). Thus, the GMM basedmethod can be used as a more reliable voice trigger for homeappliances and digital devices, while the template basedmethod has a fast voice trigger at the expense of accuracy.V. CONCLUSIONThis paper proposed voice trigger systems that use keywordand speaker recognition techniques for a hands-free interfaceto small mobile devices. The proposed methods, the templatebased method and the HMM based method, do not require aspeech recognizer to register keywords and users. Unlike thetraditional GMM speaker recognition method, they also utilizethe temporal constraints of the voice signals. These methodsgenerate the specific models and make a decision. Theexperiments using a Korean word corpus show theeffectiveness of the proposed methods. The two proposedmethods are complementary. Although the template basedmethod is relatively faster than the HMM based method, theperformance of the HMM based method is much better thanthe template based method. Therefore, the proposed voicetrigger systems for the keyword and speaker verification canbe used selectively taking into account the tradeoff betweenspeed and the accuracy, depending on the device of interest.REFERENCES[1] M. Matsuda, T. Nonaka, and T. Hase, “AV control method using naturallanguage understanding,” IEEE Trans. Consum. Electron., vol. 52, no. 3,pp. 990-997, 2006.[2] H.-C. Huang, T.-C. Lin, and Y.-M. Huang, “A smart universal remotecontrol based on audio-visual device virtualization,” IEEE Trans. Consum.Electron., vol. 55, no. 1, pp. 172-178, 2009.[3] A. Davis, S. Nordholm, and R. Togneri, “Statistical voice activitydetection using low-variance spectrum estimation and an adaptivethreshold,” IEEE Trans. Speech Audio Process., vol. 14, no. 2, pp. 412-424, 2006.[4] H. Chung and I. Chung, “Memory efficient and fast speech recognitionsystem for low-resource mobile devices,” IEEE Trans. Consum. Electron.,vol. 52, no. 3, pp. 792-796, 2006.[5] M. JI, S. Kim, H. Kim, and H.-S Yoon, “Text-independent speakeridentification using soft channel selection in home robot environment,”IEEE Trans. Consum. Electron., vol. 54, no. 1, pp. 140-144, 2008.[6] Y. R. Oh, J. S. Yoon, M. Kim, and H. K. Kim, “A name recognition basedcall-and-come service for home robots,” IEEE Trans. Consum. Electron.,vol. 54, no. 2, pp. 247-253, 2008.[7] J. Wilpon, L. Rabiner, C. Lee, and E. Goldman, “Automatic recognition ofkeywords in unconstrained speech using hidden markov models.” IEEETrans. Acoust., Speech, Signal Process., vol. 38, no. 11, pp. 1870-1878,1990.[8] E. Lleida and R. Rose, “Utterance verification in continuous speechrecognition: decoding and training procedure,” IEEE Trans. Acoust.,Speech, Signal Process., vol. 8, no. 2, pp. 126-139, 2000.[9] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification usingadapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp.19-41, 2000.[10] W. Campbell, J. Campbell, and D. Reynolds, “Support vector machines forspeaker and language recognition,” Computer Speech and Language, vol.20, pp. 210-229, 2006.[11] G. Doddington, M. Przybocki, A. Martin, and D. Reynolds, “The NISTspeaker recognition evaluation – overview, methodology, systems, results,perspective,” Speech Comm., vol. 31, pp. 225-254, 2000.[12] T. Kinnunen, E. Karpov, and P. Franti, “Real-time speaker identificationand verification,” IEEE Trans. Acoust., Speech, Signal Process., vol. 14,pp. 277-288, 2006.[13] V. Hautamaki, T. Kinnunen, I. Karkkainen, J. Saastamoinen, M. Tuononen,and P. Franti, “Maximum a posteriori adaptation of the centroid model forspeaker verification,” IEEE Signal Process. Lett., vol. 15, pp. 162-165,2008.[14] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimizationfor spoken word recognition,” IEEE Trans. Acoust., Speech, SignalProcess., vol. 26, no. 1, pp. 43-49, 1978.[15] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Comm., vol. com-28, no. 1, pp. 84-98, 1980.[16] A. Bhattacharyya, “On a measure of divergence between two statisticalpopulations defined by their probability distribution,” Bull. Calcutta Math.Soc., vol. 35, pp. 99-110, 1943.[17] J. Gauvain and C. Lee, “Maximum a posteriori estimation for multivariateGaussian mixture observations of Markov chains,” IEEE Tans. SpeechAudio Process., vol. 2, no. 2, pp. 291-298, 1994.[18] L. Rabiner, “A tutorial on hidden Markov models and selected applicationsin speech recognition,” Proc. of IEEE, vol. 72, no. 2, pp. 257-286, 1989.Hyeopwoo Lee received the B.S. and M.S. degrees inComputer and Communication Engineering, from KoreaUniversity, Seoul, Korea, in 2006 and 2008. He iscurrently in the Ph.D. program at the Speech InformationProcessing Laboratory in Korea University. His researchinterests are speech and speaker recognition.Sukmoon Chang received the M.S. degree in computerscience from Indiana University, Indiana, USA, in 1995and the Ph.D. degree in computer science from RutgersUniversity, New Jersey, USA, in 2002. He worked onimage and signal processing at the Center forComputational Biomedicine Imaging and Modeling,Rutgers University, from 2002 to 2004. He is a professor inComputer Science, School of Science, Engineering, andTechnology, Pennsylvania State University. His research interests include imageand signal processing and machine learning. Dr. Chang is a member of IEEE.Dongsuk Yook received the B.S. and M.S. degrees incomputer science from Korea University, Korea, in 1990and 1993, respectively, and the Ph.D. degree in computerscience from Rutgers University, New Jersey, USA, in1999. He worked on speech recognition at IBM T.J.Watson Research Center, New York, USA, from 1999 to2001. He is a professor in the Department of Computerand Communication Engineering, Korea University,Korea. His research interests include machine learning and speech processing.Dr. Yook is a member of IEEE.TABLE IIIPERFORMANCES COMPARISONS OF THE DECISION METHODS IN THETEMPLATE BASED AND HMM BASED METHODSDecisionSystemOne step (%) Two steps (%) Improvement (%)Template based 13.7 13.2 3.6HMM based 9.7 9.5 2.0H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2383
  8. 8. Yongserk Kim received the B.S. degree in electronicsengineering from Sungkyunkwan University, Korea, in 1983.He has been working on audio processing andtelecommunications since 1983. He was awarded anhonorary Ph.D. degree from Samsung Electronics in 2002.Currently, he is a vice president, and the director of theAcoustic Technology Center, Samsung Electronics Co., LTD.2384 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009