52 57


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

52 57

  1. 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012 SPEAKER RECOGNITION IN NOISY ENVIRONMENT Mr. Mohammed Imdad N1 , Dr . Shameem Akhtar N1, Prof.Mohammad Imran Akhtar 2 1 Computer Science and Engineering department , KBN College of Engineering, Gulbarga, India 2 Electronics and Communication department , AITM , Bhatkal Abstract--- This paper investigates the problem of The speech signal conveys several levels ofspeaker identification and verification in noisy conditions, information. Primarily, the speech signal conveys the words orassuming that speech signals are corrupted by noise. This paper message being spoken, but on a secondary level, the signaldescribes a method that combines multi-condition model training also conveys information about the identity of the speaker.and missing-feature theory to model noise with unknowntemporal-spectral characteristics. Introduction of such technique The area of speaker recognition is concerned with extractingis very useful since it remove avoids the problem of recognizing the identity of the person speaking an utterance. As speechvoice and can also be implemented since here user is not required interaction with the computers become more pervasive into remember his password login and hence no stilling chance. activities such as telephone transactions and information retrieval from speech databases, the utility of automaticallyIndex Terms— Cepstrum, Missing Feature method, Multi- recognizing a speaker based on his vocal characteristicscondition model training, Vector quantization increases. I. INTRODUCTION II. WORKING OF A SPEAKER RECOGNITION Spoken language is the most natural way used by SYSTEMhumans to communicate information. The speech signalconveys several types of information. From the speech Like most pattern recognition problems, a speakerproduction point of view, the speech signal conveys linguistic recognition system can be partitioned into two modules:information (e.g., message and language) and speaker feature extraction and classification. The classification moduleinformation (e.g., emotional, regional, and physiological has two components: pattern matching and decision. Thecharacteristics). From the speech perception point of view, it feature extraction module estimates a set of features from thealso conveys information about the environment in which the speech signal that represent some speaker-specificspeech was produced and transmitted. Even though this wide information. The speaker-specific information is the result ofrange of information is encoded in a complex form into the complex transformations occurring at different levels of thespeech signal, humans can easily decode most of the speech production: semantic, phonologic, phonetic, andinformation. This speech technology has found wide acoustic.applications such as automatic dictation, voice commandcontrol, audio archive indexing and retrieval etc. Speaker recognition refers to two fields: SpeakerIdentification (SI) and Speaker Verification (SV). In speakeridentification, the goal is to determine which one of group ofknown voices best matches the input voice sample. There aretwo tasks: text-dependent and text-independent speakeridentification. In text dependent identification, the spokenphrase is known to the system whereas in the text independentcase, the spoken phrase is unknown. Success in bothidentification tasks depends on extracting and modeling the Figure 1 : Generic speaker recognition systemspeaker dependent characteristics of the speech signal, whichcan effectively distinguish between talkers The pattern matching module is responsible for comparing the estimated features to the speaker models. There 52 All Rights Reserved © 2012 IJARCSEE
  2. 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012are many types of pattern matching methods and called quasi-stationary). An example of speech signal is showncorresponding models used in speaker recognition [13]. Some in Figure 2. When examined over a sufficiently short period ofof the methods include hidden Markov models (HMM), time (between 5 and 100 msec), its characteristics are fairlydynamic time warping (DTW), and vector quantization (VQ). stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, III. SPEAKER RECOGNITION PRINCIPLES short-time spectral analysis is the most common way to Depending on the application, the general area of characterize the speech signal.speaker recognition can be divided into three specific tasks: A wide range of possibilities exist for parametricallyidentification, detection/verification, and segmentation and representing the speech signal for the speaker recognition task,clustering. The goal of the speaker identification task is to such as Linear Prediction Coding (LPC), Mel-Frequencydetermine which speaker out of a group of known speakers Cepstrum Coefficients (MFCC), and others. MFCC is perhapsproduces the input voice sample. There are two modes of the best known and most popular, and these will be used inoperation that are related to the set of known voices- closed- this project.set mode and open-set mode. In the closed-set mode, the system assumes that theto-be- determined voice must come from the set of knownvoices. Otherwise, the system is in open-set mode. The closed-set speaker identification can be considered as a multiple-classclassification problem. In open-set mode, the speakers that donot belong to the set of known voices are referred to asimpostors. This task can be used for forensic applications, e.g.,speech evidence can be used to recognize the perpetrator’sidentity among several known suspects. In speaker verification, the goal is to determine Figure: 2 An example of speech signal.whether a person is who he or she claims to be according tohis/her voice sample. This task is also known as voice The technique used for speech feature extractionverification or authentication, speaker authentication, talker make use of MFCC’s are based on the known variation of theverification or authentication, and speaker detection. Speaker human ear’s critical bandwidths with frequency filters spacedsegmentation and clustering techniques are also used in linearly at low frequencies and logarithmically at highmultiple-speaker recognition scenarios. In many speech frequencies have been used to capture the phoneticallyrecognition and it’s applications, it is often assumed that the important characteristics of speech. This is expressed in thespeech from a particular individual is available for processing. mel-frequency scale, which is linear frequency spacing belowWhen this is not the case, and the speech from the desired 1000 Hz and a logarithmic spacing above 1000 Hz. Thespeaker is intermixed with other speakers, it is desired to process of computing MFCCs is described in more detail next.segregate the speech into segments from the individuals beforethe recognition process commences. So the goal of this task is V. Mel-Frequency Cepstrum Coefficients Processorto divide the input audio into homogeneous segments and thenlabel them via speaker identity. Recently, this task has A block diagram of the structure of an MFCCreceived more attention due to increased inclusion of multiple- processor is given in Figure The speech input is typicallyspeaker audio such as recorded news show or meetings in recorded at a sampling rate above 16000 Hz. This samplingcommonly used web searches and consumer electronic frequency was chosen to minimize the effects of aliasing indevices. Speaker segmentation and clustering is one way to the analog-to-digital conversion.index audio archives so that to make the retrieval easier. According to the constraints placed on the speechused to train and test the system, Automatic speakerrecognition can be further classified into text-dependent ortext-independent tasks. IV. SPEECH FEATURE EXTRACTION The purpose of this module is to convert the speechwaveform to some type of parametric representation (at aconsiderably lower information rate) for further analysis and Figure: 3 MFCC Processor.processing. This is often referred as the signal-processing frontend. The speech signal is a slowly timed varying signal (it is 53 All Rights Reserved © 2012 IJARCSEE
  3. 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012 VI. Vector Quantization likelihood function of frame feature vector X associated with speaker S trained on data set . In this paper, we assume that It is a feature matching techniques used in speaker each frame vector X consists of N subband features: X =recognition. Here , VQ approach will be used, due to ease of (x1,x2,..,xn), where xn represents the feature for the nth subband.implementation and high accuracy. VQ is a process of We obtain by dividing the whole speech frequency-band into nmapping vectors from a large vector space to a finite number subbands, and then calculating the feature coefficients for eachof regions in that space. Each region is called a cluster and can subband independently of the other subbands. The subbandbe represented by its center called a codeword. The collection feature framework has been used in speech recognition forof all codewords is called a codebook. Figure shows a isolating local frequency-band corruption from spreading intoconceptual diagram to illustrate this recognition process. In the the features of the other bands.figure, only two speakers and two dimensions of the acousticspace are shown. The proposed approach for modeling noise includes two steps. The first step is to generate multiple copies of training set ø0, by introducing corruption of different characteristics into ø0. Primarily, we could add white noise at various signal-to-noise ratios (SNRs) to the clean training data to simulate the corruption. Assume that this leads to augmented training sets ø0, ø1,.., øl, where øl denotes the lth training set derived from with the inclusion of a certain noise condition. Then, new likelihood function for the test frame vector can be formed by combining the likelihood functions trained on the individual training sets p(X / S)=Σ(l=0,L) p(X / S, øl) P(øl / S) …….(1) Figure 4: conceptual diagram illustrating vector quantization codebook formation. where p(X / S, øl)is the likelihood function of frame vector X trained on set øl, and is the prior probability for theOne speaker can be discriminated from another based of the occurrence of the noise condition , for speaker S. Equation (1)location of centroids. In the training phase, a speaker-specific is a multicondition model. A recognition system based on (1)VQ codebook is generated for each known speaker by should have improved robustness to the noise conditions seenclustering his/her training acoustic vectors. The result in the training sets øl, as compared to a system based on p(X /codewords (centroids) are shown in Figure by black circles S, ø0).and black triangles for speaker 1 and 2, respectively. Thedistance from a vector to the closest codeword of a codebook The second step of the new approach is to make (1)is called a VQ-distortion. In the recognition phase, an input robust to noise conditions not fully matched by the trainingutterance of an unknown voice is “vector-quantized” using sets øl without assuming extra noise information. One way toeach trained codebook and the total VQ distortion is this is to ignore the heavily mismatched subbands and focuscomputed. The speaker corresponding to the VQ codebook the score only on the matching subbands. Let X = (x1,x2,..,xn),with smallest total distortion is identified. be a test frame vector and Xl c X be a subset in containing all the subband features corrupted at noise condition øl. Then, After the enrolment session, the acoustic vectors using Xl in place of X as the test vector for each training noiseextracted from input speech of a speaker provide a set of condition, (1) can be redefined astraining vectors. As described above, the next important step isto build a speaker-specific VQ codebook for this speaker using p(X / S)=Σ(l=0,L) p(Xl / S, øl) P(øl / S) ……..(2)those training vectors. There is a well-know algorithm, namelyLBG algorithm [Linde, Buzo and Gray, 1980], for clustering a where p(Xl / S, øl) is the marginal likelihood of the matchingset of L training vectors into a set of M codebook vectors. feature subset Xl, derived from p(X / S, øl) with the mismatched subband features ignored to improve mismatch robustness between the test frame X and the training noise condition . VII. SPEAKER MODELLING It deals with designing for speaker for recognition ofvoice. It mainly consist of two phase training and testing phase VIII. SPEAKER VERIFICATIONand both the phase mainly depends on feature extraction andparameter matching. Speaker verification is the process of automatically verify who is speaking on the basis of individual information Let ø0 denote the training data set, containing clean included in speech waves. This technique makes it possible tospeech data, for speaker S, and let p(X / S, ø0 ) represent the use the speakers voice to verify their identity and control 54 All Rights Reserved © 2012 IJARCSEE
  4. 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012access to services such as voice dialing, banking by telephone, Add push button here will perform add to database similarlytelephone shopping, database access services, information remove push button will perform remove from database fromservices, voice mail, security control for confidential database.information areas, and remote access to computers. Speaker recognition can be classified intoidentification and verification. Speaker identification is theprocess of determining which registered speaker provides agiven speech. Speaker verification, on the other hand, is theprocess of accepting or rejecting the identity claim of aspeaker. At the highest level, all speaker recognition systemscontain two main modules: feature extraction and featurepattern matching. Feature extraction is the process that extracts a smallamount of data from the voice signal that can later be used torepresent each speaker. Feature matching involves the actualprocedure to identify the unknown speaker by comparingextracted features from his/her voice input with the ones froma set of known speakers. Snapshot 2. An example of adding voice named (IMRAN1) All speaker recognition systems have to serve two on top push button ,this to add the voice sample of respectivedistinguishes phases. The first one is referred to the enrollment user. After this click the push button record file.sessions or training phase while the second one is referred toas the operation sessions or testing phase. In the trainingphase, each registered speaker has to provide samples of theirspeech so that the system can build or train a reference modelfor that speaker. In case of speaker verification systems, inaddition, a speaker-specific threshold is also computed fromthe training samples. IX. RESULTS The experiment conducted using three voice signalof each person with different level noisy environment. Afterpassing the input speech through microphone speaker, Featurevector transformation of input voice took place for the purposeof testing and training. Snapshot of corresponding experimentrunning and decision making for corresponding speaker Snapshot 3. A prompt of record voice signal is displayedidentification and verification has been displayed below. asking permission for recording the voice of concerned user. Click here the push button yes to record the voice. Snapshot 4. After recording the voice, a prompt of playing voice signal is displayed. Click here the push button yes to play back the recorded voice.Snapshot 1. Here four bush button is there named Add,Remove, Recognized, Exit. 55 All Rights Reserved © 2012 IJARCSEE
  5. 5. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012 Snapshot 8. A prompt containing Playing Voice Signal will appear. Click here the push button yes to play back your recorded voice.Snapshot 5. Then time graph of voice signal and spectrum ofnoise signal appears separately as two different figures.Showing the speech signal varying with time and in otherconcerned noise added to it. Snapshot 9. Then two separate figure of time graph of voice signal and spectrum signal appears.Snapshot 6. Click here the push button recognize torecognize the speaker and compare the frequency template ofspeaker in data base with the present input speech signal. Snapshot 10. Fig for match of calculated and the best match stored codebook (MFCC) appear.Snapshot 7. A prompt of Recode Voice Signal containingpush button speak now will appear. Click push button yes torecord your voice for further comparison. 56 All Rights Reserved © 2012 IJARCSEE
  6. 6. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 4, June 2012 REFERENCES [1] Dr. Joseph Picone, Fundamentals of speech recognition,. a short course Institute for Signal and Information Processing. Department of Electrical and Computer Engineering., Mississippi State University. [2] Monson H Hayes, Digital Signal Processing,. Text book., Schaum’s outline. [3] D. A. Reynolds, .Experimental evaluation of features for robustSnapshot 11. Shows to whom the voice matches and time speaker identification,. IEEE Trans. Speech Audio Processing, vol. 2,taken for matching the two voice in seconds. pp. 639-643, Oct. 1994. [4] R. Mammone, X. Zhang and R. P. Ramachandran, .Robust speaker recognition - a feature-based approach,. IEEE Signal Processing Magazine, pp. 58-71, Sep. 1996. [5] H. A. Murthy, F. Beaufays, L. P. Heck and M. Weintraub, .Robust text-independent speaker identification over telephone channels,. IEEE Trans. Speech Audio Processing, vol. 7, pp. 554-568, Sep. 1999. [6] L. F. Lamel and J. L. Gauvain, .Speaker verification over theSnapshot 12. Shows the decision that input voice not present telephone,. Speech Commun., vol. 31, pp. 141-154, 2000.in database hence speaker not recognized. [7] G. R. Doddington, et al., .The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective., Speech Commun., vol. 31, pp. 225-254, 2000. X. CONCLUSION [8] Y.Kao,P.Rajashekaran and J.Baras, “Free-text speaker identification Speaker recognition can be used to verify one’s identity over long distance telephone channel using phonetic segmentation”, inwhen the interface favors the use of a telephone or proc.IEEE ICASSP 1992 pp II 177-II 180.microphone. With proper expectations, planning andeducation, speaker verification has already proven to be themost natural yet very secure solution to verifying one’sidentity. Voice analysis technology has been around for years.Applying it used to be tougher than rocket science. Now you Mohammed Imdad N , received B.E incan get all the benefits of advanced technology without all the Electronics and communication fromcomplexity and overhead of managing Gigabytes of voice VTU, Belgaum. He is presently pursuingreference data, dealing with advanced speech technology, and M.Tech in computer science andworrying about all the legal issues involved. engineering from VTU, Belgaum. topic Working on the project to develop 1. This technique is been used for speaker recognition speaker recognition in noisy environment and to identify the user using the speaker. for his PG thesis under the guidance of 2. This technique makes it possible to use the speakers Dr. Shameem Akhtar N. voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas. Mohammad Imran Akhtar, received B.E in Information technology from MGDr. Shameem Akhtar N received B.E in Computer Science University, Kottyam. He completedand Engineering from Gulbarga university and M.Tech in M.Tech in digital communication andComputer Science and Engineering from VTU, Belgaum and Networking from UBDT Collegethe Ph.D degree in digital image processing From Gitam of Engineering. He is an AssistantUniversity. She has more than 10 years of experience in Professor in Electronics andteaching and research. She is life member of Indian society of communication department of AITMtechnical Education. She is an Assistant Professor in the Bhatkal. His main research interest aredepartment of Computer Science and Engineering in KBN include speech processing, imagecollege of Engineering. processing. 57 All Rights Reserved © 2012 IJARCSEE