52 57

ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 4, June 2012

SPEAKER RECOGNITION IN NOISY
ENVIRONMENT

Mr. Mohammed Imdad N1 , Dr . Shameem Akhtar N1, Prof.Mohammad Imran Akhtar 2

1 Computer Science and Engineering department , KBN College of Engineering, Gulbarga, India

2 Electronics and Communication department , AITM , Bhatkal

Abstract--- This paper investigates the problem of The speech signal conveys several levels of
speaker identification and verification in noisy conditions, information. Primarily, the speech signal conveys the words or
assuming that speech signals are corrupted by noise. This paper message being spoken, but on a secondary level, the signal
describes a method that combines multi-condition model training also conveys information about the identity of the speaker.
and missing-feature theory to model noise with unknown
temporal-spectral characteristics. Introduction of such technique
The area of speaker recognition is concerned with extracting
is very useful since it remove avoids the problem of recognizing the identity of the person speaking an utterance. As speech
voice and can also be implemented since here user is not required interaction with the computers become more pervasive in
to remember his password login and hence no stilling chance. activities such as telephone transactions and information
retrieval from speech databases, the utility of automatically
Index Terms— Cepstrum, Missing Feature method, Multi- recognizing a speaker based on his vocal characteristics
condition model training, Vector quantization increases.

I. INTRODUCTION II. WORKING OF A SPEAKER RECOGNITION
Spoken language is the most natural way used by SYSTEM
humans to communicate information. The speech signal
conveys several types of information. From the speech Like most pattern recognition problems, a speaker
production point of view, the speech signal conveys linguistic recognition system can be partitioned into two modules:
information (e.g., message and language) and speaker feature extraction and classification. The classification module
information (e.g., emotional, regional, and physiological has two components: pattern matching and decision. The
characteristics). From the speech perception point of view, it feature extraction module estimates a set of features from the
also conveys information about the environment in which the speech signal that represent some speaker-specific
speech was produced and transmitted. Even though this wide information. The speaker-specific information is the result of
range of information is encoded in a complex form into the complex transformations occurring at different levels of the
speech signal, humans can easily decode most of the speech production: semantic, phonologic, phonetic, and
information. This speech technology has found wide acoustic.
applications such as automatic dictation, voice command
control, audio archive indexing and retrieval etc.

Speaker recognition refers to two fields: Speaker
Identification (SI) and Speaker Verification (SV). In speaker
identification, the goal is to determine which one of group of
known voices best matches the input voice sample. There are
two tasks: text-dependent and text-independent speaker
identification. In text dependent identification, the spoken
phrase is known to the system whereas in the text independent
case, the spoken phrase is unknown. Success in both
identification tasks depends on extracting and modeling the Figure 1 : Generic speaker recognition system
speaker dependent characteristics of the speech signal, which
can effectively distinguish between talkers The pattern matching module is responsible for
comparing the estimated features to the speaker models. There
52

All Rights Reserved © 2012 IJARCSEE

ISSN: 2277 – 9043

are many types of pattern matching methods and called quasi-stationary). An example of speech signal is shown
corresponding models used in speaker recognition [13]. Some in Figure 2. When examined over a sufficiently short period of
of the methods include hidden Markov models (HMM), time (between 5 and 100 msec), its characteristics are fairly
dynamic time warping (DTW), and vector quantization (VQ). stationary. However, over long periods of time (on the order
of 1/5 seconds or more) the signal characteristic change to
reflect the different speech sounds being spoken. Therefore,
III. SPEAKER RECOGNITION PRINCIPLES short-time spectral analysis is the most common way to
Depending on the application, the general area of characterize the speech signal.
speaker recognition can be divided into three specific tasks:
A wide range of possibilities exist for parametrically
identification, detection/verification, and segmentation and
representing the speech signal for the speaker recognition task,
clustering. The goal of the speaker identification task is to
such as Linear Prediction Coding (LPC), Mel-Frequency
determine which speaker out of a group of known speakers
Cepstrum Coefficients (MFCC), and others. MFCC is perhaps
produces the input voice sample. There are two modes of
the best known and most popular, and these will be used in
operation that are related to the set of known voices- closed-
this project.
set mode and open-set mode.

In the closed-set mode, the system assumes that the
to-be- determined voice must come from the set of known
voices. Otherwise, the system is in open-set mode. The closed-
set speaker identification can be considered as a multiple-class
classification problem. In open-set mode, the speakers that do
not belong to the set of known voices are referred to as
impostors. This task can be used for forensic applications, e.g.,
speech evidence can be used to recognize the perpetrator’s
identity among several known suspects.

In speaker verification, the goal is to determine Figure: 2 An example of speech signal.
whether a person is who he or she claims to be according to
his/her voice sample. This task is also known as voice The technique used for speech feature extraction
verification or authentication, speaker authentication, talker make use of MFCC’s are based on the known variation of the
verification or authentication, and speaker detection. Speaker human ear’s critical bandwidths with frequency filters spaced
segmentation and clustering techniques are also used in linearly at low frequencies and logarithmically at high
multiple-speaker recognition scenarios. In many speech frequencies have been used to capture the phonetically
recognition and it’s applications, it is often assumed that the important characteristics of speech. This is expressed in the
speech from a particular individual is available for processing. mel-frequency scale, which is linear frequency spacing below
When this is not the case, and the speech from the desired 1000 Hz and a logarithmic spacing above 1000 Hz. The
speaker is intermixed with other speakers, it is desired to process of computing MFCCs is described in more detail next.
segregate the speech into segments from the individuals before
the recognition process commences. So the goal of this task is V. Mel-Frequency Cepstrum Coefficients Processor
to divide the input audio into homogeneous segments and then
label them via speaker identity. Recently, this task has A block diagram of the structure of an MFCC
received more attention due to increased inclusion of multiple- processor is given in Figure The speech input is typically
speaker audio such as recorded news show or meetings in recorded at a sampling rate above 16000 Hz. This sampling
commonly used web searches and consumer electronic frequency was chosen to minimize the effects of aliasing in
devices. Speaker segmentation and clustering is one way to the analog-to-digital conversion.
index audio archives so that to make the retrieval easier.
According to the constraints placed on the speech
used to train and test the system, Automatic speaker
recognition can be further classified into text-dependent or
text-independent tasks.

IV. SPEECH FEATURE EXTRACTION
The purpose of this module is to convert the speech
waveform to some type of parametric representation (at a
considerably lower information rate) for further analysis and Figure: 3 MFCC Processor.
processing. This is often referred as the signal-processing front
end. The speech signal is a slowly timed varying signal (it is
53


ISSN: 2277 – 9043

VI. Vector Quantization likelihood function of frame feature vector X associated with
speaker S trained on data set . In this paper, we assume that
It is a feature matching techniques used in speaker each frame vector X consists of N subband features: X =
recognition. Here , VQ approach will be used, due to ease of (x1,x2,..,xn), where xn represents the feature for the nth subband.
implementation and high accuracy. VQ is a process of We obtain by dividing the whole speech frequency-band into n
mapping vectors from a large vector space to a finite number subbands, and then calculating the feature coefficients for each
of regions in that space. Each region is called a cluster and can subband independently of the other subbands. The subband
be represented by its center called a codeword. The collection feature framework has been used in speech recognition for
of all codewords is called a codebook. Figure shows a isolating local frequency-band corruption from spreading into
conceptual diagram to illustrate this recognition process. In the the features of the other bands.
figure, only two speakers and two dimensions of the acoustic
space are shown. The proposed approach for modeling noise includes
two steps. The first step is to generate multiple copies of
training set ø0, by introducing corruption of different
characteristics into ø0. Primarily, we could add white noise at
various signal-to-noise ratios (SNRs) to the clean training data
to simulate the corruption. Assume that this leads to
augmented training sets ø0, ø1,.., øl, where øl denotes the lth
training set derived from with the inclusion of a certain noise
condition. Then, new likelihood function for the test frame
vector can be formed by combining the likelihood functions
trained on the individual training sets

p(X / S)=Σ(l=0,L) p(X / S, øl) P(øl / S) …….(1)
Figure 4: conceptual diagram illustrating vector quantization
codebook formation. where p(X / S, øl)is the likelihood function of frame
vector X trained on set øl, and is the prior probability for the
One speaker can be discriminated from another based of the occurrence of the noise condition , for speaker S. Equation (1)
location of centroids. In the training phase, a speaker-specific is a multicondition model. A recognition system based on (1)
VQ codebook is generated for each known speaker by should have improved robustness to the noise conditions seen
clustering his/her training acoustic vectors. The result in the training sets øl, as compared to a system based on p(X /
codewords (centroids) are shown in Figure by black circles S, ø0).
and black triangles for speaker 1 and 2, respectively. The
distance from a vector to the closest codeword of a codebook The second step of the new approach is to make (1)
is called a VQ-distortion. In the recognition phase, an input robust to noise conditions not fully matched by the training
utterance of an unknown voice is “vector-quantized” using sets øl without assuming extra noise information. One way to
each trained codebook and the total VQ distortion is this is to ignore the heavily mismatched subbands and focus
computed. The speaker corresponding to the VQ codebook the score only on the matching subbands. Let X = (x1,x2,..,xn),
with smallest total distortion is identified. be a test frame vector and Xl c X be a subset in containing all
the subband features corrupted at noise condition øl. Then,
After the enrolment session, the acoustic vectors using Xl in place of X as the test vector for each training noise
extracted from input speech of a speaker provide a set of condition, (1) can be redefined as
training vectors. As described above, the next important step is
to build a speaker-specific VQ codebook for this speaker using p(X / S)=Σ(l=0,L) p(Xl / S, øl) P(øl / S) ……..(2)
those training vectors. There is a well-know algorithm, namely
LBG algorithm [Linde, Buzo and Gray, 1980], for clustering a where p(Xl / S, øl) is the marginal likelihood of the matching
set of L training vectors into a set of M codebook vectors. feature subset Xl, derived from p(X / S, øl) with the
mismatched subband features ignored to improve mismatch
robustness between the test frame X and the training noise
condition .
VII. SPEAKER MODELLING
It deals with designing for speaker for recognition of
voice. It mainly consist of two phase training and testing phase VIII. SPEAKER VERIFICATION
and both the phase mainly depends on feature extraction and
parameter matching. Speaker verification is the process of automatically
verify who is speaking on the basis of individual information
Let ø0 denote the training data set, containing clean included in speech waves. This technique makes it possible to
speech data, for speaker S, and let p(X / S, ø0 ) represent the use the speaker's voice to verify their identity and control

54


ISSN: 2277 – 9043

access to services such as voice dialing, banking by telephone, Add push button here will perform add to database similarly
telephone shopping, database access services, information remove push button will perform remove from database from
services, voice mail, security control for confidential database.
information areas, and remote access to computers.

Speaker recognition can be classified into
identification and verification. Speaker identification is the
process of determining which registered speaker provides a
given speech. Speaker verification, on the other hand, is the
process of accepting or rejecting the identity claim of a
speaker. At the highest level, all speaker recognition systems
contain two main modules: feature extraction and feature
pattern matching.

Feature extraction is the process that extracts a small
amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual
procedure to identify the unknown speaker by comparing
extracted features from his/her voice input with the ones from
a set of known speakers.
Snapshot 2. An example of adding voice named (IMRAN1)
All speaker recognition systems have to serve two
on top push button ,this to add the voice sample of respective
distinguishes phases. The first one is referred to the enrollment
user. After this click the push button record file.
sessions or training phase while the second one is referred to
as the operation sessions or testing phase. In the training
phase, each registered speaker has to provide samples of their
speech so that the system can build or train a reference model
for that speaker. In case of speaker verification systems, in
addition, a speaker-specific threshold is also computed from
the training samples.

IX. RESULTS
The experiment conducted using three voice signal
of each person with different level noisy environment. After
passing the input speech through microphone speaker, Feature
vector transformation of input voice took place for the purpose
of testing and training. Snapshot of corresponding experiment
running and decision making for corresponding speaker Snapshot 3. A prompt of record voice signal is displayed
identification and verification has been displayed below. asking permission for recording the voice of concerned user.
Click here the push button yes to record the voice.

Snapshot 4. After recording the voice, a prompt of playing
voice signal is displayed. Click here the push button yes to
play back the recorded voice.
Snapshot 1. Here four bush button is there named Add,
Remove, Recognized, Exit.

55


ISSN: 2277 – 9043

Snapshot 8. A prompt containing Playing Voice Signal will
appear. Click here the push button yes to play back your
recorded voice.

Snapshot 5. Then time graph of voice signal and spectrum of
noise signal appears separately as two different figures.
Showing the speech signal varying with time and in other
concerned noise added to it.

Snapshot 9. Then two separate figure of time graph of voice
signal and spectrum signal appears.

Snapshot 6. Click here the push button recognize to
recognize the speaker and compare the frequency template of
speaker in data base with the present input speech signal.

Snapshot 10. Fig for match of calculated and the best match
stored codebook (MFCC) appear.

Snapshot 7. A prompt of Recode Voice Signal containing
push button speak now will appear. Click push button yes to
record your voice for further comparison.

56


ISSN: 2277 – 9043

REFERENCES
[1] Dr. Joseph Picone, Fundamentals of speech recognition,. a short
course Institute for Signal and Information Processing. Department of
Electrical and Computer Engineering., Mississippi State University.

[2] Monson H Hayes, Digital Signal Processing,. Text book., Schaum’s
outline.

[3] D. A. Reynolds, .Experimental evaluation of features for robust
Snapshot 11. Shows to whom the voice matches and time speaker identification,. IEEE Trans. Speech Audio Processing, vol. 2,
taken for matching the two voice in seconds. pp. 639-643, Oct. 1994.

[4] R. Mammone, X. Zhang and R. P. Ramachandran, .Robust speaker
recognition - a feature-based approach,. IEEE Signal Processing
Magazine, pp. 58-71, Sep. 1996.

[5] H. A. Murthy, F. Beaufays, L. P. Heck and M. Weintraub, .Robust
text-independent speaker identification over telephone channels,. IEEE
Trans. Speech Audio Processing, vol. 7, pp. 554-568, Sep. 1999.

[6] L. F. Lamel and J. L. Gauvain, .Speaker verification over the
Snapshot 12. Shows the decision that input voice not present telephone,. Speech Commun., vol. 31, pp. 141-154, 2000.
in database hence speaker not recognized.
[7] G. R. Doddington, et al., .The NIST speaker recognition evaluation -
overview, methodology, systems, results, perspective., Speech
Commun., vol. 31, pp. 225-254, 2000.
X. CONCLUSION
[8] Y.Kao,P.Rajashekaran and J.Baras, “Free-text speaker identification
Speaker recognition can be used to verify one’s identity over long distance telephone channel using phonetic segmentation”, in
when the interface favors the use of a telephone or proc.IEEE ICASSP 1992 pp II 177-II 180.
microphone. With proper expectations, planning and
education, speaker verification has already proven to be the
most natural yet very secure solution to verifying one’s
identity. Voice analysis technology has been around for years.
Applying it used to be tougher than rocket science. Now you Mohammed Imdad N , received B.E in
can get all the benefits of advanced technology without all the Electronics and communication from
complexity and overhead of managing Gigabytes of voice VTU, Belgaum. He is presently pursuing
reference data, dealing with advanced speech technology, and M.Tech in computer science and
worrying about all the legal issues involved. engineering from VTU, Belgaum. topic
Working on the project to develop
1. This technique is been used for speaker recognition speaker recognition in noisy environment
and to identify the user using the speaker. for his PG thesis under the guidance of
2. This technique makes it possible to use the speaker's Dr. Shameem Akhtar N.
voice to verify their identity and control access to
services such as voice dialing, banking by telephone,
telephone shopping, database access services,
information services, voice mail, security control for
confidential information areas.

Mohammad Imran Akhtar, received B.E
in Information technology from MG
Dr. Shameem Akhtar N received B.E in Computer Science University, Kottyam. He completed
and Engineering from Gulbarga university and M.Tech in M.Tech in digital communication and
Computer Science and Engineering from VTU, Belgaum and Networking from UBDT College
the Ph.D degree in digital image processing From Gitam of Engineering. He is an Assistant
University. She has more than 10 years of experience in Professor in Electronics and
teaching and research. She is life member of Indian society of communication department of AITM
technical Education. She is an Assistant Professor in the Bhatkal. His main research interest are
department of Computer Science and Engineering in KBN include speech processing, image
college of Engineering. processing.

57


52 57

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to 52 57

Similar to 52 57 (20)

More from Ijarcsee Journal

More from Ijarcsee Journal (20)

Recently uploaded

Recently uploaded (20)

52 57