ASR_final

“Development of Some Techniques
For Text-Independent Speaker
Recognition From Audio Signals”
By
Bidhan Barai
Under the guidance of
Dr. Nibaran Das and Dr. Subhadip Basu
Assistant Professors of Computer Science & Engineering
Jadavpur University
Kolkata – 700 032

Overview
●
Introduction
●
Types of Speaker Recognition
●
Principles of Automatic Speaker Recognition (ASR)
●
Steps of Speaker Recognition:
1> Voice Recording
2> Feature Extraction
3> Modeling
4> Pattern Matching
5> Decision (accept / reject) (for Verification)
●
Conclusion
●
References

Introduction
●
Speaker recognition is the identification of a person from
characteristics of voices (voice biometrics). It is also
called voice recognition. There is a difference between
speaker recognition (recognizing who is speaking) and
speech recognition (recognizing what is being said).
●
In addition, there is a difference between the act of
authentication (commonly referred to as speaker
verification or speaker authentication) and identification.

Types of Speaker Identification
● Text-Dependent:
If the text must be the same for enrollment and
verification this is called text-dependent recognition.
In a text-dependent system, prompts can either be
common across all speakers (e.g.: a common pass
phrase) or unique
●
Text-Independent:Text-Independent:
Text-independent systems are most often used for
speaker identification as they require very little if any
cooperation by the speaker. In this case the text
during enrollment and test is different.

Types of Speaker Identification
● Closed-Set: Assumed that Speaker is in Database
In closed-set identification, the audio of the test speaker is
compared against all the available speaker models and the
speaker ID of the model with the closest match is returned.
Result is the best speaker matched.
● Open-Set: Speaker may not in Database
Open-set identification may be viewed as a combination of
closed-set identification and speaker verification. Result
can be a speaker or a no-match result.

Principles of Automatic Speaker
Recognition
●
Speaker recognition can be classified into identification
and verification.
●
Speaker identification is the process of determining which
registered speaker provides a given utterance.
●
Speaker verification, on the other hand, is the process of
accepting or rejecting the identity claim of a speaker.
●
Following figures shows the basic structures of speaker
identification and verification systems. The system that we
will describe is classified as text-independent speaker
identification system since its task is to identify the person
who speaks regardless of what is saying.

Recognition ... Contd.
Figure 1
Block Diagram of Speaker Recognition SystemBlock Diagram of Speaker Recognition System

●
Speaker RecognitionSpeaker Recognition
Feature
Extraction
Similarity
Reference Model
Speaker #1
Similarity
Reference Model
Speaker #N
Maximun
Selection
Identification
Result
(Speaker ID)
Input Speech
Figure 2

●
Speaker verificationSpeaker verification
Feature
Extraction
Input Speech Similarity Decision
Reference Model
Speaker #1
Speaker ID
(#M)
Threshold
Verification
Result
(Accept/Reject)
Figure 3

●
All speaker recognition systems have to serve two
distinguished phases.
The first one is referred to the enrolment or training phase,
while the second one is referred to as the operational or
testing phase.
●
In the training phase, each registered speaker has to provide
samples of their speech so that the system can build or train
a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific
threshold is also computed from the training samples.
●
In the testing phase, the input speech is matched with
stored reference model(s) and a recognition decision is
made.

Steps of Speaker Recognition
1> Voice Recording
3> Modeling
4> Pattern Matching
5> Decision (accept / reject) (for Verification)

Step 1: Voice Recording
●
The speech input is typically recorded at a sampling
rate above 10000 Hz (10 kHz).
●
This sampling frequency was chosen to minimize the
effects of aliasing in the analog-to-digital conversion.
These sampled signals can capture all frequencies up
to 5 kHz, which cover most energy of sounds that are
generated by humans.
●
This sampling rate (10 kHz) is determined by the
Nyquest Sampling Theorem.

Step 2: Speech Feature
Extraction
●
The purpose of this module is to convert the speech
waveform, using digital signal processing (DSP) tools, to a set
of features (at a considerably lower information rate) for
further analysis. This is often referred as the
signal-processing front end.
●
The speech signal is a slowly timed varying signal (it is called
quasi-stationary). When examined over a sufficiently short
period of time (between 5 and 100 msec), its characteristics
are fairly stationary. However, over long periods of time (on
the order of 1/5 seconds or more) the signal characteristic
change to reflect the different speech sounds being spoken.
●
Therefore, short-time spectral analysis is the most common
way to characterize the speech signal.

Speech Feature Extraction...Contd
Examples of Speech Signals:
A wide range of possibilities exist for parametrically representing
the speech signal for the speaker recognition task, such as Linear
Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients
(MFCC), Gammatone Frequency Cepstral Coefficients (GFCC),
Group Delay Features (GDF) and others. MFCC is perhaps the best
known and most popular, and will be described in this project.
Figure 4 Figure 5

● Mel-frequency Cepstrum Coefficients Processor:
A block diagram of the structure of an MFCC
processor is given in Figure
Figure 6

● Steps of extracting Feature from Speech Signal:
1> Pre-emphasis
2> Frame Blocking
3> Windowing
4> Fast Fourier Transform (FFT)
5> Mel-frequency Wrapping
6> Cepstrum: Logarithmic Compression and Discrete
Cosine Transform (DCT)

●
Pre-emphasis: In speech processing, the original
signal usually has too much lower frequency energy,
and processing the signal to emphasize higher
frequency energy is necessary. To perform
pre-emphasis, we choose some value α between 0.9
and 1. Then each value in the signal is re-evaluated
using this formula:
This is apparently a first order high pass filter.
y[n]=x[n]−α x[n−1], where 0.9<α<1

Figure 7

● Frame Blocking: The input speech signal is segmented into frames of
20~30 ms with optional overlap of 1/3~1/2 of the frame size. Usually the
frame size (in terms of sample points) is equal to power of two in order to
facilitate the use of FFT. If this is not the case, we need to do zero padding
to the nearest length of power of two.
● Windowing: Each frame has to be multiplied with a hamming window in
order to keep the continuity of the first and the last points in the frame. If
the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after
Hamming windowing is s(n)*w(n), where w(n) is the Hamming window
defined by:
Different values of corresponds to different curves for the Hamming
windows shown next:
w(n,α)=(1−α)−αcos(
2πn
N−1
), 0≤n≤N−1
α

Figure 8

Figure 9

● Fast Fourier Transform (FFT): The Discrete
Fourier Transform (DFT) of a discrete-time signal
x(nT) is given by:
Where:
X (k)= ∑
n=0
N−1
x [n]e
−j
2π
N
nk
k=0,1,…N−1
x (nT )=x [n]

●
If we let: thene
−j
2π
N
=W N
X (k)= ∑
n=0
N−1
x [n]W N
nk
0 20 40 60 80 100 120
-2
-1
0
1
2
Sampledsignal
Sample
Amplitude
0 0.1 0.2 0.3 0.4 0.5
0
0.2
0.4
0.6
0.8
1
FrequencyDomain
NormalisedFrequency
Magnitude
Figure 10

● x[n] = x[0], x[1], …, x[N-1]
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 [1][1]
 Lets divide the sequence x[n] into even and odd
sequences:
 x[2n] = x[0], x[2], …, x[N-2]
 x[2n+1] = x[1], x[3], …, x[N-1]

●
Equation 1 can be rewritten as:
X (k)= ∑
n=0
N
2
−1
x[2n]WN
2nk
+ ∑
n=0
N
2
−1
x[2n+1]WN
(2n+1)k
[2][2]
Since:
W N
2 nk
=e
− j
2π
N
2 nk
=e
−j
2π
N /2
nk
=W N
2
nk W N
(2n+1)k
=W N
k
⋅W N
2
nk
Then:
X (k)= ∑
n=0
N
2
−1
x [2n]WN
2
nk
+WN
k
∑
n=0
N
2
−1
x [2n+1]WN
2
nk
=Y (k )+WN
k
Z (k )
and

●
The result is that an N-point DFT can be divided into
two N/2 point DFT’s:
●
Where Y(k) and Z(k) are the two N/2 point DFTs
operating on even and odd samples respectively:
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 N-point DFTN-point DFT
X (k)= ∑
n=0
N
2
−1
x1[n]WN
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]WN
2
nk
=Y (k )+WN
k
Z (k )
TwoTwo
N/2-pointN/2-point
DFTsDFTs

●
Periodicity and symmetry of W can be exploited to
simplify the DFT further:
X (k )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
⋮
X(k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
n(k +
N
2 )+W
N
k+
N
2
∑
n=0
N
2
−1
x2 [n]W N
2
n(k+
N
2 )
[3][3]
Or: W N
k+
N
2
=e
− j
2π
N
k
e
−j
2π
N
N
2
=e
−j
2π
N
k
e−jπ
=−e
−j
2π
N
k
=−W N
k : Symmetry: Symmetry
And: W N
2
k+
N
2
=e
− j
2π
N /2
k
e
− j
2π
N /2
N
2
=e
− j
2π
N /2
k
=W N
2
k : Periodicity: Periodicity

●
Finally by exploiting the symmetry and periodicity,
Equation 3 can be written as:
●
Hence Complete Equations for finding FFT are:
X (k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
−W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
=Y (k)−W N
k
Z (k )
[4][4]
X (k)=Y (k)+WN
k
Z (k ); k=0,…(N
2
−1)
X (k+
N
2 )=Y (k)−WN
k
Z (k); k=0,…(N
2
−1)

●
Schematic Diagram for FFT: Radix-2 butterfly diagram
y[0]y[0]
y[1]y[1]
y[2]y[2]
y[N-2]y[N-2]
N/2 pointN/2 point
DFTDFT
x[0]x[0]
x[2]x[2]
x[4]x[4]
x[N-2]x[N-2]
N/2 pointN/2 point
DFTDFT
x[1]x[1]
x[3]x[3]
x[5]x[5]
x[N-1]x[N-1]
z[0]z[0]
z[1]z[1]
z[2]z[2]
z[N/2-1]z[N/2-1]
X[1] = y[1]+WX[1] = y[1]+W11z[1]z[1]
WW11
X[0] = y[0]+WX[0] = y[0]+W00z[0]z[0]
WW00
X[N/2] = y[0]-WX[N/2] = y[0]-W00z[0]z[0]-1-1
X[N/2+1] = y[1]-WX[N/2+1] = y[1]-W11z[1]z[1]-1-1

●
Mel-frequency Wrapping: Psychophysical studies
have shown that human perception of the frequency
content of sounds does not follow a liner scale. That
research has led to the concept of the subjective
frequency, i.e., the perceived frequency of sounds is
defined as follows. For each sound with an actual
frequency, f , measured in Hz, a subjective frequency
is measured on a scale called the "Mel scale".
Mel-frequency can be approximated by
Mel(f )=2595log(1+
f
700
)

Mel Frequency Plot:
Figure 11

●
In the Mel-Frequency Scale, there is a linear frequency
spacing below 1000 Hz and a logarithmic spacing above
1000Hz.
●
Triangular Filters Bank: The human ear acts essentially
like a bank of overlapping band-pass filters and human
perception is based on Mel scale. Thus, the approach to
simulating the human perception is to build a filter bank
with bandwidth given by the Mel scale and pass the
magnitudes of the spectra, through these filters and
obtain the Mel-frequency spectrum.

●
Equally Spaced Mel values:
●
We define a triangular filter-bank with M filters (m=1, 2,...,M) where,
Hm[ k ] , is the magnitude (frequency response) of the filter given by:
Hm( k) =
{
0, k< f ( m−1)
k− f ( m−1)
f ( m)−f ( m−1)
, f ( m−1)≤k≤f ( m)
f ( m+1)− k
f (m+ 1)−f (m)
, f ( m)≤k≤ f ( m+1)
0, k> f ( m+1)
}

● Mel Filter Bank:

●
Given the FFT of the input signal, x[n]
●
The values of FFT are weighted by triangular filters.
The result is called Mel-frequency power spectrum
which is defined as:
where is called the Power Spectrum.
X [k]=∑
n=0
N−1
x[n]e
−j2 π nk/ N
,0≤k≤N
S[m]=∑
k=1
N
∣Xa [k]∣
2
Hm[k],0<m≤M
∣Xa [k]∣
2

●
Schematic diagram of Filter Bank Energy:
●
Finally, a discrete cosine transform (DCT) of the
logarithm of S[m] is computed to form the MFCCs as:
mfcc[i]=∑
m=1
M
log(S[m])cos[i(m−
1
2
) π
M
],
i=1,2,........., L

Step 3: Modeling
●
State-of-the-Art Modeling Techniques:
1> Gaussian Mixture Model (GMM)
2> Hidden Markov Model (HMM)

GMM
●
Mixture model is a probabilistic model which assumes
the underlying data to belong to a mixture
distribution.
Gaussian is a characteristic symmetric
“bell curve"

GMM...Contd
●
Mathematical Description of GMM:
where = Mixed Density Function
= Mixture weight or mixture Coefficient
= Density Function
p(x)=∑
i=1
i=n
wi pi(x)
p(x)
wi
pi(x)

GMM...Contd
●
Image showing Best fit Gaussian Curve:

GMM...Contd
● Hence the Density Function is:
● The Descreption of GMM becomes
where ‘s are means and ‘s are covariance-matrix of
individual components(probability density function) .
pi(x)=N (x∣μi ,Σi)
p(x)=∑
i=1
i=n
wi N (x∣μi ,Σi)
μi Σi
G1,w1 G2,w2
G3,w3
G4,w4
G5,w5

GMM...Contd
●
The Gaussian (Normal) density function, in which each of
the mixture components are Gaussian distributions, each
with their own mean and variance parameters is the most
common mixture distribution.
The feature vectors follows the Gaussian Distribution.
Hence X is distributed Normally.
: Multi variate Normal Distribution
Where = Means
= Covariance Matrix
μ
Σ
X∼N (x∣μ,Σ)

GMM...Contd
●
The GMM for a Speaker is denoted by
Here a speaker is represented by a mixture of M
Gaussian Components.
●
The Gaussian Mixture Density is
λ={wi ,μi ,Σi }, where i=1,2,.........., M
p(⃗x∣λ)=∑
i=1
M
wi pi(⃗x)
where ⃗x = D−dimensional random vector(variable)

GMM...Contd●
The Component Density is given by
●
The schematic diagram of the GMM of a speaker is given
below
pi(⃗x)=
1
(2π)D/2
∣Σi∣
1/2
exp{−
1
2
(⃗x−μi)
T
Σi
−1
(⃗x−μi)}
p1() p2()μ1, Σ1
μ2, Σ2
Σ
⃗x
p(⃗x∣λ)
w1
pM () μM , ΣM
w2
wM
. . . .

Model Parameter Estimation
●
To create a GMM we are required to find the
numerical values of Model parameters , and
●
To obtain an optimum model representing each
speaker we need to calculate a good estimation of
the GMM parameters. To do that, a very efficient
method is the Maximum-Likelihood Estimation (MLE)
approach. For speaker identification, each speaker is
represented by a GMM and is referred to by his/her
model. In this regard EM algorithm is very useful tool
to find the optimum model parameters by MLE
approach.
wi μi Σi

Step 4: Pattern Matching:
Classification
●
In this stage, a series of input vectors are compared,
and a decision is made as to which of the speakers in
the set is the most likely to have spoken the test data.
The input to the classification system is denoted as
●
Using the models of each speaker and the unknown
vectors the fitness values are calculated with the
help of posterior probalility. The speaker model which
gives the maximum fitness value, we classify the
vectors to that speaker.
⃗x={x1, x2, x3,. ................, xT }
⃗x

Conclusion...Contd
● Modification can be done in the following
cases:
2> MFCC Feature
3> Filter Bank
4> Modeling Techniques
5> Pattern Matching

Conclusion...Contd
● Feature Extraction: In the MFCC feature the phase
information is not taken into account. Only magnitude is
considered. So using phase information along with the MFCC
feature new feature vectors can be derived.
● Pattern Matching: In pattern matching step it is assumed
that the feature vectors of unknown speaker are
independent. With this assumption posterior probability is
calculated. But we can use some orthogonal transformation
to transform the set of vectors into a new set of orthogonal
vectors. Hence, after the transformation the the vectors
become independent. And then we can proceed as before.

References
●
[1] Molau, S., Pitz, M., Schlüter, R. & Ney, H. (2001), Computing Mel-Frequency
Cepstral Coefficients on the Power Spectrum, IEEE International Conference on
Acoustics, Speech and Signal Processing, Germany, 2001: 73-76.
●
[2] Huang, X., Acero, A. & Hon, H. (2001), Spoken Language Processing - A Guide to
Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.
●
[3] Homayoon Beigi, (2011), Fundamentals of Speaker Recognition, Springer.
●
[4] Daniel J. Mashao, Marshalleno Skosan, Combining classifier decisions for robust
speaker identification, ELSEVIER 2006.
●
[5] W.M. Campbell , J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo,
Support vector machines for speaker and language recognition, ELSEVIER, 2006.
●
[6] Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang, Speaker Recognition by
Combining MFCC and Phase Information, INTERSPEECH 2007.
●
[7] Nilsson, M. & Ejnarsson, M, Speech Recognition Using Hidden Markov Model
Performance Evaluation in Noisy Environment, Blekinge Institute of Technology
Sweden, 2002.

References...Contd●
[8] Stevens, S. S. & Volkman, J. (1940), The Relation of the Pitch to
Frequency , Journal of Psychology, 1940(53): 329.
●
[9] A . Jain, A. Ross, and S. Prabhakar, “An introduction to biometric
recognition,” IEEETrans. Circuits Systems Video Technol., vol. 14, no. 1, pp.
4–20, 2004.
●
[10] D. Reynolds, “An overview of automatic speaker recognition
technology,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing
(ICASSP), 2002, vol. 4, pp. 4072–4075.
●
[11] S. Furui, “Cepstral analysis technique for automatic speaker
verification,” IEEE Trans. Acoustics Speech Signal Process., vol. 29, no. 2, pp.
254–272, 1981.
●
[12] D. Reynolds and R. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. Speech
Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
●
[13] D. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Commun., vol. 17, no. 1–2, pp. 91–108,
1995.

References...Contd
●
[14] Man-Wai Mak , Wei Rao, Utterance partitioning with acoustic vector
resampling for GMM–SVM speaker verification, ELSEVIER, 2011.
●
[15] Md. Sahidullah, Goutam Saha, Design, analysis and experimental
evaluation of block based transformation in MFCC computation for speaker
recognition, ELSEVIER, 2011.
●
[16] Qi Li, and Yan Huang, An Auditory-Based Feature Extraction Algorithm for
Robust Speaker Identification Under Mismatched Conditions , IEEE
TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19,
NO. 6, AUGUST 2011.
●
[17] Alfredo Maesa, Fabio Garzia, Michele Scarpiniti, Roberto Cusani, Text
Independent Automatic Speaker Recognition System Using Mel-Frequency
Cepstrum Coefficient and Gaussian Mixture Models, Journal of Information
Security, 2012.
●
[18] Ming Li, Kyu J. Han, Shrikanth Narayanan, Automatic speaker age and
gender recognition using acoustic and prosodic level information fusion,
ELSEVIER, 2013.

ASR_final

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to ASR_final

Similar to ASR_final (20)

ASR_final