SPEAKER VERIFICATION

EXPLORATION OF SPEAKER MODELLING AND SCORE
NORMALIZATION METHOD FOR DEVELEPMENT OF A
VOICE PASSWORD BASED SPEAKER VERIFICATION
SYSTEM
Under guidance of NAME-AJAY KUMAR PASWAN
Dr. G. Pradhan M.TECH 2 𝑛𝑑
yr (ECE Dept.)
NIT PATNA (ECE Dept.) ROLL NO-1229011

OUTLINE
 Introduction
 Literature review on speaker verification system
 Summary of literature review
 Motivation for present work
 Baseline speaker verification system
 Proposed speaker verification system
 Summary and Contribution
 Future scope

INTRODUCTION
 Speaker verification: A process of verifying identity claim of a person from
his/her voice
 To improve the security level, recent technology turned towards using biometric
features over non-biometric features
 With the emergence of mobile technology a person can remotely access the
system, so, remote monitoring is possible
 Speaker verification can be divided into
 Text-independent
 Text-dependent
 Voice password
 Text- independent system has lesser performance than text-dependent due to
additional phonetic variability between training and testing speech
 Text- independent system requires more data for training and testing

BRIEF HISTORY
 Research in the field of speaker recognition was initially
carried out in 1950s in Bell laboratories using isolated digits
[1].
 1960- 1990 most of the research was focused on extraction of
speaker specific information from the speech data, and
development of text dependent speaker verification system.
 In 1990-2005 the speaker recognition method shifted from
template based pattern matching to statistical modeling.
Different statistical modeling method like GMM and GMM-
UBM are proposed.
 2005- 2014 most of the research was focused on
compensation of mismatches and development of practical
authentication systems. Different compensation methods like
JAFA, i-vectors and LDA, WCCN, PLDA are proposed.
1. K. H. Davis, et. al., “Automatic recognition of spoken
digits,” J.A.S.A., 24 (6), pp. 637-642, 1952.

MODULAR REPRESENTATION OF VOICE PASS WORD
BASED SPEAKER VERIFICATION SYSTEM
Fig: Voice password speaker verification system
Training Reference model
Speech
Identity claim
Testing
Speech R
Accept/reject
Pre-
processing
Feature
extraction
Model
Building
Pre-
processing
Feature
extraction comparison
Decision
logic

PREPROCESSING
 Preprocessing is an important step in a speaker verification
system. This also called voice activity detection (VAD).
 VAD separates speech region from non-speech regions[2-3]
 It is very difficult to implement a VAD algorithm which works
consistently for different type of data
 VAD algorithms can be classified in two groups
 Feature based approach
 Statistical model based approach
 Each of the VAD method have its own merits and demerits
depending on accuracy, complexity etc.
 Due to simplicity most of the speaker verification systems use
signal energy for VAD.
2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85,
no. 9, pp. 1437–1462, September 1997.
3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. on
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.

FEATURE EXTRACTION
 The speech signal along with speaker information contains
many other redundant information like recording sensor,
channel, environment etc.
 The speaker specific information in the speech signal[2]
 Unique speech production system
 Physiological
 Behavioral aspects
 Feature extraction module transforms speech to a set of
feature vectors of reduce dimensions
 To enhance speaker specific information
 Suppress redundant information[2-4]
4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S.
Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A
tutorial on text-independent speaker verification,” EURASIP Journal on
Applied Signal Processing, vol. 4, pp. 430–451, 2004.

 An ideal feature
 Robust to environmental and recording condition
 Contains less intra-speaker variability
 More inter-speaker variability
 Most of the state-of-the-art speaker verification systems use
Mel-frequency Cepstral Coefficient (MFCC) appended to it’s
first and second order derivative as the feature vectors
 Easy to extract
 Provides best performance compared to other features
 MFCC mostly contains information about the
resonance structure of the vocal tract system

STEPS FOR MFCC COMPUTATION
Windowing of signal using Hamming window
DFT spectrum:
 Discrete Fourier transform is calculated for each
window frame by following DFT equation.
X (k) = 𝑛=0
𝑛=N
𝑥(𝑛)𝑒−
−𝑖2𝜋𝑘𝑛
𝑁 ; 0 ≤ 𝑘 ≤ 𝑁 − 1
Mel-Spectrum
 Mel-Spectrum can be calculated by passing the
Fourier transform of the signal through mel-filter
bank, mel-bank filter is a set of band pass filter
 The mel-frequency related to the linear frequency
as
fmel = 2595log10 1+
𝑓
700

Discrete cosine transform (DCT):
 Discrete cosine transform convert mel–spectrum on log scale to
cepstral coefficients
 Unlike spectral feature which are highly correlated , cepstral features
produce a more decorrelated , compact representation.
 DCT of k log filter bank , spectral values, {log(Sk)}K
k=1 , into L cepstral
coefficient
Cn = 𝑘=1
𝐾
log(Sk) cos 𝑛 𝑘 −
1
2
𝜋
𝐾
n = 1 ,2 ,3 , …… L
Typically L = 13 MFCC coefficient are calculated per frame ,
which is feature vector of that frame.
 The cepstral coefficient are usually static feature, they contain the
information about a particular frame only, so to get dynamics of the
signal first and second derivative of cepstral coefficient is computed.

SPEAKER MODELING
 Speaker models the statistical information present in the
feature vectors it enhances the speaker information and
suppress the redundant information
 For text independent speaker verification speaker
modeling technique used is vector quantization(VQ),
Gaussian mixer model(GMM)[5], GMM-universal back
ground model(GMM-UBM)[6], Artificial neural
network(ANN)and support vector machine(SVM)
 The Gaussian mixer model is most widely used for
speaker verification systems
5. D. A. Reynolds, “Speaker identification and verification using
Gaussian mixture speaker models,” Speech Communication, vol.
17, pp. 91–108, March 1995.
6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker
verification using adapted Gaussian mixture models,” Digital Signal
Processing, vol. 10, pp. 19–41, January 2000.

 Gaussian model assumes the feature vectors follow a
Gaussian distribution, characterized by mean vectors,
covariance matrix and weights
 The data unseen in the training which appear in the test
data will trigger a low score
 Though GMM is quit powerful but it need large training
data to properly estimate model parameter
 GMM is available powerful and versatile parameter
estimation algorithm, expectation-maximization.

Pattern comparison
 Testing phase test feature vectors are compared
with claimed model to get similarity between
training and testing speech
 Different similarity measure is done for used
modeling method
 Euclidean distance [8] for VQ , log likelihood
score(LLS)[7] and log likelihood score ratio(LLSR)
for GMM-UBM.
7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995
8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and
transitional spectral information in speaker recognition,” IEEE Trans.
Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June
1988

PERFORMANCE MEASURE:
 A perfect SV system should accept all true claim and reject
all the false claims
 Depending on the variability between the training and
testing speech some true claim may be rejected and some
false claim may be accepted
 Therefore the speaker verification performance is
measured in term of false rejection rate (FRR) and false
acceptance rate (FAR), more meaningfully in term of equal
error rate(EER)[9].
 In order to improve the visualization of the SV
performance, the detection error tradeoff(DTF) curve is
used to performance measure
9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S.
Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial
on text-independent speaker verification,” EURASIP Journal on Applied
Signal Processing, vol. 4, pp. 430–451, 2004

SUMMARY ON REVIEW
 The performance of speaker verification system is mostly
depend upon quality of speech signal
 The performance of the system degraded significantly
under mismatched conditions
 The phonetic variability between training and testing
speech is another major source of mismatch
 Speaker verification for text-dependent mode is
performed with DTW algorithm, HMM
 GMM is useful for modeling of system in text-
independent mode

MOTIVATION FOR PRESENT WORK
 Most of the application where speech signal of short
duration used around 3-5ms, but Speaker
verification system provide poor performance for
short duration speech signal
 This degradation of performance is due to phonetic
variability between training and testing speech data
 The phonetic variability may be reduced by
artificially generating multiple utterance, taking
feature around Glottal closure instants(GCI)
 Most of the SV system develop score normalization
using on cohort centric normalization. The speaker
centric score normalization may provide better
result.

OBJECTIVE OF THIS THESIS WORK
 To develop voice password based speaker
verification
 To study impact of text-mismatch on the
performance of voice password based speaker
verification system
 Develop a voice password based speaker
verification system in text-independent mode
 Explore method to model speaker information in
limited data condition
 Study and Explore the advantages of speaker
centric score normalization

DATABASE COLLECTION
 Total database collection = 100 speaker Male
speaker ,85 and Female speaker ,15
 Number of repetition for train= 3 session
 Number of repetition for test=5 session
 Format of file naming = 8765538857_NAMCF

BASELINE SPEAKER VERIFICATION SYSTEM
 For Baseline speaker verification the following
parameter are used
 VAD threshold is taken 0.1 of average energy
 Baseline uses MFCC appended with first and second
order derivative , i.e. delta(Δ) and delta delta(ΔΔ) for
feature extraction
 Feature vector: It uses 39 dimension feature vector and
20ms frame size with shift 2ms.
 Modeling: GMM
 GMM size: 8, 16, 32, 64.

3.6 EXPERIMENTAL RESULT
train
test
GMM size
8 16 32 64
Vp Name Vp name Vp Name Vp Name
Vp 17.52 42.26 24.74 43.92 28.8 44.32 38.14 46.39
name 39.17 17.52 41.23 20.61 43.29 27.83 39.17 45.36
Table: Baseline result.

1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
original baseline gmm
Fig: baseline DET plot

GENERATION OF MULTIPLE UTTERANCE BY ADDING
WHITE NOISE TO TRAINING SPEECH
 Motivation
 It covers entire spectrum of speech signal
 Addition of white noise will reduce phonetic variability as
it covers entire spectrum
 Feature are calculated white noise added for
training and clear for testing
 Modeling of train with white noise added, and test
data is clear
 White noise used range [-10,-5,0,+5,+10,+20]db
 VAD used in clean speech for reference index
 Reference index is used to find speech region

EXPERIMENTAL RESULT
Train
test
GMM SIZE
8 16 32 64
Vp Name Vp Name vp Name Vp Name
Vp 14.4 35.05 18.55 37.11 22.86 40.20 28.86 44.32
Name 35.05 12.37 35.05 14.40 39.17 19.58 40.20 26.80
Table : Result based on white noise added.

1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
white noise added
Fig: comparison of baseline with white noise added
 By adding white noise the phonetic variability of the training data reduce.
Hence the performance is increased from baseline.

MAXIMUM A POSTERIOR (MAP) ADAPTATION
METHOD:
 In Gaussian mixer model for Modeling of speaker, it is
necessary sufficient training data must be available to
make model of speaker
 There is a another method available, maximum a
posteriori (MAP) adaptation, of a background model
trained on the speech data of several other speaker
 It may be useful for evaluation of statistical model which
may be useful for short duration speech data
 Maximum a posterior (MAP) adaptation takes the prior
information of existing model and changes their
parameter according to new training data

Train
test
GMM SIZE
8 16 32 64
Vp Name Vp Name Vp Name Vp Name
Vp 14.43 36.08 11.34 34.02 15.46 36.08 20.61 37.11
Name 34.02 12.37 34.02 14.43 40.20 15.46 40.20 24.74
Table : Map adaptation of clean data on noise model.
EXPERIMENTAL RESULT:

1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
white noise added
map adaptation
Fig: comparison of above two model with map adaptation

RESIDUAL MFCC FROM GCI
Computation of residual phase through Linear
prediction analysis
 The speech signal produced is convolution of
excitation source and vocal track system
 The speaker verification system required speaker
specific information
 The feature around glottal closure instants (GCI)
are more speaker specific[10]
10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of
reverberant speech using LP residual signal,”IEEE Trans.Speech Audio
Process.,vol. 14, pp. 774-784, May 2006.

ZERO FREQUENCY FILTERING (ZFF) METHOD
 The ZFF method is most useful for evaluating the
various parameter of prosodic parameter
 It is the best available method to calculate
expressive parameter for various emotions
 The feature around CGI can be computed using
ZFF[11]
 Periodically located epoch in voiced speech signal
represent the glottal closure instants
11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech
signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no.
8, pp. 1602–1614, Nov. 2008

EXPERIMENTAL RESULT FOR RESIDUAL
MFCC FROM GCI
Train
test
GMM SIZE
8 16 32 64
Vp name Vp name Vp Name Vp Name
Vp 20.6 35.05 19.58 38.14 25.77 39.17 30.92 42.26
Name 36.08 24.74 36.08 25.77 40.20 31.95 46.39 35.05
Table : result of residual MFCC from GCI

DET CURVE FOR COMPARISON OF DIFFERENT
PROPOSED METHOD WITH BASELINE METHOD
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
baseline gmm
gmm white noise added
map adaptation
residual around gci

SCORE NORMALIZATION
 The speech data used for the development of
model and testing varies between the speakers
 For the same speaker quality and quantity of test
data varies between the trials so the verification
score varies between the trials
 Compensation of different variability at the score
level is commonly known as score normalization
 The score normalization helps to reduce
degradation and mismatch effect that are not
compensated at feature and model levels
 It also transforms scores from different trials into a
similar range so that a common speaker
independent verification threshold can be used

BASELINE WITH SCORE NORMALIZATION
EXPERIMENTAL RESULT
Train and test speech Session 1(initial) After score normalization
name train name test 16.49 14.43
name train vp test 38.14 37.11
vp train vp test 19.58 22.68
vp train name test 40.20 40.20
Table : experimental result GMM baseline with score normalization

Fig : DET plot of baseline initial and after score normalization
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
score norm baseline gmm

GMM WITH WHITE NOISE ADDED TRAIN
Table : experimental result of GMM with white noise added train

Fig : DET plot for GMM with white noise and after score normalization
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
score norm white noise added

MAP ADAPTATION OF CLEAN DATA ON NOISY
MODEL
Table : result of map adaptation of noisy train on clean test initial and after score

Fig : DET plot for map adaptation of noisy train initial and after SN
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
score norm map adaptation

RESIDUAL FEATURE FROM 3MS GCI
Table : result of residual MFCC around 3ms GCI:

Fig: DET plot for residual MFCC around GCI initial and after score normalization.
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
score norm map adaptation
score norm residual around gci

 The residual MFCC calculate the feature around
glottal closure instances(GCI)
 Residual MFCC is specific to speaker only and it
does not contain information about utterances
 It may provide better result combined with above
technique

DET PLOT FOR COMPARISON OF DIFFERENT
MODELING TECHNIQUE
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
baseline gmm
white noise added
map adaptation
residual around gci
score nor of map addapttation
Fig: comparison of different modeling technique

SUMMARY OF DIFFERENT MODELING
TECHNIQUE
 The baseline best score is is 17.52.
 By using white noise the best result is improved
from 17.52 to 14.4 for vp and 17.52 to 12.37 for
name
 The result is further improved using MAP
adaptation from 14.4 to 12.37 and for name 12.37
to 11.34
 By using score normalization technique the score is
reduced upto 8.24 for vp and 9.27 for name.

COTRIBUTION:
 Database is collected for future research
 A method is proposed to model the limited data by
generating multiple utterance of speech through
adding controlled white noise addition to clean
speech
 The performance of speaker centric score
normalization under limited data condition is
address

FUTURE SCOPE:
 Extraction of feature to reduce the impact of
phonetic variability
 Different residue of behavioral feature may be
extracted in addition to MFCC for speaker
verification
 In this project we considered GMM modeling
technique in next work many other technique
may be used like JAFA, i-vector etc.

REFERENCES
 1. K. H. Davis, et. al., “Automatic recognition of spoken digits,”
J.A.S.A., 24 (6), pp. 637-642, 1952.
 2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol.
85, no. 9, pp. 1437–1462, September 1997.
 3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.
 4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau,
S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A
tutorial on text-independent speaker verification,” EURASIP Journal on
Applied Signal Processing, vol. 4, pp. 430–451, 2004.
 5. D. A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Communication, vol. 17, pp. 91–108,
March 1995.
 6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” Digital Signal Processing, vol.
10, pp. 19–41, January 2000.

 7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans.
on speech and audio processing, vol. 3, no. 1, pp. 72–83, January
1995
 8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous
and transitional spectral information in speaker recognition,” IEEE
Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp.
871–879, June 1988
 9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I.
Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D.
A. Reynolds, “A tutorial on text-independent speaker verification,”
EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451,
2004
 10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement
of reverberant speech using LP residual signal,”IEEE Trans.Speech
Audio Process.,vol. 14, pp. 774-784, May 2006.
 11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from
speech signals,” IEEE Trans. Audio, Speech and Language
Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008

SPEAKER VERIFICATION

More Related Content

What's hot

Viewers also liked

Similar to SPEAKER VERIFICATION

Recently uploaded

SPEAKER VERIFICATION