EXPLORATION OF SPEAKER MODELLING AND SCORE
NORMALIZATION METHOD FOR DEVELEPMENT OF A
VOICE PASSWORD BASED SPEAKER VERIFICATION
SYSTEM
Under guidance of NAME-AJAY KUMAR PASWAN
Dr. G. Pradhan M.TECH 2 𝑛𝑑
yr (ECE Dept.)
NIT PATNA (ECE Dept.) ROLL NO-1229011
OUTLINE
 Introduction
 Literature review on speaker verification system
 Summary of literature review
 Motivation for present work
 Baseline speaker verification system
 Proposed speaker verification system
 Summary and Contribution
 Future scope
INTRODUCTION
 Speaker verification: A process of verifying identity claim of a person from
his/her voice
 To improve the security level, recent technology turned towards using biometric
features over non-biometric features
 With the emergence of mobile technology a person can remotely access the
system, so, remote monitoring is possible
 Speaker verification can be divided into
 Text-independent
 Text-dependent
 Voice password
 Text- independent system has lesser performance than text-dependent due to
additional phonetic variability between training and testing speech
 Text- independent system requires more data for training and testing
BRIEF HISTORY
 Research in the field of speaker recognition was initially
carried out in 1950s in Bell laboratories using isolated digits
[1].
 1960- 1990 most of the research was focused on extraction of
speaker specific information from the speech data, and
development of text dependent speaker verification system.
 In 1990-2005 the speaker recognition method shifted from
template based pattern matching to statistical modeling.
Different statistical modeling method like GMM and GMM-
UBM are proposed.
 2005- 2014 most of the research was focused on
compensation of mismatches and development of practical
authentication systems. Different compensation methods like
JAFA, i-vectors and LDA, WCCN, PLDA are proposed.
1. K. H. Davis, et. al., “Automatic recognition of spoken
digits,” J.A.S.A., 24 (6), pp. 637-642, 1952.
MODULAR REPRESENTATION OF VOICE PASS WORD
BASED SPEAKER VERIFICATION SYSTEM
Fig: Voice password speaker verification system
Training Reference model
Speech
Identity claim
Testing
Speech R
Accept/reject
Pre-
processing
Feature
extraction
Model
Building
Pre-
processing
Feature
extraction comparison
Decision
logic
PREPROCESSING
 Preprocessing is an important step in a speaker verification
system. This also called voice activity detection (VAD).
 VAD separates speech region from non-speech regions[2-3]
 It is very difficult to implement a VAD algorithm which works
consistently for different type of data
 VAD algorithms can be classified in two groups
 Feature based approach
 Statistical model based approach
 Each of the VAD method have its own merits and demerits
depending on accuracy, complexity etc.
 Due to simplicity most of the speaker verification systems use
signal energy for VAD.
2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85,
no. 9, pp. 1437–1462, September 1997.
3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. on
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.
FEATURE EXTRACTION
 The speech signal along with speaker information contains
many other redundant information like recording sensor,
channel, environment etc.
 The speaker specific information in the speech signal[2]
 Unique speech production system
 Physiological
 Behavioral aspects
 Feature extraction module transforms speech to a set of
feature vectors of reduce dimensions
 To enhance speaker specific information
 Suppress redundant information[2-4]
4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S.
Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A
tutorial on text-independent speaker verification,” EURASIP Journal on
Applied Signal Processing, vol. 4, pp. 430–451, 2004.
 An ideal feature
 Robust to environmental and recording condition
 Contains less intra-speaker variability
 More inter-speaker variability
 Most of the state-of-the-art speaker verification systems use
Mel-frequency Cepstral Coefficient (MFCC) appended to it’s
first and second order derivative as the feature vectors
 Easy to extract
 Provides best performance compared to other features
 MFCC mostly contains information about the
resonance structure of the vocal tract system
STEPS FOR MFCC COMPUTATION
Windowing of signal using Hamming window
DFT spectrum:
 Discrete Fourier transform is calculated for each
window frame by following DFT equation.
X (k) = 𝑛=0
𝑛=N
𝑥(𝑛)𝑒−
−𝑖2𝜋𝑘𝑛
𝑁 ; 0 ≤ 𝑘 ≤ 𝑁 − 1
Mel-Spectrum
 Mel-Spectrum can be calculated by passing the
Fourier transform of the signal through mel-filter
bank, mel-bank filter is a set of band pass filter
 The mel-frequency related to the linear frequency
as
fmel = 2595log10 1+
𝑓
700
Discrete cosine transform (DCT):
 Discrete cosine transform convert mel–spectrum on log scale to
cepstral coefficients
 Unlike spectral feature which are highly correlated , cepstral features
produce a more decorrelated , compact representation.
 DCT of k log filter bank , spectral values, {log(Sk)}K
k=1 , into L cepstral
coefficient
Cn = 𝑘=1
𝐾
log(Sk) cos 𝑛 𝑘 −
1
2
𝜋
𝐾
n = 1 ,2 ,3 , …… L
Typically L = 13 MFCC coefficient are calculated per frame ,
which is feature vector of that frame.
 The cepstral coefficient are usually static feature, they contain the
information about a particular frame only, so to get dynamics of the
signal first and second derivative of cepstral coefficient is computed.
SPEAKER MODELING
 Speaker models the statistical information present in the
feature vectors it enhances the speaker information and
suppress the redundant information
 For text independent speaker verification speaker
modeling technique used is vector quantization(VQ),
Gaussian mixer model(GMM)[5], GMM-universal back
ground model(GMM-UBM)[6], Artificial neural
network(ANN)and support vector machine(SVM)
 The Gaussian mixer model is most widely used for
speaker verification systems
5. D. A. Reynolds, “Speaker identification and verification using
Gaussian mixture speaker models,” Speech Communication, vol.
17, pp. 91–108, March 1995.
6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker
verification using adapted Gaussian mixture models,” Digital Signal
Processing, vol. 10, pp. 19–41, January 2000.
 Gaussian model assumes the feature vectors follow a
Gaussian distribution, characterized by mean vectors,
covariance matrix and weights
 The data unseen in the training which appear in the test
data will trigger a low score
 Though GMM is quit powerful but it need large training
data to properly estimate model parameter
 GMM is available powerful and versatile parameter
estimation algorithm, expectation-maximization.
Pattern comparison
 Testing phase test feature vectors are compared
with claimed model to get similarity between
training and testing speech
 Different similarity measure is done for used
modeling method
 Euclidean distance [8] for VQ , log likelihood
score(LLS)[7] and log likelihood score ratio(LLSR)
for GMM-UBM.
7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. on
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995
8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and
transitional spectral information in speaker recognition,” IEEE Trans.
Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June
1988
PERFORMANCE MEASURE:
 A perfect SV system should accept all true claim and reject
all the false claims
 Depending on the variability between the training and
testing speech some true claim may be rejected and some
false claim may be accepted
 Therefore the speaker verification performance is
measured in term of false rejection rate (FRR) and false
acceptance rate (FAR), more meaningfully in term of equal
error rate(EER)[9].
 In order to improve the visualization of the SV
performance, the detection error tradeoff(DTF) curve is
used to performance measure
9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S.
Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial
on text-independent speaker verification,” EURASIP Journal on Applied
Signal Processing, vol. 4, pp. 430–451, 2004
SUMMARY ON REVIEW
 The performance of speaker verification system is mostly
depend upon quality of speech signal
 The performance of the system degraded significantly
under mismatched conditions
 The phonetic variability between training and testing
speech is another major source of mismatch
 Speaker verification for text-dependent mode is
performed with DTW algorithm, HMM
 GMM is useful for modeling of system in text-
independent mode
MOTIVATION FOR PRESENT WORK
 Most of the application where speech signal of short
duration used around 3-5ms, but Speaker
verification system provide poor performance for
short duration speech signal
 This degradation of performance is due to phonetic
variability between training and testing speech data
 The phonetic variability may be reduced by
artificially generating multiple utterance, taking
feature around Glottal closure instants(GCI)
 Most of the SV system develop score normalization
using on cohort centric normalization. The speaker
centric score normalization may provide better
result.
OBJECTIVE OF THIS THESIS WORK
 To develop voice password based speaker
verification
 To study impact of text-mismatch on the
performance of voice password based speaker
verification system
 Develop a voice password based speaker
verification system in text-independent mode
 Explore method to model speaker information in
limited data condition
 Study and Explore the advantages of speaker
centric score normalization
DATABASE COLLECTION
 Total database collection = 100 speaker Male
speaker ,85 and Female speaker ,15
 Number of repetition for train= 3 session
 Number of repetition for test=5 session
 Format of file naming = 8765538857_NAMCF
BASELINE SPEAKER VERIFICATION SYSTEM
 For Baseline speaker verification the following
parameter are used
 VAD threshold is taken 0.1 of average energy
 Baseline uses MFCC appended with first and second
order derivative , i.e. delta(Δ) and delta delta(ΔΔ) for
feature extraction
 Feature vector: It uses 39 dimension feature vector and
20ms frame size with shift 2ms.
 Modeling: GMM
 GMM size: 8, 16, 32, 64.
3.6 EXPERIMENTAL RESULT
train
test
GMM size
8 16 32 64
Vp Name Vp name Vp Name Vp Name
Vp 17.52 42.26 24.74 43.92 28.8 44.32 38.14 46.39
name 39.17 17.52 41.23 20.61 43.29 27.83 39.17 45.36
Table: Baseline result.
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
original baseline gmm
Fig: baseline DET plot
GENERATION OF MULTIPLE UTTERANCE BY ADDING
WHITE NOISE TO TRAINING SPEECH
 Motivation
 It covers entire spectrum of speech signal
 Addition of white noise will reduce phonetic variability as
it covers entire spectrum
 Feature are calculated white noise added for
training and clear for testing
 Modeling of train with white noise added, and test
data is clear
 White noise used range [-10,-5,0,+5,+10,+20]db
 VAD used in clean speech for reference index
 Reference index is used to find speech region
EXPERIMENTAL RESULT
Train
test
GMM SIZE
8 16 32 64
Vp Name Vp Name vp Name Vp Name
Vp 14.4 35.05 18.55 37.11 22.86 40.20 28.86 44.32
Name 35.05 12.37 35.05 14.40 39.17 19.58 40.20 26.80
Table : Result based on white noise added.
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
original baseline gmm
white noise added
Fig: comparison of baseline with white noise added
 By adding white noise the phonetic variability of the training data reduce.
Hence the performance is increased from baseline.
MAXIMUM A POSTERIOR (MAP) ADAPTATION
METHOD:
 In Gaussian mixer model for Modeling of speaker, it is
necessary sufficient training data must be available to
make model of speaker
 There is a another method available, maximum a
posteriori (MAP) adaptation, of a background model
trained on the speech data of several other speaker
 It may be useful for evaluation of statistical model which
may be useful for short duration speech data
 Maximum a posterior (MAP) adaptation takes the prior
information of existing model and changes their
parameter according to new training data
Train
test
GMM SIZE
8 16 32 64
Vp Name Vp Name Vp Name Vp Name
Vp 14.43 36.08 11.34 34.02 15.46 36.08 20.61 37.11
Name 34.02 12.37 34.02 14.43 40.20 15.46 40.20 24.74
Table : Map adaptation of clean data on noise model.
EXPERIMENTAL RESULT:
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
original baseline gmm
white noise added
map adaptation
Fig: comparison of above two model with map adaptation
RESIDUAL MFCC FROM GCI
Computation of residual phase through Linear
prediction analysis
 The speech signal produced is convolution of
excitation source and vocal track system
 The speaker verification system required speaker
specific information
 The feature around glottal closure instants (GCI)
are more speaker specific[10]
10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of
reverberant speech using LP residual signal,”IEEE Trans.Speech Audio
Process.,vol. 14, pp. 774-784, May 2006.
ZERO FREQUENCY FILTERING (ZFF) METHOD
 The ZFF method is most useful for evaluating the
various parameter of prosodic parameter
 It is the best available method to calculate
expressive parameter for various emotions
 The feature around CGI can be computed using
ZFF[11]
 Periodically located epoch in voiced speech signal
represent the glottal closure instants
11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech
signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no.
8, pp. 1602–1614, Nov. 2008
EXPERIMENTAL RESULT FOR RESIDUAL
MFCC FROM GCI
Train
test
GMM SIZE
8 16 32 64
Vp name Vp name Vp Name Vp Name
Vp 20.6 35.05 19.58 38.14 25.77 39.17 30.92 42.26
Name 36.08 24.74 36.08 25.77 40.20 31.95 46.39 35.05
Table : result of residual MFCC from GCI
DET CURVE FOR COMPARISON OF DIFFERENT
PROPOSED METHOD WITH BASELINE METHOD
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
baseline gmm
gmm white noise added
map adaptation
residual around gci
SCORE NORMALIZATION
 The speech data used for the development of
model and testing varies between the speakers
 For the same speaker quality and quantity of test
data varies between the trials so the verification
score varies between the trials
 Compensation of different variability at the score
level is commonly known as score normalization
 The score normalization helps to reduce
degradation and mismatch effect that are not
compensated at feature and model levels
 It also transforms scores from different trials into a
similar range so that a common speaker
independent verification threshold can be used
BASELINE WITH SCORE NORMALIZATION
EXPERIMENTAL RESULT
Train and test speech Session 1(initial) After score normalization
name train name test 16.49 14.43
name train vp test 38.14 37.11
vp train vp test 19.58 22.68
vp train name test 40.20 40.20
Table : experimental result GMM baseline with score normalization
Fig : DET plot of baseline initial and after score normalization
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
score norm baseline gmm
GMM WITH WHITE NOISE ADDED TRAIN
Train and test speech Session 1(initial) After score normalization
name train name test 14.43 9.27
name train vp test 34.02 31.95
vp train vp test 15.46 12.37
vp train name test 36.08 35.05
Table : experimental result of GMM with white noise added train
Fig : DET plot for GMM with white noise and after score normalization
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
score norm baseline gmm
score norm white noise added
MAP ADAPTATION OF CLEAN DATA ON NOISY
MODEL
Train and test speech Session 1(initial) After score normalization
name train name test 12.37 8.24
name train vp test 35.05 32.98
vp train vp test 12.37 10.30
vp train name test 35.05 35.05
Table : result of map adaptation of noisy train on clean test initial and after score
Fig : DET plot for map adaptation of noisy train initial and after SN
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
score norm baseline gmm
score norm white noise added
score norm map adaptation
RESIDUAL FEATURE FROM 3MS GCI
Train and test speech Session 1(initial) After score normalization
name train name test 37.11 31.95
name train vp test 44.32 40.20
vp train vp test 39.17 38.14
vp train name test 44.32 42.26
Table : result of residual MFCC around 3ms GCI:
Fig: DET plot for residual MFCC around GCI initial and after score normalization.
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
score norm baseline gmm
score norm white noise added
score norm map adaptation
score norm residual around gci
 The residual MFCC calculate the feature around
glottal closure instances(GCI)
 Residual MFCC is specific to speaker only and it
does not contain information about utterances
 It may provide better result combined with above
technique
DET PLOT FOR COMPARISON OF DIFFERENT
MODELING TECHNIQUE
1 2 5 10 20 40 60 80 90 95 98 99
1
2
5
10
20
40
60
80
90
95
98
99
False Alarm probability (in %)
Missprobability(in%)
baseline gmm
white noise added
map adaptation
residual around gci
score nor of map addapttation
Fig: comparison of different modeling technique
SUMMARY OF DIFFERENT MODELING
TECHNIQUE
 The baseline best score is is 17.52.
 By using white noise the best result is improved
from 17.52 to 14.4 for vp and 17.52 to 12.37 for
name
 The result is further improved using MAP
adaptation from 14.4 to 12.37 and for name 12.37
to 11.34
 By using score normalization technique the score is
reduced upto 8.24 for vp and 9.27 for name.
COTRIBUTION:
 Database is collected for future research
 A method is proposed to model the limited data by
generating multiple utterance of speech through
adding controlled white noise addition to clean
speech
 The performance of speaker centric score
normalization under limited data condition is
address
FUTURE SCOPE:
 Extraction of feature to reduce the impact of
phonetic variability
 Different residue of behavioral feature may be
extracted in addition to MFCC for speaker
verification
 In this project we considered GMM modeling
technique in next work many other technique
may be used like JAFA, i-vector etc.
REFERENCES
 1. K. H. Davis, et. al., “Automatic recognition of spoken digits,”
J.A.S.A., 24 (6), pp. 637-642, 1952.
 2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol.
85, no. 9, pp. 1437–1462, September 1997.
 3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. on
speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.
 4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau,
S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A
tutorial on text-independent speaker verification,” EURASIP Journal on
Applied Signal Processing, vol. 4, pp. 430–451, 2004.
 5. D. A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Communication, vol. 17, pp. 91–108,
March 1995.
 6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” Digital Signal Processing, vol.
10, pp. 19–41, January 2000.
 7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans.
on speech and audio processing, vol. 3, no. 1, pp. 72–83, January
1995
 8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous
and transitional spectral information in speaker recognition,” IEEE
Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp.
871–879, June 1988
 9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I.
Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D.
A. Reynolds, “A tutorial on text-independent speaker verification,”
EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451,
2004
 10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement
of reverberant speech using LP residual signal,”IEEE Trans.Speech
Audio Process.,vol. 14, pp. 774-784, May 2006.
 11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from
speech signals,” IEEE Trans. Audio, Speech and Language
Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008
thank you

SPEAKER VERIFICATION

  • 1.
    EXPLORATION OF SPEAKERMODELLING AND SCORE NORMALIZATION METHOD FOR DEVELEPMENT OF A VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM Under guidance of NAME-AJAY KUMAR PASWAN Dr. G. Pradhan M.TECH 2 𝑛𝑑 yr (ECE Dept.) NIT PATNA (ECE Dept.) ROLL NO-1229011
  • 2.
    OUTLINE  Introduction  Literaturereview on speaker verification system  Summary of literature review  Motivation for present work  Baseline speaker verification system  Proposed speaker verification system  Summary and Contribution  Future scope
  • 3.
    INTRODUCTION  Speaker verification:A process of verifying identity claim of a person from his/her voice  To improve the security level, recent technology turned towards using biometric features over non-biometric features  With the emergence of mobile technology a person can remotely access the system, so, remote monitoring is possible  Speaker verification can be divided into  Text-independent  Text-dependent  Voice password  Text- independent system has lesser performance than text-dependent due to additional phonetic variability between training and testing speech  Text- independent system requires more data for training and testing
  • 4.
    BRIEF HISTORY  Researchin the field of speaker recognition was initially carried out in 1950s in Bell laboratories using isolated digits [1].  1960- 1990 most of the research was focused on extraction of speaker specific information from the speech data, and development of text dependent speaker verification system.  In 1990-2005 the speaker recognition method shifted from template based pattern matching to statistical modeling. Different statistical modeling method like GMM and GMM- UBM are proposed.  2005- 2014 most of the research was focused on compensation of mismatches and development of practical authentication systems. Different compensation methods like JAFA, i-vectors and LDA, WCCN, PLDA are proposed. 1. K. H. Davis, et. al., “Automatic recognition of spoken digits,” J.A.S.A., 24 (6), pp. 637-642, 1952.
  • 5.
    MODULAR REPRESENTATION OFVOICE PASS WORD BASED SPEAKER VERIFICATION SYSTEM Fig: Voice password speaker verification system Training Reference model Speech Identity claim Testing Speech R Accept/reject Pre- processing Feature extraction Model Building Pre- processing Feature extraction comparison Decision logic
  • 6.
    PREPROCESSING  Preprocessing isan important step in a speaker verification system. This also called voice activity detection (VAD).  VAD separates speech region from non-speech regions[2-3]  It is very difficult to implement a VAD algorithm which works consistently for different type of data  VAD algorithms can be classified in two groups  Feature based approach  Statistical model based approach  Each of the VAD method have its own merits and demerits depending on accuracy, complexity etc.  Due to simplicity most of the speaker verification systems use signal energy for VAD. 2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, September 1997. 3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.
  • 7.
    FEATURE EXTRACTION  Thespeech signal along with speaker information contains many other redundant information like recording sensor, channel, environment etc.  The speaker specific information in the speech signal[2]  Unique speech production system  Physiological  Behavioral aspects  Feature extraction module transforms speech to a set of feature vectors of reduce dimensions  To enhance speaker specific information  Suppress redundant information[2-4] 4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004.
  • 8.
     An idealfeature  Robust to environmental and recording condition  Contains less intra-speaker variability  More inter-speaker variability  Most of the state-of-the-art speaker verification systems use Mel-frequency Cepstral Coefficient (MFCC) appended to it’s first and second order derivative as the feature vectors  Easy to extract  Provides best performance compared to other features  MFCC mostly contains information about the resonance structure of the vocal tract system
  • 9.
    STEPS FOR MFCCCOMPUTATION Windowing of signal using Hamming window DFT spectrum:  Discrete Fourier transform is calculated for each window frame by following DFT equation. X (k) = 𝑛=0 𝑛=N 𝑥(𝑛)𝑒− −𝑖2𝜋𝑘𝑛 𝑁 ; 0 ≤ 𝑘 ≤ 𝑁 − 1 Mel-Spectrum  Mel-Spectrum can be calculated by passing the Fourier transform of the signal through mel-filter bank, mel-bank filter is a set of band pass filter  The mel-frequency related to the linear frequency as fmel = 2595log10 1+ 𝑓 700
  • 10.
    Discrete cosine transform(DCT):  Discrete cosine transform convert mel–spectrum on log scale to cepstral coefficients  Unlike spectral feature which are highly correlated , cepstral features produce a more decorrelated , compact representation.  DCT of k log filter bank , spectral values, {log(Sk)}K k=1 , into L cepstral coefficient Cn = 𝑘=1 𝐾 log(Sk) cos 𝑛 𝑘 − 1 2 𝜋 𝐾 n = 1 ,2 ,3 , …… L Typically L = 13 MFCC coefficient are calculated per frame , which is feature vector of that frame.  The cepstral coefficient are usually static feature, they contain the information about a particular frame only, so to get dynamics of the signal first and second derivative of cepstral coefficient is computed.
  • 11.
    SPEAKER MODELING  Speakermodels the statistical information present in the feature vectors it enhances the speaker information and suppress the redundant information  For text independent speaker verification speaker modeling technique used is vector quantization(VQ), Gaussian mixer model(GMM)[5], GMM-universal back ground model(GMM-UBM)[6], Artificial neural network(ANN)and support vector machine(SVM)  The Gaussian mixer model is most widely used for speaker verification systems 5. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, pp. 91–108, March 1995. 6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, January 2000.
  • 12.
     Gaussian modelassumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights  The data unseen in the training which appear in the test data will trigger a low score  Though GMM is quit powerful but it need large training data to properly estimate model parameter  GMM is available powerful and versatile parameter estimation algorithm, expectation-maximization.
  • 13.
    Pattern comparison  Testingphase test feature vectors are compared with claimed model to get similarity between training and testing speech  Different similarity measure is done for used modeling method  Euclidean distance [8] for VQ , log likelihood score(LLS)[7] and log likelihood score ratio(LLSR) for GMM-UBM. 7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995 8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June 1988
  • 14.
    PERFORMANCE MEASURE:  Aperfect SV system should accept all true claim and reject all the false claims  Depending on the variability between the training and testing speech some true claim may be rejected and some false claim may be accepted  Therefore the speaker verification performance is measured in term of false rejection rate (FRR) and false acceptance rate (FAR), more meaningfully in term of equal error rate(EER)[9].  In order to improve the visualization of the SV performance, the detection error tradeoff(DTF) curve is used to performance measure 9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004
  • 15.
    SUMMARY ON REVIEW The performance of speaker verification system is mostly depend upon quality of speech signal  The performance of the system degraded significantly under mismatched conditions  The phonetic variability between training and testing speech is another major source of mismatch  Speaker verification for text-dependent mode is performed with DTW algorithm, HMM  GMM is useful for modeling of system in text- independent mode
  • 16.
    MOTIVATION FOR PRESENTWORK  Most of the application where speech signal of short duration used around 3-5ms, but Speaker verification system provide poor performance for short duration speech signal  This degradation of performance is due to phonetic variability between training and testing speech data  The phonetic variability may be reduced by artificially generating multiple utterance, taking feature around Glottal closure instants(GCI)  Most of the SV system develop score normalization using on cohort centric normalization. The speaker centric score normalization may provide better result.
  • 17.
    OBJECTIVE OF THISTHESIS WORK  To develop voice password based speaker verification  To study impact of text-mismatch on the performance of voice password based speaker verification system  Develop a voice password based speaker verification system in text-independent mode  Explore method to model speaker information in limited data condition  Study and Explore the advantages of speaker centric score normalization
  • 18.
    DATABASE COLLECTION  Totaldatabase collection = 100 speaker Male speaker ,85 and Female speaker ,15  Number of repetition for train= 3 session  Number of repetition for test=5 session  Format of file naming = 8765538857_NAMCF
  • 19.
    BASELINE SPEAKER VERIFICATIONSYSTEM  For Baseline speaker verification the following parameter are used  VAD threshold is taken 0.1 of average energy  Baseline uses MFCC appended with first and second order derivative , i.e. delta(Δ) and delta delta(ΔΔ) for feature extraction  Feature vector: It uses 39 dimension feature vector and 20ms frame size with shift 2ms.  Modeling: GMM  GMM size: 8, 16, 32, 64.
  • 20.
    3.6 EXPERIMENTAL RESULT train test GMMsize 8 16 32 64 Vp Name Vp name Vp Name Vp Name Vp 17.52 42.26 24.74 43.92 28.8 44.32 38.14 46.39 name 39.17 17.52 41.23 20.61 43.29 27.83 39.17 45.36 Table: Baseline result.
  • 21.
    1 2 510 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) original baseline gmm Fig: baseline DET plot
  • 22.
    GENERATION OF MULTIPLEUTTERANCE BY ADDING WHITE NOISE TO TRAINING SPEECH  Motivation  It covers entire spectrum of speech signal  Addition of white noise will reduce phonetic variability as it covers entire spectrum  Feature are calculated white noise added for training and clear for testing  Modeling of train with white noise added, and test data is clear  White noise used range [-10,-5,0,+5,+10,+20]db  VAD used in clean speech for reference index  Reference index is used to find speech region
  • 23.
    EXPERIMENTAL RESULT Train test GMM SIZE 816 32 64 Vp Name Vp Name vp Name Vp Name Vp 14.4 35.05 18.55 37.11 22.86 40.20 28.86 44.32 Name 35.05 12.37 35.05 14.40 39.17 19.58 40.20 26.80 Table : Result based on white noise added.
  • 24.
    1 2 510 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) original baseline gmm white noise added Fig: comparison of baseline with white noise added  By adding white noise the phonetic variability of the training data reduce. Hence the performance is increased from baseline.
  • 25.
    MAXIMUM A POSTERIOR(MAP) ADAPTATION METHOD:  In Gaussian mixer model for Modeling of speaker, it is necessary sufficient training data must be available to make model of speaker  There is a another method available, maximum a posteriori (MAP) adaptation, of a background model trained on the speech data of several other speaker  It may be useful for evaluation of statistical model which may be useful for short duration speech data  Maximum a posterior (MAP) adaptation takes the prior information of existing model and changes their parameter according to new training data
  • 26.
    Train test GMM SIZE 8 1632 64 Vp Name Vp Name Vp Name Vp Name Vp 14.43 36.08 11.34 34.02 15.46 36.08 20.61 37.11 Name 34.02 12.37 34.02 14.43 40.20 15.46 40.20 24.74 Table : Map adaptation of clean data on noise model. EXPERIMENTAL RESULT:
  • 27.
    1 2 510 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) original baseline gmm white noise added map adaptation Fig: comparison of above two model with map adaptation
  • 28.
    RESIDUAL MFCC FROMGCI Computation of residual phase through Linear prediction analysis  The speech signal produced is convolution of excitation source and vocal track system  The speaker verification system required speaker specific information  The feature around glottal closure instants (GCI) are more speaker specific[10] 10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of reverberant speech using LP residual signal,”IEEE Trans.Speech Audio Process.,vol. 14, pp. 774-784, May 2006.
  • 29.
    ZERO FREQUENCY FILTERING(ZFF) METHOD  The ZFF method is most useful for evaluating the various parameter of prosodic parameter  It is the best available method to calculate expressive parameter for various emotions  The feature around CGI can be computed using ZFF[11]  Periodically located epoch in voiced speech signal represent the glottal closure instants 11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008
  • 30.
    EXPERIMENTAL RESULT FORRESIDUAL MFCC FROM GCI Train test GMM SIZE 8 16 32 64 Vp name Vp name Vp Name Vp Name Vp 20.6 35.05 19.58 38.14 25.77 39.17 30.92 42.26 Name 36.08 24.74 36.08 25.77 40.20 31.95 46.39 35.05 Table : result of residual MFCC from GCI
  • 31.
    DET CURVE FORCOMPARISON OF DIFFERENT PROPOSED METHOD WITH BASELINE METHOD 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) baseline gmm gmm white noise added map adaptation residual around gci
  • 32.
    SCORE NORMALIZATION  Thespeech data used for the development of model and testing varies between the speakers  For the same speaker quality and quantity of test data varies between the trials so the verification score varies between the trials  Compensation of different variability at the score level is commonly known as score normalization  The score normalization helps to reduce degradation and mismatch effect that are not compensated at feature and model levels  It also transforms scores from different trials into a similar range so that a common speaker independent verification threshold can be used
  • 33.
    BASELINE WITH SCORENORMALIZATION EXPERIMENTAL RESULT Train and test speech Session 1(initial) After score normalization name train name test 16.49 14.43 name train vp test 38.14 37.11 vp train vp test 19.58 22.68 vp train name test 40.20 40.20 Table : experimental result GMM baseline with score normalization
  • 34.
    Fig : DETplot of baseline initial and after score normalization 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) score norm baseline gmm
  • 35.
    GMM WITH WHITENOISE ADDED TRAIN Train and test speech Session 1(initial) After score normalization name train name test 14.43 9.27 name train vp test 34.02 31.95 vp train vp test 15.46 12.37 vp train name test 36.08 35.05 Table : experimental result of GMM with white noise added train
  • 36.
    Fig : DETplot for GMM with white noise and after score normalization 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) score norm baseline gmm score norm white noise added
  • 37.
    MAP ADAPTATION OFCLEAN DATA ON NOISY MODEL Train and test speech Session 1(initial) After score normalization name train name test 12.37 8.24 name train vp test 35.05 32.98 vp train vp test 12.37 10.30 vp train name test 35.05 35.05 Table : result of map adaptation of noisy train on clean test initial and after score
  • 38.
    Fig : DETplot for map adaptation of noisy train initial and after SN 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) score norm baseline gmm score norm white noise added score norm map adaptation
  • 39.
    RESIDUAL FEATURE FROM3MS GCI Train and test speech Session 1(initial) After score normalization name train name test 37.11 31.95 name train vp test 44.32 40.20 vp train vp test 39.17 38.14 vp train name test 44.32 42.26 Table : result of residual MFCC around 3ms GCI:
  • 40.
    Fig: DET plotfor residual MFCC around GCI initial and after score normalization. 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) score norm baseline gmm score norm white noise added score norm map adaptation score norm residual around gci
  • 41.
     The residualMFCC calculate the feature around glottal closure instances(GCI)  Residual MFCC is specific to speaker only and it does not contain information about utterances  It may provide better result combined with above technique
  • 42.
    DET PLOT FORCOMPARISON OF DIFFERENT MODELING TECHNIQUE 1 2 5 10 20 40 60 80 90 95 98 99 1 2 5 10 20 40 60 80 90 95 98 99 False Alarm probability (in %) Missprobability(in%) baseline gmm white noise added map adaptation residual around gci score nor of map addapttation Fig: comparison of different modeling technique
  • 43.
    SUMMARY OF DIFFERENTMODELING TECHNIQUE  The baseline best score is is 17.52.  By using white noise the best result is improved from 17.52 to 14.4 for vp and 17.52 to 12.37 for name  The result is further improved using MAP adaptation from 14.4 to 12.37 and for name 12.37 to 11.34  By using score normalization technique the score is reduced upto 8.24 for vp and 9.27 for name.
  • 44.
    COTRIBUTION:  Database iscollected for future research  A method is proposed to model the limited data by generating multiple utterance of speech through adding controlled white noise addition to clean speech  The performance of speaker centric score normalization under limited data condition is address
  • 45.
    FUTURE SCOPE:  Extractionof feature to reduce the impact of phonetic variability  Different residue of behavioral feature may be extracted in addition to MFCC for speaker verification  In this project we considered GMM modeling technique in next work many other technique may be used like JAFA, i-vector etc.
  • 46.
    REFERENCES  1. K.H. Davis, et. al., “Automatic recognition of spoken digits,” J.A.S.A., 24 (6), pp. 637-642, 1952.  2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, September 1997.  3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.  4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004.  5. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, pp. 91–108, March 1995.  6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, January 2000.
  • 47.
     7. D.A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995  8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June 1988  9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004  10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of reverberant speech using LP residual signal,”IEEE Trans.Speech Audio Process.,vol. 14, pp. 774-784, May 2006.  11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008
  • 48.