Under guidance of
Dr. G. Pradhan
NIT PATNA (ECE dept.)
Presented by -
Kamlesh Kalvaniya -(1104080)
Niranjan Kumar –(1104087)
Piyush Kumar-(1104091)
B.TECH 4th yr (ECE dept.)
6/24/2015 N.I.T. PATNA ECE, DEPTT. 1
1. Introduction
2. Baseline speaker verification system
3. Future Plan
Speaker Recognition is the computing task of validating
identity claim of a person from his/her voice.
Applications:-
Authentication
Forensic test
Security system
ATM Security Key
Personalized user interface
Multi speaker tracking
Surveillance
6/24/2015 N.I.T. PATNA ECE, DEPTT. 3
Identification v/s verification
6/24/2015 N.I.T. PATNA ECE, DEPTT. 4
Phase of Speaker Verification
• Enrollment Session or Training Phase
• Operating Session or Testing Phase
6/24/2015 N.I.T. PATNA ECE, DEPTT. 5
Training & Testing Phase
Training Reference model
Speech
Identity claim
Testing
Speech R
Accept/reject
Pre-
processing
Feature
extraction
Model
Building
Pre-
processing
Feature
extraction comparison
Decision
logic
6/24/2015 N.I.T. PATNA ECE, DEPTT. 6
Preprocessing
Preprocessing is an important step in a speaker verification system. This also called
voice activity detection (VAD).
VAD separates speech region from non-speech regions[2-3]
It is very difficult to implement a VAD algorithm which works consistently for
different type of data
VAD algorithms can be classified in two groups
 Feature based approach
 Statistical model based approach
 Each of the VAD method have its own merits and demerits depending on accuracy,
complexity etc.
Due to simplicity most of the speaker verification systems use signal energy for VAD.
6/24/2015 N.I.T. PATNA ECE, DEPTT. 7
The speech signal along with speaker information
contains many other redundant information like
recording sensor, channel, environment etc.
The speaker specific information in the speech
signal[2]
 Unique speech production system
 Physiological
 Behavioral aspects
Feature extraction module transforms speech to a set
of feature vectors of reduce dimensions
 To enhance speaker specific information
 Suppress redundant information.
Feature Extraction
6/24/2015 N.I.T. PATNA ECE, DEPTT. 8
• Robust against noise and distortion
• Occur frequently and naturally in speech
• Be easy to measure from speech signal
• Be difficult to impersonate/mimic
• Not be affected by the speaker’s health or long term variations in voice
Selection of Features
6/24/2015 N.I.T. PATNA ECE, DEPTT. 9
Types Of Features
6/24/2015 N.I.T. PATNA ECE, DEPTT. 10
Feature Extraction Techniques
A wide range of approaches may be used to parametrically represent the speech
signal to be used in the speaker recognition activity.
 Linear Prediction Coding
 Linear Predictive Ceptral Coefficients
 Mel Frequency Ceptral Coefficients
 Perceptual Linear Prediction
 Neural Predictive Coding
Most of the state-of-the-art speaker verification systems use Mel-frequency
Cepstral Coefficient (MFCC) appended to it’s first and second order derivative
as the feature vectors
Easy to extract
Provides best performance compared to other features
 MFCC mostly contains information about the resonance structure of the vocal
tract system
6/24/2015 N.I.T. PATNA ECE, DEPTT. 11
1. Analog to digital conversion
2. Pre emphasis
3. Framing & windowing
4. Fast Fourier Transform
5. Mel scale wrapping
6. MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 12
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 13
Step 1:- Analog to digital conversion: is transformed to
digital form by sampling it at given frequency.
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 14
Step 2:- Pre-emphasis: The amount of energy present in
the high frequency (important for speech) are boosted.
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 15
Step 3:(framing)the signal is divided into frames
of given size.
MFCC FRAMING
6/24/2015 N.I.T. PATNA ECE, DEPTT. 16
MFCC FRAMING
6/24/2015 N.I.T. PATNA ECE, DEPTT. 17
MFCC FRAMING
6/24/2015 N.I.T. PATNA ECE, DEPTT. 18
MFCC FRAMING
6/24/2015 N.I.T. PATNA ECE, DEPTT. 19
25ms
10ms
MFCC WINDOWING
• The next step is to window individual frame to
minimize the signal discontinuities at the
beginning and end of each frame.
• The concept applied here is to minimize the
spectral distortion by using the window to
taper the signal to zero at the beginning and
end of each frame.
• We have used hamming window
6/24/2015 N.I.T. PATNA ECE, DEPTT. 20
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 21
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 22
MEL FILTERBANK
6/24/2015 N.I.T. PATNA ECE, DEPTT. 23
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 24
DCT
MFCC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 25
DCT
6/24/2015 N.I.T. PATNA ECE, DEPTT. 26
Speaker Modelling
• Vector Quantization
• Gaussian Mixture Model
• Gaussian Mixture Model-UBM
• Hidden Markov Model
• Artificial Neural Networks
• Super Vector Machines
• I-Vector
 Gaussian model assumes the feature vectors follow a Gaussian distribution,
characterized by mean vectors, covariance matrix and weights
 The data unseen in the training which appear in the test data will trigger a low
score
Speaker models the statistical information present in the
feature vectors it enhances the speaker information and
suppress the redundant information
 A Gaussian mixture density defined as-
A Gaussian function for D dimension is defined as-
where- Unimodal Gaussian
D=8,16,32,64
ʎ i = {wi , ∑i µi }
wi = Weight
µi = Mean ;
∑i = Covariance ;
i-No. of models(M=356)
6/24/2015
N.I.T. PATNA ECE, DEPTT.
27
Gaussian Mixture Model
 For a sequence of T training vector X={x1 , x2 ,…, xT }
the GMM likelihood can be defined as-
 For estimation of speaker specific GMM,
Expectation maximization algorithm is used .
6/24/2015 N.I.T. PATNA ECE, DEPTT. 28
6/24/2015 N.I.T. PATNA ECE, DEPTT. 29
ʎtarget : X(MFCC(TESTING DATA)) is from the hypothesized
speaker S
ʎUBM : X(MFCC(TESTING DATA)) is not from the
hypothesized speaker S
 The likelihood ratio test is given by-
LR(X)=
 The probability of alternative hypothesis
P(X/ʎUBM ) =F( P(X/ʎ1), P(X/ʎ2),..., P(X/ʎM))
F( ) is function such as average or maximum of likelihood
value of Background Speaker set ( P(X/ʎi) ) .
6/24/2015 N.I.T. PATNA ECE, DEPTT.
30
 Score Normalisation
Where-
s- Original Score = log(LR(X));
µI - Estimated mean of s
σI -standard deviation of s
6/24/2015 N.I.T. PATNA ECE, DEPTT. 31
PERFORMANCE EVALUATION
 NIST has conducted speaker recognition
benchmarking activity on annual basis since
1997.
NIST has provided speech files as development
data.
NIST 2003 data-
Testing Speech Data-2559
Train Speech Data-356
UBM Female Speech data-251
UBM male Speech data-251
6/24/2015 N.I.T. PATNA ECE, DEPTT. 32
For Baseline speaker verification the following parameter are
used
 VAD: Energy based VAD (0.6 * average energy)
 Feature vector: 13 dimension MFCC appended with delta
and delta-delta
 Modeling: GMM
 GMM size: 8, 16, 32, 64.0
 Comparison: log Likelihood score
.
6/24/2015 N.I.T. PATNA ECE, DEPTT. 34
DET
PLOT
FOR
TEST
15 Sec
AND
TRAIN
15
SEC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 35
DET
PLOT
FOR
TEST
FULL
AND
TRAIN
15
SEC
6/24/2015 N.I.T. PATNA ECE, DEPTT. 36
DET
PLOT
FOR
TEST
15 Sec
AND
TRAIN
FULL
6/24/2015 N.I.T. PATNA ECE, DEPTT. 37
DET
PLOT
FOR
TEST
FULL
AND
TRAIN
FULL
6/24/2015 N.I.T. PATNA ECE, DEPTT. 38
Comparison of training data model
with Equal Error Rate
.
6/24/2015 N.I.T. PATNA ECE, DEPTT. 39
GAUSSIAN SIZE
8
16
32
64
TEST 15 Sec
TRAIN 15 SEC
Test Full
Train 15 sec
TEST 15 sec
Train Full
Test Full
Train Full
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
34.90 34.24 33.18 27.70
33.05 32.28 30.50 25.67
32.46 32.94 28.78 23.67
32.82 33.06 27.42 22.05
Conclusion
 Performance is more sensitive to training
data.
6/24/2015 N.I.T. PATNA ECE, DEPTT. 40
Future Plan
 Synthetically generating training and testing speech
from limited speech data.
 Validating the results on state-of-the-art i-vector
based speaker verification system.
6/24/2015 N.I.T. PATNA ECE, DEPTT. 41
Thank you
6/24/2015 N.I.T. PATNA ECE, DEPTT. 42

SPEKER RECOGNITION UNDER LIMITED DATA CODITION

  • 1.
    Under guidance of Dr.G. Pradhan NIT PATNA (ECE dept.) Presented by - Kamlesh Kalvaniya -(1104080) Niranjan Kumar –(1104087) Piyush Kumar-(1104091) B.TECH 4th yr (ECE dept.) 6/24/2015 N.I.T. PATNA ECE, DEPTT. 1
  • 2.
    1. Introduction 2. Baselinespeaker verification system 3. Future Plan
  • 3.
    Speaker Recognition isthe computing task of validating identity claim of a person from his/her voice. Applications:- Authentication Forensic test Security system ATM Security Key Personalized user interface Multi speaker tracking Surveillance 6/24/2015 N.I.T. PATNA ECE, DEPTT. 3
  • 4.
  • 5.
    Phase of SpeakerVerification • Enrollment Session or Training Phase • Operating Session or Testing Phase 6/24/2015 N.I.T. PATNA ECE, DEPTT. 5
  • 6.
    Training & TestingPhase Training Reference model Speech Identity claim Testing Speech R Accept/reject Pre- processing Feature extraction Model Building Pre- processing Feature extraction comparison Decision logic 6/24/2015 N.I.T. PATNA ECE, DEPTT. 6
  • 7.
    Preprocessing Preprocessing is animportant step in a speaker verification system. This also called voice activity detection (VAD). VAD separates speech region from non-speech regions[2-3] It is very difficult to implement a VAD algorithm which works consistently for different type of data VAD algorithms can be classified in two groups  Feature based approach  Statistical model based approach  Each of the VAD method have its own merits and demerits depending on accuracy, complexity etc. Due to simplicity most of the speaker verification systems use signal energy for VAD. 6/24/2015 N.I.T. PATNA ECE, DEPTT. 7
  • 8.
    The speech signalalong with speaker information contains many other redundant information like recording sensor, channel, environment etc. The speaker specific information in the speech signal[2]  Unique speech production system  Physiological  Behavioral aspects Feature extraction module transforms speech to a set of feature vectors of reduce dimensions  To enhance speaker specific information  Suppress redundant information. Feature Extraction 6/24/2015 N.I.T. PATNA ECE, DEPTT. 8
  • 9.
    • Robust againstnoise and distortion • Occur frequently and naturally in speech • Be easy to measure from speech signal • Be difficult to impersonate/mimic • Not be affected by the speaker’s health or long term variations in voice Selection of Features 6/24/2015 N.I.T. PATNA ECE, DEPTT. 9
  • 10.
    Types Of Features 6/24/2015N.I.T. PATNA ECE, DEPTT. 10
  • 11.
    Feature Extraction Techniques Awide range of approaches may be used to parametrically represent the speech signal to be used in the speaker recognition activity.  Linear Prediction Coding  Linear Predictive Ceptral Coefficients  Mel Frequency Ceptral Coefficients  Perceptual Linear Prediction  Neural Predictive Coding Most of the state-of-the-art speaker verification systems use Mel-frequency Cepstral Coefficient (MFCC) appended to it’s first and second order derivative as the feature vectors Easy to extract Provides best performance compared to other features  MFCC mostly contains information about the resonance structure of the vocal tract system 6/24/2015 N.I.T. PATNA ECE, DEPTT. 11
  • 12.
    1. Analog todigital conversion 2. Pre emphasis 3. Framing & windowing 4. Fast Fourier Transform 5. Mel scale wrapping 6. MFCC 6/24/2015 N.I.T. PATNA ECE, DEPTT. 12
  • 13.
    MFCC 6/24/2015 N.I.T. PATNAECE, DEPTT. 13 Step 1:- Analog to digital conversion: is transformed to digital form by sampling it at given frequency.
  • 14.
    MFCC 6/24/2015 N.I.T. PATNAECE, DEPTT. 14 Step 2:- Pre-emphasis: The amount of energy present in the high frequency (important for speech) are boosted.
  • 15.
    MFCC 6/24/2015 N.I.T. PATNAECE, DEPTT. 15 Step 3:(framing)the signal is divided into frames of given size.
  • 16.
    MFCC FRAMING 6/24/2015 N.I.T.PATNA ECE, DEPTT. 16
  • 17.
    MFCC FRAMING 6/24/2015 N.I.T.PATNA ECE, DEPTT. 17
  • 18.
    MFCC FRAMING 6/24/2015 N.I.T.PATNA ECE, DEPTT. 18
  • 19.
    MFCC FRAMING 6/24/2015 N.I.T.PATNA ECE, DEPTT. 19 25ms 10ms
  • 20.
    MFCC WINDOWING • Thenext step is to window individual frame to minimize the signal discontinuities at the beginning and end of each frame. • The concept applied here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. • We have used hamming window 6/24/2015 N.I.T. PATNA ECE, DEPTT. 20
  • 21.
  • 22.
  • 23.
    MEL FILTERBANK 6/24/2015 N.I.T.PATNA ECE, DEPTT. 23
  • 24.
    MFCC 6/24/2015 N.I.T. PATNAECE, DEPTT. 24 DCT
  • 25.
    MFCC 6/24/2015 N.I.T. PATNAECE, DEPTT. 25 DCT
  • 26.
    6/24/2015 N.I.T. PATNAECE, DEPTT. 26 Speaker Modelling • Vector Quantization • Gaussian Mixture Model • Gaussian Mixture Model-UBM • Hidden Markov Model • Artificial Neural Networks • Super Vector Machines • I-Vector  Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights  The data unseen in the training which appear in the test data will trigger a low score Speaker models the statistical information present in the feature vectors it enhances the speaker information and suppress the redundant information
  • 27.
     A Gaussianmixture density defined as- A Gaussian function for D dimension is defined as- where- Unimodal Gaussian D=8,16,32,64 ʎ i = {wi , ∑i µi } wi = Weight µi = Mean ; ∑i = Covariance ; i-No. of models(M=356) 6/24/2015 N.I.T. PATNA ECE, DEPTT. 27 Gaussian Mixture Model
  • 28.
     For asequence of T training vector X={x1 , x2 ,…, xT } the GMM likelihood can be defined as-  For estimation of speaker specific GMM, Expectation maximization algorithm is used . 6/24/2015 N.I.T. PATNA ECE, DEPTT. 28
  • 29.
    6/24/2015 N.I.T. PATNAECE, DEPTT. 29
  • 30.
    ʎtarget : X(MFCC(TESTINGDATA)) is from the hypothesized speaker S ʎUBM : X(MFCC(TESTING DATA)) is not from the hypothesized speaker S  The likelihood ratio test is given by- LR(X)=  The probability of alternative hypothesis P(X/ʎUBM ) =F( P(X/ʎ1), P(X/ʎ2),..., P(X/ʎM)) F( ) is function such as average or maximum of likelihood value of Background Speaker set ( P(X/ʎi) ) . 6/24/2015 N.I.T. PATNA ECE, DEPTT. 30
  • 31.
     Score Normalisation Where- s-Original Score = log(LR(X)); µI - Estimated mean of s σI -standard deviation of s 6/24/2015 N.I.T. PATNA ECE, DEPTT. 31
  • 32.
    PERFORMANCE EVALUATION  NISThas conducted speaker recognition benchmarking activity on annual basis since 1997. NIST has provided speech files as development data. NIST 2003 data- Testing Speech Data-2559 Train Speech Data-356 UBM Female Speech data-251 UBM male Speech data-251 6/24/2015 N.I.T. PATNA ECE, DEPTT. 32
  • 33.
    For Baseline speakerverification the following parameter are used  VAD: Energy based VAD (0.6 * average energy)  Feature vector: 13 dimension MFCC appended with delta and delta-delta  Modeling: GMM  GMM size: 8, 16, 32, 64.0  Comparison: log Likelihood score
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Comparison of trainingdata model with Equal Error Rate . 6/24/2015 N.I.T. PATNA ECE, DEPTT. 39 GAUSSIAN SIZE 8 16 32 64 TEST 15 Sec TRAIN 15 SEC Test Full Train 15 sec TEST 15 sec Train Full Test Full Train Full EQUAL ERROR RATE(%) EQUAL ERROR RATE(%) EQUAL ERROR RATE(%) EQUAL ERROR RATE(%) 34.90 34.24 33.18 27.70 33.05 32.28 30.50 25.67 32.46 32.94 28.78 23.67 32.82 33.06 27.42 22.05
  • 40.
    Conclusion  Performance ismore sensitive to training data. 6/24/2015 N.I.T. PATNA ECE, DEPTT. 40
  • 41.
    Future Plan  Syntheticallygenerating training and testing speech from limited speech data.  Validating the results on state-of-the-art i-vector based speaker verification system. 6/24/2015 N.I.T. PATNA ECE, DEPTT. 41
  • 42.
    Thank you 6/24/2015 N.I.T.PATNA ECE, DEPTT. 42