Project Members
ASHOK SHARMA PAUDEL(066/BEX/405)
DEEPESH LEKHAK(066/BEX/414)
KESHAV BASHYAL(066/BEX/418)
SUSHMA SHRESTHA(066/BEX/444)
TEXT-INDEPENDENT SPEAKER
RECOGNITION SYSTEM
1
OVERVIEW OF PRESENTATION
1. Introduction
2. Objective
3. System Architecture
4. Methodology
5. Results and Analysis
6. Application area
7. Limitations
8. Problem Faced
9. Conclusion2
1. INTRODUCTION
 Speech - universal method of
communication.
 Information through speech signal
1. high-level Characteristics -syntax, dialect, style, overall
meaning of a spoken message.
2. low-level Characteristics- pitch and phonemic spectra
associated much more with the physiology of vocal tract.
3
1. INTRODUCTION(2)
4
1. INTRODUCTION(3)
 Speech is a diverse field with many
applications.
Speech
Recognition
Language
Recognition
Speaker
Recognition
Words
Language Name
Speaker Name
“How are you?”
English
“ Deepesh”
Speech
Signal
5
1. INTRODUCTION (4)
What is Speaker Recognition?
 Recognition of who is speaking based on
characteristics of their speech signal.
 Text-independent , Text-dependent
 Speaker Identification: Determines which
registered speaker has spoken.
 Speaker Verification: Accept or reject a
claimed identity of a speaker.
6
1. INTRODUCTION (5)
Biometric: a human generated signal or
attribute for authenticating a person’s
identity
Why Voice ?
– natural signal to produce
– Only biometric that allows users to authenticate
remotely.
– does not require a specialized input device,
Implementation cost is low
– ubiquitous: telephones and microphone equipped PC
7
Strongest
security
• Voice biometric with other forms of security
– Something you have - badge
– Something you are - voice
HaveKnow
Are
– Something you know - password
1. INTRODUCTION(6)
Why text independent speaker recognition ?
- Independent of text, easy to access, cannot be
forgotten or misplaced,
- Independent of language, Acceptable by user8
2. OBJECTIVE
The main goal of the project is to design and
implement a text-independent speaker
recognition system on FPGA.
The specific goals can be summarized as:
1. To learn about digital signal processing and FPGA.
2. To implement and analyze the system in MATLAB.
3. To design and implement the system on FPGA.
9
3. SYSTEM ARCHITECTURE
Universal Asynchronous Receiver Transmitter
Mel-Frequency Cespstral Coefficients
Mel-Spectrum
Fast Fourier Transform
Framing and Windowing
Pre-emphasis
Double Data Rate SDRAM Storage
Analog to Digital Conversion
Conditioning
Input audio
10
4. METHODOLOGY
Testing data Training data
Input signal
Feature extraction
Feature matching
Threshold
Output
11
4.1. System Implementation on MATLAB
4.1.1. Voice Capturing and Storage
-input through microphone, saved .wav
format
-sound used in format of 22050Hz, 16-bits
PCM, Mono Channel.
12
4.1.2. Pre-Processing
1) Silence removal
13
4.1.2. Pre-Processing(2)
1)Silence removal 2) Pre-
emphasis
s’[n]=s[n]-a s[n-1] ……
[1]
[1] Shi-Huang Chen and Yu-Ren Luo, Speaker Verification Using MFCC
and Support Vector Machine14
4.1.2. Pre-Processing(3)
1)Silence removal 2) Pre-emphasis 3)Framing
•Overlapping frames - frame block of 23.22ms with 50%
overlapping i.e., 512 samples per
frame
15
4.1.2. Pre-Processing(4)
1)Silence removal 2) Pre-emphasis 3)Framing 4)Windowing
x[n] = s’[n] . w[n-m]
if n=0,1,2,…,N-1
if n=m,m+1,…..m+N-1
[2]
[2] Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using
MFCC and Support Vector Machine16
4.1.3. Feature Extraction using MFCC
MFCC : Mel Filter Cepstral Coefficients
 Perceptual approach
the human perception of speech, are applied to
the sample frames extract the features of speech.
Steps for calculating MFCC
1. Discrete Fourier Transform using FFT and
Power spectrum , X[k]|2 of signal
17
4.1.3. Feature Extraction using MFCC(2)
2. Mel scaling
Mel scale : linear up to 1 KHz and logarithmic after 1 KHz
. Mapping the powers of the spectrum onto the Mel scale,
using Mel filter bank-Mel spectral coefficients G[k]
Filter bank:
overlapping windows
18
4.1.3. Feature Extraction using MFCC(3)
3.log of Mel spectral coefficients has been taken
log(G[k]).
4. Discrete Cosine Transform (DCT) ->Mel-cepstrum
c[q].
(Source: Shi-Huang Chen and Yu-Ren Luo , Speaker Verification
Using MFCC and Support Vector Machine)
(3.4)
19
4.1.3. Feature Extraction using MFCC(4)
mel
cepstrum
mel
spectrum
framecontinuous
speech
Frame
Blocking
Windowing FFT spectrum
Mel-frequency
Wrapping
Cepstrum
20
4.1.4. Feature Matching using GMM
Gaussian Mixture Model
Parametric probability
density function
Based on soft clustering
technique
Mixture of Gaussian
components
21
4.1.4. Feature Matching using GMM(2)
•GMM Training
22
4.1.4. Feature Matching using GMM(3)
The GMM modeling process consists of two
steps:
1. Initialization :
Initial value of mean, covariance & weight
assigned.
2. Expectation Maximization(EM)
Value of mean, covariance & weight
calculated adaptively by finding maximum
likelihood of parameters.23
4.1.5. Identification & Verification
For speaker identification, maximum posteriori
probability of a speaker model within a group of
S speakers.
For verification, a threshold value for the log-
likelihood probability of speaker has been set on
the adaptive basis.
.
24
Feature
Extraction
Feature
Matching Decision
Accept if
> Threshold
Reject if
< Threshold
4.2. System Implementation on FPGA
Mic Pre-
amplification
DC offset
shiter
Analog-to-
digital
conversion
Temporary
Buffer
Framing and
windowing
Fast Fourier
Transform
Mel
Spectrum
Log
Discrete
Cosine
Transform
MFCC
(UART)
Computer
(MATLAB)
25
4.2. System Implementation on FPGA(2)
 Sound Capture and Level Shifter
• The audio sound is captured using conditioner
microphone and amplified using Op-amp
• Dc offset of the input audio signal is shifted to 1.65
volt
 Analog to digital conversion and Digital to
analog conversion
• Spartan 3E FPGA board has ADC module having SPI
operation
• 14 bit ADC sample values are obtained from ADC at
the rate of 25000 samples per seconds.
26
4.2. System Implementation on FPGA(3)
 Double Data Rate SDRAM
- ADC Samples are stored in DDR SDRAM
temporarily before further processing.
- Burst mode 4 with burst length 2 i.e. 64
bits are written in SDRAM.
- Wishbone communication protocol is
used for communication with DDR SDRAM.
27
4.2. System Implementation on FPGA(4)
 Framing and windowing
 ADC samples stored in DDR are pre-
emphasized.
 50 % overlapped frames having frame
length of 512 samples are used.
 Fast Fourier Transform
 512 point Radix-2 Fast Fourier Transform is
done using Xilinx Logicore.
28
4.2. System Implementation on FPGA(5)
29
FFT timing diagram
4.2. System Implementation on FPGA(6)
 Mel-Spectrum
 Spectrum (linear scale) => Mel Spectrum
 Log calculation
 Natural log using look up tables .
 Input data : 24 bit
output : 12 bit
30
4.2. System Implementation on FPGA(7)
Discrete Cosine Transform (DCT)
 DCT core by poencores.org
 Input : 1 bit
Output : 16 bit parallel
Universal Asynchronous Receiver
Transmitter(UART)
 Baud rate of 19.2 kbps
 Each MFCC (32 bits) are divided into four
8-bit components.
 Implemented on unused pin in Jumper for
using UART protocol via CDC.
31
4.3. Further processing in Matlab
MFCCs are received in MATLAB in int32
format.
Training phase :MFCC feature vectors =>
Gaussian Mixture Model
Testing phase : MFCC feature vectors =>
posterior probability (Recognition).
32
5. RESULT AND ANALYSIS
33
5.1. Output in MATLAB
 Training data:31 speakers (male – 20, female-11)
 Testing data length= 10-30 seconds
 Training data length= 1-10 seconds
 No. of MFCCs= 8-20
 Up to 99% recognition when
testing data length= 30 seconds
training data length= 10 seconds
No. of MFCCs= 20
5.1. Output in MATLAB(2)
Amount of
Training Speech
Model order
(M)
Duration of Testing Speech
1 seconds 5 seconds 10 seconds
10 Seconds 8 51.3% 75.5% 82.9%
13 60.3% 83.5% 88.4%
20 64.7% 85.1% 90.4%
20 Seconds 8 67.3% 86.3% 93.6%
13 75.1% 95.1% 97.3%
20 78.3% 95.4% 97.4%
30 seconds 8 71.7% 95.5% 97.5%
13 79.2% 97.8% 98.5%
20 84.1% 98.1% 99.1%34
 Largest increase in performance when training data
increases from 10 to 20 sec. Increasing to 30 sec
improves the performance with little increment
 At most 30 sec of speech to maintain high
performance.
 Abrupt change in performance on increasing testing
speech duration from 1 to 5 seconds. Only slight
increase in performance when increased from 5
seconds to 10 seconds.
 Using more training data improves the performance .
35
5.1. Output in MATLAB(3)
 77% unknown female voice is matched with
female voice 85% unknown male voice is matched
with male voice.
 During the experiments, 4 languages English,
Nepali and Hindi, German - correct speaker
recognition regardless of the spoken text and
language.
36
5.1. Output in MATLAB(4)
 Total Error Rate (TER) = FAR + FRR
 Threshold for speaker verification was
calculated empirically using FAR and FRR.
.37
5.1. Output in MATLAB(5)
5.2. Output Analysis in FPGA
Recognition rate less than that of software
implementation.
overall resource utilization in FPGA :
i. RAMs : 7
ii. ROMs : 3
iii. Multipliers : 15
iv. Adders/ Subtractors : 18
v. Counters : 9
vi. Registers : 132
vii. Comparators : 20
viii. Multiplexers : 238
Device Utilization summary
Logic utilization Used Available Utilizations
Number of Slice Flip-Flops 8225 9312 88%
Number of 4 input LUTs 8734 9312 93%
Number of occupied Slices 2355 4656 54%
Number of Slices containing only related
logic
1325 1325 100%
Number of Slices containing unrelated logic 0 1325 0%
Total Number of 4 inputs LUTs 8903 9312 94%
Number of bonded IOBs 215 232 94%
Number of RAMB16s 7 20 35%
Number of BUFGMUXs 2 24 8%
Number of MULT18X18SIOs 15 20 75%
Average Fanout of Non-Clock Nets 272
39
5.2. Output Analysis in FPGA (2)
Security
• Forensics for
voice sample
matching
• Transaction
authentication
• Toll fraud
prevention
Information and
physical facilities
• Telephone
credit card
purchases
• Remote time
and attendance
logging
• Information
retrieval
• Audio indexing
• Voice dialing
and voice mail
Monitoring
• Access control
• Access to
confidential
information
areas
• Computer and
data networks
• Remote access
of computers
40
6. APPLICATIONS
41
 Duration of speech signal limits the
performance .
 The intrusion based on voice imitation
cannot be detected.
 Optimal number of model order.
The silence removal process is not efficient.
7. LIMITATION
limited resources in the Spartan 3E.
Lack of sufficient block RAM & ROM memory.
Synchronization problem of different
modules/components.
42
8. PROBLEM FACED
The system has been implemented using
MFCC for feature extraction and GMM to
model the speakers.
The performance of software
implementation of systems is very good.
The implementation in FPGA is not
satisfactory
Noise reduction algorithms can be used to
improve the performance of the system.
43
9. CONCLUSION
THANK YOU
44

Text independent speaker recognition system

  • 1.
    Project Members ASHOK SHARMAPAUDEL(066/BEX/405) DEEPESH LEKHAK(066/BEX/414) KESHAV BASHYAL(066/BEX/418) SUSHMA SHRESTHA(066/BEX/444) TEXT-INDEPENDENT SPEAKER RECOGNITION SYSTEM 1
  • 2.
    OVERVIEW OF PRESENTATION 1.Introduction 2. Objective 3. System Architecture 4. Methodology 5. Results and Analysis 6. Application area 7. Limitations 8. Problem Faced 9. Conclusion2
  • 3.
    1. INTRODUCTION  Speech- universal method of communication.  Information through speech signal 1. high-level Characteristics -syntax, dialect, style, overall meaning of a spoken message. 2. low-level Characteristics- pitch and phonemic spectra associated much more with the physiology of vocal tract. 3
  • 4.
  • 5.
    1. INTRODUCTION(3)  Speechis a diverse field with many applications. Speech Recognition Language Recognition Speaker Recognition Words Language Name Speaker Name “How are you?” English “ Deepesh” Speech Signal 5
  • 6.
    1. INTRODUCTION (4) Whatis Speaker Recognition?  Recognition of who is speaking based on characteristics of their speech signal.  Text-independent , Text-dependent  Speaker Identification: Determines which registered speaker has spoken.  Speaker Verification: Accept or reject a claimed identity of a speaker. 6
  • 7.
    1. INTRODUCTION (5) Biometric:a human generated signal or attribute for authenticating a person’s identity Why Voice ? – natural signal to produce – Only biometric that allows users to authenticate remotely. – does not require a specialized input device, Implementation cost is low – ubiquitous: telephones and microphone equipped PC 7
  • 8.
    Strongest security • Voice biometricwith other forms of security – Something you have - badge – Something you are - voice HaveKnow Are – Something you know - password 1. INTRODUCTION(6) Why text independent speaker recognition ? - Independent of text, easy to access, cannot be forgotten or misplaced, - Independent of language, Acceptable by user8
  • 9.
    2. OBJECTIVE The maingoal of the project is to design and implement a text-independent speaker recognition system on FPGA. The specific goals can be summarized as: 1. To learn about digital signal processing and FPGA. 2. To implement and analyze the system in MATLAB. 3. To design and implement the system on FPGA. 9
  • 10.
    3. SYSTEM ARCHITECTURE UniversalAsynchronous Receiver Transmitter Mel-Frequency Cespstral Coefficients Mel-Spectrum Fast Fourier Transform Framing and Windowing Pre-emphasis Double Data Rate SDRAM Storage Analog to Digital Conversion Conditioning Input audio 10
  • 11.
    4. METHODOLOGY Testing dataTraining data Input signal Feature extraction Feature matching Threshold Output 11
  • 12.
    4.1. System Implementationon MATLAB 4.1.1. Voice Capturing and Storage -input through microphone, saved .wav format -sound used in format of 22050Hz, 16-bits PCM, Mono Channel. 12
  • 13.
  • 14.
    4.1.2. Pre-Processing(2) 1)Silence removal2) Pre- emphasis s’[n]=s[n]-a s[n-1] …… [1] [1] Shi-Huang Chen and Yu-Ren Luo, Speaker Verification Using MFCC and Support Vector Machine14
  • 15.
    4.1.2. Pre-Processing(3) 1)Silence removal2) Pre-emphasis 3)Framing •Overlapping frames - frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame 15
  • 16.
    4.1.2. Pre-Processing(4) 1)Silence removal2) Pre-emphasis 3)Framing 4)Windowing x[n] = s’[n] . w[n-m] if n=0,1,2,…,N-1 if n=m,m+1,…..m+N-1 [2] [2] Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support Vector Machine16
  • 17.
    4.1.3. Feature Extractionusing MFCC MFCC : Mel Filter Cepstral Coefficients  Perceptual approach the human perception of speech, are applied to the sample frames extract the features of speech. Steps for calculating MFCC 1. Discrete Fourier Transform using FFT and Power spectrum , X[k]|2 of signal 17
  • 18.
    4.1.3. Feature Extractionusing MFCC(2) 2. Mel scaling Mel scale : linear up to 1 KHz and logarithmic after 1 KHz . Mapping the powers of the spectrum onto the Mel scale, using Mel filter bank-Mel spectral coefficients G[k] Filter bank: overlapping windows 18
  • 19.
    4.1.3. Feature Extractionusing MFCC(3) 3.log of Mel spectral coefficients has been taken log(G[k]). 4. Discrete Cosine Transform (DCT) ->Mel-cepstrum c[q]. (Source: Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support Vector Machine) (3.4) 19
  • 20.
    4.1.3. Feature Extractionusing MFCC(4) mel cepstrum mel spectrum framecontinuous speech Frame Blocking Windowing FFT spectrum Mel-frequency Wrapping Cepstrum 20
  • 21.
    4.1.4. Feature Matchingusing GMM Gaussian Mixture Model Parametric probability density function Based on soft clustering technique Mixture of Gaussian components 21
  • 22.
    4.1.4. Feature Matchingusing GMM(2) •GMM Training 22
  • 23.
    4.1.4. Feature Matchingusing GMM(3) The GMM modeling process consists of two steps: 1. Initialization : Initial value of mean, covariance & weight assigned. 2. Expectation Maximization(EM) Value of mean, covariance & weight calculated adaptively by finding maximum likelihood of parameters.23
  • 24.
    4.1.5. Identification &Verification For speaker identification, maximum posteriori probability of a speaker model within a group of S speakers. For verification, a threshold value for the log- likelihood probability of speaker has been set on the adaptive basis. . 24 Feature Extraction Feature Matching Decision Accept if > Threshold Reject if < Threshold
  • 25.
    4.2. System Implementationon FPGA Mic Pre- amplification DC offset shiter Analog-to- digital conversion Temporary Buffer Framing and windowing Fast Fourier Transform Mel Spectrum Log Discrete Cosine Transform MFCC (UART) Computer (MATLAB) 25
  • 26.
    4.2. System Implementationon FPGA(2)  Sound Capture and Level Shifter • The audio sound is captured using conditioner microphone and amplified using Op-amp • Dc offset of the input audio signal is shifted to 1.65 volt  Analog to digital conversion and Digital to analog conversion • Spartan 3E FPGA board has ADC module having SPI operation • 14 bit ADC sample values are obtained from ADC at the rate of 25000 samples per seconds. 26
  • 27.
    4.2. System Implementationon FPGA(3)  Double Data Rate SDRAM - ADC Samples are stored in DDR SDRAM temporarily before further processing. - Burst mode 4 with burst length 2 i.e. 64 bits are written in SDRAM. - Wishbone communication protocol is used for communication with DDR SDRAM. 27
  • 28.
    4.2. System Implementationon FPGA(4)  Framing and windowing  ADC samples stored in DDR are pre- emphasized.  50 % overlapped frames having frame length of 512 samples are used.  Fast Fourier Transform  512 point Radix-2 Fast Fourier Transform is done using Xilinx Logicore. 28
  • 29.
    4.2. System Implementationon FPGA(5) 29 FFT timing diagram
  • 30.
    4.2. System Implementationon FPGA(6)  Mel-Spectrum  Spectrum (linear scale) => Mel Spectrum  Log calculation  Natural log using look up tables .  Input data : 24 bit output : 12 bit 30
  • 31.
    4.2. System Implementationon FPGA(7) Discrete Cosine Transform (DCT)  DCT core by poencores.org  Input : 1 bit Output : 16 bit parallel Universal Asynchronous Receiver Transmitter(UART)  Baud rate of 19.2 kbps  Each MFCC (32 bits) are divided into four 8-bit components.  Implemented on unused pin in Jumper for using UART protocol via CDC. 31
  • 32.
    4.3. Further processingin Matlab MFCCs are received in MATLAB in int32 format. Training phase :MFCC feature vectors => Gaussian Mixture Model Testing phase : MFCC feature vectors => posterior probability (Recognition). 32
  • 33.
    5. RESULT ANDANALYSIS 33 5.1. Output in MATLAB  Training data:31 speakers (male – 20, female-11)  Testing data length= 10-30 seconds  Training data length= 1-10 seconds  No. of MFCCs= 8-20  Up to 99% recognition when testing data length= 30 seconds training data length= 10 seconds No. of MFCCs= 20
  • 34.
    5.1. Output inMATLAB(2) Amount of Training Speech Model order (M) Duration of Testing Speech 1 seconds 5 seconds 10 seconds 10 Seconds 8 51.3% 75.5% 82.9% 13 60.3% 83.5% 88.4% 20 64.7% 85.1% 90.4% 20 Seconds 8 67.3% 86.3% 93.6% 13 75.1% 95.1% 97.3% 20 78.3% 95.4% 97.4% 30 seconds 8 71.7% 95.5% 97.5% 13 79.2% 97.8% 98.5% 20 84.1% 98.1% 99.1%34
  • 35.
     Largest increasein performance when training data increases from 10 to 20 sec. Increasing to 30 sec improves the performance with little increment  At most 30 sec of speech to maintain high performance.  Abrupt change in performance on increasing testing speech duration from 1 to 5 seconds. Only slight increase in performance when increased from 5 seconds to 10 seconds.  Using more training data improves the performance . 35 5.1. Output in MATLAB(3)
  • 36.
     77% unknownfemale voice is matched with female voice 85% unknown male voice is matched with male voice.  During the experiments, 4 languages English, Nepali and Hindi, German - correct speaker recognition regardless of the spoken text and language. 36 5.1. Output in MATLAB(4)
  • 37.
     Total ErrorRate (TER) = FAR + FRR  Threshold for speaker verification was calculated empirically using FAR and FRR. .37 5.1. Output in MATLAB(5)
  • 38.
    5.2. Output Analysisin FPGA Recognition rate less than that of software implementation. overall resource utilization in FPGA : i. RAMs : 7 ii. ROMs : 3 iii. Multipliers : 15 iv. Adders/ Subtractors : 18 v. Counters : 9 vi. Registers : 132 vii. Comparators : 20 viii. Multiplexers : 238
  • 39.
    Device Utilization summary Logicutilization Used Available Utilizations Number of Slice Flip-Flops 8225 9312 88% Number of 4 input LUTs 8734 9312 93% Number of occupied Slices 2355 4656 54% Number of Slices containing only related logic 1325 1325 100% Number of Slices containing unrelated logic 0 1325 0% Total Number of 4 inputs LUTs 8903 9312 94% Number of bonded IOBs 215 232 94% Number of RAMB16s 7 20 35% Number of BUFGMUXs 2 24 8% Number of MULT18X18SIOs 15 20 75% Average Fanout of Non-Clock Nets 272 39 5.2. Output Analysis in FPGA (2)
  • 40.
    Security • Forensics for voicesample matching • Transaction authentication • Toll fraud prevention Information and physical facilities • Telephone credit card purchases • Remote time and attendance logging • Information retrieval • Audio indexing • Voice dialing and voice mail Monitoring • Access control • Access to confidential information areas • Computer and data networks • Remote access of computers 40 6. APPLICATIONS
  • 41.
    41  Duration ofspeech signal limits the performance .  The intrusion based on voice imitation cannot be detected.  Optimal number of model order. The silence removal process is not efficient. 7. LIMITATION
  • 42.
    limited resources inthe Spartan 3E. Lack of sufficient block RAM & ROM memory. Synchronization problem of different modules/components. 42 8. PROBLEM FACED
  • 43.
    The system hasbeen implemented using MFCC for feature extraction and GMM to model the speakers. The performance of software implementation of systems is very good. The implementation in FPGA is not satisfactory Noise reduction algorithms can be used to improve the performance of the system. 43 9. CONCLUSION
  • 44.