Text independent speaker recognition system

Project Members
ASHOK SHARMA PAUDEL(066/BEX/405)
DEEPESH LEKHAK(066/BEX/414)
KESHAV BASHYAL(066/BEX/418)
SUSHMA SHRESTHA(066/BEX/444)
TEXT-INDEPENDENT SPEAKER
RECOGNITION SYSTEM
1

OVERVIEW OF PRESENTATION
1. Introduction
2. Objective
3. System Architecture
4. Methodology
5. Results and Analysis
6. Application area
7. Limitations
8. Problem Faced
9. Conclusion2

1. INTRODUCTION
 Speech - universal method of
communication.
 Information through speech signal
1. high-level Characteristics -syntax, dialect, style, overall
meaning of a spoken message.
2. low-level Characteristics- pitch and phonemic spectra
associated much more with the physiology of vocal tract.
3

1. INTRODUCTION(3)
 Speech is a diverse field with many
applications.
Speech
Recognition
Language
Recognition
Speaker
Recognition
Words
Language Name
Speaker Name
“How are you?”
English
“ Deepesh”
Speech
Signal
5

1. INTRODUCTION (4)
What is Speaker Recognition?
 Recognition of who is speaking based on
characteristics of their speech signal.
 Text-independent , Text-dependent
 Speaker Identification: Determines which
registered speaker has spoken.
 Speaker Verification: Accept or reject a
claimed identity of a speaker.
6

1. INTRODUCTION (5)
Biometric: a human generated signal or
attribute for authenticating a person’s
identity
Why Voice ?
– natural signal to produce
– Only biometric that allows users to authenticate
remotely.
– does not require a specialized input device,
Implementation cost is low
– ubiquitous: telephones and microphone equipped PC
7

Strongest
security
• Voice biometric with other forms of security
– Something you have - badge
– Something you are - voice
HaveKnow
Are
– Something you know - password
1. INTRODUCTION(6)
Why text independent speaker recognition ?
- Independent of text, easy to access, cannot be
forgotten or misplaced,
- Independent of language, Acceptable by user8

2. OBJECTIVE
The main goal of the project is to design and
implement a text-independent speaker
recognition system on FPGA.
The specific goals can be summarized as:
1. To learn about digital signal processing and FPGA.
2. To implement and analyze the system in MATLAB.
3. To design and implement the system on FPGA.
9

3. SYSTEM ARCHITECTURE
Universal Asynchronous Receiver Transmitter
Mel-Frequency Cespstral Coefficients
Mel-Spectrum
Fast Fourier Transform
Framing and Windowing
Pre-emphasis
Double Data Rate SDRAM Storage
Analog to Digital Conversion
Conditioning
Input audio
10

4. METHODOLOGY
Testing data Training data
Input signal
Feature extraction
Feature matching
Threshold
Output
11

4.1. System Implementation on MATLAB
4.1.1. Voice Capturing and Storage
-input through microphone, saved .wav
format
-sound used in format of 22050Hz, 16-bits
PCM, Mono Channel.
12

4.1.2. Pre-Processing
1) Silence removal
13

4.1.2. Pre-Processing(2)
1)Silence removal 2) Pre-
emphasis
s’[n]=s[n]-a s[n-1] ……
[1]
[1] Shi-Huang Chen and Yu-Ren Luo, Speaker Verification Using MFCC
and Support Vector Machine14

1)Silence removal 2) Pre-emphasis 3)Framing
•Overlapping frames - frame block of 23.22ms with 50%
overlapping i.e., 512 samples per
frame
15

1)Silence removal 2) Pre-emphasis 3)Framing 4)Windowing
x[n] = s’[n] . w[n-m]
if n=0,1,2,…,N-1
if n=m,m+1,…..m+N-1
[2]
[2] Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using
MFCC and Support Vector Machine16

4.1.3. Feature Extraction using MFCC
MFCC : Mel Filter Cepstral Coefficients
 Perceptual approach
the human perception of speech, are applied to
the sample frames extract the features of speech.
Steps for calculating MFCC
1. Discrete Fourier Transform using FFT and
Power spectrum , X[k]|2 of signal
17

4.1.3. Feature Extraction using MFCC(2)
2. Mel scaling
Mel scale : linear up to 1 KHz and logarithmic after 1 KHz
. Mapping the powers of the spectrum onto the Mel scale,
using Mel filter bank-Mel spectral coefficients G[k]
Filter bank:
overlapping windows
18

3.log of Mel spectral coefficients has been taken
log(G[k]).
4. Discrete Cosine Transform (DCT) ->Mel-cepstrum
c[q].
(Source: Shi-Huang Chen and Yu-Ren Luo , Speaker Verification
Using MFCC and Support Vector Machine)
(3.4)
19

mel
cepstrum
mel
spectrum
framecontinuous
speech
Frame
Blocking
Windowing FFT spectrum
Mel-frequency
Wrapping
Cepstrum
20

4.1.4. Feature Matching using GMM
Gaussian Mixture Model
Parametric probability
density function
Based on soft clustering
technique
Mixture of Gaussian
components
21

4.1.4. Feature Matching using GMM(2)
•GMM Training
22

4.1.4. Feature Matching using GMM(3)
The GMM modeling process consists of two
steps:
1. Initialization :
Initial value of mean, covariance & weight
assigned.
2. Expectation Maximization(EM)
Value of mean, covariance & weight
calculated adaptively by finding maximum
likelihood of parameters.23

4.1.5. Identification & Verification
For speaker identification, maximum posteriori
probability of a speaker model within a group of
S speakers.
For verification, a threshold value for the log-
likelihood probability of speaker has been set on
the adaptive basis.
.
24
Feature
Extraction
Feature
Matching Decision
Accept if
> Threshold
Reject if
< Threshold

4.2. System Implementation on FPGA
Mic Pre-
amplification
DC offset
shiter
Analog-to-
digital
conversion
Temporary
Buffer
Framing and
windowing
Fast Fourier
Transform
Mel
Spectrum
Log
Discrete
Cosine
Transform
MFCC
(UART)
Computer
(MATLAB)
25

4.2. System Implementation on FPGA(2)
 Sound Capture and Level Shifter
• The audio sound is captured using conditioner
microphone and amplified using Op-amp
• Dc offset of the input audio signal is shifted to 1.65
volt
 Analog to digital conversion and Digital to
analog conversion
• Spartan 3E FPGA board has ADC module having SPI
operation
• 14 bit ADC sample values are obtained from ADC at
the rate of 25000 samples per seconds.
26

 Double Data Rate SDRAM
- ADC Samples are stored in DDR SDRAM
temporarily before further processing.
- Burst mode 4 with burst length 2 i.e. 64
bits are written in SDRAM.
- Wishbone communication protocol is
used for communication with DDR SDRAM.
27

 Framing and windowing
 ADC samples stored in DDR are pre-
emphasized.
 50 % overlapped frames having frame
length of 512 samples are used.
 Fast Fourier Transform
 512 point Radix-2 Fast Fourier Transform is
done using Xilinx Logicore.
28

29
FFT timing diagram

 Mel-Spectrum
 Spectrum (linear scale) => Mel Spectrum
 Log calculation
 Natural log using look up tables .
 Input data : 24 bit
output : 12 bit
30

Discrete Cosine Transform (DCT)
 DCT core by poencores.org
 Input : 1 bit
Output : 16 bit parallel
Universal Asynchronous Receiver
Transmitter(UART)
 Baud rate of 19.2 kbps
 Each MFCC (32 bits) are divided into four
8-bit components.
 Implemented on unused pin in Jumper for
using UART protocol via CDC.
31

4.3. Further processing in Matlab
MFCCs are received in MATLAB in int32
format.
Training phase :MFCC feature vectors =>
Gaussian Mixture Model
Testing phase : MFCC feature vectors =>
posterior probability (Recognition).
32

5. RESULT AND ANALYSIS
33
5.1. Output in MATLAB
 Training data:31 speakers (male – 20, female-11)
 Testing data length= 10-30 seconds
 Training data length= 1-10 seconds
 No. of MFCCs= 8-20
 Up to 99% recognition when
testing data length= 30 seconds
training data length= 10 seconds
No. of MFCCs= 20

5.1. Output in MATLAB(2)
Amount of
Training Speech
Model order
(M)
Duration of Testing Speech
1 seconds 5 seconds 10 seconds
10 Seconds 8 51.3% 75.5% 82.9%
13 60.3% 83.5% 88.4%
20 64.7% 85.1% 90.4%
20 Seconds 8 67.3% 86.3% 93.6%
13 75.1% 95.1% 97.3%
20 78.3% 95.4% 97.4%
30 seconds 8 71.7% 95.5% 97.5%
13 79.2% 97.8% 98.5%
20 84.1% 98.1% 99.1%34

 Largest increase in performance when training data
increases from 10 to 20 sec. Increasing to 30 sec
improves the performance with little increment
 At most 30 sec of speech to maintain high
performance.
 Abrupt change in performance on increasing testing
speech duration from 1 to 5 seconds. Only slight
increase in performance when increased from 5
seconds to 10 seconds.
 Using more training data improves the performance .
35

 77% unknown female voice is matched with
female voice 85% unknown male voice is matched
with male voice.
 During the experiments, 4 languages English,
Nepali and Hindi, German - correct speaker
recognition regardless of the spoken text and
language.
36

 Total Error Rate (TER) = FAR + FRR
 Threshold for speaker verification was
calculated empirically using FAR and FRR.
.37

5.2. Output Analysis in FPGA
Recognition rate less than that of software
implementation.
overall resource utilization in FPGA :
i. RAMs : 7
ii. ROMs : 3
iii. Multipliers : 15
iv. Adders/ Subtractors : 18
v. Counters : 9
vi. Registers : 132
vii. Comparators : 20
viii. Multiplexers : 238

Device Utilization summary
Logic utilization Used Available Utilizations
Number of Slice Flip-Flops 8225 9312 88%
Number of 4 input LUTs 8734 9312 93%
Number of occupied Slices 2355 4656 54%
Number of Slices containing only related
logic
1325 1325 100%
Number of Slices containing unrelated logic 0 1325 0%
Total Number of 4 inputs LUTs 8903 9312 94%
Number of bonded IOBs 215 232 94%
Number of RAMB16s 7 20 35%
Number of BUFGMUXs 2 24 8%
Number of MULT18X18SIOs 15 20 75%
Average Fanout of Non-Clock Nets 272
39
5.2. Output Analysis in FPGA (2)

Security
• Forensics for
voice sample
matching
• Transaction
authentication
• Toll fraud
prevention
Information and
physical facilities
• Telephone
credit card
purchases
• Remote time
and attendance
logging
• Information
retrieval
• Audio indexing
• Voice dialing
and voice mail
Monitoring
• Access control
• Access to
confidential
information
areas
• Computer and
data networks
• Remote access
of computers
40
6. APPLICATIONS

41
 Duration of speech signal limits the
performance .
 The intrusion based on voice imitation
cannot be detected.
 Optimal number of model order.
The silence removal process is not efficient.
7. LIMITATION

limited resources in the Spartan 3E.
Lack of sufficient block RAM & ROM memory.
Synchronization problem of different
modules/components.
42
8. PROBLEM FACED

The system has been implemented using
MFCC for feature extraction and GMM to
model the speakers.
The performance of software
implementation of systems is very good.
The implementation in FPGA is not
satisfactory
Noise reduction algorithms can be used to
improve the performance of the system.
43
9. CONCLUSION

Text independent speaker recognition system

More Related Content

What's hot

Viewers also liked

Similar to Text independent speaker recognition system

Recently uploaded

Text independent speaker recognition system