SlideShare a Scribd company logo
1 of 4
Download to read offline
Recognising a person using voice – Automatic speaker recognition and AI
Bhusan Chettri gives an overview of the technology behind Voice Authentication using computer
So, what is Automatic Speaker Recognition?
Automatic Speaker Recognition is the task of recognizing humans through their voice by using a
computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and
Speaker verification. Speaker identification involves finding the correct person from a given pool of known
speakers or voices. A speaker identification usually comprises of a set of N speakers who are already
registered in the system and these N speakers can only have access to the system. Speaker verification
on the other hand involves verifying whether a person is who he/she claims to be using their voice
sample.
These systems are further classified into two categories depending upon the level of user cooperation: (1)
Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of
the spoken text and therefore expects same utterance during test time (or deployment phase). For
example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment
(registration) and during deployment (when the system is running). On the contrary, in text independent
systems there is no prior knowledge about the lexical contents, and therefore these systems are much
more complex than text dependent ones.
So how does the speaker verification algorithm work? How are they trained and deployed?
Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is
data. Big amount of speech data collected from hundreds and thousands of speakers spoken across
varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak
louder than thousand words. The block diagram shown below summarises a typical speaker verification
system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of
a feature extraction module is to transform the raw speech signal into some representation (features) that
retains speaker specific attributes useful to the downstream components in building speaker models. The
enrollment phase comprises offline and online modes of building models. During the offline mode,
background models are trained on features computed from a large speech collection representing a
diverse population of speakers. The online phase comprises building a target speaker model using
features computed from target speaker’s speech. Usually, training the target speaker model from scratch
is avoided because learning reliable model parameters requires a sufficiently large amount of speech
data,which is usually not available for every individual speaker. To overcome this, the parameters of a
pretrained background model representing the speaker population are adapted using the speaker data
yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech
utterance, a claimed speaker’s model and the background model (representing the world of all other
possible speakers) is used to derive a confidence score. The decision logic module then makes a binary
decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on
some decision threshold.
(a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a
background model which is trained on a large speech database.
(b) Speaker verification phase. For a given speech utterance the system obtains a verification score and
makes a decision whether to accept or reject the claimed identity.
How has the state-of-the-art changed and driven by big-data and AI?
Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To
bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two
broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches.
Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model
- universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning
techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame-
level feature representations used in speaker verification. Using short-term MFCC feature vectors,
utterance level features such as i-vectors are often derived which have shown state-of-the-art
performance in speaker verification. The background models such as the Universal back-ground model
(UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech
data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing
a variable-length speech utterance) representations. The training process involves learning model (target
or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was
one of the earliest approaches used to represent a speaker, after which Gaussian mixture models
(GMMs), an extension to VQ methods, and Support vector machines became popular methods for
speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T-
matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring.
Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data-
driven manner directly from the raw speech signal or from some intermediate speech representations
such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train
deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build
traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained
DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features).
Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a
DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly.
Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have
become popular recently, demonstrating good results. End-to-end modelling approaches have also been
extensively studied in speaker verification showing promising results. In this setting, both feature learning
and model training are jointly optimised from the raw speech input. A wide range of neural architectures
have been studied for speaker verification. This includes feed forward neural networks, commonly
referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural
networks, and attention models. Training background models in deep learning approaches can be thought
of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are
then derived by adapting the pretrained model parameters using speaker specific data, much like the
same way a traditional GMM-UBM system operates.
So Dr. Bhusan Chettri tell us where these technology are being used? Its applications?
These can be used across wide-range of domains such as (a) access control - voice based access
control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile
devices, lock/unlock vehicle door (engine start/off) based on specific user etc.
Are they safe and secure? Are they prone to any manipulation when they are deployed?
Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big
data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone
to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain
illegitimate access to their system. A significant amount of research is being promoted by the ASV
community recently along this direction.
References
[1] Bhusan Chettri scholar and personal website
[2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019.
[3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay
attacks. PhD thesis, Queen Mary University of London, August 2020.
[4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website.
Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri |
Bhusan Chettri social | Bhusan Chettri Research

More Related Content

Similar to An overview of speaker recognition by Bhusan Chettri.pdf

Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...IAESIJAI
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
B tech project_report
B tech project_reportB tech project_report
B tech project_reportabhiuaikey
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
A Robust Speaker Identification System
A Robust Speaker Identification SystemA Robust Speaker Identification System
A Robust Speaker Identification Systemijtsrd
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiRizwan Habib
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Ahmed Ayman
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptxJhalakDashora
 

Similar to An overview of speaker recognition by Bhusan Chettri.pdf (20)

Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Ijetcas14 426
Ijetcas14 426Ijetcas14 426
Ijetcas14 426
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
B tech project_report
B tech project_reportB tech project_report
B tech project_report
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
A Robust Speaker Identification System
A Robust Speaker Identification SystemA Robust Speaker Identification System
A Robust Speaker Identification System
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason Yosinski
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: Review
 
50120140502007
5012014050200750120140502007
50120140502007
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptx
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

An overview of speaker recognition by Bhusan Chettri.pdf

  • 1. Recognising a person using voice – Automatic speaker recognition and AI Bhusan Chettri gives an overview of the technology behind Voice Authentication using computer So, what is Automatic Speaker Recognition? Automatic Speaker Recognition is the task of recognizing humans through their voice by using a computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and Speaker verification. Speaker identification involves finding the correct person from a given pool of known speakers or voices. A speaker identification usually comprises of a set of N speakers who are already registered in the system and these N speakers can only have access to the system. Speaker verification on the other hand involves verifying whether a person is who he/she claims to be using their voice sample. These systems are further classified into two categories depending upon the level of user cooperation: (1) Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of the spoken text and therefore expects same utterance during test time (or deployment phase). For example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment (registration) and during deployment (when the system is running). On the contrary, in text independent systems there is no prior knowledge about the lexical contents, and therefore these systems are much more complex than text dependent ones. So how does the speaker verification algorithm work? How are they trained and deployed? Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is data. Big amount of speech data collected from hundreds and thousands of speakers spoken across varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak louder than thousand words. The block diagram shown below summarises a typical speaker verification system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of a feature extraction module is to transform the raw speech signal into some representation (features) that retains speaker specific attributes useful to the downstream components in building speaker models. The enrollment phase comprises offline and online modes of building models. During the offline mode, background models are trained on features computed from a large speech collection representing a diverse population of speakers. The online phase comprises building a target speaker model using features computed from target speaker’s speech. Usually, training the target speaker model from scratch is avoided because learning reliable model parameters requires a sufficiently large amount of speech
  • 2. data,which is usually not available for every individual speaker. To overcome this, the parameters of a pretrained background model representing the speaker population are adapted using the speaker data yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech utterance, a claimed speaker’s model and the background model (representing the world of all other possible speakers) is used to derive a confidence score. The decision logic module then makes a binary decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on some decision threshold. (a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a background model which is trained on a large speech database. (b) Speaker verification phase. For a given speech utterance the system obtains a verification score and makes a decision whether to accept or reject the claimed identity.
  • 3. How has the state-of-the-art changed and driven by big-data and AI? Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches. Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model - universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame- level feature representations used in speaker verification. Using short-term MFCC feature vectors, utterance level features such as i-vectors are often derived which have shown state-of-the-art performance in speaker verification. The background models such as the Universal back-ground model (UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing a variable-length speech utterance) representations. The training process involves learning model (target or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was one of the earliest approaches used to represent a speaker, after which Gaussian mixture models (GMMs), an extension to VQ methods, and Support vector machines became popular methods for speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T- matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring. Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data- driven manner directly from the raw speech signal or from some intermediate speech representations such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features). Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly. Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have become popular recently, demonstrating good results. End-to-end modelling approaches have also been extensively studied in speaker verification showing promising results. In this setting, both feature learning and model training are jointly optimised from the raw speech input. A wide range of neural architectures have been studied for speaker verification. This includes feed forward neural networks, commonly referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks, and attention models. Training background models in deep learning approaches can be thought of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are then derived by adapting the pretrained model parameters using speaker specific data, much like the same way a traditional GMM-UBM system operates.
  • 4. So Dr. Bhusan Chettri tell us where these technology are being used? Its applications? These can be used across wide-range of domains such as (a) access control - voice based access control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile devices, lock/unlock vehicle door (engine start/off) based on specific user etc. Are they safe and secure? Are they prone to any manipulation when they are deployed? Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain illegitimate access to their system. A significant amount of research is being promoted by the ASV community recently along this direction. References [1] Bhusan Chettri scholar and personal website [2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019. [3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay attacks. PhD thesis, Queen Mary University of London, August 2020. [4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website. Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri | Bhusan Chettri social | Bhusan Chettri Research