SlideShare a Scribd company logo
Recognising a person using voice – Automatic speaker recognition and AI
Bhusan Chettri gives an overview of the technology behind Voice Authentication using computer
So, what is Automatic Speaker Recognition?
Automatic Speaker Recognition is the task of recognizing humans through their voice by using a
computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and
Speaker verification. Speaker identification involves finding the correct person from a given pool of known
speakers or voices. A speaker identification usually comprises of a set of N speakers who are already
registered in the system and these N speakers can only have access to the system. Speaker verification
on the other hand involves verifying whether a person is who he/she claims to be using their voice
sample.
These systems are further classified into two categories depending upon the level of user cooperation: (1)
Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of
the spoken text and therefore expects same utterance during test time (or deployment phase). For
example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment
(registration) and during deployment (when the system is running). On the contrary, in text independent
systems there is no prior knowledge about the lexical contents, and therefore these systems are much
more complex than text dependent ones.
So how does the speaker verification algorithm work? How are they trained and deployed?
Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is
data. Big amount of speech data collected from hundreds and thousands of speakers spoken across
varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak
louder than thousand words. The block diagram shown below summarises a typical speaker verification
system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of
a feature extraction module is to transform the raw speech signal into some representation (features) that
retains speaker specific attributes useful to the downstream components in building speaker models. The
enrollment phase comprises offline and online modes of building models. During the offline mode,
background models are trained on features computed from a large speech collection representing a
diverse population of speakers. The online phase comprises building a target speaker model using
features computed from target speaker’s speech. Usually, training the target speaker model from scratch
is avoided because learning reliable model parameters requires a sufficiently large amount of speech
data,which is usually not available for every individual speaker. To overcome this, the parameters of a
pretrained background model representing the speaker population are adapted using the speaker data
yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech
utterance, a claimed speaker’s model and the background model (representing the world of all other
possible speakers) is used to derive a confidence score. The decision logic module then makes a binary
decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on
some decision threshold.
(a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a
background model which is trained on a large speech database.
(b) Speaker verification phase. For a given speech utterance the system obtains a verification score and
makes a decision whether to accept or reject the claimed identity.
How has the state-of-the-art changed and driven by big-data and AI?
Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To
bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two
broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches.
Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model
- universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning
techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame-
level feature representations used in speaker verification. Using short-term MFCC feature vectors,
utterance level features such as i-vectors are often derived which have shown state-of-the-art
performance in speaker verification. The background models such as the Universal back-ground model
(UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech
data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing
a variable-length speech utterance) representations. The training process involves learning model (target
or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was
one of the earliest approaches used to represent a speaker, after which Gaussian mixture models
(GMMs), an extension to VQ methods, and Support vector machines became popular methods for
speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T-
matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring.
Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data-
driven manner directly from the raw speech signal or from some intermediate speech representations
such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train
deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build
traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained
DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features).
Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a
DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly.
Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have
become popular recently, demonstrating good results. End-to-end modelling approaches have also been
extensively studied in speaker verification showing promising results. In this setting, both feature learning
and model training are jointly optimised from the raw speech input. A wide range of neural architectures
have been studied for speaker verification. This includes feed forward neural networks, commonly
referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural
networks, and attention models. Training background models in deep learning approaches can be thought
of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are
then derived by adapting the pretrained model parameters using speaker specific data, much like the
same way a traditional GMM-UBM system operates.
So Dr. Bhusan Chettri tell us where these technology are being used? Its applications?
These can be used across wide-range of domains such as (a) access control - voice based access
control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile
devices, lock/unlock vehicle door (engine start/off) based on specific user etc.
Are they safe and secure? Are they prone to any manipulation when they are deployed?
Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big
data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone
to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain
illegitimate access to their system. A significant amount of research is being promoted by the ASV
community recently along this direction.
References
[1] Bhusan Chettri scholar and personal website
[2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019.
[3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay
attacks. PhD thesis, Queen Mary University of London, August 2020.
[4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website.
Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri |
Bhusan Chettri social | Bhusan Chettri Research

More Related Content

Similar to An overview of speaker recognition by Bhusan Chettri.pdf

Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...IAESIJAI
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
B tech project_report
B tech project_reportB tech project_report
B tech project_reportabhiuaikey
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
A Robust Speaker Identification System
A Robust Speaker Identification SystemA Robust Speaker Identification System
A Robust Speaker Identification Systemijtsrd
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiRizwan Habib
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Ahmed Ayman
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptxJhalakDashora
 

Similar to An overview of speaker recognition by Bhusan Chettri.pdf (20)

Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...Speaker identification under noisy conditions using hybrid convolutional neur...
Speaker identification under noisy conditions using hybrid convolutional neur...
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Ijetcas14 426
Ijetcas14 426Ijetcas14 426
Ijetcas14 426
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
B tech project_report
B tech project_reportB tech project_report
B tech project_report
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
A Robust Speaker Identification System
A Robust Speaker Identification SystemA Robust Speaker Identification System
A Robust Speaker Identification System
 
NYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason YosinskiNYAI #5 - Fun With Neural Nets by Jason Yosinski
NYAI #5 - Fun With Neural Nets by Jason Yosinski
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: Review
 
50120140502007
5012014050200750120140502007
50120140502007
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptx
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

An overview of speaker recognition by Bhusan Chettri.pdf

  • 1. Recognising a person using voice – Automatic speaker recognition and AI Bhusan Chettri gives an overview of the technology behind Voice Authentication using computer So, what is Automatic Speaker Recognition? Automatic Speaker Recognition is the task of recognizing humans through their voice by using a computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and Speaker verification. Speaker identification involves finding the correct person from a given pool of known speakers or voices. A speaker identification usually comprises of a set of N speakers who are already registered in the system and these N speakers can only have access to the system. Speaker verification on the other hand involves verifying whether a person is who he/she claims to be using their voice sample. These systems are further classified into two categories depending upon the level of user cooperation: (1) Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of the spoken text and therefore expects same utterance during test time (or deployment phase). For example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment (registration) and during deployment (when the system is running). On the contrary, in text independent systems there is no prior knowledge about the lexical contents, and therefore these systems are much more complex than text dependent ones. So how does the speaker verification algorithm work? How are they trained and deployed? Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is data. Big amount of speech data collected from hundreds and thousands of speakers spoken across varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak louder than thousand words. The block diagram shown below summarises a typical speaker verification system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of a feature extraction module is to transform the raw speech signal into some representation (features) that retains speaker specific attributes useful to the downstream components in building speaker models. The enrollment phase comprises offline and online modes of building models. During the offline mode, background models are trained on features computed from a large speech collection representing a diverse population of speakers. The online phase comprises building a target speaker model using features computed from target speaker’s speech. Usually, training the target speaker model from scratch is avoided because learning reliable model parameters requires a sufficiently large amount of speech
  • 2. data,which is usually not available for every individual speaker. To overcome this, the parameters of a pretrained background model representing the speaker population are adapted using the speaker data yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech utterance, a claimed speaker’s model and the background model (representing the world of all other possible speakers) is used to derive a confidence score. The decision logic module then makes a binary decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on some decision threshold. (a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a background model which is trained on a large speech database. (b) Speaker verification phase. For a given speech utterance the system obtains a verification score and makes a decision whether to accept or reject the claimed identity.
  • 3. How has the state-of-the-art changed and driven by big-data and AI? Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches. Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model - universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame- level feature representations used in speaker verification. Using short-term MFCC feature vectors, utterance level features such as i-vectors are often derived which have shown state-of-the-art performance in speaker verification. The background models such as the Universal back-ground model (UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing a variable-length speech utterance) representations. The training process involves learning model (target or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was one of the earliest approaches used to represent a speaker, after which Gaussian mixture models (GMMs), an extension to VQ methods, and Support vector machines became popular methods for speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T- matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring. Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data- driven manner directly from the raw speech signal or from some intermediate speech representations such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features). Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly. Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have become popular recently, demonstrating good results. End-to-end modelling approaches have also been extensively studied in speaker verification showing promising results. In this setting, both feature learning and model training are jointly optimised from the raw speech input. A wide range of neural architectures have been studied for speaker verification. This includes feed forward neural networks, commonly referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks, and attention models. Training background models in deep learning approaches can be thought of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are then derived by adapting the pretrained model parameters using speaker specific data, much like the same way a traditional GMM-UBM system operates.
  • 4. So Dr. Bhusan Chettri tell us where these technology are being used? Its applications? These can be used across wide-range of domains such as (a) access control - voice based access control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile devices, lock/unlock vehicle door (engine start/off) based on specific user etc. Are they safe and secure? Are they prone to any manipulation when they are deployed? Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain illegitimate access to their system. A significant amount of research is being promoted by the ASV community recently along this direction. References [1] Bhusan Chettri scholar and personal website [2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019. [3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay attacks. PhD thesis, Queen Mary University of London, August 2020. [4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website. Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri | Bhusan Chettri social | Bhusan Chettri Research