In this article, Bhusan Chettri provides an overview of voice authentication system that is based on automatic speaker verification technology. He provides background on both the traditional approaches of modelling speakers and current deep learning based approaches. A brief introduction to how these systems can be manipulated is also provided.
An overview of speaker recognition by Bhusan Chettri.pdf
1. Recognising a person using voice – Automatic speaker recognition and AI
Bhusan Chettri gives an overview of the technology behind Voice Authentication using computer
So, what is Automatic Speaker Recognition?
Automatic Speaker Recognition is the task of recognizing humans through their voice by using a
computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and
Speaker verification. Speaker identification involves finding the correct person from a given pool of known
speakers or voices. A speaker identification usually comprises of a set of N speakers who are already
registered in the system and these N speakers can only have access to the system. Speaker verification
on the other hand involves verifying whether a person is who he/she claims to be using their voice
sample.
These systems are further classified into two categories depending upon the level of user cooperation: (1)
Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of
the spoken text and therefore expects same utterance during test time (or deployment phase). For
example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment
(registration) and during deployment (when the system is running). On the contrary, in text independent
systems there is no prior knowledge about the lexical contents, and therefore these systems are much
more complex than text dependent ones.
So how does the speaker verification algorithm work? How are they trained and deployed?
Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is
data. Big amount of speech data collected from hundreds and thousands of speakers spoken across
varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak
louder than thousand words. The block diagram shown below summarises a typical speaker verification
system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of
a feature extraction module is to transform the raw speech signal into some representation (features) that
retains speaker specific attributes useful to the downstream components in building speaker models. The
enrollment phase comprises offline and online modes of building models. During the offline mode,
background models are trained on features computed from a large speech collection representing a
diverse population of speakers. The online phase comprises building a target speaker model using
features computed from target speaker’s speech. Usually, training the target speaker model from scratch
is avoided because learning reliable model parameters requires a sufficiently large amount of speech
2. data,which is usually not available for every individual speaker. To overcome this, the parameters of a
pretrained background model representing the speaker population are adapted using the speaker data
yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech
utterance, a claimed speaker’s model and the background model (representing the world of all other
possible speakers) is used to derive a confidence score. The decision logic module then makes a binary
decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on
some decision threshold.
(a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a
background model which is trained on a large speech database.
(b) Speaker verification phase. For a given speech utterance the system obtains a verification score and
makes a decision whether to accept or reject the claimed identity.
3. How has the state-of-the-art changed and driven by big-data and AI?
Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To
bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two
broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches.
Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model
- universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning
techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame-
level feature representations used in speaker verification. Using short-term MFCC feature vectors,
utterance level features such as i-vectors are often derived which have shown state-of-the-art
performance in speaker verification. The background models such as the Universal back-ground model
(UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech
data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing
a variable-length speech utterance) representations. The training process involves learning model (target
or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was
one of the earliest approaches used to represent a speaker, after which Gaussian mixture models
(GMMs), an extension to VQ methods, and Support vector machines became popular methods for
speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T-
matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring.
Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data-
driven manner directly from the raw speech signal or from some intermediate speech representations
such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train
deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build
traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained
DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features).
Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a
DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly.
Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have
become popular recently, demonstrating good results. End-to-end modelling approaches have also been
extensively studied in speaker verification showing promising results. In this setting, both feature learning
and model training are jointly optimised from the raw speech input. A wide range of neural architectures
have been studied for speaker verification. This includes feed forward neural networks, commonly
referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural
networks, and attention models. Training background models in deep learning approaches can be thought
of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are
then derived by adapting the pretrained model parameters using speaker specific data, much like the
same way a traditional GMM-UBM system operates.
4. So Dr. Bhusan Chettri tell us where these technology are being used? Its applications?
These can be used across wide-range of domains such as (a) access control - voice based access
control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile
devices, lock/unlock vehicle door (engine start/off) based on specific user etc.
Are they safe and secure? Are they prone to any manipulation when they are deployed?
Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big
data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone
to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain
illegitimate access to their system. A significant amount of research is being promoted by the ASV
community recently along this direction.
References
[1] Bhusan Chettri scholar and personal website
[2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019.
[3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay
attacks. PhD thesis, Queen Mary University of London, August 2020.
[4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website.
Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri |
Bhusan Chettri social | Bhusan Chettri Research