1) Voice biometrics refers to technology that recognizes individuals by their unique vocal characteristics. It is commonly used for automatic speaker verification (ASV) systems.
2) ASV systems have two phases - a training phase where a universal voice template is built from diverse data and adapted to create speaker-specific models, and a verification phase where a test voice is matched against templates to verify identity.
3) While voice biometrics provide advantages like password-free access, there are growing security concerns as artificially generated voices can potentially spoof systems; researchers are working on countermeasures to address this.
1. Bhusan Chettri explains how your unique VOICE can be used for automatic
authentication and challenges towards the security of voice authentication systems
· Dr. Bhusan Chettri a Ph.D graduate from Queen Mary University of London (QMUL)
explains the science behind voice biometrics; different types of voice biometric
system
s
· Spoo
fi
ng attacks on voice biometric systems: a growing concern regarding its
security
Voice biometrics in simple words refers to the technology used to automatically recognise
a person using their voice sample. Every person possesses a unique vocal apparatus and
therefore the characteristics and features of an individual's voice is distinct. This is one of
the key reasons for wide adoption of voice as a means of person authentication across the
globe. In this article, Dr Bhusan Chettri explains the basics of voice biometrics and briefs
about growing concern regarding its security against fake voices generated using
computers and AI technology
.
Voice biometrics are commonly referred to as automatic speaker veri
fi
cation (ASV). Two
key steps are required to be followed in order to build such a system using a computer.
Training phase: it involves building a universal voice template a.k.a speaker template (or
model) using large amounts of voice samples collected from different people with different
cultural background, ethnicity and from different regions across the world. The more data
recorded/collected under different diverse environmental conditions from a large speaker
population the better will be the universal template because with such diverse data the
template will be able to capture and represent the general voice pattern of speakers
across the world. Furthermore, the voice template (also referred as speaker model) is
simply a large table (or matrix) of numbers learned during the training such that each
number in the table represents some meaningful information (about the speaker) which the
computer understands but is hard for humans to interpret. As illustrated in Figure 1, top
row, this step is often called of
fl
ine phase training.
Figure 1. Training phase. The goal here is to build speaker speci
fi
c models by adaptin
g
a background model which is trained on a large speech database
.
Here, the feature extraction step simply gathers relevant information from the voice/speech
samples of speakers and use them for building the voice template. The training step then
makes use of the features being extracted from voices and applies computer algorithm to
2. learn patterns across different voices. As a results this step produces the so called
background model which is nothing but the universal speaker template representing the
whole speaker/voice population. Then the next key step in training phase is building
speaker speci
fi
c model or voice template for a designated speaker making use of the
universal speaker template. One interesting point to note here is that this step, also called
speaker or voice registration, does not require huge amount of voice samples from the
speci
fi
c target speaker. And, it is also impractical to collect thousands of hours of speech/
voice samples for one speaker. This is the reason why universal speaker/voice template
are created and are then adapted to build speaker speci
fi
c template. What this means is
that using a small fragment of voice samples (usually 5-10 seconds or a minute speech
sample) the large table (universal voice template) is adjusted to represent the speci
fi
c
speaker. It should also be noted that this speaker registration often happens on the
fl
y. For
example, in voice-based banking application, the system often ask user’s to speak certain
phrase such as “my voice is my password” for couple of times. What is happening here is
that the universal voice template is being adjusted to suit the user’s voice pattern. Once it
is successful, a voice template/model for a speci
fi
c user is created
.
Veri
fi
cation phase: The second step in voice biometrics is called speaker veri
fi
cation
phase. Here, the system accepts as input a test speech/voice sample and extracts
relevant features from it. Then the system will simply match this new speech/voice with the
voice template of the claimed speaker (which was already created during the training
phase). As a result a number/score is produced that informs the level/degree of match
being observed. Furthermore, it also uses the universal voice template to score this new
voice. Finally, the score difference between the speaker voice template and universal
voice template (also called log-likelihood ratio in ASV terminology) is used as the
fi
nal
score to decide whether to accept or reject the claimed identity. Higher score difference
usually corresponds to higher probability that the new voice sample belongs to the claimed
identity. This process is illustrated in Figure 2
.
Figure 2. Speaker veri
fi
cation phase. For a given speech utterance the system obtains a
veri
fi
cation score and makes a decision whether to accept or reject the claimed identity
.
Types of ASV systems. Depending upon the level of user cooperation ASV systems are
often classi
fi
ed into two types: text-dependent and text independent systems. In text-
dependent applications, the system has prior knowledge about the text being spoken and
therefore it expects the same utterance when the biometric system is accessed by the
3. user. An example usage of this scenario would be banking applications. On the contrary, in
text-independent systems there are no such restrictions. Users can speak any phrase
during registration and while accessing the system. An example of this would be forensic
applications where users may not be cooperating to speak the phrase they are being
asked to during interrogations
.
Bhusan Chettri further elucidated, Now, one interesting question that might pop up in the
reader's mind is regarding the usage of this technology. Where is this technology used?
What are its applications
?
Application
s
ASV systems can be used in a wide range of applications across different domains.
1. Access control: controlling access to electronic devices and other facilities using
voice
.
2. Speaker diarization applications: identifying who spoke when
?
3. Forensic application - to match voice templates with pre-recorded voices of
criminals
.
4. Retrieval of customer information in call centres using voice indexing
.
5. Surveillance applications
.
Advantages
There are many advantages to using this technology. One interesting one is the fact that
using voice biometrics user’s do not have to worry about remembering long complex
combinations of passwords anymore. By just speaking up the unlock phrase (for example,
“my voice is my password”) users can access the application (for example banking app or
personalised digital accessories)
.
Common errors in ASV
Like any other computer systems (or machine learning models) ASV systems can make
mistakes while it is up and running. There are two types of common errors it can make:
false acceptance and false rejection. False acceptance means that the system has falsely
accepted an unknown (or unregistered) speaker. False rejection is an error which refers to
a situation where the system rejects the true speaker. This may happen in cases for
example where a user attempts to access the voice biometrics in very noisy conditions
(with severe background noises), and therefore the system becomes incon
fi
dent in
recognising the speaker’s voice
.
How good is voice biometrics? Evaluation metric
s
“To decide whether the trained biometric system is good or not, an evaluation metric is
required. Commonly used metric in ASV is Equal Error Rate (EER). EER basically
corresponds to a situation where both false acceptance and false rejection errors are the
same. And for this to happen the decision threshold to accept or reject a speaker is
carefully adjusted during training (and this adjustment varies across different application
domains)” ‘Bhusan Explained’. Researchers and ASV system developers aim at
minimising these error rates. Lower the EER better is the ASV system
.
Security of Voice biometrics: a growing concern
One of the key problems with the usage of voice biometric application corresponds to the
growing concern about its security. With recent advancement in technology, there are
commercial applications (available online) capable of producing voices that sound as
natural as if spoken by a real human. For human ears it is very dif
fi
cult to detect if the
4. voice was created using computer algorithms. Therefore, fraudsters/attackers aim at
launching spoo
fi
ng attacks on voice biometrics in order to gain illegitimate access to
someone else’s voice biometrics (say, bank application with an aim to steal money).
However, researchers like Andrew Ng, Bhusan Chettri, Alexis Conneau, Edward Chang,
Demis Hassabis and more in the speech community have also been working hard towards
design and development of spoo
fi
ng countermeasures with an aim to safe-guard voice
biometrics from fraudulent access. The next article, follow up on this, would be explaining
more about spoo
fi
ng attacks in voice biometrics and mechanisms/algorithms used to
counter such attacks.
References
[1] D. A. Reynolds, “An overview of automatic speaker recognition technology,” 2002 IEEE
ICASSP, 2002, pp. IV-4072-IV-4075
.
[2] Bhusan Chettri. Voice biometric system security: Design and analysis of
countermeasures for replay attacks. Ph.D. thesis, Queen Mary University of London
.
[3] ORCID, DBLP
[4] Automatic Speaker Recognition and AI