In this article Dr. Bhusan Chettri provides an overview of how voice authentication systems can be compromised through spoofing attacks. He adds "spoofing attack refers to the process of making an un-authorised attempt to break into someone else's authentication system either using synthetic voices produces through AI technology or by performing a mimicry or by simply replaying a pre-recorded voice samples of the target user."
POGONATUM : morphology, anatomy, reproduction etc.
an-overview-of--spoofing-by-Bhusan-Chettri.pdf
1. Voice authentication systems: are they secure? can AI be used to fool them?
ophile
Bhusan Chettri explains how voice authentication systems can be fooled using AI and how they
can be protected
Although today’s speaker verification systems driven by deep learning and big data shows superior
performance in verifying a speaker, they are not secure. They are prone to spoofing attacks. In this article
Dr. Bhusan Chettri gives an overview of the technology used for spoofing a voice aunthetication system
that uses automatic speaker verification (ASV) technology.
Spoofing attacks in ASV: an overview by Dr Bhusan Chettri
A spoofing attack (or presentation attack ) involves illegitimate access to the personal data of a targeted
user. These attacks are performed on a biometric system to provoke an increase in its false acceptance
rate. The security threats imposed by such attacks are now well acknowledged within the speech
community. As identified in the ISO/IEC 30107-1 standard, a biometric system could be potentially
attacked from nine different points. Fig. 1 provides a summary of this. The first two attack points are of
specific interest as they are particularly vulnerable in terms of enabling an adversary to inject spoofed
biometric data. These two points are commonly referred as physical access (PA) and logical access (LA)
attacks. As illustrated in the figure, PA attacks involve presentation attack at the sensor (microphone in
case of ASV) level and LA attacks involve modifying biometric samples to bypass the sensor. Text-to-
speech and voice conversion techniques are used to produce artificial speech to bypass an ASV system.
These two methods are examples of LA attacks. On the other hand, mimicry and playing back speech
recordings (replay) are examples of PA attacks.
2. Figure 1: Possible locations [ISO/IEC, 2016] to attack an ASV system. 1: microphone point, 2:
transmission point, 3: override feature extractor, 4: modify features, 5: override classifier, 6: modify
speaker database, 7: modify biometric reference, 8: modify score and 9: override decision.
Below, Bhusan Chettri provides a brief summary of the four different spoofing methods used to fool an
ASV system
1. Mimicry (or Impersonation)
This form of attack involves an attacker attempting to modify their voice characteristics to sound like a
target speaker. In other words, an attacker aims to transform their lexical and prosodic properties to be
able to sound as close as possible to the target speaker. Therefore, this form of attack can be highly
effective when the attacker’s voice is similar to the target speaker, as less effort would be required to
adjust the voice of an attacker in contrast to situations where the voice of the attacker is less similar to the
target speaker. In other words, the success of mimicry attacks often depends on the degree or quality of
the impersonated voice, suggesting that professional impersonators may be better at mimicking a target
speaker’s voice than inexperienced impersonators. Research has shown that successful attackers were
found to be able to transform their F0 (fundamental frequency) and sometimes the formants close to the
target speaker.
2. Speech synthesis
3. Speech synthesis or text-to-speech (TTS), is a method to generate speech from a given text input that
sounds as natural and intelligible as possible. It has a wide range of applications including spoken
dialogue systems, speech-to-speech translation, assisting people with vocal disorders, and automatic e-
book reading, to name a few. Text analysis and speech waveform generation are the two main
components of a typical TTS system. The text analysis component analyses the input text and produces
sequence of phonemes defining the linguistic specification of the text. Using these phonemes, the speech
waveform generation module produces the speech waveform. However, in end-to-end deep learning
frameworks, speech waveforms are directly generated from the input text.
3. Voice conversion
Voice conversion aims at converting the voice of a speaker to that of another. In the context of ASV
spoofing, the source voice corresponds to an attacker which is converted to that of a target speaker to
fool an ASV system. Typical VC systems operate directly on speech signals of the source and target
speaker using a parallel corpus of the two speakers (speaking the same utterances) on which a
transformation function is learned to convert the attacker acoustic parameters to that of a target speaker.
Applications of VC technologies include producing natural sounding voices for people with speech
disabilities and voice dubbing in entertainment industries to name a few.
4. Replay attacks
A replay spoofing attack involves playing back recorded speech samples of a target speaker (enrolled
speaker) to bypass an ASV system. This type of attack requires physical transmission of spoofed speech
through the system microphone. This is shown as point 1 in Fig. 1. Replay is the simplest form of a
spoofing attack that can be implemented using smartphones, and does not require specific expertise
either in speech processing or machine learning techniques. A bonafide or genuine speech corresponds
to speech spoken by a target speaker during enrollment (or the verification phase) and is acquired by an
ASV system’s microphone. On the other hand, a replayed speech denotes the speech signal that is
obtained by playing back a pre-recorded bonafide speech which is then acquired by the system’s
microphone. The acoustic environment for the acquisition of bonafide speech, and the replayed speech
can be the same — situations where an attacker manages to launch the attack from the same physical
space. But, in practice the acoustic space is usually different (eg. a different closed room/office with no
background noise) as an attacker would not want to risk getting caught while launching such attacks.
Therefore, factors of interest in detecting replay attacks are changes/noise induced in bonafide speech
from the loudspeaker of playback device, recording device and the acoustic environment where the replay
attack is simulated.
4. Therefore, it is very important to secure these systems from being manipulated. For this, spoofing
countermeasure solutions are often integrated within the verfication pipeline. And, voice spoofing
countermeasures is currently an active research topic within the speech research community. In the next
article, Dr Bhusan Chettri will be talking more about how AI and big-data can be used to design anti-
spoofing solutions in order to protect voice authentication systems from spoofing attacks.
References
[1] Bhusan Chettri scholar and personal website
[2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019.
[3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay
attacks. PhD thesis, Queen Mary University of London, August 2020.
[4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website.
Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri |
Bhusan Chettri social | Bhusan Chettri Research