Voice Impersonation Using Generative Adversarial Networks review

Voice Impersonation using Generative
Adversarial Networks
Yang Gao, Rita Singh, Bhiksha Raj
Electrical and Computer Engineering Department, Carnegie Mellon
University.
Arxiv Date: 19, Feb. 2018.
Conference: ICASSP 2018
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
25, Sep. 2019.

2021-01-09
Overview of the paper
• In voice impersonation, the resultant voice must convincingly convey
the impression of having been naturally produced by the target speaker,
mimicking not only the pitch and other perceivable signal qualities, but
also the style of the target speaker
• In this paper, they propose a novel neural-network based speech quality
and style mimicry framework for the synthesis of impersonated voices
– Framework: built upon a fast and accurate GAN model
• Generating a synthetic spectrogram from which the time-domain signal
is reconstructed using the Griffin-Lim method
• Given spectrographic representations of source and target speaker’s
voices, the model learns to mimic the target speaker’s voice quality and
style, regardless of the linguistic content of either’s voice.

2021-01-09
Overview of the paper
• Summarize
– This paper, given X is one of gender’s speech, given Y the other’s
speech
– Goal is change X voice to Y voice, regardless of the linguistic
content of either’s voice
– They use GAN, however, their model is more close to DiscoGAN
• They find some shortcomings of the existing DiscoGAN model and modified
them to make VoiceGAN

2021-01-09
Related Works: Generative Adversarial
Networks
• The original GAN model comprises a generator 𝐺(𝑧) and discrimina-
tor 𝐷(𝑥)
• The generator 𝐺 takes as input a random variable 𝑧 drawn from
some probability distribution function 𝑃𝑧, and produces an output
vector 𝑥 𝑧

2021-01-09
Related Works: GAN
• Discriminator D() attempts to discriminate between sample 𝑥~𝑃𝑥
that are drawn from 𝑃𝑥, the true (but unknown) distribution we aim
to model, and samples produced by the Generator 𝐺
• Let T represent the event that a vector 𝑥 was drawn from 𝑃𝑥, the
discriminator attemps to compute the a 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 probability of
𝐷 𝑥 = 𝑃(𝑇|𝑥)

2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃𝑥
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃 𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most

2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃 𝑥
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
• To get the better result of GAN,
– Generator: 𝐷 𝑥 𝑧  should minimize
– Discriminator: 𝐷 𝑥  should maximizing
• Conclusion
– G and D is adversarial
– We often define GAN as a minimax game with G wants to minimize V while D wants to maximize it
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most

2021-01-09
Related Works: Style transfer by GAN
• Input data instance (usually an image) 𝑥A drawn from a distribution
𝑃A is 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 to an instance 𝑥AB by a generator (more aptly
called a “transformer”), 𝐺AB
• The aim of the transformer is to convert 𝑥A into the style of the
variable 𝑥B which natively occurs with the distribution 𝑃B

2021-01-09
Related Works: Style transfer by GAN
• The discriminator 𝐷B attempts to distinguish between genuine draws of 𝑥B from
𝑃B and instances 𝑥 𝐴𝐵 obtained by transforming draws of 𝑥 𝐴 from 𝑃A
• Style transfer optimizations is achieved as follows:
• 𝐿 𝐺 = 𝐸 𝑥 𝐴~𝑃 𝐴
log 1 − 𝐷 𝐵 𝑥 𝐴𝐵
• 𝐿 𝐷 = −𝐸 𝑥 𝐵~𝑃 𝐵
log 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[log(1 − 𝐷 𝐵 𝑥 𝐴𝐵 ]
• The generator 𝐺 is updated by minimizing the “generator loss” 𝐿 𝐺, while the
discriminator 𝐷 is updated to minimize the “discriminator loss” 𝐿 𝐷

2021-01-09
Related Works: DiscoGAN
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other
• DiscoGAN Includes 2 Generator
– 𝐺AB: draw 𝑥A from 𝑃A of 𝐴 into 𝑥AB = 𝐺AB 𝑥A
– 𝐺BA: draw 𝑥B from 𝑃B of 𝐴 into 𝑥BA = 𝐺BA 𝑥B
– Inverse relationship with each other.
• The goal of 𝐺AB is that the product of 𝐺AB(𝑥AB) cannot be distinguished
from the distribution 𝑃B of 𝐵

2021-01-09
• 𝐺AB and 𝐺BA must be inverses of each other to the extent possible
• For any 𝑥A from 𝐴,
– 𝑥 𝐴𝐵𝐴 = 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
– must be close to the original 𝑥 𝐴
• For any 𝑥 𝐵 from 𝐵,
– 𝑥 𝐵𝐴𝐵 = 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
– must be close to the original 𝑥 𝐵

2021-01-09
• It also includes two discriminators, 𝐷A and 𝐷B
• 𝐷A attempts to discriminate between draws from 𝑃A and draws from
𝑃B that have been transformed by 𝐺BA
• 𝐷B performs the analogous operations for draws from 𝑃B
• The 𝐺 and 𝐷 must all be jointly trained.
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other

2021-01-09
• 𝐿 𝐺 𝐴
= −𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔𝐷𝐴(𝐺 𝐵𝐴(𝑥 𝐵))]
• 𝐿 𝐺 𝐵
= −𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔𝐷 𝐵(𝐺 𝐴𝐵(𝑥 𝐴))]
• This requirement is encoded through two reconstruction losses 𝐿 𝐶𝑂𝑁𝑆𝑇A
and
𝐿 𝐶𝑂𝑁𝑆𝑇B
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇B
= 𝑑(𝐺AB(𝐺BA(𝑥B)), 𝑥 𝐵)
• In DiscoGAN, 𝑑 is MSE
𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
𝐿 𝐺 𝐴
𝐿 𝐺 𝐵

2021-01-09
• The 2 generator losses are defined as
– 𝐿 𝐺𝐴𝑁 𝐴𝐵
= 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
– 𝐿 𝐺𝐴𝑁 𝐵𝐴
= 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
• 𝐿 𝐺 = 𝐿 𝐺 𝐴𝐵
+ 𝐿 𝐺 𝐵𝐴
 𝐿 𝐺 𝐵
+ 𝐿 𝐺 𝐴
• 2 discriminator losses are defined as
– 𝐿 𝐷 𝐴
= −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))]
– 𝐿 𝐷 𝐵
= −𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]
• 𝐿 𝐷 = −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] −
𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]  𝐿 𝐷A
+ 𝐿 𝐷B
𝐿 𝐺𝐴𝑁 𝐴𝐵
𝐿 𝐺𝐴𝑁 𝐵𝐴

2021-01-09
Proposed Model: VoiceGAN
• DiscoGAN was originally designed to transform style in images
• In order to apply the model to speech, first, convert it to an invertible,
picture-like representation, namely a spectrogram
• They propose VoiceGAN which incorporated all these modifications
– Original DiscoGAN was designed to operate on images of fixed size. For it to work with
inherently variable-sized speech signal, this constraint must be relaxed in its new design
– It is important to ensure that the linguistic information in the speech signal is not lost
– Their objective is to modify specific aspects of the speech, e.g. style, so they add extra
components to their model to achieve this

2021-01-09
VoiceGAN
• DiscoGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) = 𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝑑(𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)), 𝑥 𝐵) = 𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information

2021-01-09
VoiceGAN
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information
• 𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– This loss attemps to retain the structure of 𝑥 𝐴 even after it
has been converted to 𝑥 𝐴𝐵
• 𝛼, 𝛽
– Accurate reconversion and retention of linguistic
information after conversion
– In this paper, they not open this parameter. Just “Careful
choice of 𝛼, 𝛽 ensures both”

2021-01-09
VoiceGAN
• Same as DiscoGAN generator
• Their proposed discriminator
– Adaptive pooling layer is added after CNN layers and
before the fully connected layer
– It includes channel-wise pooling
– This converts any variable-sized feature map into a vector
of a fixed number of dimensions

2021-01-09
Style Embedding Model (𝐷𝑆)
• They add Style Discriminator with speech label
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
= 𝑑 𝐷𝑆 𝑥 𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑 𝐷𝑆 𝑥 𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑(𝐷𝑆 𝑥 𝐴𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴)
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
= 𝑑 𝐷𝑆 𝑥 𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑 𝐷𝑆 𝑥 𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑(𝐷𝑆 𝑥 𝐵𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵)
• 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸
= 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
+ 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵

2021-01-09
Final loss
• DiscoGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐺 𝐴
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
• VoiceGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐺 𝐴
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
+ 𝐿 𝐷STYLE

2021-01-09
Experiment - Dataset
• They use TIDIGITS dataset
– 326 speakers: 111 men, 114 women, 50 boys, 51 girls
– Each speaker reads 77 digit sentences
– Sampling rate: 16kHz
– Style: Gender
– Utterances are consist of counting numbers
– Using Spectrogram (maybe mel-scale filter bank spectrogram)

2021-01-09
Model Architecture
• Generator
– 6-layer CNN encoder and a 6-layer transposed CNN decoder
• Discriminator
– 7-layer CNN with adaptive pooling
• Employ BN and leaky ReLU activations in both networks (similar to DiscoGAN)
• Number of filters in each layer is an increasing power of 2 (32, 64, 128)

2021-01-09
Results
• https://nbviewer.jupyter.org/github/Yolanda-Gao/Spectrogram-
GAN/blob/master/VoiceGAN%20result.ipynb?flush_cache=true

Voice Impersonation Using Generative Adversarial Networks review

Recommended

Recommended

More Related Content

Similar to Voice Impersonation Using Generative Adversarial Networks review

Similar to Voice Impersonation Using Generative Adversarial Networks review (20)

More from June-Woo Kim

More from June-Woo Kim (7)

Recently uploaded

Recently uploaded (20)

Voice Impersonation Using Generative Adversarial Networks review

Editor's Notes