SlideShare a Scribd company logo
1 of 24
Voice Impersonation using Generative
Adversarial Networks
Yang Gao, Rita Singh, Bhiksha Raj
Electrical and Computer Engineering Department, Carnegie Mellon
University.
Arxiv Date: 19, Feb. 2018.
Conference: ICASSP 2018
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
25, Sep. 2019.
2021-01-09
Overview of the paper
• In voice impersonation, the resultant voice must convincingly convey
the impression of having been naturally produced by the target speaker,
mimicking not only the pitch and other perceivable signal qualities, but
also the style of the target speaker
• In this paper, they propose a novel neural-network based speech quality
and style mimicry framework for the synthesis of impersonated voices
– Framework: built upon a fast and accurate GAN model
• Generating a synthetic spectrogram from which the time-domain signal
is reconstructed using the Griffin-Lim method
• Given spectrographic representations of source and target speaker’s
voices, the model learns to mimic the target speaker’s voice quality and
style, regardless of the linguistic content of either’s voice.
2021-01-09
Overview of the paper
• Summarize
– This paper, given X is one of gender’s speech, given Y the other’s
speech
– Goal is change X voice to Y voice, regardless of the linguistic
content of either’s voice
– They use GAN, however, their model is more close to DiscoGAN
• They find some shortcomings of the existing DiscoGAN model and modified
them to make VoiceGAN
2021-01-09
Related Works: Generative Adversarial
Networks
• The original GAN model comprises a generator 𝐺(𝑧) and discrimina-
tor 𝐷(𝑥)
• The generator 𝐺 takes as input a random variable 𝑧 drawn from
some probability distribution function 𝑃𝑧, and produces an output
vector 𝑥 𝑧
2021-01-09
Related Works: GAN
• Discriminator D() attempts to discriminate between sample 𝑥~𝑃𝑥
that are drawn from 𝑃𝑥, the true (but unknown) distribution we aim
to model, and samples produced by the Generator 𝐺
• Let T represent the event that a vector 𝑥 was drawn from 𝑃𝑥, the
discriminator attemps to compute the a 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 probability of
𝐷 𝑥 = 𝑃(𝑇|𝑥)
2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃𝑥
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃 𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most
2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃 𝑥
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
• To get the better result of GAN,
– Generator: 𝐷 𝑥 𝑧  should minimize
– Discriminator: 𝐷 𝑥  should maximizing
• Conclusion
– G and D is adversarial
– We often define GAN as a minimax game with G wants to minimize V while D wants to maximize it
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most
2021-01-09
Related Works: Style transfer by GAN
• Input data instance (usually an image) 𝑥A drawn from a distribution
𝑃A is 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 to an instance 𝑥AB by a generator (more aptly
called a “transformer”), 𝐺AB
• The aim of the transformer is to convert 𝑥A into the style of the
variable 𝑥B which natively occurs with the distribution 𝑃B
2021-01-09
Related Works: Style transfer by GAN
• The discriminator 𝐷B attempts to distinguish between genuine draws of 𝑥B from
𝑃B and instances 𝑥 𝐴𝐵 obtained by transforming draws of 𝑥 𝐴 from 𝑃A
• Style transfer optimizations is achieved as follows:
• 𝐿 𝐺 = 𝐸 𝑥 𝐴~𝑃 𝐴
log 1 − 𝐷 𝐵 𝑥 𝐴𝐵
• 𝐿 𝐷 = −𝐸 𝑥 𝐵~𝑃 𝐵
log 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[log(1 − 𝐷 𝐵 𝑥 𝐴𝐵 ]
• The generator 𝐺 is updated by minimizing the “generator loss” 𝐿 𝐺, while the
discriminator 𝐷 is updated to minimize the “discriminator loss” 𝐿 𝐷
2021-01-09
Related Works: DiscoGAN
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other
• DiscoGAN Includes 2 Generator
– 𝐺AB: draw 𝑥A from 𝑃A of 𝐴 into 𝑥AB = 𝐺AB 𝑥A
– 𝐺BA: draw 𝑥B from 𝑃B of 𝐴 into 𝑥BA = 𝐺BA 𝑥B
– Inverse relationship with each other.
• The goal of 𝐺AB is that the product of 𝐺AB(𝑥AB) cannot be distinguished
from the distribution 𝑃B of 𝐵
2021-01-09
Related Works: DiscoGAN
• 𝐺AB and 𝐺BA must be inverses of each other to the extent possible
• For any 𝑥A from 𝐴,
– 𝑥 𝐴𝐵𝐴 = 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
– must be close to the original 𝑥 𝐴
• For any 𝑥 𝐵 from 𝐵,
– 𝑥 𝐵𝐴𝐵 = 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
– must be close to the original 𝑥 𝐵
2021-01-09
Related Works: DiscoGAN
• It also includes two discriminators, 𝐷A and 𝐷B
• 𝐷A attempts to discriminate between draws from 𝑃A and draws from
𝑃B that have been transformed by 𝐺BA
• 𝐷B performs the analogous operations for draws from 𝑃B
• The 𝐺 and 𝐷 must all be jointly trained.
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other
2021-01-09
Related Works: DiscoGAN
• 𝐿 𝐺 𝐴
= −𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔𝐷𝐴(𝐺 𝐵𝐴(𝑥 𝐵))]
• 𝐿 𝐺 𝐵
= −𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔𝐷 𝐵(𝐺 𝐴𝐵(𝑥 𝐴))]
• This requirement is encoded through two reconstruction losses 𝐿 𝐶𝑂𝑁𝑆𝑇A
and
𝐿 𝐶𝑂𝑁𝑆𝑇B
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇B
= 𝑑(𝐺AB(𝐺BA(𝑥B)), 𝑥 𝐵)
• In DiscoGAN, 𝑑 is MSE
𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
𝐿 𝐺 𝐴
𝐿 𝐺 𝐵
2021-01-09
Related Works: DiscoGAN
• The 2 generator losses are defined as
– 𝐿 𝐺𝐴𝑁 𝐴𝐵
= 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
– 𝐿 𝐺𝐴𝑁 𝐵𝐴
= 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
• 𝐿 𝐺 = 𝐿 𝐺 𝐴𝐵
+ 𝐿 𝐺 𝐵𝐴
 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
• 2 discriminator losses are defined as
– 𝐿 𝐷 𝐴
= −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))]
– 𝐿 𝐷 𝐵
= −𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]
• 𝐿 𝐷 = −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] −
𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]  𝐿 𝐷A
+ 𝐿 𝐷B
𝐿 𝐺𝐴𝑁 𝐴𝐵
𝐿 𝐺𝐴𝑁 𝐵𝐴
2021-01-09
Proposed Model: VoiceGAN
• DiscoGAN was originally designed to transform style in images
• In order to apply the model to speech, first, convert it to an invertible,
picture-like representation, namely a spectrogram
• They propose VoiceGAN which incorporated all these modifications
– Original DiscoGAN was designed to operate on images of fixed size. For it to work with
inherently variable-sized speech signal, this constraint must be relaxed in its new design
– It is important to ensure that the linguistic information in the speech signal is not lost
– Their objective is to modify specific aspects of the speech, e.g. style, so they add extra
components to their model to achieve this
2021-01-09
VoiceGAN
• DiscoGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) = 𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝑑(𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)), 𝑥 𝐵) = 𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information
2021-01-09
VoiceGAN
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information
• 𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– This loss attemps to retain the structure of 𝑥 𝐴 even after it
has been converted to 𝑥 𝐴𝐵
• 𝛼, 𝛽
– Accurate reconversion and retention of linguistic
information after conversion
– In this paper, they not open this parameter. Just “Careful
choice of 𝛼, 𝛽 ensures both”
2021-01-09
VoiceGAN
• Same as DiscoGAN generator
• Their proposed discriminator
– Adaptive pooling layer is added after CNN layers and
before the fully connected layer
– It includes channel-wise pooling
– This converts any variable-sized feature map into a vector
of a fixed number of dimensions
2021-01-09
Style Embedding Model (𝐷𝑆)
• They add Style Discriminator with speech label
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
= 𝑑 𝐷𝑆 𝑥 𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑 𝐷𝑆 𝑥 𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑(𝐷𝑆 𝑥 𝐴𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴)
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
= 𝑑 𝐷𝑆 𝑥 𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑 𝐷𝑆 𝑥 𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑(𝐷𝑆 𝑥 𝐵𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵)
• 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸
= 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
+ 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
2021-01-09
Final loss
• DiscoGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
• VoiceGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
+ 𝐿 𝐷STYLE
2021-01-09
Experiment - Dataset
• They use TIDIGITS dataset
– 326 speakers: 111 men, 114 women, 50 boys, 51 girls
– Each speaker reads 77 digit sentences
– Sampling rate: 16kHz
– Style: Gender
– Utterances are consist of counting numbers
– Using Spectrogram (maybe mel-scale filter bank spectrogram)
2021-01-09
Model Architecture
• Generator
– 6-layer CNN encoder and a 6-layer transposed CNN decoder
• Discriminator
– 7-layer CNN with adaptive pooling
• Employ BN and leaky ReLU activations in both networks (similar to DiscoGAN)
• Number of filters in each layer is an increasing power of 2 (32, 64, 128)
2021-01-09
Results
• https://nbviewer.jupyter.org/github/Yolanda-Gao/Spectrogram-
GAN/blob/master/VoiceGAN%20result.ipynb?flush_cache=true
2021-01-09
Thanks…

More Related Content

Similar to Voice Impersonation Using Generative Adversarial Networks review

Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorSungchul Kim
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For RailsKoichi ITO
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentElectronic Arts / DICE
 
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural CorrespondenceSemi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural CorrespondenceRylan Cottrell
 
P2P Bug Tracking with SD
P2P Bug Tracking with SDP2P Bug Tracking with SD
P2P Bug Tracking with SDJesse Vincent
 
Least Squares Fitting
Least Squares Fitting Least Squares Fitting
Least Squares Fitting MANREET SOHAL
 
Time cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo MethodTime cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo MethodMohammad Lemar ZALMAİ
 
Shortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRoutingShortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRoutingDaniel Kastl
 
Multinomial distribution
Multinomial distributionMultinomial distribution
Multinomial distributionNadeem Uddin
 
テンプレート管理ツール r3
テンプレート管理ツール r3テンプレート管理ツール r3
テンプレート管理ツール r3Ippei Ogiwara
 
Graphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two VariablesGraphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two VariablesCABILACedricLyoidJ
 
2.8 Function Operations and Composition
2.8 Function Operations and Composition2.8 Function Operations and Composition
2.8 Function Operations and Compositionsmiller5
 
Geometric Algebra 2: Applications
Geometric Algebra 2: ApplicationsGeometric Algebra 2: Applications
Geometric Algebra 2: ApplicationsVitor Pamplona
 
Conjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and PreconditioningConjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and PreconditioningFahad B. Mostafa
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchLennart Nacke
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffTaeoh Kim
 

Similar to Voice Impersonation Using Generative Adversarial Networks review (20)

Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
 
The derivatives module03
The derivatives module03The derivatives module03
The derivatives module03
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For Rails
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural CorrespondenceSemi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
 
P2P Bug Tracking with SD
P2P Bug Tracking with SDP2P Bug Tracking with SD
P2P Bug Tracking with SD
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
 
Least Squares Fitting
Least Squares Fitting Least Squares Fitting
Least Squares Fitting
 
Time cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo MethodTime cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo Method
 
Shortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRoutingShortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRouting
 
Multinomial distribution
Multinomial distributionMultinomial distribution
Multinomial distribution
 
テンプレート管理ツール r3
テンプレート管理ツール r3テンプレート管理ツール r3
テンプレート管理ツール r3
 
Graphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two VariablesGraphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two Variables
 
Revisited
RevisitedRevisited
Revisited
 
2.8 Function Operations and Composition
2.8 Function Operations and Composition2.8 Function Operations and Composition
2.8 Function Operations and Composition
 
Geometric Algebra 2: Applications
Geometric Algebra 2: ApplicationsGeometric Algebra 2: Applications
Geometric Algebra 2: Applications
 
Conjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and PreconditioningConjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and Preconditioning
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience Research
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
Paint.net
Paint.netPaint.net
Paint.net
 

More from June-Woo Kim

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention reviewJune-Woo Kim
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewJune-Woo Kim
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain June-Woo Kim
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN reviewJune-Woo Kim
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment reviewJune-Woo Kim
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron reviewJune-Woo Kim
 

More from June-Woo Kim (7)

Conformer review
Conformer reviewConformer review
Conformer review
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 

Recently uploaded

Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashidFaiyazSheikh
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Studentskannan348865
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfragupathi90
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New HorizonMorshed Ahmed Rahath
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docxrahulmanepalli02
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfKira Dess
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationEmaan Sharma
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalSwarnaSLcse
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxMustafa Ahmed
 

Recently uploaded (20)

Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 

Voice Impersonation Using Generative Adversarial Networks review

  • 1. Voice Impersonation using Generative Adversarial Networks Yang Gao, Rita Singh, Bhiksha Raj Electrical and Computer Engineering Department, Carnegie Mellon University. Arxiv Date: 19, Feb. 2018. Conference: ICASSP 2018 Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 25, Sep. 2019.
  • 2. 2021-01-09 Overview of the paper • In voice impersonation, the resultant voice must convincingly convey the impression of having been naturally produced by the target speaker, mimicking not only the pitch and other perceivable signal qualities, but also the style of the target speaker • In this paper, they propose a novel neural-network based speech quality and style mimicry framework for the synthesis of impersonated voices – Framework: built upon a fast and accurate GAN model • Generating a synthetic spectrogram from which the time-domain signal is reconstructed using the Griffin-Lim method • Given spectrographic representations of source and target speaker’s voices, the model learns to mimic the target speaker’s voice quality and style, regardless of the linguistic content of either’s voice.
  • 3. 2021-01-09 Overview of the paper • Summarize – This paper, given X is one of gender’s speech, given Y the other’s speech – Goal is change X voice to Y voice, regardless of the linguistic content of either’s voice – They use GAN, however, their model is more close to DiscoGAN • They find some shortcomings of the existing DiscoGAN model and modified them to make VoiceGAN
  • 4. 2021-01-09 Related Works: Generative Adversarial Networks • The original GAN model comprises a generator 𝐺(𝑧) and discrimina- tor 𝐷(𝑥) • The generator 𝐺 takes as input a random variable 𝑧 drawn from some probability distribution function 𝑃𝑧, and produces an output vector 𝑥 𝑧
  • 5. 2021-01-09 Related Works: GAN • Discriminator D() attempts to discriminate between sample 𝑥~𝑃𝑥 that are drawn from 𝑃𝑥, the true (but unknown) distribution we aim to model, and samples produced by the Generator 𝐺 • Let T represent the event that a vector 𝑥 was drawn from 𝑃𝑥, the discriminator attemps to compute the a 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 probability of 𝐷 𝑥 = 𝑃(𝑇|𝑥)
  • 6. 2021-01-09 Related Works: GAN • 𝑚𝑖𝑛 𝐺 𝑚𝑎𝑥 𝐷 𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃𝑥 [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )] – 𝑚𝑎𝑥 𝐷 𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))] – 𝑚𝑖𝑛 𝐺 𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )] • Appendix – 𝐸 𝑥~𝑃 𝑥 --> x is sampled from real data – 𝐸𝑧~𝑃𝑧 --> z is sampled from fake data(Noise=z) – D(x) --> probability of D(real) – 1 − 𝐷 𝑥 𝑧  probability of D(fake) – [log 𝐷(𝑥)] --> likelihood of D(real) – [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake) Recognize real images better Recognize generated images better Optimize G that can fool the discriminator the most
  • 7. 2021-01-09 Related Works: GAN • 𝑚𝑖𝑛 𝐺 𝑚𝑎𝑥 𝐷 𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃 𝑥 [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )] – 𝑚𝑎𝑥 𝐷 𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))] – 𝑚𝑖𝑛 𝐺 𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )] • Appendix – 𝐸 𝑥~𝑃𝑥 --> x is sampled from real data – 𝐸𝑧~𝑃𝑧 --> z is sampled from fake data(Noise=z) – D(x) --> probability of D(real) – 1 − 𝐷 𝑥 𝑧  probability of D(fake) – [log 𝐷(𝑥)] --> likelihood of D(real) – [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake) • To get the better result of GAN, – Generator: 𝐷 𝑥 𝑧  should minimize – Discriminator: 𝐷 𝑥  should maximizing • Conclusion – G and D is adversarial – We often define GAN as a minimax game with G wants to minimize V while D wants to maximize it Recognize real images better Recognize generated images better Optimize G that can fool the discriminator the most
  • 8. 2021-01-09 Related Works: Style transfer by GAN • Input data instance (usually an image) 𝑥A drawn from a distribution 𝑃A is 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 to an instance 𝑥AB by a generator (more aptly called a “transformer”), 𝐺AB • The aim of the transformer is to convert 𝑥A into the style of the variable 𝑥B which natively occurs with the distribution 𝑃B
  • 9. 2021-01-09 Related Works: Style transfer by GAN • The discriminator 𝐷B attempts to distinguish between genuine draws of 𝑥B from 𝑃B and instances 𝑥 𝐴𝐵 obtained by transforming draws of 𝑥 𝐴 from 𝑃A • Style transfer optimizations is achieved as follows: • 𝐿 𝐺 = 𝐸 𝑥 𝐴~𝑃 𝐴 log 1 − 𝐷 𝐵 𝑥 𝐴𝐵 • 𝐿 𝐷 = −𝐸 𝑥 𝐵~𝑃 𝐵 log 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [log(1 − 𝐷 𝐵 𝑥 𝐴𝐵 ] • The generator 𝐺 is updated by minimizing the “generator loss” 𝐿 𝐺, while the discriminator 𝐷 is updated to minimize the “discriminator loss” 𝐿 𝐷
  • 10. 2021-01-09 Related Works: DiscoGAN • DiscoGAN is a symmetric model which attempts to transform two categories of data, 𝐴 and 𝐵, into each other • DiscoGAN Includes 2 Generator – 𝐺AB: draw 𝑥A from 𝑃A of 𝐴 into 𝑥AB = 𝐺AB 𝑥A – 𝐺BA: draw 𝑥B from 𝑃B of 𝐴 into 𝑥BA = 𝐺BA 𝑥B – Inverse relationship with each other. • The goal of 𝐺AB is that the product of 𝐺AB(𝑥AB) cannot be distinguished from the distribution 𝑃B of 𝐵
  • 11. 2021-01-09 Related Works: DiscoGAN • 𝐺AB and 𝐺BA must be inverses of each other to the extent possible • For any 𝑥A from 𝐴, – 𝑥 𝐴𝐵𝐴 = 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)) – must be close to the original 𝑥 𝐴 • For any 𝑥 𝐵 from 𝐵, – 𝑥 𝐵𝐴𝐵 = 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)) – must be close to the original 𝑥 𝐵
  • 12. 2021-01-09 Related Works: DiscoGAN • It also includes two discriminators, 𝐷A and 𝐷B • 𝐷A attempts to discriminate between draws from 𝑃A and draws from 𝑃B that have been transformed by 𝐺BA • 𝐷B performs the analogous operations for draws from 𝑃B • The 𝐺 and 𝐷 must all be jointly trained. • DiscoGAN is a symmetric model which attempts to transform two categories of data, 𝐴 and 𝐵, into each other
  • 13. 2021-01-09 Related Works: DiscoGAN • 𝐿 𝐺 𝐴 = −𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔𝐷𝐴(𝐺 𝐵𝐴(𝑥 𝐵))] • 𝐿 𝐺 𝐵 = −𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔𝐷 𝐵(𝐺 𝐴𝐵(𝑥 𝐴))] • This requirement is encoded through two reconstruction losses 𝐿 𝐶𝑂𝑁𝑆𝑇A and 𝐿 𝐶𝑂𝑁𝑆𝑇B – 𝐿 𝐶𝑂𝑁𝑆𝑇A = 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇B = 𝑑(𝐺AB(𝐺BA(𝑥B)), 𝑥 𝐵) • In DiscoGAN, 𝑑 is MSE 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)) 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)) 𝐿 𝐺 𝐴 𝐿 𝐺 𝐵
  • 14. 2021-01-09 Related Works: DiscoGAN • The 2 generator losses are defined as – 𝐿 𝐺𝐴𝑁 𝐴𝐵 = 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 – 𝐿 𝐺𝐴𝑁 𝐵𝐴 = 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 • 𝐿 𝐺 = 𝐿 𝐺 𝐴𝐵 + 𝐿 𝐺 𝐵𝐴  𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 • 2 discriminator losses are defined as – 𝐿 𝐷 𝐴 = −𝐸 𝑥 𝐴~𝑃 𝐴 𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] – 𝐿 𝐷 𝐵 = −𝐸 𝑥 𝐵~𝑃 𝐵 𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))] • 𝐿 𝐷 = −𝐸 𝑥 𝐴~𝑃 𝐴 𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] − 𝐸 𝑥 𝐵~𝑃 𝐵 𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]  𝐿 𝐷A + 𝐿 𝐷B 𝐿 𝐺𝐴𝑁 𝐴𝐵 𝐿 𝐺𝐴𝑁 𝐵𝐴
  • 15. 2021-01-09 Proposed Model: VoiceGAN • DiscoGAN was originally designed to transform style in images • In order to apply the model to speech, first, convert it to an invertible, picture-like representation, namely a spectrogram • They propose VoiceGAN which incorporated all these modifications – Original DiscoGAN was designed to operate on images of fixed size. For it to work with inherently variable-sized speech signal, this constraint must be relaxed in its new design – It is important to ensure that the linguistic information in the speech signal is not lost – Their objective is to modify specific aspects of the speech, e.g. style, so they add extra components to their model to achieve this
  • 16. 2021-01-09 VoiceGAN • DiscoGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇A = 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) = 𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝑑(𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)), 𝑥 𝐵) = 𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 • VoiceGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 = 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵) – For retain the linguistic information
  • 17. 2021-01-09 VoiceGAN • VoiceGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 = 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵) – For retain the linguistic information • 𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – This loss attemps to retain the structure of 𝑥 𝐴 even after it has been converted to 𝑥 𝐴𝐵 • 𝛼, 𝛽 – Accurate reconversion and retention of linguistic information after conversion – In this paper, they not open this parameter. Just “Careful choice of 𝛼, 𝛽 ensures both”
  • 18. 2021-01-09 VoiceGAN • Same as DiscoGAN generator • Their proposed discriminator – Adaptive pooling layer is added after CNN layers and before the fully connected layer – It includes channel-wise pooling – This converts any variable-sized feature map into a vector of a fixed number of dimensions
  • 19. 2021-01-09 Style Embedding Model (𝐷𝑆) • They add Style Discriminator with speech label – 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴 = 𝑑 𝐷𝑆 𝑥 𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑 𝐷𝑆 𝑥 𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑(𝐷𝑆 𝑥 𝐴𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴) – 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵 = 𝑑 𝐷𝑆 𝑥 𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑 𝐷𝑆 𝑥 𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑(𝐷𝑆 𝑥 𝐵𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵) • 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸 = 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴 + 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
  • 20. 2021-01-09 Final loss • DiscoGAN Final loss – Generator: 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 – Discriminator: 𝐿 𝐷A + 𝐿 𝐷B • VoiceGAN Final loss – Generator: 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 – Discriminator: 𝐿 𝐷A + 𝐿 𝐷B + 𝐿 𝐷STYLE
  • 21. 2021-01-09 Experiment - Dataset • They use TIDIGITS dataset – 326 speakers: 111 men, 114 women, 50 boys, 51 girls – Each speaker reads 77 digit sentences – Sampling rate: 16kHz – Style: Gender – Utterances are consist of counting numbers – Using Spectrogram (maybe mel-scale filter bank spectrogram)
  • 22. 2021-01-09 Model Architecture • Generator – 6-layer CNN encoder and a 6-layer transposed CNN decoder • Discriminator – 7-layer CNN with adaptive pooling • Employ BN and leaky ReLU activations in both networks (similar to DiscoGAN) • Number of filters in each layer is an increasing power of 2 (32, 64, 128)

Editor's Notes

  1. Introduction3, End
  2. Introduction3, End