SlideShare a Scribd company logo
Voice Impersonation using Generative
Adversarial Networks
Yang Gao, Rita Singh, Bhiksha Raj
Electrical and Computer Engineering Department, Carnegie Mellon
University.
Arxiv Date: 19, Feb. 2018.
Conference: ICASSP 2018
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
25, Sep. 2019.
2021-01-09
Overview of the paper
• In voice impersonation, the resultant voice must convincingly convey
the impression of having been naturally produced by the target speaker,
mimicking not only the pitch and other perceivable signal qualities, but
also the style of the target speaker
• In this paper, they propose a novel neural-network based speech quality
and style mimicry framework for the synthesis of impersonated voices
– Framework: built upon a fast and accurate GAN model
• Generating a synthetic spectrogram from which the time-domain signal
is reconstructed using the Griffin-Lim method
• Given spectrographic representations of source and target speaker’s
voices, the model learns to mimic the target speaker’s voice quality and
style, regardless of the linguistic content of either’s voice.
2021-01-09
Overview of the paper
• Summarize
– This paper, given X is one of gender’s speech, given Y the other’s
speech
– Goal is change X voice to Y voice, regardless of the linguistic
content of either’s voice
– They use GAN, however, their model is more close to DiscoGAN
• They find some shortcomings of the existing DiscoGAN model and modified
them to make VoiceGAN
2021-01-09
Related Works: Generative Adversarial
Networks
• The original GAN model comprises a generator 𝐺(𝑧) and discrimina-
tor 𝐷(𝑥)
• The generator 𝐺 takes as input a random variable 𝑧 drawn from
some probability distribution function 𝑃𝑧, and produces an output
vector 𝑥 𝑧
2021-01-09
Related Works: GAN
• Discriminator D() attempts to discriminate between sample 𝑥~𝑃𝑥
that are drawn from 𝑃𝑥, the true (but unknown) distribution we aim
to model, and samples produced by the Generator 𝐺
• Let T represent the event that a vector 𝑥 was drawn from 𝑃𝑥, the
discriminator attemps to compute the a 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 probability of
𝐷 𝑥 = 𝑃(𝑇|𝑥)
2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃𝑥
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃 𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most
2021-01-09
Related Works: GAN
• 𝑚𝑖𝑛
𝐺
𝑚𝑎𝑥
𝐷
𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃 𝑥
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )]
– 𝑚𝑎𝑥
𝐷
𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥)
[𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧
[𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))]
– 𝑚𝑖𝑛
𝐺
𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )]
• Appendix
– 𝐸 𝑥~𝑃𝑥
--> x is sampled from real data
– 𝐸𝑧~𝑃𝑧
--> z is sampled from fake data(Noise=z)
– D(x) --> probability of D(real)
– 1 − 𝐷 𝑥 𝑧  probability of D(fake)
– [log 𝐷(𝑥)] --> likelihood of D(real)
– [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake)
• To get the better result of GAN,
– Generator: 𝐷 𝑥 𝑧  should minimize
– Discriminator: 𝐷 𝑥  should maximizing
• Conclusion
– G and D is adversarial
– We often define GAN as a minimax game with G wants to minimize V while D wants to maximize it
Recognize real images better
Recognize generated images better
Optimize G that can fool the discriminator the most
2021-01-09
Related Works: Style transfer by GAN
• Input data instance (usually an image) 𝑥A drawn from a distribution
𝑃A is 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 to an instance 𝑥AB by a generator (more aptly
called a “transformer”), 𝐺AB
• The aim of the transformer is to convert 𝑥A into the style of the
variable 𝑥B which natively occurs with the distribution 𝑃B
2021-01-09
Related Works: Style transfer by GAN
• The discriminator 𝐷B attempts to distinguish between genuine draws of 𝑥B from
𝑃B and instances 𝑥 𝐴𝐵 obtained by transforming draws of 𝑥 𝐴 from 𝑃A
• Style transfer optimizations is achieved as follows:
• 𝐿 𝐺 = 𝐸 𝑥 𝐴~𝑃 𝐴
log 1 − 𝐷 𝐵 𝑥 𝐴𝐵
• 𝐿 𝐷 = −𝐸 𝑥 𝐵~𝑃 𝐵
log 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[log(1 − 𝐷 𝐵 𝑥 𝐴𝐵 ]
• The generator 𝐺 is updated by minimizing the “generator loss” 𝐿 𝐺, while the
discriminator 𝐷 is updated to minimize the “discriminator loss” 𝐿 𝐷
2021-01-09
Related Works: DiscoGAN
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other
• DiscoGAN Includes 2 Generator
– 𝐺AB: draw 𝑥A from 𝑃A of 𝐴 into 𝑥AB = 𝐺AB 𝑥A
– 𝐺BA: draw 𝑥B from 𝑃B of 𝐴 into 𝑥BA = 𝐺BA 𝑥B
– Inverse relationship with each other.
• The goal of 𝐺AB is that the product of 𝐺AB(𝑥AB) cannot be distinguished
from the distribution 𝑃B of 𝐵
2021-01-09
Related Works: DiscoGAN
• 𝐺AB and 𝐺BA must be inverses of each other to the extent possible
• For any 𝑥A from 𝐴,
– 𝑥 𝐴𝐵𝐴 = 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
– must be close to the original 𝑥 𝐴
• For any 𝑥 𝐵 from 𝐵,
– 𝑥 𝐵𝐴𝐵 = 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
– must be close to the original 𝑥 𝐵
2021-01-09
Related Works: DiscoGAN
• It also includes two discriminators, 𝐷A and 𝐷B
• 𝐷A attempts to discriminate between draws from 𝑃A and draws from
𝑃B that have been transformed by 𝐺BA
• 𝐷B performs the analogous operations for draws from 𝑃B
• The 𝐺 and 𝐷 must all be jointly trained.
• DiscoGAN is a symmetric model which attempts to transform two
categories of data, 𝐴 and 𝐵, into each other
2021-01-09
Related Works: DiscoGAN
• 𝐿 𝐺 𝐴
= −𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔𝐷𝐴(𝐺 𝐵𝐴(𝑥 𝐵))]
• 𝐿 𝐺 𝐵
= −𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔𝐷 𝐵(𝐺 𝐴𝐵(𝑥 𝐴))]
• This requirement is encoded through two reconstruction losses 𝐿 𝐶𝑂𝑁𝑆𝑇A
and
𝐿 𝐶𝑂𝑁𝑆𝑇B
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇B
= 𝑑(𝐺AB(𝐺BA(𝑥B)), 𝑥 𝐵)
• In DiscoGAN, 𝑑 is MSE
𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴))
𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵))
𝐿 𝐺 𝐴
𝐿 𝐺 𝐵
2021-01-09
Related Works: DiscoGAN
• The 2 generator losses are defined as
– 𝐿 𝐺𝐴𝑁 𝐴𝐵
= 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
– 𝐿 𝐺𝐴𝑁 𝐵𝐴
= 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
• 𝐿 𝐺 = 𝐿 𝐺 𝐴𝐵
+ 𝐿 𝐺 𝐵𝐴
 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
• 2 discriminator losses are defined as
– 𝐿 𝐷 𝐴
= −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))]
– 𝐿 𝐷 𝐵
= −𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]
• 𝐿 𝐷 = −𝐸 𝑥 𝐴~𝑃 𝐴
𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵
[𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] −
𝐸 𝑥 𝐵~𝑃 𝐵
𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴
[𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]  𝐿 𝐷A
+ 𝐿 𝐷B
𝐿 𝐺𝐴𝑁 𝐴𝐵
𝐿 𝐺𝐴𝑁 𝐵𝐴
2021-01-09
Proposed Model: VoiceGAN
• DiscoGAN was originally designed to transform style in images
• In order to apply the model to speech, first, convert it to an invertible,
picture-like representation, namely a spectrogram
• They propose VoiceGAN which incorporated all these modifications
– Original DiscoGAN was designed to operate on images of fixed size. For it to work with
inherently variable-sized speech signal, this constraint must be relaxed in its new design
– It is important to ensure that the linguistic information in the speech signal is not lost
– Their objective is to modify specific aspects of the speech, e.g. style, so they add extra
components to their model to achieve this
2021-01-09
VoiceGAN
• DiscoGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇A
= 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) = 𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝑑(𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)), 𝑥 𝐵) = 𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information
2021-01-09
VoiceGAN
• VoiceGAN reconstruction loss
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
= 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
= 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵)
– For retain the linguistic information
• 𝑑(𝑥 𝐴𝐵, 𝑥 𝐴)
– This loss attemps to retain the structure of 𝑥 𝐴 even after it
has been converted to 𝑥 𝐴𝐵
• 𝛼, 𝛽
– Accurate reconversion and retention of linguistic
information after conversion
– In this paper, they not open this parameter. Just “Careful
choice of 𝛼, 𝛽 ensures both”
2021-01-09
VoiceGAN
• Same as DiscoGAN generator
• Their proposed discriminator
– Adaptive pooling layer is added after CNN layers and
before the fully connected layer
– It includes channel-wise pooling
– This converts any variable-sized feature map into a vector
of a fixed number of dimensions
2021-01-09
Style Embedding Model (𝐷𝑆)
• They add Style Discriminator with speech label
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
= 𝑑 𝐷𝑆 𝑥 𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑 𝐷𝑆 𝑥 𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑(𝐷𝑆 𝑥 𝐴𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴)
– 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
= 𝑑 𝐷𝑆 𝑥 𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑 𝐷𝑆 𝑥 𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑(𝐷𝑆 𝑥 𝐵𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵)
• 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸
= 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴
+ 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
2021-01-09
Final loss
• DiscoGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
• VoiceGAN Final loss
– Generator: 𝐿 𝐺 𝐵
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴
+ 𝐿 𝐺 𝐴
+ 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵
– Discriminator: 𝐿 𝐷A
+ 𝐿 𝐷B
+ 𝐿 𝐷STYLE
2021-01-09
Experiment - Dataset
• They use TIDIGITS dataset
– 326 speakers: 111 men, 114 women, 50 boys, 51 girls
– Each speaker reads 77 digit sentences
– Sampling rate: 16kHz
– Style: Gender
– Utterances are consist of counting numbers
– Using Spectrogram (maybe mel-scale filter bank spectrogram)
2021-01-09
Model Architecture
• Generator
– 6-layer CNN encoder and a 6-layer transposed CNN decoder
• Discriminator
– 7-layer CNN with adaptive pooling
• Employ BN and leaky ReLU activations in both networks (similar to DiscoGAN)
• Number of filters in each layer is an increasing power of 2 (32, 64, 128)
2021-01-09
Results
• https://nbviewer.jupyter.org/github/Yolanda-Gao/Spectrogram-
GAN/blob/master/VoiceGAN%20result.ipynb?flush_cache=true
2021-01-09
Thanks…

More Related Content

Similar to Voice Impersonation Using Generative Adversarial Networks review

Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
Sungchul Kim
 
The derivatives module03
The derivatives module03The derivatives module03
The derivatives module03
REYEMMANUELILUMBA
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For Rails
Koichi ITO
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
Electronic Arts / DICE
 
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural CorrespondenceSemi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
Rylan Cottrell
 
P2P Bug Tracking with SD
P2P Bug Tracking with SDP2P Bug Tracking with SD
P2P Bug Tracking with SD
Jesse Vincent
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
Hong ChangBum
 
Least Squares Fitting
Least Squares Fitting Least Squares Fitting
Least Squares Fitting
MANREET SOHAL
 
Time cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo MethodTime cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo Method
Mohammad Lemar ZALMAİ
 
Shortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRoutingShortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRouting
Daniel Kastl
 
Multinomial distribution
Multinomial distributionMultinomial distribution
Multinomial distribution
Nadeem Uddin
 
テンプレート管理ツール r3
テンプレート管理ツール r3テンプレート管理ツール r3
テンプレート管理ツール r3
Ippei Ogiwara
 
Graphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two VariablesGraphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two Variables
CABILACedricLyoidJ
 
Revisited
RevisitedRevisited
Revisited
Shunsaku Kudo
 
2.8 Function Operations and Composition
2.8 Function Operations and Composition2.8 Function Operations and Composition
2.8 Function Operations and Composition
smiller5
 
Geometric Algebra 2: Applications
Geometric Algebra 2: ApplicationsGeometric Algebra 2: Applications
Geometric Algebra 2: Applications
Vitor Pamplona
 
Conjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and PreconditioningConjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and Preconditioning
Fahad B. Mostafa
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience Research
Lennart Nacke
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
Taeoh Kim
 
Paint.net
Paint.netPaint.net
Paint.net
ОШ ХРШ
 

Similar to Voice Impersonation Using Generative Adversarial Networks review (20)

Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
 
The derivatives module03
The derivatives module03The derivatives module03
The derivatives module03
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For Rails
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural CorrespondenceSemi-automating Small-Scale Source Code Reuse via Structural Correspondence
Semi-automating Small-Scale Source Code Reuse via Structural Correspondence
 
P2P Bug Tracking with SD
P2P Bug Tracking with SDP2P Bug Tracking with SD
P2P Bug Tracking with SD
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
 
Least Squares Fitting
Least Squares Fitting Least Squares Fitting
Least Squares Fitting
 
Time cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo MethodTime cost trade off optimization using harmony search and Monte-Carlo Method
Time cost trade off optimization using harmony search and Monte-Carlo Method
 
Shortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRoutingShortest Path Search in Real Road Networks with pgRouting
Shortest Path Search in Real Road Networks with pgRouting
 
Multinomial distribution
Multinomial distributionMultinomial distribution
Multinomial distribution
 
テンプレート管理ツール r3
テンプレート管理ツール r3テンプレート管理ツール r3
テンプレート管理ツール r3
 
Graphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two VariablesGraphs of Linear Equations in Two Variables
Graphs of Linear Equations in Two Variables
 
Revisited
RevisitedRevisited
Revisited
 
2.8 Function Operations and Composition
2.8 Function Operations and Composition2.8 Function Operations and Composition
2.8 Function Operations and Composition
 
Geometric Algebra 2: Applications
Geometric Algebra 2: ApplicationsGeometric Algebra 2: Applications
Geometric Algebra 2: Applications
 
Conjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and PreconditioningConjugate Gradient for Normal Equations and Preconditioning
Conjugate Gradient for Normal Equations and Preconditioning
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience Research
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
Paint.net
Paint.netPaint.net
Paint.net
 

More from June-Woo Kim

Conformer review
Conformer reviewConformer review
Conformer review
June-Woo Kim
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
June-Woo Kim
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
June-Woo Kim
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
June-Woo Kim
 

More from June-Woo Kim (7)

Conformer review
Conformer reviewConformer review
Conformer review
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 

Recently uploaded

TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 

Recently uploaded (20)

TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 

Voice Impersonation Using Generative Adversarial Networks review

  • 1. Voice Impersonation using Generative Adversarial Networks Yang Gao, Rita Singh, Bhiksha Raj Electrical and Computer Engineering Department, Carnegie Mellon University. Arxiv Date: 19, Feb. 2018. Conference: ICASSP 2018 Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 25, Sep. 2019.
  • 2. 2021-01-09 Overview of the paper • In voice impersonation, the resultant voice must convincingly convey the impression of having been naturally produced by the target speaker, mimicking not only the pitch and other perceivable signal qualities, but also the style of the target speaker • In this paper, they propose a novel neural-network based speech quality and style mimicry framework for the synthesis of impersonated voices – Framework: built upon a fast and accurate GAN model • Generating a synthetic spectrogram from which the time-domain signal is reconstructed using the Griffin-Lim method • Given spectrographic representations of source and target speaker’s voices, the model learns to mimic the target speaker’s voice quality and style, regardless of the linguistic content of either’s voice.
  • 3. 2021-01-09 Overview of the paper • Summarize – This paper, given X is one of gender’s speech, given Y the other’s speech – Goal is change X voice to Y voice, regardless of the linguistic content of either’s voice – They use GAN, however, their model is more close to DiscoGAN • They find some shortcomings of the existing DiscoGAN model and modified them to make VoiceGAN
  • 4. 2021-01-09 Related Works: Generative Adversarial Networks • The original GAN model comprises a generator 𝐺(𝑧) and discrimina- tor 𝐷(𝑥) • The generator 𝐺 takes as input a random variable 𝑧 drawn from some probability distribution function 𝑃𝑧, and produces an output vector 𝑥 𝑧
  • 5. 2021-01-09 Related Works: GAN • Discriminator D() attempts to discriminate between sample 𝑥~𝑃𝑥 that are drawn from 𝑃𝑥, the true (but unknown) distribution we aim to model, and samples produced by the Generator 𝐺 • Let T represent the event that a vector 𝑥 was drawn from 𝑃𝑥, the discriminator attemps to compute the a 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 probability of 𝐷 𝑥 = 𝑃(𝑇|𝑥)
  • 6. 2021-01-09 Related Works: GAN • 𝑚𝑖𝑛 𝐺 𝑚𝑎𝑥 𝐷 𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃𝑥 [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )] – 𝑚𝑎𝑥 𝐷 𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))] – 𝑚𝑖𝑛 𝐺 𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )] • Appendix – 𝐸 𝑥~𝑃 𝑥 --> x is sampled from real data – 𝐸𝑧~𝑃𝑧 --> z is sampled from fake data(Noise=z) – D(x) --> probability of D(real) – 1 − 𝐷 𝑥 𝑧  probability of D(fake) – [log 𝐷(𝑥)] --> likelihood of D(real) – [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake) Recognize real images better Recognize generated images better Optimize G that can fool the discriminator the most
  • 7. 2021-01-09 Related Works: GAN • 𝑚𝑖𝑛 𝐺 𝑚𝑎𝑥 𝐷 𝑉 𝐷, 𝐺 = 𝐸 𝑥~𝑃 𝑥 [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷 𝑥 𝑧 )] – 𝑚𝑎𝑥 𝐷 𝑉 𝐷 = 𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔 𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))] – 𝑚𝑖𝑛 𝐺 𝑉(𝐺) = 𝐸𝑧~𝑃𝑧(𝑧)[𝑙𝑜𝑔(1 − 𝐷 𝐺(𝑧) )] • Appendix – 𝐸 𝑥~𝑃𝑥 --> x is sampled from real data – 𝐸𝑧~𝑃𝑧 --> z is sampled from fake data(Noise=z) – D(x) --> probability of D(real) – 1 − 𝐷 𝑥 𝑧  probability of D(fake) – [log 𝐷(𝑥)] --> likelihood of D(real) – [log(1 − 𝐷 𝑥 𝑧 )] --> likelihood of D(fake) • To get the better result of GAN, – Generator: 𝐷 𝑥 𝑧  should minimize – Discriminator: 𝐷 𝑥  should maximizing • Conclusion – G and D is adversarial – We often define GAN as a minimax game with G wants to minimize V while D wants to maximize it Recognize real images better Recognize generated images better Optimize G that can fool the discriminator the most
  • 8. 2021-01-09 Related Works: Style transfer by GAN • Input data instance (usually an image) 𝑥A drawn from a distribution 𝑃A is 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 to an instance 𝑥AB by a generator (more aptly called a “transformer”), 𝐺AB • The aim of the transformer is to convert 𝑥A into the style of the variable 𝑥B which natively occurs with the distribution 𝑃B
  • 9. 2021-01-09 Related Works: Style transfer by GAN • The discriminator 𝐷B attempts to distinguish between genuine draws of 𝑥B from 𝑃B and instances 𝑥 𝐴𝐵 obtained by transforming draws of 𝑥 𝐴 from 𝑃A • Style transfer optimizations is achieved as follows: • 𝐿 𝐺 = 𝐸 𝑥 𝐴~𝑃 𝐴 log 1 − 𝐷 𝐵 𝑥 𝐴𝐵 • 𝐿 𝐷 = −𝐸 𝑥 𝐵~𝑃 𝐵 log 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [log(1 − 𝐷 𝐵 𝑥 𝐴𝐵 ] • The generator 𝐺 is updated by minimizing the “generator loss” 𝐿 𝐺, while the discriminator 𝐷 is updated to minimize the “discriminator loss” 𝐿 𝐷
  • 10. 2021-01-09 Related Works: DiscoGAN • DiscoGAN is a symmetric model which attempts to transform two categories of data, 𝐴 and 𝐵, into each other • DiscoGAN Includes 2 Generator – 𝐺AB: draw 𝑥A from 𝑃A of 𝐴 into 𝑥AB = 𝐺AB 𝑥A – 𝐺BA: draw 𝑥B from 𝑃B of 𝐴 into 𝑥BA = 𝐺BA 𝑥B – Inverse relationship with each other. • The goal of 𝐺AB is that the product of 𝐺AB(𝑥AB) cannot be distinguished from the distribution 𝑃B of 𝐵
  • 11. 2021-01-09 Related Works: DiscoGAN • 𝐺AB and 𝐺BA must be inverses of each other to the extent possible • For any 𝑥A from 𝐴, – 𝑥 𝐴𝐵𝐴 = 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)) – must be close to the original 𝑥 𝐴 • For any 𝑥 𝐵 from 𝐵, – 𝑥 𝐵𝐴𝐵 = 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)) – must be close to the original 𝑥 𝐵
  • 12. 2021-01-09 Related Works: DiscoGAN • It also includes two discriminators, 𝐷A and 𝐷B • 𝐷A attempts to discriminate between draws from 𝑃A and draws from 𝑃B that have been transformed by 𝐺BA • 𝐷B performs the analogous operations for draws from 𝑃B • The 𝐺 and 𝐷 must all be jointly trained. • DiscoGAN is a symmetric model which attempts to transform two categories of data, 𝐴 and 𝐵, into each other
  • 13. 2021-01-09 Related Works: DiscoGAN • 𝐿 𝐺 𝐴 = −𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔𝐷𝐴(𝐺 𝐵𝐴(𝑥 𝐵))] • 𝐿 𝐺 𝐵 = −𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔𝐷 𝐵(𝐺 𝐴𝐵(𝑥 𝐴))] • This requirement is encoded through two reconstruction losses 𝐿 𝐶𝑂𝑁𝑆𝑇A and 𝐿 𝐶𝑂𝑁𝑆𝑇B – 𝐿 𝐶𝑂𝑁𝑆𝑇A = 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇B = 𝑑(𝐺AB(𝐺BA(𝑥B)), 𝑥 𝐵) • In DiscoGAN, 𝑑 is MSE 𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)) 𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)) 𝐿 𝐺 𝐴 𝐿 𝐺 𝐵
  • 14. 2021-01-09 Related Works: DiscoGAN • The 2 generator losses are defined as – 𝐿 𝐺𝐴𝑁 𝐴𝐵 = 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 – 𝐿 𝐺𝐴𝑁 𝐵𝐴 = 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 • 𝐿 𝐺 = 𝐿 𝐺 𝐴𝐵 + 𝐿 𝐺 𝐵𝐴  𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 • 2 discriminator losses are defined as – 𝐿 𝐷 𝐴 = −𝐸 𝑥 𝐴~𝑃 𝐴 𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] – 𝐿 𝐷 𝐵 = −𝐸 𝑥 𝐵~𝑃 𝐵 𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))] • 𝐿 𝐷 = −𝐸 𝑥 𝐴~𝑃 𝐴 𝑙𝑜𝑔 𝐷𝐴 𝑥 𝐴 − 𝐸 𝑥 𝐵~𝑃 𝐵 [𝑙𝑜𝑔(1 − 𝐷𝐴 𝐺 𝐵𝐴(𝑥 𝐵 ))] − 𝐸 𝑥 𝐵~𝑃 𝐵 𝑙𝑜𝑔 𝐷 𝐵 𝑥 𝐵 − 𝐸 𝑥 𝐴~𝑃 𝐴 [𝑙𝑜𝑔(1 − 𝐷 𝐵 𝐺 𝐴𝐵(𝑥 𝐴 ))]  𝐿 𝐷A + 𝐿 𝐷B 𝐿 𝐺𝐴𝑁 𝐴𝐵 𝐿 𝐺𝐴𝑁 𝐵𝐴
  • 15. 2021-01-09 Proposed Model: VoiceGAN • DiscoGAN was originally designed to transform style in images • In order to apply the model to speech, first, convert it to an invertible, picture-like representation, namely a spectrogram • They propose VoiceGAN which incorporated all these modifications – Original DiscoGAN was designed to operate on images of fixed size. For it to work with inherently variable-sized speech signal, this constraint must be relaxed in its new design – It is important to ensure that the linguistic information in the speech signal is not lost – Their objective is to modify specific aspects of the speech, e.g. style, so they add extra components to their model to achieve this
  • 16. 2021-01-09 VoiceGAN • DiscoGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇A = 𝑑(𝐺 𝐵𝐴(𝐺 𝐴𝐵(𝑥 𝐴)), 𝑥 𝐴) = 𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝑑(𝐺 𝐴𝐵(𝐺 𝐵𝐴(𝑥 𝐵)), 𝑥 𝐵) = 𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 • VoiceGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 = 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵) – For retain the linguistic information
  • 17. 2021-01-09 VoiceGAN • VoiceGAN reconstruction loss – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 = 𝛼𝑑 𝑥 𝐴𝐵𝐴, 𝑥 𝐴 + 𝛽𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 = 𝛼𝑑 𝑥 𝐵𝐴𝐵, 𝑥 𝐵 + 𝛽𝑑(𝑥 𝐵𝐴, 𝑥 𝐵) – For retain the linguistic information • 𝑑(𝑥 𝐴𝐵, 𝑥 𝐴) – This loss attemps to retain the structure of 𝑥 𝐴 even after it has been converted to 𝑥 𝐴𝐵 • 𝛼, 𝛽 – Accurate reconversion and retention of linguistic information after conversion – In this paper, they not open this parameter. Just “Careful choice of 𝛼, 𝛽 ensures both”
  • 18. 2021-01-09 VoiceGAN • Same as DiscoGAN generator • Their proposed discriminator – Adaptive pooling layer is added after CNN layers and before the fully connected layer – It includes channel-wise pooling – This converts any variable-sized feature map into a vector of a fixed number of dimensions
  • 19. 2021-01-09 Style Embedding Model (𝐷𝑆) • They add Style Discriminator with speech label – 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴 = 𝑑 𝐷𝑆 𝑥 𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑 𝐷𝑆 𝑥 𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑(𝐷𝑆 𝑥 𝐴𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴) – 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵 = 𝑑 𝐷𝑆 𝑥 𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵 + 𝑑 𝐷𝑆 𝑥 𝐵𝐴 , 𝑙𝑎𝑏𝑒𝑙 𝐴 + 𝑑(𝐷𝑆 𝑥 𝐵𝐴𝐵 , 𝑙𝑎𝑏𝑒𝑙 𝐵) • 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸 = 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐴 + 𝐿 𝐷 𝑆𝑇𝑌𝐿𝐸−𝐵
  • 20. 2021-01-09 Final loss • DiscoGAN Final loss – Generator: 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 – Discriminator: 𝐿 𝐷A + 𝐿 𝐷B • VoiceGAN Final loss – Generator: 𝐿 𝐺 𝐵 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐴 + 𝐿 𝐺 𝐴 + 𝐿 𝐶𝑂𝑁𝑆𝑇 𝐵 – Discriminator: 𝐿 𝐷A + 𝐿 𝐷B + 𝐿 𝐷STYLE
  • 21. 2021-01-09 Experiment - Dataset • They use TIDIGITS dataset – 326 speakers: 111 men, 114 women, 50 boys, 51 girls – Each speaker reads 77 digit sentences – Sampling rate: 16kHz – Style: Gender – Utterances are consist of counting numbers – Using Spectrogram (maybe mel-scale filter bank spectrogram)
  • 22. 2021-01-09 Model Architecture • Generator – 6-layer CNN encoder and a 6-layer transposed CNN decoder • Discriminator – 7-layer CNN with adaptive pooling • Employ BN and leaky ReLU activations in both networks (similar to DiscoGAN) • Number of filters in each layer is an increasing power of 2 (32, 64, 128)

Editor's Notes

  1. Introduction3, End
  2. Introduction3, End