1. Multi-Task Adversarial Training Algorithm for
Multi-Speaker Neural Text-To-Speech
The University of Tokyo, Japan.
APSIPA ASC 2022 WedAM1-8-2 (SS04)
Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
2. /19
1
Overview: Multi-Speaker Neural Text-To-Speech
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
Multi-speaker Neural TTS [Fan+15][Hojo+18]
– Single Deep Neural Network (DNN) to generate multi-speakers' voices
• Speaker embedding: conditional input to control speaker ID
➢
Voice cloning (e.g., [Arik+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
Spkr.
emb.
Multi-spkr. neural
TTS model
3. /19
2
Research Outline
➢ Conventional algorithm: GAN*-based training
– High-quality TTS by adversarial training of discriminator & generator
– Poor generalization performance in voice cloning
• TTS model cannot observe unseen speakers' voices in training...
➢ Proposed algorithm: Multi-task adversarial training
– Primal task: GAN-based multi-speaker neural TTS training
• Objective: feature reconstruction loss + adversarial loss
– Secondary task: improving (pseudo) unseen speaker's TTS quality
• Objective: loss to generate realistic voices of unseen speakers
➢ Results: High-quality voice cloning by our algorithm!
*GAN: Generative Adversarial Network [Goodfellow+14]
5. /19
4
Baseline 2: GANSpeech [Yang+21]
➢ Overview: TTS model (generator) vs. JCU* discriminator
– TTS model generates speech features from text & spkr. emb.
– JCU discriminator classifies synth. / nat. from two kinds of inputs
• Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
*JCU: Joint Conditional & Unconditional [Zhang+18]
6. /19
5
GANSpeech Algorithm: JCU Discriminator Update
➢ Objective: Discriminating synth. (0) / nat. (1) correctly
– 𝐷S extracts shared features of nat. / synth. speech
– 𝐷U learns general characteristic of speech
– 𝐷C captures spkr.-specific characteristic of speech
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Disc. loss
7. /19
6
GANSpeech Algorithm: TTS Model Update
➢ Objective: Generating speech & Deceiving JCU discriminator
– Speech reconst. loss makes TTS model generate speech features
• Phoneme duration, F0, energy, mel-spectrogram
– Adv. loss causes JCU discriminator to misclassify synth. as nat.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
Speech
reconst. loss
1: Synth.
1: Synth.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Adv. loss
➢ GAN = Distribution matching betw. nat. and synth. data
– High-quality TTS for seen spkrs. included in training corpus
– No guarantee to generalize TTS for unseen spkrs.
9. /19
8
Overview of Proposed Method
➢ Motivation: Diversifying spkr. variation during training to
– Widen spkr. emb. distribution that TTS model can cover
– Improve robustness of TTS model towards unseen speakers
➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19]
– Architecture: Autoencoder w/ feature interpolation + Critic
• Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data)
• Autoencoder makes critic output 𝛼 = 0 for interpolated data
Feature interpolation w/ 𝛼
We introduce this idea to GAN-based multi-speaker TTS
10. /19
9
➢ Overview: GANSpeech + ACAI-derived regularization
– Encoder = spkr. encoder (fixed parameters during training)
– Decoder = TTS model (i.e., Generator in GAN)
– Critic 𝐶 = additional branch in Multi-Task (MT) discriminator
Multi-Task Adversarial Training Algorithm
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
11. /19
10
Proposed Algorithm: MT Discriminator Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
Disc. loss
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
Critic loss
➢ Objective: Discriminating synth./nat. & mixed/pure
– Synth. speech samples are generated from mixed / pure spkr. emb.
• Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch
– Criterion for critic training: MSE betw. predicted / correct 𝛼
12. /19
11
➢ Objective: GANSpeech objective + ACAI loss
– ACAI loss makes critic output 0 for synth. speech of mixed spkrs.
• Regularization on TTS for (pseudo) unseen spkrs.
– Computation time for inference does not change from GANSpeech
Proposed Algorithm: TTS Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp.
Speech
reconst. loss
1: Synth.
1: Synth.
Adv. loss
0: Mixed
ACAI loss