nakai22apsipa_presentation.pdf

Multi-Task Adversarial Training Algorithm for
Multi-Speaker Neural Text-To-Speech
The University of Tokyo, Japan.
APSIPA ASC 2022 WedAM1-8-2 (SS04)
Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari

/19
1
Overview: Multi-Speaker Neural Text-To-Speech
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
Multi-speaker Neural TTS [Fan+15][Hojo+18]
– Single Deep Neural Network (DNN) to generate multi-speakers' voices
• Speaker embedding: conditional input to control speaker ID
➢
Voice cloning (e.g., [Arik+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
Spkr.
emb.
Multi-spkr. neural
TTS model

/19
2
Research Outline
➢ Conventional algorithm: GAN*-based training
– High-quality TTS by adversarial training of discriminator & generator
– Poor generalization performance in voice cloning
• TTS model cannot observe unseen speakers' voices in training...
➢ Proposed algorithm: Multi-task adversarial training
– Primal task: GAN-based multi-speaker neural TTS training
• Objective: feature reconstruction loss + adversarial loss
– Secondary task: improving (pseudo) unseen speaker's TTS quality
• Objective: loss to generate realistic voices of unseen speakers
➢ Results: High-quality voice cloning by our algorithm!
*GAN: Generative Adversarial Network [Goodfellow+14]

/19
3
Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21]
Spkr. encoder
➢ Transfer-learning-based multi-speaker neural TTS [Jia+18]
– 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss)
– 2. Train FS2-based TTS model w/ pretrained spkr. encoder
*GE2E: Generalized End-2-End [Wan+18]
Extract spkr. emb. from
reference speech
(fixed during TTS training)
Add variance
information of speech

/19
4
Baseline 2: GANSpeech [Yang+21]
➢ Overview: TTS model (generator) vs. JCU* discriminator
– TTS model generates speech features from text & spkr. emb.
– JCU discriminator classifies synth. / nat. from two kinds of inputs
• Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
*JCU: Joint Conditional & Unconditional [Zhang+18]

/19
5
GANSpeech Algorithm: JCU Discriminator Update
➢ Objective: Discriminating synth. (0) / nat. (1) correctly
– 𝐷S extracts shared features of nat. / synth. speech
– 𝐷U learns general characteristic of speech
– 𝐷C captures spkr.-specific characteristic of speech
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Disc. loss

/19
6
GANSpeech Algorithm: TTS Model Update
➢ Objective: Generating speech & Deceiving JCU discriminator
– Speech reconst. loss makes TTS model generate speech features
• Phoneme duration, F0, energy, mel-spectrogram
– Adv. loss causes JCU discriminator to misclassify synth. as nat.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
Speech
reconst. loss
1: Synth.
1: Synth.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Adv. loss
➢ GAN = Distribution matching betw. nat. and synth. data
– High-quality TTS for seen spkrs. included in training corpus
– No guarantee to generalize TTS for unseen spkrs.

/19
7
➢ Proposed Method:
Multi-Speaker Neural TTS based on
Multi-Task Adversarial Training

/19
8
Overview of Proposed Method
➢ Motivation: Diversifying spkr. variation during training to
– Widen spkr. emb. distribution that TTS model can cover
– Improve robustness of TTS model towards unseen speakers
➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19]
– Architecture: Autoencoder w/ feature interpolation + Critic
• Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data)
• Autoencoder makes critic output 𝛼 = 0 for interpolated data
Feature interpolation w/ 𝛼
We introduce this idea to GAN-based multi-speaker TTS

/19
9
➢ Overview: GANSpeech + ACAI-derived regularization
– Encoder = spkr. encoder (fixed parameters during training)
– Decoder = TTS model (i.e., Generator in GAN)
– Critic 𝐶 = additional branch in Multi-Task (MT) discriminator
Multi-Task Adversarial Training Algorithm
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure

/19
10
Proposed Algorithm: MT Discriminator Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
Disc. loss
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
Critic loss
➢ Objective: Discriminating synth./nat. & mixed/pure
– Synth. speech samples are generated from mixed / pure spkr. emb.
• Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch
– Criterion for critic training: MSE betw. predicted / correct 𝛼

/19
11
➢ Objective: GANSpeech objective + ACAI loss
– ACAI loss makes critic output 0 for synth. speech of mixed spkrs.
• Regularization on TTS for (pseudo) unseen spkrs.
– Computation time for inference does not change from GANSpeech
Proposed Algorithm: TTS Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp.
Speech
reconst. loss
1: Synth.
1: Synth.
Adv. loss
0: Mixed
ACAI loss

/19
12
➢ Experimental Evaluations

/19
13
Experimental Conditions
Corpus
(speaker encoder)
CSJ [Maekawa03] (947 males & 470 females, 660h)
Corpus (TTS)
"parallel100" subset of JVS [Takamichi+20]
(49 males & 51 females, 22h, 100 sent./spkr.)
Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256
Data split
Train/Validation/Test = 0.8/0.1/0.1
Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females)
Vocoder
(for 22,050 Hz)
"generator_universal_model" of HiFi-GAN [Kong+20]
(included in ming024's GitHub repository)
Compared methods
FS2: Multi-spkr. FastSpeech 2 [Ren+21]
GAN: GANSpeech [Yang+21]
Ours: Multi-task adv. training

/19
14
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12

/19
15
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
GAN-based methods significantly improve
quality of TTS for seen spkrs.

/19
16
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
Our MT algorithm overcomes degradation of quality in
voice cloning (TTS for unseen spkrs.)

/19
17
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
There is still large gap betw. quality of spkr. similarity
betw. TTS for seen spkrs. & voice cloning

/19
18
Speech Samples (Voice Cloning)
Ground-truth FS2 GAN Ours
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →

/19
19
Summary
➢ Purpose
– Improving performance of multi-spkr. neural TTS for voice cloning
➢ Proposed method
– Multi-task adversarial training (GANSpeech + ACAI regularization)
➢ Results of our method
– 1) improves naturalness & spkr. similarity better than GANSpeech
– 2) has room for improvement for better spkr. similarity
➢ Future work
– Introducing sophisticated speaker generation framework [Stanton+22]
– Extending our method to multi-lingual TTS
Thank you for your attention!

nakai22apsipa_presentation.pdf

Recommended

Recommended

More Related Content

Similar to nakai22apsipa_presentation.pdf

Similar to nakai22apsipa_presentation.pdf (20)

More from Yuki Saito

More from Yuki Saito (20)

Recently uploaded

Recently uploaded (20)

nakai22apsipa_presentation.pdf