nakai22apsipa_presentation.pdf

Y
Multi-Task Adversarial Training Algorithm for
Multi-Speaker Neural Text-To-Speech
The University of Tokyo, Japan.
APSIPA ASC 2022 WedAM1-8-2 (SS04)
Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
/19
1
Overview: Multi-Speaker Neural Text-To-Speech
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
Multi-speaker Neural TTS [Fan+15][Hojo+18]
– Single Deep Neural Network (DNN) to generate multi-speakers' voices
• Speaker embedding: conditional input to control speaker ID
➢
Voice cloning (e.g., [Arik+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
Spkr.
emb.
Multi-spkr. neural
TTS model
/19
2
Research Outline
➢ Conventional algorithm: GAN*-based training
– High-quality TTS by adversarial training of discriminator & generator
– Poor generalization performance in voice cloning
• TTS model cannot observe unseen speakers' voices in training...
➢ Proposed algorithm: Multi-task adversarial training
– Primal task: GAN-based multi-speaker neural TTS training
• Objective: feature reconstruction loss + adversarial loss
– Secondary task: improving (pseudo) unseen speaker's TTS quality
• Objective: loss to generate realistic voices of unseen speakers
➢ Results: High-quality voice cloning by our algorithm!
*GAN: Generative Adversarial Network [Goodfellow+14]
/19
3
Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21]
Spkr. encoder
➢ Transfer-learning-based multi-speaker neural TTS [Jia+18]
– 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss)
– 2. Train FS2-based TTS model w/ pretrained spkr. encoder
*GE2E: Generalized End-2-End [Wan+18]
Extract spkr. emb. from
reference speech
(fixed during TTS training)
Add variance
information of speech
/19
4
Baseline 2: GANSpeech [Yang+21]
➢ Overview: TTS model (generator) vs. JCU* discriminator
– TTS model generates speech features from text & spkr. emb.
– JCU discriminator classifies synth. / nat. from two kinds of inputs
• Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
*JCU: Joint Conditional & Unconditional [Zhang+18]
/19
5
GANSpeech Algorithm: JCU Discriminator Update
➢ Objective: Discriminating synth. (0) / nat. (1) correctly
– 𝐷S extracts shared features of nat. / synth. speech
– 𝐷U learns general characteristic of speech
– 𝐷C captures spkr.-specific characteristic of speech
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Disc. loss
/19
6
GANSpeech Algorithm: TTS Model Update
➢ Objective: Generating speech & Deceiving JCU discriminator
– Speech reconst. loss makes TTS model generate speech features
• Phoneme duration, F0, energy, mel-spectrogram
– Adv. loss causes JCU discriminator to misclassify synth. as nat.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
Speech
reconst. loss
1: Synth.
1: Synth.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Adv. loss
➢ GAN = Distribution matching betw. nat. and synth. data
– High-quality TTS for seen spkrs. included in training corpus
– No guarantee to generalize TTS for unseen spkrs.
/19
7
➢ Proposed Method:
Multi-Speaker Neural TTS based on
Multi-Task Adversarial Training
/19
8
Overview of Proposed Method
➢ Motivation: Diversifying spkr. variation during training to
– Widen spkr. emb. distribution that TTS model can cover
– Improve robustness of TTS model towards unseen speakers
➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19]
– Architecture: Autoencoder w/ feature interpolation + Critic
• Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data)
• Autoencoder makes critic output 𝛼 = 0 for interpolated data
Feature interpolation w/ 𝛼
We introduce this idea to GAN-based multi-speaker TTS
/19
9
➢ Overview: GANSpeech + ACAI-derived regularization
– Encoder = spkr. encoder (fixed parameters during training)
– Decoder = TTS model (i.e., Generator in GAN)
– Critic 𝐶 = additional branch in Multi-Task (MT) discriminator
Multi-Task Adversarial Training Algorithm
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
/19
10
Proposed Algorithm: MT Discriminator Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
Disc. loss
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
Critic loss
➢ Objective: Discriminating synth./nat. & mixed/pure
– Synth. speech samples are generated from mixed / pure spkr. emb.
• Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch
– Criterion for critic training: MSE betw. predicted / correct 𝛼
/19
11
➢ Objective: GANSpeech objective + ACAI loss
– ACAI loss makes critic output 0 for synth. speech of mixed spkrs.
• Regularization on TTS for (pseudo) unseen spkrs.
– Computation time for inference does not change from GANSpeech
Proposed Algorithm: TTS Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp.
Speech
reconst. loss
1: Synth.
1: Synth.
Adv. loss
0: Mixed
ACAI loss
/19
12
➢ Experimental Evaluations
/19
13
Experimental Conditions
Corpus
(speaker encoder)
CSJ [Maekawa03] (947 males & 470 females, 660h)
Corpus (TTS)
"parallel100" subset of JVS [Takamichi+20]
(49 males & 51 females, 22h, 100 sent./spkr.)
Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256
Data split
Train/Validation/Test = 0.8/0.1/0.1
Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females)
Vocoder
(for 22,050 Hz)
"generator_universal_model" of HiFi-GAN [Kong+20]
(included in ming024's GitHub repository)
Compared methods
FS2: Multi-spkr. FastSpeech 2 [Ren+21]
GAN: GANSpeech [Yang+21]
Ours: Multi-task adv. training
/19
14
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
/19
15
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
GAN-based methods significantly improve
quality of TTS for seen spkrs.
/19
16
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
Our MT algorithm overcomes degradation of quality in
voice cloning (TTS for unseen spkrs.)
/19
17
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
There is still large gap betw. quality of spkr. similarity
betw. TTS for seen spkrs. & voice cloning
/19
18
Speech Samples (Voice Cloning)
Ground-truth FS2 GAN Ours
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/19
19
Summary
➢ Purpose
– Improving performance of multi-spkr. neural TTS for voice cloning
➢ Proposed method
– Multi-task adversarial training (GANSpeech + ACAI regularization)
➢ Results of our method
– 1) improves naturalness & spkr. similarity better than GANSpeech
– 2) has room for improvement for better spkr. similarity
➢ Future work
– Introducing sophisticated speaker generation framework [Stanton+22]
– Extending our method to multi-lingual TTS
Thank you for your attention!
1 of 20

More Related Content

Similar to nakai22apsipa_presentation.pdf(20)

fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
Yuki Saito45 views
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
ssuser849b73663 views
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
ssuser849b73607 views
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
NAVER Engineering1.2K views
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)
Shinnosuke Takamichi951 views
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
ssuser849b73509 views
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
Hiroyuki Miyoshi550 views
Esa actEsa act
Esa act
Advanced-Concepts-Team116 views
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
ssuser849b73260 views
H0814247H0814247
H0814247
IOSR Journals404 views
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Richie22.6K views
N20181217N20181217
N20181217
TMU, Japan43 views
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
Korakot Chaovavanich1.5K views

More from Yuki Saito(20)

hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
Yuki Saito56 views
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
Yuki Saito657 views
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
Yuki Saito299 views
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
Yuki Saito259 views
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
Yuki Saito241 views
Saito2103slpSaito2103slp
Saito2103slp
Yuki Saito250 views
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
Yuki Saito172 views
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
Yuki Saito386 views
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
Yuki Saito696 views
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
Yuki Saito606 views
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
Yuki Saito1.2K views
Saito19asj_sSaito19asj_s
Saito19asj_s
Yuki Saito500 views
Saito18sp03Saito18sp03
Saito18sp03
Yuki Saito1.1K views
Saito18asj_sSaito18asj_s
Saito18asj_s
Yuki Saito345 views
Saito17asjASaito17asjA
Saito17asjA
Yuki Saito576 views
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
Yuki Saito1.1K views

nakai22apsipa_presentation.pdf

  • 1. Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-To-Speech The University of Tokyo, Japan. APSIPA ASC 2022 WedAM1-8-2 (SS04) Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
  • 2. /19 1 Overview: Multi-Speaker Neural Text-To-Speech ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ Multi-speaker Neural TTS [Fan+15][Hojo+18] – Single Deep Neural Network (DNN) to generate multi-speakers' voices • Speaker embedding: conditional input to control speaker ID ➢ Voice cloning (e.g., [Arik+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech Spkr. emb. Multi-spkr. neural TTS model
  • 3. /19 2 Research Outline ➢ Conventional algorithm: GAN*-based training – High-quality TTS by adversarial training of discriminator & generator – Poor generalization performance in voice cloning • TTS model cannot observe unseen speakers' voices in training... ➢ Proposed algorithm: Multi-task adversarial training – Primal task: GAN-based multi-speaker neural TTS training • Objective: feature reconstruction loss + adversarial loss – Secondary task: improving (pseudo) unseen speaker's TTS quality • Objective: loss to generate realistic voices of unseen speakers ➢ Results: High-quality voice cloning by our algorithm! *GAN: Generative Adversarial Network [Goodfellow+14]
  • 4. /19 3 Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21] Spkr. encoder ➢ Transfer-learning-based multi-speaker neural TTS [Jia+18] – 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss) – 2. Train FS2-based TTS model w/ pretrained spkr. encoder *GE2E: Generalized End-2-End [Wan+18] Extract spkr. emb. from reference speech (fixed during TTS training) Add variance information of speech
  • 5. /19 4 Baseline 2: GANSpeech [Yang+21] ➢ Overview: TTS model (generator) vs. JCU* discriminator – TTS model generates speech features from text & spkr. emb. – JCU discriminator classifies synth. / nat. from two kinds of inputs • Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. *JCU: Joint Conditional & Unconditional [Zhang+18]
  • 6. /19 5 GANSpeech Algorithm: JCU Discriminator Update ➢ Objective: Discriminating synth. (0) / nat. (1) correctly – 𝐷S extracts shared features of nat. / synth. speech – 𝐷U learns general characteristic of speech – 𝐷C captures spkr.-specific characteristic of speech TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. Disc. loss
  • 7. /19 6 GANSpeech Algorithm: TTS Model Update ➢ Objective: Generating speech & Deceiving JCU discriminator – Speech reconst. loss makes TTS model generate speech features • Phoneme duration, F0, energy, mel-spectrogram – Adv. loss causes JCU discriminator to misclassify synth. as nat. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ Speech reconst. loss 1: Synth. 1: Synth. JCU discriminator Text Spkr. emb. Synth. Nat. Adv. loss ➢ GAN = Distribution matching betw. nat. and synth. data – High-quality TTS for seen spkrs. included in training corpus – No guarantee to generalize TTS for unseen spkrs.
  • 8. /19 7 ➢ Proposed Method: Multi-Speaker Neural TTS based on Multi-Task Adversarial Training
  • 9. /19 8 Overview of Proposed Method ➢ Motivation: Diversifying spkr. variation during training to – Widen spkr. emb. distribution that TTS model can cover – Improve robustness of TTS model towards unseen speakers ➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19] – Architecture: Autoencoder w/ feature interpolation + Critic • Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data) • Autoencoder makes critic output 𝛼 = 0 for interpolated data Feature interpolation w/ 𝛼 We introduce this idea to GAN-based multi-speaker TTS
  • 10. /19 9 ➢ Overview: GANSpeech + ACAI-derived regularization – Encoder = spkr. encoder (fixed parameters during training) – Decoder = TTS model (i.e., Generator in GAN) – Critic 𝐶 = additional branch in Multi-Task (MT) discriminator Multi-Task Adversarial Training Algorithm TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure
  • 11. /19 10 Proposed Algorithm: MT Discriminator Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. Disc. loss 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure Critic loss ➢ Objective: Discriminating synth./nat. & mixed/pure – Synth. speech samples are generated from mixed / pure spkr. emb. • Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch – Criterion for critic training: MSE betw. predicted / correct 𝛼
  • 12. /19 11 ➢ Objective: GANSpeech objective + ACAI loss – ACAI loss makes critic output 0 for synth. speech of mixed spkrs. • Regularization on TTS for (pseudo) unseen spkrs. – Computation time for inference does not change from GANSpeech Proposed Algorithm: TTS Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. Speech reconst. loss 1: Synth. 1: Synth. Adv. loss 0: Mixed ACAI loss
  • 14. /19 13 Experimental Conditions Corpus (speaker encoder) CSJ [Maekawa03] (947 males & 470 females, 660h) Corpus (TTS) "parallel100" subset of JVS [Takamichi+20] (49 males & 51 females, 22h, 100 sent./spkr.) Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256 Data split Train/Validation/Test = 0.8/0.1/0.1 Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females) Vocoder (for 22,050 Hz) "generator_universal_model" of HiFi-GAN [Kong+20] (included in ming024's GitHub repository) Compared methods FS2: Multi-spkr. FastSpeech 2 [Ren+21] GAN: GANSpeech [Yang+21] Ours: Multi-task adv. training
  • 15. /19 14 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
  • 16. /19 15 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 GAN-based methods significantly improve quality of TTS for seen spkrs.
  • 17. /19 16 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 Our MT algorithm overcomes degradation of quality in voice cloning (TTS for unseen spkrs.)
  • 18. /19 17 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 There is still large gap betw. quality of spkr. similarity betw. TTS for seen spkrs. & voice cloning
  • 19. /19 18 Speech Samples (Voice Cloning) Ground-truth FS2 GAN Ours jvs078 (male) jvs005 (male) jvs060 (female) jvs010 (female) Other samples are available online! →
  • 20. /19 19 Summary ➢ Purpose – Improving performance of multi-spkr. neural TTS for voice cloning ➢ Proposed method – Multi-task adversarial training (GANSpeech + ACAI regularization) ➢ Results of our method – 1) improves naturalness & spkr. similarity better than GANSpeech – 2) has room for improvement for better spkr. similarity ➢ Future work – Introducing sophisticated speaker generation framework [Stanton+22] – Extending our method to multi-lingual TTS Thank you for your attention!