SlideShare a Scribd company logo
1 of 20
Download to read offline
Multi-Task Adversarial Training Algorithm for
Multi-Speaker Neural Text-To-Speech
The University of Tokyo, Japan.
APSIPA ASC 2022 WedAM1-8-2 (SS04)
Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
/19
1
Overview: Multi-Speaker Neural Text-To-Speech
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
Multi-speaker Neural TTS [Fan+15][Hojo+18]
– Single Deep Neural Network (DNN) to generate multi-speakers' voices
• Speaker embedding: conditional input to control speaker ID
➢
Voice cloning (e.g., [Arik+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
Spkr.
emb.
Multi-spkr. neural
TTS model
/19
2
Research Outline
➢ Conventional algorithm: GAN*-based training
– High-quality TTS by adversarial training of discriminator & generator
– Poor generalization performance in voice cloning
• TTS model cannot observe unseen speakers' voices in training...
➢ Proposed algorithm: Multi-task adversarial training
– Primal task: GAN-based multi-speaker neural TTS training
• Objective: feature reconstruction loss + adversarial loss
– Secondary task: improving (pseudo) unseen speaker's TTS quality
• Objective: loss to generate realistic voices of unseen speakers
➢ Results: High-quality voice cloning by our algorithm!
*GAN: Generative Adversarial Network [Goodfellow+14]
/19
3
Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21]
Spkr. encoder
➢ Transfer-learning-based multi-speaker neural TTS [Jia+18]
– 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss)
– 2. Train FS2-based TTS model w/ pretrained spkr. encoder
*GE2E: Generalized End-2-End [Wan+18]
Extract spkr. emb. from
reference speech
(fixed during TTS training)
Add variance
information of speech
/19
4
Baseline 2: GANSpeech [Yang+21]
➢ Overview: TTS model (generator) vs. JCU* discriminator
– TTS model generates speech features from text & spkr. emb.
– JCU discriminator classifies synth. / nat. from two kinds of inputs
• Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
*JCU: Joint Conditional & Unconditional [Zhang+18]
/19
5
GANSpeech Algorithm: JCU Discriminator Update
➢ Objective: Discriminating synth. (0) / nat. (1) correctly
– 𝐷S extracts shared features of nat. / synth. speech
– 𝐷U learns general characteristic of speech
– 𝐷C captures spkr.-specific characteristic of speech
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Disc. loss
/19
6
GANSpeech Algorithm: TTS Model Update
➢ Objective: Generating speech & Deceiving JCU discriminator
– Speech reconst. loss makes TTS model generate speech features
• Phoneme duration, F0, energy, mel-spectrogram
– Adv. loss causes JCU discriminator to misclassify synth. as nat.
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
Speech
reconst. loss
1: Synth.
1: Synth.
JCU discriminator
Text
Spkr.
emb.
Synth. Nat.
Adv. loss
➢ GAN = Distribution matching betw. nat. and synth. data
– High-quality TTS for seen spkrs. included in training corpus
– No guarantee to generalize TTS for unseen spkrs.
/19
7
➢ Proposed Method:
Multi-Speaker Neural TTS based on
Multi-Task Adversarial Training
/19
8
Overview of Proposed Method
➢ Motivation: Diversifying spkr. variation during training to
– Widen spkr. emb. distribution that TTS model can cover
– Improve robustness of TTS model towards unseen speakers
➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19]
– Architecture: Autoencoder w/ feature interpolation + Critic
• Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data)
• Autoencoder makes critic output 𝛼 = 0 for interpolated data
Feature interpolation w/ 𝛼
We introduce this idea to GAN-based multi-speaker TTS
/19
9
➢ Overview: GANSpeech + ACAI-derived regularization
– Encoder = spkr. encoder (fixed parameters during training)
– Decoder = TTS model (i.e., Generator in GAN)
– Critic 𝐶 = additional branch in Multi-Task (MT) discriminator
Multi-Task Adversarial Training Algorithm
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
/19
10
Proposed Algorithm: MT Discriminator Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
or
0: Synth.
1: Nat.
0: Synth.
1: Nat.
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
Disc. loss
𝐶 ⋅
Spkr. enc.
& interp. 𝛼: Mixed
0: Pure
Critic loss
➢ Objective: Discriminating synth./nat. & mixed/pure
– Synth. speech samples are generated from mixed / pure spkr. emb.
• Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch
– Criterion for critic training: MSE betw. predicted / correct 𝛼
/19
11
➢ Objective: GANSpeech objective + ACAI loss
– ACAI loss makes critic output 0 for synth. speech of mixed spkrs.
• Regularization on TTS for (pseudo) unseen spkrs.
– Computation time for inference does not change from GANSpeech
Proposed Algorithm: TTS Update
TTS
model
𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅
𝐷C ⋅
MT discriminator
Text
Mixed/Pure
spkr. emb.
Synth. Nat.
𝐶 ⋅
Spkr. enc.
& interp.
Speech
reconst. loss
1: Synth.
1: Synth.
Adv. loss
0: Mixed
ACAI loss
/19
12
➢ Experimental Evaluations
/19
13
Experimental Conditions
Corpus
(speaker encoder)
CSJ [Maekawa03] (947 males & 470 females, 660h)
Corpus (TTS)
"parallel100" subset of JVS [Takamichi+20]
(49 males & 51 females, 22h, 100 sent./spkr.)
Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256
Data split
Train/Validation/Test = 0.8/0.1/0.1
Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females)
Vocoder
(for 22,050 Hz)
"generator_universal_model" of HiFi-GAN [Kong+20]
(included in ming024's GitHub repository)
Compared methods
FS2: Multi-spkr. FastSpeech 2 [Ren+21]
GAN: GANSpeech [Yang+21]
Ours: Multi-task adv. training
/19
14
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
/19
15
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
GAN-based methods significantly improve
quality of TTS for seen spkrs.
/19
16
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
Our MT algorithm overcomes degradation of quality in
voice cloning (TTS for unseen spkrs.)
/19
17
Subjective Evaluation & Results
➢ Criterion: quality of synth. speech (Mean Opinion Score tests)
– Naturalness (MOS) & spkr. similarity (Degradation MOS)
➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener)
TTS for seen spkrs. Voice cloning
Naturalness Similarity Naturalness Similarity
FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12
GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12
Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
There is still large gap betw. quality of spkr. similarity
betw. TTS for seen spkrs. & voice cloning
/19
18
Speech Samples (Voice Cloning)
Ground-truth FS2 GAN Ours
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/19
19
Summary
➢ Purpose
– Improving performance of multi-spkr. neural TTS for voice cloning
➢ Proposed method
– Multi-task adversarial training (GANSpeech + ACAI regularization)
➢ Results of our method
– 1) improves naturalness & spkr. similarity better than GANSpeech
– 2) has room for improvement for better spkr. similarity
➢ Future work
– Introducing sophisticated speaker generation framework [Stanton+22]
– Extending our method to multi-lingual TTS
Thank you for your attention!

More Related Content

Similar to nakai22apsipa_presentation.pdf

fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_ascYuki Saito
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Shinnosuke Takamichi
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptxssuser849b73
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...MLAI2
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfssuser849b73
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 

Similar to nakai22apsipa_presentation.pdf (20)

fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Esa act
Esa actEsa act
Esa act
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
H0814247
H0814247H0814247
H0814247
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
 
N20181217
N20181217N20181217
N20181217
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 

More from Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会Yuki Saito
 

More from Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 
miyoshi17sp07
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
 

Recently uploaded

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 

Recently uploaded (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 

nakai22apsipa_presentation.pdf

  • 1. Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-To-Speech The University of Tokyo, Japan. APSIPA ASC 2022 WedAM1-8-2 (SS04) Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari
  • 2. /19 1 Overview: Multi-Speaker Neural Text-To-Speech ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ Multi-speaker Neural TTS [Fan+15][Hojo+18] – Single Deep Neural Network (DNN) to generate multi-speakers' voices • Speaker embedding: conditional input to control speaker ID ➢ Voice cloning (e.g., [Arik+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech Spkr. emb. Multi-spkr. neural TTS model
  • 3. /19 2 Research Outline ➢ Conventional algorithm: GAN*-based training – High-quality TTS by adversarial training of discriminator & generator – Poor generalization performance in voice cloning • TTS model cannot observe unseen speakers' voices in training... ➢ Proposed algorithm: Multi-task adversarial training – Primal task: GAN-based multi-speaker neural TTS training • Objective: feature reconstruction loss + adversarial loss – Secondary task: improving (pseudo) unseen speaker's TTS quality • Objective: loss to generate realistic voices of unseen speakers ➢ Results: High-quality voice cloning by our algorithm! *GAN: Generative Adversarial Network [Goodfellow+14]
  • 4. /19 3 Baseline 1: Multi-Speaker FastSpeech 2 (FS2) [Ren+21] Spkr. encoder ➢ Transfer-learning-based multi-speaker neural TTS [Jia+18] – 1. Pretrain spkr. encoder w/ spkr. verification task (e.g., GE2E* loss) – 2. Train FS2-based TTS model w/ pretrained spkr. encoder *GE2E: Generalized End-2-End [Wan+18] Extract spkr. emb. from reference speech (fixed during TTS training) Add variance information of speech
  • 5. /19 4 Baseline 2: GANSpeech [Yang+21] ➢ Overview: TTS model (generator) vs. JCU* discriminator – TTS model generates speech features from text & spkr. emb. – JCU discriminator classifies synth. / nat. from two kinds of inputs • Unconditional: w/o spkr. emb. & Conditional: w/ spkr. emb. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. *JCU: Joint Conditional & Unconditional [Zhang+18]
  • 6. /19 5 GANSpeech Algorithm: JCU Discriminator Update ➢ Objective: Discriminating synth. (0) / nat. (1) correctly – 𝐷S extracts shared features of nat. / synth. speech – 𝐷U learns general characteristic of speech – 𝐷C captures spkr.-specific characteristic of speech TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. JCU discriminator Text Spkr. emb. Synth. Nat. Disc. loss
  • 7. /19 6 GANSpeech Algorithm: TTS Model Update ➢ Objective: Generating speech & Deceiving JCU discriminator – Speech reconst. loss makes TTS model generate speech features • Phoneme duration, F0, energy, mel-spectrogram – Adv. loss causes JCU discriminator to misclassify synth. as nat. TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ Speech reconst. loss 1: Synth. 1: Synth. JCU discriminator Text Spkr. emb. Synth. Nat. Adv. loss ➢ GAN = Distribution matching betw. nat. and synth. data – High-quality TTS for seen spkrs. included in training corpus – No guarantee to generalize TTS for unseen spkrs.
  • 8. /19 7 ➢ Proposed Method: Multi-Speaker Neural TTS based on Multi-Task Adversarial Training
  • 9. /19 8 Overview of Proposed Method ➢ Motivation: Diversifying spkr. variation during training to – Widen spkr. emb. distribution that TTS model can cover – Improve robustness of TTS model towards unseen speakers ➢ Idea: Adversarially Constrained Autoencoder Interpolation [Berthelot+19] – Architecture: Autoencoder w/ feature interpolation + Critic • Critic estimates 𝛼 from given input (𝛼 = 0 if input is pure data) • Autoencoder makes critic output 𝛼 = 0 for interpolated data Feature interpolation w/ 𝛼 We introduce this idea to GAN-based multi-speaker TTS
  • 10. /19 9 ➢ Overview: GANSpeech + ACAI-derived regularization – Encoder = spkr. encoder (fixed parameters during training) – Decoder = TTS model (i.e., Generator in GAN) – Critic 𝐶 = additional branch in Multi-Task (MT) discriminator Multi-Task Adversarial Training Algorithm TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure
  • 11. /19 10 Proposed Algorithm: MT Discriminator Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ or 0: Synth. 1: Nat. 0: Synth. 1: Nat. MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. Disc. loss 𝐶 ⋅ Spkr. enc. & interp. 𝛼: Mixed 0: Pure Critic loss ➢ Objective: Discriminating synth./nat. & mixed/pure – Synth. speech samples are generated from mixed / pure spkr. emb. • Coefficient: 𝛼 ~ 𝑈(0.0, 0.5), spkr. pairs: shuffled w/n mini-batch – Criterion for critic training: MSE betw. predicted / correct 𝛼
  • 12. /19 11 ➢ Objective: GANSpeech objective + ACAI loss – ACAI loss makes critic output 0 for synth. speech of mixed spkrs. • Regularization on TTS for (pseudo) unseen spkrs. – Computation time for inference does not change from GANSpeech Proposed Algorithm: TTS Update TTS model 𝐺 ⋅ 𝐷S ⋅ 𝐷U ⋅ 𝐷C ⋅ MT discriminator Text Mixed/Pure spkr. emb. Synth. Nat. 𝐶 ⋅ Spkr. enc. & interp. Speech reconst. loss 1: Synth. 1: Synth. Adv. loss 0: Mixed ACAI loss
  • 14. /19 13 Experimental Conditions Corpus (speaker encoder) CSJ [Maekawa03] (947 males & 470 females, 660h) Corpus (TTS) "parallel100" subset of JVS [Takamichi+20] (49 males & 51 females, 22h, 100 sent./spkr.) Feature dimensions Mel-spectrogram: 80, Spkr. emb.: 256 Data split Train/Validation/Test = 0.8/0.1/0.1 Seen spkrs: 96, Unseen spkrs. 4 (2 males & 2 females) Vocoder (for 22,050 Hz) "generator_universal_model" of HiFi-GAN [Kong+20] (included in ming024's GitHub repository) Compared methods FS2: Multi-spkr. FastSpeech 2 [Ren+21] GAN: GANSpeech [Yang+21] Ours: Multi-task adv. training
  • 15. /19 14 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12
  • 16. /19 15 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 GAN-based methods significantly improve quality of TTS for seen spkrs.
  • 17. /19 16 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 Our MT algorithm overcomes degradation of quality in voice cloning (TTS for unseen spkrs.)
  • 18. /19 17 Subjective Evaluation & Results ➢ Criterion: quality of synth. speech (Mean Opinion Score tests) – Naturalness (MOS) & spkr. similarity (Degradation MOS) ➢ Results w/ 95% intervals (50 listeners/test, 15 samples/listener) TTS for seen spkrs. Voice cloning Naturalness Similarity Naturalness Similarity FS2 3.13±0.12 3.57±0.12 3.13±0.12 2.38±0.12 GAN 3.52±0.12 3.79±0.12 3.38±0.12 2.40±0.12 Ours 3.55±0.12 3.87±0.12 3.50±0.12 2.48±0.12 There is still large gap betw. quality of spkr. similarity betw. TTS for seen spkrs. & voice cloning
  • 19. /19 18 Speech Samples (Voice Cloning) Ground-truth FS2 GAN Ours jvs078 (male) jvs005 (male) jvs060 (female) jvs010 (female) Other samples are available online! →
  • 20. /19 19 Summary ➢ Purpose – Improving performance of multi-spkr. neural TTS for voice cloning ➢ Proposed method – Multi-task adversarial training (GANSpeech + ACAI regularization) ➢ Results of our method – 1) improves naturalness & spkr. similarity better than GANSpeech – 2) has room for improvement for better spkr. similarity ➢ Future work – Introducing sophisticated speaker generation framework [Stanton+22] – Extending our method to multi-lingual TTS Thank you for your attention!