SlideShare a Scribd company logo
1 of 1
Download to read offline
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior,
Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti
1. Introduction p
1.1 Motivation
– Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et
al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker
TTS models were heavily based on the Tacotron 2 model.
1.2 Highlights
– As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario.
– We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training
speakers can significantly improve speech similarity and quality for new speakers.
– Our approach achieves promising results using only 11 speakers for training.
2. Methodology: Proposed Method and Dataset
2.1 Speaker Encoder
– Stack of 3 LSTM layers with a linear output layer.
– Trained using the Angular Prototypical loss function with approximately 25k speakers.
– Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK.
2.2 Vocoder: HiFi-GAN V2
− VCTK dataset for training and validation.
− Fine-tuning with Mel-spectrograms predicted by TTS models
(HiFi-GAN-FT).
2.3 SC-GlowTTS Model: Glow-TTS based
− Phonemes instead of graphemes as input.
− Explore 3 different encoders:
 The original transformer based encoder;
 Residual convolutional based;
 Gated convolutional based.
− External speaker embeddings conditioned in:
 Affine coupling layers in all decoder blocks;
 Duration predictor input.
2.4 Dataset: VCTK
− Training: composed of 97 speakers.
− Development: composed by samples from the 97 training speakers.
− Test: composed of 11 speakers not present in the training set.
Input Text Phonemizer Encoder
Duration Predictor
Conv Projection
Speaker Embedding
Aligment Generation
Ceil
Flow-Based Decoder
UnSqueeze
Affine Coupling Layer
Invertible 1x1 Conv
ActNorm
Squeeze
x 12
Predicted Mel spectrogram
HiFi-GAN
Waveform
3. Experiments: Setup and Results
3.1 Proposed Experiments
1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020);
2. SC-GlowTTS with transformer based encoder;
3. SC-GlowTTS with residual convolutional based encoder;
4. SC-GlowTTS with gated convolutional based encoder.
3.2 Experiments Setup
– All experiments were implemented on the Coqui TTS:
github.com/coqui-ai/TTS
– Coqui TTS is an open source TTS framework. Contributions are welcome.
– Audio samples and checkpoints of all experiments are available on:
github.com/Edresson/SC-GlowTTS
3.3 Results
Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments.
Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS
Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06
Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06)
1 - Tacotron 2
HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08
HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07
2 - SC-GlowTTS-Trans
HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07
HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07
3 - SC-GlowTTS-Res
HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08
HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07
4 - SC-GlowTTS-Gated
HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08
HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07
4. SC-GlowTTS performance with few speakers
– To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset.
– We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this
dataset composed of 11 speakers and we calculated the metrics for the test set.
– The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved
by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08.
– We believe that this is an important step forward, especially for zero-shot multi speaker TTS in
low-resource languages.

More Related Content

What's hot

PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in EngineeringPrince Jain
 
Nlms algorithm for adaptive filter
Nlms algorithm for adaptive filterNlms algorithm for adaptive filter
Nlms algorithm for adaptive filterchintanajoshi
 
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelZimin Park
 
Introduction To Wireless Fading Channels
Introduction To Wireless Fading ChannelsIntroduction To Wireless Fading Channels
Introduction To Wireless Fading ChannelsNitin Jain
 
Phase modulation
Phase modulationPhase modulation
Phase modulationavocado1111
 
9. parameters of mobile multipath channels
9. parameters of mobile multipath channels9. parameters of mobile multipath channels
9. parameters of mobile multipath channelsJAIGANESH SEKAR
 
Fading in wireless communication.ppt
Fading in wireless communication.pptFading in wireless communication.ppt
Fading in wireless communication.pptpravin patil
 
Direct sequence spread spectrum
Direct sequence spread spectrumDirect sequence spread spectrum
Direct sequence spread spectrumDhananjay .
 
Particle Swarm Optimization by Rajorshi Mukherjee
Particle Swarm Optimization by Rajorshi MukherjeeParticle Swarm Optimization by Rajorshi Mukherjee
Particle Swarm Optimization by Rajorshi MukherjeeRajorshi Mukherjee
 
Multiplexing and Multiple Access
Multiplexing and Multiple AccessMultiplexing and Multiple Access
Multiplexing and Multiple AccessRidwanul Hoque
 
Introduction to Wireless Communication
Introduction to Wireless CommunicationIntroduction to Wireless Communication
Introduction to Wireless CommunicationDilum Bandara
 
Channel capacity
Channel capacityChannel capacity
Channel capacityPALLAB DAS
 
Digital Modulation Techniques ppt
Digital Modulation Techniques pptDigital Modulation Techniques ppt
Digital Modulation Techniques pptPankaj Singh
 
Equalisation, diversity, coding.
Equalisation, diversity, coding.Equalisation, diversity, coding.
Equalisation, diversity, coding.Vrince Vimal
 

What's hot (20)

PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
 
Nlms algorithm for adaptive filter
Nlms algorithm for adaptive filterNlms algorithm for adaptive filter
Nlms algorithm for adaptive filter
 
Channel coding
Channel codingChannel coding
Channel coding
 
Lecture13
Lecture13Lecture13
Lecture13
 
Ec 8701 ame unit 1
Ec 8701 ame unit 1Ec 8701 ame unit 1
Ec 8701 ame unit 1
 
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training Model
 
Introduction To Wireless Fading Channels
Introduction To Wireless Fading ChannelsIntroduction To Wireless Fading Channels
Introduction To Wireless Fading Channels
 
bat algorithm
bat algorithmbat algorithm
bat algorithm
 
Phase modulation
Phase modulationPhase modulation
Phase modulation
 
9. parameters of mobile multipath channels
9. parameters of mobile multipath channels9. parameters of mobile multipath channels
9. parameters of mobile multipath channels
 
Fading in wireless communication.ppt
Fading in wireless communication.pptFading in wireless communication.ppt
Fading in wireless communication.ppt
 
Direct sequence spread spectrum
Direct sequence spread spectrumDirect sequence spread spectrum
Direct sequence spread spectrum
 
PSO
PSOPSO
PSO
 
Particle Swarm Optimization by Rajorshi Mukherjee
Particle Swarm Optimization by Rajorshi MukherjeeParticle Swarm Optimization by Rajorshi Mukherjee
Particle Swarm Optimization by Rajorshi Mukherjee
 
Multiplexing and Multiple Access
Multiplexing and Multiple AccessMultiplexing and Multiple Access
Multiplexing and Multiple Access
 
Introduction to Wireless Communication
Introduction to Wireless CommunicationIntroduction to Wireless Communication
Introduction to Wireless Communication
 
Dpcm
DpcmDpcm
Dpcm
 
Channel capacity
Channel capacityChannel capacity
Channel capacity
 
Digital Modulation Techniques ppt
Digital Modulation Techniques pptDigital Modulation Techniques ppt
Digital Modulation Techniques ppt
 
Equalisation, diversity, coding.
Equalisation, diversity, coding.Equalisation, diversity, coding.
Equalisation, diversity, coding.
 

Similar to Poster SCGlowTTS Interspeech 2021

Shah Md Zobair(063560056)
Shah Md Zobair(063560056)Shah Md Zobair(063560056)
Shah Md Zobair(063560056)mashiur
 
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_lastSlow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_lastHany Fahmy
 
NR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptxNR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptxBijoy Banerjee
 
Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011mohdmohsen
 
College ADSL Presentation
College ADSL PresentationCollege ADSL Presentation
College ADSL Presentationjviviano
 
Orthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.pptOrthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.pptStefan Oprea
 
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...KevinYangYang
 
OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017Deborah Porchivina
 
Encrypted Traffic Mining
Encrypted Traffic MiningEncrypted Traffic Mining
Encrypted Traffic MiningHenry Huang
 
Final presentation
Final presentationFinal presentation
Final presentationRohan Lad
 
Ofdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsOfdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsMike Martin
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriaisWebinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriaisEmbarcados
 
Automatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain EnhancementAutomatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain EnhancementIRJET Journal
 

Similar to Poster SCGlowTTS Interspeech 2021 (20)

Shah Md Zobair(063560056)
Shah Md Zobair(063560056)Shah Md Zobair(063560056)
Shah Md Zobair(063560056)
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_lastSlow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
 
NR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptxNR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptx
 
Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011
 
College ADSL Presentation
College ADSL PresentationCollege ADSL Presentation
College ADSL Presentation
 
Orthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.pptOrthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.ppt
 
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
 
LTE Air Interface
LTE Air InterfaceLTE Air Interface
LTE Air Interface
 
OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017
 
Encrypted Traffic Mining
Encrypted Traffic MiningEncrypted Traffic Mining
Encrypted Traffic Mining
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Ofdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsOfdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE students
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Finalreport
FinalreportFinalreport
Finalreport
 
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriaisWebinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Automatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain EnhancementAutomatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain Enhancement
 
Speech coding techniques
Speech coding techniquesSpeech coding techniques
Speech coding techniques
 
moip
moipmoip
moip
 

More from Bilkent University

More from Bilkent University (6)

RNNs for Speech
RNNs for SpeechRNNs for Speech
RNNs for Speech
 
Qualcomm research-imagenet2015
Qualcomm research-imagenet2015Qualcomm research-imagenet2015
Qualcomm research-imagenet2015
 
Fame cvpr
Fame cvprFame cvpr
Fame cvpr
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014
 
Cmap presentation
Cmap presentationCmap presentation
Cmap presentation
 

Recently uploaded

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Recently uploaded (20)

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

Poster SCGlowTTS Interspeech 2021

  • 1. SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti 1. Introduction p 1.1 Motivation – Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker TTS models were heavily based on the Tacotron 2 model. 1.2 Highlights – As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario. – We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training speakers can significantly improve speech similarity and quality for new speakers. – Our approach achieves promising results using only 11 speakers for training. 2. Methodology: Proposed Method and Dataset 2.1 Speaker Encoder – Stack of 3 LSTM layers with a linear output layer. – Trained using the Angular Prototypical loss function with approximately 25k speakers. – Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK. 2.2 Vocoder: HiFi-GAN V2 − VCTK dataset for training and validation. − Fine-tuning with Mel-spectrograms predicted by TTS models (HiFi-GAN-FT). 2.3 SC-GlowTTS Model: Glow-TTS based − Phonemes instead of graphemes as input. − Explore 3 different encoders: The original transformer based encoder; Residual convolutional based; Gated convolutional based. − External speaker embeddings conditioned in: Affine coupling layers in all decoder blocks; Duration predictor input. 2.4 Dataset: VCTK − Training: composed of 97 speakers. − Development: composed by samples from the 97 training speakers. − Test: composed of 11 speakers not present in the training set. Input Text Phonemizer Encoder Duration Predictor Conv Projection Speaker Embedding Aligment Generation Ceil Flow-Based Decoder UnSqueeze Affine Coupling Layer Invertible 1x1 Conv ActNorm Squeeze x 12 Predicted Mel spectrogram HiFi-GAN Waveform 3. Experiments: Setup and Results 3.1 Proposed Experiments 1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020); 2. SC-GlowTTS with transformer based encoder; 3. SC-GlowTTS with residual convolutional based encoder; 4. SC-GlowTTS with gated convolutional based encoder. 3.2 Experiments Setup – All experiments were implemented on the Coqui TTS: github.com/coqui-ai/TTS – Coqui TTS is an open source TTS framework. Contributions are welcome. – Audio samples and checkpoints of all experiments are available on: github.com/Edresson/SC-GlowTTS 3.3 Results Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments. Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06 Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06) 1 - Tacotron 2 HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08 HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07 2 - SC-GlowTTS-Trans HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07 HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07 3 - SC-GlowTTS-Res HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08 HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07 4 - SC-GlowTTS-Gated HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08 HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07 4. SC-GlowTTS performance with few speakers – To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset. – We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this dataset composed of 11 speakers and we calculated the metrics for the test set. – The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08. – We believe that this is an important step forward, especially for zero-shot multi speaker TTS in low-resource languages.