Poster SCGlowTTS Interspeech 2021

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior,
Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti
1. Introduction p
1.1 Motivation
– Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et
al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker
TTS models were heavily based on the Tacotron 2 model.
1.2 Highlights
– As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario.
– We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training
speakers can significantly improve speech similarity and quality for new speakers.
– Our approach achieves promising results using only 11 speakers for training.
2. Methodology: Proposed Method and Dataset
2.1 Speaker Encoder
– Stack of 3 LSTM layers with a linear output layer.
– Trained using the Angular Prototypical loss function with approximately 25k speakers.
– Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK.
2.2 Vocoder: HiFi-GAN V2
− VCTK dataset for training and validation.
− Fine-tuning with Mel-spectrograms predicted by TTS models
(HiFi-GAN-FT).
2.3 SC-GlowTTS Model: Glow-TTS based
− Phonemes instead of graphemes as input.
− Explore 3 different encoders:
The original transformer based encoder;
Residual convolutional based;
Gated convolutional based.
− External speaker embeddings conditioned in:
Affine coupling layers in all decoder blocks;
Duration predictor input.
2.4 Dataset: VCTK
− Training: composed of 97 speakers.
− Development: composed by samples from the 97 training speakers.
− Test: composed of 11 speakers not present in the training set.
Input Text Phonemizer Encoder
Duration Predictor
Conv Projection
Speaker Embedding
Aligment Generation
Ceil
Flow-Based Decoder
UnSqueeze
Affine Coupling Layer
Invertible 1x1 Conv
ActNorm
Squeeze
x 12
Predicted Mel spectrogram
HiFi-GAN
Waveform
3. Experiments: Setup and Results
3.1 Proposed Experiments
1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020);
2. SC-GlowTTS with transformer based encoder;
3. SC-GlowTTS with residual convolutional based encoder;
4. SC-GlowTTS with gated convolutional based encoder.
3.2 Experiments Setup
– All experiments were implemented on the Coqui TTS:
github.com/coqui-ai/TTS
– Coqui TTS is an open source TTS framework. Contributions are welcome.
– Audio samples and checkpoints of all experiments are available on:
github.com/Edresson/SC-GlowTTS
3.3 Results
Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments.
Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS
Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06
Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06)
1 - Tacotron 2
HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08
HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07
2 - SC-GlowTTS-Trans
HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07
HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07
3 - SC-GlowTTS-Res
HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08
HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07
4 - SC-GlowTTS-Gated
HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08
HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07
4. SC-GlowTTS performance with few speakers
– To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset.
– We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this
dataset composed of 11 speakers and we calculated the metrics for the test set.
– The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved
by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08.
– We believe that this is an important step forward, especially for zero-shot multi speaker TTS in
low-resource languages.

Poster SCGlowTTS Interspeech 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Poster SCGlowTTS Interspeech 2021

Similar to Poster SCGlowTTS Interspeech 2021 (20)

More from Bilkent University

More from Bilkent University (6)

Recently uploaded

Recently uploaded (20)

Poster SCGlowTTS Interspeech 2021