SlideShare a Scribd company logo
1 of 33
Download to read offline
1
ESPnet-TTS: Unified, Reproducible,
and Integratable Open Source
End-to-End Text-to-Speech Toolkit
Tomoki Hayashi (@kan-bayashi)1,2,
Ryuichi Yamamoto3, Katsuki Inoue4,
Takenori Yoshimura1,2, Shinji Watanabe5,
Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7
1Nagoya University, 2Human Dataware lab. Co., Ltd.,
3LINE Corp., 4Okayama University, 5Johns Hopkins University,
6Google AI, 7Microsoft Research
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We definitely need to accelerate the research
and prepare the comparable baseline!
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We introduce ESPnet-TTS,
the new open-source toolkit of E2E-TTS
What is ESPnet-TTS?
p Open-source E2E-TTS toolkit
n Apache 2.0 LICENSE / Pytorch as main network engine
p Developed for the researcher community
n Easy to reproduce the-state-of-art model
n Can be used as a baseline to check the performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4
1. Support of various Text2Mel models
n Include autoregressive (AR), non-AR, and multi-spk models
2. Support of various Mel2Wav models
n Include both AR and the latest non-AR models
3. Unified and reproducible kaldi-style recipes
n Support 10+ recipes including En, Jp, Zn, and more
n Provide pretrained models of all recipes
n Integratable with ASR functions
(Extension of )
ESPnet-TTS
functions
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
Encoding
Tacotron 2
[Shen+, 2018]
Transformer-TTS
[Li+, 2018]
FastSpeech
[Ren+, 2019]
: Autoregressive
: Non-autoregressive
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
p Extension with pretrained speaker embedding
n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus
Multi-speaker extension (1)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8
Multi-speaker Tacotron 2
[Jia+, 2018]
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Tacotron 2
[Shen+, 2018]
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
p Extension with pretrained speaker embedding
n Apply the same idea to the other models
Multi-speaker extension (2)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9
Multi-speaker Transformer-TTS Multi-speaker FastSpeech
※ EXPERIMENTAL
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
EncodingReference
audio
Add / Concat
Pretrained
X-vector
Extractor
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Parallel WaveGAN
[Yamamoto+, 2020]
Please check
Ryuichi‘s
presentation on
this ICASSP.
Other remarkable functions
p Dynamic batch-size to maximize GPU utilization
n Change batch-size dynamically according to the length
p Gradient accumulation
n Pseudo-increase the batch-size even with a single GPU
p Guided attention loss [Tachibana+, 2017]
n Constrain the attention weight to be diagonal
p Attention constraint decoding [Ping+, 2017]
n Stably decode with a long input sentence
p Forward attention [Zhang+, 2018]
n Attention mechanism with causal regularization
p CBHG [Wang+, 2017]
n Upsample the frequency resolution
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
ESPnet-TTS
recipes
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15
The same data format for ASR and TTS recipes
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16
ASR and TTS recipes can be converted to each other
Supported recipes
p Support 10+ recipes including 10 langs.
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17
Corpus name Lang Recipe type
Arctic En Adaptation
Blizzard 2017 En Single
CSMSC Zn Single
JNAS Jp Multi
JVS Jp Adaptation
JUST Jp Single
LibriTTS En Multi
LJSpeech En Single
M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single
TWEB En Single
VAIS1000 Vi Single
We provide pretrained models of all recipes
Integration with ASR
p ASR-based evaluation for TTS
n Automatically check the deletion or repetition of words
p Advanced recipes combining TTS with ASR
n ASR-TTS cycle-consistency training [Karthick+, 2019]
n Semi-supervised ASR-TTS training [Karita+, 2019]
n Non-parallel voice conversion
l Cascade ASR + TTS system
l VCC2020 baseline system (http://www.vc-challenge.org/)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18
We can combine TTS with ASR
for the development and new research ideas
※Not merged yet
ESPnet-TTS
performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
Experimental condition
p Evaluation with the LJSpeech dataset
n #Training 12,600 / #validation 250 / #evaluation 250
p Comparison methods (Input type, [attention type])
n Tacotron 2 (Char, Forward)
n Transformer (Char)
n FastSpeech (Char)※1
p Comparison other toolkits
n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016]
n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow
n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20
※2 Data split is different. The evaluation samples might be included in training data.
n Tacotron 2 (Char, Location)
n Transformer (Phoneme)
n FastSpeech (Phoneme)※1
※1 We did not use knowledge distillation
The same MoL-WaveNet trained w/ natural feats is used
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Tacotron 2 is more robust than Transformer-TTS
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
FastSpeech is the most robust
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Tacotron 2 is faster than Transformer-TTS
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
FastSpeech is much faster than real-time
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
p (For reference) RTF of Non-AR Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Method RTF on CPU RTF on GPU
Parallel WaveGAN 0.734 0.016
MelGAN 0.137 0.002
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29
Please check the samples
from QR-code!
Tacotron 2 and Transformer-TTS have
almost the same performance
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30
Our best model can achieve the performance
comparable to state-of-the-art
※ The evaluation samples might be included in training data.
Please check the samples
from QR-code!
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Demonstration
p Demo notebooks with Google Colab.
1. E2E-TTS real-time demonstration
https://bit.ly/2Vex0Iw
2. E2E-TTS recipe Tutorial
https://bit.ly/3bhv0ow
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32
You can generate your favorite
sentence in En, Jp, Zn!
You can learn the TTS recipe
flow online!
Closing
p Conclusion
n Introduced open-source toolkit ESPnet-TTS
l Developed for the research community
l Make E2E-TTS more user-friendly
l Accelerate the research in this field
n Provide various Text2Mel and Mel2Wav models
n Provide reproducible recipes including various langs
n Achieved the performance comparable to SoTA
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33
We are always welcome
your feature requests and pull requests!

More Related Content

What's hot

Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
巨大なサービスと膨大なデータを支えるプラットフォーム

巨大なサービスと膨大なデータを支えるプラットフォーム
巨大なサービスと膨大なデータを支えるプラットフォーム

巨大なサービスと膨大なデータを支えるプラットフォーム
Tetsutaro Watanabe
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep LearningMyungjin Lee
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimationtaeseon ryu
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Deep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesDeep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesKang Pilsung
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNetDomyoung Lee
 
畳み込みLstm
畳み込みLstm畳み込みLstm
畳み込みLstmtak9029
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel TrickEdgar Marca
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat Politècnica de Catalunya
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 

What's hot (20)

Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
 
巨大なサービスと膨大なデータを支えるプラットフォーム

巨大なサービスと膨大なデータを支えるプラットフォーム
巨大なサービスと膨大なデータを支えるプラットフォーム

巨大なサービスと膨大なデータを支えるプラットフォーム

 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Deep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesDeep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniques
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
畳み込みLstm
畳み込みLstm畳み込みLstm
畳み込みLstm
 
Swarm intelligence algorithms
Swarm intelligence algorithmsSwarm intelligence algorithms
Swarm intelligence algorithms
 
Deep learning
Deep learning Deep learning
Deep learning
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
SP Study1018 Paper Reading
SP Study1018 Paper ReadingSP Study1018 Paper Reading
SP Study1018 Paper ReadingMori Takuma
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...IRJET Journal
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewJune-Woo Kim
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...IRJET Journal
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET Journal
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal
 

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit (20)

Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
SP Study1018 Paper Reading
SP Study1018 Paper ReadingSP Study1018 Paper Reading
SP Study1018 Paper Reading
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 

More from Tomoki Hayashi

複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査Tomoki Hayashi
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出Tomoki Hayashi
 
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出Tomoki Hayashi
 
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識Tomoki Hayashi
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247Tomoki Hayashi
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNETomoki Hayashi
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network Tomoki Hayashi
 

More from Tomoki Hayashi (7)

複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
 
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
 
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 

Recently uploaded

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 

Recently uploaded (20)

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

  • 1. 1 ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit Tomoki Hayashi (@kan-bayashi)1,2, Ryuichi Yamamoto3, Katsuki Inoue4, Takenori Yoshimura1,2, Shinji Watanabe5, Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7 1Nagoya University, 2Human Dataware lab. Co., Ltd., 3LINE Corp., 4Okayama University, 5Johns Hopkins University, 6Google AI, 7Microsoft Research
  • 2. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2 Hello, world! Speech Text2Mel Mel2Wav Neural Network We definitely need to accelerate the research and prepare the comparable baseline!
  • 3. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3 Hello, world! Speech Text2Mel Mel2Wav Neural Network We introduce ESPnet-TTS, the new open-source toolkit of E2E-TTS
  • 4. What is ESPnet-TTS? p Open-source E2E-TTS toolkit n Apache 2.0 LICENSE / Pytorch as main network engine p Developed for the researcher community n Easy to reproduce the-state-of-art model n Can be used as a baseline to check the performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4 1. Support of various Text2Mel models n Include autoregressive (AR), non-AR, and multi-spk models 2. Support of various Mel2Wav models n Include both AR and the latest non-AR models 3. Unified and reproducible kaldi-style recipes n Support 10+ recipes including En, Jp, Zn, and more n Provide pretrained models of all recipes n Integratable with ASR functions (Extension of )
  • 5. ESPnet-TTS functions ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
  • 6. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 7. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional Encoding Tacotron 2 [Shen+, 2018] Transformer-TTS [Li+, 2018] FastSpeech [Ren+, 2019] : Autoregressive : Non-autoregressive Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 8. p Extension with pretrained speaker embedding n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus Multi-speaker extension (1) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8 Multi-speaker Tacotron 2 [Jia+, 2018] Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Tacotron 2 [Shen+, 2018] Reference audio Add / Concat Pretrained X-vector Extractor Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output
  • 9. p Extension with pretrained speaker embedding n Apply the same idea to the other models Multi-speaker extension (2) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9 Multi-speaker Transformer-TTS Multi-speaker FastSpeech ※ EXPERIMENTAL Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional EncodingReference audio Add / Concat Pretrained X-vector Extractor Reference audio Add / Concat Pretrained X-vector Extractor Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 10. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 11. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models
  • 12. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Parallel WaveGAN [Yamamoto+, 2020] Please check Ryuichi‘s presentation on this ICASSP.
  • 13. Other remarkable functions p Dynamic batch-size to maximize GPU utilization n Change batch-size dynamically according to the length p Gradient accumulation n Pseudo-increase the batch-size even with a single GPU p Guided attention loss [Tachibana+, 2017] n Constrain the attention weight to be diagonal p Attention constraint decoding [Ping+, 2017] n Stably decode with a long input sentence p Forward attention [Zhang+, 2018] n Attention mechanism with causal regularization p CBHG [Wang+, 2017] n Upsample the frequency resolution ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
  • 14. ESPnet-TTS recipes ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
  • 15. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15 The same data format for ASR and TTS recipes
  • 16. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16 ASR and TTS recipes can be converted to each other
  • 17. Supported recipes p Support 10+ recipes including 10 langs. ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17 Corpus name Lang Recipe type Arctic En Adaptation Blizzard 2017 En Single CSMSC Zn Single JNAS Jp Multi JVS Jp Adaptation JUST Jp Single LibriTTS En Multi LJSpeech En Single M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single TWEB En Single VAIS1000 Vi Single We provide pretrained models of all recipes
  • 18. Integration with ASR p ASR-based evaluation for TTS n Automatically check the deletion or repetition of words p Advanced recipes combining TTS with ASR n ASR-TTS cycle-consistency training [Karthick+, 2019] n Semi-supervised ASR-TTS training [Karita+, 2019] n Non-parallel voice conversion l Cascade ASR + TTS system l VCC2020 baseline system (http://www.vc-challenge.org/) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18 We can combine TTS with ASR for the development and new research ideas ※Not merged yet
  • 19. ESPnet-TTS performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
  • 20. Experimental condition p Evaluation with the LJSpeech dataset n #Training 12,600 / #validation 250 / #evaluation 250 p Comparison methods (Input type, [attention type]) n Tacotron 2 (Char, Forward) n Transformer (Char) n FastSpeech (Char)※1 p Comparison other toolkits n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016] n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20 ※2 Data split is different. The evaluation samples might be included in training data. n Tacotron 2 (Char, Location) n Transformer (Phoneme) n FastSpeech (Phoneme)※1 ※1 We did not use knowledge distillation The same MoL-WaveNet trained w/ natural feats is used
  • 21. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation
  • 22. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation Tacotron 2 is more robust than Transformer-TTS
  • 23. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 FastSpeech is the most robust thanks to non-AR architecture
  • 24. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004
  • 25. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Tacotron 2 is faster than Transformer-TTS
  • 26. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 FastSpeech is much faster than real-time thanks to non-AR architecture
  • 27. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads p (For reference) RTF of Non-AR Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Method RTF on CPU RTF on GPU Parallel WaveGAN 0.734 0.016 MelGAN 0.137 0.002
  • 28. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28 Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Please check the samples from QR-code!
  • 29. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29 Please check the samples from QR-code! Tacotron 2 and Transformer-TTS have almost the same performance
  • 30. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30 Our best model can achieve the performance comparable to state-of-the-art ※ The evaluation samples might be included in training data. Please check the samples from QR-code!
  • 31. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31 Please check the samples from QR-code! Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05
  • 32. Demonstration p Demo notebooks with Google Colab. 1. E2E-TTS real-time demonstration https://bit.ly/2Vex0Iw 2. E2E-TTS recipe Tutorial https://bit.ly/3bhv0ow ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32 You can generate your favorite sentence in En, Jp, Zn! You can learn the TTS recipe flow online!
  • 33. Closing p Conclusion n Introduced open-source toolkit ESPnet-TTS l Developed for the research community l Make E2E-TTS more user-friendly l Accelerate the research in this field n Provide various Text2Mel and Mel2Wav models n Provide reproducible recipes including various langs n Achieved the performance comparable to SoTA ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33 We are always welcome your feature requests and pull requests!