SlideShare a Scribd company logo
Tutorial on end-to-end text-to-speech
synthesis
Part 2 – Tactron and related end-to-end systems
安田裕介 (Yusuke YASUDA)
National Institute of Informatics
1
エンドツーエンド音声合成
に向けたNIIにおける
ソフトウェア群 Part 2
安田裕介
National Institute of Informatics
~ TacotronとWaveNetのチュートリアル ~
2
About me
- Name: 安田裕介
- Status:
- A PhD student at Sokendai / NII
- A programmer at work
- Former geologist
- bachelor, master
- Github account: TanUkkii007
3
Table of contents
1. Text-to-speech architecture
2. End-to-end model
3. Example architectures
a. Char2Wav
b. Tacotron
c. Tacotron2
4. Our work: Japanese Tacotron
5. Implementation
4
TTS architecture: traditional pipeline
- Typical pipeline architecture
for statistical parametric
speech synthesis
- Consists of task-specific
models
- linguistic model
- alignment (duration) model
- acoustic model
- vocoder
5
TTS architecture: End-to-end model
- End-to-end model directly
converts text to waveform
- End-to-end model does not
require intermediate feature
extraction
- Pipeline models accumulate
errors across predicted
features
- End-to-end model’s internal
blocks are jointly optimized
6
[1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-EndText-to-Speech. CoRR abs/1807.07281 (2018)
End-to-end model: Encoder-Decoder with Attention
- Building blocks
- Encoder
- Attention mechanism
- Decoder
- Neural vocoder
Conventional end-to-end models may not
include a waveform generator, but some
recent full end-to-end models contain a
neural waveform generator, e.g. ClariNet[1].
7
End-to-end model: Decoding with attention
mechanism
8
Time step 1.
- Assign alignment
probabilities to encoded
inputs
- Alignment scores can be
derived from attention
layer’s output and
encoded inputs
End-to-end model: Decoding with attention
mechanism
9
Time step 1.
- Calculate context vector
- Context vector is the sum
of encoded inputs
weighted by alignment
probabilities
End-to-end model: Decoding with attention
mechanism
10
Time step 1.
- Decoder layer predicts an
output from the context
vector
End-to-end model: Decoding with attention
mechanism
11
Time step 2.
- The previous output is
fed back to next time step
- Assign alignment
probabilities to encoded
inputs
End-to-end model: Decoding with attention
mechanism
12
Time step 2.
- Calculate context vector
End-to-end model: Decoding with attention
mechanism
13
Time step 2.
- Decoder layer predict an
output from the context
vector
End-to-end model: Alignment visualization
alignment
ground truth mel spectrogram
predicted mel spectrogram
14
Example: Char2Wav
Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to-
sequence architecture with attention gives a good proof-of-concept for end-to-end
TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder
and the multi-speaker embedding.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron
Courville, Yoshua Bengio:
Char2Wav: End-to-End Speech Synthesis. ICLR 2017
https://openreview.net/forum?id=B1VWyySKx
15
Char2Wav: Architecture
Encoder: bidirectional GRU RNN
Decoder: GRU RNN
Attention: GMM attention
Neural vocoder: SampleRNN
Figure is from Sotelo et al. (2017)
16
Char2Wav: Source and target choice
- Input: character or phoneme sequence
- Output: vocoder parameters
- Objective: mean square error or negative GMM log likelihood loss
17
Char2Wav: GMM attention
- Proposed by Graves
(2013)
- Location based attention
- Alignment is based on
input location
- Alignment does not
depend on input content
18
[3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013)
[3]
Char2Wav: GMM attention
- Monotonic progress
- The mean value μ
always increases as time
progresses
- Robust to long sequence
prediction
epoch 0 epoch 130
Visualization ofGMM attention
alignmentfrom Tacotron predicting
vocoder parameters.r=2.
19
Char2Wav: Limitations
- Target features: Char2Wav uses vocoder parameters as a target. Vocoder
parameters are challenging target for sequence-to-sequence system.
- e.g. vocoder parameters for 10s speech requires 2000 iteration to predict
- Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence
architecture gives poor alignment.
These limitations were solved by Tacotron.
20
Example: Tacotron
Tacotron is the most influential method among modern architectures because of
several cutting-edge techniques.
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous:
Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010
https://arxiv.org/abs/1703.10135
21
Tacotron: Architecture
- CBHG encoder
- Encoder & decoder pre-net
- Reduction factor
- Post-net
- Additive attention
Figure is from Wang et al. (2017)
22
Tacotron: Source and target choice
- Input: character sequence
- Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift)
- x2.5 shorter total sequence than that of vocoder parameters
- Objective: L1 loss
23
Tacotron: Decoder pre-net
- The most important architecture
advancement from vanila seq2seq
- 2 FFN + ReLU + dropout
- Crucial for alignment learning
24
Tacotron: Reduction factor
- Predicts multiple frames at a time
- Reduction factor r: the number of
frames to predict at one step
- Large r →
- Small number of iteration
- Easy alignment learning
- Less training time
- Less inference time
- Poor quality
25
[5] Dzmitry Bahdanau,Kyunghyun Cho, Yoshua Bengio:Neural Machine Translation by Jointly Learningto Align and Translate. CoRR abs/1409.0473 (2014)
- Proposed by Bahdanau et al. (2014)
- Content based attention
- Distance between source and target is
learned by FFN
- No structure constraint
Tacotron: Additive attention
epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0
26
[5]
Tacotron: Limitations
- Training time is too long: > 2M steps (mini batches) to converge
- Waveform generation: Griffin-Lim is handy but hurts the waveform quality.
- Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient
both at training and inference time.
27
Example: Tacotron2
Tacotron2 is a surprising method that achieved human level quality of synthesized
speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for
high-quality waveform generation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu:
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-
4783
https://arxiv.org/abs/1712.05884
28
Tacotron2: Architecture
- CNN layers instead of CBHG at
encoder and post-net
- Stop token prediction
- Post-net improves predicted mel
spectrogram
- Location sensitive attention
- No reduction factor
Figure is from Shen et al. (2018)
29
[7] David Krueger, Tegan Maharaj, János Kramár, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, AnirudhGoyal, YoshuaBengio, HugoLarochelle,
Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.CoRR abs/1606.01305(2016)
Tacotron2: Trends of architecture changes
1. Larger model size
- character embedding: 256 → 512
- encoder: 256 → 512
- decoder pre-net:
(256, 128) → (256, 256)
- decoder: 256 → 1024
- attention: 256 → 128
- post-net: 256 → 512
2. More regularizations:
- Encoder applies dropout at every
layer. The original Tacotron applies
dropout in pre-net only at encoder.
- Zoneout regularization for encoder
bidirectional LSTM and decoder LSTM
- Post-net applies dropout at every layer
- L2 regularization
CBHG module is replaced by CNN layers at encoder and
postnet probably due to limited regularization applicability.
Large model needs heavy regularization to prevent overfitting
30
[7]
Tacotron2: Source and target choice
- Input: character sequence
- Output: mel spectrogram
- Objective: mean square error
31
Tacotron2: Waveform synthesis with WaveNet
- Upsampling rate: 200
- Mel vs Linear: no difference
- Human-level quality was achieved by WaveNet trained with ground truth
alignmed predicted spectrogram
32
[8] Jan Chorowski, Dzmitry Bahdanau,Dmitriy Serdyuk, Kyunghyun Cho,Yoshua Bengio: Attention-Based Models for SpeechRecognition. NIPS 2015: 577-585
Tacotron2: Location sensitive attention
- Proposed by Chorowski et al.,
(2015)
- Utilizes both input content and
input location
33
[8]
Tacotron2: Limitations
- WaveNet training in Tacotron2:
● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram.
● Tacotron2's errors are corrected by WaveNet.
● However, this is unproductive: one WaveNet for each different Tacotron2
- WaveNet training in normal condition
● WaveNet is trained with ground truth spectrogram
● However, its MOS score still is still inferior to the human speech.
34
Our work: Tacotron for Japanese language
We extended Tacotron for Japanese language.
● To model Japanese pitch accents, we introduced accentual type embedding
as well as phoneme embedding
● To handle long term dependencies in accents: we use additional components
Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi:
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. CoRR abs/1810.11960 (2018)
https://arxiv.org/abs/1810.11960
35
Japanese Tacotron: Architecture
- Accentual type
embedding
- Forward attention
- Self-attention
- Dual source
attention
- Vocoder parameter
target
36
Japanese Tacotron: Source and target choice
- Source: phoneme and accentual type sequence
- Target:
- mel spectrogram
- vocoder parameters
- Objective:
- L1 loss (mel spectrogram, mel generalized cepstrum)
- softmax cross entropy loss (discretized F0)
37
[10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attentionin Sequence-To-SequenceAcoustic Modeling for SpeechSynthesis. ICASSP2018:
4789-4793
Japanese Tacotron: Forward attention
- Proposed by Zhang et al. (2018)
- Precluding left-to-right progress
- Fast alignment learning
epoch 1 epoch 3
epoch 7 epoch 10 epoch 50
epoch 0
38
[10]
Japanese Tacotron: Limitations
- Speech quality is still worse than that of traditional pipeline system
- Input feature limitation
- Configuration is based on Tacotron, not Tacotron2
39
Summary
40
Char2Wav Tacotron VoiceLoop
[11]
DeepVoice3
[12]
Tacotron
2
Transformer
[13]
Japanese
Tacotron
network
type
RNN RNN memory
buffer
CNN RNN self-attention RNN
input character/
phoneme
character phoneme character/
phoneme
character phoneme phoneme
seq2seq
output
vocoder mel vocoder mel mel mel mel/
vocoder
post-net
output
- linear - linear/
vocoder
mel mel -
attention
mechanism
GMM additive GMM dot-product location
sensitive
dot-product forward
waveform
synthesis
SampleRNN Griffin-Lim WORLD Griffin-Lim/
WORLD/
WaveNet
WaveNet WaveNet WaveNet
Japanese Tacotron: Implementation
URL: https://github.com/nii-yamagishilab/self-attention-tacotron
Framework: Tensorflow, Python
Supported datasets: LJSpeech, VCTK, (...coming more in the future)
Features:
41
- Tacotron, Tacotron2, Japanese Tacotron model
- Combination choices for encoder, decoder, attention mechanism
- Force alignment
- Mel spectrogram, vocoder parameter for target
- Compatible with our WaveNet implementation
Japanese Tacotron: Experimental workflow
42
training & validation
metrics
validation result test result
Japanese Tacotron: Audio samples
43
NAT TACMNPTACMAP SATMAP SATMNP SATWAP
Bibliography
- [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-
Speech. CoRR abs/1807.07281 (2018)
- [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville,
Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017
- [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850
(2013)
- [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017:
4006-4010
- [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly
Learning to Align and Translate. CoRR abs/1409.0473 (2014)
44
Bibliography
- [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis,
Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.
ICASSP 2018: 4779-4783
- [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan
Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal:
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305
(2016)
- [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-
Based Models for Speech Recognition. NIPS 2015: 577-585
- [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron
text-to-speech synthesis systems with self-attention for pitch accent language. CoRR
abs/1810.11960 (2018)
45
Bibliography
- [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence
Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793
- [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and
Synthesis via a Phonological Loop. ICLR 2018
- [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang,
Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence
Learning. ICLR 2018
- [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality
TTS with Transformer. CoRR abs/1809.08895 (2018)
46
Acknowledgement
This work was partially supported by JST CREST Grant Number JPMJCR18A6,
Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687,
18H04120, 18H04112, 18KT0051), Japan.
47

More Related Content

What's hot

Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Text Summarization Using the T5 Transformer Model
Text Summarization Using the T5 Transformer ModelText Summarization Using the T5 Transformer Model
Text Summarization Using the T5 Transformer Model
IRJET Journal
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
NU_I_TODALAB
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
NU_I_TODALAB
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
NU_I_TODALAB
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
NU_I_TODALAB
 
Improving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior KnowledgeImproving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior Knowledge
Gaetano Rossiello, PhD
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Gujarati Text-to-Speech Presentation
Gujarati Text-to-Speech PresentationGujarati Text-to-Speech Presentation
Gujarati Text-to-Speech Presentation
samyakbhuta
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
NU_I_TODALAB
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
NU_I_TODALAB
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
統計的ボイチェン研究事情
統計的ボイチェン研究事情統計的ボイチェン研究事情
統計的ボイチェン研究事情
Shinnosuke Takamichi
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Universitat Politècnica de Catalunya
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
NU_I_TODALAB
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
佳蓉 倪
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
Sungjoon Choi
 
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
Claire O'Keeffe
 

What's hot (20)

Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Text Summarization Using the T5 Transformer Model
Text Summarization Using the T5 Transformer ModelText Summarization Using the T5 Transformer Model
Text Summarization Using the T5 Transformer Model
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
Improving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior KnowledgeImproving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior Knowledge
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Gujarati Text-to-Speech Presentation
Gujarati Text-to-Speech PresentationGujarati Text-to-Speech Presentation
Gujarati Text-to-Speech Presentation
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
統計的ボイチェン研究事情
統計的ボイチェン研究事情統計的ボイチェン研究事情
統計的ボイチェン研究事情
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
 
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
IC Mask Design - IC Layout Acceleration Tool - DAC Conference, June 2010
 

Similar to Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related end-to-end systems

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeManoj Rao
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
IRJET Journal
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
Alexander Decker
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar Decoder
RSIS International
 
TUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESSTTUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESST
multimediaeval
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined Radio
IJECEIAES
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET Journal
 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000
Rahul Jain
 
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
IRJET Journal
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21Alexander Decker
 
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip  Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
IJECEIAES
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
IRJET Journal
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
ijngnjournal
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes
IJECEIAES
 

Similar to Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related end-to-end systems (20)

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_Resume
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar Decoder
 
TUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESSTTUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESST
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined Radio
 
vorlage
vorlagevorlage
vorlage
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000
 
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
 
40120140505005
4012014050500540120140505005
40120140505005
 
40120140505005 2
40120140505005 240120140505005 2
40120140505005 2
 
40120140505005
4012014050500540120140505005
40120140505005
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21
 
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip  Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
Yamagishi Laboratory, National Institute of Informatics, Japan
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan (15)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related end-to-end systems

  • 1. Tutorial on end-to-end text-to-speech synthesis Part 2 – Tactron and related end-to-end systems 安田裕介 (Yusuke YASUDA) National Institute of Informatics 1
  • 2. エンドツーエンド音声合成 に向けたNIIにおける ソフトウェア群 Part 2 安田裕介 National Institute of Informatics ~ TacotronとWaveNetのチュートリアル ~ 2
  • 3. About me - Name: 安田裕介 - Status: - A PhD student at Sokendai / NII - A programmer at work - Former geologist - bachelor, master - Github account: TanUkkii007 3
  • 4. Table of contents 1. Text-to-speech architecture 2. End-to-end model 3. Example architectures a. Char2Wav b. Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 4
  • 5. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 5
  • 6. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model’s internal blocks are jointly optimized 6
  • 7. [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-EndText-to-Speech. CoRR abs/1807.07281 (2018) End-to-end model: Encoder-Decoder with Attention - Building blocks - Encoder - Attention mechanism - Decoder - Neural vocoder Conventional end-to-end models may not include a waveform generator, but some recent full end-to-end models contain a neural waveform generator, e.g. ClariNet[1]. 7
  • 8. End-to-end model: Decoding with attention mechanism 8 Time step 1. - Assign alignment probabilities to encoded inputs - Alignment scores can be derived from attention layer’s output and encoded inputs
  • 9. End-to-end model: Decoding with attention mechanism 9 Time step 1. - Calculate context vector - Context vector is the sum of encoded inputs weighted by alignment probabilities
  • 10. End-to-end model: Decoding with attention mechanism 10 Time step 1. - Decoder layer predicts an output from the context vector
  • 11. End-to-end model: Decoding with attention mechanism 11 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs
  • 12. End-to-end model: Decoding with attention mechanism 12 Time step 2. - Calculate context vector
  • 13. End-to-end model: Decoding with attention mechanism 13 Time step 2. - Decoder layer predict an output from the context vector
  • 14. End-to-end model: Alignment visualization alignment ground truth mel spectrogram predicted mel spectrogram 14
  • 15. Example: Char2Wav Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to- sequence architecture with attention gives a good proof-of-concept for end-to-end TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder and the multi-speaker embedding. Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 https://openreview.net/forum?id=B1VWyySKx 15
  • 16. Char2Wav: Architecture Encoder: bidirectional GRU RNN Decoder: GRU RNN Attention: GMM attention Neural vocoder: SampleRNN Figure is from Sotelo et al. (2017) 16
  • 17. Char2Wav: Source and target choice - Input: character or phoneme sequence - Output: vocoder parameters - Objective: mean square error or negative GMM log likelihood loss 17
  • 18. Char2Wav: GMM attention - Proposed by Graves (2013) - Location based attention - Alignment is based on input location - Alignment does not depend on input content 18 [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) [3]
  • 19. Char2Wav: GMM attention - Monotonic progress - The mean value μ always increases as time progresses - Robust to long sequence prediction epoch 0 epoch 130 Visualization ofGMM attention alignmentfrom Tacotron predicting vocoder parameters.r=2. 19
  • 20. Char2Wav: Limitations - Target features: Char2Wav uses vocoder parameters as a target. Vocoder parameters are challenging target for sequence-to-sequence system. - e.g. vocoder parameters for 10s speech requires 2000 iteration to predict - Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence architecture gives poor alignment. These limitations were solved by Tacotron. 20
  • 21. Example: Tacotron Tacotron is the most influential method among modern architectures because of several cutting-edge techniques. Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 https://arxiv.org/abs/1703.10135 21
  • 22. Tacotron: Architecture - CBHG encoder - Encoder & decoder pre-net - Reduction factor - Post-net - Additive attention Figure is from Wang et al. (2017) 22
  • 23. Tacotron: Source and target choice - Input: character sequence - Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift) - x2.5 shorter total sequence than that of vocoder parameters - Objective: L1 loss 23
  • 24. Tacotron: Decoder pre-net - The most important architecture advancement from vanila seq2seq - 2 FFN + ReLU + dropout - Crucial for alignment learning 24
  • 25. Tacotron: Reduction factor - Predicts multiple frames at a time - Reduction factor r: the number of frames to predict at one step - Large r → - Small number of iteration - Easy alignment learning - Less training time - Less inference time - Poor quality 25
  • 26. [5] Dzmitry Bahdanau,Kyunghyun Cho, Yoshua Bengio:Neural Machine Translation by Jointly Learningto Align and Translate. CoRR abs/1409.0473 (2014) - Proposed by Bahdanau et al. (2014) - Content based attention - Distance between source and target is learned by FFN - No structure constraint Tacotron: Additive attention epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0 26 [5]
  • 27. Tacotron: Limitations - Training time is too long: > 2M steps (mini batches) to converge - Waveform generation: Griffin-Lim is handy but hurts the waveform quality. - Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient both at training and inference time. 27
  • 28. Example: Tacotron2 Tacotron2 is a surprising method that achieved human level quality of synthesized speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for high-quality waveform generation. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779- 4783 https://arxiv.org/abs/1712.05884 28
  • 29. Tacotron2: Architecture - CNN layers instead of CBHG at encoder and post-net - Stop token prediction - Post-net improves predicted mel spectrogram - Location sensitive attention - No reduction factor Figure is from Shen et al. (2018) 29
  • 30. [7] David Krueger, Tegan Maharaj, János Kramár, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, AnirudhGoyal, YoshuaBengio, HugoLarochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.CoRR abs/1606.01305(2016) Tacotron2: Trends of architecture changes 1. Larger model size - character embedding: 256 → 512 - encoder: 256 → 512 - decoder pre-net: (256, 128) → (256, 256) - decoder: 256 → 1024 - attention: 256 → 128 - post-net: 256 → 512 2. More regularizations: - Encoder applies dropout at every layer. The original Tacotron applies dropout in pre-net only at encoder. - Zoneout regularization for encoder bidirectional LSTM and decoder LSTM - Post-net applies dropout at every layer - L2 regularization CBHG module is replaced by CNN layers at encoder and postnet probably due to limited regularization applicability. Large model needs heavy regularization to prevent overfitting 30 [7]
  • 31. Tacotron2: Source and target choice - Input: character sequence - Output: mel spectrogram - Objective: mean square error 31
  • 32. Tacotron2: Waveform synthesis with WaveNet - Upsampling rate: 200 - Mel vs Linear: no difference - Human-level quality was achieved by WaveNet trained with ground truth alignmed predicted spectrogram 32
  • 33. [8] Jan Chorowski, Dzmitry Bahdanau,Dmitriy Serdyuk, Kyunghyun Cho,Yoshua Bengio: Attention-Based Models for SpeechRecognition. NIPS 2015: 577-585 Tacotron2: Location sensitive attention - Proposed by Chorowski et al., (2015) - Utilizes both input content and input location 33 [8]
  • 34. Tacotron2: Limitations - WaveNet training in Tacotron2: ● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram. ● Tacotron2's errors are corrected by WaveNet. ● However, this is unproductive: one WaveNet for each different Tacotron2 - WaveNet training in normal condition ● WaveNet is trained with ground truth spectrogram ● However, its MOS score still is still inferior to the human speech. 34
  • 35. Our work: Tacotron for Japanese language We extended Tacotron for Japanese language. ● To model Japanese pitch accents, we introduced accentual type embedding as well as phoneme embedding ● To handle long term dependencies in accents: we use additional components Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) https://arxiv.org/abs/1810.11960 35
  • 36. Japanese Tacotron: Architecture - Accentual type embedding - Forward attention - Self-attention - Dual source attention - Vocoder parameter target 36
  • 37. Japanese Tacotron: Source and target choice - Source: phoneme and accentual type sequence - Target: - mel spectrogram - vocoder parameters - Objective: - L1 loss (mel spectrogram, mel generalized cepstrum) - softmax cross entropy loss (discretized F0) 37
  • 38. [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attentionin Sequence-To-SequenceAcoustic Modeling for SpeechSynthesis. ICASSP2018: 4789-4793 Japanese Tacotron: Forward attention - Proposed by Zhang et al. (2018) - Precluding left-to-right progress - Fast alignment learning epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 0 38 [10]
  • 39. Japanese Tacotron: Limitations - Speech quality is still worse than that of traditional pipeline system - Input feature limitation - Configuration is based on Tacotron, not Tacotron2 39
  • 40. Summary 40 Char2Wav Tacotron VoiceLoop [11] DeepVoice3 [12] Tacotron 2 Transformer [13] Japanese Tacotron network type RNN RNN memory buffer CNN RNN self-attention RNN input character/ phoneme character phoneme character/ phoneme character phoneme phoneme seq2seq output vocoder mel vocoder mel mel mel mel/ vocoder post-net output - linear - linear/ vocoder mel mel - attention mechanism GMM additive GMM dot-product location sensitive dot-product forward waveform synthesis SampleRNN Griffin-Lim WORLD Griffin-Lim/ WORLD/ WaveNet WaveNet WaveNet WaveNet
  • 41. Japanese Tacotron: Implementation URL: https://github.com/nii-yamagishilab/self-attention-tacotron Framework: Tensorflow, Python Supported datasets: LJSpeech, VCTK, (...coming more in the future) Features: 41 - Tacotron, Tacotron2, Japanese Tacotron model - Combination choices for encoder, decoder, attention mechanism - Force alignment - Mel spectrogram, vocoder parameter for target - Compatible with our WaveNet implementation
  • 42. Japanese Tacotron: Experimental workflow 42 training & validation metrics validation result test result
  • 43. Japanese Tacotron: Audio samples 43 NAT TACMNPTACMAP SATMAP SATMNP SATWAP
  • 44. Bibliography - [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to- Speech. CoRR abs/1807.07281 (2018) - [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 - [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) - [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 - [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014) 44
  • 45. Bibliography - [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-4783 - [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305 (2016) - [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention- Based Models for Speech Recognition. NIPS 2015: 577-585 - [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) 45
  • 46. Bibliography - [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793 - [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. ICLR 2018 - [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. ICLR 2018 - [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality TTS with Transformer. CoRR abs/1809.08895 (2018) 46
  • 47. Acknowledgement This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. 47