SlideShare a Scribd company logo
1 of 45
Download to read offline
エンドツーエンド音声合成
に向けたNIIにおける
ソフトウェア群 Part 2
安田裕介
National Institute of Informatics
~ TacotronとWaveNetのチュートリアル ~
1
About me
- Name: 安田裕介
- Status:
- A PhD student at Sokendai / NII
- A programmer at work
- Former geologist
- bachelor, master
- Github account: TanUkkii007
2
Table of contents
1. Text-to-speech architecture
2. End-to-end model
3. Example architectures
a. Char2Wav
b. Tacotron
c. Tacotron2
4. Our work: Japanese Tacotron
5. Implementation
3
TTS architecture: traditional pipeline
- Typical pipeline architecture
for statistical parametric
speech synthesis
- Consists of task-specific
models
- linguistic model
- alignment (duration) model
- acoustic model
- vocoder
4
TTS architecture: End-to-end model
- End-to-end model directly
converts text to waveform
- End-to-end model does not
require intermediate feature
extraction
- Pipeline models accumulate
errors across predicted
features
- End-to-end model’s internal
blocks are jointly optimized
5
[1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. CoRR abs/1807.07281 (2018)
End-to-end model: Encoder-Decoder with Attention
- Building blocks
- Encoder
- Attention mechanism
- Decoder
- Neural vocoder
Conventional end-to-end models may not
include a waveform generator, but some
recent full end-to-end models contain a
neural waveform generator, e.g. ClariNet[1].
6
End-to-end model: Decoding with attention
mechanism
7
Time step 1.
- Assign alignment
probabilities to encoded
inputs
- Alignment scores can be
derived from attention
layer’s output and
encoded inputs
End-to-end model: Decoding with attention
mechanism
8
Time step 1.
- Calculate context vector
- Context vector is the sum
of encoded inputs
weighted by alignment
probabilities
End-to-end model: Decoding with attention
mechanism
9
Time step 1.
- Decoder layer predicts an
output from the context
vector
End-to-end model: Decoding with attention
mechanism
10
Time step 2.
- The previous output is
fed back to next time step
- Assign alignment
probabilities to encoded
inputs
End-to-end model: Decoding with attention
mechanism
11
Time step 2.
- Calculate context vector
End-to-end model: Decoding with attention
mechanism
12
Time step 2.
- Decoder layer predict an
output from the context
vector
End-to-end model: Alignment visualization
alignment
ground truth mel spectrogram
predicted mel spectrogram
13
Example: Char2Wav
Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to-
sequence architecture with attention gives a good proof-of-concept for end-to-end
TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder
and the multi-speaker embedding.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron
Courville, Yoshua Bengio:
Char2Wav: End-to-End Speech Synthesis. ICLR 2017
https://openreview.net/forum?id=B1VWyySKx
14
Char2Wav: Architecture
Encoder: bidirectional GRU RNN
Decoder: GRU RNN
Attention: GMM attention
Neural vocoder: SampleRNN
Figure is from Sotelo et al. (2017)
15
Char2Wav: Source and target choice
- Input: character or phoneme sequence
- Output: vocoder parameters
- Objective: mean square error or negative GMM log likelihood loss
16
Char2Wav: GMM attention
- Proposed by Graves
(2013)
- Location based attention
- Alignment is based on
input location
- Alignment does not
depend on input content
17
[3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013)
[3]
Char2Wav: GMM attention
- Monotonic progress
- The mean value μ
always increases as time
progresses
- Robust to long sequence
prediction
epoch 0 epoch 130
Visualization of GMM attention
alignment from Tacotron predicting
vocoder parameters. r=2.
18
Char2Wav: Limitations
- Target features: Char2Wav uses vocoder parameters as a target. Vocoder
parameters are challenging target for sequence-to-sequence system.
- e.g. vocoder parameters for 10s speech requires 2000 iteration to predict
- Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence
architecture gives poor alignment.
These limitations were solved by Tacotron.
19
Example: Tacotron
Tacotron is the most influential method among modern architectures because of
several cutting-edge techniques.
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous:
Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010
https://arxiv.org/abs/1703.10135
20
Tacotron: Architecture
- CBHG encoder
- Encoder & decoder pre-net
- Reduction factor
- Post-net
- Additive attention
Figure is from Wang et al. (2017)
21
Tacotron: Source and target choice
- Input: character sequence
- Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift)
- x2.5 shorter total sequence than that of vocoder parameters
- Objective: L1 loss
22
Tacotron: Decoder pre-net
- The most important architecture
advancement from vanila seq2seq
- 2 FFN + ReLU + dropout
- Crucial for alignment learning
23
Tacotron: Reduction factor
- Predicts multiple frames at a time
- Reduction factor r: the number of
frames to predict at one step
- Large r →
- Small number of iteration
- Easy alignment learning
- Less training time
- Less inference time
- Poor quality
24
[5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014)
- Proposed by Bahdanau et al. (2014)
- Content based attention
- Distance between source and target is
learned by FFN
- No structure constraint
Tacotron: Additive attention
epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0
25
[5]
Tacotron: Limitations
- Training time is too long: > 2M steps (mini batches) to converge
- Waveform generation: Griffin-Lim is handy but hurts the waveform quality.
- Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient
both at training and inference time.
26
Example: Tacotron2
Tacotron2 is a surprising method that achieved human level quality of synthesized
speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for
high-quality waveform generation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu:
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-
4783
https://arxiv.org/abs/1712.05884
27
Tacotron2: Architecture
- CNN layers instead of CBHG at
encoder and post-net
- Stop token prediction
- Post-net improves predicted mel
spectrogram
- Location sensitive attention
- No reduction factor
Figure is from Shen et al. (2018)
28
[7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle,
Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305 (2016)
Tacotron2: Trends of architecture changes
1. Larger model size
- character embedding: 256 → 512
- encoder: 256 → 512
- decoder pre-net:
(256, 128) → (256, 256)
- decoder: 256 → 1024
- attention: 256 → 128
- post-net: 256 → 512
2. More regularizations:
- Encoder applies dropout at every
layer. The original Tacotron applies
dropout in pre-net only at encoder.
- Zoneout regularization for encoder
bidirectional LSTM and decoder LSTM
- Post-net applies dropout at every layer
- L2 regularization
CBHG module is replaced by CNN layers at encoder and
postnet probably due to limited regularization applicability.
Large model needs heavy regularization to prevent overfitting
29
[7]
Tacotron2: Source and target choice
- Input: character sequence
- Output: mel spectrogram
- Objective: mean square error
30
Tacotron2: Waveform synthesis with WaveNet
- Upsampling rate: 200
- Mel vs Linear: no difference
- Human-level quality was achieved by WaveNet trained with ground truth
alignmed predicted spectrogram
31
[8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. NIPS 2015: 577-585
Tacotron2: Location sensitive attention
- Proposed by Chorowski et al.,
(2015)
- Utilizes both input content and
input location
32
[8]
Tacotron2: Limitations
- WaveNet training in Tacotron2:
● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram.
● Tacotron2's errors are corrected by WaveNet.
● However, this is unproductive: one WaveNet for each different Tacotron2
- WaveNet training in normal condition
● WaveNet is trained with ground truth spectrogram
● However, its MOS score still is still inferior to the human speech.
33
Our work: Tacotron for Japanese language
We extended Tacotron for Japanese language.
● To model Japanese pitch accents, we introduced accentual type embedding
as well as phoneme embedding
● To handle long term dependencies in accents: we use additional components
Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi:
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. CoRR abs/1810.11960 (2018)
https://arxiv.org/abs/1810.11960
34
Japanese Tacotron: Architecture
- Accentual type
embedding
- Forward attention
- Self-attention
- Dual source
attention
- Vocoder parameter
target
35
Japanese Tacotron: Source and target choice
- Source: phoneme and accentual type sequence
- Target:
- mel spectrogram
- vocoder parameters
- Objective:
- L1 loss (mel spectrogram, mel generalized cepstrum)
- softmax cross entropy loss (discretized F0)
36
[10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence Acoustic Modeling for Speech Synthesis. ICASSP 2018:
4789-4793
Japanese Tacotron: Forward attention
- Proposed by Zhang et al. (2018)
- Precluding left-to-right progress
- Fast alignment learning
epoch 1 epoch 3
epoch 7 epoch 10 epoch 50
epoch 0
37
[10]
Japanese Tacotron: Limitations
- Speech quality is still worse than that of traditional pipeline system
- Input feature limitation
- Configuration is based on Tacotron, not Tacotron2
38
Summary
39
Char2Wav Tacotron VoiceLoop
[11]
DeepVoice3
[12]
Tacotron
2
Transformer
[13]
Japanese
Tacotron
network
type
RNN RNN memory
buffer
CNN RNN self-attention RNN
input character/
phoneme
character phoneme character/
phoneme
character phoneme phoneme
seq2seq
output
vocoder mel vocoder mel mel mel mel/
vocoder
post-net
output
- linear - linear/
vocoder
mel mel -
attention
mechanism
GMM additive GMM dot-product location
sensitive
dot-product forward
waveform
synthesis
SampleRNN Griffin-Lim WORLD Griffin-Lim/
WORLD/
WaveNet
WaveNet WaveNet WaveNet
Japanese Tacotron: Implementation
URL: https://github.com/nii-yamagishilab/self-attention-tacotron
Framework: Tensorflow, Python
Supported datasets: LJSpeech, VCTK, (...coming more in the future)
Features:
40
- Tacotron, Tacotron2, Japanese Tacotron model
- Combination choices for encoder, decoder, attention mechanism
- Force alignment
- Mel spectrogram, vocoder parameter for target
- Compatible with our WaveNet implementation
Japanese Tacotron: Experimental workflow
41
training & validation
metrics
validation result test result
Bibliography
- [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-
Speech. CoRR abs/1807.07281 (2018)
- [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville,
Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017
- [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850
(2013)
- [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017:
4006-4010
- [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly
Learning to Align and Translate. CoRR abs/1409.0473 (2014)
42
Bibliography
- [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis,
Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.
ICASSP 2018: 4779-4783
- [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan
Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal:
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305
(2016)
- [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-
Based Models for Speech Recognition. NIPS 2015: 577-585
- [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron
text-to-speech synthesis systems with self-attention for pitch accent language. CoRR
abs/1810.11960 (2018)
43
Bibliography
- [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence
Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793
- [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and
Synthesis via a Phonological Loop. ICLR 2018
- [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang,
Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence
Learning. ICLR 2018
- [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality
TTS with Transformer. CoRR abs/1809.08895 (2018)
44
Acknowledgement
This work was partially supported by JST CREST Grant Number JPMJCR18A6,
Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687,
18H04120, 18H04112, 18KT0051), Japan.
45

More Related Content

What's hot

Digital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineDigital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineMohammad Sohai Khan Niazi
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Generating sentences from a continuous space
Generating sentences from a continuous spaceGenerating sentences from a continuous space
Generating sentences from a continuous spaceShuhei Iitsuka
 
Structured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsStructured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsMauricio Toro-Bermudez, PhD
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Pythonkonryd
 
AnupVMathur
AnupVMathurAnupVMathur
AnupVMathuranupmath
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya
 
Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorGiuseppe D'Onofrio
 
Slide aansw
Slide aanswSlide aansw
Slide aanswedge7
 
20110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-0220110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-02Computer Science Club
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...InVID Project
 
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Alpen-Adria-Universität
 

What's hot (20)

Digital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineDigital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course Outline
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Generating sentences from a continuous space
Generating sentences from a continuous spaceGenerating sentences from a continuous space
Generating sentences from a continuous space
 
gio's tesi
gio's tesigio's tesi
gio's tesi
 
Structured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsStructured Interactive Scores with formal semantics
Structured Interactive Scores with formal semantics
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
AnupVMathur
AnupVMathurAnupVMathur
AnupVMathur
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslator
 
Slide11 icc2015
Slide11 icc2015Slide11 icc2015
Slide11 icc2015
 
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
 
Slide aansw
Slide aanswSlide aansw
Slide aansw
 
20110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-0220110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-02
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
 
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
 

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~

FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...Alexander Decker
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeManoj Rao
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderRSIS International
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET Journal
 
A new channel coding technique to approach the channel capacity
A new channel coding technique to approach the channel capacityA new channel coding technique to approach the channel capacity
A new channel coding technique to approach the channel capacityijwmn
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...ijngnjournal
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...Videoguy
 
Resume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software DeveloperResume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software DeveloperGaurang Rathod
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes IJECEIAES
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationIRJET Journal
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Harshal Ladhe
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
Hao hsiang ma resume
Hao hsiang ma resumeHao hsiang ma resume
Hao hsiang ma resumeEliot Ma
 

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~ (20)

FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_Resume
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar Decoder
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
 
A new channel coding technique to approach the channel capacity
A new channel coding technique to approach the channel capacityA new channel coding technique to approach the channel capacity
A new channel coding technique to approach the channel capacity
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
 
Resume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software DeveloperResume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software Developer
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Hao hsiang ma resume
Hao hsiang ma resumeHao hsiang ma resume
Hao hsiang ma resume
 
RESUME_
RESUME_RESUME_
RESUME_
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan

More from Yamagishi Laboratory, National Institute of Informatics, Japan (12)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~

  • 1. エンドツーエンド音声合成 に向けたNIIにおける ソフトウェア群 Part 2 安田裕介 National Institute of Informatics ~ TacotronとWaveNetのチュートリアル ~ 1
  • 2. About me - Name: 安田裕介 - Status: - A PhD student at Sokendai / NII - A programmer at work - Former geologist - bachelor, master - Github account: TanUkkii007 2
  • 3. Table of contents 1. Text-to-speech architecture 2. End-to-end model 3. Example architectures a. Char2Wav b. Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 3
  • 4. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 4
  • 5. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model’s internal blocks are jointly optimized 5
  • 6. [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. CoRR abs/1807.07281 (2018) End-to-end model: Encoder-Decoder with Attention - Building blocks - Encoder - Attention mechanism - Decoder - Neural vocoder Conventional end-to-end models may not include a waveform generator, but some recent full end-to-end models contain a neural waveform generator, e.g. ClariNet[1]. 6
  • 7. End-to-end model: Decoding with attention mechanism 7 Time step 1. - Assign alignment probabilities to encoded inputs - Alignment scores can be derived from attention layer’s output and encoded inputs
  • 8. End-to-end model: Decoding with attention mechanism 8 Time step 1. - Calculate context vector - Context vector is the sum of encoded inputs weighted by alignment probabilities
  • 9. End-to-end model: Decoding with attention mechanism 9 Time step 1. - Decoder layer predicts an output from the context vector
  • 10. End-to-end model: Decoding with attention mechanism 10 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs
  • 11. End-to-end model: Decoding with attention mechanism 11 Time step 2. - Calculate context vector
  • 12. End-to-end model: Decoding with attention mechanism 12 Time step 2. - Decoder layer predict an output from the context vector
  • 13. End-to-end model: Alignment visualization alignment ground truth mel spectrogram predicted mel spectrogram 13
  • 14. Example: Char2Wav Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to- sequence architecture with attention gives a good proof-of-concept for end-to-end TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder and the multi-speaker embedding. Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 https://openreview.net/forum?id=B1VWyySKx 14
  • 15. Char2Wav: Architecture Encoder: bidirectional GRU RNN Decoder: GRU RNN Attention: GMM attention Neural vocoder: SampleRNN Figure is from Sotelo et al. (2017) 15
  • 16. Char2Wav: Source and target choice - Input: character or phoneme sequence - Output: vocoder parameters - Objective: mean square error or negative GMM log likelihood loss 16
  • 17. Char2Wav: GMM attention - Proposed by Graves (2013) - Location based attention - Alignment is based on input location - Alignment does not depend on input content 17 [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) [3]
  • 18. Char2Wav: GMM attention - Monotonic progress - The mean value μ always increases as time progresses - Robust to long sequence prediction epoch 0 epoch 130 Visualization of GMM attention alignment from Tacotron predicting vocoder parameters. r=2. 18
  • 19. Char2Wav: Limitations - Target features: Char2Wav uses vocoder parameters as a target. Vocoder parameters are challenging target for sequence-to-sequence system. - e.g. vocoder parameters for 10s speech requires 2000 iteration to predict - Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence architecture gives poor alignment. These limitations were solved by Tacotron. 19
  • 20. Example: Tacotron Tacotron is the most influential method among modern architectures because of several cutting-edge techniques. Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 https://arxiv.org/abs/1703.10135 20
  • 21. Tacotron: Architecture - CBHG encoder - Encoder & decoder pre-net - Reduction factor - Post-net - Additive attention Figure is from Wang et al. (2017) 21
  • 22. Tacotron: Source and target choice - Input: character sequence - Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift) - x2.5 shorter total sequence than that of vocoder parameters - Objective: L1 loss 22
  • 23. Tacotron: Decoder pre-net - The most important architecture advancement from vanila seq2seq - 2 FFN + ReLU + dropout - Crucial for alignment learning 23
  • 24. Tacotron: Reduction factor - Predicts multiple frames at a time - Reduction factor r: the number of frames to predict at one step - Large r → - Small number of iteration - Easy alignment learning - Less training time - Less inference time - Poor quality 24
  • 25. [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014) - Proposed by Bahdanau et al. (2014) - Content based attention - Distance between source and target is learned by FFN - No structure constraint Tacotron: Additive attention epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0 25 [5]
  • 26. Tacotron: Limitations - Training time is too long: > 2M steps (mini batches) to converge - Waveform generation: Griffin-Lim is handy but hurts the waveform quality. - Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient both at training and inference time. 26
  • 27. Example: Tacotron2 Tacotron2 is a surprising method that achieved human level quality of synthesized speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for high-quality waveform generation. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779- 4783 https://arxiv.org/abs/1712.05884 27
  • 28. Tacotron2: Architecture - CNN layers instead of CBHG at encoder and post-net - Stop token prediction - Post-net improves predicted mel spectrogram - Location sensitive attention - No reduction factor Figure is from Shen et al. (2018) 28
  • 29. [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305 (2016) Tacotron2: Trends of architecture changes 1. Larger model size - character embedding: 256 → 512 - encoder: 256 → 512 - decoder pre-net: (256, 128) → (256, 256) - decoder: 256 → 1024 - attention: 256 → 128 - post-net: 256 → 512 2. More regularizations: - Encoder applies dropout at every layer. The original Tacotron applies dropout in pre-net only at encoder. - Zoneout regularization for encoder bidirectional LSTM and decoder LSTM - Post-net applies dropout at every layer - L2 regularization CBHG module is replaced by CNN layers at encoder and postnet probably due to limited regularization applicability. Large model needs heavy regularization to prevent overfitting 29 [7]
  • 30. Tacotron2: Source and target choice - Input: character sequence - Output: mel spectrogram - Objective: mean square error 30
  • 31. Tacotron2: Waveform synthesis with WaveNet - Upsampling rate: 200 - Mel vs Linear: no difference - Human-level quality was achieved by WaveNet trained with ground truth alignmed predicted spectrogram 31
  • 32. [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. NIPS 2015: 577-585 Tacotron2: Location sensitive attention - Proposed by Chorowski et al., (2015) - Utilizes both input content and input location 32 [8]
  • 33. Tacotron2: Limitations - WaveNet training in Tacotron2: ● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram. ● Tacotron2's errors are corrected by WaveNet. ● However, this is unproductive: one WaveNet for each different Tacotron2 - WaveNet training in normal condition ● WaveNet is trained with ground truth spectrogram ● However, its MOS score still is still inferior to the human speech. 33
  • 34. Our work: Tacotron for Japanese language We extended Tacotron for Japanese language. ● To model Japanese pitch accents, we introduced accentual type embedding as well as phoneme embedding ● To handle long term dependencies in accents: we use additional components Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) https://arxiv.org/abs/1810.11960 34
  • 35. Japanese Tacotron: Architecture - Accentual type embedding - Forward attention - Self-attention - Dual source attention - Vocoder parameter target 35
  • 36. Japanese Tacotron: Source and target choice - Source: phoneme and accentual type sequence - Target: - mel spectrogram - vocoder parameters - Objective: - L1 loss (mel spectrogram, mel generalized cepstrum) - softmax cross entropy loss (discretized F0) 36
  • 37. [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793 Japanese Tacotron: Forward attention - Proposed by Zhang et al. (2018) - Precluding left-to-right progress - Fast alignment learning epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 0 37 [10]
  • 38. Japanese Tacotron: Limitations - Speech quality is still worse than that of traditional pipeline system - Input feature limitation - Configuration is based on Tacotron, not Tacotron2 38
  • 39. Summary 39 Char2Wav Tacotron VoiceLoop [11] DeepVoice3 [12] Tacotron 2 Transformer [13] Japanese Tacotron network type RNN RNN memory buffer CNN RNN self-attention RNN input character/ phoneme character phoneme character/ phoneme character phoneme phoneme seq2seq output vocoder mel vocoder mel mel mel mel/ vocoder post-net output - linear - linear/ vocoder mel mel - attention mechanism GMM additive GMM dot-product location sensitive dot-product forward waveform synthesis SampleRNN Griffin-Lim WORLD Griffin-Lim/ WORLD/ WaveNet WaveNet WaveNet WaveNet
  • 40. Japanese Tacotron: Implementation URL: https://github.com/nii-yamagishilab/self-attention-tacotron Framework: Tensorflow, Python Supported datasets: LJSpeech, VCTK, (...coming more in the future) Features: 40 - Tacotron, Tacotron2, Japanese Tacotron model - Combination choices for encoder, decoder, attention mechanism - Force alignment - Mel spectrogram, vocoder parameter for target - Compatible with our WaveNet implementation
  • 41. Japanese Tacotron: Experimental workflow 41 training & validation metrics validation result test result
  • 42. Bibliography - [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to- Speech. CoRR abs/1807.07281 (2018) - [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 - [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) - [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 - [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014) 42
  • 43. Bibliography - [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-4783 - [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305 (2016) - [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention- Based Models for Speech Recognition. NIPS 2015: 577-585 - [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) 43
  • 44. Bibliography - [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793 - [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. ICLR 2018 - [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. ICLR 2018 - [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality TTS with Transformer. CoRR abs/1809.08895 (2018) 44
  • 45. Acknowledgement This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. 45