SlideShare a Scribd company logo

Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

Oral presentation in Interspeech 2019 17th Sept. 2019

1 of 12
Download to read offline
Real-time neural text-to-speech
with sequence-to-sequence acoustic model
and WaveGlow or single Gaussian WaveRNN vocoders
Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1
1National Institute of Information and Communications Technology (NICT), Japan
2Nagoya University, Japan
1
Introduction!
Problems and purpose!
Sequence-to-sequence acoustic model with full-context label input!
Real-time neural vocoders!
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Experiments!
Alternative sequence-to-sequence acoustic model (NOT included in proceeding)!
Conclusions
Outline
2
High-fidelity text-to-speech (TTS) systems!
WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS
Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018
Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform
Jointly optimizing text analysis, duration and acoustic models with a single neural network
No text analysis, no phoneme alignment, and no fundamental frequency analysis
Problem
NOT directly applied to pitch accent languages
Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019
Phoneme and accentual type sequence input (instead of character sequence)
Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model
Introduction
Realizing high-fidelity synthesis comparable to human speech!!
3
Problems in real-time neural TTS systems!
Results of sequence-to-sequence acoustic model for pitch accent language
Full-context label input > phoneme and accentual type sequence
Many investigations for end-to-end TTS
Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis
Parallel WaveNet with linguistic feature input
High-quality real-time TTS but complicated teacher-student training with additional loss functions required
Purpose: Developing real-time neural TTS for pitch accent languages !
Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure
Jointly optimizing phoneme duration and acoustic models
Real-time neural vocoders without complicated teether-student training
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Problems and purpose
4
Sequence-to-sequence acoustic model with full-context label input based on Tacotron
structure!
Input: full-context label vector (phoneme level sequence)
Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims)
1 x 1 convolution layer instead of embedding layer
Sequence-to-sequence acoustic model
layer layers
Bidirectional
LSTM
layers
2 LSTM
Full-context label
vector
Linear
projection
Linear
projection
Stop token
3 conv
2 layer
pre-net
5 conv layer
post-net
Location
sensitive
attention
1 × 1 conv
Mel-spectrogram
+
Neural
vocoder
Speech
waveform
Input text
Text analyzer
Replaced components 5
Generative flow-based model!
Image generative model: Glow + raw audio generative model: WaveNet
Training stage: speech waveform + acoustic feature -> white noise
Synthesis stage: white noise + acoustic feature -> speech waveform
Investigated WaveGlow vocoder!
Acoustic feature: mel-spectrogram (80 dims)
Training time
About 1 month using 4 GPUs (NVIDIA V100)
Inference time as real time factor (RTF)
0.1: using a GPU (NVIDIA V100)
4.0: using CPUs (Intel Xeon Gold 6148)
WaveGlow
R. Prenger et al., ICASSP 2019
Directly training real-time parallel generative model
without teacher-student training
Acoustic feature hGround-truth waveform x
WaveNet
Upsampling layer
z
xa xb
Affine
Coupling layer
xa x′
b
Invertible 1×1
convolution
Squeeze to
vectors
× 12
W k
log sj, tj
fi
Affine
transform
6

Recommended

Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...Takuma_OKAMOTO
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 

More Related Content

What's hot

Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsTakuya Yoshioka
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringTejus Adiga M
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonMel Chua
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep LearningSungjoon Choi
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
 
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Alexander Decker
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNsAuro Tripathy
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance
 

What's hot (20)

Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency Filtering
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNs
 
Pycon apac 2014
Pycon apac 2014Pycon apac 2014
Pycon apac 2014
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
 

Similar to Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Fwdays
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]威華 王
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Compression presentation 415 (1)
Compression presentation 415 (1)Compression presentation 415 (1)
Compression presentation 415 (1)Godo Dodo
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniquesidescitation
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxHamzaJaved306957
 

Similar to Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders (20)

Final presentation
Final presentationFinal presentation
Final presentation
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
A1mpeg12 2004
A1mpeg12 2004A1mpeg12 2004
A1mpeg12 2004
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
H0814247
H0814247H0814247
H0814247
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Compression presentation 415 (1)
Compression presentation 415 (1)Compression presentation 415 (1)
Compression presentation 415 (1)
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 

Recently uploaded

CHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction ModuleCHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction Modulessusera4f8281
 
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!Mathias Magdowski
 
Nexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxNexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxRohanAgarwal340656
 
Center Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxCenter Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxsjzzztc
 
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...drezdzond
 
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...C Sai Kiran
 
Series of training sessions by our experts for you to provide necessary insig...
Series of training sessions by our experts for you to provide necessary insig...Series of training sessions by our experts for you to provide necessary insig...
Series of training sessions by our experts for you to provide necessary insig...AshishChanchal1
 
POST HARVEST Threshing equipment PPT 2.pptx
POST HARVEST Threshing equipment PPT 2.pptxPOST HARVEST Threshing equipment PPT 2.pptx
POST HARVEST Threshing equipment PPT 2.pptxARUL S
 
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERS
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERSCCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERS
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERSTamil949112
 
Paper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makersPaper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makersNoman khan
 
Laser And its Application's - Engineering Physics
Laser And its Application's - Engineering PhysicsLaser And its Application's - Engineering Physics
Laser And its Application's - Engineering PhysicsPurva Nikam
 
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh RajputBasic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh RajputGaurav Singh Rajput
 
Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?Marian Marinov
 
MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024Tamer Emara
 
Energy Efficient Social Housing for One Manchester
Energy Efficient Social Housing for One ManchesterEnergy Efficient Social Housing for One Manchester
Energy Efficient Social Housing for One Manchestermark alegbe
 
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdfForged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdfVikasKumar11936
 
fat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refiningfat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refiningteddymebratie
 
Lesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdfLesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdff1002753214
 
【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证muvgemo
 

Recently uploaded (20)

CHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction ModuleCHAPTER 1_ HTML and CSS Introduction Module
CHAPTER 1_ HTML and CSS Introduction Module
 
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
 
Nexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptxNexus - Final Day 12th February 2024.pptx
Nexus - Final Day 12th February 2024.pptx
 
Center Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docxCenter Enamel is the leading fire water tanks manufacturer in China.docx
Center Enamel is the leading fire water tanks manufacturer in China.docx
 
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
 
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...
Introduction to Machine Learning Unit-1 Notes for II-II Mechanical Engineerin...
 
Series of training sessions by our experts for you to provide necessary insig...
Series of training sessions by our experts for you to provide necessary insig...Series of training sessions by our experts for you to provide necessary insig...
Series of training sessions by our experts for you to provide necessary insig...
 
POST HARVEST Threshing equipment PPT 2.pptx
POST HARVEST Threshing equipment PPT 2.pptxPOST HARVEST Threshing equipment PPT 2.pptx
POST HARVEST Threshing equipment PPT 2.pptx
 
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERS
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERSCCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERS
CCW332-DIGITAL MARKETING QUESTION BANK WITH ANSWERS
 
Paper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makersPaper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makers
 
Laser And its Application's - Engineering Physics
Laser And its Application's - Engineering PhysicsLaser And its Application's - Engineering Physics
Laser And its Application's - Engineering Physics
 
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh RajputBasic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
 
Présentation IIRB 2024 M.Campoverde R.Duval
Présentation IIRB 2024 M.Campoverde R.DuvalPrésentation IIRB 2024 M.Campoverde R.Duval
Présentation IIRB 2024 M.Campoverde R.Duval
 
Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?
 
MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024
 
Energy Efficient Social Housing for One Manchester
Energy Efficient Social Housing for One ManchesterEnergy Efficient Social Housing for One Manchester
Energy Efficient Social Housing for One Manchester
 
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdfForged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
 
fat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refiningfat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refining
 
Lesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdfLesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdf
 
【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证
 

Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

  • 1. Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1 1National Institute of Information and Communications Technology (NICT), Japan 2Nagoya University, Japan 1
  • 2. Introduction! Problems and purpose! Sequence-to-sequence acoustic model with full-context label input! Real-time neural vocoders! WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Experiments! Alternative sequence-to-sequence acoustic model (NOT included in proceeding)! Conclusions Outline 2
  • 3. High-fidelity text-to-speech (TTS) systems! WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018 Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform Jointly optimizing text analysis, duration and acoustic models with a single neural network No text analysis, no phoneme alignment, and no fundamental frequency analysis Problem NOT directly applied to pitch accent languages Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019 Phoneme and accentual type sequence input (instead of character sequence) Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model Introduction Realizing high-fidelity synthesis comparable to human speech!! 3
  • 4. Problems in real-time neural TTS systems! Results of sequence-to-sequence acoustic model for pitch accent language Full-context label input > phoneme and accentual type sequence Many investigations for end-to-end TTS Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis Parallel WaveNet with linguistic feature input High-quality real-time TTS but complicated teacher-student training with additional loss functions required Purpose: Developing real-time neural TTS for pitch accent languages ! Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure Jointly optimizing phoneme duration and acoustic models Real-time neural vocoders without complicated teether-student training WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Problems and purpose 4
  • 5. Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure! Input: full-context label vector (phoneme level sequence) Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims) 1 x 1 convolution layer instead of embedding layer Sequence-to-sequence acoustic model layer layers Bidirectional LSTM layers 2 LSTM Full-context label vector Linear projection Linear projection Stop token 3 conv 2 layer pre-net 5 conv layer post-net Location sensitive attention 1 × 1 conv Mel-spectrogram + Neural vocoder Speech waveform Input text Text analyzer Replaced components 5
  • 6. Generative flow-based model! Image generative model: Glow + raw audio generative model: WaveNet Training stage: speech waveform + acoustic feature -> white noise Synthesis stage: white noise + acoustic feature -> speech waveform Investigated WaveGlow vocoder! Acoustic feature: mel-spectrogram (80 dims) Training time About 1 month using 4 GPUs (NVIDIA V100) Inference time as real time factor (RTF) 0.1: using a GPU (NVIDIA V100) 4.0: using CPUs (Intel Xeon Gold 6148) WaveGlow R. Prenger et al., ICASSP 2019 Directly training real-time parallel generative model without teacher-student training Acoustic feature hGround-truth waveform x WaveNet Upsampling layer z xa xb Affine Coupling layer xa x′ b Invertible 1×1 convolution Squeeze to vectors × 12 W k log sj, tj fi Affine transform 6
  • 7. WaveRNN! Sparse WaveRNN Real-time inference with a mobile CPU Dual-softmax 16 bit linear PCM is split into coarse and fine 8 bits -> two samplings are required to synthesize one audio sample Proposed single Gaussian WaveRNN! Predicting mean and standard deviation of next sample Continuous values can be predicted Initially proposed in ClariNet (W. Ping et al., ICLR 2019) Applied to FFTNet (T. Okamoto et al., ICASSP 2019) Only one sampling is sufficient to synthesize one audio sample WaveRNN vocoders for CPU inference Acoustic feature h Upsampling layer Masked GRU Acoustic feature h Upsampling layer Ground-truth waveform xt−1 + O1 O2GRU µt, log σt Oh Ox 37 or 80 37 or 80 1024 1024 1024 1024 256 2 1 Concat Split O1 O2 Softmax for ct Ground-truth waveform 1. Past coarse 8-bit: ct−1 2. Past fine 8-bit: ft−1 3. Current coarse 8-bit: ct 37 or 80 37 or 80 40 or 83 1024 512 256 256 Softmax for ftO3 O4 512 256 256 3 (a) WaveRNN with dual-softmax (b) Proposed SG-WaveRNN Early investigation for real-time synthesis using a CPU N. Kalchbrenner et al., ICML 2018 7
  • 8. Noise shaping method considering auditory perception! Improving synthesis quality by reducing spectral distortion due to prediction error Implemented by MLSA filter with averaged mel-cepstra Effiective for categorical and single Gaussian WaveNet and FFTNet vocoders T. Okamoto et al., SLT 2018, ICASSP 2019 Noise shaping for neural vocoders K. Tachibana et al., ICASSP 2018 (a) Training stage Speech signal Acoustic features Residual signal (b) Synthesis stage Acoustic features WaveNet / FFTNet Reconstructed speech signal Speech corpus Source signal f [Hz] AmplitudeAmplitude f [Hz] Residual signal Amplitude f [Hz] Amplitude f [Hz] f [Hz] AmplitudeAmplitude f [Hz] Amplitude f [Hz] Reconstructed Amplitude f [Hz] Filtering Quantization Training of WaveNet / FFTNet WaveNet / FFTNet Time-invariant noise weighting filter Calculation of time-invariant noise shaping fileter Generation of residual signal Dequantization Inverse filtering Extraction of acoustic features Investigating impact for WaveGlow and WaveRNN vocoders 8
  • 9. Speech corpus! Japanese female corpus: about 22 h (test set: 20 utterances) Sampling frequency: 24 kHz Sequence-to-sequence acoustic model (introducing Tacotron 2’s setting)! Input: full-context label vector (130 dim) Neural vocoders (w/wo noise shaping)! Single Gaussian AR WaveNet Vanilla WaveRNN with dual softmax Proposed single Gaussian WaveRNN WaveGlow Acoustic features! Simple acoustic features (SAF): fundamental frequency + mel-cepstra (37 dims) Mel-spectrograms (MELSPC): 80 dims Experimental conditions 9
  • 10. Subjective evaluation! Listening subjects: 15 Japanese native speakers 18 conditions x 20 utterances = 360 sentences / a subject Results! Vanilla and single Gaussian WaveRNNs require noise shaping Noise shaping is NOT effective for WaveGlow Neural TTS systems with sequence-to-sequence acoustic model and neural vocoders can realize higher quality synthesis than STRAIGHT vocoder with analysis-synthesis condition MOS results and demo SG-WaveRNNWaveRNN WaveGlow : MELSPC : MELSPC (NS) : TTS : TTS (NS) STRAIGHT Original : SAF : SAF (NS) AR SG-WaveNet 12 3 4 5 10
  • 11. Evaluation condition! Using a GPU (NVIDIA V100) Simple PyTorch implementation Results! Sequence-to-sequence acoustic model + WaveGlow can realize real-time neural TTS with an RTF of 0.16 Single Gaussian WaveRNN can synthesize about twice as fast as vanilla WaveRNN Results of real-time factor (RTF) Real-time high fidelity neural TTS for Japanese can be realized 11
  • 12. Real-time neural TTS with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders! Sequence-to-sequence acoustic model with full-context label input WaveGlow and proposed single Gaussian WaveRNN vocoders Realizing real-time high-fidelity neural TTS using sequence-to-sequence acoustic model and WaveGlow vocoder with a real time factor of 0.16 Future work! Implementing real-time inference with a CPU (such as sparse WaveRNN and LPCNet) Comparing sequence-to-sequence acoustic model with conventional pipeline TTS models T. Okamoto, T. Toda, Y. Shiga and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems,” IEEE ASRU 2019@Singapore, Dec. 2019 (to appear) Conclusions 12