Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders


Published on

Oral presentation in Interspeech 2019
17th Sept. 2019

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

  1. 1. Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1 1National Institute of Information and Communications Technology (NICT), Japan 2Nagoya University, Japan 1
  2. 2. Introduction! Problems and purpose! Sequence-to-sequence acoustic model with full-context label input! Real-time neural vocoders! WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Experiments! Alternative sequence-to-sequence acoustic model (NOT included in proceeding)! Conclusions Outline 2
  3. 3. High-fidelity text-to-speech (TTS) systems! WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018 Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform Jointly optimizing text analysis, duration and acoustic models with a single neural network No text analysis, no phoneme alignment, and no fundamental frequency analysis Problem NOT directly applied to pitch accent languages Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019 Phoneme and accentual type sequence input (instead of character sequence) Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model Introduction Realizing high-fidelity synthesis comparable to human speech!! 3
  4. 4. Problems in real-time neural TTS systems! Results of sequence-to-sequence acoustic model for pitch accent language Full-context label input > phoneme and accentual type sequence Many investigations for end-to-end TTS Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis Parallel WaveNet with linguistic feature input High-quality real-time TTS but complicated teacher-student training with additional loss functions required Purpose: Developing real-time neural TTS for pitch accent languages ! Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure Jointly optimizing phoneme duration and acoustic models Real-time neural vocoders without complicated teether-student training WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Problems and purpose 4
  5. 5. Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure! Input: full-context label vector (phoneme level sequence) Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims) 1 x 1 convolution layer instead of embedding layer Sequence-to-sequence acoustic model layer layers Bidirectional LSTM layers 2 LSTM Full-context label vector Linear projection Linear projection Stop token 3 conv 2 layer pre-net 5 conv layer post-net Location sensitive attention 1 × 1 conv Mel-spectrogram + Neural vocoder Speech waveform Input text Text analyzer Replaced components 5
  6. 6. Generative flow-based model! Image generative model: Glow + raw audio generative model: WaveNet Training stage: speech waveform + acoustic feature -> white noise Synthesis stage: white noise + acoustic feature -> speech waveform Investigated WaveGlow vocoder! Acoustic feature: mel-spectrogram (80 dims) Training time About 1 month using 4 GPUs (NVIDIA V100) Inference time as real time factor (RTF) 0.1: using a GPU (NVIDIA V100) 4.0: using CPUs (Intel Xeon Gold 6148) WaveGlow R. Prenger et al., ICASSP 2019 Directly training real-time parallel generative model without teacher-student training Acoustic feature hGround-truth waveform x WaveNet Upsampling layer z xa xb Affine Coupling layer xa x′ b Invertible 1×1 convolution Squeeze to vectors × 12 W k log sj, tj fi Affine transform 6
  7. 7. WaveRNN! Sparse WaveRNN Real-time inference with a mobile CPU Dual-softmax 16 bit linear PCM is split into coarse and fine 8 bits -> two samplings are required to synthesize one audio sample Proposed single Gaussian WaveRNN! Predicting mean and standard deviation of next sample Continuous values can be predicted Initially proposed in ClariNet (W. Ping et al., ICLR 2019) Applied to FFTNet (T. Okamoto et al., ICASSP 2019) Only one sampling is sufficient to synthesize one audio sample WaveRNN vocoders for CPU inference Acoustic feature h Upsampling layer Masked GRU Acoustic feature h Upsampling layer Ground-truth waveform xt−1 + O1 O2GRU µt, log σt Oh Ox 37 or 80 37 or 80 1024 1024 1024 1024 256 2 1 Concat Split O1 O2 Softmax for ct Ground-truth waveform 1. Past coarse 8-bit: ct−1 2. Past fine 8-bit: ft−1 3. Current coarse 8-bit: ct 37 or 80 37 or 80 40 or 83 1024 512 256 256 Softmax for ftO3 O4 512 256 256 3 (a) WaveRNN with dual-softmax (b) Proposed SG-WaveRNN Early investigation for real-time synthesis using a CPU N. Kalchbrenner et al., ICML 2018 7
  8. 8. Noise shaping method considering auditory perception! Improving synthesis quality by reducing spectral distortion due to prediction error Implemented by MLSA filter with averaged mel-cepstra Effiective for categorical and single Gaussian WaveNet and FFTNet vocoders T. Okamoto et al., SLT 2018, ICASSP 2019 Noise shaping for neural vocoders K. Tachibana et al., ICASSP 2018 (a) Training stage Speech signal Acoustic features Residual signal (b) Synthesis stage Acoustic features WaveNet / FFTNet Reconstructed speech signal Speech corpus Source signal f [Hz] AmplitudeAmplitude f [Hz] Residual signal Amplitude f [Hz] Amplitude f [Hz] f [Hz] AmplitudeAmplitude f [Hz] Amplitude f [Hz] Reconstructed Amplitude f [Hz] Filtering Quantization Training of WaveNet / FFTNet WaveNet / FFTNet Time-invariant noise weighting filter Calculation of time-invariant noise shaping fileter Generation of residual signal Dequantization Inverse filtering Extraction of acoustic features Investigating impact for WaveGlow and WaveRNN vocoders 8
  9. 9. Speech corpus! Japanese female corpus: about 22 h (test set: 20 utterances) Sampling frequency: 24 kHz Sequence-to-sequence acoustic model (introducing Tacotron 2’s setting)! Input: full-context label vector (130 dim) Neural vocoders (w/wo noise shaping)! Single Gaussian AR WaveNet Vanilla WaveRNN with dual softmax Proposed single Gaussian WaveRNN WaveGlow Acoustic features! Simple acoustic features (SAF): fundamental frequency + mel-cepstra (37 dims) Mel-spectrograms (MELSPC): 80 dims Experimental conditions 9
  10. 10. Subjective evaluation! Listening subjects: 15 Japanese native speakers 18 conditions x 20 utterances = 360 sentences / a subject Results! Vanilla and single Gaussian WaveRNNs require noise shaping Noise shaping is NOT effective for WaveGlow Neural TTS systems with sequence-to-sequence acoustic model and neural vocoders can realize higher quality synthesis than STRAIGHT vocoder with analysis-synthesis condition MOS results and demo SG-WaveRNNWaveRNN WaveGlow : MELSPC : MELSPC (NS) : TTS : TTS (NS) STRAIGHT Original : SAF : SAF (NS) AR SG-WaveNet 12 3 4 5 10
  11. 11. Evaluation condition! Using a GPU (NVIDIA V100) Simple PyTorch implementation Results! Sequence-to-sequence acoustic model + WaveGlow can realize real-time neural TTS with an RTF of 0.16 Single Gaussian WaveRNN can synthesize about twice as fast as vanilla WaveRNN Results of real-time factor (RTF) Real-time high fidelity neural TTS for Japanese can be realized 11
  12. 12. Real-time neural TTS with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders! Sequence-to-sequence acoustic model with full-context label input WaveGlow and proposed single Gaussian WaveRNN vocoders Realizing real-time high-fidelity neural TTS using sequence-to-sequence acoustic model and WaveGlow vocoder with a real time factor of 0.16 Future work! Implementing real-time inference with a CPU (such as sparse WaveRNN and LPCNet) Comparing sequence-to-sequence acoustic model with conventional pipeline TTS models T. Okamoto, T. Toda, Y. Shiga and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems,” IEEE ASRU 2019@Singapore, Dec. 2019 (to appear) Conclusions 12