Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

1,501 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

  1. 1. [course site] Day 4 Lecture 3 Speech Synthesis: WaveNet Antonio Bonafonte
  2. 2. 2 deepmind.com/blog/wavenet-generative-model-raw-audio/ September 2016
  3. 3. 3 Deep architectures … but not deep (yet) 3 Text to Speech: Textual features → Spectrum of speech (many coefficients) TXT Designed feature extraction ft 1 ft 2 ft 3 Regression module s1 s2 s3 wavegen “Hand-crafted” features “Hand-crafted” features
  4. 4. 4 Text-to-Speech using WaveNet 4 TXT Designed feature extraction ft 1 ft 2 ft 3 W “Hand-crafted” features
  5. 5. 5 Introduction ● Based on PixelCNN ● Generative model operating directly on audio samples ● Objective: factorised joint probability ● Stack of convolutional networks ● Output: categorical distribution → softmax ● Hyperparameters & overfitting controlled on validation set
  6. 6. 6 High resolution signal and long term dependencies
  7. 7. 7 Autoregressive model DPCM decoder: next sample is (almost) reconstructed from linear causal convolution of past samples
  8. 8. 8 Dilated causal convolutions Stacked dilated convolutions: Eg: 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512 Receptive field: 1024 x 3 → 192 ms (at 16kHz)
  9. 9. 9 Dilated causal convolutions In training: all convolutions can be done in parallel
  10. 10. 10 Dilated causal convolutions Generating: predictions are sequential (~ 2min. per second)
  11. 11. 11 Modeling pdf ● Not MSE ● Not Mixture Density Networks (MDN) ● But categorical distribution, softmax (classification problem)
  12. 12. 12 Modeling pdf A softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values) Van den Oord et al. 2016 Signal represented using mu law: 16 bits → 8 bits (256 categories)
  13. 13. 13 Gated Activation Units Residual Learning
  14. 14. 14 Architecture
  15. 15. 15 Conditional WaveNet They show results with h: ● Speaker ID ● Music genre, instrument ● TTS: Linguistic Features +F0. (duration model needed to switch condition phoneme to phoneme.
  16. 16. 16 Results
  17. 17. 17 Results Listen yourself!
  18. 18. Discussion 18 ● Wavenet: deep generative model of audio samples ● Convolutional nets: faster than RNN ● Outperforms best TTS systems ● Autoregressive model: sequential model in generation GANs were designed to be able to generate all of x in parallel, yielding greater generation speed Ian Goodfellow NIPS 2016 Tutorial: Generative Adversarial Networks
  19. 19. 19 deepmind.com/blog/wavenet-generative-model-raw-audio/ September 2016

×