Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataScienceLab2017_Блиц-доклад

61 views

Published on

DataScience Lab, 13 мая 2017
Recent deep learning approaches for speech generation
Дмитрий Белевцов (Techlead at IBDI)
В последние пол года появилось несколько важных моделей на базе глубоких нейронных сетей, способных успешно синтезировать человеческую речь на уровне отдельных сэмплов. Это позволило обойти многие недостатки классических спектральных подходов. В этом докладе я сделаю небольшой обзор архитектур наиболее популярных сетей, таких как Wavenet и SampleRNN.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DataScienceLab2017_Блиц-доклад

  1. 1. Sample based generative models for speech synthesis Дмитро Бєлєвцов @ IBDI
  2. 2. Frame-based business 1.Split the waveform into overlapping frames
  3. 3. Frame-based business 1.Split the waveform into overlapping frames 2.Extract spectral features from each frame
  4. 4. Frame-based business 1.Split the waveform into overlapping frames 2.Extract spectral features from each frame 3.Model the distribution of these parameters
  5. 5. Frame-based business 1.Split the waveform into overlapping frames 2.Extract spectral features from each frame 3.Model the distribution of these parameters 4.Generate parameters
  6. 6. Frame-based business 1.Split the waveform into overlapping frames 2.Extract spectral features from each frame 3.Model the distribution of these parameters 4.Generate parameters 5.Convert parameters back to the waveform
  7. 7. Frame-based business ● 100x lower time resolution ● Phase-invariant ● Naturally motivated ● Highly compressed ● Separated from pitch Pros:
  8. 8. Frame-based business ● 100x lower time resolution ● Phase-invariant ● Naturally motivated ● Highly compressed ● Separated from pitch Pros: ● Highly compressed ● Synthesis introduces unnaturalness Cons:
  9. 9. WaveNet
  10. 10. WaveNet ● Deep
  11. 11. WaveNet ● Deep ● Residual
  12. 12. WaveNet ● Deep ● Residual ● Convolutional
  13. 13. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based
  14. 14. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based ● Probabilistic
  15. 15. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based ● Probabilistic ● Conditional
  16. 16. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based ● Probabilistic ● Conditional ● Generative
  17. 17. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based ● Probabilistic ● Conditional ● Generative ● Auto-regressive
  18. 18. WaveNet ● Deep ● Residual ● Convolutional ● Sample-based ● Probabilistic ● Conditional ● Generative ● Auto-regressive
  19. 19. How does it work? dilated causal convolutions
  20. 20. How does it work?
  21. 21. Trained like a CNN
  22. 22. Generates like an RNN (with limited memory)
  23. 23. So how is it? ● Direct waveform generation ● State-of-the-art timbre quality ● CNN-like training Pros:
  24. 24. So how is it? ● Direct waveform generation ● State-of-the-art timbre quality ● CNN-like training Pros: ● Slow generation (40x slower than realtime on commodity CPU) * ● Sensitive to local condition ● Large memory footprint ● Hard to interpret ● Missing details Cons:
  25. 25. Top layer activation
  26. 26. SampleRNN
  27. 27. SampleRNN
  28. 28. SampleRNN
  29. 29. SampleRNN ● Direct waveform generation ● Great long-range dependencies modelling ● Reference impl. available ● Clear distinction between slow and fast time scales Pros: ● Training RNN on very long sequences can be tricky ● ??? Cons:
  30. 30. Papers to check out ● WaveNet: A Generative Model for Raw Audio (Oord et al. 2016) ● Fast Wavenet Generation Algorithm (Paine et al. 2016) ● Deep Voice: Real-time Neural Text-to-Speech (Arik et al. 2017) ● A Neural Parametric Singing Synthesizer (Blaauw et al. 2017) ● SamplerRNN: An Unconditional End-To-End Neural Audio Generation Model (Mehri et al. 2017) ● Char2wav: End-To-End Speech Synthesis (Sotelo et al. 2017)
  31. 31. Thanks!

×