Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

881 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

  1. 1. [course site] Day 3 Lecture 5 Parametric Speech Synthesis Antonio Bonafonte
  2. 2. 2 Main TTS Technologies Concatenative speech synthesis + Unit Selection Concatenate best prerecorded speech units Speech data: 2-10 hours, professional speaker, carefully segmenten and annotated.
  3. 3. 3 Concatenative H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  4. 4. Statistical Speech Synthesis H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  5. 5. 5 Main TTS Technologies Concatenative speech synthesis + Unit Selection Concatenate best pre-recorded speech units Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms) Use statistic generative model Reconstruct waveform from generated parameters Hybrid Systems Concatenative speech synthesis Select best units attending a statistical parametric system
  6. 6. 6 Deep architectures … but not deep (yet) 6 Text to Speech: Textual features → Spectrum of speech (many coefficients) TXT Designed feature extraction ft 1 ft 2 ft 3 Regression module s1 s2 s3 wavegen “Hand-crafted” features “Hand-crafted” features
  7. 7. 7 Textual features (x) From text to phoneme (pronunciation) Disambiguation, pronuntiation (e.g.: Jan. 26) From phoneme to phoneme+ (with linguistic features)
  8. 8. 8 Textual features (x) ● {preceding, succeeding} two phonemes ● Position of current phoneme in current syllable ● # of phonemes at {preceding, current, succeeding} syllable ● {accent, stress} of {preceding, current, succeeding} syllable ● Position of current syllable in current word ● # of {preceding, succeeding} {stressed, accented} syllables in phrase ● # of syllables {from previous, to next} {stressed, accented} syllable ● Guess at part of speech of {preceding, current, succeeding} word ● # of syllables in {preceding, current, succeeding} word ● Position of current word in current phrase ● # of {preceding, succeeding} content words in current phrase ● # of words {from previous, to next} content word ● # of syllables in {preceding, current, succeeding} phrase H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  9. 9. Statistical Speech Synthesis H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  10. 10. 10 Speech features (y)
  11. 11. 11 Speech features (y)
  12. 12. 12 Speech features (y)
  13. 13. 13 Speech features (y) ● Rate: ~ 5 ms. (200Hz) ● Spectral features (envelope) ● Excitation features (fundamental frequency, pitch) ● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)
  14. 14. 14 Regression 14 TXT Designed feature extraction ft 1 ft 2 ft 3 Regression module s1 s2 s3 wavegen
  15. 15. 15 Phoneme rate vs. frame rate H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  16. 16. 16 Duration Modeling H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  17. 17. 17 Acoustic Modeling
  18. 18. 18 Acoustic Modeling: DNN H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  19. 19. 19 Acoustic Modeling: DNN
  20. 20. 20 Regression using DNN (problem) H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  21. 21. 21 Mixture density network (MDN) H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  22. 22. 22 Mixture density network (MDN) H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  23. 23. 23 Recurrent Networks: LSTM H. Zen - RTTHSS 2015 http://rtthss2015.talp.cat/
  24. 24. 24 Recurrent Networks: LSTM Boxplot of mean opinion score SPPS: Models built using HTS US: Ogmios unit selection system LSTM: Duration and Acoustic LSTM models LSTM-pf: Post-filtered Ahocoded: Analysis/Synthesis (vocoder effect) Natural: Human recording Source [MSc Pascual]
  25. 25. 25 Multi-speaker
  26. 26. 26 Multi-speaker Boxplot of subjective preference: -2: multioutput system preferred +2 speaker dependent system preferred
  27. 27. 27 Multi-speaker
  28. 28. 28 Adaptation to new speaker
  29. 29. 29 Speaker interpolation Interpolation: 0 1 Interpolation: 0.25 0.75 Interpolation: 0.5 0.5 Interpolation: 0.75 0.25 Interpolation 1 0
  30. 30. Conclusions 30 ● Quality DL-based on SPSS is much better than conventional SPSS ● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI) ● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features) ● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)
  31. 31. References 31 Statistical parametric speech synthesis: from HMM to LSTM-RNN. Heiga Zen, Google rtthss2015.talp.cat/ (slides, video, .. and references) Deep learning applied to Speech Synthesis, 2016 Santiago Pascual, MSc Thesis, UPC veu.talp.cat/doc/MSC_Santiago_Pascual.pdf [eusipco-2016] [ssw9]

×