Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ph.D defence (Shinnosuke Takamichi)

506 views

Published on

Slides of my Ph.D defence (2015/12/22)

Published in: Science
  • Be the first to comment

  • Be the first to like this

Ph.D defence (Shinnosuke Takamichi)

  1. 1. 2015©Shinnosuke TAKAMICHI 12/22/2015 Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis (高音質な統計的パラメトリック音声合成のための 音響モデリング法と音声パラメータ生成法) Nara Institute of Science and Technology Shinnosuke Takamichi Ph.D defense
  2. 2. /46 Research target 2 Speech
  3. 3. /46 Speech synthesis and its benefits  Speech synthesis: a method to synthesize speech by a computer – Text-To-Speech (TTS) [Sagisaka et al., 1988.] – Voice Conversion (VC) [Stylianou et al., 1988.]  What is required? – Flexible control of voice beyond ability of one human – High-quality speech generation like human 3 Text TTS VC
  4. 4. /46 Statistical parametric speech synthesis  Statistical parametric speech synthesis [Zen et al., 2009.] – Statistical modeling of relationship between input/output – Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]  HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et al., 2007.] – Mathematical support of the flexibility – Application from other research areas – But… 4*HMM: Hidden Markov Model, GMM: Gaussian Mixture Model
  5. 5. /46 Natural speech vs. synthetic speech in speech quality 5 Natural speech spoken by human Synthetic speech of HMM-based TTS & GMM-based VC Why?
  6. 6. /46 Problem definition and rest of this talk 6 Text Parameteri- zation error Insufficient modeling Over- smoothing Parameteri- zation error Text analysis Speech analysis Speech parameter generation Waveform synthesis Acoustic Modeling Approaches in this thesis Modeling of individual speech segment Chapter 3 Modulation spectrum for over-smoothing Chapter 4Chapter 5 Chapter 2
  7. 7. /46 Speech synthesis Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 4Chapter 5 Text Chapter 2 Chapter 3
  8. 8. /46 2 approaches to speech synthesis  Unit selection synthesis [Iwahashi et al., 1993.] – High quality but low flexibility 8 Pre-recorded speech database Synthetic speech Segment selection Text Text analysis Speech analysis Param. Gen. Wave. synthesis Acoustic modeling  Statistical parametric speech synthesis [Zen et al., 2009.] – High flexibility but low quality
  9. 9. /46 Text/speech analysis and waveform synthesis  Text analysis (e.g., [Sagisaka et al., 1990.])  Speech analysis (e.g., [Kawahara et al., 1999.]) 9ana gen synmodel j i あ ら ゆ る 現 実 を・・・Sentence Accent phrase a tsr a y u r u g e n u oPhoneme Low High Power Frequency Fourier transform & pow. Envelope = spectral parameters Periodicity in detail = Pitch (F0)
  10. 10. /46 Acoustic modeling in HMM-based TTS 10 𝝀 = argmax 𝑃 𝒀|𝑿, 𝝀  ML training of HMM parameter sets 𝝀 ana gen synmodel “Hello” Speech analysis Text analysis Context labels 𝒀Speech features Time sil-h+e h-e+l e-l+o HMM 𝝀 Context-tied Gaussian distribution 𝑁 ⋅; 𝝁, 𝚺 e-l+o a-l+o o-l+o 𝑿 [Zen et al., 2007.]
  11. 11. /46 Acoustic modeling in GMM-based VC  ML training of GMM parameter sets 𝝀 11 𝝀 = argmax 𝑃 𝒀 𝑡, 𝑿 𝑡|𝝀 𝑿Speech features 𝒀Speech features Speech analysis Speech analysis ana gen synmodel GMM 𝝀 𝑿 𝑡 𝒀 𝑡 𝑁 ⋅; 𝝁, 𝚺 𝑿 𝑡 𝒀 𝑡 Joint vector at time 𝑡 [Stylianou et al., 1988.]
  12. 12. /46 Probability to generate features in HMM-based TTS 12 Text analysis HMM parameter sets 𝝀 “Hello” 𝑿 ana gen synmodel  Probability to generate the synthetic speech features 𝒀 𝑃 𝒀|𝑿, 𝒒, 𝝀 = 𝑁 𝒀; 𝑬 𝒒, 𝑫 𝒒 “h” “o” 𝝁1 𝝁2 𝝁 𝑇 𝝁 𝑡 𝑬 𝒒 𝜮1 −1 𝜮2 −1 𝜮 𝑇 −1 𝜮 𝑡 −1 𝑫 𝒒 −1 Mean vector Covariance matrix 𝒒 [Tokuda et al., 2000.]
  13. 13. /46 Probability to generate features in GMM-based VC 13  Probability to generate the synthetic speech features 𝒀 𝑃 𝒀|𝑿, 𝒒, 𝝀 = 𝑁 𝒀; 𝑬 𝒒, 𝑫 𝒒 Speech analysis GMM parameter sets 𝝀 𝑿 𝝁1 𝝁2 𝝁 𝑇 𝝁 𝑡 𝑬 𝒒 𝜮1 −1 𝜮2 −1 𝜮 𝑇 −1 𝜮 𝑡 −1 𝑫 𝒒 −1 Mean vector Covariance matrix 𝒒 [Toda et al., 2007.] ana gen synmodel
  14. 14. /46 Speech parameter generation  ML generation of synthetic speech parameters 𝒚 𝒒 – Computationally-efficient generation (solved in a closed form) 14 𝒚 𝒒 = argmax 𝑃 𝒀|𝑿, 𝒒, 𝝀 = argmax 𝑃 𝒚, Δ𝒚|𝑿, 𝒒, 𝝀 Time Static𝒚 Temporal deltaΔ𝒚 𝒚 𝒒 Δ𝒚 𝒒 Mean and variance [Tokuda et al., 2000.] ana gen synmodel
  15. 15. /46 Statistical sample-based speech synthesis Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 3 Chapter 4Chapter 5 Text Chapter 2
  16. 16. /46 Quality degradation by acoustic modeling 16  Averaging across input features Context-tied Gaussian in HMM-based TTS → Robust to the unseen context → Loses info. of individual speech parameters. 𝑁 ⋅; 𝝁, 𝚺 e-l+o a-l+o o-l+o  Proposed approach – Models individual speech parameters while keeping robustness. – Select one model in parameter generation. – → Able to alleviate the quality degradation caused by averaging
  17. 17. /46 Acoustic modeling of the proposed method 17  From the tied model to Rich context-GMM (R-GMM) 𝑁 ⋅; 𝝁, 𝚺 e-l+o a-l+o o-l+o Rich context models [Yan et al., 2009.] Less-averaged models having robustness R-GMM Model that is formed as the same as the conventional tied model Update the mean while tying the covariance. Gathers them with the same mixture weights.
  18. 18. /46 Speech parameter generation from R-GMMs 18  ML generation of synthetic speech parameters 𝒚 𝒒 – Iterative generation with the explicit model selection* 𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝒎, 𝑿 𝑃 𝒎|𝒚, Δ𝒚, 𝑿 Mean ± variance Time Time Staticfeature𝒚 𝒎 Tied model R-GMM ∗ 𝝀 (HMM/GMM parameter sets) is omitted.
  19. 19. /46 Discussion  Initialization of the parameter generation (Sec. 3.5) – Uses speech parameters from the over-trained statistics. – → Avoids averaging by initialization, and alleviating over-training by parameter generation.  Comparison to unit selection synthesis (Sec. 2.2) – The model selection corresponds to the waveform segment selection. – → Integrates unit selection in the statistical modeling.  Comparison to conventional hybrid methods – Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.]. – → Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8) 19
  20. 20. /46 Subjective evaluation (preference test on speech quality) 20 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 HMM-based TTS Spectrum H H R R T F0 H R H R T H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``R’’ using reference) 95% conf. interval GMM-based VC Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 G R T
  21. 21. /46 Modulation spectrum-based post-filter Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 4Chapter 5 Text Chapter 2 Chapter 3
  22. 22. /46 Over-smoothing in parameter generation 22 Time Natural speech parameters Time Synthetic speech parameters Speech parameter generation Acoustic modeling
  23. 23. /46 Revisits speech parameter generation (Sec. 2.6)  ML generation of synthetic speech parameters 𝒚 𝒒* 23 Time Spectralparameter Natural 𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑿: input features 𝝀 (HMM/GMM parameter sets) is omitted. [Tokuda et al., 2000.] HMM
  24. 24. /46 Global Variance (GV) and parameter generation w/ GV  ML generation with GV constraint 24 Time Natural HMM HMM+GV Spectralparameter 𝒗(𝒚) 𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒗 𝒚 𝜔 𝒗 𝒚 : GV (= 2nd moment), 𝜔: weight of the GV term [Toda et al., 2007.]
  25. 25. Something is still different between them... → What is it?
  26. 26. /46 Modulation Spectrum (MS) definition  MS: power spectrum of the sequence – Represents temporal fluctuation. [Atlas et al., 2003.] – Segment features in speech recognition [Thomas et al., 2009.] – Captures speech intelligibility. [Drullman et al., 1994.] 26 2nd moment DFT & pow. GV (scalar) MS (vector) Time Speechparameter DFT: Discrete Fourier Transform
  27. 27. /46 HMM natural Modulation frequency Modulationspectrum Speech quality will be improved by filling this gap! Example of the MS HMM+GV 27
  28. 28. /46 Post-filtering process 28 Training data Speech param. MS Statistics (Gaussian) HMMs Filtering Training Synthesis  Post-filtering in the MS domain – Linear conversion (interpolation) using 2 Gaussian distributions
  29. 29. /46 Filtered speech parameter sequence 29 Time HMM HMM+GV natural HMM → post-filter Spectralparameter Generate fluctuating speech parameters by the post-filtering!
  30. 30. /46 Discussion 1: What is the MS? 30 Speech parameter GV (temporal power) Freq. 1 Freq. 2 Freq. 𝐷s + + …= MS (frequency power) Sum of MSs = GV Fourier transform Time
  31. 31. /46 Discussion 2  Why post-filter? – Independent on the original speech synthesis process – → High portability and high quality  Further application – For spectrum, F0 (non-continuous), duration (unactual param.) – Segment-level filter (faster process)  Advantages compared to the conventional post-filters – Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.] 31
  32. 32. /46 Subjective evaluation (preference test on speech quality) 32 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Spectrum in HMM-based TTS Spectrum in GMM-based VC HMM GMM +GV HMM +GV post-filtering
  33. 33. /46 Speech synthesis integrating modulation spectrum Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 5 Text Chapter 2 Chapter 3 Chapter 4
  34. 34. /46 Problems of the MS-based post-filter  MS-based post-filter – External process for MS emphasis – → Causes over-emphasis ignoring speech synthesis criteria. – → Difficult to utilize flexibility that HMM/GMMs have 34  Approaches: Joint optimization using HMM/GMMs and MS – Integrate MS statistics as the one of the acoustic models. – Speech parameter generation with MS … high-quality – Acoustic model training with MS … high-quality and fast
  35. 35. /46 Speech parameter generation considering the MS 35  ML generation with MS constraint 𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒔 𝒚 𝜔 𝒔 𝒚 : MS (= power spectrum), 𝜔: weight of the MS term 𝑬 𝒒 𝑫 𝒒 𝑃 𝒚, Δ𝒚|𝑿 = 𝑁 𝒚, Δ𝒚 ; 𝑬 𝒒, 𝑫 𝒒 𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝝁s, 𝚺 𝐬 Modulation freq.MS Natural Quadratic function of 𝒚
  36. 36. /46 Discussion (comparison to MS-based post-filter)  Initialization – Basic ML generation (``HMM’’) → MS-based post-filter – → Part optimization by initialization, and joint optimization by iteration 36 HMM → post-filter HMM Time Spectralparameter HMM+MS
  37. 37. /46 Effect in the MS 37 HMM HMM+GV natural HMM+MS Modulation frequency Modulationspectrum Fills the gap by the proposed generation algorithm!
  38. 38. /46 Effect in the GV 38 Index of speech parameters LogGV HMM HMM+GV natural Recovers the GV w/o considering the GV! HMM+MS
  39. 39. /46 Subjective evaluation (preference test on speech quality) 39 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 GMM-based VC HMM +MS HMM +GV GMM +MS GMM +GV HMM-based TTS *+GV Parameter generation w/ GV (Sec. 2.9) *+MS Parameter generation w/ MS
  40. 40. /46 Problems of parameter generation and MS-constrained training 40  Speech parameter generation considering the MS – Iterative process in synthesis – → Computationally-inefficient speech synthesis 𝝀 = argmax 𝑃 𝒚|𝑿 𝑃 𝒔 𝒚 𝜔 𝑃 𝒚|𝑿 = 𝑁 𝒚; 𝒚 𝒒, 𝜮 : Trajectory likelihood (Sec. 2.8) 𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝒔 𝒚 𝒒 , 𝜮 𝐬 : MS likelihood Minimizes difference between 𝒚 and 𝒚 𝒒. Minimizes difference between 𝒔 𝒚 and 𝒔 𝒚 𝒒 .  Acoustic model training constrained with MS – Train HMMs/GMMs 𝝀 to generate param. 𝒚 𝒒 having natural MS
  41. 41. /46 Trained HMM parameters 41 Basic training (Sec. 2.4-5) Trajectory training (Sec. 2.8) Time Deltafeature Updates HMM/GMM param. to generate fluctuating param.! MS-constrained training
  42. 42. /46 Discussion  Computational efficiency in parameter generation – Basic generation algorithm (Sec. 2.6) can be used without MS. – → Not only high-quality but also computationally-efficient  Which is better in quality, proposed param. gen. or training? – Structures of HMMs/GMMs have limitation for recovering MS. – → The parameter generation considering the MS is better. 42 Portability Quality Computation time Post-filter Best! (no depend- ency on models) Better Better (120 ms) Param. gen. Better Best!(optimization in synthesis) Worse (1 min~) Training Worse Better Best! (5 ms)
  43. 43. /46 Subjective evaluation (preference test on speech quality) 43 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 HMM-based TTS GMM-based VC HMMTRJ GV MS- TRJ GMMTRJ GV MS- TRJ HMM/GMM Basic HMM/GMM training (Sec. 2.4-5) TRJ Trajectory HMM training (Sec. 2.8) GV GV-constrained training (Sec. 2.9) MS-TRJ MS-constrained trajectory training
  44. 44. /46 Conclusion
  45. 45. /46 Conclusion  Problem in this thesis – Quality degradation in synthetic speech, which is caused by parameterization error, insufficient modeling, and over-smoothing  Chapter 3: statistical parametric speech synthesis – Addresses the insufficiency in the acoustic modeling. – Models the individual speech parameter with rich context models.  Chapter 4 & 5: approaches using Modulation Spectrum (MS) – Addresses the over-smoothing in the parameter generation. – 1. MS-based post-filter: high portability – 2. Parameter generation w/ MS: highest quality – 3. MS-constrained training: computationally-efficient generation 45
  46. 46. /46 Future work  Improvements of rich context modeling – Quality degradation even if the best models are selected. (Sec. A.5)  Theoretical analysis of MS – Why is the speech quality improved by the MS?  MS for DNN-based speech synthesis – More flexible structures to integrate the MS  GPU implementation of the proposed methods – Rich-context-model selection & param. generation with the MS 46

×