Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input


Published on

In many situation such as TV narration & speech-based creativity, you may wanna control the prosody or pronunciation of synthetic speech. This method allows us to control synthetic speech using your voice.

Published in: Science
  • Be the first to comment

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

  1. 1. 2015©Shinnosuke TAKAMICHI 09/19/2015 Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST) MLSLP2015 in Aizu Univ.
  2. 2. /17 Speech-based creative activities and HMM-based speech synthesis 2 Singing voice Speech Advertisement Live concert Narration Next? Video avatar Voice actor … Useful method: HMM-based speech synthesis [Tokuda et al., 2013.] “Synthesize!” Synthetic speech parameters text speech
  3. 3. /17 Manual control of synthetic speech Laugh Sad Regression Multi-Regression HMM [Nose et al., 2007.] Manually manipulating HMM parameters User User They are very useful, but difficult to control as the user wants.
  4. 4. /17 Motivation of this study  Functions we want – Original capability of HMM-based TTS – Speech-based control • Intuitive to control • Make synthetic speech mimic input speech prosody  Our work – Speech synthesis having both functions 4 Synthesize System Synthesize“Synthesize.” MR-HMM etc. Similar to VOCALISTENER for singing voice control
  5. 5. /17 Overview of the proposed system (Only text is input.) 5 Input text Text analysis Waveform generation Synthetic speech Parameter generation Synthesis HMM Original HMM-based speech synthesis
  6. 6. /17 Overview of the proposed system (Text & speech are input.) 6 Input textInput speech Speech analysis Text analysis Waveform generation Synthetic speech F0 modification Duration extraction Parameter generation Alignment HMM Synthesis HMM
  7. 7. /17 Duration extraction module 7 Alignment HMM Synthesis HMM Feature of input speech Context of Input text HMM alignment Duration generation State duration of synthetic speech Parm. Gen. Duration of input speech
  8. 8. /17 Alignment accuracy & duration unit  How to build alignment HMMs suitable for input speech? – → The use of pre-recorded speech uttered by users – Large amounts → user-dependent HMMs – Small amounts → HMMs adapted from original alignment HMMs  How to map the input speech duration to synthetic speech? – Alignment/synthesis HMM-states represent different speech segments. – Which is better, HMM-state, phone, or mora-level duration unit? 8
  9. 9. /17 Speech parameter generation module 9 Synthesis HMM Context of Input text Parameter generation Spectrum of synthetic speech F0 generated From HMMs Dur. ext. State duration F0 mod. Wav. Gen.
  10. 10. /17 F0 modification module 10 Feature of input speech F0 generated from HMMs F0 conversion U/V region modification Parm. gen. F0 of synthetic speech Wav. Gen.
  11. 11. /17 F0 conversion & unvoiced/voiced modification 11 F0 Time Reference generated from HMMs Input speech F0-converted U/V-modified  F0 conversion fixes F0 range of input speech to fit to reference.  U/V modification fixes the U/V region of input speech to fit to reference. Linear conversion Spline interpolation
  13. 13. /17 Experimental Setup 13 Content Value/Setting User 4 Japanese speakers (2 male & 2 female) Target speaker 1 Japanese female speaker Training data of synthesis HMMs 450 phoneme-balanced sentences, 16 kHz-sampled, 5 ms shift, reading style Evaluation data 53 phoneme-balanced sentences Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity Speech analyzer STRAIGHT [Kawahara et al., 1999.] Text analyzer Open-jtalk Acoustic model 5-state HSMM [Zen et al., 2007.]  1. duration unit & alignment HMM adaptation  2. synthesis HMM adaptation  3. effect of U/V modification
  14. 14. /17 Evaluation 1: duration unit & alignment HMM adaptation  3 duration units – State / phoneme / mora-level duration  4 HMMs using different amounts of pre-recorded speech – 0 … target-speaker-dependent HMMs (= synthesis HMM) – 1 … HMMs adapted using 1 utterance uttered by the user – 56 … HMMs adapted using 56 utterances – 450 … user-dependent HMMs  Evaluation – MOS test on naturalness of synthetic speech – DMOS test on prosody mimicking ability of synthetic speech • Input speech is presented as reference. 14
  15. 15. /17 Result 1: duration unit & alignment HMM adaptation 15 1 2 3 4 5 MOS on naturalness DMOS on prosody mimicking ability 0 1 56 450utts. We can confirm (1) adaptation is effective, and (2) phoneme-level dur. is relatively robust. No significant diff. No significant diff. state phone mora
  16. 16. /17 Experiment 2: Effectiveness of U/V modification in naturalness Preferencescoreonnaturalness[%] 0 20 40 60 80 100 Spkr1 Spkr2 Spkr3 Spkr4 U/Vmodificationratio[%] 0 5 10 15 20 Spkr1 Spkr2 Spkr3 Spkr4 w/o or w/ modification U->V or V->U modification U/V modification can improve the naturalness! (especially when many U frames of input speech are fixed.)
  17. 17. /17 Conclusion  2 functions to control synthetic speech – An original function of HMM-based TTS • MR-HMM or manual control – Speech-based control • Intuitive for users  2 main modules of our system – Mimic duration. • Copy duration of input speech to synthetic speech. – Mimic F0 patterns. • Copy dynamic F0 pattern of input speech to synthetic speech.  Future work – HMM selection using text & speech 17