Successfully reported this slideshow.

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

2

Share

1 of 17
1 of 17

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

2

Share

Download to read offline

In many situation such as TV narration & speech-based creativity, you may wanna control the prosody or pronunciation of synthetic speech. This method allows us to control synthetic speech using your voice.

In many situation such as TV narration & speech-based creativity, you may wanna control the prosody or pronunciation of synthetic speech. This method allows us to control synthetic speech using your voice.

More Related Content

More from Shinnosuke Takamichi

Related Books

Free with a 14 day trial from Scribd

See all

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

  1. 1. 2015©Shinnosuke TAKAMICHI 09/19/2015 Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST) MLSLP2015 in Aizu Univ.
  2. 2. /17 Speech-based creative activities and HMM-based speech synthesis 2 Singing voice Speech Advertisement Live concert Narration Next? Video avatar Voice actor … Useful method: HMM-based speech synthesis [Tokuda et al., 2013.] “Synthesize!” Synthetic speech parameters text speech
  3. 3. /17 Manual control of synthetic speech Laugh Sad Regression Multi-Regression HMM [Nose et al., 2007.] Manually manipulating HMM parameters User User They are very useful, but difficult to control as the user wants.
  4. 4. /17 Motivation of this study  Functions we want – Original capability of HMM-based TTS – Speech-based control • Intuitive to control • Make synthetic speech mimic input speech prosody  Our work – Speech synthesis having both functions 4 Synthesize System Synthesize“Synthesize.” MR-HMM etc. Similar to VOCALISTENER for singing voice control
  5. 5. /17 Overview of the proposed system (Only text is input.) 5 Input text Text analysis Waveform generation Synthetic speech Parameter generation Synthesis HMM Original HMM-based speech synthesis
  6. 6. /17 Overview of the proposed system (Text & speech are input.) 6 Input textInput speech Speech analysis Text analysis Waveform generation Synthetic speech F0 modification Duration extraction Parameter generation Alignment HMM Synthesis HMM
  7. 7. /17 Duration extraction module 7 Alignment HMM Synthesis HMM Feature of input speech Context of Input text HMM alignment Duration generation State duration of synthetic speech Parm. Gen. Duration of input speech
  8. 8. /17 Alignment accuracy & duration unit  How to build alignment HMMs suitable for input speech? – → The use of pre-recorded speech uttered by users – Large amounts → user-dependent HMMs – Small amounts → HMMs adapted from original alignment HMMs  How to map the input speech duration to synthetic speech? – Alignment/synthesis HMM-states represent different speech segments. – Which is better, HMM-state, phone, or mora-level duration unit? 8
  9. 9. /17 Speech parameter generation module 9 Synthesis HMM Context of Input text Parameter generation Spectrum of synthetic speech F0 generated From HMMs Dur. ext. State duration F0 mod. Wav. Gen.
  10. 10. /17 F0 modification module 10 Feature of input speech F0 generated from HMMs F0 conversion U/V region modification Parm. gen. F0 of synthetic speech Wav. Gen.
  11. 11. /17 F0 conversion & unvoiced/voiced modification 11 F0 Time Reference generated from HMMs Input speech F0-converted U/V-modified  F0 conversion fixes F0 range of input speech to fit to reference.  U/V modification fixes the U/V region of input speech to fit to reference. Linear conversion Spline interpolation
  12. 12. EXPERIMENTAL EVALUATION 12
  13. 13. /17 Experimental Setup 13 Content Value/Setting User 4 Japanese speakers (2 male & 2 female) Target speaker 1 Japanese female speaker Training data of synthesis HMMs 450 phoneme-balanced sentences, 16 kHz-sampled, 5 ms shift, reading style Evaluation data 53 phoneme-balanced sentences Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity Speech analyzer STRAIGHT [Kawahara et al., 1999.] Text analyzer Open-jtalk Acoustic model 5-state HSMM [Zen et al., 2007.]  1. duration unit & alignment HMM adaptation  2. synthesis HMM adaptation  3. effect of U/V modification
  14. 14. /17 Evaluation 1: duration unit & alignment HMM adaptation  3 duration units – State / phoneme / mora-level duration  4 HMMs using different amounts of pre-recorded speech – 0 … target-speaker-dependent HMMs (= synthesis HMM) – 1 … HMMs adapted using 1 utterance uttered by the user – 56 … HMMs adapted using 56 utterances – 450 … user-dependent HMMs  Evaluation – MOS test on naturalness of synthetic speech – DMOS test on prosody mimicking ability of synthetic speech • Input speech is presented as reference. 14
  15. 15. /17 Result 1: duration unit & alignment HMM adaptation 15 1 2 3 4 5 MOS on naturalness DMOS on prosody mimicking ability 0 1 56 450utts. We can confirm (1) adaptation is effective, and (2) phoneme-level dur. is relatively robust. No significant diff. No significant diff. state phone mora
  16. 16. /17 Experiment 2: Effectiveness of U/V modification in naturalness Preferencescoreonnaturalness[%] 0 20 40 60 80 100 Spkr1 Spkr2 Spkr3 Spkr4 U/Vmodificationratio[%] 0 5 10 15 20 Spkr1 Spkr2 Spkr3 Spkr4 w/o or w/ modification U->V or V->U modification U/V modification can improve the naturalness! (especially when many U frames of input speech are fixed.)
  17. 17. /17 Conclusion  2 functions to control synthetic speech – An original function of HMM-based TTS • MR-HMM or manual control – Speech-based control • Intuitive for users  2 main modules of our system – Mimic duration. • Copy duration of input speech to synthetic speech. – Mimic F0 patterns. • Copy dynamic F0 pattern of input speech to synthetic speech.  Future work – HMM selection using text & speech 17

×