Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

2015©Shinnosuke TAKAMICHI
09/19/2015
Prosody-Controllable HMM-Based
Speech Synthesis Using Speech Input
Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda,
Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST)
MLSLP2015 in Aizu Univ.

/17
Speech-based creative activities
and HMM-based speech synthesis
2
Singing voice Speech
Advertisement Live concert Narration Next?
Video avatar
Voice actor
…
Useful method: HMM-based speech synthesis [Tokuda et al., 2013.]
“Synthesize!”
Synthetic speech parameters
text speech

/17
Manual control of synthetic speech
Laugh
Sad
Regression
Multi-Regression HMM [Nose et al., 2007.]
Manually manipulating HMM parameters
User
User
They are very useful, but difficult to control as the user wants.

/17
Motivation of this study
 Functions we want
– Original capability of HMM-based TTS
– Speech-based control
• Intuitive to control
• Make synthetic speech mimic input speech prosody
 Our work
– Speech synthesis having both functions
4
Synthesize
System
Synthesize“Synthesize.”
MR-HMM etc.
Similar to VOCALISTENER
for singing voice control

/17
Overview of the proposed system
(Only text is input.)
5
Input text
Text analysis
Waveform generation
Synthetic speech
Parameter
generation
Synthesis
HMM
Original HMM-based
speech synthesis

/17
Overview of the proposed system
(Text & speech are input.)
6
Input textInput speech
Speech analysis Text analysis
Waveform generation
Synthetic speech
F0
modification
Duration
extraction
Parameter
generation
Alignment
HMM
Synthesis
HMM

/17
Duration extraction module
7
Alignment
HMM
Synthesis
HMM
Feature of
input speech
Context of
Input text
HMM
alignment
Duration
generation
State duration of
synthetic speech
Parm. Gen.
Duration of input speech

/17
Alignment accuracy & duration unit
 How to build alignment HMMs suitable for input speech?
– → The use of pre-recorded speech uttered by users
– Large amounts → user-dependent HMMs
– Small amounts → HMMs adapted from original alignment HMMs
 How to map the input speech duration to synthetic speech?
– Alignment/synthesis HMM-states represent different speech segments.
– Which is better, HMM-state, phone, or mora-level duration unit?
8

/17
Speech parameter generation module
9
Synthesis
HMM
Context of
Input text
Parameter
generation
Spectrum of
synthetic speech
F0 generated
From HMMs
Dur. ext.
State duration
F0 mod. Wav. Gen.

/17
F0 modification module
10
Feature of
input speech
F0 generated
from HMMs
F0
conversion
U/V region
modification
Parm. gen.
F0 of
synthetic speech
Wav. Gen.

/17
F0 conversion &
unvoiced/voiced modification
11
F0
Time
Reference
generated from HMMs
Input speech
F0-converted
U/V-modified
 F0 conversion fixes F0 range of input speech to fit to reference.
 U/V modification fixes the U/V region of input speech to fit to reference.
Linear
conversion
Spline
interpolation

/17
Experimental Setup
13
Content Value/Setting
User 4 Japanese speakers (2 male & 2 female)
Target speaker 1 Japanese female speaker
Training data of
synthesis HMMs
450 phoneme-balanced sentences,
16 kHz-sampled, 5 ms shift, reading style
Evaluation data 53 phoneme-balanced sentences
Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity
Speech analyzer STRAIGHT [Kawahara et al., 1999.]
Text analyzer Open-jtalk
Acoustic model 5-state HSMM [Zen et al., 2007.]
 1. duration unit & alignment HMM adaptation
 2. synthesis HMM adaptation
 3. effect of U/V modification

/17
Evaluation 1: duration unit &
alignment HMM adaptation
 3 duration units
– State / phoneme / mora-level duration
 4 HMMs using different amounts of pre-recorded speech
– 0 … target-speaker-dependent HMMs (= synthesis HMM)
– 1 … HMMs adapted using 1 utterance uttered by the user
– 56 … HMMs adapted using 56 utterances
– 450 … user-dependent HMMs
 Evaluation
– MOS test on naturalness of synthetic speech
– DMOS test on prosody mimicking ability of synthetic speech
• Input speech is presented as reference.
14

/17
Result 1: duration unit &
alignment HMM adaptation
15
1
2
3
4
5
MOS on naturalness DMOS on prosody mimicking ability
0 1 56 450utts.
We can confirm (1) adaptation is effective, and
(2) phoneme-level dur. is relatively robust.
No significant diff. No significant diff.
state phone mora

/17
Experiment 2: Effectiveness of U/V
modification in naturalness
Preferencescoreonnaturalness[%]
0
20
40
60
80
100
Spkr1 Spkr2 Spkr3 Spkr4
U/Vmodificationratio[%]
0
5
10
15
20
Spkr1 Spkr2 Spkr3 Spkr4
w/o or w/ modification U->V or V->U modification
U/V modification can improve the naturalness!
(especially when many U frames of input speech are fixed.)

/17
Conclusion
 2 functions to control synthetic speech
– An original function of HMM-based TTS
• MR-HMM or manual control
– Speech-based control
• Intuitive for users
 2 main modules of our system
– Mimic duration.
• Copy duration of input speech to synthetic speech.
– Mimic F0 patterns.
• Copy dynamic F0 pattern of input speech to synthetic speech.
 Future work
– HMM selection using text & speech 17

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (16)

Similar to Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

Similar to Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input (20)

More from Shinnosuke Takamichi

More from Shinnosuke Takamichi (20)

Recently uploaded

Recently uploaded (20)

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input