The NAIST Text-to-Speech System for Blizzard Challenge 2015

2015©Shinnosuke TAKAMICHI 09/11/2015
shinnosuke-t@is.naist.jp
The NAIST Text-to-Speech System
for Blizzard Challenge 2015
Shinnosuke Takamichi,
Kazuhiro Kobayashi, Kou Tanaka,
Tomoki Toda, Satoshi Nakamura
(NAIST, Japan)
Blizzard Challenge 2015

/20
Blizzard Challenge 2015
 Languages
– Bengali, Hindi, Malayalam, Marathi, Tamil, & Telugu
– + English
 Provided data
– UTF-8-encoded text & 16 kHz-sampled speech waveform
– → We need to develop natural language process (front-end) and
speech waveform generation (back-end).
 2 tasks
– Mono-lingual task (IH1) … 6 Indian languages
– Multi-lingual task (IH2) … Indian languages + English
2

/20
Overview of our TTS system
3
v
v
Provided database
Speech featuresContext labels
HSMM & MS database
Context labels
of input text
Text Speech
Text processing Speech processing
Training
Synthesis
 Our system
– HMM-based TTS with 4 main modules
– No external data for all modules
 New functions
– Parameter trajectory smoothing in the speech processing module
– Modulation Spectrum (MS) in the synthesis module
Synthetic speech

/20
Text processing module
4
Text
(Discrete)
context labels
Text
analysis
Context
generation
 Bengali, Hindi, Tamil, & Telugu
Festvox ver. 2.7 recipes [Black et al., 2001.]
 Marathi
Festvox recipe for Hindi
 Malayalam
Rule [Nair et al., 2013.] … Stress is not extracted.
 Same contexts for all languages
Phoneme, syllable, & stress
Vowel/consonant, articulator & U/V
Position of phoneme, syllable, & word
The number of phonemes, syllables, & words

/20
Speech processing module
5
Speech
61-dim.
mel-cepstrum
Spectrum
extraction
F0
extraction
Aperiodicity
extraction
Trajectory
smoothing
Continuous F0 U/V symbol
5-band
aperiodicity
Continuous
F0 extraction
Trajectory
smoothing
Band
averaging
*STRAIGHT [Kawahara et al.], WORLD [Morise et al.]

/20
Motivation of parameter
trajectory smoothing
 Motivation
– Remove temporal fluctuation difficult to be modeled with HMMs
 Examples
– Fluctuating sequence vs. Smooth sequence
6
Mean ± variance

/20
Modulation spectrum analysis
for parameter trajectory smoothing
 Modulation spectrum [Takamichi et al., 2014 & 2015.]
– Power spectra of the temporal parameter sequence
– An extension of Global Variance (GV) [Toda et al., 2007.]
7
Modulation frequency
Modulationspectrum
Mel-cep sequence
FFT
& pow.
Easy to model
with HMMs
Difficult to model
with HMMs
Dominant in speech
perception

/20
Parameter trajectory smoothing
(= High modulation freq. removal)
8
Extracted parameters
50 Hz-cutoff LPF to remove high modulation freq.
*LPF: Low Pass Filter

/20
Training module
9
Mel-cepstrum Cont. F0 U/V symbol Aperiodicity
HSMM database MS database
 HSMM training
– ML training of context-dependent HSMM [Yoshimura et al., 1999.]
– MDL-based clustering [Shinoda et al., 2000.]
 MS model training
– Mean-normalized MS [Takamichi et al., 2014.]
– ML training of Gaussian distribution

/20
Synthesis module
10
Context labels of input text
HSMM database MS database
Spectrum
Generation
w/ MS
Cont. F0
generation
Aperiodicity
generation
U/V symbol
generation
MS-based
post-filter
MLSA filter
Synthetic speech
Smoothing
In silence*
* For reducing unnatural power in silence

/20
Speech parameter generation
algorithm considering MS
11
w/ ~50 Hz MS
w/o MS
𝒚 = argmax 𝑃𝑟𝑜𝑏HMM 𝑾𝒚 𝑃𝑟𝑜𝑏MS 𝒔 𝒚
𝜔
𝒚: speech parameters, 𝑾: delta window, 𝒔 𝒚 : MS of 𝒚

/20
Speech samples
12
Language w/o MS w/ MS
Bengali
Hindi
Malayalam
Marathi
Tamil
Telugu

/20
Evaluation of synthesizer
 Evaluation
– Naturalness: 5-point MOS score
– Intelligibility: WER of listening tests
– Similarity: 5-point DMOS score
 Result shown in this talk
– Naturalness: mean of MOS score of RD task in Marathi
– Intelligibility: mean of WER in Marathi
– Similarity: mean of DMOS score of RD task in Marathi
– + rank of these scores in all languages
14

/20
5-point MOS score on naturalness
15
Our place for RD task (our place / #-of-systems)
Bengali Marathi Hindi Tamil Malayalam Telugu
6 / 10 2 / 9 4 / 10 6 / 10 2 / 10 2 / 10
Results in Marathi

/20
WER for intelligibility
16
Our place (our place / #-of-systems)
5 / 10 1 / 9 7 / 10 8 / 10 4 / 10 4 / 10
Results in Marathi

/20
5-point DMOS score on similarity
17
Our place for RD task (our place / #-of-systems)
5 / 10 5 / 9 4 / 10 5 / 10 8 / 10 5 / 10
Results in Marathi

/20
Goodness and Weakness
 Good!
– Naturalness of synthetic speech
– Intelligibility of synthetic speech (in Marathi)
– Small footprint (10 ~ 20 MB)
– Fast training (~ 10 hours for 1 system)
 Weak…
– Similarity of synthetic speech
– Slow synthesis (3 minutes for 1 sentence)
• Because generation considering MS needs iteration.
 Early stopping of the iteration
 Parallelization of generation algorithm
18

/20
Is our system open source?
 Text processing
– Text analyzer … Yes (Festvox) except Malayalam
– Context generator … Yes (my GitHub*1)
 Speech processing
– Speech analyzer … Yes (STRAIGHT & WORLD)
– Spectral smoothing … No, but it uses only Butterworth LPF.
 Training
– HSMM & MS model training … Yes (HTS & SPTK)
 Synthesis
– Generation w/ MS … No, but post-filter is available (HTS).
19
*1: search “shinnsuke takamichi”

/20
Conclusion
 Our challenge
– Mono-lingual task (IH1) for Indian languages
 Our TTS synthesizer
– HMM-based TTS with 4 main modules
– Parameter trajectory smoothing in the speech processing module
– Modulation spectrum in the synthesis module
 Future work
– Combine with statistical sample-based method [Takamichi et al., 2014.]
20

The NAIST Text-to-Speech System for Blizzard Challenge 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (15)

Similar to The NAIST Text-to-Speech System for Blizzard Challenge 2015

Similar to The NAIST Text-to-Speech System for Blizzard Challenge 2015 (20)

More from Shinnosuke Takamichi

More from Shinnosuke Takamichi (20)

Recently uploaded

Recently uploaded (20)

The NAIST Text-to-Speech System for Blizzard Challenge 2015