Tomoki Koriyama12
, Takao Kobayashi1
1
Tokyo Institute of Technology, Japan, 2
Currently with The University of Tokyo, Japan
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model
Abstract Experiments
Semi-supervised learning of prosody
using DGP-LVM
GP, GPLVM, Deep Gaussian processBackground
Conclusions & Future Work
•Prosody labeling is important for TTS but laborious
•Use deep Gaussian process (DGP), a Bayesian deep model, to
represent prosodic context labels as latent variables
•Propose semi-supervised modeling for partially-annotated data, in
which the latent variables are used in place of annotated prosody
•Perform experiments using around 10% of fully-annotated data
•The proposed semi-supervised modeling with DGP
– Gave comparable score with the case all training data was
fully-annotated
– Outperformed the case using the data w/o accent information
•Future work
– Use diverse speech data including low-resource languages
– Compare other generative models, e.g., Bayes NN, VAE, flow
•To construct TTS, we require manual annotation of prosody labels,
which costs much time and patience
End-to-end approach [Wang et al., 2017][Sotelo et al., 2017]
•End-to-end TTS is language-dependent
•Japanese TTS still requires prosodic context labels [Yasuda et al., 2019] (b) Partially-annotated data
Common function
for both data
Acoustic featureAcoustic feature
Encode function
of accent contexts
Manually annotated
accent-dependent context
Latent variable as a accent
information representation
Accent-independent
context
Accent-independent
context
(a) Fully-annotated data
(a) FULL
0 1 2 3
Time [s]
150
200
300
400
F0[Hz]
(b) LABELED
0 1 2 3
Time [s]
(c) W/O ACCENT
0 1 2 3
Time [s]
(d) PROPOSED
0 1 2 3
Time [s]
(a) GP regression
ha
shi ga
Inference
(c) DGP regression (d) DGP-LVM(b) GPLVM
Purpose
– Incorporate DGP with LVM into prosody modeling
– Apply latent representation to semi-supervised learning
Problems in Japanese pitch accent
•Word meanings depend on accent
•Accent is not lexical. It varies with speakers and contexts
ha
shi
ga
ha
shi ga
Inference
Inference
Inference
•Infer the posteriors of functions and
latent variables simultaneously
Inference
[Damianou&Lawrence, 2013][Titsias&Lawrence, 2009]
Latent variable approach
•Gaussian process latent variable model (GPLVM) can represent
unannotated prosody information as latent variables [Moungsri et al., 2016]
•Single-layer GP lacks expressiveness in modeling
•Deep Gaussan process (DGP) [Damianou&Lawrence, 2013]
– Deep model with stacked Bayesian kernel regressions
– Outperformed 1-layer GP and DNN in TTS [Koriyama&Kobayashi, 2019]
low
high
a
ta ma
a
ta
malow
high
"edge is" "bridge is"
"head" "head"
"chopsticks are"
Speaker 2:Speaker 1:
•Use both fully-annotated and partially-annotated data
•The partially-annotated data does not include accent information
•Infer the posteriors of function and variables by variational Bayes
FULL 4.79 167
LABELED 5.54 228
W/O ACCENT 4.75 207
PROPOSED 4.76 178
Experimental conditions
Database
Train / Valid / Test
data set
Input features
Acoustic features
Model architecture
Model training
# of utterances for each method
Latent space: 3 dim, hidden layer: 32 dim
1024 inducing points, 5 layers
ArcCos kernel [Cho&Saul, 2009]
Optimizer: Adam, learning rate: 0.01
Japanese speech data of a female speaker in
XIMERA corpus [Kawai et al., 2004]
1533 (119 min) / 60 / 60 utterances
– 99 fully-annotated utterances
– 1434 partially-annotated utterances
accent dependent/independent context: 137/477 dims
40-dim mel-cepstrum, log F0, 5-band aperiodicity,
and their Δ+Δ2
Methods
Subjective evaluation Acoustic feature distortions
Example: generated F0 countours
Fully‒annotated data
(w/ accent info.)
Partially‒annotated data
(w/o accent info.)
FULL 1533 ‒
LABELED 99 ‒
W/O ACCENT ‒ 1533
PROPOSED 99 1434
MCD
[dB]
RMSE of
log F0 [cent]

Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model

  • 1.
    Tomoki Koriyama12 , TakaoKobayashi1 1 Tokyo Institute of Technology, Japan, 2 Currently with The University of Tokyo, Japan Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model Abstract Experiments Semi-supervised learning of prosody using DGP-LVM GP, GPLVM, Deep Gaussian processBackground Conclusions & Future Work •Prosody labeling is important for TTS but laborious •Use deep Gaussian process (DGP), a Bayesian deep model, to represent prosodic context labels as latent variables •Propose semi-supervised modeling for partially-annotated data, in which the latent variables are used in place of annotated prosody •Perform experiments using around 10% of fully-annotated data •The proposed semi-supervised modeling with DGP – Gave comparable score with the case all training data was fully-annotated – Outperformed the case using the data w/o accent information •Future work – Use diverse speech data including low-resource languages – Compare other generative models, e.g., Bayes NN, VAE, flow •To construct TTS, we require manual annotation of prosody labels, which costs much time and patience End-to-end approach [Wang et al., 2017][Sotelo et al., 2017] •End-to-end TTS is language-dependent •Japanese TTS still requires prosodic context labels [Yasuda et al., 2019] (b) Partially-annotated data Common function for both data Acoustic featureAcoustic feature Encode function of accent contexts Manually annotated accent-dependent context Latent variable as a accent information representation Accent-independent context Accent-independent context (a) Fully-annotated data (a) FULL 0 1 2 3 Time [s] 150 200 300 400 F0[Hz] (b) LABELED 0 1 2 3 Time [s] (c) W/O ACCENT 0 1 2 3 Time [s] (d) PROPOSED 0 1 2 3 Time [s] (a) GP regression ha shi ga Inference (c) DGP regression (d) DGP-LVM(b) GPLVM Purpose – Incorporate DGP with LVM into prosody modeling – Apply latent representation to semi-supervised learning Problems in Japanese pitch accent •Word meanings depend on accent •Accent is not lexical. It varies with speakers and contexts ha shi ga ha shi ga Inference Inference Inference •Infer the posteriors of functions and latent variables simultaneously Inference [Damianou&Lawrence, 2013][Titsias&Lawrence, 2009] Latent variable approach •Gaussian process latent variable model (GPLVM) can represent unannotated prosody information as latent variables [Moungsri et al., 2016] •Single-layer GP lacks expressiveness in modeling •Deep Gaussan process (DGP) [Damianou&Lawrence, 2013] – Deep model with stacked Bayesian kernel regressions – Outperformed 1-layer GP and DNN in TTS [Koriyama&Kobayashi, 2019] low high a ta ma a ta malow high "edge is" "bridge is" "head" "head" "chopsticks are" Speaker 2:Speaker 1: •Use both fully-annotated and partially-annotated data •The partially-annotated data does not include accent information •Infer the posteriors of function and variables by variational Bayes FULL 4.79 167 LABELED 5.54 228 W/O ACCENT 4.75 207 PROPOSED 4.76 178 Experimental conditions Database Train / Valid / Test data set Input features Acoustic features Model architecture Model training # of utterances for each method Latent space: 3 dim, hidden layer: 32 dim 1024 inducing points, 5 layers ArcCos kernel [Cho&Saul, 2009] Optimizer: Adam, learning rate: 0.01 Japanese speech data of a female speaker in XIMERA corpus [Kawai et al., 2004] 1533 (119 min) / 60 / 60 utterances – 99 fully-annotated utterances – 1434 partially-annotated utterances accent dependent/independent context: 137/477 dims 40-dim mel-cepstrum, log F0, 5-band aperiodicity, and their Δ+Δ2 Methods Subjective evaluation Acoustic feature distortions Example: generated F0 countours Fully‒annotated data (w/ accent info.) Partially‒annotated data (w/o accent info.) FULL 1533 ‒ LABELED 99 ‒ W/O ACCENT ‒ 1533 PROPOSED 99 1434 MCD [dB] RMSE of log F0 [cent]