GMMNに基づく音声合成におけるグラム行列の スパース近似の検討

GMMN  
A Study of Sparse Approximation of Gram Matrices 
for GMMN-based Speech Synthesis

Background
‣ Statistical speech synthesis
•Model the relationship between input context and output acoustic
features
- In general, synthetic speech is always the same in perception 
if the sentence is the same
- Different from real human communication
‣ Sampling-based speech synthesis [Takamichi et al., 2017]
•Models the relationship between input context and 
the distribution of output acoustic features
•Samples speech parameter from the distribution
•Uses generative moment matching network (GMMN) as a model

Generative moment matching network (GMMN)
‣ Generative model based on DNN
•Predict the sample of output distribution from noise vector
•Use conditional maximum mean discrepancy (CMMD) as a cost
function
•Applications
- i-vector for speaker veriﬁcation [Shiota et al., 2018]
- singing voice for double-tracking [Tamaru et al., 2019]
•Advantage
- Sampling is easily performed without considering parametric p.d.f.
- Min-max optimization is not required

Purpose
‣ Computational complexity problem
•CMMD is computationally infeasible for a large amount of data
- when N is the number of training data points
•Conventional method
- Partitions data based on randomly selected minibatch
- Calculates CMMD for each minibatch
‣ Purpose of this study
O(N3
)
•Review the approximation method of CMMD, which is used as a
cost function of GMNN
•Evaluate naturalness and diversity of generates synthetic speech

Maximum Mean Discrepancy [Gretton et al.,, 2012]
The distance of two distributions is deﬁned by 
the distance of means of RKHS points
{yi} {˜yi}
ϕ(y) ϕ(y)
RKHS RKHS
μ ˜μ
𝔼[ ⋅ ] 𝔼[ ⋅ ]
MMD2
= ∥μ − ˜μ∥2
P(Y) P( ˜Y)

Conditional MMD (CMMD) [Ren et al.,, 2012]
The distance of two conditional distributions is deﬁned by 
the distance of linear operator of RKHSs
{yi} {˜yi}
ϕ(y) ϕ(y)
RKHS RKHS
μ = Cψ(x) ˜μ = ˜Cψ(x)
𝔼[ ⋅ ] 𝔼[ ⋅ ]
CMMD2
= ∥C − ˜C∥2
x
ψ(x)
RKHS
P(Y|x)
P( ˜Y|x)

Conditional MMD (CMMD)
CMMD2
= ∥C − ˜C∥2
CMMD2
= Tr [(KY,Y + K˜Y, ˜Y − 2KY, ˜Y)(H + λI)−1
H(H + λI)−1
]
– Linear operators are estimated by kernel regressionC, ˜C
– Kernel trick is used
The distance of two conditional distributions is calculated by
the kernel functions of input features and output features
Gram matrices for output Gram matrix for input

Generative Moment Matching Network (GMMN) 
[Ren et al.,, 2012]
Predict the samples of conditional distributions 
using DNN, which is trained by CMMD cost function
{yi}
x {ni; ni ∼ 𝒩(0,I)}
DNN (GMNN)
{˜yi}
CMMD
: training data points
: noise
backprop

GMMN-Based Speech Synthesis
Use two DNNs, MSE criterion and
CMMD criterion that predicts residual of acoustic features
Gram
matrix
Gram
matrix
DNN with
MSE criterion
Context
Acoustic
feature
Bottleneck
feature
CMMD
Random vaue
GMMN for
sampling

Problem of GMMN-based speech synthesis
CMMD2
= Tr [(KY,Y + K˜Y, ˜Y − 2KY, ˜Y)(H + λI)−1
H(H + λI)−1
]
2. Calculation of inverse matrix
1. Calculation of Gram matrices
O(N2
)
O(N3
)
‣ Impossible to use CMMD directly for speech synthesis,
because N of speech synthesis is large
‣ Unable to train a model by Minibatch-based optimization

Local Approximation (Conventional Method)
‣ CMMD is calculated for each partitioned minibatch
‣ This method is regarded as block diagonal approximation
•Blocks are determined by minibatch
‣ Computational complexity for each minibatch:
•B: minibatch size
CMMD2
= Tr [(KY,Y + K˜Y, ˜Y − 2KY, ˜Y)(H + λI)−1
H(H + λI)−1
]
O(B3
)

Random Fourier Features (RFF) [Rahimi & Recht, 2008]
Kernel function is approximated by the inner product of a ﬁnite
number of basis to obtain low-rank Gram matrix
kRBF(x, x′) = (exp( −∥x − x′∥2
/2) kRBF(x, x′) ≈
1
M
M
∑
r=1
cos(x⊤
ωr + br)cos(x′⊤
ωr + br)
RBF kernel RBF kernel approx. with RFF
example:
-1.0
1.0
0.0
-1.0
1.0
0.0
Gram matrix with rank N=1000 Gram matrix with rank M=100

RFF-based Approximation
‣ Approximate Gram matrices of input features by RFF
‣ Can reduce computational complexity by matrix inversion
formula
‣ Computational complexity for each minibatch:
•B: minibatch size, M: RFF dimensions
CMMD2
= Tr [(KY,Y + K˜Y, ˜Y − 2KY, ˜Y)(H + λI)−1
H(H + λI)−1
]
O(BM2
)
low rank low rank

Clustering for Minibatch Selection
‣ Conventional method chose minibatch randomly
•Gram matrices tended to be sparse
- Since /a/ and /s/ are distant, kernel function value is almost zero
•Sparse matrix is redundant
‣ Collect similar contexts and use cluster as minibatch
•Perform K-means clustering (K=2) on bottleneck features
•Top-down partition until cluster size becomes sufﬁciently small

Experimental Conditions
Database
1 female, 203 sentences 
(ATR B-set subset a & j 
REPEAT included in JSUT corpus)
Each sentence was repeated 5 times.
Training data 5 x 150 utterances (ATR-a and REPEAT)
Development set 5 x 26 utterances (ATR-j27 to j53)
Test data 27 utterances (ATR-j01 to j26), 5 samples are generated
Acoustic 
features
0-39th mel-cepstrum, log F0, and 5-band aperiodicity
with their delta and delta-delta, and VUV

Network conﬁgurations
Dimensions
bottleneck feature: 32
noise vector: 3
hidden unit: 2014
# of hidden layers
DNN with MSE criterion: 7
GMMN: 3
Max minibatch size 10000
RFF dimensions 1024

Methods
‣ MSE
•No sampling. Just use DNN with MSE criterion
‣ VOC
•Vocoder speech of 5 different recordings
‣ Approximation methods

Subjective Evaluation: Naturalness
1
MSE
Score
95% confidence interval
p<0.01
LOCAL-RAND
LOCAL-CLST
RFF-RAND
RFF-CLST
VOC
2 3 4 5
(1: too bad, 5: very good)

Subjective Evaluation: Diversity
95% confidence interval
p<0.05 p<0.001
MSE
1 2 3 4 5
LOCAL-RAND
LOCAL-CLST
RFF-RAND
RFF-CLST
VOC
Score
(1: completely equivalent, 5: very different)
• Participants listened to two samples generated using different random inputs
• They rate how different two samples are in 5 point scale

Variance of Sampled Speech Parameters
The score of diversity increased with the variance of phone
duration
0-th mel-
cepstrum
1-st mel-
cepstrum
log F0
[cent]
phone
duration
[ms]
Diversity
MOS
LOCAL-RAND 0.023 0.012 15.8 2.46 1.61
LOCAL-CLST 0.053 0.022 18.2 3.50 1.71
RFF-RAND 0.021 0.007 1.5 3.77 1.73
RFF-CLST 0.049 0.027 14.0 5.47 1.94

Conclusions
‣ Examined the approximation methods to reduce
computational complexity of GMMN-based speech
synthesis
•Local approximation / Low rank approximation (RFF)
•Minibatch selection using clustering
‣ RFF and clustering-based minibatch improved diversity
‣ Future work
•Employ sequence-level modeling
•Use more data
•Investigate evaluation method of sampling-based TTS

GMMNに基づく音声合成におけるグラム行列の スパース近似の検討

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to GMMNに基づく音声合成におけるグラム行列の スパース近似の検討

Similar to GMMNに基づく音声合成におけるグラム行列の スパース近似の検討 (20)

More from Tomoki Koriyama

More from Tomoki Koriyama (12)

Recently uploaded

Recently uploaded (20)