A TRAINING METHOD USING  DNN-GUIDED LAYERWISE PRETRAINING  FOR DEEP GAUSSIAN PROCESSES

A TRAINING METHOD USING 
DNN-GUIDED LAYERWISE PRETRAINING 
FOR DEEP GAUSSIAN PROCESSES
Tomoki Koriyama, Takao Kobayashi
Tokyo Institute of Technology, Yokohama, Japan
May 14, 2019

Abstract
‣ Although deep Gaussian process is powerful regression
model, its training is not easy
‣ Propose two-stage pretraining, which helps DGP training
•DNN and layer-wise GP pretraining
‣ Use speech synthesis database
•600K data points, hundreds of input and output features
‣ Avoid training failures for deeper models

Background: Deep Neural Network
‣ Deep neural network
•Stacked functions of linear transformation and nonlinear activation
•Expressiveness enhanced by deep architecture
•Many techniques for training
- Batch normalization, dropout, ResNet, etc.
•Scalability for large training data
- O(N) computation complexity
‣ Disadvantage
•Point estimate
- No prior on weight matrix
- Overﬁtting ploblem
σ(W1
x)
σ(W2
h1
)
y
x
W3
h2
h1
h2
W

Background: Gaussian process regression
‣ Gaussian process regression (GPR)
•Nonparametric regression
- Utilize raw data points directly for prediction
•Probabilistic model
- Optimize hyper-parameters considering model complexity
•Scalability for large data with sparse approximation
- Stochastic variational inference [Hensman et al., 2013]
‣ Disadvantage
•Performance depends on kernel function
•Choosing appropriate kernel is hard work

Deep Gaussian process (DGP) [Damianou et al., 2013]
‣ Stacked Gaussian process regression
•Compared with DNN
- Probabilistic Bayesian model
•Compared with GPR
- Expressiveness enhanced by deep architecture
- Lower layer can be regarded as automatic kernel tuning
-> Overcome the limitation of kernel function
‣ Scalable for large data 
[Salimbeni et al., 2017]
‣ In TTS task, DGP outperformed DNN 
[Koriyama et al., 2019]
p(y|x)
x
p(h1
|x)
p(h2
|x)
GPR
GPR
GPR

Purpose
‣ Problem of DGP
•Training fails if initial parameters are bad
- Due to repeated Monte Carlo sampling
•Very few studies about training techniques of DGP

Gaussian process in machine learning
Assume that the latent function is sampled from Gaussian
process, and predict posterior of function
noiseoutput input
mean function kernel function
kernel parameter
latent function
y = f(x) + ϵ
f ∼ 𝒢𝒫(m(x); k(x, x′; θ))

‣ Predictive posterior distribution
‣ Target function (ELBO) to maximize
•Available for big data by stochastic optimization
GPR using stochastic variational inference (SVI)
[Hensman, 2013]
K( ⋅ , ⋅ ) : Gram matrix
Parameters:
Z : inducing inputs
q(u) = 𝒩(u; m, S)
: variational distr. of 
inducing outputs
θ : kernel parameter
ℒ =
N
∑
i=1
𝔼q( f(xi) [log p(yi | f(xi))] − KL(q(u)∥p(u; Z, θ))
q( f(x)) = SVGP( f(x); m( ⋅ ), k( ⋅ , ⋅ ,θ), x, Z, q(u))
= 𝒩( f(x); μ, σ2
)
μ = m(x) + a⊤
(m − m(Z))
σ2
= k(x, x; θ) − a⊤
[K(Z, Z; θ) − S] a
a = K(Z, Z; θ)−1
K(Z, x; θ)
Penalty
Model ﬁtness

‣ Perform stochastic optimization in a similar manner to as
single-layer GPR
•Target ELBO function is calculated 
by Monte Carlo sampling
‣ Problem
•Repeated sampling of deep model 
causes gradient vanishing
Training of DGP based on SVI [Salimbeni et al., 2017]
ELBO
sampling
sampling
ℒ =
N
∑
i=1
1
S
S
∑
s=1
𝔼q( f(xs
i) [log p(yi | f(xs
i))]
−
L
∑
ℓ=1
KL(q(Uℓ
)∥p(Uℓ
; Zℓ
, θℓ
))

Conventional method: mean function 
[Salimbeni et al., 2017]
Non-zero mean function is used to reduce gradient vanishing, 
and GPR is used as residual prediction
Output
Input
h2
h1
x
y
Dimension reduction by PCA
Copy
Predict output by GPR
Predict residual by GPR
Mean function
Predict residual by GPR
Design of mean function is difﬁcult if we use complicated architecture

Pre-training 1: DNN training
Replace the functions of GPs by perceptron block and 
train DNN to obtain hidden layer values
Perceptron block
Perceptron block
Perceptron block
hℓ+1
= BatchNorm(V ⋅ ReLU(Whℓ
+ b))

Pre-training 2: layer-wise GP training
Train layer-wise GPRs which represent the relationships 
between hidden layers
Perceptron block
Perceptron block
Perceptron block
GPR
GPR

Initialize parameter of DGP
Use the pretrained layer-wise GPR parameters 
as the initial parameters of each layer of DGP
GPR
GPR
GPR
GPR
GPR

Experimental conditions: database
English speech 
synthesis DB
Japanese speech 
synthesis DB
Database CMU ARCTIC
1 female (SLT)
XIMERA[Kawai et al., 2004]
1 female (F009)
Training data 597K frames (49 min.) 1.39M frames (119 min.)
Test data 66 sentences 60 sentences
Input featrue
721 dim.
linguistic feature vector
574 dim.
linguistic feature vector
Output feature 139 dim. acoustic feature vector
Evaluation
measure
Mel-celstral distance (MCD)

Experimental conditions: model conﬁgurations
Hidden layer dim. 32
Kernel function ArcCos [Cho & Saul, 2009] / RBF
# of inducing points 1024
DGP
Hidden units 1024
Activation ReLU
Dropout rate 20%
Perceptron block for DNN

Methods
‣ PRE10
•Proposed method using 10-epoch for layer-wise GP training
‣ PRE1
•Proposed method using only 1-epoch for layer-wise GP training
‣ MEAN
•Conventional method using non-zero mean function
‣ RAND
•Random values were used as initial inducing input and outputs
‣ DNN
•DNN was used instead of DGP

Effect of # of layers
The training of DGP with 7-10 layers of random parameters
(RAND) failed, while the proposed method worked well
# of layers RAND PRE10
1 5.11 5.08
2 4.72 4.70
3 4.65 4.63
4 4.65 4.59
5 4.65 4.62
6 4.65 4.63
7 10.08 4.60
8 10.08 4.63
9 10.08 4.65
10 10.08 4.62
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
MCDs as a function fo number of layers [dB]

Epoch-by-epoch distortions
Proposed PRE10 and PRE1 gave smaller distortions than 
conventional MEAN in early epochs
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
– # of layers:
6
4
10
8
6
Epoch
20 400
MCD[dB]
RAND
MEAN
DNNPRE1
PRE10

Conclusions
‣ Proposed two-stage pretraining for DGP training
•Pretraining 1: DNN
- Determine hidden layer values
•Pretraining 2: layer-wise GPR
- Obtain initial GP parameters using hidden layer values
‣ The proposed pertaining made training stable 
even for deep (7–10-layer) models
‣ Future work
•Apply the proposed method to more complicated architecture 
other than feed-forward-type models

A TRAINING METHOD USING  DNN-GUIDED LAYERWISE PRETRAINING  FOR DEEP GAUSSIAN PROCESSES

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to A TRAINING METHOD USING  DNN-GUIDED LAYERWISE PRETRAINING  FOR DEEP GAUSSIAN PROCESSES

Similar to A TRAINING METHOD USING  DNN-GUIDED LAYERWISE PRETRAINING  FOR DEEP GAUSSIAN PROCESSES (20)

More from Tomoki Koriyama

More from Tomoki Koriyama (13)

Recently uploaded

Recently uploaded (20)