A TRAINING METHOD USING DNN-GUIDED LAYERWISE PRETRAINING FOR DEEP GAUSSIAN PROCESSES
1. A TRAINING METHOD USING
DNN-GUIDED LAYERWISE PRETRAINING
FOR DEEP GAUSSIAN PROCESSES
Tomoki Koriyama, Takao Kobayashi
Tokyo Institute of Technology, Yokohama, Japan
May 14, 2019
2. Abstract
‣ Although deep Gaussian process is powerful regression
model, its training is not easy
‣ Propose two-stage pretraining, which helps DGP training
•DNN and layer-wise GP pretraining
‣ Use speech synthesis database
•600K data points, hundreds of input and output features
‣ Avoid training failures for deeper models
3. Background: Deep Neural Network
‣ Deep neural network
•Stacked functions of linear transformation and nonlinear activation
•Expressiveness enhanced by deep architecture
•Many techniques for training
- Batch normalization, dropout, ResNet, etc.
•Scalability for large training data
- O(N) computation complexity
‣ Disadvantage
•Point estimate
- No prior on weight matrix
- Overfitting ploblem
σ(W1
x)
σ(W2
h1
)
y
x
W3
h2
h1
h2
W
4. Background: Gaussian process regression
‣ Gaussian process regression (GPR)
•Nonparametric regression
- Utilize raw data points directly for prediction
•Probabilistic model
- Optimize hyper-parameters considering model complexity
•Scalability for large data with sparse approximation
- Stochastic variational inference [Hensman et al., 2013]
‣ Disadvantage
•Performance depends on kernel function
•Choosing appropriate kernel is hard work
5. Deep Gaussian process (DGP) [Damianou et al., 2013]
‣ Stacked Gaussian process regression
•Compared with DNN
- Probabilistic Bayesian model
•Compared with GPR
- Expressiveness enhanced by deep architecture
- Lower layer can be regarded as automatic kernel tuning
-> Overcome the limitation of kernel function
‣ Scalable for large data
[Salimbeni et al., 2017]
‣ In TTS task, DGP outperformed DNN
[Koriyama et al., 2019]
p(y|x)
x
p(h1
|x)
p(h2
|x)
GPR
GPR
GPR
6. Purpose
‣ Problem of DGP
•Training fails if initial parameters are bad
- Due to repeated Monte Carlo sampling
•Very few studies about training techniques of DGP
7. Gaussian process in machine learning
Assume that the latent function is sampled from Gaussian
process, and predict posterior of function
noiseoutput input
mean function kernel function
kernel parameter
latent function
y = f(x) + ϵ
f ∼ 𝒢𝒫(m(x); k(x, x′; θ))
8. ‣ Predictive posterior distribution
‣ Target function (ELBO) to maximize
•Available for big data by stochastic optimization
GPR using stochastic variational inference (SVI)
[Hensman, 2013]
K( ⋅ , ⋅ ) : Gram matrix
Parameters:
Z : inducing inputs
q(u) = 𝒩(u; m, S)
: variational distr. of
inducing outputs
θ : kernel parameter
ℒ =
N
∑
i=1
𝔼q( f(xi) [log p(yi | f(xi))] − KL(q(u)∥p(u; Z, θ))
q( f(x)) = SVGP( f(x); m( ⋅ ), k( ⋅ , ⋅ ,θ), x, Z, q(u))
= 𝒩( f(x); μ, σ2
)
μ = m(x) + a⊤
(m − m(Z))
σ2
= k(x, x; θ) − a⊤
[K(Z, Z; θ) − S] a
a = K(Z, Z; θ)−1
K(Z, x; θ)
Penalty
Model fitness
9. ‣ Perform stochastic optimization in a similar manner to as
single-layer GPR
•Target ELBO function is calculated
by Monte Carlo sampling
‣ Problem
•Repeated sampling of deep model
causes gradient vanishing
Training of DGP based on SVI [Salimbeni et al., 2017]
ELBO
sampling
sampling
ℒ =
N
∑
i=1
1
S
S
∑
s=1
𝔼q( f(xs
i) [log p(yi | f(xs
i))]
−
L
∑
ℓ=1
KL(q(Uℓ
)∥p(Uℓ
; Zℓ
, θℓ
))
10. Conventional method: mean function
[Salimbeni et al., 2017]
Non-zero mean function is used to reduce gradient vanishing,
and GPR is used as residual prediction
Output
Input
h2
h1
x
y
Dimension reduction by PCA
Copy
Predict output by GPR
Predict residual by GPR
Mean function
Predict residual by GPR
Design of mean function is difficult if we use complicated architecture
11. Pre-training 1: DNN training
Replace the functions of GPs by perceptron block and
train DNN to obtain hidden layer values
Perceptron block
Perceptron block
Perceptron block
hℓ+1
= BatchNorm(V ⋅ ReLU(Whℓ
+ b))
12. Pre-training 2: layer-wise GP training
Train layer-wise GPRs which represent the relationships
between hidden layers
Perceptron block
Perceptron block
Perceptron block
GPR
GPR
13. Initialize parameter of DGP
Use the pretrained layer-wise GPR parameters
as the initial parameters of each layer of DGP
GPR
GPR
GPR
GPR
GPR
14. Experimental conditions: database
English speech
synthesis DB
Japanese speech
synthesis DB
Database CMU ARCTIC
1 female (SLT)
XIMERA[Kawai et al., 2004]
1 female (F009)
Training data 597K frames (49 min.) 1.39M frames (119 min.)
Test data 66 sentences 60 sentences
Input featrue
721 dim.
linguistic feature vector
574 dim.
linguistic feature vector
Output feature 139 dim. acoustic feature vector
Evaluation
measure
Mel-celstral distance (MCD)
15. Experimental conditions: model configurations
Hidden layer dim. 32
Kernel function ArcCos [Cho & Saul, 2009] / RBF
# of inducing points 1024
DGP
Hidden units 1024
Activation ReLU
Dropout rate 20%
Perceptron block for DNN
16. Methods
‣ PRE10
•Proposed method using 10-epoch for layer-wise GP training
‣ PRE1
•Proposed method using only 1-epoch for layer-wise GP training
‣ MEAN
•Conventional method using non-zero mean function
‣ RAND
•Random values were used as initial inducing input and outputs
‣ DNN
•DNN was used instead of DGP
17. Effect of # of layers
The training of DGP with 7-10 layers of random parameters
(RAND) failed, while the proposed method worked well
# of layers RAND PRE10
1 5.11 5.08
2 4.72 4.70
3 4.65 4.63
4 4.65 4.59
5 4.65 4.62
6 4.65 4.63
7 10.08 4.60
8 10.08 4.63
9 10.08 4.65
10 10.08 4.62
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
MCDs as a function fo number of layers [dB]
18. Epoch-by-epoch distortions
Proposed PRE10 and PRE1 gave smaller distortions than
conventional MEAN in early epochs
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
– # of layers:
6
4
10
8
6
Epoch
20 400
MCD[dB]
RAND
MEAN
DNNPRE1
PRE10
19. Conclusions
‣ Proposed two-stage pretraining for DGP training
•Pretraining 1: DNN
- Determine hidden layer values
•Pretraining 2: layer-wise GPR
- Obtain initial GP parameters using hidden layer values
‣ The proposed pertaining made training stable
even for deep (7–10-layer) models
‣ Future work
•Apply the proposed method to more complicated architecture
other than feed-forward-type models