SlideShare a Scribd company logo
1 of 19
Download to read offline
A TRAINING METHOD USING

DNN-GUIDED LAYERWISE PRETRAINING

FOR DEEP GAUSSIAN PROCESSES
Tomoki Koriyama, Takao Kobayashi
Tokyo Institute of Technology, Yokohama, Japan
May 14, 2019
Abstract
‣ Although deep Gaussian process is powerful regression
model, its training is not easy
‣ Propose two-stage pretraining, which helps DGP training
•DNN and layer-wise GP pretraining
‣ Use speech synthesis database
•600K data points, hundreds of input and output features
‣ Avoid training failures for deeper models
Background: Deep Neural Network
‣ Deep neural network
•Stacked functions of linear transformation and nonlinear activation
•Expressiveness enhanced by deep architecture
•Many techniques for training
- Batch normalization, dropout, ResNet, etc.
•Scalability for large training data
- O(N) computation complexity
‣ Disadvantage
•Point estimate
- No prior on weight matrix
- Overfitting ploblem
σ(W1
x)
σ(W2
h1
)
y
x
W3
h2
h1
h2
W
Background: Gaussian process regression
‣ Gaussian process regression (GPR)
•Nonparametric regression
- Utilize raw data points directly for prediction
•Probabilistic model
- Optimize hyper-parameters considering model complexity
•Scalability for large data with sparse approximation
- Stochastic variational inference [Hensman et al., 2013]
‣ Disadvantage
•Performance depends on kernel function
•Choosing appropriate kernel is hard work
Deep Gaussian process (DGP) [Damianou et al., 2013]
‣ Stacked Gaussian process regression
•Compared with DNN
- Probabilistic Bayesian model
•Compared with GPR
- Expressiveness enhanced by deep architecture
- Lower layer can be regarded as automatic kernel tuning
-> Overcome the limitation of kernel function
‣ Scalable for large data

[Salimbeni et al., 2017]
‣ In TTS task, DGP outperformed DNN

[Koriyama et al., 2019]
p(y|x)
x
p(h1
|x)
p(h2
|x)
GPR
GPR
GPR
Purpose
‣ Problem of DGP
•Training fails if initial parameters are bad
- Due to repeated Monte Carlo sampling
•Very few studies about training techniques of DGP
Gaussian process in machine learning
Assume that the latent function is sampled from Gaussian
process, and predict posterior of function
noiseoutput input
mean function kernel function
kernel parameter
latent function
y = f(x) + ϵ
f ∼ 𝒢𝒫(m(x); k(x, x′; θ))
‣ Predictive posterior distribution
‣ Target function (ELBO) to maximize
•Available for big data by stochastic optimization
GPR using stochastic variational inference (SVI)
[Hensman, 2013]
K( ⋅ , ⋅ ) : Gram matrix
Parameters:
Z : inducing inputs
q(u) = 𝒩(u; m, S)
: variational distr. of

inducing outputs
θ : kernel parameter
ℒ =
N
∑
i=1
𝔼q( f(xi) [log p(yi | f(xi))] − KL(q(u)∥p(u; Z, θ))
q( f(x)) = SVGP( f(x); m( ⋅ ), k( ⋅ , ⋅ ,θ), x, Z, q(u))
= 𝒩( f(x); μ, σ2
)
μ = m(x) + a⊤
(m − m(Z))
σ2
= k(x, x; θ) − a⊤
[K(Z, Z; θ) − S] a
a = K(Z, Z; θ)−1
K(Z, x; θ)
Penalty
Model fitness
‣ Perform stochastic optimization in a similar manner to as
single-layer GPR
•Target ELBO function is calculated

by Monte Carlo sampling
‣ Problem
•Repeated sampling of deep model

causes gradient vanishing
Training of DGP based on SVI [Salimbeni et al., 2017]
ELBO
sampling
sampling
ℒ =
N
∑
i=1
1
S
S
∑
s=1
𝔼q( f(xs
i) [log p(yi | f(xs
i))]
−
L
∑
ℓ=1
KL(q(Uℓ
)∥p(Uℓ
; Zℓ
, θℓ
))
Conventional method: mean function

[Salimbeni et al., 2017]
Non-zero mean function is used to reduce gradient vanishing,

and GPR is used as residual prediction
Output
Input
h2
h1
x
y
Dimension reduction by PCA
Copy
Predict output by GPR
Predict residual by GPR
Mean function
Predict residual by GPR
Design of mean function is difficult if we use complicated architecture
Pre-training 1: DNN training
Replace the functions of GPs by perceptron block and

train DNN to obtain hidden layer values
Perceptron block
Perceptron block
Perceptron block
hℓ+1
= BatchNorm(V ⋅ ReLU(Whℓ
+ b))
Pre-training 2: layer-wise GP training
Train layer-wise GPRs which represent the relationships

between hidden layers
Perceptron block
Perceptron block
Perceptron block
GPR
GPR
Initialize parameter of DGP
Use the pretrained layer-wise GPR parameters

as the initial parameters of each layer of DGP
GPR
GPR
GPR
GPR
GPR
Experimental conditions: database
English speech

synthesis DB
Japanese speech

synthesis DB
Database CMU ARCTIC
1 female (SLT)
XIMERA[Kawai et al., 2004]
1 female (F009)
Training data 597K frames (49 min.) 1.39M frames (119 min.)
Test data 66 sentences 60 sentences
Input featrue
721 dim.
linguistic feature vector
574 dim.
linguistic feature vector
Output feature 139 dim. acoustic feature vector
Evaluation
measure
Mel-celstral distance (MCD)
Experimental conditions: model configurations
Hidden layer dim. 32
Kernel function ArcCos [Cho & Saul, 2009] / RBF
# of inducing points 1024
DGP
Hidden units 1024
Activation ReLU
Dropout rate 20%
Perceptron block for DNN
Methods
‣ PRE10
•Proposed method using 10-epoch for layer-wise GP training
‣ PRE1
•Proposed method using only 1-epoch for layer-wise GP training
‣ MEAN
•Conventional method using non-zero mean function
‣ RAND
•Random values were used as initial inducing input and outputs
‣ DNN
•DNN was used instead of DGP
Effect of # of layers
The training of DGP with 7-10 layers of random parameters
(RAND) failed, while the proposed method worked well
# of layers RAND PRE10
1 5.11 5.08
2 4.72 4.70
3 4.65 4.63
4 4.65 4.59
5 4.65 4.62
6 4.65 4.63
7 10.08 4.60
8 10.08 4.63
9 10.08 4.65
10 10.08 4.62
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
MCDs as a function fo number of layers [dB]
Epoch-by-epoch distortions
Proposed PRE10 and PRE1 gave smaller distortions than

conventional MEAN in early epochs
– Database:
CMU arctic (English)
– Kernel:
ArcCos kernel
– # of layers:
6
4
10
8
6
Epoch
20 400
MCD[dB]
RAND
MEAN
DNNPRE1
PRE10
Conclusions
‣ Proposed two-stage pretraining for DGP training
•Pretraining 1: DNN
- Determine hidden layer values
•Pretraining 2: layer-wise GPR
- Obtain initial GP parameters using hidden layer values
‣ The proposed pertaining made training stable

even for deep (7–10-layer) models
‣ Future work
•Apply the proposed method to more complicated architecture

other than feed-forward-type models

More Related Content

What's hot

presentation
presentationpresentation
presentationjie ren
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkYu Liu
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSean Moran
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님taeseon ryu
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
vasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarksvasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some BenchmarksJonathan Skelton
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5dm_work
 
Craig-Bampton Method
Craig-Bampton MethodCraig-Bampton Method
Craig-Bampton Methodsiavoshani
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster  with TerasortUsing Derivation-Free Optimization in the Hadoop Cluster  with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster with TerasortAnhanguera Educacional S/A
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...T. E. BOGALE
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 

What's hot (19)

presentation
presentationpresentation
presentation
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on Spark
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
 
parallel
parallelparallel
parallel
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
vasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarksvasp-gpu on Balena: Usage and Some Benchmarks
vasp-gpu on Balena: Usage and Some Benchmarks
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5
 
Craig-Bampton Method
Craig-Bampton MethodCraig-Bampton Method
Craig-Bampton Method
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster  with TerasortUsing Derivation-Free Optimization in the Hadoop Cluster  with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 

Similar to A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN PROCESSES

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfdajiba
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningJulia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningAssociation for Computational Linguistics
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptxssuser7807522
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 

Similar to A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN PROCESSES (20)

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdf
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningJulia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 

More from Tomoki Koriyama

UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...Tomoki Koriyama
 
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討Tomoki Koriyama
 
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech SynthesisSparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech SynthesisTomoki Koriyama
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...Tomoki Koriyama
 
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jpICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jpTomoki Koriyama
 
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討Tomoki Koriyama
 
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討Tomoki Koriyama
 
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討Tomoki Koriyama
 
深層ガウス過程に基づく音声合成のための
事前学習の検討
深層ガウス過程に基づく音声合成のための
事前学習の検討深層ガウス過程に基づく音声合成のための
事前学習の検討
深層ガウス過程に基づく音声合成のための
事前学習の検討Tomoki Koriyama
 
GPR音声合成における深層ガウス過程の利用の検討
GPR音声合成における深層ガウス過程の利用の検討GPR音声合成における深層ガウス過程の利用の検討
GPR音声合成における深層ガウス過程の利用の検討Tomoki Koriyama
 
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討Tomoki Koriyama
 
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討Tomoki Koriyama
 
ICASSP2017読み会(Speech Synthesis)
ICASSP2017読み会(Speech Synthesis)ICASSP2017読み会(Speech Synthesis)
ICASSP2017読み会(Speech Synthesis)Tomoki Koriyama
 

More from Tomoki Koriyama (13)

UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
 
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
深層ガウス過程に基づく音声合成におけるリカレント構造を用いた系列モデリングの検討
 
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech SynthesisSparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jpICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
ICASSP2019音声&音響論文読み会 論文紹介(合成系) #icassp2019jp
 
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
 
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
深層ガウス過程とアクセントの潜在変数表現に基づく音声合成の検討
 
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
グラム行列のスパース近似を用いた生成的モーメントマッチングネットに基づく音声合成の検討
 
深層ガウス過程に基づく音声合成のための
事前学習の検討
深層ガウス過程に基づく音声合成のための
事前学習の検討深層ガウス過程に基づく音声合成のための
事前学習の検討
深層ガウス過程に基づく音声合成のための
事前学習の検討
 
GPR音声合成における深層ガウス過程の利用の検討
GPR音声合成における深層ガウス過程の利用の検討GPR音声合成における深層ガウス過程の利用の検討
GPR音声合成における深層ガウス過程の利用の検討
 
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
GP-DNNハイブリッドモデルに基づく統計的音声合成の検討
 
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
GPR音声合成のためのフレームコンテキストカーネルに基づく決定木構築の検討
 
ICASSP2017読み会(Speech Synthesis)
ICASSP2017読み会(Speech Synthesis)ICASSP2017読み会(Speech Synthesis)
ICASSP2017読み会(Speech Synthesis)
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN PROCESSES

  • 1. A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN PROCESSES Tomoki Koriyama, Takao Kobayashi Tokyo Institute of Technology, Yokohama, Japan May 14, 2019
  • 2. Abstract ‣ Although deep Gaussian process is powerful regression model, its training is not easy ‣ Propose two-stage pretraining, which helps DGP training •DNN and layer-wise GP pretraining ‣ Use speech synthesis database •600K data points, hundreds of input and output features ‣ Avoid training failures for deeper models
  • 3. Background: Deep Neural Network ‣ Deep neural network •Stacked functions of linear transformation and nonlinear activation •Expressiveness enhanced by deep architecture •Many techniques for training - Batch normalization, dropout, ResNet, etc. •Scalability for large training data - O(N) computation complexity ‣ Disadvantage •Point estimate - No prior on weight matrix - Overfitting ploblem σ(W1 x) σ(W2 h1 ) y x W3 h2 h1 h2 W
  • 4. Background: Gaussian process regression ‣ Gaussian process regression (GPR) •Nonparametric regression - Utilize raw data points directly for prediction •Probabilistic model - Optimize hyper-parameters considering model complexity •Scalability for large data with sparse approximation - Stochastic variational inference [Hensman et al., 2013] ‣ Disadvantage •Performance depends on kernel function •Choosing appropriate kernel is hard work
  • 5. Deep Gaussian process (DGP) [Damianou et al., 2013] ‣ Stacked Gaussian process regression •Compared with DNN - Probabilistic Bayesian model •Compared with GPR - Expressiveness enhanced by deep architecture - Lower layer can be regarded as automatic kernel tuning -> Overcome the limitation of kernel function ‣ Scalable for large data
 [Salimbeni et al., 2017] ‣ In TTS task, DGP outperformed DNN
 [Koriyama et al., 2019] p(y|x) x p(h1 |x) p(h2 |x) GPR GPR GPR
  • 6. Purpose ‣ Problem of DGP •Training fails if initial parameters are bad - Due to repeated Monte Carlo sampling •Very few studies about training techniques of DGP
  • 7. Gaussian process in machine learning Assume that the latent function is sampled from Gaussian process, and predict posterior of function noiseoutput input mean function kernel function kernel parameter latent function y = f(x) + ϵ f ∼ 𝒢𝒫(m(x); k(x, x′; θ))
  • 8. ‣ Predictive posterior distribution ‣ Target function (ELBO) to maximize •Available for big data by stochastic optimization GPR using stochastic variational inference (SVI) [Hensman, 2013] K( ⋅ , ⋅ ) : Gram matrix Parameters: Z : inducing inputs q(u) = 𝒩(u; m, S) : variational distr. of
 inducing outputs θ : kernel parameter ℒ = N ∑ i=1 𝔼q( f(xi) [log p(yi | f(xi))] − KL(q(u)∥p(u; Z, θ)) q( f(x)) = SVGP( f(x); m( ⋅ ), k( ⋅ , ⋅ ,θ), x, Z, q(u)) = 𝒩( f(x); μ, σ2 ) μ = m(x) + a⊤ (m − m(Z)) σ2 = k(x, x; θ) − a⊤ [K(Z, Z; θ) − S] a a = K(Z, Z; θ)−1 K(Z, x; θ) Penalty Model fitness
  • 9. ‣ Perform stochastic optimization in a similar manner to as single-layer GPR •Target ELBO function is calculated
 by Monte Carlo sampling ‣ Problem •Repeated sampling of deep model
 causes gradient vanishing Training of DGP based on SVI [Salimbeni et al., 2017] ELBO sampling sampling ℒ = N ∑ i=1 1 S S ∑ s=1 𝔼q( f(xs i) [log p(yi | f(xs i))] − L ∑ ℓ=1 KL(q(Uℓ )∥p(Uℓ ; Zℓ , θℓ ))
  • 10. Conventional method: mean function
 [Salimbeni et al., 2017] Non-zero mean function is used to reduce gradient vanishing,
 and GPR is used as residual prediction Output Input h2 h1 x y Dimension reduction by PCA Copy Predict output by GPR Predict residual by GPR Mean function Predict residual by GPR Design of mean function is difficult if we use complicated architecture
  • 11. Pre-training 1: DNN training Replace the functions of GPs by perceptron block and
 train DNN to obtain hidden layer values Perceptron block Perceptron block Perceptron block hℓ+1 = BatchNorm(V ⋅ ReLU(Whℓ + b))
  • 12. Pre-training 2: layer-wise GP training Train layer-wise GPRs which represent the relationships
 between hidden layers Perceptron block Perceptron block Perceptron block GPR GPR
  • 13. Initialize parameter of DGP Use the pretrained layer-wise GPR parameters
 as the initial parameters of each layer of DGP GPR GPR GPR GPR GPR
  • 14. Experimental conditions: database English speech
 synthesis DB Japanese speech
 synthesis DB Database CMU ARCTIC 1 female (SLT) XIMERA[Kawai et al., 2004] 1 female (F009) Training data 597K frames (49 min.) 1.39M frames (119 min.) Test data 66 sentences 60 sentences Input featrue 721 dim. linguistic feature vector 574 dim. linguistic feature vector Output feature 139 dim. acoustic feature vector Evaluation measure Mel-celstral distance (MCD)
  • 15. Experimental conditions: model configurations Hidden layer dim. 32 Kernel function ArcCos [Cho & Saul, 2009] / RBF # of inducing points 1024 DGP Hidden units 1024 Activation ReLU Dropout rate 20% Perceptron block for DNN
  • 16. Methods ‣ PRE10 •Proposed method using 10-epoch for layer-wise GP training ‣ PRE1 •Proposed method using only 1-epoch for layer-wise GP training ‣ MEAN •Conventional method using non-zero mean function ‣ RAND •Random values were used as initial inducing input and outputs ‣ DNN •DNN was used instead of DGP
  • 17. Effect of # of layers The training of DGP with 7-10 layers of random parameters (RAND) failed, while the proposed method worked well # of layers RAND PRE10 1 5.11 5.08 2 4.72 4.70 3 4.65 4.63 4 4.65 4.59 5 4.65 4.62 6 4.65 4.63 7 10.08 4.60 8 10.08 4.63 9 10.08 4.65 10 10.08 4.62 – Database: CMU arctic (English) – Kernel: ArcCos kernel MCDs as a function fo number of layers [dB]
  • 18. Epoch-by-epoch distortions Proposed PRE10 and PRE1 gave smaller distortions than
 conventional MEAN in early epochs – Database: CMU arctic (English) – Kernel: ArcCos kernel – # of layers: 6 4 10 8 6 Epoch 20 400 MCD[dB] RAND MEAN DNNPRE1 PRE10
  • 19. Conclusions ‣ Proposed two-stage pretraining for DGP training •Pretraining 1: DNN - Determine hidden layer values •Pretraining 2: layer-wise GPR - Obtain initial GP parameters using hidden layer values ‣ The proposed pertaining made training stable
 even for deep (7–10-layer) models ‣ Future work •Apply the proposed method to more complicated architecture
 other than feed-forward-type models