Integration of Unsupervised and Supervised Criteria for DNNs Training

Integration of Unsupervised and
Supervised Criteria for DNNs
Training
International Conf. on Artificial Neural Networks
Francisco Zamora-Martínez, Francisco Javier
Muñoz-Almaraz, Juan Pardo
Departamento de ciencias físicas, matemáticas y de la computación
Universidad CEU Cardenal Herrera
September 7th, 2016

Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work

Motivation
Greedy layer-wise unsupervised pre-training is
successful training logistic MLPs: Two training
stages
...1 Pre-training with unsupervised data (SAEs or
RBMs)
...2 Fine-tuning parameters with supervised data
Very useful when large unsupervised data is
available
But…
It is a greedy approach
Not valid for on-line learning scenarios
Not as much useful with small data sets

Motivation
Goals
Train a supervised model
Layer-wise conditioned by unsupervised loss
Improving gradient flow
Learning better features
Every layer parameters should be
Useful for the global supervised task
Able to reconstruct their input (Auto-Encoders)

Motivation
Related works
Is Joint Training Better for Deep
Auto-Encoders?, by Y. Zhou et al (2015) paper
at arXiv (fine-tuning stage for supervision)
Preliminary work done by P. Vincent et al
(2010), Stacked denoising autoencoders:
Learning useful representations in a deep
network with a local denoising criterion
Deep learning via Semi-Supervised Embedding,
by Weston et al (2008) ICML paper

Method description
How to do it
Risk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ, D) =
1
|D|
∑
(x,y)∈D
[
λ0Ls(F(x; θ), y) +
H∑
k=1
λkU(k)
]
+ ϵΩ(θ)
U(k)
= Lu(Ak(h(k−1)
; θ), h(k−1)
) for 1 ≤ k ≤ H
λk ≥ 0
F(x; θ) is the MLP model
A(h(k−1)
; θ) is a Denoising AE model
H is the number of hidden layers
h(0)
= x

Method description
How to do it
Risk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ, D) =
1
|D|
∑
(x,y)∈D
[
λ0Ls(F(x; θ), y)+
H∑
k=1
λkU(k)
]
+ϵΩ(θ)
λ vector mixes all the components
It should be updated every iteration
Starting focused at unsupervised criteria
Ending focused at supervised criterion

λ Update Policies I
A λ update policy indicates how to change λ
vector every iteration
The supervised part (λ0) can be fixed to 1.
The unsupervised part should be important
during first iterations
Loosing focus while training
Being insignificant at the end
A greedy exponential decay (GED) will suffice
λ0(t) = 1 ; λk(t) = Λγt
Being the constants Λ > 0 and γ ∈ [0, 1]

λ Update Policies II
Exponential decay is the most simple approach, but
other policies are possible
Ratio between loss functions
Ratio between gradients at each layer
A combination of them
…

Experiments and Results (MNIST) I
Benchmark with MNIST dataset
Logistic activation functions, softmax output
Cross-entropy for supervised and unsupervised
losses
Classification error as evaluation measure
Effect of MLP topology and Λ initial value λk
Sensitivity study of γ exponential decay term
Comparison with other literature models

Experiments and Results (MNIST) II
Test error (%) plus 95% confidence interval
Data Set SAE-3 SDAE-3 GED-3
MNIST 1.40±0.23 1.28±0.22 1.22±0.22
basic 3.46±0.16 2.84±0.15 2.72±0.14
SAE-3 and SDAE-3 taken from Vincent et al (2010)

Experiments and Results (MNIST) III
Hyper-parameters grid search (Validation set)
MNIST
256
512
1024
2048
Depth=1
256
512
1024
2048
Depth=2
256
512
1024
2048
Depth=3
256
512
1024
2048
Depth=4
256
512
1024
2048
0.8
1
1.2
1.4
1.6
1.8
2
Error(%)
Layer size
Depth=5
Λ values
0.00000
0.00001
0.00100
0.20000
0.60000
1.00000
3.00000
5.00000

Experiments and Results (MNIST) IV
γ exponential decay term (Validation set)
MNIST
1.00
1.10
1.20
1.30
1.40
1.50
0.5 0.6 0.7 0.8 0.9 1
Validationerror(%)
Decay (γ)
1.00
1.10
1.20
1.30
1.40
1.50
0.997 0.998 0.999 1
Validationerror(%)
Decay (γ)
Detail

Experiments and Results (MNIST) V
First layer filters (16 of 2048 units)
Only supervised γ = 0.999 γ = 1.000

Experiments and Results (SML2010) I
SML2010 UCI data set: indoor temperature
forecasting
Logistic hidden act. functions, linear output
48 inputs (12 hours) and 12 outputs (3 hours)
Mean Square Error function supervised loss
Cross-entropy unsupervised losses
Mean Absolute Error (MAE) for evaluation
Compared MLPs w/wout unsupervised losses

Experiments and Results (SML2010) II
Depth Size MAE Λ = 0 MAE Λ = 1
3 32 0.1322 0.1266
3 64 0.1350 0.1257
3 128 0.1308 0.1292
3 512 0.6160 0.1312
Validation set results. In red statistically significant
improvements.
Λ = 0 is training DNN only with supervised loss.
Λ = 1 is training DNN with supervised and unsupervised losses.

Experiments and Results (SML2010) III
Test result for 3 layers model with 64 neurons
per hidden layer:
0.1274 when Λ = 0
0.1177 when Λ = 1
Able to train up to 10 layers DNNs with 64
hidden units per layer when Λ = 1
MAE in range [0.1274, 0.1331]
Λ = 0 is training DNN only with supervised loss.
Λ = 1 is training DNN with supervised and unsupervised losses.

Conclusions
One-stage training of deep models combining
supervised and unsupervised loss functions
Comparable with greedy layer-wise
unsupervised pre-training + fine-tuning
The approach is successful training deep MLPs
with logistic activations
Decaying unsupervised loss during training is
crucial
Time-series results encourage further research
of this idea into on-line learning scenarios

Future Work
Better filters and models? Further research
needed
Study the effect using ReLU activations
Study other alternatives to exponential
decaying of unsupervised loss: dynamic
adaptation

The End
Thanks for your attention!!!
Questions?

Integration of Unsupervised and Supervised Criteria for DNNs Training

Recommended

Recommended

More Related Content

Similar to Integration of Unsupervised and Supervised Criteria for DNNs Training

Similar to Integration of Unsupervised and Supervised Criteria for DNNs Training (20)

More from Francisco Zamora-Martinez

More from Francisco Zamora-Martinez (11)

Recently uploaded

Recently uploaded (20)

Integration of Unsupervised and Supervised Criteria for DNNs Training