Training Deep Neural Networks has been a difficult task for a long time. Recently diverse approaches have been presented to tackle these difficulties, showing that deep models improve the performance of shallow ones in some areas like signal processing, signal classification or signal segmentation, whatever type of signals, e.g. video, audio or images. One of the most important methods is greedy layer-wise unsupervised pre-training followed by a fine-tuning phase. Despite the advantages of this procedure, it does not fit some scenarios where real time learning is needed, as for adaptation of some time-series models. This paper proposes to couple both phases into one, modifying the loss function to mix together the unsupervised and supervised parts. Benchmark experiments with MNIST database prove the viability of the idea for simple image tasks, and experiments with time-series forecasting encourage the incorporation of this idea into on-line learning approaches. The interest of this method in time-series forecasting is motivated by the study of predictive models for domotic houses with intelligent control systems.
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Integration of Unsupervised and Supervised Criteria for DNNs Training
1. Integration of Unsupervised and
Supervised Criteria for DNNs
Training
International Conf. on Artificial Neural Networks
Francisco Zamora-Martínez, Francisco Javier
Muñoz-Almaraz, Juan Pardo
Departamento de ciencias físicas, matemáticas y de la computación
Universidad CEU Cardenal Herrera
September 7th, 2016
2. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
3. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
4. Motivation
Greedy layer-wise unsupervised pre-training is
successful training logistic MLPs: Two training
stages
...1 Pre-training with unsupervised data (SAEs or
RBMs)
...2 Fine-tuning parameters with supervised data
Very useful when large unsupervised data is
available
But…
It is a greedy approach
Not valid for on-line learning scenarios
Not as much useful with small data sets
5. Motivation
Goals
Train a supervised model
Layer-wise conditioned by unsupervised loss
Improving gradient flow
Learning better features
Every layer parameters should be
Useful for the global supervised task
Able to reconstruct their input (Auto-Encoders)
6. Motivation
Related works
Is Joint Training Better for Deep
Auto-Encoders?, by Y. Zhou et al (2015) paper
at arXiv (fine-tuning stage for supervision)
Preliminary work done by P. Vincent et al
(2010), Stacked denoising autoencoders:
Learning useful representations in a deep
network with a local denoising criterion
Deep learning via Semi-Supervised Embedding,
by Weston et al (2008) ICML paper
7. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
8. Method description
How to do it
Risk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ, D) =
1
|D|
∑
(x,y)∈D
[
λ0Ls(F(x; θ), y) +
H∑
k=1
λkU(k)
]
+ ϵΩ(θ)
U(k)
= Lu(Ak(h(k−1)
; θ), h(k−1)
) for 1 ≤ k ≤ H
λk ≥ 0
F(x; θ) is the MLP model
A(h(k−1)
; θ) is a Denoising AE model
H is the number of hidden layers
h(0)
= x
9. Method description
How to do it
Risk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ, D) =
1
|D|
∑
(x,y)∈D
[
λ0Ls(F(x; θ), y)+
H∑
k=1
λkU(k)
]
+ϵΩ(θ)
λ vector mixes all the components
It should be updated every iteration
Starting focused at unsupervised criteria
Ending focused at supervised criterion
11. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
12. λ Update Policies I
A λ update policy indicates how to change λ
vector every iteration
The supervised part (λ0) can be fixed to 1.
The unsupervised part should be important
during first iterations
Loosing focus while training
Being insignificant at the end
A greedy exponential decay (GED) will suffice
λ0(t) = 1 ; λk(t) = Λγt
Being the constants Λ > 0 and γ ∈ [0, 1]
13. λ Update Policies II
Exponential decay is the most simple approach, but
other policies are possible
Ratio between loss functions
Ratio between gradients at each layer
A combination of them
…
14. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
15. Experiments and Results (MNIST) I
Benchmark with MNIST dataset
Logistic activation functions, softmax output
Cross-entropy for supervised and unsupervised
losses
Classification error as evaluation measure
Effect of MLP topology and Λ initial value λk
Sensitivity study of γ exponential decay term
Comparison with other literature models
16. Experiments and Results (MNIST) II
Test error (%) plus 95% confidence interval
Data Set SAE-3 SDAE-3 GED-3
MNIST 1.40±0.23 1.28±0.22 1.22±0.22
basic 3.46±0.16 2.84±0.15 2.72±0.14
SAE-3 and SDAE-3 taken from Vincent et al (2010)
19. Experiments and Results (MNIST) V
First layer filters (16 of 2048 units)
Only supervised γ = 0.999 γ = 1.000
20. Experiments and Results (SML2010) I
SML2010 UCI data set: indoor temperature
forecasting
Logistic hidden act. functions, linear output
48 inputs (12 hours) and 12 outputs (3 hours)
Mean Square Error function supervised loss
Cross-entropy unsupervised losses
Mean Absolute Error (MAE) for evaluation
Compared MLPs w/wout unsupervised losses
21. Experiments and Results (SML2010) II
Depth Size MAE Λ = 0 MAE Λ = 1
3 32 0.1322 0.1266
3 64 0.1350 0.1257
3 128 0.1308 0.1292
3 512 0.6160 0.1312
Validation set results. In red statistically significant
improvements.
Λ = 0 is training DNN only with supervised loss.
Λ = 1 is training DNN with supervised and unsupervised losses.
22. Experiments and Results (SML2010) III
Test result for 3 layers model with 64 neurons
per hidden layer:
0.1274 when Λ = 0
0.1177 when Λ = 1
Able to train up to 10 layers DNNs with 64
hidden units per layer when Λ = 1
MAE in range [0.1274, 0.1331]
Λ = 0 is training DNN only with supervised loss.
Λ = 1 is training DNN with supervised and unsupervised losses.
23. Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and Results
MNIST
SML2010 temperature forecasting
...5 Conclusions and Future Work
24. Conclusions
One-stage training of deep models combining
supervised and unsupervised loss functions
Comparable with greedy layer-wise
unsupervised pre-training + fine-tuning
The approach is successful training deep MLPs
with logistic activations
Decaying unsupervised loss during training is
crucial
Time-series results encourage further research
of this idea into on-line learning scenarios
25. Future Work
Better filters and models? Further research
needed
Study the effect using ReLU activations
Study other alternatives to exponential
decaying of unsupervised loss: dynamic
adaptation