SlideShare a Scribd company logo
1 of 46
Download to read offline
Scalable Gradient-Based Tuning of
Continuous Regularization Hyperparameters
ICML 2016
1 citations
Jelena Luketina
Aalto University, Finland
Mathias Berglund
Aalto University, Finland
Klaus Greff
The Swiss AI Lab, Switzerland
Tapani Raiko
Aalto University, Finland
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
However, there’re many hyperparameters for training
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty strength
- dropout: rate
… etc
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
So, you decide by, first, dividing dataset into 3 sets
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
Then, train on training set multiple times with different setting
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
and pick the best performed one on validation set
Then, train on training set multiple times with different setting
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
However, this trial-and-error is time-consuming
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
Wouldn’t it be nice if NN can learn hyperparameters by itself?
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
Problem Formulation
● Suppose objective function
where is the model parameter,
and is regularization hyperparameters
● How to learn as well as during training?
Algorithm
model parameter
update
regularization
hyperameter
fixed
model parameter
fixed
regularization
hyperameter
update
● Take turns to descend model parameter and
regularization hyperparameter
Algorithm
● training objective
Algorithm
● training objective
● validation objective
Algorithm
● training objective
● validation objective
● model parameter update
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Experiment (1/4)
● Regularization
– noise injection
● to input by Gaussian noise with std
● to each hidden layer by Gaussian noise with std
– L2 penalty
● on each hidden layer with strength
Experiment (1/4)
● Hyperparameter trajectory
Experiment (2/4)
● fixed hyperparameter
● T1-T2
Experiment (3/4)
● T1-T2 as meta-algorithm
Experiment (4/4)
● does T1-T2 overfit the validation set?
Conclusion
● Introduce a novel and significant topic and the
method makes intuitive totally sense
Scalable Gradient-Based Tuning of
Continuous Regularization Hyperparameters
ICML 2016
1 citations
Jelena Luketina
Aalto University, Finland
Mathias Berglund
Aalto University, Finland
Klaus Greff
The Swiss AI Lab, Switzerland
Tapani Raiko
Aalto University, Finland
Today I’m gonna present a paper called …. It’s
published on last track of ICML, and so far has 1
citation count. From the title, you can expect that
this is a paper that aims to boost the
hyperparameter selecting procedure of neural
network.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Let me give a motivative example to show you why
boosting the hyperparameter selecting procedure is
an important thing. Suppose now, you’ve designed
a neural network model. For example, for task of
image-captioning, you’re given an image and have
to output a sentence to describe that image. Then
you may want to try a model like this. So, after
deciding a model, the next thing is to evaluate how
good it performs.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
However, there’re many hyperparameters for training
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty strength
- dropout: rate
… etc
So, you start training your model, however, there’re
so many hyperparameters to decide. For example,
we’ll use iterative learning process like SGD to train
the network, then you’d have decide the learning
rate and most of time for RNN we’ll use additional
techniques like gradient clipping or weight norm
constraints to prevent sudden error shooting, then
you’d have to decide the clipping threshold. And if
you observe the network is overfitting the data, then
you might want to use some regularizer, like adding
l1 or l2 penalty terms to objective function, or chage
the dropout rate of your model and so on.
So, in order to evaluate if it’s a good model, there’re
so many determining hyperparameters for a
successful training, and totally they’ll make an
exponentially large number of combinations to try.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
So, you decide by, first, dividing dataset into 3 sets
Unfortunately, currently there’re no good algorithm to
find hyperparameter, mostly we’ll use brute-force
algorithm like grid search or come out some
candidate hyperparameters by ourself. In either
way, we’ll pick the best combination by a procedure
called cross-validation. Cross-Validation goes like
this, first, we divide the dataset into 3 sets, training
set, validation set and test set
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
Then, train on training set multiple times with different setting
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
We use training set to train the network multiple
times, each time under a different hyperparameter
setting.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
and pick the best performed one on validation set
Then, train on training set multiple times with different setting
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
At the end of each training round, the model is
evaluated on validation set. So, the combination
which has the best performance on validation set is
our best hyperparameter.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
However, this trial-and-error is time-consuming
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
But, the hyperparameter selecting procedure requires
multiple full-training rounds. It’s very inefficient and
time-consuming. It only works fine when you have
some experience or good intuitions to make the
exploration space small.
Motivation
After designing a NN model,
you’d like to see how good it performs
Embedding-100
img-256cur_wd
RNN-512
next_wd
“a”
“cat”
Training
Validation
Testing
Wouldn’t it be nice if NN can learn hyperparameters by itself?
optimizer
- learning rate
- gradient clipping
- weight norm
regularizer
- l1/l2: penalty
strength
- dropout: rate
… etc
So, wouldn’t it be nice, if we can make the network
learn those training hyperparameters by itself?
Problem Formulation
● Suppose objective function
where is the model parameter,
and is regularization hyperparameters
● How to learn as well as during training?
Let’s define the problem formally. Suppose our
objective function is C, and we add regularization
terms Omega to penalizes on model parameter
theta and the penalty strength is controlled by
lambda.
Normally, we only learn theta and keep lambda fixed
during a training round. So, now the problem is how
could we learn both of them simultaneously?
You might wonder there’re so many kinds of
hyperparameters, why this paper only target at
regularization hyperparameters. It’s because for
regularization hyperparameters, you can clearly
write down its from in objective function, so it’s the
easiest one to solve.
Algorithm
model parameter
update
regularization
hyperameter
fixed
model parameter
fixed
regularization
hyperameter
update
● Take turns to descend model parameter and
regularization hyperparameter
Their method is intuitively straightforward. The model
parameter and hyperparameter take turns to
descent. For model parameter descending phase,
the regularization hyperparameter is kept fixed. And
for hyperparameter descending phase, the model
parameter is fixed. But there’s one thing to note is
that the objective of the two descending phase is
different.
Algorithm
● training objective
This is our normal training objective for descending
model parameter, it’s a regularized one measured
on training set.
Algorithm
● training objective
● validation objective
But for hyperparameter descending phase, we
measure the performance on validation set, and
there’s no need to add the regularized terms, since
the model parameter is fixed. So, the validation
objective is a unregularized objective.
Algorithm
● training objective
● validation objective
● model parameter update
We trained the model by SGD. The model parameter
update is the same as usual, take the gradient of
training objective respect to theta, then descent a
step size of eta.
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
After each model parameter update, we update the
regularization hyperparameter. So, we take the
gradient of validation objective repect to lambda.
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Since there’re no explictly penalty terms in validation
objective, lamda in deed only appears in theta in
the model parameter update.
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
We use chain rule to expand gradient, we first take
the gradient of validation objective repect to theta,
and then multiply by the gradient of theta repect to
lamda.
Algorithm
● training objective
● validation objective
● model parameter update
● regularization hyperparameter update
Lamda only appears in the gradient of model
parameter update, so you can further simplify the
formula.
Experiment (1/4)
● Regularization
– noise injection
● to input by Gaussian noise with std
● to each hidden layer by Gaussian noise with std
– L2 penalty
● on each hidden layer with strength
In their experiment, they use 2 kinds of regularization,
one is noise injection with Gaussian noise to inputs
of each hidden layer. You might wonder why adding
noise to input introduces regularization terms in
objective function. You can think this way, if you
define original objective function as the one that’s
measure on original clean input, then the penalty
terms is difference between the cost measure on
noise input and the cost of clean input.
And the second regularizer used in their experiment
is the l2 norm penalty, separately penalize on the
parameter of each layer.
Experiment (1/4)
●
Hyperparameter trajectory
The first figure shown is the training trajectory of
hyperparameters respect to the background, test
negative log likelihood in different color, the larger
the better. And the points mark in start represents
the starting point of values of regularization
hyperparameters before training, and square mark
represents the end points after training. You can
see by using their method, the regularization
hyperparameter gradually move toward better
points.
Experiment (2/4)
● fixed hyperparameter
● T1-T2
In the next figure, they make a comparison between
their method, called T1-T2 to normal training by
plotting the training history. T1 stands for training
error, T2 stands for validation error. You can see
that the model trained by normal training overfits
the dataset and descent to validation error of 0.2
but model trained by their method less overfit and
decents to lower validation error of 0.1.
Experiment (3/4)
●
T1-T2 as meta-algorithm
They want to answer the question: in normal training,
we assume hyperparameter to be fixed, they want
to know does their method which changes
hyperparameters along the way has any negative
effect at final performance? So, they do experiment
by first running their T1-T2 method, then use the
values found by T1-T2 to rerun a fixed-
hyperparameter experiment. This experiment is run
on CIFAR-10, x-axis represents the test error of
their method, and y-axis represent the test error of
the rerun experiment. The result is yes, there’s
slightly negative effect since the test error of rerun
experiment is better. So, they make a conclusion
that their method can be used as a meta-algorithm
to find a good hyperparemter for normal training.
Experiment (4/4)
●
does T1-T2 overfit the validation set?
Next, since now validation set is more aggressively
involved in training, so they want to know if the
performance measured on validation set is still
indicative of generalization performance? So, they
plot the validation error versus the test error. You
can see validation error and test error are linearly
correlated. Although, it seems to overfit a little bit in
SVHN dataset, it’s still a good indicator of final
performance.
Conclusion
● Introduce a novel and significant topic and the
method makes intuitive totally sense

More Related Content

Viewers also liked

Cloud Database Final Project
Cloud Database Final ProjectCloud Database Final Project
Cloud Database Final ProjectRuochun Tzeng
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural NetworkRuochun Tzeng
 
On the Number of Linear Regions of DNN
On the Number of Linear Regions of DNNOn the Number of Linear Regions of DNN
On the Number of Linear Regions of DNNRuochun Tzeng
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internalHyunghun Cho
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlowDarshan Patel
 
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasGoogle Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasTaegyun Jeon
 
Gentlest Introduction to Tensorflow - Part 2
Gentlest Introduction to Tensorflow - Part 2Gentlest Introduction to Tensorflow - Part 2
Gentlest Introduction to Tensorflow - Part 2Khor SoonHin
 
Introduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowIntroduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowTerry Taewoong Um
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapJiang Jun
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Alessio Tonioni
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 

Viewers also liked (17)

Cloud Database Final Project
Cloud Database Final ProjectCloud Database Final Project
Cloud Database Final Project
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
XSSearch
XSSearchXSSearch
XSSearch
 
On the Number of Linear Regions of DNN
On the Number of Linear Regions of DNNOn the Number of Linear Regions of DNN
On the Number of Linear Regions of DNN
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed Systems
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internal
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
 
TensorFlow
TensorFlowTensorFlow
TensorFlow
 
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasGoogle Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
 
Tensorflow 2
Tensorflow 2Tensorflow 2
Tensorflow 2
 
Gentlest Introduction to Tensorflow - Part 2
Gentlest Introduction to Tensorflow - Part 2Gentlest Introduction to Tensorflow - Part 2
Gentlest Introduction to Tensorflow - Part 2
 
Introduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowIntroduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlow
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit Recap
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
TENSORFLOW深度學習講座講義(很硬的課程)
TENSORFLOW深度學習講座講義(很硬的課程)TENSORFLOW深度學習講座講義(很硬的課程)
TENSORFLOW深度學習講座講義(很硬的課程)
 

Similar to Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt

StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxDrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxssuserd23711
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptxAbithaSam
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptxMahmoudAbuGhali
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash courseVishwas N
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn철민 권
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeMichael Galarnyk
 

Similar to Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt (20)

StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
C3 w1
C3 w1C3 w1
C3 w1
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptx
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
 
Day 4
Day 4Day 4
Day 4
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt

  • 1. Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters ICML 2016 1 citations Jelena Luketina Aalto University, Finland Mathias Berglund Aalto University, Finland Klaus Greff The Swiss AI Lab, Switzerland Tapani Raiko Aalto University, Finland
  • 2. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat”
  • 3. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” However, there’re many hyperparameters for training optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc
  • 4. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing So, you decide by, first, dividing dataset into 3 sets
  • 5. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing Then, train on training set multiple times with different setting optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc
  • 6. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing and pick the best performed one on validation set Then, train on training set multiple times with different setting optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc
  • 7. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing However, this trial-and-error is time-consuming optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc
  • 8. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing Wouldn’t it be nice if NN can learn hyperparameters by itself? optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc
  • 9. Problem Formulation ● Suppose objective function where is the model parameter, and is regularization hyperparameters ● How to learn as well as during training?
  • 12. Algorithm ● training objective ● validation objective
  • 13. Algorithm ● training objective ● validation objective ● model parameter update
  • 14. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update
  • 15. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update
  • 16. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update
  • 17. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update
  • 18. Experiment (1/4) ● Regularization – noise injection ● to input by Gaussian noise with std ● to each hidden layer by Gaussian noise with std – L2 penalty ● on each hidden layer with strength
  • 20. Experiment (2/4) ● fixed hyperparameter ● T1-T2
  • 21. Experiment (3/4) ● T1-T2 as meta-algorithm
  • 22. Experiment (4/4) ● does T1-T2 overfit the validation set?
  • 23. Conclusion ● Introduce a novel and significant topic and the method makes intuitive totally sense
  • 24. Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters ICML 2016 1 citations Jelena Luketina Aalto University, Finland Mathias Berglund Aalto University, Finland Klaus Greff The Swiss AI Lab, Switzerland Tapani Raiko Aalto University, Finland Today I’m gonna present a paper called …. It’s published on last track of ICML, and so far has 1 citation count. From the title, you can expect that this is a paper that aims to boost the hyperparameter selecting procedure of neural network.
  • 25. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Let me give a motivative example to show you why boosting the hyperparameter selecting procedure is an important thing. Suppose now, you’ve designed a neural network model. For example, for task of image-captioning, you’re given an image and have to output a sentence to describe that image. Then you may want to try a model like this. So, after deciding a model, the next thing is to evaluate how good it performs.
  • 26. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” However, there’re many hyperparameters for training optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc So, you start training your model, however, there’re so many hyperparameters to decide. For example, we’ll use iterative learning process like SGD to train the network, then you’d have decide the learning rate and most of time for RNN we’ll use additional techniques like gradient clipping or weight norm constraints to prevent sudden error shooting, then you’d have to decide the clipping threshold. And if you observe the network is overfitting the data, then you might want to use some regularizer, like adding l1 or l2 penalty terms to objective function, or chage the dropout rate of your model and so on. So, in order to evaluate if it’s a good model, there’re so many determining hyperparameters for a successful training, and totally they’ll make an exponentially large number of combinations to try.
  • 27. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing So, you decide by, first, dividing dataset into 3 sets Unfortunately, currently there’re no good algorithm to find hyperparameter, mostly we’ll use brute-force algorithm like grid search or come out some candidate hyperparameters by ourself. In either way, we’ll pick the best combination by a procedure called cross-validation. Cross-Validation goes like this, first, we divide the dataset into 3 sets, training set, validation set and test set
  • 28. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing Then, train on training set multiple times with different setting optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc We use training set to train the network multiple times, each time under a different hyperparameter setting.
  • 29. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing and pick the best performed one on validation set Then, train on training set multiple times with different setting optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc At the end of each training round, the model is evaluated on validation set. So, the combination which has the best performance on validation set is our best hyperparameter.
  • 30. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing However, this trial-and-error is time-consuming optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc But, the hyperparameter selecting procedure requires multiple full-training rounds. It’s very inefficient and time-consuming. It only works fine when you have some experience or good intuitions to make the exploration space small.
  • 31. Motivation After designing a NN model, you’d like to see how good it performs Embedding-100 img-256cur_wd RNN-512 next_wd “a” “cat” Training Validation Testing Wouldn’t it be nice if NN can learn hyperparameters by itself? optimizer - learning rate - gradient clipping - weight norm regularizer - l1/l2: penalty strength - dropout: rate … etc So, wouldn’t it be nice, if we can make the network learn those training hyperparameters by itself?
  • 32. Problem Formulation ● Suppose objective function where is the model parameter, and is regularization hyperparameters ● How to learn as well as during training? Let’s define the problem formally. Suppose our objective function is C, and we add regularization terms Omega to penalizes on model parameter theta and the penalty strength is controlled by lambda. Normally, we only learn theta and keep lambda fixed during a training round. So, now the problem is how could we learn both of them simultaneously? You might wonder there’re so many kinds of hyperparameters, why this paper only target at regularization hyperparameters. It’s because for regularization hyperparameters, you can clearly write down its from in objective function, so it’s the easiest one to solve.
  • 33. Algorithm model parameter update regularization hyperameter fixed model parameter fixed regularization hyperameter update ● Take turns to descend model parameter and regularization hyperparameter Their method is intuitively straightforward. The model parameter and hyperparameter take turns to descent. For model parameter descending phase, the regularization hyperparameter is kept fixed. And for hyperparameter descending phase, the model parameter is fixed. But there’s one thing to note is that the objective of the two descending phase is different.
  • 34. Algorithm ● training objective This is our normal training objective for descending model parameter, it’s a regularized one measured on training set.
  • 35. Algorithm ● training objective ● validation objective But for hyperparameter descending phase, we measure the performance on validation set, and there’s no need to add the regularized terms, since the model parameter is fixed. So, the validation objective is a unregularized objective.
  • 36. Algorithm ● training objective ● validation objective ● model parameter update We trained the model by SGD. The model parameter update is the same as usual, take the gradient of training objective respect to theta, then descent a step size of eta.
  • 37. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update After each model parameter update, we update the regularization hyperparameter. So, we take the gradient of validation objective repect to lambda.
  • 38. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update Since there’re no explictly penalty terms in validation objective, lamda in deed only appears in theta in the model parameter update.
  • 39. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update We use chain rule to expand gradient, we first take the gradient of validation objective repect to theta, and then multiply by the gradient of theta repect to lamda.
  • 40. Algorithm ● training objective ● validation objective ● model parameter update ● regularization hyperparameter update Lamda only appears in the gradient of model parameter update, so you can further simplify the formula.
  • 41. Experiment (1/4) ● Regularization – noise injection ● to input by Gaussian noise with std ● to each hidden layer by Gaussian noise with std – L2 penalty ● on each hidden layer with strength In their experiment, they use 2 kinds of regularization, one is noise injection with Gaussian noise to inputs of each hidden layer. You might wonder why adding noise to input introduces regularization terms in objective function. You can think this way, if you define original objective function as the one that’s measure on original clean input, then the penalty terms is difference between the cost measure on noise input and the cost of clean input. And the second regularizer used in their experiment is the l2 norm penalty, separately penalize on the parameter of each layer.
  • 42. Experiment (1/4) ● Hyperparameter trajectory The first figure shown is the training trajectory of hyperparameters respect to the background, test negative log likelihood in different color, the larger the better. And the points mark in start represents the starting point of values of regularization hyperparameters before training, and square mark represents the end points after training. You can see by using their method, the regularization hyperparameter gradually move toward better points.
  • 43. Experiment (2/4) ● fixed hyperparameter ● T1-T2 In the next figure, they make a comparison between their method, called T1-T2 to normal training by plotting the training history. T1 stands for training error, T2 stands for validation error. You can see that the model trained by normal training overfits the dataset and descent to validation error of 0.2 but model trained by their method less overfit and decents to lower validation error of 0.1.
  • 44. Experiment (3/4) ● T1-T2 as meta-algorithm They want to answer the question: in normal training, we assume hyperparameter to be fixed, they want to know does their method which changes hyperparameters along the way has any negative effect at final performance? So, they do experiment by first running their T1-T2 method, then use the values found by T1-T2 to rerun a fixed- hyperparameter experiment. This experiment is run on CIFAR-10, x-axis represents the test error of their method, and y-axis represent the test error of the rerun experiment. The result is yes, there’s slightly negative effect since the test error of rerun experiment is better. So, they make a conclusion that their method can be used as a meta-algorithm to find a good hyperparemter for normal training.
  • 45. Experiment (4/4) ● does T1-T2 overfit the validation set? Next, since now validation set is more aggressively involved in training, so they want to know if the performance measured on validation set is still indicative of generalization performance? So, they plot the validation error versus the test error. You can see validation error and test error are linearly correlated. Although, it seems to overfit a little bit in SVHN dataset, it’s still a good indicator of final performance.
  • 46. Conclusion ● Introduce a novel and significant topic and the method makes intuitive totally sense