2. Hyperparameters vs Model Parameters
● Learning rate
● Momentum or the hyperparameters for
Adam optimization algorithm
● Number of layers
● Number of hidden units
● Mini-batch size
● Activation function
● Number of epochs
● ...
2
3. 1. How fast the
algorithm learns
2. Whether the cost
function is
minimized or not
Effect of Learning Rate
3
Source: Understanding Learning Rate in Machine Learning
9. 9
Why use learning rate schedule?
● Too small a learning rate and your neural network may not learn at all
● Too large a learning rate and you may overshoot areas of low loss (or even
overfit from the start of training)
➔ Finding a set of reasonably “good” weights early in the training process with
a larger learning rate.
➔ Tuning these weights later in the process to find more optimal weights using
a smaller learning rate.
17. 17
Let LR cyclical vary
between boundary
values
Estimate reasonable
bounds
18. Claims & Proposal
● We don’t know what the optimal initial learning rate is.
● Monotonically decreasing our learning rate may lead to our network getting
“stuck” in plateaus of the loss landscape.
18
● Define a minimu learning rate
● Define a maximum learning rate
● Allow the learning rate to cyclical oscillate between the two bounds
19. Source: Escaping from Saddle Points
saddle point
convex function
critical point
update rule
21. 21
CLR - Policies
● batch size: number of training examples
● batch or iteration: number of weight updates per
epoch (#total training examples/batch size)
● cycle: number of iterations (lower -> upper -> lower)
● step size: number of batch/iterations in a half cycle
https://github.com/bckenstler/CLR
29. 29
● Recommended minimum: loss decreases the fastest (minimum
negative gradient)
● Recommended maximum: 10 times less (one order lower) than the
learning rate where the loss is minimum (if loss is low at 0.1, good
value to start is 0.01).
Source: The Learning Rate Finder Technique: How Reliable Is It?
30. 30
Reminder
● use the same initial weights for the LRFinder and the subsequent model
training.
● We simply keep a copy of the model weights to reset them, that way they are
“as they were” before you ran the learning rate finder.
● We should never assume that the found learning rates are the best for any
model initialization ❌
● setting a narrower range than what is recommended is safer and could
reduce the risk of divergence due to very high learning rates.
31. 31
● min: loss decreases the fastest
● max: narrower then 1 order lower
● Higher batch → higher learning rate
Source: The Learning Rate Finder Technique: How Reliable Is It?
35. Learning Rate
fastai Modification (cosine descent)
0.08~0.8
The maximum should be the value picked with
a learning rate finder procedure.
Source: Finding Good Learning Rate and The One Cycle Policy.
39. @log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
@log_args(but_as=Learner.fit)
def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None,
moms=None, cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
40. References
● 1.Keras learning rate schedules and decay (PyImageSearch)
● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● 3.Keras Learning Rate Finder (PyImageSearch)
● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍
● Understanding Learning Rate in Machine Learning
● Learning Rate Schedules in Deep Learning
● Setting the learning rate of your neural network
● Exploring Super-Convergence 👍
● The Learning Rate Finder Technique: How Reliable Is It?
40
41. References - One Cycle
● One-cycle learning rate schedulers (Kaggle)
● Finding Good Learning Rate and The One Cycle Policy. 👍
● The 1cycle policy (fastbook author)
● Understanding Fastai's fit_one_cycle method 👍
41
42. Colab
● Keras learning rate schedules and decay (PyImageSearch)
● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● Keras Learning Rate Finder (PyImageSearch) 💎
● TensorFlow Addons Optimizers: CyclicalLearningRate 👍
42
43. Further Reading
● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015)
● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates (Leslie et., 2017)
● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate,
batch size, momentum, and weight decay (Leslie, 2018)
● SGDR: Stochastic Gradient Descent with Warm Restarts (2016)
● Snapshot Ensembles: Train 1, get M for free (2017)
● A brief history of learning rate schedulers and adaptive optimizers 💎
● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎
43