Tuning learning rate

Tuning Learning Rate
Taka Wang
20210727

Hyperparameters vs Model Parameters
● Learning rate
● Momentum or the hyperparameters for
Adam optimization algorithm
● Number of layers
● Number of hidden units
● Mini-batch size
● Activation function
● Number of epochs
● ...
2

1. How fast the
algorithm learns
2. Whether the cost
function is
minimized or not
Effect of Learning Rate
3
Source: Understanding Learning Rate in Machine Learning

4
Source: Setting the learning rate of your neural network.

5
Source: Understanding Fastai's ﬁt_one_cycle method

Adjust Learning Rate During Training
● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.)
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
6

Source: How do we decide the optimizer used for training?

9
Why use learning rate schedule?
● Too small a learning rate and your neural network may not learn at all
● Too large a learning rate and you may overshoot areas of low loss (or even
overﬁt from the start of training)
➔ Finding a set of reasonably “good” weights early in the training process with
a larger learning rate.
➔ Tuning these weights later in the process to ﬁnd more optimal weights using
a smaller learning rate.

10
Learning Rate Schedule
● Time-based decay
● Linear decay
● Step decay (Piecewise Constant
Decay)
● Polynomial decay
● Exponential decay Two Methods:
● Built-in Schedules
● Custom Callbacks (every batch)

Keras Example
import tensorflow as tf
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32')/255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,)))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[])
11

Time-based decay (InverseTimeDecay)
13
lr_fn = keras.optimizers.schedules.InverseTimeDecay(
initial_learning_rate = 0.1,
decay_steps = 1.0,
decay_rate = 0.5
)
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.ﬁt(data, labels, epochs=5)
Source: Learning Rate Schedules in Deep Learning

Step Decay
14
from tensorflow.keras.callbacks import LearningRateScheduler
class StepDecay:
def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10):
self.initAlpha = initAlpha
self.factor = factor
self.dropEvery = dropEvery
def __call__(self, epoch):
# compute the learning rate for the current epoch
exp = np.floor((1 + epoch) / self.dropEvery)
alpha = self.initAlpha * (self.factor ** exp)
return float(alpha) # learning rate
cb = [LearningRateScheduler(schedule)]
model.fit(x_train, y_train, epochs=10, callbacks=cb)

Linear Decay & Polynomial Decay
15
Learning rate is decayed to zero over a fixed number of epochs.

17
Let LR cyclical vary
between boundary
values
Estimate reasonable
bounds

Claims & Proposal
● We don’t know what the optimal initial learning rate is.
● Monotonically decreasing our learning rate may lead to our network getting
“stuck” in plateaus of the loss landscape.
18
● Deﬁne a minimu learning rate
● Deﬁne a maximum learning rate
● Allow the learning rate to cyclical oscillate between the two bounds

Source: Escaping from Saddle Points
saddle point
convex function
critical point
update rule

Loss
Landscape
20
model architecture & dataset
Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS

21
CLR - Policies
● batch size: number of training examples
● batch or iteration: number of weight updates per
epoch (#total training examples/batch size)
● cycle: number of iterations (lower -> upper -> lower)
● step size: number of batch/iterations in a half cycle
https://github.com/bckenstler/CLR

Implementations
22
opt = SGD(lr=config.MIN_LR, momentum=0.9)
model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
# initialize the cyclical learning rate callback
clr = CyclicLR(
mode="triangular",
base_lr=config.MIN_LR,
max_lr=config.MAX_LR,
step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE),
)
model.fit(
...,
steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE,
epochs=config.NUM_EPOCHS,
callbacks=[clr])
MIN_LR = 1e-7
MAX_LR = 1e-2
BATCH_SIZE = 64
STEP_SIZE = 8 (4 or 8)
CLR_METHOD = "triangular"
NUM_EPOCHS = 96
https://github.com/bckenstler/CLR

TensorFlow Addons Optimizers
23
!pip install -q -U tensorﬂow_addons
import tensorﬂow_addons as tfa
...
steps_per_epoch = len(x_train) // BATCH_SIZE
clr = tfa.optimizers.CyclicalLearningRate(
initial_learning_rate=INIT_LR,
maximal_learning_rate=MAX_LR,
scale_fn=lambda x: 1/(2.**(x-1)),
step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.SGD(clr)
clr_model = tf.keras.models.load_model("initial_model")
clr_history = train_model(clr_model, optimizer=optimizer)
#no_clr_history = train_model(standard_model, optimizer="sgd")
BATCH_SIZE = 64
EPOCHS = 10
INIT_LR = 1e-4
MAX_LR = 1e-2

Experiment Results - Triangular
24

Experiment Results - Triangular2
25

Automatic learning rate ﬁnder algorithm

28
Learning Rate Increase After Every Mini-Batch
3~5 epochs

29
● Recommended minimum: loss decreases the fastest (minimum
negative gradient)
● Recommended maximum: 10 times less (one order lower) than the
learning rate where the loss is minimum (if loss is low at 0.1, good
value to start is 0.01).
Source: The Learning Rate Finder Technique: How Reliable Is It?

30
Reminder
● use the same initial weights for the LRFinder and the subsequent model
training.
● We simply keep a copy of the model weights to reset them, that way they are
“as they were” before you ran the learning rate ﬁnder.
● We should never assume that the found learning rates are the best for any
model initialization ❌
● setting a narrower range than what is recommended is safer and could
reduce the risk of divergence due to very high learning rates.

31
● min: loss decreases the fastest
● max: narrower then 1 order lower
● Higher batch → higher learning rate
Source: The Learning Rate Finder Technique: How Reliable Is It?

Summary
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
32

34
Learning rate
Batch Size
Momentum
Weight Decay

Learning Rate
fastai Modiﬁcation (cosine descent)
0.08~0.8
The maximum should be the value picked with
a learning rate finder procedure.
Source: Finding Good Learning Rate and The One Cycle Policy.

Cyclic Momentum
36
fastai Modiﬁcation (cosine ascent)
Source: Finding Good Learning Rate and The One Cycle Policy.

Weight Decay Matters
1e-3 1e-5

Example of super-convergence
38
Source: Understanding Fastai's ﬁt_one_cycle method

@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with ùnfreeze` from èpochs` using discriminative LR"
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
@log_args(but_as=Learner.fit)
def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None,
moms=None, cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)

References
● 1.Keras learning rate schedules and decay (PyImageSearch)
● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● 3.Keras Learning Rate Finder (PyImageSearch)
● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍
● Understanding Learning Rate in Machine Learning
● Learning Rate Schedules in Deep Learning
● Setting the learning rate of your neural network
● Exploring Super-Convergence 👍
● The Learning Rate Finder Technique: How Reliable Is It?
40

References - One Cycle
● One-cycle learning rate schedulers (Kaggle)
● Finding Good Learning Rate and The One Cycle Policy. 👍
● The 1cycle policy (fastbook author)
● Understanding Fastai's ﬁt_one_cycle method 👍
41

Colab
● Keras learning rate schedules and decay (PyImageSearch)
● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● Keras Learning Rate Finder (PyImageSearch) 💎
● TensorFlow Addons Optimizers: CyclicalLearningRate 👍
42

Further Reading
● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015)
● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates (Leslie et., 2017)
● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate,
batch size, momentum, and weight decay (Leslie, 2018)
● SGDR: Stochastic Gradient Descent with Warm Restarts (2016)
● Snapshot Ensembles: Train 1, get M for free (2017)
● A brief history of learning rate schedulers and adaptive optimizers 💎
● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎
43

Tuning learning rate

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tuning learning rate

Similar to Tuning learning rate (20)

More from Jamie (Taka) Wang

More from Jamie (Taka) Wang (20)

Recently uploaded

Recently uploaded (20)

Tuning learning rate