SlideShare a Scribd company logo
1 of 43
Tuning Learning Rate
Taka Wang
20210727
Hyperparameters vs Model Parameters
● Learning rate
● Momentum or the hyperparameters for
Adam optimization algorithm
● Number of layers
● Number of hidden units
● Mini-batch size
● Activation function
● Number of epochs
● ...
2
1. How fast the
algorithm learns
2. Whether the cost
function is
minimized or not
Effect of Learning Rate
3
Source: Understanding Learning Rate in Machine Learning
4
Source: Setting the learning rate of your neural network.
5
Source: Understanding Fastai's fit_one_cycle method
Adjust Learning Rate During Training
● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.)
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
6
Source: How do we decide the optimizer used for training?
Learning Rate Schedule
8
9
Why use learning rate schedule?
● Too small a learning rate and your neural network may not learn at all
● Too large a learning rate and you may overshoot areas of low loss (or even
overfit from the start of training)
➔ Finding a set of reasonably “good” weights early in the training process with
a larger learning rate.
➔ Tuning these weights later in the process to find more optimal weights using
a smaller learning rate.
10
Learning Rate Schedule
● Time-based decay
● Linear decay
● Step decay (Piecewise Constant
Decay)
● Polynomial decay
● Exponential decay Two Methods:
● Built-in Schedules
● Custom Callbacks (every batch)
Keras Example
import tensorflow as tf
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32')/255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,)))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[])
11
12
Time-based decay (InverseTimeDecay)
13
lr_fn = keras.optimizers.schedules.InverseTimeDecay(
initial_learning_rate = 0.1,
decay_steps = 1.0,
decay_rate = 0.5
)
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.fit(data, labels, epochs=5)
Source: Learning Rate Schedules in Deep Learning
Step Decay
14
from tensorflow.keras.callbacks import LearningRateScheduler
class StepDecay:
def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10):
self.initAlpha = initAlpha
self.factor = factor
self.dropEvery = dropEvery
def __call__(self, epoch):
# compute the learning rate for the current epoch
exp = np.floor((1 + epoch) / self.dropEvery)
alpha = self.initAlpha * (self.factor ** exp)
return float(alpha) # learning rate
cb = [LearningRateScheduler(schedule)]
model.fit(x_train, y_train, epochs=10, callbacks=cb)
Linear Decay & Polynomial Decay
15
Learning rate is decayed to zero over a fixed number of epochs.
Cyclical Learning Rate
16
17
Let LR cyclical vary
between boundary
values
Estimate reasonable
bounds
Claims & Proposal
● We don’t know what the optimal initial learning rate is.
● Monotonically decreasing our learning rate may lead to our network getting
“stuck” in plateaus of the loss landscape.
18
● Define a minimu learning rate
● Define a maximum learning rate
● Allow the learning rate to cyclical oscillate between the two bounds
Source: Escaping from Saddle Points
saddle point
convex function
critical point
update rule
Loss
Landscape
20
model architecture & dataset
Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS
21
CLR - Policies
● batch size: number of training examples
● batch or iteration: number of weight updates per
epoch (#total training examples/batch size)
● cycle: number of iterations (lower -> upper -> lower)
● step size: number of batch/iterations in a half cycle
https://github.com/bckenstler/CLR
Implementations
22
opt = SGD(lr=config.MIN_LR, momentum=0.9)
model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
# initialize the cyclical learning rate callback
clr = CyclicLR(
mode="triangular",
base_lr=config.MIN_LR,
max_lr=config.MAX_LR,
step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE),
)
model.fit(
...,
steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE,
epochs=config.NUM_EPOCHS,
callbacks=[clr])
MIN_LR = 1e-7
MAX_LR = 1e-2
BATCH_SIZE = 64
STEP_SIZE = 8 (4 or 8)
CLR_METHOD = "triangular"
NUM_EPOCHS = 96
https://github.com/bckenstler/CLR
TensorFlow Addons Optimizers
23
!pip install -q -U tensorflow_addons
import tensorflow_addons as tfa
...
steps_per_epoch = len(x_train) // BATCH_SIZE
clr = tfa.optimizers.CyclicalLearningRate(
initial_learning_rate=INIT_LR,
maximal_learning_rate=MAX_LR,
scale_fn=lambda x: 1/(2.**(x-1)),
step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.SGD(clr)
clr_model = tf.keras.models.load_model("initial_model")
clr_history = train_model(clr_model, optimizer=optimizer)
#no_clr_history = train_model(standard_model, optimizer="sgd")
BATCH_SIZE = 64
EPOCHS = 10
INIT_LR = 1e-4
MAX_LR = 1e-2
Experiment Results - Triangular
24
Experiment Results - Triangular2
25
LR Finder (Range Test)
26
Automatic learning rate finder algorithm
28
Learning Rate Increase After Every Mini-Batch
3~5 epochs
29
● Recommended minimum: loss decreases the fastest (minimum
negative gradient)
● Recommended maximum: 10 times less (one order lower) than the
learning rate where the loss is minimum (if loss is low at 0.1, good
value to start is 0.01).
Source: The Learning Rate Finder Technique: How Reliable Is It?
30
Reminder
● use the same initial weights for the LRFinder and the subsequent model
training.
● We simply keep a copy of the model weights to reset them, that way they are
“as they were” before you ran the learning rate finder.
● We should never assume that the found learning rates are the best for any
model initialization ❌
● setting a narrower range than what is recommended is safer and could
reduce the risk of divergence due to very high learning rates.
31
● min: loss decreases the fastest
● max: narrower then 1 order lower
● Higher batch → higher learning rate
Source: The Learning Rate Finder Technique: How Reliable Is It?
Summary
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
32
One Cycle Policy
33
34
Learning rate
Batch Size
Momentum
Weight Decay
Learning Rate
fastai Modification (cosine descent)
0.08~0.8
The maximum should be the value picked with
a learning rate finder procedure.
Source: Finding Good Learning Rate and The One Cycle Policy.
Cyclic Momentum
36
fastai Modification (cosine ascent)
Source: Finding Good Learning Rate and The One Cycle Policy.
Weight Decay Matters
1e-3 1e-5
Example of super-convergence
38
Source: Understanding Fastai's fit_one_cycle method
@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
@log_args(but_as=Learner.fit)
def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None,
moms=None, cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
References
● 1.Keras learning rate schedules and decay (PyImageSearch)
● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● 3.Keras Learning Rate Finder (PyImageSearch)
● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍
● Understanding Learning Rate in Machine Learning
● Learning Rate Schedules in Deep Learning
● Setting the learning rate of your neural network
● Exploring Super-Convergence 👍
● The Learning Rate Finder Technique: How Reliable Is It?
40
References - One Cycle
● One-cycle learning rate schedulers (Kaggle)
● Finding Good Learning Rate and The One Cycle Policy. 👍
● The 1cycle policy (fastbook author)
● Understanding Fastai's fit_one_cycle method 👍
41
Colab
● Keras learning rate schedules and decay (PyImageSearch)
● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● Keras Learning Rate Finder (PyImageSearch) 💎
● TensorFlow Addons Optimizers: CyclicalLearningRate 👍
42
Further Reading
● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015)
● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates (Leslie et., 2017)
● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate,
batch size, momentum, and weight decay (Leslie, 2018)
● SGDR: Stochastic Gradient Descent with Warm Restarts (2016)
● Snapshot Ensembles: Train 1, get M for free (2017)
● A brief history of learning rate schedulers and adaptive optimizers 💎
● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎
43

More Related Content

What's hot

What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding GradientsSiddharth Vij
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...Edureka!
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Sujit Pal
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 

What's hot (20)

What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
07 regularization
07 regularization07 regularization
07 regularization
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding Gradients
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 

Similar to Tuning learning rate

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsSayak Paul
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptxMahmoudAbuGhali
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash courseVishwas N
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchPPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchJisang Yoon
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 

Similar to Tuning learning rate (20)

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
C3 w1
C3 w1C3 w1
C3 w1
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural nets
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble Learning
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchPPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 

More from Jamie (Taka) Wang

More from Jamie (Taka) Wang (20)

20200606_insight_Ignition
20200606_insight_Ignition20200606_insight_Ignition
20200606_insight_Ignition
 
20200727_Insight workstation
20200727_Insight workstation20200727_Insight workstation
20200727_Insight workstation
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_plan
 
20210105_量產技轉
20210105_量產技轉20210105_量產技轉
20210105_量產技轉
 
20200808自營電商平台策略討論
20200808自營電商平台策略討論20200808自營電商平台策略討論
20200808自營電商平台策略討論
 
20200427_hardware
20200427_hardware20200427_hardware
20200427_hardware
 
20200429_ec
20200429_ec20200429_ec
20200429_ec
 
20200607_insight_sync
20200607_insight_sync20200607_insight_sync
20200607_insight_sync
 
20220113_product_day
20220113_product_day20220113_product_day
20220113_product_day
 
20200429_software
20200429_software20200429_software
20200429_software
 
20200602_insight_business
20200602_insight_business20200602_insight_business
20200602_insight_business
 
20200408_gen11_sequence_diagram
20200408_gen11_sequence_diagram20200408_gen11_sequence_diagram
20200408_gen11_sequence_diagram
 
20190827_activity_diagram
20190827_activity_diagram20190827_activity_diagram
20190827_activity_diagram
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
20161220 - microservice
20161220 - microservice20161220 - microservice
20161220 - microservice
 
20160217 - Overview of Vortex Intelligent Data Sharing Platform
20160217 - Overview of Vortex Intelligent Data Sharing Platform20160217 - Overview of Vortex Intelligent Data Sharing Platform
20160217 - Overview of Vortex Intelligent Data Sharing Platform
 
20151111 - IoT Sync Up
20151111 - IoT Sync Up20151111 - IoT Sync Up
20151111 - IoT Sync Up
 
20151207 - iot strategy
20151207 - iot strategy20151207 - iot strategy
20151207 - iot strategy
 
20141210 - Microservice Container
20141210 - Microservice Container20141210 - Microservice Container
20141210 - Microservice Container
 
20161027 - edge part2
20161027 - edge part220161027 - edge part2
20161027 - edge part2
 

Recently uploaded

Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 

Recently uploaded (20)

Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 

Tuning learning rate

  • 2. Hyperparameters vs Model Parameters ● Learning rate ● Momentum or the hyperparameters for Adam optimization algorithm ● Number of layers ● Number of hidden units ● Mini-batch size ● Activation function ● Number of epochs ● ... 2
  • 3. 1. How fast the algorithm learns 2. Whether the cost function is minimized or not Effect of Learning Rate 3 Source: Understanding Learning Rate in Machine Learning
  • 4. 4 Source: Setting the learning rate of your neural network.
  • 5. 5 Source: Understanding Fastai's fit_one_cycle method
  • 6. Adjust Learning Rate During Training ● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.) ● Learning Rate Annealing ● Cyclical Learning Rate ● LR Finder 6
  • 7. Source: How do we decide the optimizer used for training?
  • 9. 9 Why use learning rate schedule? ● Too small a learning rate and your neural network may not learn at all ● Too large a learning rate and you may overshoot areas of low loss (or even overfit from the start of training) ➔ Finding a set of reasonably “good” weights early in the training process with a larger learning rate. ➔ Tuning these weights later in the process to find more optimal weights using a smaller learning rate.
  • 10. 10 Learning Rate Schedule ● Time-based decay ● Linear decay ● Step decay (Piecewise Constant Decay) ● Polynomial decay ● Exponential decay Two Methods: ● Built-in Schedules ● Custom Callbacks (every batch)
  • 11. Keras Example import tensorflow as tf (x_train, y_train), _ = tf.keras.datasets.mnist.load_data() x_train = x_train.reshape(60000, 784).astype('float32')/255 y_train = tf.keras.utils.to_categorical(y_train, num_classes=10) model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,))) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy']) model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[]) 11
  • 12. 12
  • 13. Time-based decay (InverseTimeDecay) 13 lr_fn = keras.optimizers.schedules.InverseTimeDecay( initial_learning_rate = 0.1, decay_steps = 1.0, decay_rate = 0.5 ) model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn), loss='categorical_crossentropy', metrics=['accuracy'] ) model.fit(data, labels, epochs=5) Source: Learning Rate Schedules in Deep Learning
  • 14. Step Decay 14 from tensorflow.keras.callbacks import LearningRateScheduler class StepDecay: def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10): self.initAlpha = initAlpha self.factor = factor self.dropEvery = dropEvery def __call__(self, epoch): # compute the learning rate for the current epoch exp = np.floor((1 + epoch) / self.dropEvery) alpha = self.initAlpha * (self.factor ** exp) return float(alpha) # learning rate cb = [LearningRateScheduler(schedule)] model.fit(x_train, y_train, epochs=10, callbacks=cb)
  • 15. Linear Decay & Polynomial Decay 15 Learning rate is decayed to zero over a fixed number of epochs.
  • 17. 17 Let LR cyclical vary between boundary values Estimate reasonable bounds
  • 18. Claims & Proposal ● We don’t know what the optimal initial learning rate is. ● Monotonically decreasing our learning rate may lead to our network getting “stuck” in plateaus of the loss landscape. 18 ● Define a minimu learning rate ● Define a maximum learning rate ● Allow the learning rate to cyclical oscillate between the two bounds
  • 19. Source: Escaping from Saddle Points saddle point convex function critical point update rule
  • 20. Loss Landscape 20 model architecture & dataset Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS
  • 21. 21 CLR - Policies ● batch size: number of training examples ● batch or iteration: number of weight updates per epoch (#total training examples/batch size) ● cycle: number of iterations (lower -> upper -> lower) ● step size: number of batch/iterations in a half cycle https://github.com/bckenstler/CLR
  • 22. Implementations 22 opt = SGD(lr=config.MIN_LR, momentum=0.9) model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) # initialize the cyclical learning rate callback clr = CyclicLR( mode="triangular", base_lr=config.MIN_LR, max_lr=config.MAX_LR, step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE), ) model.fit( ..., steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE, epochs=config.NUM_EPOCHS, callbacks=[clr]) MIN_LR = 1e-7 MAX_LR = 1e-2 BATCH_SIZE = 64 STEP_SIZE = 8 (4 or 8) CLR_METHOD = "triangular" NUM_EPOCHS = 96 https://github.com/bckenstler/CLR
  • 23. TensorFlow Addons Optimizers 23 !pip install -q -U tensorflow_addons import tensorflow_addons as tfa ... steps_per_epoch = len(x_train) // BATCH_SIZE clr = tfa.optimizers.CyclicalLearningRate( initial_learning_rate=INIT_LR, maximal_learning_rate=MAX_LR, scale_fn=lambda x: 1/(2.**(x-1)), step_size=2 * steps_per_epoch ) optimizer = tf.keras.optimizers.SGD(clr) clr_model = tf.keras.models.load_model("initial_model") clr_history = train_model(clr_model, optimizer=optimizer) #no_clr_history = train_model(standard_model, optimizer="sgd") BATCH_SIZE = 64 EPOCHS = 10 INIT_LR = 1e-4 MAX_LR = 1e-2
  • 24. Experiment Results - Triangular 24
  • 25. Experiment Results - Triangular2 25
  • 26. LR Finder (Range Test) 26
  • 27. Automatic learning rate finder algorithm
  • 28. 28 Learning Rate Increase After Every Mini-Batch 3~5 epochs
  • 29. 29 ● Recommended minimum: loss decreases the fastest (minimum negative gradient) ● Recommended maximum: 10 times less (one order lower) than the learning rate where the loss is minimum (if loss is low at 0.1, good value to start is 0.01). Source: The Learning Rate Finder Technique: How Reliable Is It?
  • 30. 30 Reminder ● use the same initial weights for the LRFinder and the subsequent model training. ● We simply keep a copy of the model weights to reset them, that way they are “as they were” before you ran the learning rate finder. ● We should never assume that the found learning rates are the best for any model initialization ❌ ● setting a narrower range than what is recommended is safer and could reduce the risk of divergence due to very high learning rates.
  • 31. 31 ● min: loss decreases the fastest ● max: narrower then 1 order lower ● Higher batch → higher learning rate Source: The Learning Rate Finder Technique: How Reliable Is It?
  • 32. Summary ● Learning Rate Annealing ● Cyclical Learning Rate ● LR Finder 32
  • 35. Learning Rate fastai Modification (cosine descent) 0.08~0.8 The maximum should be the value picked with a learning rate finder procedure. Source: Finding Good Learning Rate and The One Cycle Policy.
  • 36. Cyclic Momentum 36 fastai Modification (cosine ascent) Source: Finding Good Learning Rate and The One Cycle Policy.
  • 38. Example of super-convergence 38 Source: Understanding Fastai's fit_one_cycle method
  • 39. @log_args(but_as=Learner.fit) @delegates(Learner.fit_one_cycle) def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs): "Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR" self.freeze() self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs) base_lr /= 2 self.unfreeze() self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs) @log_args(but_as=Learner.fit) def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None, moms=None, cbs=None, reset_opt=False): "Fit `self.model` for `n_epoch` using the 1cycle policy." if self.opt is None: self.create_opt() self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max) lr_max = np.array([h['lr'] for h in self.opt.hypers]) scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final), 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))} self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  • 40. References ● 1.Keras learning rate schedules and decay (PyImageSearch) ● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch) ● 3.Keras Learning Rate Finder (PyImageSearch) ● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍 ● Understanding Learning Rate in Machine Learning ● Learning Rate Schedules in Deep Learning ● Setting the learning rate of your neural network ● Exploring Super-Convergence 👍 ● The Learning Rate Finder Technique: How Reliable Is It? 40
  • 41. References - One Cycle ● One-cycle learning rate schedulers (Kaggle) ● Finding Good Learning Rate and The One Cycle Policy. 👍 ● The 1cycle policy (fastbook author) ● Understanding Fastai's fit_one_cycle method 👍 41
  • 42. Colab ● Keras learning rate schedules and decay (PyImageSearch) ● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch) ● Keras Learning Rate Finder (PyImageSearch) 💎 ● TensorFlow Addons Optimizers: CyclicalLearningRate 👍 42
  • 43. Further Reading ● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015) ● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates (Leslie et., 2017) ● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay (Leslie, 2018) ● SGDR: Stochastic Gradient Descent with Warm Restarts (2016) ● Snapshot Ensembles: Train 1, get M for free (2017) ● A brief history of learning rate schedulers and adaptive optimizers 💎 ● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎 43