SlideShare a Scribd company logo
LESLIE SMITH’S PAPERS
FOR DL JOURNAL CLUB
DISCIPLINED APPROACH PAPER
• A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size,
Momentum, and Weight Decay
• There is no Part 2
• https://arxiv.org/abs/1803.09820
• Collection of empirical observations spread out through the paper
CONVERGENCE / TEST-VAL LOSS
• Observe box in top-left corner of Figure 1(a)
• Shows training loss (red) decreasing and validation loss
(blue) decreasing then increasing.
• Plot to left of validation loss minima indicates
underfitting
• Plot to right of validation loss minima indicates
overfitting.
• Achieving the horizontal part of test/validation loss
(minima) is goal of hyperparameter tuning.
UNDERFITTING
• Underfitting is indicated by continuously decreasing
test loss rather than horizontal plateau (Fig 3(a)).
• Steepness of test loss curve indicates how well the
model is learning (Fig 3(b)).
OVERFITTING
• Increasing Learning Rate moves the model from underfitting
to overfitting.
• Blue curve (Fig 4a) shows steepest fall – indication that this
will produce better final accuracy.
• Yellow curve (Fig 4a) shows overfitting with LR > 0.006.
• More overfitting examples – blue curves in bottom figs.
• Blue curve (Fig 4b) shows underfitting.
• Red curve (Fig 4b) shows overfitting.
CYCLIC LEARNING RATE (CLR)
• Motivation: Underfitting if LR too low, overfitting if too high; requires grid search
• CLR
• Specify upper and lower bound for LR
• Specify step size == number of iterations or epochs used for each step
• Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly
from max to min.
• Other variants tried but no significant benefit observed.
CLR – CHOOSE MAX AND MIN LR
• LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to
decrease)
• LR lower bound, one of:
• Factor of 3 or 4 less than upper bound.
• Factor of 10 or 20 less than upper bound if only 1 cycle is used.
• Find experimentally using short test of ~1000 iterations, pick largest that allows convergence.
• Step size – if LR too high, training becomes unstable, increase step size to increase difference between
max and min LR bounds.
SUPER CONVERGENCE
• Super convergence – some networks remain stable under
high LR, so can be trained very quickly with CLR with high
upper bound.
• Fig 5a shows super convergence (orange curve) training
faster to higher accuracy using large LR than blue curve.
• 1-cycle policy – one cycle that is smaller than number of
iterations/epochs, then remaining iterations with LR
lowered by several order of magnitude.
REGULARIZATION
• Many forms of regularization
• Large Learning Rate
• Small batch size
• Weight decay (aka L2 regularization)
• Dropout
• Need to balance different regularizers for each dataset and architecture.
• Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning
needs to be balanced with lower WD.
• General guidance: reduce other forms of regularization and train with high LR makes training efficient.
BATCH SIZE
• Larger batch sizes permit larger LR using 1cycle schedule.
• Larger batch size may increase training time, so tradeoff
required.
• Tradeoff – use batch size so number of epochs is optimum
for data/model.
• Batch size limited by GPU memory.
• Fig 6a shows validation accuracy for different batch sizes.
Larger batch sizes better but effect tapers off (BS=1024
blue curve very close to BS=512 red curve).
(CYCLIC) MOMENTUM
• Set momentum as large as possible without causing instability.
• Constant LR => use large constant momentum (0.9 – 0.99)
• Cyclic LR => decrease cyclic momentum as cyclic LR increases
during early to middle part of training (0.95 – 0.85).
• Fig 8a – blue curve is constant momentum, red curve is
decreasing CM and yellow curve is increasing CM (with
increasing CLR).
• These observations also carry over to deep networks (Fig 8b).
WEIGHT DECAY
• Cyclical WD not useful, should remain constant throughout
training.
• Value should be found by grid search (ok with early
termination).
• Fig 9a shows loss plots for different values of WD (with LR=5e-
3, mom=0.95).
• Fig 9b shows equivalent accuracy plots.
CYCLIC LEARNING RATE PAPER
• Cyclical Learning Rates for Training Neural Networks
• https://arxiv.org/abs/1506.01186
• Describes CLR in depth and describes results of training common networks with CLR.
CYCLIC LEARNING RATE
• Successor to
• Learning rate schedules – varying LR exponentially over training.
• Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR
based on values of gradients.
• Based on observation that increasing LR has short-term
negative effect but long-term positive effect.
• Let LR vary between range of values.
• Triangular LR (Fig 2) is usually good enough but other variants
also possible.
• Accuracy plot (Fig 1) shows CLR (red curve) is better compared
to Exponential LR.
ESTIMATING CLR PARAMETERS
• Step size
• Step size = 2 to 10 times * number of iterations per epoch
• Number of training iterations per epoch = number of training records /
batch size
• Upper and lower bounds for LR
• Run model for few epochs with some bounds (1e-4 to 2e-1 for
example)
• Upper bound == where accuracy stops increasing, becomes ragged, or
falls (~ 6e-3).
• Lower bound
• Either 1/3 or ¼ of upper bound (~ 2e-3)
• Point at which accuracy starts to increase (~ 1e-3)
LR FINDER USAGE
• LR Finder – first available in Fast.AI library.
• Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is
decreasing fastest.
• Can also be found using lr.plot_loss_change() – minimum point (here 1e-2).
• Lower bound is about 1-2 orders of magnitude lower.
• LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder
• LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder
• Keras example -- https://github.com/sujitpal/keras-tutorial-
odsc2020/blob/master/02_03_exercise_2_solved.ipynb
• Fast. AI example --
https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac
cel_sgd.ipynb

More Related Content

Similar to Leslie Smith's Papers discussion for DL Journal Club

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final versionMihir Sanghavi
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
Sunghoon Joo
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
mohammedalherwi1
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
ABINASHPADHY6
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Paper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourPaper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hour
Young Seok Kim
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
Madhumita Tamhane
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paperSam Bean
 
4.1.pptx
4.1.pptx4.1.pptx
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
MLconf
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
Shiwani Gupta
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
jemin lee
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
Quantopian
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
Sandeep Kumar
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
sudheeremoa229
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
Jamie (Taka) Wang
 

Similar to Leslie Smith's Papers discussion for DL Journal Club (20)

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final version
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Paper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourPaper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hour
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paper
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
Sujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Leslie Smith's Papers discussion for DL Journal Club

  • 1. LESLIE SMITH’S PAPERS FOR DL JOURNAL CLUB
  • 2. DISCIPLINED APPROACH PAPER • A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size, Momentum, and Weight Decay • There is no Part 2 • https://arxiv.org/abs/1803.09820 • Collection of empirical observations spread out through the paper
  • 3. CONVERGENCE / TEST-VAL LOSS • Observe box in top-left corner of Figure 1(a) • Shows training loss (red) decreasing and validation loss (blue) decreasing then increasing. • Plot to left of validation loss minima indicates underfitting • Plot to right of validation loss minima indicates overfitting. • Achieving the horizontal part of test/validation loss (minima) is goal of hyperparameter tuning.
  • 4. UNDERFITTING • Underfitting is indicated by continuously decreasing test loss rather than horizontal plateau (Fig 3(a)). • Steepness of test loss curve indicates how well the model is learning (Fig 3(b)).
  • 5. OVERFITTING • Increasing Learning Rate moves the model from underfitting to overfitting. • Blue curve (Fig 4a) shows steepest fall – indication that this will produce better final accuracy. • Yellow curve (Fig 4a) shows overfitting with LR > 0.006. • More overfitting examples – blue curves in bottom figs. • Blue curve (Fig 4b) shows underfitting. • Red curve (Fig 4b) shows overfitting.
  • 6. CYCLIC LEARNING RATE (CLR) • Motivation: Underfitting if LR too low, overfitting if too high; requires grid search • CLR • Specify upper and lower bound for LR • Specify step size == number of iterations or epochs used for each step • Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly from max to min. • Other variants tried but no significant benefit observed.
  • 7. CLR – CHOOSE MAX AND MIN LR • LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to decrease) • LR lower bound, one of: • Factor of 3 or 4 less than upper bound. • Factor of 10 or 20 less than upper bound if only 1 cycle is used. • Find experimentally using short test of ~1000 iterations, pick largest that allows convergence. • Step size – if LR too high, training becomes unstable, increase step size to increase difference between max and min LR bounds.
  • 8. SUPER CONVERGENCE • Super convergence – some networks remain stable under high LR, so can be trained very quickly with CLR with high upper bound. • Fig 5a shows super convergence (orange curve) training faster to higher accuracy using large LR than blue curve. • 1-cycle policy – one cycle that is smaller than number of iterations/epochs, then remaining iterations with LR lowered by several order of magnitude.
  • 9. REGULARIZATION • Many forms of regularization • Large Learning Rate • Small batch size • Weight decay (aka L2 regularization) • Dropout • Need to balance different regularizers for each dataset and architecture. • Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning needs to be balanced with lower WD. • General guidance: reduce other forms of regularization and train with high LR makes training efficient.
  • 10. BATCH SIZE • Larger batch sizes permit larger LR using 1cycle schedule. • Larger batch size may increase training time, so tradeoff required. • Tradeoff – use batch size so number of epochs is optimum for data/model. • Batch size limited by GPU memory. • Fig 6a shows validation accuracy for different batch sizes. Larger batch sizes better but effect tapers off (BS=1024 blue curve very close to BS=512 red curve).
  • 11. (CYCLIC) MOMENTUM • Set momentum as large as possible without causing instability. • Constant LR => use large constant momentum (0.9 – 0.99) • Cyclic LR => decrease cyclic momentum as cyclic LR increases during early to middle part of training (0.95 – 0.85). • Fig 8a – blue curve is constant momentum, red curve is decreasing CM and yellow curve is increasing CM (with increasing CLR). • These observations also carry over to deep networks (Fig 8b).
  • 12. WEIGHT DECAY • Cyclical WD not useful, should remain constant throughout training. • Value should be found by grid search (ok with early termination). • Fig 9a shows loss plots for different values of WD (with LR=5e- 3, mom=0.95). • Fig 9b shows equivalent accuracy plots.
  • 13. CYCLIC LEARNING RATE PAPER • Cyclical Learning Rates for Training Neural Networks • https://arxiv.org/abs/1506.01186 • Describes CLR in depth and describes results of training common networks with CLR.
  • 14. CYCLIC LEARNING RATE • Successor to • Learning rate schedules – varying LR exponentially over training. • Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR based on values of gradients. • Based on observation that increasing LR has short-term negative effect but long-term positive effect. • Let LR vary between range of values. • Triangular LR (Fig 2) is usually good enough but other variants also possible. • Accuracy plot (Fig 1) shows CLR (red curve) is better compared to Exponential LR.
  • 15. ESTIMATING CLR PARAMETERS • Step size • Step size = 2 to 10 times * number of iterations per epoch • Number of training iterations per epoch = number of training records / batch size • Upper and lower bounds for LR • Run model for few epochs with some bounds (1e-4 to 2e-1 for example) • Upper bound == where accuracy stops increasing, becomes ragged, or falls (~ 6e-3). • Lower bound • Either 1/3 or ¼ of upper bound (~ 2e-3) • Point at which accuracy starts to increase (~ 1e-3)
  • 16. LR FINDER USAGE • LR Finder – first available in Fast.AI library. • Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is decreasing fastest. • Can also be found using lr.plot_loss_change() – minimum point (here 1e-2). • Lower bound is about 1-2 orders of magnitude lower. • LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder • LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder • Keras example -- https://github.com/sujitpal/keras-tutorial- odsc2020/blob/master/02_03_exercise_2_solved.ipynb • Fast. AI example -- https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac cel_sgd.ipynb