Successfully reported this slideshow.
Upcoming SlideShare
×

# Competition winning learning rates

368 views

Published on

Leslie Smith, Senior Research Scientist, US Naval Research Laboratory

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Competition winning learning rates

1. 1. Competition Winning Learning Rates Leslie N. Smith Naval Center for Applied Research in Artificial Intelligence US Naval Research Laboratory, Washington, DC 20375 leslie.smith@nrl.navy.mil; Phone: (202) 767-9532 MLConf 2018 November 14, 2018 UNCLASSIFIED
2. 2. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 2 Outline
3. 3. • Outline Deep Learning Basics Background – Uses a neural network composed of many “hidden” layers, 𝒍; each layer contains trainable weights, 𝑾𝒍, and biases, 𝒃𝒍, and a non-linear function, σ – Image x is input and output y is compared to the label, which defines the loss }Loss Label Back-propagationVanishing Gradient Input x Classes Weights Speed limit 20 Yield Pedestrian crossing 𝒚𝒍 = 𝑭 𝒚𝒍−𝟏 = 𝝈 𝑾𝒍 𝒚𝒍−𝟏 + 𝒃𝒍 𝒚 𝑳 = 𝝈 𝑾 𝑳 𝝈(𝑾 𝑳−𝟏 𝝈(… 𝑾 𝟏 𝒙 + 𝒃 𝟏 … )) 3 || 𝒚 𝑳 - y ||
4. 4. 4 What are learning rates? • Learning rates are the step size in Stochastic Gradient Descent’s (SGD) back-propagation • Long known that the learning rate (LR) is the most important hyper-parameter to tune – Too large: the training diverges – Too small: trains slowly to a sub-optimal solution • How to find an optimal LR? – Grid or random search – Time consuming and inaccurate Cyclical learning rates (CLR) 𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
5. 5. 5 What is CLR? And who cares? • Cyclical learning rates (CLR) – Learning rate schedule that varies between min and max values • LR range test: One stepsize of increasing LR – Quick and easy way to find an optimal learning rate – The peak defines the max_lr – The optimal LR is a bit less than max_lr – min_lr ≈ max_lr / 3 Cyclical learning rates (CLR) max_lr min_lr
6. 6. 6 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle Super-convergence
7. 7. 7 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle Super-convergence
8. 8. 8 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle, ending with a smaller LR than the min Super-convergence
9. 9. 9 ImageNet Super -convergence
10. 10. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 10 Outline
11. 11. 11 Competition winning LRCompetition Winning Learning Rates • DAWNBench challenge – “Howard explains that in order to create an algorithm for solving CIFAR, Fast.AI’s group turned to a relatively unknown technique known as “super convergence.” This wasn’t developed by a well-funded tech company or published in a big journal, but was created and self-published by a single engineer named Leslie Smith working at the Naval Research Laboratory.” The Verge, 5/7/2018 article about fast.ai team 1st place winning
12. 12. 12 Competition winning LRCompetition Winning Learning Rates • Kaggle iMaterialist Challenge (Fashion) – “For training I used Adam initially but I switched to the 1cycle policy with SGD very early on. You can read more about this training regime in a paper by Leslie Smith and you can find details on how to use it by Sylvain Gugger, the author of the implementation in the fastai library here.” 1st place winner, Radek Osmulski
13. 13. 13 Relevant publications • Smith, Leslie N. "Cyclical learning rates for training neural networks." In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 464-472. IEEE, 2017. • Smith, Leslie N., and Nicholay Topin. "Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates." arXiv preprint arXiv:1708.07120 (2017). • Smith, Leslie N. "A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay." arXiv preprint arXiv:1803.09820 (2018). • There’s a large batch of literature on Large Batch Training Competition Winning Learning Rates
14. 14. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 14 Outline
15. 15. 15 What is Weight Decay? • L2 normalization • The “effective weight decay” is a combination of the WD coefficient, λ, and the learning rate, ε • Which term does the learning rate schedule impact more? Weight decay (WD) 𝑳(𝜽, 𝒙) = |𝒇 𝜽, 𝒙 − 𝒚| 2 + ½ λ||w||2 Effective WD 𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙 − 𝜺 𝝀 𝒘 𝒕 𝒘 𝒕+𝟏 = (𝟏 − 𝜺 𝝀) 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
16. 16. 16 What is the optimal WD? • Decreasing weight decay has a much greater effect than decreasing the learning rate • The maximum weight decay, maxWD, can be found in a similar way as maxLR (i.e., WD range test) • Use a large WD early in training and let it decay Weight decay (WD) MaxWD
17. 17. 17 Dynamic weight decay • Hyper-parameter’s relationship where LR = learning rate, WD = weight decay coefficient, TBS = total batch size, and α = momentum – Large TBS and smaller values of LR and α permit larger maxWD • Large batch training can be improved by a large WD if WD decays during training Weight decay (WD) (𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 83.7% 82.2% 84% 83.5%
18. 18. 18 Conclusions (for now) • Takeaways – Enlightenment is a never-ending story – A new diet plan for your network • Decaying weight decay in large batch training; set WD large in the early phase of training and zero near the end – Hyper-parameters are tightly coupled and must be tuned together Competition Winning Learning Rates (𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
19. 19. • Outline Competition Winning Learning Rates Questions and comments? The End 19