Deep Learning - Optimization Basic

•Download as PPTX, PDF•

0 likes•214 views

This slide covers basic principles of optimization and summarizes some optimization algorithms. it includes SGD, Momentum, Nesterov momentum, AdaGrad, RMSProp, AdadDelta, RMSProp \w Nestrov Momentum and Adam. presentation: https://www.youtube.com/watch?v=KdOmIODy3eA Reference: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016

Data & Analytics

Deep Learning
Optimization
Rookie’s Seminar, Jan 2018
Jaehyun Jun
Biointelligence Laboratory
Interdisciplinary Program of Neuro Science, Seoul National Univertisy
http://bi.snu.ac.kr

Contents
8.1 Pure Optimization
8.2 Challenges
8.3 Basic Algorithms
SGD, Momentum, Nesterov momentum
8.5 Adaptive Learning Rates
AdaGrad, RMSProp, AdaDelta,
RMS Prop w Nesterov momentum, Adam
8.4 & 8.7 Strategies
Reference: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016
© 2017, SNU Biointelligence Lab., http://bi.snu.ac.kr 2

Pure Optimization
 Objective: minimizing J is a goal in and of itself.
 We don't know pdata (intractable)
-> we use empirical risk minimization
-> be prone to overfitting
© 2017, SNU Biointelligence Lab., http://bi.snu.ac.kr 3

Differs from Pure Optimization
 Types of optimization algorithm
 batch (deterministic) algorithm: use the entire training set
 stochastic (online) algorithm: use a single example
 minibatch (minibatch stochastic) algorithm: use more than one but
less than all
 select randomly -> prevent biased
4© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Challenges
 Ill-Conditioning
 Second-order Tayler series expansion
 Local minima
5© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Challenges
 Cliff and exploding gradients
 Long-term dependencies
-> vanishing and exploding gradient problem
 Poor correspondence between local and global structure
 the gradient of local minima cannot reach a global solution.
6© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Basic Algorithms
 Stochastic Gradient Descent (SGD)
 Obtain an unbiased estimate of the gradient by taking the average
gradient on a mini batch of m examples drawn i.i.d from the data
generating distribution
7© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Basic Algorithms
 Momentum
 The method of momentum is designed to
solve poor conditioning of the Hessian matrix
and variance in SGD
8© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Basic Algorithms
 Nesterov momentum
 Difference is that with Nesterov momentum the gradient is
evaluated after the current velocity is applied
9© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Adaptive Learning Rates
 AdaGrad
 AdaGrad gives larger weights for gradients of rare terms, and
smaller weights for those of common terms.
10© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Adaptive Learning Rates
 RMSProp
 Perform better in the non-convex setting
 Uses an exponentially decaying average to discard history from the
extreme past so that it can converge rapidly after finding a convex
bowl
11© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Adaptive Learning Rates
 RMSProp with Nesterov Momentum
12© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Adaptive Learning Rates
 AdaDelta
 approximate the second-order optimization instead of the first-order
optimization.
13© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Adaptive Learning Rates
 Adam (RMSProp + momentum)
14

Optimizers
15Ref: http://shuuki4.github.io/deep%20learning/2016/05/20/Gradient-Descent-Algorithm-Overview.html

Strategies
 Initialization
 Point: break symmetry
 random initialization
 large value: strong symmetry breaking effect, but exploding
gradient problem
 random Gram-Schmidt orthogonalization
 Sparse initialization -> strong prior
 initialize with unsupervised model, different task
16© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr
(Glorot and Bengio, 2010)

Strategies
 Batch Normalization
 hard to choose an appropriate learning rate
 Adaptive reparameterization
 A second-order term might be very small or very large based on wi
 second-order term:
𝐵𝑁 𝒉, 𝛾, 𝛽 = 𝛽 + 𝛾
𝒉 − 𝐸(𝒉)
𝑉𝑎𝑟 𝒉 + 𝜖
17© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Strategies
 Batch Normalization
 Advantage
 reduce the problem of coordinating updates across many layers
 mitigate exploding or vanishing gradient problem
 allow higher learning ra
 reduce the strong dependence on initialization
 act as a regularization method
18© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

What's hot

New Ways to Find Latency in Linux Using TracingScyllaDB

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham

Convolutional Neural NetworksAshray Bhandare

Deep Dive into Hyperparameter TuningShubhmay Potdar

Artifical Neural Network and its applicationsSangeeta Tiwari

Introduction to Recurrent Neural NetworkYan Xu

AutoencoderHARISH R

AI On the Edge: Model CompressionApache MXNet

Artificial Neural NetworkAtul Krishna

Notes on attention mechanismKhang Pham

AutoencoderMehrnaz Faraz

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

Introduction to Recurrent Neural NetworkKnoldus Inc.

Transformer Introduction (Seminar Material)Yuta Niki

Neural networkRamesh Giri

Neural NetworksNikitaRuhela

Autoencoders in Deep Learningmilad abbasi

Deep Learning - Convolutional Neural NetworksChristian Perone

rnn BASICSPriyanka Reddy

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev

What's hot (20)

New Ways to Find Latency in Linux Using Tracing

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Convolutional Neural Networks

Deep Dive into Hyperparameter Tuning

Artifical Neural Network and its applications

Introduction to Recurrent Neural Network

Autoencoder

AI On the Edge: Model Compression

Artificial Neural Network

Notes on attention mechanism

Autoencoder

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Introduction to Recurrent Neural Network

Transformer Introduction (Seminar Material)

Neural network

Neural Networks

Autoencoders in Deep Learning

Deep Learning - Convolutional Neural Networks

rnn BASICS

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)

Similar to Deep Learning - Optimization Basic

Dnn guidelinesNaitik Shukla

Mask R-CNNJaehyun Jun

Optimization in deep learningJeremy Nixon

Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...TELKOMNIKA JOURNAL

jStanley: Placing a Green Thumb on Java CollectionsJácome Cunha

08039246Thilip Kumar

Optimization of Unit Commitment Problem using Classical Soft Computing Techni...IRJET Journal

IRJET- Wind Energy Storage Prediction using Machine LearningIRJET Journal

40120140507002IAEME Publication

XGBOOST [Autosaved]12.pptxyadav834181

Performance improvement of a Rainfall Prediction Model using Particle Swarm O...ijceronline

GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHMijscai

Short Term Electrical Load Forecasting by Artificial Neural NetworkIJERA Editor

ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUESIRJET Journal

solar air heater Using ANNRAJBALA PURNIMA PRIYA

Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionIJECEIAES

Predictive Data Mining with Normalized Adaptive Training Method for Neural Ne...IJERDJOURNAL

IRJET- Efficient JPEG Reconstruction using Bayesian MAP and BFMTIRJET Journal

IRJET - Intelligent Weather Forecasting using Machine Learning TechniquesIRJET Journal

Similar to Deep Learning - Optimization Basic (20)

Dnn guidelines

Mask R-CNN

Optimization in deep learning

Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...

jStanley: Placing a Green Thumb on Java Collections

08039246

Optimization of Unit Commitment Problem using Classical Soft Computing Techni...

IRJET- Wind Energy Storage Prediction using Machine Learning

40120140507002

XGBOOST [Autosaved]12.pptx

Performance improvement of a Rainfall Prediction Model using Particle Swarm O...

GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM

Short Term Electrical Load Forecasting by Artificial Neural Network

ANALYSIS AND PREDICTION OF RAINFALL USING MACHINE LEARNING TECHNIQUES

solar air heater Using ANN

Adapted Branch-and-Bound Algorithm Using SVM With Model Selection

Predictive Data Mining with Normalized Adaptive Training Method for Neural Ne...

IRJET- Efficient JPEG Reconstruction using Bayesian MAP and BFMT

IRJET - Intelligent Weather Forecasting using Machine Learning Techniques

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Call Girls In Mahipalpur O9654467111 Escorts Service

E-Commerce Order PredictionShraddha Kamble.pptx

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

DBA Basics: Getting Started with Performance Tuning.pdf

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Call Girls In Dwarka 9654467111 Escorts Service

Decoding Loan Approval: Predictive Modeling in Action

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Deep Learning - Optimization Basic

1. Deep Learning Optimization Rookie’s Seminar, Jan 2018 Jaehyun Jun Biointelligence Laboratory Interdisciplinary Program of Neuro Science, Seoul National Univertisy http://bi.snu.ac.kr

2. Contents 8.1 Pure Optimization 8.2 Challenges 8.3 Basic Algorithms SGD, Momentum, Nesterov momentum 8.5 Adaptive Learning Rates AdaGrad, RMSProp, AdaDelta, RMS Prop w Nesterov momentum, Adam 8.4 & 8.7 Strategies Reference: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016 © 2017, SNU Biointelligence Lab., http://bi.snu.ac.kr 2

3. Pure Optimization  Objective: minimizing J is a goal in and of itself.  We don't know pdata (intractable) -> we use empirical risk minimization -> be prone to overfitting © 2017, SNU Biointelligence Lab., http://bi.snu.ac.kr 3

4. Differs from Pure Optimization  Types of optimization algorithm  batch (deterministic) algorithm: use the entire training set  stochastic (online) algorithm: use a single example  minibatch (minibatch stochastic) algorithm: use more than one but less than all  select randomly -> prevent biased 4© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

6. Challenges  Cliff and exploding gradients  Long-term dependencies -> vanishing and exploding gradient problem  Poor correspondence between local and global structure  the gradient of local minima cannot reach a global solution. 6© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

7. Basic Algorithms  Stochastic Gradient Descent (SGD)  Obtain an unbiased estimate of the gradient by taking the average gradient on a mini batch of m examples drawn i.i.d from the data generating distribution 7© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

9. Basic Algorithms  Nesterov momentum  Difference is that with Nesterov momentum the gradient is evaluated after the current velocity is applied 9© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

11. Adaptive Learning Rates  RMSProp  Perform better in the non-convex setting  Uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl 11© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

14. Adaptive Learning Rates  Adam (RMSProp + momentum) 14

15. Optimizers 15Ref: http://shuuki4.github.io/deep%20learning/2016/05/20/Gradient-Descent-Algorithm-Overview.html

16. Strategies  Initialization  Point: break symmetry  random initialization  large value: strong symmetry breaking effect, but exploding gradient problem  random Gram-Schmidt orthogonalization  Sparse initialization -> strong prior  initialize with unsupervised model, different task 16© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr (Glorot and Bengio, 2010)

17. Strategies  Batch Normalization  hard to choose an appropriate learning rate  Adaptive reparameterization  A second-order term might be very small or very large based on wi  second-order term: 𝐵𝑁 𝒉, 𝛾, 𝛽 = 𝛽 + 𝛾 𝒉 − 𝐸(𝒉) 𝑉𝑎𝑟 𝒉 + 𝜖 17© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

18. Strategies  Batch Normalization  Advantage  reduce the problem of coordinating updates across many layers  mitigate exploding or vanishing gradient problem  allow higher learning ra  reduce the strong dependence on initialization  act as a regularization method 18© 2018, SNU Biointelligence Lab., http://bi.snu.ac.kr

Deep Learning - Optimization Basic

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning - Optimization Basic

Similar to Deep Learning - Optimization Basic (20)

Recently uploaded

Recently uploaded (20)

Deep Learning - Optimization Basic