SlideShare a Scribd company logo
1 of 23
Download to read offline
[course site]
Optimizing deep
networks
Day 3 Lecture 3
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Convex optimization
A function is convex if for all α ∈ [0,1]:
Examples
● Quadratics
● 2-norms
Properties
● Local minimum is global minimum
x
f(x)
Tangent
line
2
Gradient descent
E.g. linear regression
Need to optimize L
Gradient descent
w
L
Tangent lineLoss
function
wt
wt+1
3
Stochastic gradient descent
● Computing the gradient for the full dataset at each step is slow
○ Especially if the dataset is large
● Note:
○ For many loss functions we care about, the gradient is the average over losses on individual
examples
● Idea:
○ Pick a single random training example
○ Estimate a (noisy) loss on this single training example (the stochastic gradient)
○ Compute gradient wrt. this loss
○ Take a step of gradient descent using the estimated loss
4
Non-convex optimization
Objective function in deep networks is
non-convex
● May be many local minima
● Plateaus: flat regions
● Saddle points
Q: Why does SGD seem to work so well
for optimizing these complex non-convex
functions??
x
f(x)
5
Local minima
Q: Why doesn’t SGD get stuck at local
minima?
A: It does.
But:
● Theory and experiments suggest that for
high dimensional deep models, value of loss
function at most local minima is close to
value of loss function at global minimum.
Most local minima are good local minima!
Choromanska et al. The loss surfaces of multilayer networks, AISTATS 2015 http://arxiv.org/abs/1412.0233
Value of local minima found by running SGD for 200
iterations on a simplified version of MNIST from different
initial starting points. As number of parameters increases,
local minima tend to cluster more tightly.
6
Saddle points
Q: Are there many saddle points in
high-dimensional loss functions?
A: Local minima dominate in low dimensions, but
saddle points dominate in high dimensions.
Why?
Eigenvalues of the Hessian matrix
Intuition
Random matrix theory: P(eigenvalue > 0) ~ 0.5
At a critical point (zero grad) N dimensions means
we need N positive eigenvalues to be local min.
As N grows it becomes exponentially unlikely to
randomly pick all eigenvalues to be positive or
negative, and therefore most critical points are
saddle points.
Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014
http://arxiv.org/abs/1406.2572
7
Saddle points
Q: Does SGD get stuck at saddle points?
A: No, not really
Gradient descent is initially attracted to saddle points,
but unless it hits the critical point exactly, it will be
repelled when close.
Hitting critical point exactly is unlikely: estimated
gradient of loss is stochastic
Warning: Newton’s method works poorly for neural
nets as it is attracted to saddle points
SGD tends to oscillate between slowly approaching a saddle
point and quickly escaping from it
8
Plateaus
Regions of the weight space where loss
function is mostly flat (small gradients).
Can sometimes be avoided using:
● Careful initialization
● Non-saturating transfer functions
● Dynamic gradient scaling
● Network design
● Loss function design
9
Learning rates and initialization
10
Choosing the learning rate
For most first order optimization methods, we
need to choose a learning rate (aka step size)
● Too large: overshoots local minimum, loss increases
● Too small: makes very slow progress, can get stuck
● Good learning rate: makes steady progress toward
local minimum
Usually want a higher learning rate at the start
and a lower one later on.
Common strategy in practice:
● Start off with a high LR (like 0.1 - 0.001),
● Run for several epochs (1 - 10)
● Decrease LR by multiplying a constant factor (0.1 - 0.5) w
L
Loss
w
t
α too large
Good α
α too
small
11
Weight initialization
Need to pick a starting point for gradient
descent: an initial set of weights
Zero is a very bad idea!
● Zero is a critical point
● Error signal will not propagate
● Gradients will be zero: no progress
Constant value also bad idea:
● Need to break symmetry
Use small random values:
● E.g. zero mean Gaussian noise with constant
variance
Ideally we want inputs to activation functions
(e.g. sigmoid, tanh, ReLU) to be mostly in the
linear area to allow larger gradients to
propagate and converge faster.
0
tanh
Small
gradient
Large
gradient
bad good
12
Batch normalization
As learning progresses, the distribution of layer
inputs changes due to parameter updates.
This can result in most inputs being in the
nonlinear regime of the activation function and
slow down learning.
Batch normalization is a technique to reduce this
effect.
Works by re-normalizing layer inputs to have
zero mean and unit standard deviation with
respect to running batch estimates.
Also adds a learnable scale and bias term to
allow the network to still use the nonlinearity.
Usually allows much higher learning rates!
conv/fc
ReLU
Batch Normalization
no bias!
Ioffe and Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, JMRL 2015
https://arxiv.org/abs/1502.03167
13
First-order optimization algorithms
(SGD bells and whistles)
14
Vanilla mini-batch SGD
Evaluated on a mini-batch
15
Momentum
2x memory for parameters!
16
Nesterov accelerated gradient (NAG)
Approximate what the parameters will be on the next time step by using the
current velocity.
Update the velocity using gradient where we predict we will be, instead of where
we are now.
What we expect the
parameters to be based on
momentum aloneNesterov, Y. (1983). A method for unconstrained convex minimization problem
with the rate of convergence o(1/k2).
17
Adagrad
Adapts the learning rate for each of the parameters based on sizes of previous
updates.
● Scales updates to be larger for parameters that are updated less
● Scales updates to be smaller for parameters that are updated more
Store sum of squares of gradients so far in diagonal of matrix Gt
Gradient of loss at
timestep i
Update rule:
Duchi et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMRL 2011
18
RMSProp
Modification of Adagrad to address aggressively decaying learning rate.
Instead of storing sum of squares of gradient over all time steps so far, use a
decayed moving average of sum of squares of gradients
Update rule:
Geoff Hinton, Unpublished
19
Adam
Combines momentum and RMSProp
Keep decaying average of both first-order moment of gradient (momentum) and
second-order moment (like RMSProp)
Update rule:
First-order:
Second-order:
3x memory!
Kingma et al. Adam: a Method for Stochastic Optimization. ICLR 2015
20
Images credit: Alec Radford.
21
Images credit: Alec Radford.
22
Summary
Non-convex optimization means local minima and saddle points
In high dimensions, there are many more saddle points than local optima
Saddle points attract, but usually SGD can escape
Choosing a good learning rate is critical
Weight initialization is key to ensuring gradients propagate nicely (also batch
normalization)
Several SGD extensions that can help improve convergence
23

More Related Content

What's hot

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Universitat Politècnica de Catalunya
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Universitat Politècnica de Catalunya
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupYan Xu
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Universitat Politècnica de Catalunya
 

What's hot (20)

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Deep Learning for Computer Vision: Backward Propagation (UPC 2016)
Deep Learning for Computer Vision: Backward Propagation (UPC 2016)Deep Learning for Computer Vision: Backward Propagation (UPC 2016)
Deep Learning for Computer Vision: Backward Propagation (UPC 2016)
 
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...
 

Similar to Deep Learning for Computer Vision: Optimization (UPC 2016)

Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Wuhyun Rico Shin
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...Taiji Suzuki
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptxkumarkaushal17
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient BackpropagationAakash Chotrani
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Maninda Edirisooriya
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimizationMarco Quartulli
 

Similar to Deep Learning for Computer Vision: Optimization (UPC 2016) (20)

Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
Ai saturdays presentation
Ai saturdays presentationAi saturdays presentation
Ai saturdays presentation
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient Backpropagation
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimization
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Recently uploaded (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Deep Learning for Computer Vision: Optimization (UPC 2016)

  • 1. [course site] Optimizing deep networks Day 3 Lecture 3 Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University
  • 2. Convex optimization A function is convex if for all α ∈ [0,1]: Examples ● Quadratics ● 2-norms Properties ● Local minimum is global minimum x f(x) Tangent line 2
  • 3. Gradient descent E.g. linear regression Need to optimize L Gradient descent w L Tangent lineLoss function wt wt+1 3
  • 4. Stochastic gradient descent ● Computing the gradient for the full dataset at each step is slow ○ Especially if the dataset is large ● Note: ○ For many loss functions we care about, the gradient is the average over losses on individual examples ● Idea: ○ Pick a single random training example ○ Estimate a (noisy) loss on this single training example (the stochastic gradient) ○ Compute gradient wrt. this loss ○ Take a step of gradient descent using the estimated loss 4
  • 5. Non-convex optimization Objective function in deep networks is non-convex ● May be many local minima ● Plateaus: flat regions ● Saddle points Q: Why does SGD seem to work so well for optimizing these complex non-convex functions?? x f(x) 5
  • 6. Local minima Q: Why doesn’t SGD get stuck at local minima? A: It does. But: ● Theory and experiments suggest that for high dimensional deep models, value of loss function at most local minima is close to value of loss function at global minimum. Most local minima are good local minima! Choromanska et al. The loss surfaces of multilayer networks, AISTATS 2015 http://arxiv.org/abs/1412.0233 Value of local minima found by running SGD for 200 iterations on a simplified version of MNIST from different initial starting points. As number of parameters increases, local minima tend to cluster more tightly. 6
  • 7. Saddle points Q: Are there many saddle points in high-dimensional loss functions? A: Local minima dominate in low dimensions, but saddle points dominate in high dimensions. Why? Eigenvalues of the Hessian matrix Intuition Random matrix theory: P(eigenvalue > 0) ~ 0.5 At a critical point (zero grad) N dimensions means we need N positive eigenvalues to be local min. As N grows it becomes exponentially unlikely to randomly pick all eigenvalues to be positive or negative, and therefore most critical points are saddle points. Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014 http://arxiv.org/abs/1406.2572 7
  • 8. Saddle points Q: Does SGD get stuck at saddle points? A: No, not really Gradient descent is initially attracted to saddle points, but unless it hits the critical point exactly, it will be repelled when close. Hitting critical point exactly is unlikely: estimated gradient of loss is stochastic Warning: Newton’s method works poorly for neural nets as it is attracted to saddle points SGD tends to oscillate between slowly approaching a saddle point and quickly escaping from it 8
  • 9. Plateaus Regions of the weight space where loss function is mostly flat (small gradients). Can sometimes be avoided using: ● Careful initialization ● Non-saturating transfer functions ● Dynamic gradient scaling ● Network design ● Loss function design 9
  • 10. Learning rates and initialization 10
  • 11. Choosing the learning rate For most first order optimization methods, we need to choose a learning rate (aka step size) ● Too large: overshoots local minimum, loss increases ● Too small: makes very slow progress, can get stuck ● Good learning rate: makes steady progress toward local minimum Usually want a higher learning rate at the start and a lower one later on. Common strategy in practice: ● Start off with a high LR (like 0.1 - 0.001), ● Run for several epochs (1 - 10) ● Decrease LR by multiplying a constant factor (0.1 - 0.5) w L Loss w t α too large Good α α too small 11
  • 12. Weight initialization Need to pick a starting point for gradient descent: an initial set of weights Zero is a very bad idea! ● Zero is a critical point ● Error signal will not propagate ● Gradients will be zero: no progress Constant value also bad idea: ● Need to break symmetry Use small random values: ● E.g. zero mean Gaussian noise with constant variance Ideally we want inputs to activation functions (e.g. sigmoid, tanh, ReLU) to be mostly in the linear area to allow larger gradients to propagate and converge faster. 0 tanh Small gradient Large gradient bad good 12
  • 13. Batch normalization As learning progresses, the distribution of layer inputs changes due to parameter updates. This can result in most inputs being in the nonlinear regime of the activation function and slow down learning. Batch normalization is a technique to reduce this effect. Works by re-normalizing layer inputs to have zero mean and unit standard deviation with respect to running batch estimates. Also adds a learnable scale and bias term to allow the network to still use the nonlinearity. Usually allows much higher learning rates! conv/fc ReLU Batch Normalization no bias! Ioffe and Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, JMRL 2015 https://arxiv.org/abs/1502.03167 13
  • 15. Vanilla mini-batch SGD Evaluated on a mini-batch 15
  • 16. Momentum 2x memory for parameters! 16
  • 17. Nesterov accelerated gradient (NAG) Approximate what the parameters will be on the next time step by using the current velocity. Update the velocity using gradient where we predict we will be, instead of where we are now. What we expect the parameters to be based on momentum aloneNesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). 17
  • 18. Adagrad Adapts the learning rate for each of the parameters based on sizes of previous updates. ● Scales updates to be larger for parameters that are updated less ● Scales updates to be smaller for parameters that are updated more Store sum of squares of gradients so far in diagonal of matrix Gt Gradient of loss at timestep i Update rule: Duchi et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMRL 2011 18
  • 19. RMSProp Modification of Adagrad to address aggressively decaying learning rate. Instead of storing sum of squares of gradient over all time steps so far, use a decayed moving average of sum of squares of gradients Update rule: Geoff Hinton, Unpublished 19
  • 20. Adam Combines momentum and RMSProp Keep decaying average of both first-order moment of gradient (momentum) and second-order moment (like RMSProp) Update rule: First-order: Second-order: 3x memory! Kingma et al. Adam: a Method for Stochastic Optimization. ICLR 2015 20
  • 21. Images credit: Alec Radford. 21
  • 22. Images credit: Alec Radford. 22
  • 23. Summary Non-convex optimization means local minima and saddle points In high dimensions, there are many more saddle points than local optima Saddle points attract, but usually SGD can escape Choosing a good learning rate is critical Weight initialization is key to ensuring gradients propagate nicely (also batch normalization) Several SGD extensions that can help improve convergence 23