SlideShare a Scribd company logo
1 of 85
Download to read offline
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1
Lecture 3: Deeper into Deep Learning and Optimizations
Deep Learning @ UvA
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2
o Machine learning paradigm for neural networks
o Backpropagation algorithm, backbone for training neural networks
o Neural network == modular architecture
o Visited different modules, saw how to implement and check them
Previous lecture
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
Lecture overview
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4
Deeper into
Neural Networks &
Deep Neural Nets
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5
A Neural/Deep Network in a nutshell
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6
SGD vs GD
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7
Backpropagation again
o Step 1. Compute forward propagations for all layers recursively
𝑎𝑙 = ℎ𝑙 𝑥𝑙 and 𝑥𝑙+1 = 𝑎𝑙
o Step 2. Once done with forward propagation, follow the reverse path.
◦ Start from the last layer and for each new layer compute the gradients
◦ Cache computations when possible to avoid redundant operations
o Step 3. Use the gradients
𝜕ℒ
𝜕𝜃𝑙
with Stochastic Gradient Descend to train
𝜕ℒ
𝜕𝑎𝑙
=
𝜕𝑎𝑙+1
𝜕𝑥𝑙+1
𝑇
⋅
𝜕ℒ
𝜕𝑎𝑙+1
𝜕ℒ
𝜕𝜃𝑙
=
𝜕𝑎𝑙
𝜕𝜃𝑙
⋅
𝜕ℒ
𝜕𝑎𝑙
𝑇
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8
o Often loss surfaces are
◦ non-quadratic
◦ highly non-convex
◦ very high-dimensional
o Datasets are typically really large to compute complete gradients
o No real guarantee that
◦ the final solution will be good
◦ we converge fast to final solution
◦ or that there will be convergence
Still, backpropagation can be slow
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9
o Stochastically sample “mini-batches” from dataset 𝐷
◦ The size of 𝐵𝑗 can contain even just 1 sample
o Much faster than Gradient Descend
o Results are often better
o Also suitable for datasets that change over time
o Variance of gradients increases when batch size decreases
Stochastic Gradient Descend (SGD)
𝜃(𝑡+1)
= 𝜃(𝑡)
−
𝜂𝑡
|𝐵𝑗|
෍
𝑖 ∈ 𝐵𝑗
𝛻𝜃ℒ𝑖
𝐵𝑗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝐷)
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10
SGD is often better
Current solution
Full GD gradient
New GD solution
Noisy SGD gradient
Best GD solution
Best SGD solution
• No guarantee that this is what
is going to always happen.
• But the noisy SGC gradients
can help some times escaping
local optima
Loss surface
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11
SGD is often better
o (A bit) Noisy gradients act as regularization
o Gradient Descend  Complete gradients
o Complete gradients fit optimally the (arbitrary) data we have, not the
distribution that generates them
◦ All training samples are the “absolute representative” of the input distribution
◦ Test data will be no different than training data
◦ Suitable for traditional optimization problems: “find optimal route”
◦ But for ML we cannot make this assumption  test data are always different
o Stochastic gradients  sampled training data sample roughly
representative gradients
◦ Model does not overfit to the particular training samples
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12
SGD is faster
Gradient
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13
SGD is faster
Gradient
10x
What is our
gradient now?
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14
SGD is faster
10x
What is our
gradient now?
Gradient
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15
o Of course in real situations data do not replicate
o However, after a sizeable amount of data there are clusters of data that
are similar
o Hence, the gradient is approximately alright
o Approximate alright is great, is even better in many cases actually
SGD is faster
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16
o Often datasets are not “rigid”
o Imagine Instagram
◦ Let’s assume 1 million of new images uploaded per week and
we want to build a “cool picture” classifier
◦ Should “cool pictures” from the previous year have the same as
much influence?
◦ No, the learning machine should track these changes
o With GD these changes go undetected, as results are
averaged by the many more “past” samples
◦ Past “over-dominates”
o A properly implemented SGD can track changes much
better and give better models
◦ [LeCun2002]
SGD for dynamically changed datasets
Popular today
Popular in 2014
Popular in 2010
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17
o Applicable only with SGD
o Choose samples with maximum information content
o Mini-batches should contain examples from different classes
◦ As different as possible
o Prefer samples likely to generate larger errors
◦ Otherwise gradients will be small  slower learning
◦ Check the errors from previous rounds and prefer “hard examples”
◦ Don’t overdo it though :P, beware of outliers
o In practice, split your dataset into mini-batches
◦ Each mini-batch is as class-divergent and rich as possible
◦ New epoch  to be safe new batches & new, randomly shuffled examples
Shuffling examples
Dataset
Shuffling
at epoch t
Shuffling
at epoch t+1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18
o Conditions of convergence well understood
o Acceleration techniques can be applied
◦ Second order (Hessian based) optimizations are possible
◦ Measuring not only gradients, but also curvatures of the loss surface
o Simpler theoretical analysis on weight dynamics and convergence rates
Advantages of Gradient Descend batch learning
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19
o SGD is preferred to Gradient Descend
o Training is orders faster
◦ In real datasets Gradient Descend is not even realistic
o Solutions generalize better
◦ More efficient  larger datasets
◦ Larger datasets  better generalization
o How many samples per mini-batch?
◦ Hyper-parameter, trial & error
◦ Usually between 32-256 samples
In practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20
Data preprocessing &
normalization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21
o Center data to be roughly 0
◦ Activation functions usually “centered” around 0
◦ Convergence usually faster
◦ Otherwise bias on gradient direction  might slow down learning
Data pre-processing
ReLU  tanh(𝑥)  𝜎(𝑥) 


UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22
o Scale input variables to have similar diagonal covariances 𝑐𝑖 = σ𝑗(𝑥𝑖
(𝑗)
)2
◦ Similar covariances  more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
Data pre-processing
𝑥1
, 𝑥2
, 𝑥3
 much different covariances
𝜃1
𝜃2
𝑥 = 𝑥1
, 𝑥2
, 𝑥3 𝑇
, 𝜃 = 𝜃1
, 𝜃2
, 𝜃3 𝑇
, 𝑎 = tanh(𝜃Τ
𝑥)
𝜃3
Generated gradients ቚ
dℒ
𝑑𝜃 𝑥1,𝑥2,𝑥3
: much different
Gradient update harder: 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝑑ℒ/𝑑θ1
𝑑ℒ/𝑑θ2
𝑑ℒ/𝑑θ3
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23
o Input variables should be as decorrelated as possible
◦ Input variables are “more independent”
◦ Network is forced to find non-trivial correlations between inputs
◦ Decorrelated inputs  Better optimization
◦ Obviously not the case when inputs are by definition correlated (sequences)
o Extreme case
◦ extreme correlation (linear dependency) might cause problems [CAUTION]
Data pre-processing
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24
o Input variables follow a Gaussian distribution (roughly)
o In practice:
◦ from training set compute mean and standard deviation
◦ Then subtract the mean from training samples
◦ Then divide the result by the standard deviation
Normalization: 𝑁 𝜇, 𝜎2
= 𝑁 0, 1
𝑥
𝑥 − 𝜇
𝑥 − 𝜇
𝜎
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25
o Instead of “per-dimension”  all input dimensions simultaneously
o If dimensions have similar values (e.g. pixels in natural images)
◦ Compute one 𝜇, 𝜎2 instead of as many as the input variables
◦ Or the per color channel pixel average/variance
𝜇𝑟𝑒𝑑, 𝜎𝑟𝑒𝑑
2
, 𝜇𝑔𝑟𝑒𝑒𝑛, 𝜎𝑔𝑟𝑒𝑒𝑛
2 , 𝜇𝑏𝑙𝑢𝑒, 𝜎𝑏𝑙𝑢𝑒
2
𝑁 𝜇, 𝜎2
= 𝑁 0, 1 − Making things faster
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26
o When input dimensions have similar ranges …
o … and with the right non-linearlity …
o … centering might be enough
◦ e.g. in images all dimensions are pixels
◦ All pixels have more or less the same ranges
o Juse make sure images have mean 0 (𝜇 = 0)
Even simpler: Centering the input
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27
o If 𝐶 the covariance matrix of your dataset, compute
eigenvalues and eigenvectors with SVD
𝑈, Σ, 𝑉𝑇 = 𝑠𝑣𝑑(𝐶)
o Decorrelate (PCA-ed) dataset by
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
◦ Subset of eigenvectors 𝑈′
= [𝑢1, … , 𝑢𝑞] to reduce data dimensions
o Scaling by square root of eigenvalues to whiten data
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
o Not used much with Convolutional Neural Nets
◦ The zero mean normalization is more important
PCA Whitening
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28
Example
Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29
Data augmentation [Krizhevsky2012]
Original
Flip Random crop
Contrast Tint
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30
o Weights change  the
distribution of the layer inputs
changes per round
◦ Covariance shift
o Normalize the layer inputs with
batch normalization
◦ Roughly speaking, normalize 𝑥𝑙 to
𝑁(0, 1) and rescale
Batch normalization [Ioffe2015]
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization
𝑥𝑙
ℒ
𝑥𝑙
ℒ
Batch normalization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31
Batch normalization - Intuitively
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32
Batch normalization – The algorithm
o 𝜇ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 [compute mini-batch mean]
o 𝜎ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 − 𝜇ℬ
2 [compute mini-batch variance]
o ෝ
𝑥𝑖 ←
𝑥𝑖−𝜇ℬ
𝜎ℬ
2+𝜀
[normalize input]
o ෝ
𝑦𝑖 ← 𝛾𝑥𝑖 + 𝛽 [scale and shift input]
Trainable parameters
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33
o Gradients can be stronger  higher learning rates  faster training
◦ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima
o Neurons get activated in a near optimal “regime”
o Better model regularization
◦ Neuron activations not deterministic,
depend on the batch
◦ Model cannot be overconfident
Batch normalization - Benefits
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34
Regularization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35
o Neural networks typically have thousands, if not millions of parameters
◦ Usually, the dataset size smaller than the number of parameters
o Overfitting is a grave danger
o Proper weight regularization is crucial to avoid overfitting
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆Ω(𝜃)
o Possible regularization methods
◦ ℓ2-regularization
◦ ℓ1-regularization
◦ Dropout
Regularization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36
o Most important (or most popular) regularization
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
2
o The ℓ2-regularization can pass inside the gradient descend update rule
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ + 𝜆𝜃𝑙 ⟹
𝜃 𝑡+1 = 1 − 𝜆𝜂𝑡 𝜃 𝑡 − 𝜂𝑡𝛻𝜃ℒ
o 𝜆 is usually about 10−1, 10−2
ℓ2-regularization
“Weight decay”, because
weights get smaller
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37
o ℓ1-regularization is one of the most important techniques
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
o Also ℓ1-regularization passes inside the gradient descend update rule
𝜃 𝑡+1
= 𝜃 𝑡
− 𝜆𝜂𝑡
𝜃 𝑡
|𝜃 𝑡 |
− 𝜂𝑡𝛻𝜃ℒ
o ℓ1-regularization  sparse weights
◦ 𝜆 ↗  more weights become 0
ℓ1-regularization
Sign function
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38
o To tackle overfitting another popular technique is early stopping
o Monitor performance on a separate validation set
o Training the network will decrease training error, as well validation error
(although with a slower rate usually)
o Stop when validation error starts increasing
◦ This quite likely means the network starts to overfit
Early stopping
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39
o During training setting activations randomly to 0
◦ Neurons sampled at random from a Bernoulli distribution with 𝑝 = 0.5
o At test time all neurons are used
◦ Neuron activations reweighted by 𝑝
o Benefits
◦ Reduces complex co-adaptations or co-dependencies between neurons
◦ No “free-rider” neurons that rely on others
◦ Every neuron becomes more robust
◦ Decreases significantly overfitting
◦ Improves significantly training speed
Dropout [Srivastava2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Original model
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45
Architectural details
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46
o Straightforward sigmoids not a very good idea
o Symmetric sigmoids converge faster
◦ E.g. tanh, returns a(x=0)=0
◦ Recommended sigmoid: 𝑎 = ℎ 𝑥 = 1.7159 tanh(
2
3
𝑥)
o You can add a linear term to avoid flat areas
𝑎 = ℎ 𝑥 = tanh 𝑥 + 𝛽𝑥
Sigmoid-like activation functions
tanh 𝑥 + 0.5𝑥
tanh 𝑥
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47
o RBF: 𝑎 = ℎ 𝑥 = σ𝑗 𝑢𝑗 exp −𝛽𝑗 𝑥 − 𝑤𝑗
2
o Sigmoid: 𝑎 = ℎ 𝑥 = 𝜎 𝑥 =
1
1+𝑒−𝑥
o Sigmoids can cover the full feature space
o RBF’s are much more local in the feature space
◦ Can be faster to train but with a more limited range
◦ Can give better set of basis functions
◦ Preferred in lower dimensional spaces
RBFs vs “Sigmoids”
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48
o Activation function 𝑎 = ℎ(𝑥) = max 0, 𝑥
o Gradient wrt the input
𝜕𝑎
𝜕𝑥
= ቊ
0, 𝑖𝑓 𝑥 ≤ 0
1, 𝑖𝑓𝑥 > 0
o Very popular in computer vision and speech recognition
o Much faster computations, gradients
◦ No vanishing or exploding problems, only comparison, addition, multiplication
o People claim biological plausibility
o Sparse activations
o No saturation
o Non-symmetric
o Non-differentiable at 0
o A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem
Rectified Linear Unit (ReLU) module [Krizhevsky2012]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49
ReLU convergence rate
ReLU
Tanh
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50
o Soft approximation (softplus): 𝑎 = ℎ(𝑥) = ln 1 + 𝑒𝑥
o Noisy ReLU: 𝑎 = ℎ 𝑥 = max 0, x + ε , ε~𝛮(0, σ(x))
o Leaky ReLU: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
0.01𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
o Parametric ReLu: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
𝛽𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(parameter 𝛽 is trainable)
Other ReLUs
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51
o Number of hidden layers
o Number of neuron in each hidden layer
o Type of activation functions
o Type and amount of regularization
Architectural hyper-parameters
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52
o Dataset dependent hyperparameters
o Tip: Start small  increase complexity gradually
◦ e.g. start with a 2-3 hidden layers
◦ Add more layers  does performance improve?
◦ Add more neurons  does performance improve?
o Regularization is very important, use ℓ2
◦ Even if with very deep or wide network
◦ With strong ℓ2-regularization we avoid overfitting
Number of neurons, number of hidden layers
Generalization
Model complexity
(number of neurons)
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53
Learning rate
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54
o The right learning rate 𝜂𝑡 very important for fast convergence
◦ Too strong  gradients overshoot and bounce
◦ Too weak,  too small gradients  slow training
o Learning rate per weight is often advantageous
◦ Some weights are near convergence, others not
o Rule of thumb
◦ Learning rate of (shared) weights prop. to square root of share weight connections
o Adaptive learning rates are also possible, based on the errors observed
◦ [Sompolinsky1995]
Learning rate
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55
o Constant
◦ Learning rate remains the same for all epochs
o Step decay
◦ Decrease (e.g. 𝜂𝑡/𝑇 or 𝜂𝑡/𝑇) every T number of epochs
o Inverse decay 𝜂𝑡 =
𝜂0
1+𝜀𝑡
o Exponential decay 𝜂𝑡 = 𝜂0𝑒−𝜀𝑡
o Often step decay preferred
◦ simple, intuitive, works well and only a
single extra hyper-parameter 𝑇 (𝑇 =2, 10)
Learning rate schedules
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56
o Try several log-spaced values 10−1, 10−2, 10−3, … on a smaller set
◦ Then, you can narrow it down from there around where you get the lowest error
o You can decrease the learning rate every 10 (or some other value) full
training set epochs
◦ Although this highly depends on your data
Learning rate in practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57
Weight initialization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58
o There are few contradictory requirements
o Weights need to be small enough
◦ around origin (𝟎) for symmetric functions (tanh, sigmoid)
◦ When training starts better stimulate activation functions near their linear regime
◦ larger gradients  faster training
o Weights need to be large enough
◦ Otherwise signal is too weak for any serious learning
Weight initialization
Linear regime
Large gradients
Linear regime
Large gradients
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62
o For tanh initialize weights from −
6
𝑑𝑙−1+𝑑𝑙
,
6
𝑑𝑙−1+𝑑𝑙
◦ 𝑑𝑙−1 is the number of input variables to the tanh layer and 𝑑𝑙 is the number of the
output variables
o For a sigmoid −4 ∙
6
𝑑𝑙−1+𝑑𝑙
, 4 ∙
6
𝑑𝑙−1+𝑑𝑙
One way of initializing sigmoid-like neurons
Linear regime
Large gradients
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63
o For 𝑎 = 𝜃𝑥 the variance is
𝑉𝑎𝑟 𝑎 = 𝐸 𝑥 2𝑉𝑎𝑟 𝜃 + E 𝜃 2𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃
o Since 𝐸 𝑥 = 𝐸 𝜃 = 0
𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 ≈ 𝑑 ⋅ 𝑉𝑎𝑟 𝑥𝑖 𝑉𝑎𝑟 𝜃𝑖
o For 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 ⇒ 𝑉𝑎𝑟 𝜃𝑖 =
1
𝑑
o Draw random weights from
𝜃~𝑁 0, 1/𝑑
where 𝑑 is the number of neurons in the input
Xavier initialization [Glorot2010]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64
o Unlike sigmoids, ReLUs ground to 0 the linear activations half the
time
o Double weight variance
◦ Compensate for the zero flat-area 
◦ Input and output maintain same variance
◦ Very similar to Xavier initialization
o Draw random weights from
w~𝑁 0, 2/𝑑
where 𝑑 is the number of neurons in the input
[He2015] initialization for ReLUs
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65
Loss functions
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66
o Our samples contains only one class
◦ There is only one correct answer per sample
o Negative log-likelihood (cross entropy) + Softmax
ℒ 𝜃; 𝑥, 𝑦 = − σ𝑐=1
𝐶
𝑦𝑐 log 𝑎𝐿
𝑐
for all classes 𝑐 = 1, … , 𝐶
o Hierarchical softmax when C is very large
o Hinge loss (aka SVM loss)
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑐=1
𝑐≠𝑦
𝐶
max(0, 𝑎𝐿
𝑐
− 𝑎𝐿
𝑦
+ 1)
o Squared hinge loss
Multi-class classification
Is it a cat? Is it a horse? …
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67
o Each sample can have many correct answers
o Hinge loss and the likes
◦ Also sigmoids would also work
o Each output neuron is independent
◦ “Does this contain a car, yes or no?“
◦ “Does this contain a person, yes or no?“
◦ “Does this contain a motorbike, yes or no?“
◦ “Does this contain a horse, yes or no?“
o Instead of “Is this a car, motorbike or person?”
◦ 𝑝 𝑐𝑎𝑟 𝑥) = 0.55, 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) = 0.25, 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) = 0.15, 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 0.05
◦ 𝑝 𝑐𝑎𝑟 𝑥) + 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) + 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) + 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 1.0
Multi-class, multi-label classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68
o The good old Euclidean Loss
ℒ 𝜃; 𝑥, 𝑦 =
1
2
|𝑦 − 𝑎𝐿|2
2
o Or RBF on top of Euclidean loss
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
𝑢𝑗 exp(−𝛽𝑗(𝑦 − 𝑎𝐿)2)
o Or ℓ1 distance
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
|𝑦𝑗 − 𝑎𝐿
𝑗
|
Regression
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69
Even better
optimizations
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70
o Don’t switch gradients all the time
o Maintain “momentum” from previous
parameters
o More robust gradients and learning 
faster convergence
o Nice “physics”-based interpretation
◦ Instead of updating the position of the “ball”, we
update the velocity, which updates the position
Momentum
𝜃(𝑡+1)
= 𝜃(𝑡)
+ 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Loss surface
Gradient
Gradient + momentum
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71
o Use the future gradient instead of
the current gradient
o Better theoretical convergence
o Generally works better with
Convolutional Neural Networks
Nesterov Momentum [Sutskever2013]
𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Gradient
Gradient + momentum
Momentum
Look-ahead gradient
from the next step
Momentum
Gradient + Nesterov
momentum
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72
o Normally all weights updated with same “aggressiveness”
◦ Often some parameters could enjoy more “teaching”
◦ While others are already about there
o Adapt learning per parameter
𝜃(𝑡+1) = 𝜃(𝑡) − 𝐻ℒ
−1
𝜂𝑡𝛻𝜃ℒ
o 𝐻ℒ is the Hessian matrix of ℒ: second-order derivatives
𝐻ℒ
𝑖𝑗
=
𝜕ℒ
𝜕𝜃𝑖𝜕𝜃𝑗
Second order optimization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73
o Inverse of Hessian usually very expensive
◦ Too many parameters
o Approximating the Hessian, e.g. with the L-BFGS algorithm
◦ Keeps memory of gradients to approximate the inverse Hessian
o L-BFGS works alright with Gradient Descend. What about SGD?
o In practice SGD with some good momentum works just fine
Second order optimization methods in practice
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74
o Adagrad [Duchi2011]
o RMSprop
o Adam [Kingma2014]
Other per-parameter adaptive optimizations
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75
o Schedule
◦ 𝑚𝑗 = σ𝜏(𝛻𝜃ℒ𝑗)2
⟹ 𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
◦ 𝜀 is a small number to avoid division with 0
◦ Gradients become gradually smaller and smaller
Adagrad [Duchi2011]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76
o Schedule
◦ 𝑚𝑗 = 𝛼 σ𝜏=1
𝑡−1
(𝛻𝜃
(𝑡)
ℒ𝑗)2 + 1 − 𝛼 𝛻𝜃
(𝑡)
ℒ𝑗 ⟹
◦ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
o Moving average of the squared gradients
◦ Compared to Adagrad
o Large gradients, e.g. too “noisy” loss surface
◦ Updates are tamed
o Small gradients, e.g. stuck in flat loss surface ravine
◦ Updates become more aggressive
RMSprop
Square rooting boosts small values
while suppresses large values
Decay hyper-parameter
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77
o One of the most popular learning algorithms
𝑚𝑗 = ෍
𝜏
(𝛻𝜃ℒ𝑗)2
𝜃(𝑡+0.5) = 𝛽1𝜃(𝑡) + 1 − 𝛽1 𝛻𝜃ℒ
𝑣(𝑡+0.5) = 𝛽2𝑣(𝑡) + 1 − 𝛽2 𝑚
𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝜃(𝑡+0.5)
𝑣(𝑡+0.5) + 𝜀
o Similar to RMSprop, but with momentum
o Recommended values: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−8
Adam [Kingma2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78
Visual overview
Picture credit: Alec Radford
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79
o Learning to learn by gradient descent by gradient descent
◦ [Andrychowicz2016]
o 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑔𝑡 𝛻𝜃ℒ, 𝜑
o 𝑔𝑡 is an “optimizer” with its own parameters 𝜑
◦ Implemented as a recurrent network
Learning –not computing– the gradients
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80
Good practice
o Preprocess the data to at least have 0 mean
o Initialize weights based on activations functions
◦ For ReLU Xavier or HeICCV2015 initialization
o Always use ℓ2-regularization and dropout
o Use batch normalization
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81
Babysitting
Deep Nets
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82
o Always check your gradients if not computed automatically
o Check that in the first round you get a random loss
o Check network with few samples
◦ Turn off regularization. You should predictably overfit and have a 0 loss
◦ Turn or regularization. The loss should increase
o Have a separate validation set
◦ Compare the curve between training and validation sets
◦ There should be a gap, but not too large
Babysitting Deep Nets
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83
Summary
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84
o http://www.deeplearningbook.org/
◦ Part II: Chapter 7, 8
[Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient
descent, arXiv, 2016
[He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
[Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015
[Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014
[Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting, JMLR, 2014
[Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013
[Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012
[Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
[Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011
[Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010
[LeCun2002]
Reading material & references
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85
Next lecture
o What are the Convolutional Neural Networks?
o Why are they important in Computer Vision?
o Differences from standard Neural Networks
o How to train a Convolutional Neural Network?

More Related Content

Similar to lecture3.pdf

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxDrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxssuserd23711
 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Giacomo Boracchi
 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architecturesJoe li
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Learning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxLearning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxamrita chaturvedi
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowYi-Fan Liou
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Giacomo Boracchi
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersMadhumita Tamhane
 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsMostafa El-Hosseini
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 

Similar to lecture3.pdf (20)

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
Learning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxLearning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptx
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with Tensorflow
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_models
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 

More from Tigabu Yaya

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdfTigabu Yaya
 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfTigabu Yaya
 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfTigabu Yaya
 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfTigabu Yaya
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdfTigabu Yaya
 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.pptTigabu Yaya
 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.pptTigabu Yaya
 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfTigabu Yaya
 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdfTigabu Yaya
 
nabil-201-chap-03.ppt
nabil-201-chap-03.pptnabil-201-chap-03.ppt
nabil-201-chap-03.pptTigabu Yaya
 
Mat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptMat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptTigabu Yaya
 

More from Tigabu Yaya (20)

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf
 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdf
 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdf
 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdf
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
lecture6.pdf
lecture6.pdflecture6.pdf
lecture6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Chap 4.ppt
Chap 4.pptChap 4.ppt
Chap 4.ppt
 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.ppt
 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.ppt
 
nnfl.0620.pptx
nnfl.0620.pptxnnfl.0620.pptx
nnfl.0620.pptx
 
L20.ppt
L20.pptL20.ppt
L20.ppt
 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdf
 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdf
 
nabil-201-chap-03.ppt
nabil-201-chap-03.pptnabil-201-chap-03.ppt
nabil-201-chap-03.ppt
 
Mat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptMat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.ppt
 

Recently uploaded

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 

Recently uploaded (20)

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 

lecture3.pdf

  • 1. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1 Lecture 3: Deeper into Deep Learning and Optimizations Deep Learning @ UvA
  • 2. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2 o Machine learning paradigm for neural networks o Backpropagation algorithm, backbone for training neural networks o Neural network == modular architecture o Visited different modules, saw how to implement and check them Previous lecture
  • 3. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3 o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices Lecture overview
  • 4. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4 Deeper into Neural Networks & Deep Neural Nets
  • 5. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5 A Neural/Deep Network in a nutshell 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 6. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6 SGD vs GD 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 7. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7 Backpropagation again o Step 1. Compute forward propagations for all layers recursively 𝑎𝑙 = ℎ𝑙 𝑥𝑙 and 𝑥𝑙+1 = 𝑎𝑙 o Step 2. Once done with forward propagation, follow the reverse path. ◦ Start from the last layer and for each new layer compute the gradients ◦ Cache computations when possible to avoid redundant operations o Step 3. Use the gradients 𝜕ℒ 𝜕𝜃𝑙 with Stochastic Gradient Descend to train 𝜕ℒ 𝜕𝑎𝑙 = 𝜕𝑎𝑙+1 𝜕𝑥𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕𝑎𝑙+1 𝜕ℒ 𝜕𝜃𝑙 = 𝜕𝑎𝑙 𝜕𝜃𝑙 ⋅ 𝜕ℒ 𝜕𝑎𝑙 𝑇
  • 8. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8 o Often loss surfaces are ◦ non-quadratic ◦ highly non-convex ◦ very high-dimensional o Datasets are typically really large to compute complete gradients o No real guarantee that ◦ the final solution will be good ◦ we converge fast to final solution ◦ or that there will be convergence Still, backpropagation can be slow
  • 9. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9 o Stochastically sample “mini-batches” from dataset 𝐷 ◦ The size of 𝐵𝑗 can contain even just 1 sample o Much faster than Gradient Descend o Results are often better o Also suitable for datasets that change over time o Variance of gradients increases when batch size decreases Stochastic Gradient Descend (SGD) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 |𝐵𝑗| ෍ 𝑖 ∈ 𝐵𝑗 𝛻𝜃ℒ𝑖 𝐵𝑗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝐷)
  • 10. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10 SGD is often better Current solution Full GD gradient New GD solution Noisy SGD gradient Best GD solution Best SGD solution • No guarantee that this is what is going to always happen. • But the noisy SGC gradients can help some times escaping local optima Loss surface
  • 11. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11 SGD is often better o (A bit) Noisy gradients act as regularization o Gradient Descend  Complete gradients o Complete gradients fit optimally the (arbitrary) data we have, not the distribution that generates them ◦ All training samples are the “absolute representative” of the input distribution ◦ Test data will be no different than training data ◦ Suitable for traditional optimization problems: “find optimal route” ◦ But for ML we cannot make this assumption  test data are always different o Stochastic gradients  sampled training data sample roughly representative gradients ◦ Model does not overfit to the particular training samples
  • 12. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12 SGD is faster Gradient
  • 13. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13 SGD is faster Gradient 10x What is our gradient now?
  • 14. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14 SGD is faster 10x What is our gradient now? Gradient
  • 15. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15 o Of course in real situations data do not replicate o However, after a sizeable amount of data there are clusters of data that are similar o Hence, the gradient is approximately alright o Approximate alright is great, is even better in many cases actually SGD is faster
  • 16. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16 o Often datasets are not “rigid” o Imagine Instagram ◦ Let’s assume 1 million of new images uploaded per week and we want to build a “cool picture” classifier ◦ Should “cool pictures” from the previous year have the same as much influence? ◦ No, the learning machine should track these changes o With GD these changes go undetected, as results are averaged by the many more “past” samples ◦ Past “over-dominates” o A properly implemented SGD can track changes much better and give better models ◦ [LeCun2002] SGD for dynamically changed datasets Popular today Popular in 2014 Popular in 2010
  • 17. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17 o Applicable only with SGD o Choose samples with maximum information content o Mini-batches should contain examples from different classes ◦ As different as possible o Prefer samples likely to generate larger errors ◦ Otherwise gradients will be small  slower learning ◦ Check the errors from previous rounds and prefer “hard examples” ◦ Don’t overdo it though :P, beware of outliers o In practice, split your dataset into mini-batches ◦ Each mini-batch is as class-divergent and rich as possible ◦ New epoch  to be safe new batches & new, randomly shuffled examples Shuffling examples Dataset Shuffling at epoch t Shuffling at epoch t+1
  • 18. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18 o Conditions of convergence well understood o Acceleration techniques can be applied ◦ Second order (Hessian based) optimizations are possible ◦ Measuring not only gradients, but also curvatures of the loss surface o Simpler theoretical analysis on weight dynamics and convergence rates Advantages of Gradient Descend batch learning
  • 19. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19 o SGD is preferred to Gradient Descend o Training is orders faster ◦ In real datasets Gradient Descend is not even realistic o Solutions generalize better ◦ More efficient  larger datasets ◦ Larger datasets  better generalization o How many samples per mini-batch? ◦ Hyper-parameter, trial & error ◦ Usually between 32-256 samples In practice
  • 20. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20 Data preprocessing & normalization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 21. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21 o Center data to be roughly 0 ◦ Activation functions usually “centered” around 0 ◦ Convergence usually faster ◦ Otherwise bias on gradient direction  might slow down learning Data pre-processing ReLU  tanh(𝑥)  𝜎(𝑥)   
  • 22. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22 o Scale input variables to have similar diagonal covariances 𝑐𝑖 = σ𝑗(𝑥𝑖 (𝑗) )2 ◦ Similar covariances  more balanced rate of learning for different weights ◦ Rescaling to 1 is a good choice, unless some dimensions are less important Data pre-processing 𝑥1 , 𝑥2 , 𝑥3  much different covariances 𝜃1 𝜃2 𝑥 = 𝑥1 , 𝑥2 , 𝑥3 𝑇 , 𝜃 = 𝜃1 , 𝜃2 , 𝜃3 𝑇 , 𝑎 = tanh(𝜃Τ 𝑥) 𝜃3 Generated gradients ቚ dℒ 𝑑𝜃 𝑥1,𝑥2,𝑥3 : much different Gradient update harder: 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝑑ℒ/𝑑θ1 𝑑ℒ/𝑑θ2 𝑑ℒ/𝑑θ3
  • 23. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23 o Input variables should be as decorrelated as possible ◦ Input variables are “more independent” ◦ Network is forced to find non-trivial correlations between inputs ◦ Decorrelated inputs  Better optimization ◦ Obviously not the case when inputs are by definition correlated (sequences) o Extreme case ◦ extreme correlation (linear dependency) might cause problems [CAUTION] Data pre-processing
  • 24. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24 o Input variables follow a Gaussian distribution (roughly) o In practice: ◦ from training set compute mean and standard deviation ◦ Then subtract the mean from training samples ◦ Then divide the result by the standard deviation Normalization: 𝑁 𝜇, 𝜎2 = 𝑁 0, 1 𝑥 𝑥 − 𝜇 𝑥 − 𝜇 𝜎
  • 25. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25 o Instead of “per-dimension”  all input dimensions simultaneously o If dimensions have similar values (e.g. pixels in natural images) ◦ Compute one 𝜇, 𝜎2 instead of as many as the input variables ◦ Or the per color channel pixel average/variance 𝜇𝑟𝑒𝑑, 𝜎𝑟𝑒𝑑 2 , 𝜇𝑔𝑟𝑒𝑒𝑛, 𝜎𝑔𝑟𝑒𝑒𝑛 2 , 𝜇𝑏𝑙𝑢𝑒, 𝜎𝑏𝑙𝑢𝑒 2 𝑁 𝜇, 𝜎2 = 𝑁 0, 1 − Making things faster
  • 26. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26 o When input dimensions have similar ranges … o … and with the right non-linearlity … o … centering might be enough ◦ e.g. in images all dimensions are pixels ◦ All pixels have more or less the same ranges o Juse make sure images have mean 0 (𝜇 = 0) Even simpler: Centering the input
  • 27. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27 o If 𝐶 the covariance matrix of your dataset, compute eigenvalues and eigenvectors with SVD 𝑈, Σ, 𝑉𝑇 = 𝑠𝑣𝑑(𝐶) o Decorrelate (PCA-ed) dataset by 𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋 ◦ Subset of eigenvectors 𝑈′ = [𝑢1, … , 𝑢𝑞] to reduce data dimensions o Scaling by square root of eigenvalues to whiten data 𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ o Not used much with Convolutional Neural Nets ◦ The zero mean normalization is more important PCA Whitening 𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋 𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
  • 28. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28 Example Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
  • 29. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29 Data augmentation [Krizhevsky2012] Original Flip Random crop Contrast Tint
  • 30. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30 o Weights change  the distribution of the layer inputs changes per round ◦ Covariance shift o Normalize the layer inputs with batch normalization ◦ Roughly speaking, normalize 𝑥𝑙 to 𝑁(0, 1) and rescale Batch normalization [Ioffe2015] 𝑥𝑙 Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation 𝑥𝑙 𝑥𝑙 Batch Normalization 𝑥𝑙 ℒ 𝑥𝑙 ℒ Batch normalization
  • 31. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31 Batch normalization - Intuitively 𝑥𝑙 Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation 𝑥𝑙 𝑥𝑙 Batch Normalization
  • 32. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32 Batch normalization – The algorithm o 𝜇ℬ ← 1 𝑚 σ𝑖=1 𝑚 𝑥𝑖 [compute mini-batch mean] o 𝜎ℬ ← 1 𝑚 σ𝑖=1 𝑚 𝑥𝑖 − 𝜇ℬ 2 [compute mini-batch variance] o ෝ 𝑥𝑖 ← 𝑥𝑖−𝜇ℬ 𝜎ℬ 2+𝜀 [normalize input] o ෝ 𝑦𝑖 ← 𝛾𝑥𝑖 + 𝛽 [scale and shift input] Trainable parameters
  • 33. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33 o Gradients can be stronger  higher learning rates  faster training ◦ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima o Neurons get activated in a near optimal “regime” o Better model regularization ◦ Neuron activations not deterministic, depend on the batch ◦ Model cannot be overconfident Batch normalization - Benefits
  • 34. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34 Regularization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 35. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35 o Neural networks typically have thousands, if not millions of parameters ◦ Usually, the dataset size smaller than the number of parameters o Overfitting is a grave danger o Proper weight regularization is crucial to avoid overfitting θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆Ω(𝜃) o Possible regularization methods ◦ ℓ2-regularization ◦ ℓ1-regularization ◦ Dropout Regularization
  • 36. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36 o Most important (or most popular) regularization θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆 2 ෍ 𝑙 𝜃𝑙 2 o The ℓ2-regularization can pass inside the gradient descend update rule 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ + 𝜆𝜃𝑙 ⟹ 𝜃 𝑡+1 = 1 − 𝜆𝜂𝑡 𝜃 𝑡 − 𝜂𝑡𝛻𝜃ℒ o 𝜆 is usually about 10−1, 10−2 ℓ2-regularization “Weight decay”, because weights get smaller
  • 37. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37 o ℓ1-regularization is one of the most important techniques θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆 2 ෍ 𝑙 𝜃𝑙 o Also ℓ1-regularization passes inside the gradient descend update rule 𝜃 𝑡+1 = 𝜃 𝑡 − 𝜆𝜂𝑡 𝜃 𝑡 |𝜃 𝑡 | − 𝜂𝑡𝛻𝜃ℒ o ℓ1-regularization  sparse weights ◦ 𝜆 ↗  more weights become 0 ℓ1-regularization Sign function
  • 38. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38 o To tackle overfitting another popular technique is early stopping o Monitor performance on a separate validation set o Training the network will decrease training error, as well validation error (although with a slower rate usually) o Stop when validation error starts increasing ◦ This quite likely means the network starts to overfit Early stopping
  • 39. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39 o During training setting activations randomly to 0 ◦ Neurons sampled at random from a Bernoulli distribution with 𝑝 = 0.5 o At test time all neurons are used ◦ Neuron activations reweighted by 𝑝 o Benefits ◦ Reduces complex co-adaptations or co-dependencies between neurons ◦ No “free-rider” neurons that rely on others ◦ Every neuron becomes more robust ◦ Decreases significantly overfitting ◦ Improves significantly training speed Dropout [Srivastava2014]
  • 40. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Original model
  • 41. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 1
  • 42. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 1
  • 43. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 2
  • 44. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 2
  • 45. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45 Architectural details 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 46. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46 o Straightforward sigmoids not a very good idea o Symmetric sigmoids converge faster ◦ E.g. tanh, returns a(x=0)=0 ◦ Recommended sigmoid: 𝑎 = ℎ 𝑥 = 1.7159 tanh( 2 3 𝑥) o You can add a linear term to avoid flat areas 𝑎 = ℎ 𝑥 = tanh 𝑥 + 𝛽𝑥 Sigmoid-like activation functions tanh 𝑥 + 0.5𝑥 tanh 𝑥
  • 47. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47 o RBF: 𝑎 = ℎ 𝑥 = σ𝑗 𝑢𝑗 exp −𝛽𝑗 𝑥 − 𝑤𝑗 2 o Sigmoid: 𝑎 = ℎ 𝑥 = 𝜎 𝑥 = 1 1+𝑒−𝑥 o Sigmoids can cover the full feature space o RBF’s are much more local in the feature space ◦ Can be faster to train but with a more limited range ◦ Can give better set of basis functions ◦ Preferred in lower dimensional spaces RBFs vs “Sigmoids”
  • 48. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48 o Activation function 𝑎 = ℎ(𝑥) = max 0, 𝑥 o Gradient wrt the input 𝜕𝑎 𝜕𝑥 = ቊ 0, 𝑖𝑓 𝑥 ≤ 0 1, 𝑖𝑓𝑥 > 0 o Very popular in computer vision and speech recognition o Much faster computations, gradients ◦ No vanishing or exploding problems, only comparison, addition, multiplication o People claim biological plausibility o Sparse activations o No saturation o Non-symmetric o Non-differentiable at 0 o A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem Rectified Linear Unit (ReLU) module [Krizhevsky2012]
  • 49. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49 ReLU convergence rate ReLU Tanh
  • 50. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50 o Soft approximation (softplus): 𝑎 = ℎ(𝑥) = ln 1 + 𝑒𝑥 o Noisy ReLU: 𝑎 = ℎ 𝑥 = max 0, x + ε , ε~𝛮(0, σ(x)) o Leaky ReLU: 𝑎 = ℎ 𝑥 = ቊ 𝑥, 𝑖𝑓 𝑥 > 0 0.01𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 o Parametric ReLu: 𝑎 = ℎ 𝑥 = ቊ 𝑥, 𝑖𝑓 𝑥 > 0 𝛽𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (parameter 𝛽 is trainable) Other ReLUs
  • 51. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51 o Number of hidden layers o Number of neuron in each hidden layer o Type of activation functions o Type and amount of regularization Architectural hyper-parameters
  • 52. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52 o Dataset dependent hyperparameters o Tip: Start small  increase complexity gradually ◦ e.g. start with a 2-3 hidden layers ◦ Add more layers  does performance improve? ◦ Add more neurons  does performance improve? o Regularization is very important, use ℓ2 ◦ Even if with very deep or wide network ◦ With strong ℓ2-regularization we avoid overfitting Number of neurons, number of hidden layers Generalization Model complexity (number of neurons)
  • 53. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53 Learning rate 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 54. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54 o The right learning rate 𝜂𝑡 very important for fast convergence ◦ Too strong  gradients overshoot and bounce ◦ Too weak,  too small gradients  slow training o Learning rate per weight is often advantageous ◦ Some weights are near convergence, others not o Rule of thumb ◦ Learning rate of (shared) weights prop. to square root of share weight connections o Adaptive learning rates are also possible, based on the errors observed ◦ [Sompolinsky1995] Learning rate
  • 55. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55 o Constant ◦ Learning rate remains the same for all epochs o Step decay ◦ Decrease (e.g. 𝜂𝑡/𝑇 or 𝜂𝑡/𝑇) every T number of epochs o Inverse decay 𝜂𝑡 = 𝜂0 1+𝜀𝑡 o Exponential decay 𝜂𝑡 = 𝜂0𝑒−𝜀𝑡 o Often step decay preferred ◦ simple, intuitive, works well and only a single extra hyper-parameter 𝑇 (𝑇 =2, 10) Learning rate schedules
  • 56. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56 o Try several log-spaced values 10−1, 10−2, 10−3, … on a smaller set ◦ Then, you can narrow it down from there around where you get the lowest error o You can decrease the learning rate every 10 (or some other value) full training set epochs ◦ Although this highly depends on your data Learning rate in practice
  • 57. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57 Weight initialization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 58. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58 o There are few contradictory requirements o Weights need to be small enough ◦ around origin (𝟎) for symmetric functions (tanh, sigmoid) ◦ When training starts better stimulate activation functions near their linear regime ◦ larger gradients  faster training o Weights need to be large enough ◦ Otherwise signal is too weak for any serious learning Weight initialization Linear regime Large gradients Linear regime Large gradients
  • 59. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 60. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 61. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 62. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62 o For tanh initialize weights from − 6 𝑑𝑙−1+𝑑𝑙 , 6 𝑑𝑙−1+𝑑𝑙 ◦ 𝑑𝑙−1 is the number of input variables to the tanh layer and 𝑑𝑙 is the number of the output variables o For a sigmoid −4 ∙ 6 𝑑𝑙−1+𝑑𝑙 , 4 ∙ 6 𝑑𝑙−1+𝑑𝑙 One way of initializing sigmoid-like neurons Linear regime Large gradients
  • 63. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63 o For 𝑎 = 𝜃𝑥 the variance is 𝑉𝑎𝑟 𝑎 = 𝐸 𝑥 2𝑉𝑎𝑟 𝜃 + E 𝜃 2𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 o Since 𝐸 𝑥 = 𝐸 𝜃 = 0 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 ≈ 𝑑 ⋅ 𝑉𝑎𝑟 𝑥𝑖 𝑉𝑎𝑟 𝜃𝑖 o For 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 ⇒ 𝑉𝑎𝑟 𝜃𝑖 = 1 𝑑 o Draw random weights from 𝜃~𝑁 0, 1/𝑑 where 𝑑 is the number of neurons in the input Xavier initialization [Glorot2010]
  • 64. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64 o Unlike sigmoids, ReLUs ground to 0 the linear activations half the time o Double weight variance ◦ Compensate for the zero flat-area  ◦ Input and output maintain same variance ◦ Very similar to Xavier initialization o Draw random weights from w~𝑁 0, 2/𝑑 where 𝑑 is the number of neurons in the input [He2015] initialization for ReLUs
  • 65. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65 Loss functions 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 66. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66 o Our samples contains only one class ◦ There is only one correct answer per sample o Negative log-likelihood (cross entropy) + Softmax ℒ 𝜃; 𝑥, 𝑦 = − σ𝑐=1 𝐶 𝑦𝑐 log 𝑎𝐿 𝑐 for all classes 𝑐 = 1, … , 𝐶 o Hierarchical softmax when C is very large o Hinge loss (aka SVM loss) ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑐=1 𝑐≠𝑦 𝐶 max(0, 𝑎𝐿 𝑐 − 𝑎𝐿 𝑦 + 1) o Squared hinge loss Multi-class classification Is it a cat? Is it a horse? …
  • 67. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67 o Each sample can have many correct answers o Hinge loss and the likes ◦ Also sigmoids would also work o Each output neuron is independent ◦ “Does this contain a car, yes or no?“ ◦ “Does this contain a person, yes or no?“ ◦ “Does this contain a motorbike, yes or no?“ ◦ “Does this contain a horse, yes or no?“ o Instead of “Is this a car, motorbike or person?” ◦ 𝑝 𝑐𝑎𝑟 𝑥) = 0.55, 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) = 0.25, 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) = 0.15, 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 0.05 ◦ 𝑝 𝑐𝑎𝑟 𝑥) + 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) + 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) + 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 1.0 Multi-class, multi-label classification
  • 68. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68 o The good old Euclidean Loss ℒ 𝜃; 𝑥, 𝑦 = 1 2 |𝑦 − 𝑎𝐿|2 2 o Or RBF on top of Euclidean loss ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑗 𝑢𝑗 exp(−𝛽𝑗(𝑦 − 𝑎𝐿)2) o Or ℓ1 distance ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑗 |𝑦𝑗 − 𝑎𝐿 𝑗 | Regression
  • 69. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69 Even better optimizations 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 70. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70 o Don’t switch gradients all the time o Maintain “momentum” from previous parameters o More robust gradients and learning  faster convergence o Nice “physics”-based interpretation ◦ Instead of updating the position of the “ball”, we update the velocity, which updates the position Momentum 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃 𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ Loss surface Gradient Gradient + momentum
  • 71. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71 o Use the future gradient instead of the current gradient o Better theoretical convergence o Generally works better with Convolutional Neural Networks Nesterov Momentum [Sutskever2013] 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃 𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ Gradient Gradient + momentum Momentum Look-ahead gradient from the next step Momentum Gradient + Nesterov momentum
  • 72. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72 o Normally all weights updated with same “aggressiveness” ◦ Often some parameters could enjoy more “teaching” ◦ While others are already about there o Adapt learning per parameter 𝜃(𝑡+1) = 𝜃(𝑡) − 𝐻ℒ −1 𝜂𝑡𝛻𝜃ℒ o 𝐻ℒ is the Hessian matrix of ℒ: second-order derivatives 𝐻ℒ 𝑖𝑗 = 𝜕ℒ 𝜕𝜃𝑖𝜕𝜃𝑗 Second order optimization
  • 73. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73 o Inverse of Hessian usually very expensive ◦ Too many parameters o Approximating the Hessian, e.g. with the L-BFGS algorithm ◦ Keeps memory of gradients to approximate the inverse Hessian o L-BFGS works alright with Gradient Descend. What about SGD? o In practice SGD with some good momentum works just fine Second order optimization methods in practice
  • 74. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74 o Adagrad [Duchi2011] o RMSprop o Adam [Kingma2014] Other per-parameter adaptive optimizations
  • 75. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75 o Schedule ◦ 𝑚𝑗 = σ𝜏(𝛻𝜃ℒ𝑗)2 ⟹ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ 𝑚+𝜀 ◦ 𝜀 is a small number to avoid division with 0 ◦ Gradients become gradually smaller and smaller Adagrad [Duchi2011]
  • 76. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76 o Schedule ◦ 𝑚𝑗 = 𝛼 σ𝜏=1 𝑡−1 (𝛻𝜃 (𝑡) ℒ𝑗)2 + 1 − 𝛼 𝛻𝜃 (𝑡) ℒ𝑗 ⟹ ◦ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ 𝑚+𝜀 o Moving average of the squared gradients ◦ Compared to Adagrad o Large gradients, e.g. too “noisy” loss surface ◦ Updates are tamed o Small gradients, e.g. stuck in flat loss surface ravine ◦ Updates become more aggressive RMSprop Square rooting boosts small values while suppresses large values Decay hyper-parameter
  • 77. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77 o One of the most popular learning algorithms 𝑚𝑗 = ෍ 𝜏 (𝛻𝜃ℒ𝑗)2 𝜃(𝑡+0.5) = 𝛽1𝜃(𝑡) + 1 − 𝛽1 𝛻𝜃ℒ 𝑣(𝑡+0.5) = 𝛽2𝑣(𝑡) + 1 − 𝛽2 𝑚 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝜃(𝑡+0.5) 𝑣(𝑡+0.5) + 𝜀 o Similar to RMSprop, but with momentum o Recommended values: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−8 Adam [Kingma2014]
  • 78. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78 Visual overview Picture credit: Alec Radford
  • 79. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79 o Learning to learn by gradient descent by gradient descent ◦ [Andrychowicz2016] o 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑔𝑡 𝛻𝜃ℒ, 𝜑 o 𝑔𝑡 is an “optimizer” with its own parameters 𝜑 ◦ Implemented as a recurrent network Learning –not computing– the gradients
  • 80. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80 Good practice o Preprocess the data to at least have 0 mean o Initialize weights based on activations functions ◦ For ReLU Xavier or HeICCV2015 initialization o Always use ℓ2-regularization and dropout o Use batch normalization
  • 81. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81 Babysitting Deep Nets 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 82. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82 o Always check your gradients if not computed automatically o Check that in the first round you get a random loss o Check network with few samples ◦ Turn off regularization. You should predictably overfit and have a 0 loss ◦ Turn or regularization. The loss should increase o Have a separate validation set ◦ Compare the curve between training and validation sets ◦ There should be a gap, but not too large Babysitting Deep Nets
  • 83. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83 Summary o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices
  • 84. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84 o http://www.deeplearningbook.org/ ◦ Part II: Chapter 7, 8 [Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient descent, arXiv, 2016 [He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015 [Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015 [Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014 [Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR, 2014 [Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013 [Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012 [Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012 [Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011 [Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010 [LeCun2002] Reading material & references
  • 85. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85 Next lecture o What are the Convolutional Neural Networks? o Why are they important in Computer Vision? o Differences from standard Neural Networks o How to train a Convolutional Neural Network?