SlideShare a Scribd company logo
1 of 85
Download to read offline
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1
Lecture 3: Deeper into Deep Learning and Optimizations
Deep Learning @ UvA
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2
o Machine learning paradigm for neural networks
o Backpropagation algorithm, backbone for training neural networks
o Neural network == modular architecture
o Visited different modules, saw how to implement and check them
Previous lecture
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
Lecture overview
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4
Deeper into
Neural Networks &
Deep Neural Nets
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5
A Neural/Deep Network in a nutshell
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6
SGD vs GD
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7
Backpropagation again
o Step 1. Compute forward propagations for all layers recursively
๐‘Ž๐‘™ = โ„Ž๐‘™ ๐‘ฅ๐‘™ and ๐‘ฅ๐‘™+1 = ๐‘Ž๐‘™
o Step 2. Once done with forward propagation, follow the reverse path.
โ—ฆ Start from the last layer and for each new layer compute the gradients
โ—ฆ Cache computations when possible to avoid redundant operations
o Step 3. Use the gradients
๐œ•โ„’
๐œ•๐œƒ๐‘™
with Stochastic Gradient Descend to train
๐œ•โ„’
๐œ•๐‘Ž๐‘™
=
๐œ•๐‘Ž๐‘™+1
๐œ•๐‘ฅ๐‘™+1
๐‘‡
โ‹…
๐œ•โ„’
๐œ•๐‘Ž๐‘™+1
๐œ•โ„’
๐œ•๐œƒ๐‘™
=
๐œ•๐‘Ž๐‘™
๐œ•๐œƒ๐‘™
โ‹…
๐œ•โ„’
๐œ•๐‘Ž๐‘™
๐‘‡
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8
o Often loss surfaces are
โ—ฆ non-quadratic
โ—ฆ highly non-convex
โ—ฆ very high-dimensional
o Datasets are typically really large to compute complete gradients
o No real guarantee that
โ—ฆ the final solution will be good
โ—ฆ we converge fast to final solution
โ—ฆ or that there will be convergence
Still, backpropagation can be slow
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9
o Stochastically sample โ€œmini-batchesโ€ from dataset ๐ท
โ—ฆ The size of ๐ต๐‘— can contain even just 1 sample
o Much faster than Gradient Descend
o Results are often better
o Also suitable for datasets that change over time
o Variance of gradients increases when batch size decreases
Stochastic Gradient Descend (SGD)
๐œƒ(๐‘ก+1)
= ๐œƒ(๐‘ก)
โˆ’
๐œ‚๐‘ก
|๐ต๐‘—|
เท
๐‘– โˆˆ ๐ต๐‘—
๐›ป๐œƒโ„’๐‘–
๐ต๐‘— = ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’(๐ท)
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10
SGD is often better
Current solution
Full GD gradient
New GD solution
Noisy SGD gradient
Best GD solution
Best SGD solution
โ€ข No guarantee that this is what
is going to always happen.
โ€ข But the noisy SGC gradients
can help some times escaping
local optima
Loss surface
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11
SGD is often better
o (A bit) Noisy gradients act as regularization
o Gradient Descend ๏ƒ  Complete gradients
o Complete gradients fit optimally the (arbitrary) data we have, not the
distribution that generates them
โ—ฆ All training samples are the โ€œabsolute representativeโ€ of the input distribution
โ—ฆ Test data will be no different than training data
โ—ฆ Suitable for traditional optimization problems: โ€œfind optimal routeโ€
โ—ฆ But for ML we cannot make this assumption ๏ƒ  test data are always different
o Stochastic gradients ๏ƒ  sampled training data sample roughly
representative gradients
โ—ฆ Model does not overfit to the particular training samples
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12
SGD is faster
Gradient
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13
SGD is faster
Gradient
10x
What is our
gradient now?
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14
SGD is faster
10x
What is our
gradient now?
Gradient
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15
o Of course in real situations data do not replicate
o However, after a sizeable amount of data there are clusters of data that
are similar
o Hence, the gradient is approximately alright
o Approximate alright is great, is even better in many cases actually
SGD is faster
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16
o Often datasets are not โ€œrigidโ€
o Imagine Instagram
โ—ฆ Letโ€™s assume 1 million of new images uploaded per week and
we want to build a โ€œcool pictureโ€ classifier
โ—ฆ Should โ€œcool picturesโ€ from the previous year have the same as
much influence?
โ—ฆ No, the learning machine should track these changes
o With GD these changes go undetected, as results are
averaged by the many more โ€œpastโ€ samples
โ—ฆ Past โ€œover-dominatesโ€
o A properly implemented SGD can track changes much
better and give better models
โ—ฆ [LeCun2002]
SGD for dynamically changed datasets
Popular today
Popular in 2014
Popular in 2010
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17
o Applicable only with SGD
o Choose samples with maximum information content
o Mini-batches should contain examples from different classes
โ—ฆ As different as possible
o Prefer samples likely to generate larger errors
โ—ฆ Otherwise gradients will be small ๏ƒ  slower learning
โ—ฆ Check the errors from previous rounds and prefer โ€œhard examplesโ€
โ—ฆ Donโ€™t overdo it though :P, beware of outliers
o In practice, split your dataset into mini-batches
โ—ฆ Each mini-batch is as class-divergent and rich as possible
โ—ฆ New epoch ๏ƒ  to be safe new batches & new, randomly shuffled examples
Shuffling examples
Dataset
Shuffling
at epoch t
Shuffling
at epoch t+1
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18
o Conditions of convergence well understood
o Acceleration techniques can be applied
โ—ฆ Second order (Hessian based) optimizations are possible
โ—ฆ Measuring not only gradients, but also curvatures of the loss surface
o Simpler theoretical analysis on weight dynamics and convergence rates
Advantages of Gradient Descend batch learning
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19
o SGD is preferred to Gradient Descend
o Training is orders faster
โ—ฆ In real datasets Gradient Descend is not even realistic
o Solutions generalize better
โ—ฆ More efficient ๏ƒ  larger datasets
โ—ฆ Larger datasets ๏ƒ  better generalization
o How many samples per mini-batch?
โ—ฆ Hyper-parameter, trial & error
โ—ฆ Usually between 32-256 samples
In practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20
Data preprocessing &
normalization
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21
o Center data to be roughly 0
โ—ฆ Activation functions usually โ€œcenteredโ€ around 0
โ—ฆ Convergence usually faster
โ—ฆ Otherwise bias on gradient direction ๏ƒ  might slow down learning
Data pre-processing
ReLU ๏Š tanh(๐‘ฅ) ๏Š ๐œŽ(๐‘ฅ) ๏Œ
๏Š
๏Œ
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22
o Scale input variables to have similar diagonal covariances ๐‘๐‘– = ฯƒ๐‘—(๐‘ฅ๐‘–
(๐‘—)
)2
โ—ฆ Similar covariances ๏ƒ  more balanced rate of learning for different weights
โ—ฆ Rescaling to 1 is a good choice, unless some dimensions are less important
Data pre-processing
๐‘ฅ1
, ๐‘ฅ2
, ๐‘ฅ3
๏ƒ  much different covariances
๐œƒ1
๐œƒ2
๐‘ฅ = ๐‘ฅ1
, ๐‘ฅ2
, ๐‘ฅ3 ๐‘‡
, ๐œƒ = ๐œƒ1
, ๐œƒ2
, ๐œƒ3 ๐‘‡
, ๐‘Ž = tanh(๐œƒฮค
๐‘ฅ)
๐œƒ3
Generated gradients แ‰š
dโ„’
๐‘‘๐œƒ ๐‘ฅ1,๐‘ฅ2,๐‘ฅ3
: much different
Gradient update harder: ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก
๐‘‘โ„’/๐‘‘ฮธ1
๐‘‘โ„’/๐‘‘ฮธ2
๐‘‘โ„’/๐‘‘ฮธ3
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23
o Input variables should be as decorrelated as possible
โ—ฆ Input variables are โ€œmore independentโ€
โ—ฆ Network is forced to find non-trivial correlations between inputs
โ—ฆ Decorrelated inputs ๏ƒ  Better optimization
โ—ฆ Obviously not the case when inputs are by definition correlated (sequences)
o Extreme case
โ—ฆ extreme correlation (linear dependency) might cause problems [CAUTION]
Data pre-processing
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24
o Input variables follow a Gaussian distribution (roughly)
o In practice:
โ—ฆ from training set compute mean and standard deviation
โ—ฆ Then subtract the mean from training samples
โ—ฆ Then divide the result by the standard deviation
Normalization: ๐‘ ๐œ‡, ๐œŽ2
= ๐‘ 0, 1
๐‘ฅ
๐‘ฅ โˆ’ ๐œ‡
๐‘ฅ โˆ’ ๐œ‡
๐œŽ
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25
o Instead of โ€œper-dimensionโ€ ๏ƒ  all input dimensions simultaneously
o If dimensions have similar values (e.g. pixels in natural images)
โ—ฆ Compute one ๐œ‡, ๐œŽ2 instead of as many as the input variables
โ—ฆ Or the per color channel pixel average/variance
๐œ‡๐‘Ÿ๐‘’๐‘‘, ๐œŽ๐‘Ÿ๐‘’๐‘‘
2
, ๐œ‡๐‘”๐‘Ÿ๐‘’๐‘’๐‘›, ๐œŽ๐‘”๐‘Ÿ๐‘’๐‘’๐‘›
2 , ๐œ‡๐‘๐‘™๐‘ข๐‘’, ๐œŽ๐‘๐‘™๐‘ข๐‘’
2
๐‘ ๐œ‡, ๐œŽ2
= ๐‘ 0, 1 โˆ’ Making things faster
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26
o When input dimensions have similar ranges โ€ฆ
o โ€ฆ and with the right non-linearlity โ€ฆ
o โ€ฆ centering might be enough
โ—ฆ e.g. in images all dimensions are pixels
โ—ฆ All pixels have more or less the same ranges
o Juse make sure images have mean 0 (๐œ‡ = 0)
Even simpler: Centering the input
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27
o If ๐ถ the covariance matrix of your dataset, compute
eigenvalues and eigenvectors with SVD
๐‘ˆ, ฮฃ, ๐‘‰๐‘‡ = ๐‘ ๐‘ฃ๐‘‘(๐ถ)
o Decorrelate (PCA-ed) dataset by
๐‘‹๐‘Ÿ๐‘œ๐‘ก = ๐‘ˆ๐‘‡๐‘‹
โ—ฆ Subset of eigenvectors ๐‘ˆโ€ฒ
= [๐‘ข1, โ€ฆ , ๐‘ข๐‘ž] to reduce data dimensions
o Scaling by square root of eigenvalues to whiten data
๐‘‹๐‘คโ„Ž๐‘ก = ๐‘‹๐‘Ÿ๐‘œ๐‘ก/ ฮฃ
o Not used much with Convolutional Neural Nets
โ—ฆ The zero mean normalization is more important
PCA Whitening
๐‘‹๐‘Ÿ๐‘œ๐‘ก = ๐‘ˆ๐‘‡๐‘‹
๐‘‹๐‘คโ„Ž๐‘ก = ๐‘‹๐‘Ÿ๐‘œ๐‘ก/ ฮฃ
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28
Example
Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29
Data augmentation [Krizhevsky2012]
Original
Flip Random crop
Contrast Tint
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30
o Weights change ๏ƒ  the
distribution of the layer inputs
changes per round
โ—ฆ Covariance shift
o Normalize the layer inputs with
batch normalization
โ—ฆ Roughly speaking, normalize ๐‘ฅ๐‘™ to
๐‘(0, 1) and rescale
Batch normalization [Ioffe2015]
๐‘ฅ๐‘™
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
๐‘ฅ๐‘™ ๐‘ฅ๐‘™
Batch Normalization
๐‘ฅ๐‘™
โ„’
๐‘ฅ๐‘™
โ„’
Batch normalization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31
Batch normalization - Intuitively
๐‘ฅ๐‘™
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
๐‘ฅ๐‘™ ๐‘ฅ๐‘™
Batch Normalization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32
Batch normalization โ€“ The algorithm
o ๐œ‡โ„ฌ โ†
1
๐‘š
ฯƒ๐‘–=1
๐‘š
๐‘ฅ๐‘– [compute mini-batch mean]
o ๐œŽโ„ฌ โ†
1
๐‘š
ฯƒ๐‘–=1
๐‘š
๐‘ฅ๐‘– โˆ’ ๐œ‡โ„ฌ
2 [compute mini-batch variance]
o เท
๐‘ฅ๐‘– โ†
๐‘ฅ๐‘–โˆ’๐œ‡โ„ฌ
๐œŽโ„ฌ
2+๐œ€
[normalize input]
o เท
๐‘ฆ๐‘– โ† ๐›พ๐‘ฅ๐‘– + ๐›ฝ [scale and shift input]
Trainable parameters
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33
o Gradients can be stronger ๏ƒ  higher learning rates ๏ƒ  faster training
โ—ฆ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima
o Neurons get activated in a near optimal โ€œregimeโ€
o Better model regularization
โ—ฆ Neuron activations not deterministic,
depend on the batch
โ—ฆ Model cannot be overconfident
Batch normalization - Benefits
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34
Regularization
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35
o Neural networks typically have thousands, if not millions of parameters
โ—ฆ Usually, the dataset size smaller than the number of parameters
o Overfitting is a grave danger
o Proper weight regularization is crucial to avoid overfitting
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) + ๐œ†ฮฉ(๐œƒ)
o Possible regularization methods
โ—ฆ โ„“2-regularization
โ—ฆ โ„“1-regularization
โ—ฆ Dropout
Regularization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36
o Most important (or most popular) regularization
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) +
๐œ†
2
เท
๐‘™
๐œƒ๐‘™
2
o The โ„“2-regularization can pass inside the gradient descend update rule
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐›ป๐œƒโ„’ + ๐œ†๐œƒ๐‘™ โŸน
๐œƒ ๐‘ก+1 = 1 โˆ’ ๐œ†๐œ‚๐‘ก ๐œƒ ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
o ๐œ† is usually about 10โˆ’1, 10โˆ’2
โ„“2-regularization
โ€œWeight decayโ€, because
weights get smaller
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37
o โ„“1-regularization is one of the most important techniques
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) +
๐œ†
2
เท
๐‘™
๐œƒ๐‘™
o Also โ„“1-regularization passes inside the gradient descend update rule
๐œƒ ๐‘ก+1
= ๐œƒ ๐‘ก
โˆ’ ๐œ†๐œ‚๐‘ก
๐œƒ ๐‘ก
|๐œƒ ๐‘ก |
โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
o โ„“1-regularization ๏ƒ  sparse weights
โ—ฆ ๐œ† โ†— ๏ƒ  more weights become 0
โ„“1-regularization
Sign function
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38
o To tackle overfitting another popular technique is early stopping
o Monitor performance on a separate validation set
o Training the network will decrease training error, as well validation error
(although with a slower rate usually)
o Stop when validation error starts increasing
โ—ฆ This quite likely means the network starts to overfit
Early stopping
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39
o During training setting activations randomly to 0
โ—ฆ Neurons sampled at random from a Bernoulli distribution with ๐‘ = 0.5
o At test time all neurons are used
โ—ฆ Neuron activations reweighted by ๐‘
o Benefits
โ—ฆ Reduces complex co-adaptations or co-dependencies between neurons
โ—ฆ No โ€œfree-riderโ€ neurons that rely on others
โ—ฆ Every neuron becomes more robust
โ—ฆ Decreases significantly overfitting
โ—ฆ Improves significantly training speed
Dropout [Srivastava2014]
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40
o Effectively, a different architecture at every training epoch
โ—ฆ Similar to model ensembles
Dropout
Original model
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41
o Effectively, a different architecture at every training epoch
โ—ฆ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42
o Effectively, a different architecture at every training epoch
โ—ฆ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43
o Effectively, a different architecture at every training epoch
โ—ฆ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44
o Effectively, a different architecture at every training epoch
โ—ฆ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45
Architectural details
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46
o Straightforward sigmoids not a very good idea
o Symmetric sigmoids converge faster
โ—ฆ E.g. tanh, returns a(x=0)=0
โ—ฆ Recommended sigmoid: ๐‘Ž = โ„Ž ๐‘ฅ = 1.7159 tanh(
2
3
๐‘ฅ)
o You can add a linear term to avoid flat areas
๐‘Ž = โ„Ž ๐‘ฅ = tanh ๐‘ฅ + ๐›ฝ๐‘ฅ
Sigmoid-like activation functions
tanh ๐‘ฅ + 0.5๐‘ฅ
tanh ๐‘ฅ
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47
o RBF: ๐‘Ž = โ„Ž ๐‘ฅ = ฯƒ๐‘— ๐‘ข๐‘— exp โˆ’๐›ฝ๐‘— ๐‘ฅ โˆ’ ๐‘ค๐‘—
2
o Sigmoid: ๐‘Ž = โ„Ž ๐‘ฅ = ๐œŽ ๐‘ฅ =
1
1+๐‘’โˆ’๐‘ฅ
o Sigmoids can cover the full feature space
o RBFโ€™s are much more local in the feature space
โ—ฆ Can be faster to train but with a more limited range
โ—ฆ Can give better set of basis functions
โ—ฆ Preferred in lower dimensional spaces
RBFs vs โ€œSigmoidsโ€
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48
o Activation function ๐‘Ž = โ„Ž(๐‘ฅ) = max 0, ๐‘ฅ
o Gradient wrt the input
๐œ•๐‘Ž
๐œ•๐‘ฅ
= แ‰Š
0, ๐‘–๐‘“ ๐‘ฅ โ‰ค 0
1, ๐‘–๐‘“๐‘ฅ > 0
o Very popular in computer vision and speech recognition
o Much faster computations, gradients
โ—ฆ No vanishing or exploding problems, only comparison, addition, multiplication
o People claim biological plausibility
o Sparse activations
o No saturation
o Non-symmetric
o Non-differentiable at 0
o A large gradient during training can cause a neuron to โ€œdieโ€. Higher learning rates mitigate the problem
Rectified Linear Unit (ReLU) module [Krizhevsky2012]
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49
ReLU convergence rate
ReLU
Tanh
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50
o Soft approximation (softplus): ๐‘Ž = โ„Ž(๐‘ฅ) = ln 1 + ๐‘’๐‘ฅ
o Noisy ReLU: ๐‘Ž = โ„Ž ๐‘ฅ = max 0, x + ฮต , ฮต~๐›ฎ(0, ฯƒ(x))
o Leaky ReLU: ๐‘Ž = โ„Ž ๐‘ฅ = แ‰Š
๐‘ฅ, ๐‘–๐‘“ ๐‘ฅ > 0
0.01๐‘ฅ ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
o Parametric ReLu: ๐‘Ž = โ„Ž ๐‘ฅ = แ‰Š
๐‘ฅ, ๐‘–๐‘“ ๐‘ฅ > 0
๐›ฝ๐‘ฅ ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
(parameter ๐›ฝ is trainable)
Other ReLUs
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51
o Number of hidden layers
o Number of neuron in each hidden layer
o Type of activation functions
o Type and amount of regularization
Architectural hyper-parameters
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52
o Dataset dependent hyperparameters
o Tip: Start small ๏ƒ  increase complexity gradually
โ—ฆ e.g. start with a 2-3 hidden layers
โ—ฆ Add more layers ๏ƒ  does performance improve?
โ—ฆ Add more neurons ๏ƒ  does performance improve?
o Regularization is very important, use โ„“2
โ—ฆ Even if with very deep or wide network
โ—ฆ With strong โ„“2-regularization we avoid overfitting
Number of neurons, number of hidden layers
Generalization
Model complexity
(number of neurons)
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53
Learning rate
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54
o The right learning rate ๐œ‚๐‘ก very important for fast convergence
โ—ฆ Too strong ๏ƒ  gradients overshoot and bounce
โ—ฆ Too weak, ๏ƒ  too small gradients ๏ƒ  slow training
o Learning rate per weight is often advantageous
โ—ฆ Some weights are near convergence, others not
o Rule of thumb
โ—ฆ Learning rate of (shared) weights prop. to square root of share weight connections
o Adaptive learning rates are also possible, based on the errors observed
โ—ฆ [Sompolinsky1995]
Learning rate
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55
o Constant
โ—ฆ Learning rate remains the same for all epochs
o Step decay
โ—ฆ Decrease (e.g. ๐œ‚๐‘ก/๐‘‡ or ๐œ‚๐‘ก/๐‘‡) every T number of epochs
o Inverse decay ๐œ‚๐‘ก =
๐œ‚0
1+๐œ€๐‘ก
o Exponential decay ๐œ‚๐‘ก = ๐œ‚0๐‘’โˆ’๐œ€๐‘ก
o Often step decay preferred
โ—ฆ simple, intuitive, works well and only a
single extra hyper-parameter ๐‘‡ (๐‘‡ =2, 10)
Learning rate schedules
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56
o Try several log-spaced values 10โˆ’1, 10โˆ’2, 10โˆ’3, โ€ฆ on a smaller set
โ—ฆ Then, you can narrow it down from there around where you get the lowest error
o You can decrease the learning rate every 10 (or some other value) full
training set epochs
โ—ฆ Although this highly depends on your data
Learning rate in practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57
Weight initialization
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58
o There are few contradictory requirements
o Weights need to be small enough
โ—ฆ around origin (๐ŸŽ) for symmetric functions (tanh, sigmoid)
โ—ฆ When training starts better stimulate activation functions near their linear regime
โ—ฆ larger gradients ๏ƒ  faster training
o Weights need to be large enough
โ—ฆ Otherwise signal is too weak for any serious learning
Weight initialization
Linear regime
Large gradients
Linear regime
Large gradients
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โ—ฆ Especially for deep learning
โ—ฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
o Good practice: initialize weights to be asymmetric
โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ)
โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning
o Generally speaking initialization depends on
โ—ฆ non-linearities
โ—ฆ data normalization
Weight initialization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โ—ฆ Especially for deep learning
โ—ฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ)
โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning
o Generally speaking initialization depends on
โ—ฆ non-linearities
โ—ฆ data normalization
Weight initialization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โ—ฆ Especially for deep learning
โ—ฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ)
โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning
o Generally speaking initialization depends on
โ—ฆ non-linearities
โ—ฆ data normalization
Weight initialization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62
o For tanh initialize weights from โˆ’
6
๐‘‘๐‘™โˆ’1+๐‘‘๐‘™
,
6
๐‘‘๐‘™โˆ’1+๐‘‘๐‘™
โ—ฆ ๐‘‘๐‘™โˆ’1 is the number of input variables to the tanh layer and ๐‘‘๐‘™ is the number of the
output variables
o For a sigmoid โˆ’4 โˆ™
6
๐‘‘๐‘™โˆ’1+๐‘‘๐‘™
, 4 โˆ™
6
๐‘‘๐‘™โˆ’1+๐‘‘๐‘™
One way of initializing sigmoid-like neurons
Linear regime
Large gradients
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63
o For ๐‘Ž = ๐œƒ๐‘ฅ the variance is
๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐ธ ๐‘ฅ 2๐‘‰๐‘Ž๐‘Ÿ ๐œƒ + E ๐œƒ 2๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ + ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ
o Since ๐ธ ๐‘ฅ = ๐ธ ๐œƒ = 0
๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ โ‰ˆ ๐‘‘ โ‹… ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ๐‘– ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ๐‘–
o For ๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ โ‡’ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ๐‘– =
1
๐‘‘
o Draw random weights from
๐œƒ~๐‘ 0, 1/๐‘‘
where ๐‘‘ is the number of neurons in the input
Xavier initialization [Glorot2010]
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64
o Unlike sigmoids, ReLUs ground to 0 the linear activations half the
time
o Double weight variance
โ—ฆ Compensate for the zero flat-area ๏ƒ 
โ—ฆ Input and output maintain same variance
โ—ฆ Very similar to Xavier initialization
o Draw random weights from
w~๐‘ 0, 2/๐‘‘
where ๐‘‘ is the number of neurons in the input
[He2015] initialization for ReLUs
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65
Loss functions
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66
o Our samples contains only one class
โ—ฆ There is only one correct answer per sample
o Negative log-likelihood (cross entropy) + Softmax
โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = โˆ’ ฯƒ๐‘=1
๐ถ
๐‘ฆ๐‘ log ๐‘Ž๐ฟ
๐‘
for all classes ๐‘ = 1, โ€ฆ , ๐ถ
o Hierarchical softmax when C is very large
o Hinge loss (aka SVM loss)
โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท
๐‘=1
๐‘โ‰ ๐‘ฆ
๐ถ
max(0, ๐‘Ž๐ฟ
๐‘
โˆ’ ๐‘Ž๐ฟ
๐‘ฆ
+ 1)
o Squared hinge loss
Multi-class classification
Is it a cat? Is it a horse? โ€ฆ
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67
o Each sample can have many correct answers
o Hinge loss and the likes
โ—ฆ Also sigmoids would also work
o Each output neuron is independent
โ—ฆ โ€œDoes this contain a car, yes or no?โ€œ
โ—ฆ โ€œDoes this contain a person, yes or no?โ€œ
โ—ฆ โ€œDoes this contain a motorbike, yes or no?โ€œ
โ—ฆ โ€œDoes this contain a horse, yes or no?โ€œ
o Instead of โ€œIs this a car, motorbike or person?โ€
โ—ฆ ๐‘ ๐‘๐‘Ž๐‘Ÿ ๐‘ฅ) = 0.55, ๐‘ ๐‘š/๐‘๐‘–๐‘˜๐‘’ ๐‘ฅ) = 0.25, ๐‘ ๐‘๐‘’๐‘Ÿ๐‘ ๐‘œ๐‘› ๐‘ฅ) = 0.15, ๐‘ โ„Ž๐‘œ๐‘Ÿ๐‘ ๐‘’ ๐‘ฅ) = 0.05
โ—ฆ ๐‘ ๐‘๐‘Ž๐‘Ÿ ๐‘ฅ) + ๐‘ ๐‘š/๐‘๐‘–๐‘˜๐‘’ ๐‘ฅ) + ๐‘ ๐‘๐‘’๐‘Ÿ๐‘ ๐‘œ๐‘› ๐‘ฅ) + ๐‘ โ„Ž๐‘œ๐‘Ÿ๐‘ ๐‘’ ๐‘ฅ) = 1.0
Multi-class, multi-label classification
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68
o The good old Euclidean Loss
โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ =
1
2
|๐‘ฆ โˆ’ ๐‘Ž๐ฟ|2
2
o Or RBF on top of Euclidean loss
โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท
๐‘—
๐‘ข๐‘— exp(โˆ’๐›ฝ๐‘—(๐‘ฆ โˆ’ ๐‘Ž๐ฟ)2)
o Or โ„“1 distance
โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท
๐‘—
|๐‘ฆ๐‘— โˆ’ ๐‘Ž๐ฟ
๐‘—
|
Regression
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69
Even better
optimizations
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70
o Donโ€™t switch gradients all the time
o Maintain โ€œmomentumโ€ from previous
parameters
o More robust gradients and learning ๏ƒ 
faster convergence
o Nice โ€œphysicsโ€-based interpretation
โ—ฆ Instead of updating the position of the โ€œballโ€, we
update the velocity, which updates the position
Momentum
๐œƒ(๐‘ก+1)
= ๐œƒ(๐‘ก)
+ ๐‘ข๐œƒ
๐‘ข๐œƒ = ๐›พ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
Loss surface
Gradient
Gradient + momentum
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71
o Use the future gradient instead of
the current gradient
o Better theoretical convergence
o Generally works better with
Convolutional Neural Networks
Nesterov Momentum [Sutskever2013]
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) + ๐‘ข๐œƒ
๐‘ข๐œƒ = ๐›พ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
Gradient
Gradient + momentum
Momentum
Look-ahead gradient
from the next step
Momentum
Gradient + Nesterov
momentum
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72
o Normally all weights updated with same โ€œaggressivenessโ€
โ—ฆ Often some parameters could enjoy more โ€œteachingโ€
โ—ฆ While others are already about there
o Adapt learning per parameter
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐ปโ„’
โˆ’1
๐œ‚๐‘ก๐›ป๐œƒโ„’
o ๐ปโ„’ is the Hessian matrix of โ„’: second-order derivatives
๐ปโ„’
๐‘–๐‘—
=
๐œ•โ„’
๐œ•๐œƒ๐‘–๐œ•๐œƒ๐‘—
Second order optimization
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73
o Inverse of Hessian usually very expensive
โ—ฆ Too many parameters
o Approximating the Hessian, e.g. with the L-BFGS algorithm
โ—ฆ Keeps memory of gradients to approximate the inverse Hessian
o L-BFGS works alright with Gradient Descend. What about SGD?
o In practice SGD with some good momentum works just fine
Second order optimization methods in practice
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74
o Adagrad [Duchi2011]
o RMSprop
o Adam [Kingma2014]
Other per-parameter adaptive optimizations
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75
o Schedule
โ—ฆ ๐‘š๐‘— = ฯƒ๐œ(๐›ป๐œƒโ„’๐‘—)2
โŸน ๐œƒ(๐‘ก+1)
= ๐œƒ(๐‘ก)
โˆ’ ๐œ‚๐‘ก
๐›ป๐œƒโ„’
๐‘š+๐œ€
โ—ฆ ๐œ€ is a small number to avoid division with 0
โ—ฆ Gradients become gradually smaller and smaller
Adagrad [Duchi2011]
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76
o Schedule
โ—ฆ ๐‘š๐‘— = ๐›ผ ฯƒ๐œ=1
๐‘กโˆ’1
(๐›ป๐œƒ
(๐‘ก)
โ„’๐‘—)2 + 1 โˆ’ ๐›ผ ๐›ป๐œƒ
(๐‘ก)
โ„’๐‘— โŸน
โ—ฆ ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก
๐›ป๐œƒโ„’
๐‘š+๐œ€
o Moving average of the squared gradients
โ—ฆ Compared to Adagrad
o Large gradients, e.g. too โ€œnoisyโ€ loss surface
โ—ฆ Updates are tamed
o Small gradients, e.g. stuck in flat loss surface ravine
โ—ฆ Updates become more aggressive
RMSprop
Square rooting boosts small values
while suppresses large values
Decay hyper-parameter
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77
o One of the most popular learning algorithms
๐‘š๐‘— = เท
๐œ
(๐›ป๐œƒโ„’๐‘—)2
๐œƒ(๐‘ก+0.5) = ๐›ฝ1๐œƒ(๐‘ก) + 1 โˆ’ ๐›ฝ1 ๐›ป๐œƒโ„’
๐‘ฃ(๐‘ก+0.5) = ๐›ฝ2๐‘ฃ(๐‘ก) + 1 โˆ’ ๐›ฝ2 ๐‘š
๐œƒ(๐‘ก+1)
= ๐œƒ(๐‘ก)
โˆ’ ๐œ‚๐‘ก
๐œƒ(๐‘ก+0.5)
๐‘ฃ(๐‘ก+0.5) + ๐œ€
o Similar to RMSprop, but with momentum
o Recommended values: ๐›ฝ1 = 0.9, ๐›ฝ2 = 0.999, ๐œ€ = 10โˆ’8
Adam [Kingma2014]
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78
Visual overview
Picture credit: Alec Radford
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79
o Learning to learn by gradient descent by gradient descent
โ—ฆ [Andrychowicz2016]
o ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) + ๐‘”๐‘ก ๐›ป๐œƒโ„’, ๐œ‘
o ๐‘”๐‘ก is an โ€œoptimizerโ€ with its own parameters ๐œ‘
โ—ฆ Implemented as a recurrent network
Learning โ€“not computingโ€“ the gradients
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80
Good practice
o Preprocess the data to at least have 0 mean
o Initialize weights based on activations functions
โ—ฆ For ReLU Xavier or HeICCV2015 initialization
o Always use โ„“2-regularization and dropout
o Use batch normalization
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81
Babysitting
Deep Nets
๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ)
ฮธโˆ— โ† arg min๐œƒ เท
(๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ)
โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L )
๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82
o Always check your gradients if not computed automatically
o Check that in the first round you get a random loss
o Check network with few samples
โ—ฆ Turn off regularization. You should predictably overfit and have a 0 loss
โ—ฆ Turn or regularization. The loss should increase
o Have a separate validation set
โ—ฆ Compare the curve between training and validation sets
โ—ฆ There should be a gap, but not too large
Babysitting Deep Nets
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83
Summary
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84
o http://www.deeplearningbook.org/
โ—ฆ Part II: Chapter 7, 8
[Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient
descent, arXiv, 2016
[He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
[Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015
[Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014
[Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting, JMLR, 2014
[Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013
[Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012
[Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
[Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011
[Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010
[LeCun2002]
Reading material & references
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85
Next lecture
o What are the Convolutional Neural Networks?
o Why are they important in Computer Vision?
o Differences from standard Neural Networks
o How to train a Convolutional Neural Network?

More Related Content

Similar to lecture3.pdf

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxDrKBManwade
ย 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxssuserd23711
ย 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Giacomo Boracchi
ย 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architecturesJoe li
ย 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
ย 
Learning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxLearning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxamrita chaturvedi
ย 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balรกzs Hidasi
ย 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
ย 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowYi-Fan Liou
ย 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
ย 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
ย 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
ย 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Giacomo Boracchi
ย 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersMadhumita Tamhane
ย 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdfShiwani Gupta
ย 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsMostafa El-Hosseini
ย 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
ย 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
ย 

Similar to lecture3.pdf (20)

NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
ย 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ย 
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
Learning In Nonstationary Environments: Perspectives And Applications. Part2:...
ย 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
ย 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
ย 
Learning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptxLearning algorithm including gradient descent.pptx
Learning algorithm including gradient descent.pptx
ย 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
ย 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
ย 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
ย 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with Tensorflow
ย 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
ย 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
ย 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
ย 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
ย 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
ย 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
ย 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
ย 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_models
ย 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
ย 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
ย 

More from Tigabu Yaya

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
ย 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdfTigabu Yaya
ย 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfTigabu Yaya
ย 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfTigabu Yaya
ย 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfTigabu Yaya
ย 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
ย 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
ย 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdfTigabu Yaya
ย 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptxTigabu Yaya
ย 
lecture6.pdf
lecture6.pdflecture6.pdf
lecture6.pdfTigabu Yaya
ย 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdfTigabu Yaya
ย 
Chap 4.ppt
Chap 4.pptChap 4.ppt
Chap 4.pptTigabu Yaya
ย 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.pptTigabu Yaya
ย 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.pptTigabu Yaya
ย 
nnfl.0620.pptx
nnfl.0620.pptxnnfl.0620.pptx
nnfl.0620.pptxTigabu Yaya
ย 
L20.ppt
L20.pptL20.ppt
L20.pptTigabu Yaya
ย 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfTigabu Yaya
ย 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdfTigabu Yaya
ย 
nabil-201-chap-03.ppt
nabil-201-chap-03.pptnabil-201-chap-03.ppt
nabil-201-chap-03.pptTigabu Yaya
ย 
Mat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptMat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptTigabu Yaya
ย 

More from Tigabu Yaya (20)

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
ย 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf
ย 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdf
ย 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdf
ย 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdf
ย 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
ย 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
ย 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf
ย 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
ย 
lecture6.pdf
lecture6.pdflecture6.pdf
lecture6.pdf
ย 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
ย 
Chap 4.ppt
Chap 4.pptChap 4.ppt
Chap 4.ppt
ย 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.ppt
ย 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.ppt
ย 
nnfl.0620.pptx
nnfl.0620.pptxnnfl.0620.pptx
nnfl.0620.pptx
ย 
L20.ppt
L20.pptL20.ppt
L20.ppt
ย 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdf
ย 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdf
ย 
nabil-201-chap-03.ppt
nabil-201-chap-03.pptnabil-201-chap-03.ppt
nabil-201-chap-03.ppt
ย 
Mat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.pptMat 223_Ch3-Determinants.ppt
Mat 223_Ch3-Determinants.ppt
ย 

Recently uploaded

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
ย 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
ย 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
ย 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
ย 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
ย 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
ย 
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
ย 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
ย 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
ย 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
ย 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
ย 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
ย 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
ย 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
ย 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
ย 

Recently uploaded (20)

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
ย 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
ย 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
ย 
Cรณdigo Creativo y Arte de Software | Unidad 1
Cรณdigo Creativo y Arte de Software | Unidad 1Cรณdigo Creativo y Arte de Software | Unidad 1
Cรณdigo Creativo y Arte de Software | Unidad 1
ย 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
ย 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
ย 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
ย 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
ย 
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
โ€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
ย 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
ย 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
ย 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
ย 
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
ย 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
ย 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
ย 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
ย 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
ย 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
ย 
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 

lecture3.pdf

  • 1. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1 Lecture 3: Deeper into Deep Learning and Optimizations Deep Learning @ UvA
  • 2. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2 o Machine learning paradigm for neural networks o Backpropagation algorithm, backbone for training neural networks o Neural network == modular architecture o Visited different modules, saw how to implement and check them Previous lecture
  • 3. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3 o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices Lecture overview
  • 4. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4 Deeper into Neural Networks & Deep Neural Nets
  • 5. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5 A Neural/Deep Network in a nutshell ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 6. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6 SGD vs GD ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 7. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7 Backpropagation again o Step 1. Compute forward propagations for all layers recursively ๐‘Ž๐‘™ = โ„Ž๐‘™ ๐‘ฅ๐‘™ and ๐‘ฅ๐‘™+1 = ๐‘Ž๐‘™ o Step 2. Once done with forward propagation, follow the reverse path. โ—ฆ Start from the last layer and for each new layer compute the gradients โ—ฆ Cache computations when possible to avoid redundant operations o Step 3. Use the gradients ๐œ•โ„’ ๐œ•๐œƒ๐‘™ with Stochastic Gradient Descend to train ๐œ•โ„’ ๐œ•๐‘Ž๐‘™ = ๐œ•๐‘Ž๐‘™+1 ๐œ•๐‘ฅ๐‘™+1 ๐‘‡ โ‹… ๐œ•โ„’ ๐œ•๐‘Ž๐‘™+1 ๐œ•โ„’ ๐œ•๐œƒ๐‘™ = ๐œ•๐‘Ž๐‘™ ๐œ•๐œƒ๐‘™ โ‹… ๐œ•โ„’ ๐œ•๐‘Ž๐‘™ ๐‘‡
  • 8. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8 o Often loss surfaces are โ—ฆ non-quadratic โ—ฆ highly non-convex โ—ฆ very high-dimensional o Datasets are typically really large to compute complete gradients o No real guarantee that โ—ฆ the final solution will be good โ—ฆ we converge fast to final solution โ—ฆ or that there will be convergence Still, backpropagation can be slow
  • 9. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9 o Stochastically sample โ€œmini-batchesโ€ from dataset ๐ท โ—ฆ The size of ๐ต๐‘— can contain even just 1 sample o Much faster than Gradient Descend o Results are often better o Also suitable for datasets that change over time o Variance of gradients increases when batch size decreases Stochastic Gradient Descend (SGD) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก |๐ต๐‘—| เท ๐‘– โˆˆ ๐ต๐‘— ๐›ป๐œƒโ„’๐‘– ๐ต๐‘— = ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’(๐ท)
  • 10. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10 SGD is often better Current solution Full GD gradient New GD solution Noisy SGD gradient Best GD solution Best SGD solution โ€ข No guarantee that this is what is going to always happen. โ€ข But the noisy SGC gradients can help some times escaping local optima Loss surface
  • 11. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11 SGD is often better o (A bit) Noisy gradients act as regularization o Gradient Descend ๏ƒ  Complete gradients o Complete gradients fit optimally the (arbitrary) data we have, not the distribution that generates them โ—ฆ All training samples are the โ€œabsolute representativeโ€ of the input distribution โ—ฆ Test data will be no different than training data โ—ฆ Suitable for traditional optimization problems: โ€œfind optimal routeโ€ โ—ฆ But for ML we cannot make this assumption ๏ƒ  test data are always different o Stochastic gradients ๏ƒ  sampled training data sample roughly representative gradients โ—ฆ Model does not overfit to the particular training samples
  • 12. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12 SGD is faster Gradient
  • 13. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13 SGD is faster Gradient 10x What is our gradient now?
  • 14. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14 SGD is faster 10x What is our gradient now? Gradient
  • 15. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15 o Of course in real situations data do not replicate o However, after a sizeable amount of data there are clusters of data that are similar o Hence, the gradient is approximately alright o Approximate alright is great, is even better in many cases actually SGD is faster
  • 16. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16 o Often datasets are not โ€œrigidโ€ o Imagine Instagram โ—ฆ Letโ€™s assume 1 million of new images uploaded per week and we want to build a โ€œcool pictureโ€ classifier โ—ฆ Should โ€œcool picturesโ€ from the previous year have the same as much influence? โ—ฆ No, the learning machine should track these changes o With GD these changes go undetected, as results are averaged by the many more โ€œpastโ€ samples โ—ฆ Past โ€œover-dominatesโ€ o A properly implemented SGD can track changes much better and give better models โ—ฆ [LeCun2002] SGD for dynamically changed datasets Popular today Popular in 2014 Popular in 2010
  • 17. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17 o Applicable only with SGD o Choose samples with maximum information content o Mini-batches should contain examples from different classes โ—ฆ As different as possible o Prefer samples likely to generate larger errors โ—ฆ Otherwise gradients will be small ๏ƒ  slower learning โ—ฆ Check the errors from previous rounds and prefer โ€œhard examplesโ€ โ—ฆ Donโ€™t overdo it though :P, beware of outliers o In practice, split your dataset into mini-batches โ—ฆ Each mini-batch is as class-divergent and rich as possible โ—ฆ New epoch ๏ƒ  to be safe new batches & new, randomly shuffled examples Shuffling examples Dataset Shuffling at epoch t Shuffling at epoch t+1
  • 18. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18 o Conditions of convergence well understood o Acceleration techniques can be applied โ—ฆ Second order (Hessian based) optimizations are possible โ—ฆ Measuring not only gradients, but also curvatures of the loss surface o Simpler theoretical analysis on weight dynamics and convergence rates Advantages of Gradient Descend batch learning
  • 19. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19 o SGD is preferred to Gradient Descend o Training is orders faster โ—ฆ In real datasets Gradient Descend is not even realistic o Solutions generalize better โ—ฆ More efficient ๏ƒ  larger datasets โ—ฆ Larger datasets ๏ƒ  better generalization o How many samples per mini-batch? โ—ฆ Hyper-parameter, trial & error โ—ฆ Usually between 32-256 samples In practice
  • 20. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20 Data preprocessing & normalization ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 21. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21 o Center data to be roughly 0 โ—ฆ Activation functions usually โ€œcenteredโ€ around 0 โ—ฆ Convergence usually faster โ—ฆ Otherwise bias on gradient direction ๏ƒ  might slow down learning Data pre-processing ReLU ๏Š tanh(๐‘ฅ) ๏Š ๐œŽ(๐‘ฅ) ๏Œ ๏Š ๏Œ
  • 22. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22 o Scale input variables to have similar diagonal covariances ๐‘๐‘– = ฯƒ๐‘—(๐‘ฅ๐‘– (๐‘—) )2 โ—ฆ Similar covariances ๏ƒ  more balanced rate of learning for different weights โ—ฆ Rescaling to 1 is a good choice, unless some dimensions are less important Data pre-processing ๐‘ฅ1 , ๐‘ฅ2 , ๐‘ฅ3 ๏ƒ  much different covariances ๐œƒ1 ๐œƒ2 ๐‘ฅ = ๐‘ฅ1 , ๐‘ฅ2 , ๐‘ฅ3 ๐‘‡ , ๐œƒ = ๐œƒ1 , ๐œƒ2 , ๐œƒ3 ๐‘‡ , ๐‘Ž = tanh(๐œƒฮค ๐‘ฅ) ๐œƒ3 Generated gradients แ‰š dโ„’ ๐‘‘๐œƒ ๐‘ฅ1,๐‘ฅ2,๐‘ฅ3 : much different Gradient update harder: ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐‘‘โ„’/๐‘‘ฮธ1 ๐‘‘โ„’/๐‘‘ฮธ2 ๐‘‘โ„’/๐‘‘ฮธ3
  • 23. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23 o Input variables should be as decorrelated as possible โ—ฆ Input variables are โ€œmore independentโ€ โ—ฆ Network is forced to find non-trivial correlations between inputs โ—ฆ Decorrelated inputs ๏ƒ  Better optimization โ—ฆ Obviously not the case when inputs are by definition correlated (sequences) o Extreme case โ—ฆ extreme correlation (linear dependency) might cause problems [CAUTION] Data pre-processing
  • 24. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24 o Input variables follow a Gaussian distribution (roughly) o In practice: โ—ฆ from training set compute mean and standard deviation โ—ฆ Then subtract the mean from training samples โ—ฆ Then divide the result by the standard deviation Normalization: ๐‘ ๐œ‡, ๐œŽ2 = ๐‘ 0, 1 ๐‘ฅ ๐‘ฅ โˆ’ ๐œ‡ ๐‘ฅ โˆ’ ๐œ‡ ๐œŽ
  • 25. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25 o Instead of โ€œper-dimensionโ€ ๏ƒ  all input dimensions simultaneously o If dimensions have similar values (e.g. pixels in natural images) โ—ฆ Compute one ๐œ‡, ๐œŽ2 instead of as many as the input variables โ—ฆ Or the per color channel pixel average/variance ๐œ‡๐‘Ÿ๐‘’๐‘‘, ๐œŽ๐‘Ÿ๐‘’๐‘‘ 2 , ๐œ‡๐‘”๐‘Ÿ๐‘’๐‘’๐‘›, ๐œŽ๐‘”๐‘Ÿ๐‘’๐‘’๐‘› 2 , ๐œ‡๐‘๐‘™๐‘ข๐‘’, ๐œŽ๐‘๐‘™๐‘ข๐‘’ 2 ๐‘ ๐œ‡, ๐œŽ2 = ๐‘ 0, 1 โˆ’ Making things faster
  • 26. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26 o When input dimensions have similar ranges โ€ฆ o โ€ฆ and with the right non-linearlity โ€ฆ o โ€ฆ centering might be enough โ—ฆ e.g. in images all dimensions are pixels โ—ฆ All pixels have more or less the same ranges o Juse make sure images have mean 0 (๐œ‡ = 0) Even simpler: Centering the input
  • 27. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27 o If ๐ถ the covariance matrix of your dataset, compute eigenvalues and eigenvectors with SVD ๐‘ˆ, ฮฃ, ๐‘‰๐‘‡ = ๐‘ ๐‘ฃ๐‘‘(๐ถ) o Decorrelate (PCA-ed) dataset by ๐‘‹๐‘Ÿ๐‘œ๐‘ก = ๐‘ˆ๐‘‡๐‘‹ โ—ฆ Subset of eigenvectors ๐‘ˆโ€ฒ = [๐‘ข1, โ€ฆ , ๐‘ข๐‘ž] to reduce data dimensions o Scaling by square root of eigenvalues to whiten data ๐‘‹๐‘คโ„Ž๐‘ก = ๐‘‹๐‘Ÿ๐‘œ๐‘ก/ ฮฃ o Not used much with Convolutional Neural Nets โ—ฆ The zero mean normalization is more important PCA Whitening ๐‘‹๐‘Ÿ๐‘œ๐‘ก = ๐‘ˆ๐‘‡๐‘‹ ๐‘‹๐‘คโ„Ž๐‘ก = ๐‘‹๐‘Ÿ๐‘œ๐‘ก/ ฮฃ
  • 28. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28 Example Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
  • 29. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29 Data augmentation [Krizhevsky2012] Original Flip Random crop Contrast Tint
  • 30. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30 o Weights change ๏ƒ  the distribution of the layer inputs changes per round โ—ฆ Covariance shift o Normalize the layer inputs with batch normalization โ—ฆ Roughly speaking, normalize ๐‘ฅ๐‘™ to ๐‘(0, 1) and rescale Batch normalization [Ioffe2015] ๐‘ฅ๐‘™ Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation ๐‘ฅ๐‘™ ๐‘ฅ๐‘™ Batch Normalization ๐‘ฅ๐‘™ โ„’ ๐‘ฅ๐‘™ โ„’ Batch normalization
  • 31. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31 Batch normalization - Intuitively ๐‘ฅ๐‘™ Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation ๐‘ฅ๐‘™ ๐‘ฅ๐‘™ Batch Normalization
  • 32. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32 Batch normalization โ€“ The algorithm o ๐œ‡โ„ฌ โ† 1 ๐‘š ฯƒ๐‘–=1 ๐‘š ๐‘ฅ๐‘– [compute mini-batch mean] o ๐œŽโ„ฌ โ† 1 ๐‘š ฯƒ๐‘–=1 ๐‘š ๐‘ฅ๐‘– โˆ’ ๐œ‡โ„ฌ 2 [compute mini-batch variance] o เท ๐‘ฅ๐‘– โ† ๐‘ฅ๐‘–โˆ’๐œ‡โ„ฌ ๐œŽโ„ฌ 2+๐œ€ [normalize input] o เท ๐‘ฆ๐‘– โ† ๐›พ๐‘ฅ๐‘– + ๐›ฝ [scale and shift input] Trainable parameters
  • 33. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33 o Gradients can be stronger ๏ƒ  higher learning rates ๏ƒ  faster training โ—ฆ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima o Neurons get activated in a near optimal โ€œregimeโ€ o Better model regularization โ—ฆ Neuron activations not deterministic, depend on the batch โ—ฆ Model cannot be overconfident Batch normalization - Benefits
  • 34. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34 Regularization ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 35. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35 o Neural networks typically have thousands, if not millions of parameters โ—ฆ Usually, the dataset size smaller than the number of parameters o Overfitting is a grave danger o Proper weight regularization is crucial to avoid overfitting ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) + ๐œ†ฮฉ(๐œƒ) o Possible regularization methods โ—ฆ โ„“2-regularization โ—ฆ โ„“1-regularization โ—ฆ Dropout Regularization
  • 36. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36 o Most important (or most popular) regularization ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) + ๐œ† 2 เท ๐‘™ ๐œƒ๐‘™ 2 o The โ„“2-regularization can pass inside the gradient descend update rule ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐›ป๐œƒโ„’ + ๐œ†๐œƒ๐‘™ โŸน ๐œƒ ๐‘ก+1 = 1 โˆ’ ๐œ†๐œ‚๐‘ก ๐œƒ ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ o ๐œ† is usually about 10โˆ’1, 10โˆ’2 โ„“2-regularization โ€œWeight decayโ€, because weights get smaller
  • 37. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37 o โ„“1-regularization is one of the most important techniques ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) + ๐œ† 2 เท ๐‘™ ๐œƒ๐‘™ o Also โ„“1-regularization passes inside the gradient descend update rule ๐œƒ ๐‘ก+1 = ๐œƒ ๐‘ก โˆ’ ๐œ†๐œ‚๐‘ก ๐œƒ ๐‘ก |๐œƒ ๐‘ก | โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ o โ„“1-regularization ๏ƒ  sparse weights โ—ฆ ๐œ† โ†— ๏ƒ  more weights become 0 โ„“1-regularization Sign function
  • 38. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38 o To tackle overfitting another popular technique is early stopping o Monitor performance on a separate validation set o Training the network will decrease training error, as well validation error (although with a slower rate usually) o Stop when validation error starts increasing โ—ฆ This quite likely means the network starts to overfit Early stopping
  • 39. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39 o During training setting activations randomly to 0 โ—ฆ Neurons sampled at random from a Bernoulli distribution with ๐‘ = 0.5 o At test time all neurons are used โ—ฆ Neuron activations reweighted by ๐‘ o Benefits โ—ฆ Reduces complex co-adaptations or co-dependencies between neurons โ—ฆ No โ€œfree-riderโ€ neurons that rely on others โ—ฆ Every neuron becomes more robust โ—ฆ Decreases significantly overfitting โ—ฆ Improves significantly training speed Dropout [Srivastava2014]
  • 40. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40 o Effectively, a different architecture at every training epoch โ—ฆ Similar to model ensembles Dropout Original model
  • 41. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41 o Effectively, a different architecture at every training epoch โ—ฆ Similar to model ensembles Dropout Epoch 1
  • 42. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42 o Effectively, a different architecture at every training epoch โ—ฆ Similar to model ensembles Dropout Epoch 1
  • 43. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43 o Effectively, a different architecture at every training epoch โ—ฆ Similar to model ensembles Dropout Epoch 2
  • 44. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44 o Effectively, a different architecture at every training epoch โ—ฆ Similar to model ensembles Dropout Epoch 2
  • 45. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45 Architectural details ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 46. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46 o Straightforward sigmoids not a very good idea o Symmetric sigmoids converge faster โ—ฆ E.g. tanh, returns a(x=0)=0 โ—ฆ Recommended sigmoid: ๐‘Ž = โ„Ž ๐‘ฅ = 1.7159 tanh( 2 3 ๐‘ฅ) o You can add a linear term to avoid flat areas ๐‘Ž = โ„Ž ๐‘ฅ = tanh ๐‘ฅ + ๐›ฝ๐‘ฅ Sigmoid-like activation functions tanh ๐‘ฅ + 0.5๐‘ฅ tanh ๐‘ฅ
  • 47. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47 o RBF: ๐‘Ž = โ„Ž ๐‘ฅ = ฯƒ๐‘— ๐‘ข๐‘— exp โˆ’๐›ฝ๐‘— ๐‘ฅ โˆ’ ๐‘ค๐‘— 2 o Sigmoid: ๐‘Ž = โ„Ž ๐‘ฅ = ๐œŽ ๐‘ฅ = 1 1+๐‘’โˆ’๐‘ฅ o Sigmoids can cover the full feature space o RBFโ€™s are much more local in the feature space โ—ฆ Can be faster to train but with a more limited range โ—ฆ Can give better set of basis functions โ—ฆ Preferred in lower dimensional spaces RBFs vs โ€œSigmoidsโ€
  • 48. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48 o Activation function ๐‘Ž = โ„Ž(๐‘ฅ) = max 0, ๐‘ฅ o Gradient wrt the input ๐œ•๐‘Ž ๐œ•๐‘ฅ = แ‰Š 0, ๐‘–๐‘“ ๐‘ฅ โ‰ค 0 1, ๐‘–๐‘“๐‘ฅ > 0 o Very popular in computer vision and speech recognition o Much faster computations, gradients โ—ฆ No vanishing or exploding problems, only comparison, addition, multiplication o People claim biological plausibility o Sparse activations o No saturation o Non-symmetric o Non-differentiable at 0 o A large gradient during training can cause a neuron to โ€œdieโ€. Higher learning rates mitigate the problem Rectified Linear Unit (ReLU) module [Krizhevsky2012]
  • 49. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49 ReLU convergence rate ReLU Tanh
  • 50. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50 o Soft approximation (softplus): ๐‘Ž = โ„Ž(๐‘ฅ) = ln 1 + ๐‘’๐‘ฅ o Noisy ReLU: ๐‘Ž = โ„Ž ๐‘ฅ = max 0, x + ฮต , ฮต~๐›ฎ(0, ฯƒ(x)) o Leaky ReLU: ๐‘Ž = โ„Ž ๐‘ฅ = แ‰Š ๐‘ฅ, ๐‘–๐‘“ ๐‘ฅ > 0 0.01๐‘ฅ ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’ o Parametric ReLu: ๐‘Ž = โ„Ž ๐‘ฅ = แ‰Š ๐‘ฅ, ๐‘–๐‘“ ๐‘ฅ > 0 ๐›ฝ๐‘ฅ ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’ (parameter ๐›ฝ is trainable) Other ReLUs
  • 51. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51 o Number of hidden layers o Number of neuron in each hidden layer o Type of activation functions o Type and amount of regularization Architectural hyper-parameters
  • 52. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52 o Dataset dependent hyperparameters o Tip: Start small ๏ƒ  increase complexity gradually โ—ฆ e.g. start with a 2-3 hidden layers โ—ฆ Add more layers ๏ƒ  does performance improve? โ—ฆ Add more neurons ๏ƒ  does performance improve? o Regularization is very important, use โ„“2 โ—ฆ Even if with very deep or wide network โ—ฆ With strong โ„“2-regularization we avoid overfitting Number of neurons, number of hidden layers Generalization Model complexity (number of neurons)
  • 53. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53 Learning rate ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 54. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54 o The right learning rate ๐œ‚๐‘ก very important for fast convergence โ—ฆ Too strong ๏ƒ  gradients overshoot and bounce โ—ฆ Too weak, ๏ƒ  too small gradients ๏ƒ  slow training o Learning rate per weight is often advantageous โ—ฆ Some weights are near convergence, others not o Rule of thumb โ—ฆ Learning rate of (shared) weights prop. to square root of share weight connections o Adaptive learning rates are also possible, based on the errors observed โ—ฆ [Sompolinsky1995] Learning rate
  • 55. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55 o Constant โ—ฆ Learning rate remains the same for all epochs o Step decay โ—ฆ Decrease (e.g. ๐œ‚๐‘ก/๐‘‡ or ๐œ‚๐‘ก/๐‘‡) every T number of epochs o Inverse decay ๐œ‚๐‘ก = ๐œ‚0 1+๐œ€๐‘ก o Exponential decay ๐œ‚๐‘ก = ๐œ‚0๐‘’โˆ’๐œ€๐‘ก o Often step decay preferred โ—ฆ simple, intuitive, works well and only a single extra hyper-parameter ๐‘‡ (๐‘‡ =2, 10) Learning rate schedules
  • 56. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56 o Try several log-spaced values 10โˆ’1, 10โˆ’2, 10โˆ’3, โ€ฆ on a smaller set โ—ฆ Then, you can narrow it down from there around where you get the lowest error o You can decrease the learning rate every 10 (or some other value) full training set epochs โ—ฆ Although this highly depends on your data Learning rate in practice
  • 57. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57 Weight initialization ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„“(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 58. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58 o There are few contradictory requirements o Weights need to be small enough โ—ฆ around origin (๐ŸŽ) for symmetric functions (tanh, sigmoid) โ—ฆ When training starts better stimulate activation functions near their linear regime โ—ฆ larger gradients ๏ƒ  faster training o Weights need to be large enough โ—ฆ Otherwise signal is too weak for any serious learning Weight initialization Linear regime Large gradients Linear regime Large gradients
  • 59. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations โ—ฆ Especially for deep learning โ—ฆ All neurons operate in their full capacity Question: Why similar input/output variance? o Good practice: initialize weights to be asymmetric โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ) โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning o Generally speaking initialization depends on โ—ฆ non-linearities โ—ฆ data normalization Weight initialization
  • 60. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations โ—ฆ Especially for deep learning โ—ฆ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ) โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning o Generally speaking initialization depends on โ—ฆ non-linearities โ—ฆ data normalization Weight initialization
  • 61. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations โ—ฆ Especially for deep learning โ—ฆ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric โ—ฆ Donโ€™t give save values to all weights (like all ๐ŸŽ) โ—ฆ In that case all neurons generate same gradient ๏ƒ  no learning o Generally speaking initialization depends on โ—ฆ non-linearities โ—ฆ data normalization Weight initialization
  • 62. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62 o For tanh initialize weights from โˆ’ 6 ๐‘‘๐‘™โˆ’1+๐‘‘๐‘™ , 6 ๐‘‘๐‘™โˆ’1+๐‘‘๐‘™ โ—ฆ ๐‘‘๐‘™โˆ’1 is the number of input variables to the tanh layer and ๐‘‘๐‘™ is the number of the output variables o For a sigmoid โˆ’4 โˆ™ 6 ๐‘‘๐‘™โˆ’1+๐‘‘๐‘™ , 4 โˆ™ 6 ๐‘‘๐‘™โˆ’1+๐‘‘๐‘™ One way of initializing sigmoid-like neurons Linear regime Large gradients
  • 63. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63 o For ๐‘Ž = ๐œƒ๐‘ฅ the variance is ๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐ธ ๐‘ฅ 2๐‘‰๐‘Ž๐‘Ÿ ๐œƒ + E ๐œƒ 2๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ + ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ o Since ๐ธ ๐‘ฅ = ๐ธ ๐œƒ = 0 ๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ โ‰ˆ ๐‘‘ โ‹… ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ๐‘– ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ๐‘– o For ๐‘‰๐‘Ž๐‘Ÿ ๐‘Ž = ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ โ‡’ ๐‘‰๐‘Ž๐‘Ÿ ๐œƒ๐‘– = 1 ๐‘‘ o Draw random weights from ๐œƒ~๐‘ 0, 1/๐‘‘ where ๐‘‘ is the number of neurons in the input Xavier initialization [Glorot2010]
  • 64. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64 o Unlike sigmoids, ReLUs ground to 0 the linear activations half the time o Double weight variance โ—ฆ Compensate for the zero flat-area ๏ƒ  โ—ฆ Input and output maintain same variance โ—ฆ Very similar to Xavier initialization o Draw random weights from w~๐‘ 0, 2/๐‘‘ where ๐‘‘ is the number of neurons in the input [He2015] initialization for ReLUs
  • 65. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65 Loss functions ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 66. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66 o Our samples contains only one class โ—ฆ There is only one correct answer per sample o Negative log-likelihood (cross entropy) + Softmax โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = โˆ’ ฯƒ๐‘=1 ๐ถ ๐‘ฆ๐‘ log ๐‘Ž๐ฟ ๐‘ for all classes ๐‘ = 1, โ€ฆ , ๐ถ o Hierarchical softmax when C is very large o Hinge loss (aka SVM loss) โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท ๐‘=1 ๐‘โ‰ ๐‘ฆ ๐ถ max(0, ๐‘Ž๐ฟ ๐‘ โˆ’ ๐‘Ž๐ฟ ๐‘ฆ + 1) o Squared hinge loss Multi-class classification Is it a cat? Is it a horse? โ€ฆ
  • 67. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67 o Each sample can have many correct answers o Hinge loss and the likes โ—ฆ Also sigmoids would also work o Each output neuron is independent โ—ฆ โ€œDoes this contain a car, yes or no?โ€œ โ—ฆ โ€œDoes this contain a person, yes or no?โ€œ โ—ฆ โ€œDoes this contain a motorbike, yes or no?โ€œ โ—ฆ โ€œDoes this contain a horse, yes or no?โ€œ o Instead of โ€œIs this a car, motorbike or person?โ€ โ—ฆ ๐‘ ๐‘๐‘Ž๐‘Ÿ ๐‘ฅ) = 0.55, ๐‘ ๐‘š/๐‘๐‘–๐‘˜๐‘’ ๐‘ฅ) = 0.25, ๐‘ ๐‘๐‘’๐‘Ÿ๐‘ ๐‘œ๐‘› ๐‘ฅ) = 0.15, ๐‘ โ„Ž๐‘œ๐‘Ÿ๐‘ ๐‘’ ๐‘ฅ) = 0.05 โ—ฆ ๐‘ ๐‘๐‘Ž๐‘Ÿ ๐‘ฅ) + ๐‘ ๐‘š/๐‘๐‘–๐‘˜๐‘’ ๐‘ฅ) + ๐‘ ๐‘๐‘’๐‘Ÿ๐‘ ๐‘œ๐‘› ๐‘ฅ) + ๐‘ โ„Ž๐‘œ๐‘Ÿ๐‘ ๐‘’ ๐‘ฅ) = 1.0 Multi-class, multi-label classification
  • 68. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68 o The good old Euclidean Loss โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = 1 2 |๐‘ฆ โˆ’ ๐‘Ž๐ฟ|2 2 o Or RBF on top of Euclidean loss โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท ๐‘— ๐‘ข๐‘— exp(โˆ’๐›ฝ๐‘—(๐‘ฆ โˆ’ ๐‘Ž๐ฟ)2) o Or โ„“1 distance โ„’ ๐œƒ; ๐‘ฅ, ๐‘ฆ = เท ๐‘— |๐‘ฆ๐‘— โˆ’ ๐‘Ž๐ฟ ๐‘— | Regression
  • 69. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69 Even better optimizations ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 70. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70 o Donโ€™t switch gradients all the time o Maintain โ€œmomentumโ€ from previous parameters o More robust gradients and learning ๏ƒ  faster convergence o Nice โ€œphysicsโ€-based interpretation โ—ฆ Instead of updating the position of the โ€œballโ€, we update the velocity, which updates the position Momentum ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) + ๐‘ข๐œƒ ๐‘ข๐œƒ = ๐›พ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ Loss surface Gradient Gradient + momentum
  • 71. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71 o Use the future gradient instead of the current gradient o Better theoretical convergence o Generally works better with Convolutional Neural Networks Nesterov Momentum [Sutskever2013] ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) + ๐‘ข๐œƒ ๐‘ข๐œƒ = ๐›พ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ Gradient Gradient + momentum Momentum Look-ahead gradient from the next step Momentum Gradient + Nesterov momentum
  • 72. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72 o Normally all weights updated with same โ€œaggressivenessโ€ โ—ฆ Often some parameters could enjoy more โ€œteachingโ€ โ—ฆ While others are already about there o Adapt learning per parameter ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐ปโ„’ โˆ’1 ๐œ‚๐‘ก๐›ป๐œƒโ„’ o ๐ปโ„’ is the Hessian matrix of โ„’: second-order derivatives ๐ปโ„’ ๐‘–๐‘— = ๐œ•โ„’ ๐œ•๐œƒ๐‘–๐œ•๐œƒ๐‘— Second order optimization
  • 73. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73 o Inverse of Hessian usually very expensive โ—ฆ Too many parameters o Approximating the Hessian, e.g. with the L-BFGS algorithm โ—ฆ Keeps memory of gradients to approximate the inverse Hessian o L-BFGS works alright with Gradient Descend. What about SGD? o In practice SGD with some good momentum works just fine Second order optimization methods in practice
  • 74. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74 o Adagrad [Duchi2011] o RMSprop o Adam [Kingma2014] Other per-parameter adaptive optimizations
  • 75. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75 o Schedule โ—ฆ ๐‘š๐‘— = ฯƒ๐œ(๐›ป๐œƒโ„’๐‘—)2 โŸน ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐›ป๐œƒโ„’ ๐‘š+๐œ€ โ—ฆ ๐œ€ is a small number to avoid division with 0 โ—ฆ Gradients become gradually smaller and smaller Adagrad [Duchi2011]
  • 76. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76 o Schedule โ—ฆ ๐‘š๐‘— = ๐›ผ ฯƒ๐œ=1 ๐‘กโˆ’1 (๐›ป๐œƒ (๐‘ก) โ„’๐‘—)2 + 1 โˆ’ ๐›ผ ๐›ป๐œƒ (๐‘ก) โ„’๐‘— โŸน โ—ฆ ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐›ป๐œƒโ„’ ๐‘š+๐œ€ o Moving average of the squared gradients โ—ฆ Compared to Adagrad o Large gradients, e.g. too โ€œnoisyโ€ loss surface โ—ฆ Updates are tamed o Small gradients, e.g. stuck in flat loss surface ravine โ—ฆ Updates become more aggressive RMSprop Square rooting boosts small values while suppresses large values Decay hyper-parameter
  • 77. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77 o One of the most popular learning algorithms ๐‘š๐‘— = เท ๐œ (๐›ป๐œƒโ„’๐‘—)2 ๐œƒ(๐‘ก+0.5) = ๐›ฝ1๐œƒ(๐‘ก) + 1 โˆ’ ๐›ฝ1 ๐›ป๐œƒโ„’ ๐‘ฃ(๐‘ก+0.5) = ๐›ฝ2๐‘ฃ(๐‘ก) + 1 โˆ’ ๐›ฝ2 ๐‘š ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก ๐œƒ(๐‘ก+0.5) ๐‘ฃ(๐‘ก+0.5) + ๐œ€ o Similar to RMSprop, but with momentum o Recommended values: ๐›ฝ1 = 0.9, ๐›ฝ2 = 0.999, ๐œ€ = 10โˆ’8 Adam [Kingma2014]
  • 78. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78 Visual overview Picture credit: Alec Radford
  • 79. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79 o Learning to learn by gradient descent by gradient descent โ—ฆ [Andrychowicz2016] o ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) + ๐‘”๐‘ก ๐›ป๐œƒโ„’, ๐œ‘ o ๐‘”๐‘ก is an โ€œoptimizerโ€ with its own parameters ๐œ‘ โ—ฆ Implemented as a recurrent network Learning โ€“not computingโ€“ the gradients
  • 80. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80 Good practice o Preprocess the data to at least have 0 mean o Initialize weights based on activations functions โ—ฆ For ReLU Xavier or HeICCV2015 initialization o Always use โ„“2-regularization and dropout o Use batch normalization
  • 81. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81 Babysitting Deep Nets ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L = โ„Ž๐ฟ (โ„Ž๐ฟโˆ’1 โ€ฆ โ„Ž1 ๐‘ฅ, ฮธ1 , ฮธ๐ฟโˆ’1 , ฮธ๐ฟ) ฮธโˆ— โ† arg min๐œƒ เท (๐‘ฅ,๐‘ฆ)โŠ†(๐‘‹,๐‘Œ) โ„’(๐‘ฆ, ๐‘Ž๐ฟ ๐‘ฅ; ๐œƒ1,โ€ฆ,L ) ๐œƒ(๐‘ก+1) = ๐œƒ(๐‘ก) โˆ’ ๐œ‚๐‘ก๐›ป๐œƒโ„’ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 82. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82 o Always check your gradients if not computed automatically o Check that in the first round you get a random loss o Check network with few samples โ—ฆ Turn off regularization. You should predictably overfit and have a 0 loss โ—ฆ Turn or regularization. The loss should increase o Have a separate validation set โ—ฆ Compare the curve between training and validation sets โ—ฆ There should be a gap, but not too large Babysitting Deep Nets
  • 83. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83 Summary o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices
  • 84. UVA DEEP LEARNING COURSE โ€“ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84 o http://www.deeplearningbook.org/ โ—ฆ Part II: Chapter 7, 8 [Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient descent, arXiv, 2016 [He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015 [Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015 [Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014 [Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR, 2014 [Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013 [Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012 [Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012 [Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011 [Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010 [LeCun2002] Reading material & references
  • 85. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85 Next lecture o What are the Convolutional Neural Networks? o Why are they important in Computer Vision? o Differences from standard Neural Networks o How to train a Convolutional Neural Network?