Model Call Girl in Bikash Puri Delhi reach out to us at ๐9953056974๐
ย
lecture3.pdf
1. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1
Lecture 3: Deeper into Deep Learning and Optimizations
Deep Learning @ UvA
2. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2
o Machine learning paradigm for neural networks
o Backpropagation algorithm, backbone for training neural networks
o Neural network == modular architecture
o Visited different modules, saw how to implement and check them
Previous lecture
3. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
Lecture overview
4. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4
Deeper into
Neural Networks &
Deep Neural Nets
5. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5
A Neural/Deep Network in a nutshell
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
6. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6
SGD vs GD
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
7. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7
Backpropagation again
o Step 1. Compute forward propagations for all layers recursively
๐๐ = โ๐ ๐ฅ๐ and ๐ฅ๐+1 = ๐๐
o Step 2. Once done with forward propagation, follow the reverse path.
โฆ Start from the last layer and for each new layer compute the gradients
โฆ Cache computations when possible to avoid redundant operations
o Step 3. Use the gradients
๐โ
๐๐๐
with Stochastic Gradient Descend to train
๐โ
๐๐๐
=
๐๐๐+1
๐๐ฅ๐+1
๐
โ
๐โ
๐๐๐+1
๐โ
๐๐๐
=
๐๐๐
๐๐๐
โ
๐โ
๐๐๐
๐
8. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8
o Often loss surfaces are
โฆ non-quadratic
โฆ highly non-convex
โฆ very high-dimensional
o Datasets are typically really large to compute complete gradients
o No real guarantee that
โฆ the final solution will be good
โฆ we converge fast to final solution
โฆ or that there will be convergence
Still, backpropagation can be slow
9. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9
o Stochastically sample โmini-batchesโ from dataset ๐ท
โฆ The size of ๐ต๐ can contain even just 1 sample
o Much faster than Gradient Descend
o Results are often better
o Also suitable for datasets that change over time
o Variance of gradients increases when batch size decreases
Stochastic Gradient Descend (SGD)
๐(๐ก+1)
= ๐(๐ก)
โ
๐๐ก
|๐ต๐|
เท
๐ โ ๐ต๐
๐ป๐โ๐
๐ต๐ = ๐ ๐๐๐๐๐(๐ท)
10. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10
SGD is often better
Current solution
Full GD gradient
New GD solution
Noisy SGD gradient
Best GD solution
Best SGD solution
โข No guarantee that this is what
is going to always happen.
โข But the noisy SGC gradients
can help some times escaping
local optima
Loss surface
11. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11
SGD is often better
o (A bit) Noisy gradients act as regularization
o Gradient Descend ๏ Complete gradients
o Complete gradients fit optimally the (arbitrary) data we have, not the
distribution that generates them
โฆ All training samples are the โabsolute representativeโ of the input distribution
โฆ Test data will be no different than training data
โฆ Suitable for traditional optimization problems: โfind optimal routeโ
โฆ But for ML we cannot make this assumption ๏ test data are always different
o Stochastic gradients ๏ sampled training data sample roughly
representative gradients
โฆ Model does not overfit to the particular training samples
12. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12
SGD is faster
Gradient
13. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13
SGD is faster
Gradient
10x
What is our
gradient now?
14. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14
SGD is faster
10x
What is our
gradient now?
Gradient
15. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15
o Of course in real situations data do not replicate
o However, after a sizeable amount of data there are clusters of data that
are similar
o Hence, the gradient is approximately alright
o Approximate alright is great, is even better in many cases actually
SGD is faster
16. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16
o Often datasets are not โrigidโ
o Imagine Instagram
โฆ Letโs assume 1 million of new images uploaded per week and
we want to build a โcool pictureโ classifier
โฆ Should โcool picturesโ from the previous year have the same as
much influence?
โฆ No, the learning machine should track these changes
o With GD these changes go undetected, as results are
averaged by the many more โpastโ samples
โฆ Past โover-dominatesโ
o A properly implemented SGD can track changes much
better and give better models
โฆ [LeCun2002]
SGD for dynamically changed datasets
Popular today
Popular in 2014
Popular in 2010
17. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17
o Applicable only with SGD
o Choose samples with maximum information content
o Mini-batches should contain examples from different classes
โฆ As different as possible
o Prefer samples likely to generate larger errors
โฆ Otherwise gradients will be small ๏ slower learning
โฆ Check the errors from previous rounds and prefer โhard examplesโ
โฆ Donโt overdo it though :P, beware of outliers
o In practice, split your dataset into mini-batches
โฆ Each mini-batch is as class-divergent and rich as possible
โฆ New epoch ๏ to be safe new batches & new, randomly shuffled examples
Shuffling examples
Dataset
Shuffling
at epoch t
Shuffling
at epoch t+1
18. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18
o Conditions of convergence well understood
o Acceleration techniques can be applied
โฆ Second order (Hessian based) optimizations are possible
โฆ Measuring not only gradients, but also curvatures of the loss surface
o Simpler theoretical analysis on weight dynamics and convergence rates
Advantages of Gradient Descend batch learning
19. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19
o SGD is preferred to Gradient Descend
o Training is orders faster
โฆ In real datasets Gradient Descend is not even realistic
o Solutions generalize better
โฆ More efficient ๏ larger datasets
โฆ Larger datasets ๏ better generalization
o How many samples per mini-batch?
โฆ Hyper-parameter, trial & error
โฆ Usually between 32-256 samples
In practice
20. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20
Data preprocessing &
normalization
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
21. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21
o Center data to be roughly 0
โฆ Activation functions usually โcenteredโ around 0
โฆ Convergence usually faster
โฆ Otherwise bias on gradient direction ๏ might slow down learning
Data pre-processing
ReLU ๏ tanh(๐ฅ) ๏ ๐(๐ฅ) ๏
๏
๏
22. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22
o Scale input variables to have similar diagonal covariances ๐๐ = ฯ๐(๐ฅ๐
(๐)
)2
โฆ Similar covariances ๏ more balanced rate of learning for different weights
โฆ Rescaling to 1 is a good choice, unless some dimensions are less important
Data pre-processing
๐ฅ1
, ๐ฅ2
, ๐ฅ3
๏ much different covariances
๐1
๐2
๐ฅ = ๐ฅ1
, ๐ฅ2
, ๐ฅ3 ๐
, ๐ = ๐1
, ๐2
, ๐3 ๐
, ๐ = tanh(๐ฮค
๐ฅ)
๐3
Generated gradients แ
dโ
๐๐ ๐ฅ1,๐ฅ2,๐ฅ3
: much different
Gradient update harder: ๐(๐ก+1) = ๐(๐ก) โ ๐๐ก
๐โ/๐ฮธ1
๐โ/๐ฮธ2
๐โ/๐ฮธ3
23. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23
o Input variables should be as decorrelated as possible
โฆ Input variables are โmore independentโ
โฆ Network is forced to find non-trivial correlations between inputs
โฆ Decorrelated inputs ๏ Better optimization
โฆ Obviously not the case when inputs are by definition correlated (sequences)
o Extreme case
โฆ extreme correlation (linear dependency) might cause problems [CAUTION]
Data pre-processing
24. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24
o Input variables follow a Gaussian distribution (roughly)
o In practice:
โฆ from training set compute mean and standard deviation
โฆ Then subtract the mean from training samples
โฆ Then divide the result by the standard deviation
Normalization: ๐ ๐, ๐2
= ๐ 0, 1
๐ฅ
๐ฅ โ ๐
๐ฅ โ ๐
๐
25. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25
o Instead of โper-dimensionโ ๏ all input dimensions simultaneously
o If dimensions have similar values (e.g. pixels in natural images)
โฆ Compute one ๐, ๐2 instead of as many as the input variables
โฆ Or the per color channel pixel average/variance
๐๐๐๐, ๐๐๐๐
2
, ๐๐๐๐๐๐, ๐๐๐๐๐๐
2 , ๐๐๐๐ข๐, ๐๐๐๐ข๐
2
๐ ๐, ๐2
= ๐ 0, 1 โ Making things faster
26. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26
o When input dimensions have similar ranges โฆ
o โฆ and with the right non-linearlity โฆ
o โฆ centering might be enough
โฆ e.g. in images all dimensions are pixels
โฆ All pixels have more or less the same ranges
o Juse make sure images have mean 0 (๐ = 0)
Even simpler: Centering the input
27. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27
o If ๐ถ the covariance matrix of your dataset, compute
eigenvalues and eigenvectors with SVD
๐, ฮฃ, ๐๐ = ๐ ๐ฃ๐(๐ถ)
o Decorrelate (PCA-ed) dataset by
๐๐๐๐ก = ๐๐๐
โฆ Subset of eigenvectors ๐โฒ
= [๐ข1, โฆ , ๐ข๐] to reduce data dimensions
o Scaling by square root of eigenvalues to whiten data
๐๐คโ๐ก = ๐๐๐๐ก/ ฮฃ
o Not used much with Convolutional Neural Nets
โฆ The zero mean normalization is more important
PCA Whitening
๐๐๐๐ก = ๐๐๐
๐๐คโ๐ก = ๐๐๐๐ก/ ฮฃ
28. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28
Example
Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
29. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29
Data augmentation [Krizhevsky2012]
Original
Flip Random crop
Contrast Tint
30. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30
o Weights change ๏ the
distribution of the layer inputs
changes per round
โฆ Covariance shift
o Normalize the layer inputs with
batch normalization
โฆ Roughly speaking, normalize ๐ฅ๐ to
๐(0, 1) and rescale
Batch normalization [Ioffe2015]
๐ฅ๐
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
๐ฅ๐ ๐ฅ๐
Batch Normalization
๐ฅ๐
โ
๐ฅ๐
โ
Batch normalization
31. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31
Batch normalization - Intuitively
๐ฅ๐
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
๐ฅ๐ ๐ฅ๐
Batch Normalization
32. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32
Batch normalization โ The algorithm
o ๐โฌ โ
1
๐
ฯ๐=1
๐
๐ฅ๐ [compute mini-batch mean]
o ๐โฌ โ
1
๐
ฯ๐=1
๐
๐ฅ๐ โ ๐โฌ
2 [compute mini-batch variance]
o เท
๐ฅ๐ โ
๐ฅ๐โ๐โฌ
๐โฌ
2+๐
[normalize input]
o เท
๐ฆ๐ โ ๐พ๐ฅ๐ + ๐ฝ [scale and shift input]
Trainable parameters
33. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33
o Gradients can be stronger ๏ higher learning rates ๏ faster training
โฆ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima
o Neurons get activated in a near optimal โregimeโ
o Better model regularization
โฆ Neuron activations not deterministic,
depend on the batch
โฆ Model cannot be overconfident
Batch normalization - Benefits
34. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34
Regularization
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
35. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35
o Neural networks typically have thousands, if not millions of parameters
โฆ Usually, the dataset size smaller than the number of parameters
o Overfitting is a grave danger
o Proper weight regularization is crucial to avoid overfitting
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L ) + ๐ฮฉ(๐)
o Possible regularization methods
โฆ โ2-regularization
โฆ โ1-regularization
โฆ Dropout
Regularization
36. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36
o Most important (or most popular) regularization
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L ) +
๐
2
เท
๐
๐๐
2
o The โ2-regularization can pass inside the gradient descend update rule
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก ๐ป๐โ + ๐๐๐ โน
๐ ๐ก+1 = 1 โ ๐๐๐ก ๐ ๐ก โ ๐๐ก๐ป๐โ
o ๐ is usually about 10โ1, 10โ2
โ2-regularization
โWeight decayโ, because
weights get smaller
37. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37
o โ1-regularization is one of the most important techniques
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L ) +
๐
2
เท
๐
๐๐
o Also โ1-regularization passes inside the gradient descend update rule
๐ ๐ก+1
= ๐ ๐ก
โ ๐๐๐ก
๐ ๐ก
|๐ ๐ก |
โ ๐๐ก๐ป๐โ
o โ1-regularization ๏ sparse weights
โฆ ๐ โ ๏ more weights become 0
โ1-regularization
Sign function
38. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38
o To tackle overfitting another popular technique is early stopping
o Monitor performance on a separate validation set
o Training the network will decrease training error, as well validation error
(although with a slower rate usually)
o Stop when validation error starts increasing
โฆ This quite likely means the network starts to overfit
Early stopping
39. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39
o During training setting activations randomly to 0
โฆ Neurons sampled at random from a Bernoulli distribution with ๐ = 0.5
o At test time all neurons are used
โฆ Neuron activations reweighted by ๐
o Benefits
โฆ Reduces complex co-adaptations or co-dependencies between neurons
โฆ No โfree-riderโ neurons that rely on others
โฆ Every neuron becomes more robust
โฆ Decreases significantly overfitting
โฆ Improves significantly training speed
Dropout [Srivastava2014]
40. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40
o Effectively, a different architecture at every training epoch
โฆ Similar to model ensembles
Dropout
Original model
41. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41
o Effectively, a different architecture at every training epoch
โฆ Similar to model ensembles
Dropout
Epoch 1
42. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42
o Effectively, a different architecture at every training epoch
โฆ Similar to model ensembles
Dropout
Epoch 1
43. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43
o Effectively, a different architecture at every training epoch
โฆ Similar to model ensembles
Dropout
Epoch 2
44. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44
o Effectively, a different architecture at every training epoch
โฆ Similar to model ensembles
Dropout
Epoch 2
45. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45
Architectural details
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
46. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46
o Straightforward sigmoids not a very good idea
o Symmetric sigmoids converge faster
โฆ E.g. tanh, returns a(x=0)=0
โฆ Recommended sigmoid: ๐ = โ ๐ฅ = 1.7159 tanh(
2
3
๐ฅ)
o You can add a linear term to avoid flat areas
๐ = โ ๐ฅ = tanh ๐ฅ + ๐ฝ๐ฅ
Sigmoid-like activation functions
tanh ๐ฅ + 0.5๐ฅ
tanh ๐ฅ
47. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47
o RBF: ๐ = โ ๐ฅ = ฯ๐ ๐ข๐ exp โ๐ฝ๐ ๐ฅ โ ๐ค๐
2
o Sigmoid: ๐ = โ ๐ฅ = ๐ ๐ฅ =
1
1+๐โ๐ฅ
o Sigmoids can cover the full feature space
o RBFโs are much more local in the feature space
โฆ Can be faster to train but with a more limited range
โฆ Can give better set of basis functions
โฆ Preferred in lower dimensional spaces
RBFs vs โSigmoidsโ
48. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48
o Activation function ๐ = โ(๐ฅ) = max 0, ๐ฅ
o Gradient wrt the input
๐๐
๐๐ฅ
= แ
0, ๐๐ ๐ฅ โค 0
1, ๐๐๐ฅ > 0
o Very popular in computer vision and speech recognition
o Much faster computations, gradients
โฆ No vanishing or exploding problems, only comparison, addition, multiplication
o People claim biological plausibility
o Sparse activations
o No saturation
o Non-symmetric
o Non-differentiable at 0
o A large gradient during training can cause a neuron to โdieโ. Higher learning rates mitigate the problem
Rectified Linear Unit (ReLU) module [Krizhevsky2012]
49. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49
ReLU convergence rate
ReLU
Tanh
50. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50
o Soft approximation (softplus): ๐ = โ(๐ฅ) = ln 1 + ๐๐ฅ
o Noisy ReLU: ๐ = โ ๐ฅ = max 0, x + ฮต , ฮต~๐ฎ(0, ฯ(x))
o Leaky ReLU: ๐ = โ ๐ฅ = แ
๐ฅ, ๐๐ ๐ฅ > 0
0.01๐ฅ ๐๐กโ๐๐๐ค๐๐ ๐
o Parametric ReLu: ๐ = โ ๐ฅ = แ
๐ฅ, ๐๐ ๐ฅ > 0
๐ฝ๐ฅ ๐๐กโ๐๐๐ค๐๐ ๐
(parameter ๐ฝ is trainable)
Other ReLUs
51. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51
o Number of hidden layers
o Number of neuron in each hidden layer
o Type of activation functions
o Type and amount of regularization
Architectural hyper-parameters
52. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52
o Dataset dependent hyperparameters
o Tip: Start small ๏ increase complexity gradually
โฆ e.g. start with a 2-3 hidden layers
โฆ Add more layers ๏ does performance improve?
โฆ Add more neurons ๏ does performance improve?
o Regularization is very important, use โ2
โฆ Even if with very deep or wide network
โฆ With strong โ2-regularization we avoid overfitting
Number of neurons, number of hidden layers
Generalization
Model complexity
(number of neurons)
53. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53
Learning rate
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
54. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54
o The right learning rate ๐๐ก very important for fast convergence
โฆ Too strong ๏ gradients overshoot and bounce
โฆ Too weak, ๏ too small gradients ๏ slow training
o Learning rate per weight is often advantageous
โฆ Some weights are near convergence, others not
o Rule of thumb
โฆ Learning rate of (shared) weights prop. to square root of share weight connections
o Adaptive learning rates are also possible, based on the errors observed
โฆ [Sompolinsky1995]
Learning rate
55. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55
o Constant
โฆ Learning rate remains the same for all epochs
o Step decay
โฆ Decrease (e.g. ๐๐ก/๐ or ๐๐ก/๐) every T number of epochs
o Inverse decay ๐๐ก =
๐0
1+๐๐ก
o Exponential decay ๐๐ก = ๐0๐โ๐๐ก
o Often step decay preferred
โฆ simple, intuitive, works well and only a
single extra hyper-parameter ๐ (๐ =2, 10)
Learning rate schedules
56. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56
o Try several log-spaced values 10โ1, 10โ2, 10โ3, โฆ on a smaller set
โฆ Then, you can narrow it down from there around where you get the lowest error
o You can decrease the learning rate every 10 (or some other value) full
training set epochs
โฆ Although this highly depends on your data
Learning rate in practice
57. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57
Weight initialization
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
58. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58
o There are few contradictory requirements
o Weights need to be small enough
โฆ around origin (๐) for symmetric functions (tanh, sigmoid)
โฆ When training starts better stimulate activation functions near their linear regime
โฆ larger gradients ๏ faster training
o Weights need to be large enough
โฆ Otherwise signal is too weak for any serious learning
Weight initialization
Linear regime
Large gradients
Linear regime
Large gradients
59. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โฆ Especially for deep learning
โฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
o Good practice: initialize weights to be asymmetric
โฆ Donโt give save values to all weights (like all ๐)
โฆ In that case all neurons generate same gradient ๏ no learning
o Generally speaking initialization depends on
โฆ non-linearities
โฆ data normalization
Weight initialization
60. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โฆ Especially for deep learning
โฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
โฆ Donโt give save values to all weights (like all ๐)
โฆ In that case all neurons generate same gradient ๏ no learning
o Generally speaking initialization depends on
โฆ non-linearities
โฆ data normalization
Weight initialization
61. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
โฆ Especially for deep learning
โฆ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
โฆ Donโt give save values to all weights (like all ๐)
โฆ In that case all neurons generate same gradient ๏ no learning
o Generally speaking initialization depends on
โฆ non-linearities
โฆ data normalization
Weight initialization
62. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62
o For tanh initialize weights from โ
6
๐๐โ1+๐๐
,
6
๐๐โ1+๐๐
โฆ ๐๐โ1 is the number of input variables to the tanh layer and ๐๐ is the number of the
output variables
o For a sigmoid โ4 โ
6
๐๐โ1+๐๐
, 4 โ
6
๐๐โ1+๐๐
One way of initializing sigmoid-like neurons
Linear regime
Large gradients
63. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63
o For ๐ = ๐๐ฅ the variance is
๐๐๐ ๐ = ๐ธ ๐ฅ 2๐๐๐ ๐ + E ๐ 2๐๐๐ ๐ฅ + ๐๐๐ ๐ฅ ๐๐๐ ๐
o Since ๐ธ ๐ฅ = ๐ธ ๐ = 0
๐๐๐ ๐ = ๐๐๐ ๐ฅ ๐๐๐ ๐ โ ๐ โ ๐๐๐ ๐ฅ๐ ๐๐๐ ๐๐
o For ๐๐๐ ๐ = ๐๐๐ ๐ฅ โ ๐๐๐ ๐๐ =
1
๐
o Draw random weights from
๐~๐ 0, 1/๐
where ๐ is the number of neurons in the input
Xavier initialization [Glorot2010]
64. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64
o Unlike sigmoids, ReLUs ground to 0 the linear activations half the
time
o Double weight variance
โฆ Compensate for the zero flat-area ๏
โฆ Input and output maintain same variance
โฆ Very similar to Xavier initialization
o Draw random weights from
w~๐ 0, 2/๐
where ๐ is the number of neurons in the input
[He2015] initialization for ReLUs
65. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65
Loss functions
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
66. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66
o Our samples contains only one class
โฆ There is only one correct answer per sample
o Negative log-likelihood (cross entropy) + Softmax
โ ๐; ๐ฅ, ๐ฆ = โ ฯ๐=1
๐ถ
๐ฆ๐ log ๐๐ฟ
๐
for all classes ๐ = 1, โฆ , ๐ถ
o Hierarchical softmax when C is very large
o Hinge loss (aka SVM loss)
โ ๐; ๐ฅ, ๐ฆ = เท
๐=1
๐โ ๐ฆ
๐ถ
max(0, ๐๐ฟ
๐
โ ๐๐ฟ
๐ฆ
+ 1)
o Squared hinge loss
Multi-class classification
Is it a cat? Is it a horse? โฆ
67. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67
o Each sample can have many correct answers
o Hinge loss and the likes
โฆ Also sigmoids would also work
o Each output neuron is independent
โฆ โDoes this contain a car, yes or no?โ
โฆ โDoes this contain a person, yes or no?โ
โฆ โDoes this contain a motorbike, yes or no?โ
โฆ โDoes this contain a horse, yes or no?โ
o Instead of โIs this a car, motorbike or person?โ
โฆ ๐ ๐๐๐ ๐ฅ) = 0.55, ๐ ๐/๐๐๐๐ ๐ฅ) = 0.25, ๐ ๐๐๐๐ ๐๐ ๐ฅ) = 0.15, ๐ โ๐๐๐ ๐ ๐ฅ) = 0.05
โฆ ๐ ๐๐๐ ๐ฅ) + ๐ ๐/๐๐๐๐ ๐ฅ) + ๐ ๐๐๐๐ ๐๐ ๐ฅ) + ๐ โ๐๐๐ ๐ ๐ฅ) = 1.0
Multi-class, multi-label classification
68. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68
o The good old Euclidean Loss
โ ๐; ๐ฅ, ๐ฆ =
1
2
|๐ฆ โ ๐๐ฟ|2
2
o Or RBF on top of Euclidean loss
โ ๐; ๐ฅ, ๐ฆ = เท
๐
๐ข๐ exp(โ๐ฝ๐(๐ฆ โ ๐๐ฟ)2)
o Or โ1 distance
โ ๐; ๐ฅ, ๐ฆ = เท
๐
|๐ฆ๐ โ ๐๐ฟ
๐
|
Regression
69. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69
Even better
optimizations
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
70. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70
o Donโt switch gradients all the time
o Maintain โmomentumโ from previous
parameters
o More robust gradients and learning ๏
faster convergence
o Nice โphysicsโ-based interpretation
โฆ Instead of updating the position of the โballโ, we
update the velocity, which updates the position
Momentum
๐(๐ก+1)
= ๐(๐ก)
+ ๐ข๐
๐ข๐ = ๐พ๐(๐ก) โ ๐๐ก๐ป๐โ
Loss surface
Gradient
Gradient + momentum
71. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71
o Use the future gradient instead of
the current gradient
o Better theoretical convergence
o Generally works better with
Convolutional Neural Networks
Nesterov Momentum [Sutskever2013]
๐(๐ก+1) = ๐(๐ก) + ๐ข๐
๐ข๐ = ๐พ๐(๐ก) โ ๐๐ก๐ป๐โ
Gradient
Gradient + momentum
Momentum
Look-ahead gradient
from the next step
Momentum
Gradient + Nesterov
momentum
72. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72
o Normally all weights updated with same โaggressivenessโ
โฆ Often some parameters could enjoy more โteachingโ
โฆ While others are already about there
o Adapt learning per parameter
๐(๐ก+1) = ๐(๐ก) โ ๐ปโ
โ1
๐๐ก๐ป๐โ
o ๐ปโ is the Hessian matrix of โ: second-order derivatives
๐ปโ
๐๐
=
๐โ
๐๐๐๐๐๐
Second order optimization
73. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73
o Inverse of Hessian usually very expensive
โฆ Too many parameters
o Approximating the Hessian, e.g. with the L-BFGS algorithm
โฆ Keeps memory of gradients to approximate the inverse Hessian
o L-BFGS works alright with Gradient Descend. What about SGD?
o In practice SGD with some good momentum works just fine
Second order optimization methods in practice
74. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74
o Adagrad [Duchi2011]
o RMSprop
o Adam [Kingma2014]
Other per-parameter adaptive optimizations
75. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75
o Schedule
โฆ ๐๐ = ฯ๐(๐ป๐โ๐)2
โน ๐(๐ก+1)
= ๐(๐ก)
โ ๐๐ก
๐ป๐โ
๐+๐
โฆ ๐ is a small number to avoid division with 0
โฆ Gradients become gradually smaller and smaller
Adagrad [Duchi2011]
76. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76
o Schedule
โฆ ๐๐ = ๐ผ ฯ๐=1
๐กโ1
(๐ป๐
(๐ก)
โ๐)2 + 1 โ ๐ผ ๐ป๐
(๐ก)
โ๐ โน
โฆ ๐(๐ก+1) = ๐(๐ก) โ ๐๐ก
๐ป๐โ
๐+๐
o Moving average of the squared gradients
โฆ Compared to Adagrad
o Large gradients, e.g. too โnoisyโ loss surface
โฆ Updates are tamed
o Small gradients, e.g. stuck in flat loss surface ravine
โฆ Updates become more aggressive
RMSprop
Square rooting boosts small values
while suppresses large values
Decay hyper-parameter
77. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77
o One of the most popular learning algorithms
๐๐ = เท
๐
(๐ป๐โ๐)2
๐(๐ก+0.5) = ๐ฝ1๐(๐ก) + 1 โ ๐ฝ1 ๐ป๐โ
๐ฃ(๐ก+0.5) = ๐ฝ2๐ฃ(๐ก) + 1 โ ๐ฝ2 ๐
๐(๐ก+1)
= ๐(๐ก)
โ ๐๐ก
๐(๐ก+0.5)
๐ฃ(๐ก+0.5) + ๐
o Similar to RMSprop, but with momentum
o Recommended values: ๐ฝ1 = 0.9, ๐ฝ2 = 0.999, ๐ = 10โ8
Adam [Kingma2014]
78. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78
Visual overview
Picture credit: Alec Radford
79. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79
o Learning to learn by gradient descent by gradient descent
โฆ [Andrychowicz2016]
o ๐(๐ก+1) = ๐(๐ก) + ๐๐ก ๐ป๐โ, ๐
o ๐๐ก is an โoptimizerโ with its own parameters ๐
โฆ Implemented as a recurrent network
Learning โnot computingโ the gradients
80. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80
Good practice
o Preprocess the data to at least have 0 mean
o Initialize weights based on activations functions
โฆ For ReLU Xavier or HeICCV2015 initialization
o Always use โ2-regularization and dropout
o Use batch normalization
81. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81
Babysitting
Deep Nets
๐๐ฟ ๐ฅ; ๐1,โฆ,L = โ๐ฟ (โ๐ฟโ1 โฆ โ1 ๐ฅ, ฮธ1 , ฮธ๐ฟโ1 , ฮธ๐ฟ)
ฮธโ โ arg min๐ เท
(๐ฅ,๐ฆ)โ(๐,๐)
โ(๐ฆ, ๐๐ฟ ๐ฅ; ๐1,โฆ,L )
๐(๐ก+1) = ๐(๐ก) โ ๐๐ก๐ป๐โ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
82. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82
o Always check your gradients if not computed automatically
o Check that in the first round you get a random loss
o Check network with few samples
โฆ Turn off regularization. You should predictably overfit and have a 0 loss
โฆ Turn or regularization. The loss should increase
o Have a separate validation set
โฆ Compare the curve between training and validation sets
โฆ There should be a gap, but not too large
Babysitting Deep Nets
83. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83
Summary
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
84. UVA DEEP LEARNING COURSE โ EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84
o http://www.deeplearningbook.org/
โฆ Part II: Chapter 7, 8
[Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient
descent, arXiv, 2016
[He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
[Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015
[Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014
[Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting, JMLR, 2014
[Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013
[Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012
[Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
[Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011
[Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010
[LeCun2002]
Reading material & references
85. UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85
Next lecture
o What are the Convolutional Neural Networks?
o Why are they important in Computer Vision?
o Differences from standard Neural Networks
o How to train a Convolutional Neural Network?