SlideShare a Scribd company logo
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Lecture 7:
Training Neural Networks
1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Administrative: A2
A2 is out, due Monday May 2nd, 11:59pm
2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Linear score function:
2-layer Neural Network
x h
W1 s
W2
3072 100 10
Neural Networks
3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Neural Networks
4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Layer
32
32
3
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
5
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Layer
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll
get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
6
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
7
Where we are now...
CNN Architectures
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Landscape image is CC0 1.0 public domain
Walking man image is CC0 1.0 public domain
Learning network parameters through optimization
8
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
9
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Today: Training Neural Networks
10
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
11
1. One time set up: activation functions, preprocessing,
weight initialization, regularization, gradient checking
2. Training dynamics: babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation: model ensembles, test-time
augmentation, transfer learning
Overview
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
12
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
13
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
14
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
15
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
16
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
17
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
18
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
19
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
20
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
21
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
22
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Why is this a problem?
If all the gradients flowing back will be
zero and weights will never change
sigmoid
gate
x
23
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
24
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
25
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
26
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
27
We know that local gradient of sigmoid is always positive
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
28
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
So!! Sign of gradient for all wi
is the same as the sign of upstream scalar gradient!
29
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
30
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
31
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(For a single element! Minibatches help)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
31
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit compute expensive
32
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]
- zero centered (nice)
- still kills gradients when saturated :(
[LeCun et al., 1991]
33
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
34
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
35
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
- An annoyance:
hint: what is the gradient when x < 0?
36
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
ReLU
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
37
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
38
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
=> people like to initialize
ReLU neurons with slightly
positive biases (e.g. 0.01)
39
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
[Mass et al., 2013]
[He et al., 2015]
40
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
Parametric Rectifier (PReLU)
[Mass et al., 2013]
[He et al., 2015]
41
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise
- Computation requires exp()
[Clevert et al., 2015]
42
(Alpha default = 1)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Scaled Exponential Linear Units (SELU)
- Scaled version of ELU that
works better for deep networks
- “Self-normalizing” property;
- Can train deep SELU networks
without BatchNorm
[Klambauer et al. ICLR 2017]
43
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
[Goodfellow et al., 2013]
44
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
TLDR: In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
45
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
46
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
(Assume X [NxD] is data matrix,
each example in a row)
47
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Remember: Consider what happens when
the input to a neuron is always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(this is also why you want zero-mean data!)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
48
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
(Assume X [NxD] is data matrix, each example in a row)
Data Preprocessing
49
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
In practice, you may also see PCA and Whitening of the data
(data has diagonal
covariance matrix)
(covariance matrix is the
identity matrix)
50
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
Before normalization: classification loss
very sensitive to changes in weight matrix;
hard to optimize
After normalization: less sensitive to small
changes in weights; easier to optimize
51
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
TLDR: In practice for Images: center only
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Divide by per-channel std (e.g. ResNet)
(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common
to do PCA or
whitening
52
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization
53
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- Q: what happens when W=constant init is used?
54
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
55
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with
deeper networks.
56
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
57
What will happen to the activations for the last layer?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
58
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
A: All zero, no learning =(
59
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
60
What will happen to the activations for the last layer?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
61
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
A: Local gradients all zero,
no learning =(
62
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: “Xavier” Initialization
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
63
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Just right”: Activations are
nicely scaled for all layers!
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
64
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
65
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
66
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
67
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
68
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
[substituting value of y]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
69
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
70
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are zero mean]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
71
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
72
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
So, Var(y) = Var(xi
) only when Var(wi
) = 1/Din
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: What about ReLU?
Change from tanh to ReLU
73
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: What about ReLU?
Xavier assumes zero
centered activation function
Activations collapse to zero
again, no learning =(
Change from tanh to ReLU
74
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
“Just right”: Activations are
nicely scaled for all layers!
75
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et
al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
76
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Training vs. Testing Error
77
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
78
Beyond Training Error
Better optimization algorithms
help reduce training loss
But we really care about error on
new data - how to reduce the gap?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
79
Early Stopping: Always do this
Iteration
Loss
Iteration
Accuracy
Train
Val
Stop training here
Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot
that worked best on val
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
80
1. Train multiple independent models
2. At test time average their results
(Take average of predicted probability distributions, then choose argmax)
Enjoy 2% extra performance
Model Ensembles
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
81
How to improve single-model performance?
Regularization
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Regularization: Add term to loss
82
In common use:
L2 regularization
L1 regularization
Elastic net (L1 + L2)
(Weight decay)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
83
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
84
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
85
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features
has an ear
has a tail
is furry
has claws
mischievous
look
cat
score
X
X
X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
86
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Dropout is training a large ensemble of
models (that share parameters).
Each binary mask is one model
An FC layer with 4096 units has
24096
~ 101233
possible masks!
Only ~ 1082
atoms in the universe...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
87
Dropout: Test time
Dropout makes our output random!
Output
(label)
Input
(image)
Random
mask
Want to “average out” the randomness at test-time
But this integral seems hard …
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
88
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
x y
w1 w2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
89
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
a
x y
w1 w2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
90
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
91
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
At test time, multiply
by dropout probability
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
92
Dropout: Test time
At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
93
Dropout Summary
drop in train time
scale at test time
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
94
More common: “Inverted dropout”
test time is unchanged!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
95
Regularization: A common pattern
Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
96
Regularization: A common pattern
Training: Add some kind
of randomness
Testing: Average out randomness
(sometimes approximate)
Example: Batch
Normalization
Training:
Normalize using
stats from random
minibatches
Testing: Use fixed
stats to normalize
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
97
Load image
and label
“cat”
CNN
Compute
loss
Regularization: Data Augmentation
This image by Nikita is
licensed under CC-BY 2.0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
98
Regularization: Data Augmentation
Load image
and label
“cat”
CNN
Compute
loss
Transform image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
99
Data Augmentation
Horizontal Flips
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
100
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
101
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
102
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
103
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
More Complex:
1. Apply PCA to all [R, G, B]
pixels in training set
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
104
Data Augmentation
Get creative for your problem!
Examples of data augmentations:
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
105
Automatic Data Augmentation
Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
106
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
107
Regularization: DropConnect
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
108
Regularization: Fractional Pooling
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Graham, “Fractional Max Pooling”, arXiv 2014
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
109
Regularization: Stochastic Depth
Training: Skip some layers in the network
Testing: Use all the layer
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
110
Regularization: Cutout
Training: Set random image regions to zero
Testing: Use full image
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
DeVries and Taylor, “Improved Regularization of
Convolutional Neural Networks with Cutout”, arXiv 2017
Works very well for small datasets like CIFAR,
less common for large datasets like ImageNet
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
111
Regularization: Mixup
Training: Train on random blends of images
Testing: Use original images
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels
of pairs of training images,
e.g. 40% cat, 60% dog
CNN
Target label:
cat: 0.4
dog: 0.6
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
112
Regularization - In practice
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
- Consider dropout for large
fully-connected layers
- Batch normalization and data
augmentation almost always a
good idea
- Try cutout and mixup especially
for small classification datasets
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
113
Choosing Hyperparameters
(without tons of GPUs)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
114
Choosing Hyperparameters
Step 1: Check initial loss
Turn off weight decay, sanity check loss at initialization
e.g. log(C) for softmax with C classes
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
115
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of
training data (~5-10 minibatches); fiddle with architecture,
learning rate, weight initialization
Loss not going down? LR too low, bad initialization
Loss explodes to Inf or NaN? LR too high, bad initialization
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
116
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations
Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
117
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around
what worked from Step 3, train a few models for ~1-5 epochs.
Good weight decay to try: 1e-4, 1e-5, 0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
118
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20
epochs) without learning rate decay
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
119
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
120
Accuracy
time
Train
Accuracy still going up, you
need to train longer
Val
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
121
Accuracy
time
Train
Huge train / val gap means
overfitting! Increase regularization,
get more data
Val
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
122
Accuracy
time
Train
No gap between train / val means
underfitting: train longer, use a
bigger model
Val
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
123
Losses may be noisy, use a
scatter plot and also plot moving
average to see trends better
Look at learning curves!
Training Loss Train / Val Accuracy
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
124
Cross-validation
We develop "command
centers" to visualize all our
models training with different
hyperparameters
check out weights and biases
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
125
You can plot all your loss curves for different hyperparameters on a single plot
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
126
Don't look at accuracy or loss curves for too long!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
127
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Step 7: GOTO step 5
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
128
Random Search vs. Grid Search
Important Parameter Important Parameter
Unimportant
Parameter
Unimportant
Parameter
Grid Layout Random Layout
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017
Random Search for
Hyper-Parameter Optimization
Bergstra and Bengio, 2012
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
129
Summary
- Improve your training error:
- Optimizers
- Learning rate schedules
- Improve your test error:
- Regularization
- Choosing Hyperparameters
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Summary
We looked in detail at:
- Activation Functions (use ReLU)
- Data Preprocessing (images: subtract mean)
- Weight Initialization (use Xavier/He init)
- Batch Normalization (use this!)
- Transfer learning (use this if you can!)
TLDRs
130
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
131
Next time: Visualizing and Understanding

More Related Content

Similar to PQS_7_ruohan.pdf

Visual Object Tracking: Action-Decision Networks (ADNets)
Visual Object Tracking: Action-Decision Networks (ADNets)Visual Object Tracking: Action-Decision Networks (ADNets)
Visual Object Tracking: Action-Decision Networks (ADNets)
Yehya Abouelnaga
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Introduction to Artificial Neural Networks - PART IV.pdf
Introduction to Artificial Neural Networks - PART IV.pdfIntroduction to Artificial Neural Networks - PART IV.pdf
Introduction to Artificial Neural Networks - PART IV.pdf
SasiKala592103
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
Jordan Open Source Association
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
Rouyun Pan
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptx
Abid523408
 
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
Dr. Geoffrey J. Gordon: What can machine learning do for open education?Dr. Geoffrey J. Gordon: What can machine learning do for open education?
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
The Open Education Consortium
 
lecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxlecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptx
anjithaba
 

Similar to PQS_7_ruohan.pdf (9)

Visual Object Tracking: Action-Decision Networks (ADNets)
Visual Object Tracking: Action-Decision Networks (ADNets)Visual Object Tracking: Action-Decision Networks (ADNets)
Visual Object Tracking: Action-Decision Networks (ADNets)
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
 
Introduction to Artificial Neural Networks - PART IV.pdf
Introduction to Artificial Neural Networks - PART IV.pdfIntroduction to Artificial Neural Networks - PART IV.pdf
Introduction to Artificial Neural Networks - PART IV.pdf
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptx
 
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
Dr. Geoffrey J. Gordon: What can machine learning do for open education?Dr. Geoffrey J. Gordon: What can machine learning do for open education?
Dr. Geoffrey J. Gordon: What can machine learning do for open education?
 
lecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxlecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptx
 

More from Kuan-Tsae Huang

TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
Kuan-Tsae Huang
 
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
Kuan-Tsae Huang
 
lecture_16_jiajun.pdf
lecture_16_jiajun.pdflecture_16_jiajun.pdf
lecture_16_jiajun.pdf
Kuan-Tsae Huang
 
MIT-Lec-11- RSA.pdf
MIT-Lec-11- RSA.pdfMIT-Lec-11- RSA.pdf
MIT-Lec-11- RSA.pdf
Kuan-Tsae Huang
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdf
Kuan-Tsae Huang
 
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
Kuan-Tsae Huang
 
半導體實驗室.pptx
半導體實驗室.pptx半導體實驗室.pptx
半導體實驗室.pptx
Kuan-Tsae Huang
 
B 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptxB 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptx
Kuan-Tsae Huang
 
20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf
Kuan-Tsae Huang
 
電動拖板車100208.pptx
電動拖板車100208.pptx電動拖板車100208.pptx
電動拖板車100208.pptx
Kuan-Tsae Huang
 
電動拖板車100121.pptx
電動拖板車100121.pptx電動拖板車100121.pptx
電動拖板車100121.pptx
Kuan-Tsae Huang
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
Kuan-Tsae Huang
 
FIN330 Chapter 20.pptx
FIN330 Chapter 20.pptxFIN330 Chapter 20.pptx
FIN330 Chapter 20.pptx
Kuan-Tsae Huang
 
FIN330 Chapter 22.pptx
FIN330 Chapter 22.pptxFIN330 Chapter 22.pptx
FIN330 Chapter 22.pptx
Kuan-Tsae Huang
 
FIN330 Chapter 01.pptx
FIN330 Chapter 01.pptxFIN330 Chapter 01.pptx
FIN330 Chapter 01.pptx
Kuan-Tsae Huang
 
Thoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdfThoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdf
Kuan-Tsae Huang
 
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
Kuan-Tsae Huang
 
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Kuan-Tsae Huang
 
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
Kuan-Tsae Huang
 
Capture and storage—a compendium of canada’s participation, 2006
Capture and storage—a compendium of canada’s participation, 2006Capture and storage—a compendium of canada’s participation, 2006
Capture and storage—a compendium of canada’s participation, 2006
Kuan-Tsae Huang
 

More from Kuan-Tsae Huang (20)

TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...TWCC22_PPT_v3_KL.pdf  quantum computer, quantum computer , quantum computer ,...
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
 
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf0305吳家樂lecture qva lecture , letcutre , my talk.pdf
0305吳家樂lecture qva lecture , letcutre , my talk.pdf
 
lecture_16_jiajun.pdf
lecture_16_jiajun.pdflecture_16_jiajun.pdf
lecture_16_jiajun.pdf
 
MIT-Lec-11- RSA.pdf
MIT-Lec-11- RSA.pdfMIT-Lec-11- RSA.pdf
MIT-Lec-11- RSA.pdf
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdf
 
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
經濟部科研成果價值創造計畫-簡報格式111.05.30.pptx
 
半導體實驗室.pptx
半導體實驗室.pptx半導體實驗室.pptx
半導體實驗室.pptx
 
B 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptxB 每產生1度氫電排放多少.pptx
B 每產生1度氫電排放多少.pptx
 
20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf20230418 GPT Impact on Accounting & Finance -Reg.pdf
20230418 GPT Impact on Accounting & Finance -Reg.pdf
 
電動拖板車100208.pptx
電動拖板車100208.pptx電動拖板車100208.pptx
電動拖板車100208.pptx
 
電動拖板車100121.pptx
電動拖板車100121.pptx電動拖板車100121.pptx
電動拖板車100121.pptx
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
FIN330 Chapter 20.pptx
FIN330 Chapter 20.pptxFIN330 Chapter 20.pptx
FIN330 Chapter 20.pptx
 
FIN330 Chapter 22.pptx
FIN330 Chapter 22.pptxFIN330 Chapter 22.pptx
FIN330 Chapter 22.pptx
 
FIN330 Chapter 01.pptx
FIN330 Chapter 01.pptxFIN330 Chapter 01.pptx
FIN330 Chapter 01.pptx
 
Thoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdfThoughtworks Q1 2022 Investor Presentation.pdf
Thoughtworks Q1 2022 Investor Presentation.pdf
 
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx20220828-How-do-you-build-an-effic.9658238.ppt.pptx
20220828-How-do-you-build-an-effic.9658238.ppt.pptx
 
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...
 
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...
 
Capture and storage—a compendium of canada’s participation, 2006
Capture and storage—a compendium of canada’s participation, 2006Capture and storage—a compendium of canada’s participation, 2006
Capture and storage—a compendium of canada’s participation, 2006
 

Recently uploaded

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Bituminous road construction project based learning report
Bituminous road construction project based learning reportBituminous road construction project based learning report
Bituminous road construction project based learning report
CE19KaushlendraKumar
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Power Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptxPower Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptx
Poornima D
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
Shiny Christobel
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 

Recently uploaded (20)

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Bituminous road construction project based learning report
Bituminous road construction project based learning reportBituminous road construction project based learning report
Bituminous road construction project based learning report
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Power Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptxPower Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptx
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 

PQS_7_ruohan.pdf

  • 1. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Lecture 7: Training Neural Networks 1
  • 2. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Administrative: A2 A2 is out, due Monday May 2nd, 11:59pm 2
  • 3. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Linear score function: 2-layer Neural Network x h W1 s W2 3072 100 10 Neural Networks 3
  • 4. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Convolutional Neural Networks 4
  • 5. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Convolutional Layer 32 32 3 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations activation map 1 28 28 5
  • 6. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Convolutional Layer 32 32 3 Convolution Layer activation maps 6 28 28 For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6! 6
  • 7. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 7 Where we are now... CNN Architectures
  • 8. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Learning network parameters through optimization 8
  • 9. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Where we are now... Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph (network), get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient 9
  • 10. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Today: Training Neural Networks 10
  • 11. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 11 1. One time set up: activation functions, preprocessing, weight initialization, regularization, gradient checking 2. Training dynamics: babysitting the learning process, parameter updates, hyperparameter optimization 3. Evaluation: model ensembles, test-time augmentation, transfer learning Overview
  • 12. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions 12
  • 13. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions 13
  • 14. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Sigmoid tanh ReLU Leaky ReLU Maxout ELU 14
  • 15. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Sigmoid - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 15
  • 16. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Sigmoid - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 16
  • 17. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x 17
  • 18. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x What happens when x = -10? 18
  • 19. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x What happens when x = -10? 19
  • 20. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x What happens when x = -10? What happens when x = 0? 20
  • 21. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x What happens when x = -10? What happens when x = 0? What happens when x = 10? 21
  • 22. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 sigmoid gate x What happens when x = -10? What happens when x = 0? What happens when x = 10? 22
  • 23. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Why is this a problem? If all the gradients flowing back will be zero and weights will never change sigmoid gate x 23
  • 24. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Sigmoid - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered 24
  • 25. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? 25
  • 26. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? 26
  • 27. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? 27 We know that local gradient of sigmoid is always positive
  • 28. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? We know that local gradient of sigmoid is always positive We are assuming x is always positive 28
  • 29. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? We know that local gradient of sigmoid is always positive We are assuming x is always positive So!! Sign of gradient for all wi is the same as the sign of upstream scalar gradient! 29
  • 30. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( hypothetical optimal w vector zig zag path allowed gradient update directions allowed gradient update directions 30
  • 31. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 31 Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (For a single element! Minibatches help) hypothetical optimal w vector zig zag path allowed gradient update directions allowed gradient update directions 31
  • 32. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Sigmoid - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive 32
  • 33. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions tanh(x) - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :( [LeCun et al., 1991] 33
  • 34. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] 34
  • 35. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions ReLU (Rectified Linear Unit) - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Not zero-centered output 35
  • 36. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions ReLU (Rectified Linear Unit) - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Not zero-centered output - An annoyance: hint: what is the gradient when x < 0? 36
  • 37. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 ReLU gate x What happens when x = -10? What happens when x = 0? What happens when x = 10? 37
  • 38. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 DATA CLOUD active ReLU dead ReLU will never activate => never update 38
  • 39. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 DATA CLOUD active ReLU dead ReLU will never activate => never update => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01) 39
  • 40. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Leaky ReLU - Does not saturate - Computationally efficient - Converges much faster than sigmoid/tanh in practice! (e.g. 6x) - will not “die”. [Mass et al., 2013] [He et al., 2015] 40
  • 41. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Leaky ReLU - Does not saturate - Computationally efficient - Converges much faster than sigmoid/tanh in practice! (e.g. 6x) - will not “die”. Parametric Rectifier (PReLU) [Mass et al., 2013] [He et al., 2015] 41
  • 42. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Exponential Linear Units (ELU) - All benefits of ReLU - Closer to zero mean outputs - Negative saturation regime compared with Leaky ReLU adds some robustness to noise - Computation requires exp() [Clevert et al., 2015] 42 (Alpha default = 1)
  • 43. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Activation Functions Scaled Exponential Linear Units (SELU) - Scaled version of ELU that works better for deep networks - “Self-normalizing” property; - Can train deep SELU networks without BatchNorm [Klambauer et al. ICLR 2017] 43
  • 44. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Maxout “Neuron” - Does not have the basic form of dot product -> nonlinearity - Generalizes ReLU and Leaky ReLU - Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :( [Goodfellow et al., 2013] 44
  • 45. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 TLDR: In practice: - Use ReLU. Be careful with your learning rates - Try out Leaky ReLU / Maxout / ELU / SELU - To squeeze out some marginal gains - Don’t use sigmoid or tanh 45
  • 46. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Data Preprocessing 46
  • 47. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Data Preprocessing (Assume X [NxD] is data matrix, each example in a row) 47
  • 48. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Remember: Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!) hypothetical optimal w vector zig zag path allowed gradient update directions allowed gradient update directions 48
  • 49. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 (Assume X [NxD] is data matrix, each example in a row) Data Preprocessing 49
  • 50. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Data Preprocessing In practice, you may also see PCA and Whitening of the data (data has diagonal covariance matrix) (covariance matrix is the identity matrix) 50
  • 51. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Data Preprocessing Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize After normalization: less sensitive to small changes in weights; easier to optimize 51
  • 52. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 TLDR: In practice for Images: center only - Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) - Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) - Subtract per-channel mean and Divide by per-channel std (e.g. ResNet) (mean along each channel = 3 numbers) e.g. consider CIFAR-10 example with [32,32,3] images Not common to do PCA or whitening 52
  • 53. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization 53
  • 54. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 - Q: what happens when W=constant init is used? 54
  • 55. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 - First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) 55
  • 56. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 - First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks. 56
  • 57. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 57 What will happen to the activations for the last layer?
  • 58. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? 58
  • 59. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 All activations tend to zero for deeper network layers Q: What do the gradients dL/dW look like? A: All zero, no learning =( 59
  • 60. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 60 What will happen to the activations for the last layer?
  • 61. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 All activations saturate Q: What do the gradients look like? 61
  • 62. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 All activations saturate Q: What do the gradients look like? A: Local gradients all zero, no learning =( 62
  • 63. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: “Xavier” Initialization “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 63
  • 64. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Just right”: Activations are nicely scaled for all layers! “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization 64
  • 65. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization 65
  • 66. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 66
  • 67. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 67 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin )
  • 68. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 68 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin ) We want: Var(y) = Var(xi )
  • 69. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Var(y) = Var(x1 w1 +x2 w2 +...+xDin wDin ) [substituting value of y] Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 69 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin ) We want: Var(y) = Var(xi )
  • 70. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Var(y) = Var(x1 w1 +x2 w2 +...+xDin wDin ) = Din Var(xi wi ) [Assume all xi , wi are iid] Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 70 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin ) We want: Var(y) = Var(xi )
  • 71. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Var(y) = Var(x1 w1 +x2 w2 +...+xDin wDin ) = Din Var(xi wi ) = Din Var(xi ) Var(wi ) [Assume all xi , wi are zero mean] Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 71 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin ) We want: Var(y) = Var(xi )
  • 72. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Weight Initialization: “Xavier” Initialization Var(y) = Var(x1 w1 +x2 w2 +...+xDin wDin ) = Din Var(xi wi ) = Din Var(xi ) Var(wi ) [Assume all xi , wi are iid] Let: y = x1 w1 +x2 w2 +...+xDin wDin “Just right”: Activations are nicely scaled for all layers! For conv layers, Din is filter_size2 * input_channels 72 Assume: Var(x1 ) = Var(x2 )= …=Var(xDin ) We want: Var(y) = Var(xi ) So, Var(y) = Var(xi ) only when Var(wi ) = 1/Din
  • 73. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: What about ReLU? Change from tanh to ReLU 73
  • 74. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: What about ReLU? Xavier assumes zero centered activation function Activations collapse to zero again, no learning =( Change from tanh to ReLU 74
  • 75. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Weight Initialization: Kaiming / MSRA Initialization ReLU correction: std = sqrt(2 / Din) He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015 “Just right”: Activations are nicely scaled for all layers! 75
  • 76. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Proper initialization is an active area of research… Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019 76
  • 77. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Training vs. Testing Error 77
  • 78. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 78 Beyond Training Error Better optimization algorithms help reduce training loss But we really care about error on new data - how to reduce the gap?
  • 79. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 79 Early Stopping: Always do this Iteration Loss Iteration Accuracy Train Val Stop training here Stop training the model when accuracy on the validation set decreases Or train for a long time, but always keep track of the model snapshot that worked best on val
  • 80. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 80 1. Train multiple independent models 2. At test time average their results (Take average of predicted probability distributions, then choose argmax) Enjoy 2% extra performance Model Ensembles
  • 81. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 81 How to improve single-model performance? Regularization
  • 82. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Regularization: Add term to loss 82 In common use: L2 regularization L1 regularization Elastic net (L1 + L2) (Weight decay)
  • 83. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 83 Regularization: Dropout In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
  • 84. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 84 Regularization: Dropout Example forward pass with a 3-layer network using dropout
  • 85. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 85 Regularization: Dropout How can this possibly be a good idea? Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear has a tail is furry has claws mischievous look cat score X X X
  • 86. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 86 Regularization: Dropout How can this possibly be a good idea? Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 24096 ~ 101233 possible masks! Only ~ 1082 atoms in the universe...
  • 87. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 87 Dropout: Test time Dropout makes our output random! Output (label) Input (image) Random mask Want to “average out” the randomness at test-time But this integral seems hard …
  • 88. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 88 Dropout: Test time Want to approximate the integral Consider a single neuron. a x y w1 w2
  • 89. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 89 Dropout: Test time Want to approximate the integral Consider a single neuron. At test time we have: a x y w1 w2
  • 90. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 90 Dropout: Test time Want to approximate the integral Consider a single neuron. At test time we have: During training we have: a x y w1 w2
  • 91. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 91 Dropout: Test time Want to approximate the integral Consider a single neuron. At test time we have: During training we have: a x y w1 w2 At test time, multiply by dropout probability
  • 92. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 92 Dropout: Test time At test time all neurons are active always => We must scale the activations so that for each neuron: output at test time = expected output at training time
  • 93. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 93 Dropout Summary drop in train time scale at test time
  • 94. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 94 More common: “Inverted dropout” test time is unchanged!
  • 95. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 95 Regularization: A common pattern Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate)
  • 96. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 96 Regularization: A common pattern Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate) Example: Batch Normalization Training: Normalize using stats from random minibatches Testing: Use fixed stats to normalize
  • 97. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 97 Load image and label “cat” CNN Compute loss Regularization: Data Augmentation This image by Nikita is licensed under CC-BY 2.0
  • 98. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 98 Regularization: Data Augmentation Load image and label “cat” CNN Compute loss Transform image
  • 99. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 99 Data Augmentation Horizontal Flips
  • 100. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 100 Data Augmentation Random crops and scales Training: sample random crops / scales ResNet: 1. Pick random L in range [256, 480] 2. Resize training image, short side = L 3. Sample random 224 x 224 patch
  • 101. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 101 Data Augmentation Random crops and scales Training: sample random crops / scales ResNet: 1. Pick random L in range [256, 480] 2. Resize training image, short side = L 3. Sample random 224 x 224 patch Testing: average a fixed set of crops ResNet: 1. Resize image at 5 scales: {224, 256, 384, 480, 640} 2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
  • 102. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 102 Data Augmentation Color Jitter Simple: Randomize contrast and brightness
  • 103. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 103 Data Augmentation Color Jitter Simple: Randomize contrast and brightness More Complex: 1. Apply PCA to all [R, G, B] pixels in training set 2. Sample a “color offset” along principal component directions 3. Add offset to all pixels of a training image (As seen in [Krizhevsky et al. 2012], ResNet, etc)
  • 104. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 104 Data Augmentation Get creative for your problem! Examples of data augmentations: - translation - rotation - stretching - shearing, - lens distortions, … (go crazy)
  • 105. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 105 Automatic Data Augmentation Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
  • 106. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 106 Regularization: A common pattern Training: Add random noise Testing: Marginalize over the noise Examples: Dropout Batch Normalization Data Augmentation
  • 107. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 107 Regularization: DropConnect Training: Drop connections between neurons (set weights to 0) Testing: Use all the connections Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Examples: Dropout Batch Normalization Data Augmentation DropConnect
  • 108. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 108 Regularization: Fractional Pooling Training: Use randomized pooling regions Testing: Average predictions from several regions Graham, “Fractional Max Pooling”, arXiv 2014 Examples: Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling
  • 109. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 109 Regularization: Stochastic Depth Training: Skip some layers in the network Testing: Use all the layer Examples: Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
  • 110. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 110 Regularization: Cutout Training: Set random image regions to zero Testing: Use full image Examples: Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout / Random Crop DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017 Works very well for small datasets like CIFAR, less common for large datasets like ImageNet
  • 111. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 111 Regularization: Mixup Training: Train on random blends of images Testing: Use original images Examples: Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout / Random Crop Mixup Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018 Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog CNN Target label: cat: 0.4 dog: 0.6
  • 112. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 112 Regularization - In practice Training: Add random noise Testing: Marginalize over the noise Examples: Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout / Random Crop Mixup - Consider dropout for large fully-connected layers - Batch normalization and data augmentation almost always a good idea - Try cutout and mixup especially for small classification datasets
  • 113. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 113 Choosing Hyperparameters (without tons of GPUs)
  • 114. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 114 Choosing Hyperparameters Step 1: Check initial loss Turn off weight decay, sanity check loss at initialization e.g. log(C) for softmax with C classes
  • 115. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 115 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization Loss not going down? LR too low, bad initialization Loss explodes to Inf or NaN? LR too high, bad initialization
  • 116. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 116 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Step 3: Find LR that makes loss go down Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
  • 117. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 117 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Step 3: Find LR that makes loss go down Step 4: Coarse grid, train for ~1-5 epochs Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs. Good weight decay to try: 1e-4, 1e-5, 0
  • 118. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 118 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Step 3: Find LR that makes loss go down Step 4: Coarse grid, train for ~1-5 epochs Step 5: Refine grid, train longer Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay
  • 119. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 119 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Step 3: Find LR that makes loss go down Step 4: Coarse grid, train for ~1-5 epochs Step 5: Refine grid, train longer Step 6: Look at loss and accuracy curves
  • 120. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 120 Accuracy time Train Accuracy still going up, you need to train longer Val
  • 121. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 121 Accuracy time Train Huge train / val gap means overfitting! Increase regularization, get more data Val
  • 122. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 122 Accuracy time Train No gap between train / val means underfitting: train longer, use a bigger model Val
  • 123. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 123 Losses may be noisy, use a scatter plot and also plot moving average to see trends better Look at learning curves! Training Loss Train / Val Accuracy
  • 124. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 124 Cross-validation We develop "command centers" to visualize all our models training with different hyperparameters check out weights and biases
  • 125. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 125 You can plot all your loss curves for different hyperparameters on a single plot
  • 126. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 126 Don't look at accuracy or loss curves for too long!
  • 127. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 127 Choosing Hyperparameters Step 1: Check initial loss Step 2: Overfit a small sample Step 3: Find LR that makes loss go down Step 4: Coarse grid, train for ~1-5 epochs Step 5: Refine grid, train longer Step 6: Look at loss and accuracy curves Step 7: GOTO step 5
  • 128. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 128 Random Search vs. Grid Search Important Parameter Important Parameter Unimportant Parameter Unimportant Parameter Grid Layout Random Layout Illustration of Bergstra et al., 2012 by Shayne Longpre, copyright CS231n 2017 Random Search for Hyper-Parameter Optimization Bergstra and Bengio, 2012
  • 129. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 129 Summary - Improve your training error: - Optimizers - Learning rate schedules - Improve your test error: - Regularization - Choosing Hyperparameters
  • 130. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 Summary We looked in detail at: - Activation Functions (use ReLU) - Data Preprocessing (images: subtract mean) - Weight Initialization (use Xavier/He init) - Batch Normalization (use this!) - Transfer learning (use this if you can!) TLDRs 130
  • 131. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022 131 Next time: Visualizing and Understanding