This document summarizes key points from a lecture on training neural networks. It discusses activation functions like ReLU, LeakyReLU, and ELU and their benefits over sigmoid and tanh. It covers data preprocessing techniques like subtracting the mean, dividing by standard deviation, and occasionally whitening. Mini-batch stochastic gradient descent is introduced as the optimization algorithm, involving sampling batches, calculating loss, backpropagation for gradients, and parameter updates. Evaluation techniques like ensembling, test augmentation, and transfer learning are also briefly mentioned.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Lecture by Xavier Giro-i-Nieto (UPC) at the Master in Computer Vision Barcelona (March 30, 2016).
http://pagines.uab.cat/mcv/
This lecture provides an overview of computer vision analysis of images at a global scale using deep learning techniques. The session is structure in two blocks: a first one addressing end to end learning, and a second one focusing on applications that use off-the-shelf features.
Please submit your feedback as comments on the GDrive source slides:
https://docs.google.com/presentation/d/1ms9Fczkep__9pMCjxtVr41OINMklcHWc74kwANj7KKI/edit?usp=sharing
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML
In this paper we study input/output STIT logic. We introduce the semantics, proof theory and prove the completeness theorem. Input/output STIT logic has more expressive power than Makinson and van der Torre’s input/output logic. We show that input/output STIT logic is decidable and free from Ross’ paradox.
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...Aboul Ella Hassanien
This talk presented at Bio-inspiring and evolutionary computation: Trends, applications and open issues workshop, 7 Nov. 2015 Faculty of Computers and Information, Cairo University
In this deck, Pieter Abbeel from UC Berkeley describes his group research into making robots learn.
Watch the video: https://wp.me/p3RLHQ-hf7
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Lecture by Xavier Giro-i-Nieto (UPC) at the Master in Computer Vision Barcelona (March 30, 2016).
http://pagines.uab.cat/mcv/
This lecture provides an overview of computer vision analysis of images at a global scale using deep learning techniques. The session is structure in two blocks: a first one addressing end to end learning, and a second one focusing on applications that use off-the-shelf features.
Please submit your feedback as comments on the GDrive source slides:
https://docs.google.com/presentation/d/1ms9Fczkep__9pMCjxtVr41OINMklcHWc74kwANj7KKI/edit?usp=sharing
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML
In this paper we study input/output STIT logic. We introduce the semantics, proof theory and prove the completeness theorem. Input/output STIT logic has more expressive power than Makinson and van der Torre’s input/output logic. We show that input/output STIT logic is decidable and free from Ross’ paradox.
PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...Aboul Ella Hassanien
This talk presented at Bio-inspiring and evolutionary computation: Trends, applications and open issues workshop, 7 Nov. 2015 Faculty of Computers and Information, Cairo University
In this deck, Pieter Abbeel from UC Berkeley describes his group research into making robots learn.
Watch the video: https://wp.me/p3RLHQ-hf7
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
1. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Lecture 7:
Training Neural Networks
1
2. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Administrative: assignments
A2 is out, due Monday May 8th, 11:59pm
[Important] Q6 is moved from A2 to A3, please re-download the
assignment if you have the version before the change
We are working on grading A1 by the end of the week
2
3. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Linear score function:
2-layer Neural Network
x h
W1 s
W2
3072 100 10
Neural Networks
3
4. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Convolutional Neural Networks
4
5. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Convolutional Layer
32
32
3
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
5
6. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Convolutional Layer
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll
get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
6
7. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
7
Where we are now...
CNN Architectures
8. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Landscape image is CC0 1.0 public domain
Walking man image is CC0 1.0 public domain
Learning network parameters through optimization
8
9. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Where we are now...
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
9
10. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Today: Training Neural Networks
10
11. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
11
1. One time set up: activation functions, preprocessing,
weight initialization, regularization, gradient checking
2. Training dynamics: babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation: model ensembles, test-time
augmentation, transfer learning
Overview
15. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
15
16. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
16
17. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
17
18. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
What happens when x = -10?
18
19. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
What happens when x = -10?
19
20. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
20
21. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
21
22. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
22
23. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Why is this a problem?
If all the gradients flowing back will be
zero and weights will never change
sigmoid
gate
x
23
24. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
24
25. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
25
26. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
26
27. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
27
We know that local gradient of sigmoid is always positive
28. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
28
29. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
So!! Sign of gradient for all wi
is the same as the sign of upstream scalar gradient!
29
30. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
30
31. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
31
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(For a single element! Minibatches help)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
31
32. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit compute expensive
32
33. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]
- zero centered (nice)
- still kills gradients when saturated :(
[LeCun et al., 1991]
33
34. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
34
35. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
35
36. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
- An annoyance:
hint: what is the gradient when x < 0?
36
37. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
ReLU
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
37
38. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
38
39. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
=> people like to initialize
ReLU neurons with slightly
positive biases (e.g. 0.01)
39
40. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
[Mass et al., 2013]
[He et al., 2015]
40
41. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
Parametric Rectifier (PReLU)
[Mass et al., 2013]
[He et al., 2015]
41
42. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise
- Computation requires exp()
[Clevert et al., 2015]
42
(Alpha default = 1)
43. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Activation Functions
Scaled Exponential Linear Units (SELU)
- Scaled version of ELU that
works better for deep networks
- “Self-normalizing” property;
- Can train deep SELU networks
without BatchNorm
[Klambauer et al. ICLR 2017]
43
44. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
[Goodfellow et al., 2013]
44
45. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
TLDR: In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
45
46. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Data Preprocessing
46
47. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Data Preprocessing
(Assume X [NxD] is data matrix,
each example in a row)
47
48. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Remember: Consider what happens when
the input to a neuron is always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(this is also why you want zero-mean data!)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
48
49. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
(Assume X [NxD] is data matrix, each example in a row)
Data Preprocessing
49
50. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Data Preprocessing
In practice, you may also see PCA and Whitening of the data
(data has diagonal
covariance matrix)
(covariance matrix is the
identity matrix)
50
51. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Data Preprocessing
Before normalization: classification loss
very sensitive to changes in weight matrix;
hard to optimize
After normalization: less sensitive to small
changes in weights; easier to optimize
51
52. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
TLDR: In practice for Images: center only
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Divide by per-channel std (e.g. ResNet)
(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common
to do PCA or
whitening
52
54. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
- Q: what happens when W=constant init is used?
54
55. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
55
56. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with
deeper networks.
56
57. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
57
What will happen to the activations for the last layer?
58. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
58
59. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
A: All zero, no learning =(
59
60. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
60
What will happen to the activations for the last layer?
61. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
61
62. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
A: Local gradients all zero,
no learning =(
62
63. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: “Xavier” Initialization
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
63
64. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Just right”: Activations are
nicely scaled for all layers!
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
64
65. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
65
66. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
66
67. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
67
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
68. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
68
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
69. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
[substituting value of y]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
69
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
70. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
70
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
71. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are zero mean]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
71
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
72. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
72
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
So, Var(y) = Var(xi
) only when Var(wi
) = 1/Din
73. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: What about ReLU?
Change from tanh to ReLU
73
74. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: What about ReLU?
Xavier assumes zero
centered activation function
Activations collapse to zero
again, no learning =(
Change from tanh to ReLU
74
75. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
“Just right”: Activations are
nicely scaled for all layers!
75
76. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et
al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
76
77. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Training vs. Testing Error
77
78. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
78
Beyond Training Error
Better optimization algorithms
help reduce training loss
But we really care about error on
new data - how to reduce the gap?
79. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
79
Early Stopping: Always do this
Iteration
Loss
Iteration
Accuracy
Train
Val
Stop training here
Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot
that worked best on val
80. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
80
1. Train multiple independent models
2. At test time average their results
(Take average of predicted probability distributions, then choose argmax)
Enjoy 2% extra performance
Model Ensembles
81. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
81
How to improve single-model performance?
Regularization
82. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Regularization: Add term to loss
82
In common use:
L2 regularization
L1 regularization
Elastic net (L1 + L2)
(Weight decay)
83. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
83
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
84. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
84
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout
85. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
85
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features
has an ear
has a tail
is furry
has claws
mischievous
look
cat
score
X
X
X
86. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
86
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Dropout is training a large ensemble of
models (that share parameters).
Each binary mask is one model
An FC layer with 4096 units has
24096
~ 101233
possible masks!
Only ~ 1082
atoms in the universe...
87. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
87
Dropout: Test time
Dropout makes our output random!
Output
(label)
Input
(image)
Random
mask
Want to “average out” the randomness at test-time
But this integral seems hard …
88. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
88
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
x y
w1 w2
89. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
89
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
a
x y
w1 w2
90. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
90
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
91. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
91
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
At test time, multiply
by dropout probability
92. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
92
Dropout: Test time
At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
93. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
93
Dropout Summary
drop in train time
scale at test time
94. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
94
More common: “Inverted dropout”
test time is unchanged!
95. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
95
Regularization: A common pattern
Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
96. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
96
Regularization: A common pattern
Training: Add some kind
of randomness
Testing: Average out randomness
(sometimes approximate)
Example: Batch
Normalization
Training:
Normalize using
stats from random
minibatches
Testing: Use fixed
stats to normalize
97. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
97
Load image
and label
“cat”
CNN
Compute
loss
Regularization: Data Augmentation
This image by Nikita is
licensed under CC-BY 2.0
98. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
98
Regularization: Data Augmentation
Load image
and label
“cat”
CNN
Compute
loss
Transform image
99. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
99
Data Augmentation
Horizontal Flips
100. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
100
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
101. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
101
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
102. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
102
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
103. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
103
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
More Complex:
1. Apply PCA to all [R, G, B]
pixels in training set
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
104. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
104
Data Augmentation
Get creative for your problem!
Examples of data augmentations:
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
105. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
105
Automatic Data Augmentation
Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
106. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
106
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
107. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
107
Regularization: DropConnect
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
108. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
108
Regularization: Fractional Pooling
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Graham, “Fractional Max Pooling”, arXiv 2014
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
109. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
109
Regularization: Stochastic Depth
Training: Skip some layers in the network
Testing: Use all the layer
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
110. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
110
Regularization: Cutout
Training: Set random image regions to zero
Testing: Use full image
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
DeVries and Taylor, “Improved Regularization of
Convolutional Neural Networks with Cutout”, arXiv 2017
Works very well for small datasets like CIFAR,
less common for large datasets like ImageNet
111. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
111
Regularization: Mixup
Training: Train on random blends of images
Testing: Use original images
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels
of pairs of training images,
e.g. 40% cat, 60% dog
CNN
Target label:
cat: 0.4
dog: 0.6
112. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
112
Regularization - In practice
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
- Consider dropout for large
fully-connected layers
- Batch normalization and data
augmentation almost always a
good idea
- Try cutout and mixup especially
for small classification datasets
113. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
113
Choosing Hyperparameters
(without tons of GPUs)
114. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
114
Choosing Hyperparameters
Step 1: Check initial loss
Turn off weight decay, sanity check loss at initialization
e.g. log(C) for softmax with C classes
115. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
115
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of
training data (~5-10 minibatches); fiddle with architecture,
learning rate, weight initialization
Loss not going down? LR too low, bad initialization
Loss explodes to Inf or NaN? LR too high, bad initialization
116. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
116
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations
Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
117. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
117
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around
what worked from Step 3, train a few models for ~1-5 epochs.
Good weight decay to try: 1e-4, 1e-5, 0
118. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
118
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20
epochs) without learning rate decay
119. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
119
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
120. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
120
Accuracy
time
Train
Accuracy still going up, you
need to train longer
Val
121. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
121
Accuracy
time
Train
Huge train / val gap means
overfitting! Increase regularization,
get more data
Val
122. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
122
Accuracy
time
Train
No gap between train / val means
underfitting: train longer, use a
bigger model
Val
123. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
123
Losses may be noisy, use a
scatter plot and also plot moving
average to see trends better
Look at learning curves!
Training Loss Train / Val Accuracy
124. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
124
Cross-validation
We develop "command
centers" to visualize all our
models training with different
hyperparameters
check out weights and biases
125. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
125
You can plot all your loss curves for different hyperparameters on a single plot
126. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
126
Don't look at accuracy or loss curves for too long!
127. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
127
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Step 7: GOTO step 5
128. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
128
Random Search vs. Grid Search
Important Parameter Important Parameter
Unimportant
Parameter
Unimportant
Parameter
Grid Layout Random Layout
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017
Random Search for
Hyper-Parameter Optimization
Bergstra and Bengio, 2012
129. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Transfer learning
129
130. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
You need a lot of a data if you want to
train/use CNNs?
130
131. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Transfer Learning with CNNs
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-1000
1. Train on Imagenet
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
131
132. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Transfer Learning with CNNs
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-1000
1. Train on Imagenet
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize
this and train
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
132
133. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Transfer Learning with CNNs
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-1000
1. Train on Imagenet
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize
this and train
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for
Generic Visual Recognition”, ICML 2014
Finetuned from AlexNet
133
134. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Transfer Learning with CNNs
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-1000
1. Train on Imagenet
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize
this and train
Image
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
FC-4096
FC-4096
FC-C
3. Bigger dataset
Freeze these
Train these
With bigger
dataset, train
more layers
Lower learning rate
when finetuning;
1/10 of original LR
is good starting
point
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
134
135. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?
1. Find a very large dataset that has
similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained
models so you don’t need to train your own
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision
135
136. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
136
Summary
- Improve your training error:
- Optimizers
- Learning rate schedules
- Improve your test error:
- Regularization
- Choosing Hyperparameters
137. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
Summary
We looked in detail at:
- Activation Functions (use ReLU)
- Data Preprocessing (images: subtract mean)
- Weight Initialization (use Xavier/Kaiming init)
- Batch Normalization (use this!)
- Transfer learning (use this if you can!)
TLDRs
137
138. Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - April 25, 2023
138
Next time: Recurrent Neural Networks