SlideShare a Scribd company logo
1 of 77
Download to read offline
L03-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Popular DNN Models
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 13, 2023
L03-2
Sze and Emer
Goals of Today’s Lecture
• Last lecture covered the building blocks of CNNs; this lecture
describes how we put these blocks together to form a CNN.
• Overview of various well-known CNN models
– CNN ‘models’ are also referred to as ‘network architectures’; however, we prefer to
use the term ‘model’ in this class to avoid overloading the term ‘architecture’
• We group the CNN models into two categories
– High Accuracy CNN Models: Designed to maximize accuracy to compete in the
ImageNet Challenge
– Efficient CNN Models: Designed to reduce the number of weights and
operations (specifically MACs) while maintaining accuracy
February 13, 2023
L03-3
Sze and Emer
High Accuracy CNN Models
L03-4
Sze and Emer
Popular CNNs
• LeNet (1998)
• AlexNet (2012)
• OverFeat (2013)
• VGGNet (2014)
• GoogleNet (2014)
• ResNet (2015)
0
2
4
6
8
10
12
14
16
18
2012 2013 2014 2015 Human
Accuracy
(Top
5
error)
[Russakovsky, IJCV 2015]
AlexNet
OverFeat
GoogLeNet
ResNet
Clarifai
VGGNet
ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
February 13, 2023
L03-5
Sze and Emer
MNIST
http://yann.lecun.com/exdb/mnist/
Digit Classification
28x28 pixels (B&W)
10 Classes
60,000 Training
10,000 Testing
February 13, 2023
L03-6
Sze and Emer
LeNet-5
[Lecun, Proceedings of the IEEE, 1998]
CONV Layers: 2
Fully Connected Layers: 2
Weights: 60k
MACs: 341k
Sigmoid used for non-linearity
Digit Classification!
(MNIST Dataset)
2x2
average
pooling
sixteen
5x5 filters
2x2
average
pooling
six
5x5 filters
February 13, 2023
L03-7
Sze and Emer
LeNet-5
http://yann.lecun.com/exdb/lenet/
February 13, 2023
L03-8
Sze and Emer
ImageNet
http://www.image-net.org/challenges/LSVRC/
Image Classification
~256x256 pixels (color)
1000 Classes
1.3M Training
100,000 Testing (50,000 Validation) Image Source: http://karpathy.github.io/
For ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
accuracy of classification task reported
based on top-1 and top-5 error
February 13, 2023
L03-9
Sze and Emer
AlexNet
CONV Layers: 5
Fully Connected Layers: 3
Weights: 61M
MACs: 724M
ReLU used for non-linearity
ILSCVR12 Winner
Uses Local Response Normalization (LRN)
input ofmap1 ofmap2 ofmap3 ofmap4 ofmap5 ofmap6
ofmap7
ofmap8
[Krizhevsky, NeurIPS 2012]
February 13, 2023
L03-10
Sze and Emer
AlexNet
CONV Layers: 5
Fully Connected Layers: 3
Weights: 61M
MACs: 724M
ReLU used for non-linearity
ILSCVR12 Winner
Uses Local Response Normalization (LRN)
L1
96 filters
(11x11x3)
L2
256 filters
(5x5x48)
L3
384 filters
(3x3x256)
L4
384 filters
(3x3x192)
L5
384 filters
(3x3x192)
L6
4096 filters
(13x13x256)
L7
4096
(4096x1) L8
1000
(4096x1)
[Krizhevsky, NeurIPS 2012]
February 13, 2023
L03-11
Sze and Emer
Large Sizes with Varying Shapes
Layer Filter Size (RxS) # Filters (M) # Channels (C) Stride
1 11x11 96 3 4
2 5x5 256 48 1
3 3x3 384 256 1
4 3x3 384 192 1
5 3x3 256 192 1
AlexNet Convolutional Layer Configurations
34k Params 307k Params 885k Params
Layer 1 Layer 2 Layer 3
105M MACs 224M MACs 150M MACs
[Krizhevsky, NeurIPS 2012]
February 13, 2023
L03-12
Sze and Emer
AlexNet
CONV Layers: 5
Fully Connected Layers: 3
Weights: 61M
MACs: 724M
ReLU used for non-linearity [Krizhevsky, NeurIPS 2012]
ILSCVR12 Winner
Uses Local Response Normalization (LRN)
L1 L2 L3 L4 L5 L6 L7
1000
scores
224x224
Input
Image
Conv
(11x11)
Non-Linearity
Norm
(LRN)
Max
Pool
Conv
(5x5)
Non-Linearity
Norm
(LRN)
Max
Pooling
Conv
(3x3)
Non-Linearity
Conv
(3x3)
Non-Linearity
Conv
(3x3)
Non-Linearity
Max
Pooling
Fully
Connect
Non-Linearity
Fully
Connect
Non-Linearity
Fully
Connect
34k 307k 885k 664k 442k 37.7M 16.8M 4.1M
# of weights
February 13, 2023
L03-13
Sze and Emer
VGG-16
CONV Layers: 13
Fully Connected Layers: 3
Weights: 138M
MACs: 15.5G [Simonyan, ICLR 2015]
Image Source: http://www.cs.toronto.edu/~frossard/post/vgg16/
Also, 19-layer version
More Layers à Deeper!
February 13, 2023
L03-14
Sze and Emer
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
3 0 4 2 0 1 0
0 0 1 2 3 2 0
0 1 2 2 2 0 3
5 0 1 0 1 3 0
0 1 2 2 1 0 1
0 0 1 0 3 1 0
5 2 0 3 0 5 8
5x5 filter
Example
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 31 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
February 13, 2023
L03-15
Sze and Emer
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 0 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 7 8 8 0 0
0 0 5 6 7 3 0
0 1 6 5 7 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
3x3 filter1
Example
0 1 0
1 1 1
0 1 0
filter (3x3)
February 13, 2023
L03-16
Sze and Emer
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 0 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 7 8 8 0 0
0 0 5 6 7 3 0
0 1 6 5 7 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
3x3 filter1
Example
0 1 0
1 1 1
0 1 0
filter (3x3)
February 13, 2023
L03-17
Sze and Emer
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 0 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 7 8 8 0 0
0 0 5 6 7 3 0
0 1 6 5 7 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
3x3 filter1
Example
0 1 0
1 1 1
0 1 0
filter (3x3)
February 13, 2023
L03-18
Sze and Emer
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 0 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 7 8 8 0 0
0 0 5 6 7 3 0
0 1 6 5 7 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
3x3 filter1
Example
0 1 0
1 1 1
0 1 0
filter (3x3)
February 13, 2023
L03-19
Sze and Emer
VGGNet: Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same receptive field
with fewer filter weights
• Non-linear activation inserted between each filter
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 0 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 7 8 8 0 0
0 0 5 6 7 3 0
0 1 6 5 7 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
3x3 filter2
0 0 0 0 0 0 0
0 0 1 2 3 2 0
0 1 2 2 2 0 0
0 0 1 31 1 3 0
0 1 2 2 1 0 0
0 0 1 0 3 1 0
0 0 0 0 0 0 0
Example: 5x5 filter (25 weights) à two 3x3 filters (18 weights)
3x3 filter1
February 13, 2023
L03-20
Sze and Emer
GoogLeNet/Inception (v1)
CONV Layers: 21 (depth), 57 (total)
Fully Connected Layers: 1
Weights: 7.0M
MACs: 1.43G [Szegedy, CVPR 2015]
Also, v2, v3 and v4
ILSVRC14 Winner
9 Inception Blocks*
3 CONV layers 1 FC layer
(reduced from 3)
Auxiliary Classifiers
(helps with training,
not used during inference)
*referred to as inception
module in textbook
February 13, 2023
L03-21
Sze and Emer
GoogLeNet/Inception (v1)
parallel* filters of different size have the effect
of processing image at different scales
1x1 ‘bottleneck’ to
reduce number of
weights and
multiplications
Inception
Block
CONV Layers: 21 (depth), 57 (total)
Fully Connected Layers: 1
Weights: 7.0M
MACs: 1.43G
Also, v2, v3 and v4
ILSVRC14 Winner
[Szegedy, CVPR 2015]
*also referred to as “multi-branch” and
“split-transform-merge”
February 13, 2023
L03-22
Sze and Emer
1x1 Bottleneck
Modified image from source:
Stanford cs231n
[Lin, Network in Network, ICLR 2014]
Use 1x1 filter to capture cross-channel correlation, but not spatial correlation.
Can be used to reduce the number of channels in next layer (compress).
(Filter dimensions for bottleneck: R=1, S=1, C > M)
1
56
56
filter1
(1x1x64)
February 13, 2023
L03-23
Sze and Emer
1x1 Bottleneck
Modified image from source:
Stanford cs231n
filter2
(1x1x64)
2
56
56
Use 1x1 filter to capture cross-channel correlation, but not spatial correlation.
Can be used to reduce the number of channels in next layer (compress).
(Filter dimensions for bottleneck: R=1, S=1, C > M)
[Lin, Network in Network, ICLR 2014]
February 13, 2023
L03-24
Sze and Emer
1x1 Bottleneck
Modified image from source:
Stanford cs231n
32
56
56
Use 1x1 filter to capture cross-channel correlation, but not spatial correlation.
Can be used to reduce the number of channels in next layer (compress).
(Filter dimensions for bottleneck: R=1, S=1, C > M)
[Lin, Network in Network, ICLR 2014]
February 13, 2023
L03-25
Sze and Emer
GoogLeNet:1x1 Bottleneck
1x1 ‘bottleneck’ to
reduce number of
weights and
multiplications
Inception
Block
[Szegedy, CVPR 2015]
Apply 1x1 bottleneck before ‘large’ convolution filters.
Reduce weights such that entire CNN can be trained on one GPU.
Number of multiplications reduced from 854M à 358M
February 13, 2023
L03-26
Sze and Emer
Reduce Cost of FC Layers
First FC layer accounts for a
significant portion of weights
38M of 61M for AlexNet
105M 224M 150M 112M 75M 38M 17M 4M
# of MACs
L1 L2 L3 L4 L5 L6 L7
1000
scores
224x224
Input
Image
Conv
(11x11)
Non-Linearity
Norm
(LRN)
Max
Pool
Conv
(5x5)
Non-Linearity
Norm
(LRN)
Max
Pooling
Conv
(3x3)
Non-Linearity
Conv
(3x3)
Non-Linearity
Conv
(3x3)
Non-Linearity
Max
Pooling
Fully
Connect
Non-Linearity
Fully
Connect
Non-Linearity
Fully
Connect
34k 307k 885k 664k 442k 37.7M 16.8M 4.1M
# of weights 38M
[Krizhevsky, NeurIPS 2012]
February 13, 2023
L03-27
Sze and Emer
Global Pooling
GoogLeNet uses global pooling to reduce number of FC layers from three to one
[Lin, ICLR 2014]
Use Global Pooling to reduce size of input to the first FC layer and the FC layer itself
FC Layer
H
…
input fmap
output fmap1
…
C
1
W
1
H
…
…
C
W
…
filter1
input fmapGP
1
1
C
1
1
C
filter1
output fmap1
1
1
Step 2: FC Layer
H
input fmap
output fmapGP
…
…
C
1
W
1
Pool
C
Step 1: Global Pooling
Size of FC layer:
HxWxCxM à 1x1xCxM
February 13, 2023
L03-28
Sze and Emer
ResNet
Image Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
Go Deeper!
ILSVRC15 Winner
(better than human level accuracy!)
February 13, 2023
L03-29
Sze and Emer
ResNet: Training
Training and validation error increases with more layers;
this is due to vanishing gradient, no overfitting.
Introduce short cut block to address this!
Without shortcut With shortcut
Thin curves denote training error, and bold curves denote validation error.
[He, CVPR 2016]
February 13, 2023
L03-30
Sze and Emer
ResNet: Short Cut Block
Helps address the vanishing gradient challenge for
training very deep networks
1 CONV layer
1 FC layer
16 Short
Cut
Blocks
ResNet-34
3x3 CONV
ReLU
ReLU
3x3 CONV
+
x
F(x)
H(x) = F(x) + x
Iden%ty
x
Learns
Residual
F(x)=H(x)-x
Skip Connection
(also referred to
as highway)
[He, CVPR 2016]
February 13, 2023
L03-31
Sze and Emer
ResNet: Bottleneck
Apply 1x1 bottleneck to reduce computation and size
Also makes network deeper (ResNet-34 à ResNet-50)
compress
C > M
expand
C < M
[He, CVPR 2016]
February 13, 2023
L03-32
Sze and Emer
ResNet-50
CONV Layers: 49
Fully Connected Layers: 1
Weights: 25.5M
MACs: 3.9G
Also, 34-, 152-, and 1202-layer versions
ILSVRC15 Winner
1 CONV layer
1 FC layer
16 Short
Cut
Blocks
ResNet-50
Short Cut Block
[He, CVPR 2016]
February 13, 2023
L03-33
Sze and Emer
Summary of Popular CNNs
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet
(v1)
ResNet-50
Top-5 error n/a 16.4 7.4 6.7 5.3
Input Size 28x28 227x227 224x224 224x224 224x224
# of CONV Layers 2 5 16 21 (depth) 49
Filter Sizes 5 3, 5,11 3 1, 3 , 5, 7 1, 3, 7
# of Channels 1, 6 3 - 256 3 - 512 3 - 1024 3 - 2048
# of Filters 6, 16 96 - 384 64 - 512 64 - 384 64 - 2048
Stride 1 1, 4 1 1, 2 1, 2
# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M
# of MACs 283k 666M 15.3G 1.43G 3.86G
# of FC layers 2 3 3 1 1
# of Weights 58k 58.6M 124M 1M 2M
# of MACs 58k 58.6M 124M 1M 2M
Total Weights 60k 61M 138M 7M 25.5M
Total MACs 341k 724M 15.5G 1.43G 3.9G
CONV Layers increasingly important!
February 13, 2023
L03-34
Sze and Emer
Summary of Popular CNNs
• AlexNet
– First CNN Winner of ILSVRC
– Uses LRN (deprecated after this)
• VGG-16
– Goes Deeper (16+ layers)
– Uses only 3x3 filters (stack for larger filters)
• GoogLeNet (v1)
– Reduces weights with Inception and uses Global Pooling so that only one FC layer is needed
– Inception Block: 1x1 and parallel connections
– Batch Normalization
• ResNet
– Goes Deeper (24+ layers)
– Short cut Block: Skip connections
February 13, 2023
L03-35
Sze and Emer
DenseNet
[Huang, CVPR 2017]
Feature maps are concatenated rather than added.
Break into blocks to limit depth and thus size of combined feature map.
More Skip Connections!
Connections not only from previous layer, but
many past layers to strengthen feature map
propagation and feature reuse.
Dense
Block
Transition layers
February 13, 2023
L03-36
Sze and Emer
DenseNet
Note: 1 MAC = 2 FLOPS
Higher accuracy than ResNet with fewer weights and multiplications
[Huang, CVPR 2017]
Top-1 error Top-1 error
February 13, 2023
L03-37
Sze and Emer
Wide ResNet
Increase width (# of filters) rather than depth of network
• 50-layer wide ResNet outperforms 152-layer original ResNet
• Increasing width instead of depth is also more parallel-friendly
Image Source: Stanford cs231n
[Zagoruyko, BMVC 2016]
February 13, 2023
L03-38
Sze and Emer
Squeeze and Excitation
H
C
W
Input fmap
1
C
1
1) Global
Pooling
1
C
1
2) Multiple
FC Layers
Filters
H
C
W
Output fmap
∗
3) Depthwise
Convolution
Dynamic
Weights
Depth-wise convolution with dynamic weights, where the weights change based
on the input feature map.
• Squeeze: Summarize each channel of input features map with global pooling
• Excitation: Determine weights using FC layers to increase attention on
certain channels of the input features map
[Hu, CVPR 2018]
excitation
squeeze
Attention (input à dynamic weights)
Used by SENet
ILSVRC 2017 Winner
February 13, 2023
L03-39
Sze and Emer
Convolution versus Attention Mechanism
• Convolution
– Only models dependencies between spatial neighbors
– Use sparsely connected layer to spatial neighbors; no support for dependencies outside of
spatial dimensions of filter (R x S)
• Attention
– “Allows modeling of [global] dependencies without regard to their distance” [Vaswani,
NeurIPS 2017]
– However, fully connected layer too expensive; develop mechanism to bias “the allocation of
available computational resources towards the most informative components of a signal”
[Hu, CVPR 2018]
• Transformer is a type of DNN that is built entirely using Attention Mechanism
[Vaswani, NeurIPS 2017] (Next Lecture)
February 13, 2023
L03-40
Sze and Emer
Efficient CNN Models
L03-41
Sze and Emer
Accuracy vs. Weight & OPs
[Bianco, IEEE Access 2018]
February 13, 2023
L03-42
Sze and Emer
Manual Network Design
• Reduce Spatial Size (R, S)
– stacked filters
• Reduce Channels (C)
– 1x1 convolution, grouped convolution
• Reduce Filters (M)
– feature map reuse across layers
Filters
R
S
…
…
…
C
H
W
…
…
…
C
…
E
F
…
…
…
M
…
…
…
M
…
R
S
…
…
…
C
H
W
…
…
C
1
N
1
M
1
…
…
Input fmaps
Output fmaps
…
E
F
N
P
Q
P
Q
February 13, 2023
L03-43
Sze and Emer
Reduce Spatial Size (R, S): Stacked Small Filters
5x5 filter Two 3x3 filters
decompose
Apply sequentially
decompose
5x5 filter 5x1 filter
1x5 filter
Apply sequentially
GoogleNet/
Inception v3
VGG
separable
filters
Replace a large filter with a series of smaller filters (reduces degrees of freedom)
February 13, 2023
L03-44
Sze and Emer
Example: Inception V3
Go deeper (v1: 22 layers à v3: 40+ layers) by reducing the number
of weights per filter using filter decomposition
~3.5% higher accuracy than v1
[Szegedy, CVPR 2016]
5x5 filter à 3x3 filters 3x3 filter à 3x1 and 1x3 filters
Separable filters
February 13, 2023
L03-45
Sze and Emer
Reduce Channels (C): 1x1 Convolution
ResNet
GoogLeNet
compress
expand
compress
C > M
C < M
• Use 1x1 (bottleneck) filter to capture cross-channel correlation, but not spatial correlation
• Reduce the number of channels in next layer (compress), where C > M
February 13, 2023
L03-46
Sze and Emer
Example: SqueezeNet
[Iandola, ICLR 2017]]
Fire Block
Reduce number of weights by reducing number of input
channels by “squeezing” with 1x1
50x fewer weights than AlexNet (no accuracy loss)
However, 1.2x more operations than AlexNet*
*for SqueezeNetv1.0
(reduce operations by 2x in
SqueezeNetv1.1)
February 13, 2023
L03-47
Sze and Emer
Reduce Channels (C): Grouped Convolutions
Grouped convolutions reduce the number of weights and multiplications at the
cost of not sharing information between groups
• Divide filters into groups (G) operating on subset of channels.
• Each group has M/G filters and processes C/G channels.
P
Q
4
P
Q
4
P
Q
4
P
Q
4
1
Grouped Convolution
filters
input
fmaps output
fmaps
H
C
W
R
S
C/2
1 1
H
C
W
R
C/2
S
4
1
H
C
W
R
S
C/2
2 1
H
C
W
R
C/2
S
3
1
Group 1
Group 2
1
1
1
P
Q
4
P
Q
4
1
Grouped Convolution
filters
input
fmaps output
fmaps
H
C
W
R
S
C/2
1 1
H
C
W
R
S
C/2
2 1
Group 1
1
Group 2
Example for G=2: Each filter requires 2x fewer weights and MACs (C à C/2)
Group 1
In this example,
N=1 & M=4
February 13, 2023
L03-48
Sze and Emer
Reduce Channels (C): Grouped Convolutions
Two ways of mixing information from groups
Shuffle Operation
(Mix in multiple steps)
ShuffleNet
fmap 0
layer 1
fmap 1
layer 2
fmap 2
Pointwise (1x1) Convolution
(Mix in one step)
MobileNet
Also referred to as depth-wise separable:
Decouple the cross-channels correlations and
spatial correlations in the feature maps of the CNN
C
1
1
S
R
1
R
S
C
+
C
M
February 13, 2023
L03-49
Sze and Emer
Depth-wise Convolutions
The extreme case of Grouped Convolutions is Depth-wise Convolutions,
where the number of groups (G) equals number channels (C) (i.e., one input channel per group)
Group 1
Group C
H
R
S
input fmap
output fmap1
…
…
…
…
C
…
filter1
P
W Q
input fmap
output fmapC
…
…
P
Q
H
…
…
…
…
C
…
W
R
filterC
S
Typically, M=C
(but does not have to be)
February 13, 2023
L03-50
Sze and Emer
Example: MobileNets
[Howard, arXiv 2017]
Depth-wise filter decomposition
depthwise
pointwise (1x1)
C
C
C
R
S M
M
Reduction in MACs
HWCRSM RSM
HWC(RS+M) (RS+M)
=
R
S
February 13, 2023
L03-51
Sze and Emer
MobileNets: Comparison
Comparison with other CNN Models
[Image source: Github]
[Howard, arXiv 2017]
February 13, 2023
L03-52
Sze and Emer
Example: Xception
• An Inception block based on depth-wise separable convolutions
• Claims to learn richer features with similar number of weights as Inception V3 (i.e., more
efficient use of weights)
– Similar performance on ImageNet; 4.3% better on larger dataset (JFT)
– However, 1.5x more operations required than Inception V3
[Chollet, CVPR 2017]
Spatial correlation
Cross-channel correlation
February 13, 2023
L03-53
Sze and Emer
Example: ResNeXt
ResNet ResNeXt
[Xie, CVPR 2017]
Used by ILSVRC 2017 Winner SENet
Inspired by Inception’s “split-transform-merge”
Increase number of convolution groups (G) (referred to as cardinality in the paper)
instead of depth and width of network
February 13, 2023
L03-54
Sze and Emer
Example: ResNeXt
Improved accuracy vs. ‘complexity’ tradeoff compared to
other ResNet based models
[Xie, CVPR 2017]
February 13, 2023
L03-55
Sze and Emer
Shuffle Operation
H
input fmap
output fmap
…
…
…
…
C
…
P
W Q
input fmap
output fmap
…
…
P
Q
H
…
…
…
…
C
…
W
∗
∗
1
2
3
4
output fmap
P
Q
output fmap
P
Q
3
2
1
4
Shuffle
Group 1
Group 2
February 13, 2023
L03-56
Sze and Emer
Example: ShuffleNet
Shuffle order such that channels are not isolated across groups
(up to 4% increase in accuracy)
[Zhang, CVPR 2018]
No interaction between
channels from different groups
Shuffling allow interaction between
channels from different groups
February 13, 2023
L03-57
Sze and Emer
AlexNet: Grouped Convolutions
Split into 2
Groups
Split into 2
Groups
AlexNet uses grouped convolutions to train on two separate GPUs
(Drawback: correlation between channels of different groups is not used)
Mix
Information
(3x3 CONV)
Mix
Information
(FC)
February 13, 2023
L03-58
Sze and Emer
Reduce Filters (M): Feature Map Reuse
…
R
S
1
C …
…
…
M Filters
…
R
S
K
C
…
…
…
R
S
M
C
…
…
…
…
Output fmap with M channels
L2 L3
L1
Reuse (M-K) channels in feature maps from
previously processed layers
[Huang, CVPR 2017]
DenseNet reuses feature map from
multiple layers
M-K
M
F
…
…
…
…
P
K
Q
February 13, 2023
M-K
K
M
L03-59
Sze and Emer
Neural Architecture Search (NAS)
3x3? 5x5?
128 Filters?
Pool? CONV?
Rather than handcrafting the model, automatically search for it
February 13, 2023
L03-60
Sze and Emer
Neural Architecture Search (NAS)
• Three main components:
– Search Space (what is the set of all samples)
– Optimization Algorithm (where to sample)
– Performance Evaluation (how to evaluate samples)
Key Metrics: Achievable DNN accuracy and required search time
Search
Space
Performance
Evaluation Evaluation
Result
Optimization
Algorithm
Next Location
to Sample
Sampled
Network
Final Network
February 13, 2023
L03-61
Sze and Emer
Evaluate NAS Search Time
𝒕𝒊𝒎𝒆𝒏𝒂𝒔 = 𝒏𝒖𝒎𝒔𝒂𝒎𝒑𝒍𝒆𝒔 × 𝒕𝒊𝒎𝒆𝒔𝒂𝒎𝒑𝒍𝒆
𝒕𝒊𝒎𝒆𝒏𝒂𝒔 ∝ 𝒔𝒊𝒛𝒆𝒔𝒆𝒂𝒓𝒄𝒉_𝒔𝒑𝒂𝒄𝒆×
𝒏𝒖𝒎𝒂𝒍𝒈_𝒕𝒖𝒏𝒊𝒏𝒈
𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒄𝒚𝒂𝒍𝒈
× (𝒕𝒊𝒎𝒆𝒆𝒗𝒂𝒍 + 𝒕𝒊𝒎𝒆𝒕𝒓𝒂𝒊𝒏)
(1) Shrink the search
space
(2) Improve the
optimization algorithm
(3) Simplify the
performance evaluation
Goal: Improve the efficiency of NAS in the three main components
Search
Space
Performance
Evaluation
Optimization
Algorithm
February 13, 2023
L03-62
Sze and Emer
(1) Shrink the Search Space
• Trade the breadth of models for
search speed
• May limit the performance that can be
achieved
• Use domain knowledge from manual
network design to help guide the
reduction of the search space
February 13, 2023
Model
Universe
Model
Universe
Search Space
Optimal
Optimal
Samples =
L03-63
Sze and Emer
(1) Shrink the Search Space
• Search space = layer operations + connections between layers
February 13, 2023
• Identity
• 1x3 then 3x1 convolution
• 1x7 then 7x1 convolution
• 3x3 dilated convolution
• 1x1 convolution
• 3x3 convolution
• 3x3 separable convolution
• 5x5 separable convolution
• 3x3 average pooling
• 3x3 max pooling
• 5x5 max pooling
• 7x7 max pooling
Common layer operations
[Zoph, CVPR 2018]
L03-64
Sze and Emer
(1) Shrink the Search Space
• Search space = layer operations + connections between layers
February 13, 2023
Image Source: [Zoph, CVPR 2018]
Smaller Search Space
L03-65
Sze and Emer
(2) Improve Optimization Algorithm
February 13, 2023
Random Gradient Descent
Coordinate Descent
Reinforcement Learning Bayesian
Evolutionary
L03-66
Sze and Emer
(3) Simplify the Performance Evaluation
• NAS needs only the rank of the performance values
• Method 1: approximate accuracy
February 13, 2023
Proxy Task Early Termination Accuracy Prediction
E.g., Smaller resolution,
simpler tasks
Stop training earlier
Accuracy
Iteration
Stop
Extrapolate accuracy
Accuracy
Iteration
Predict
L03-67
Sze and Emer
(3) Simplify the Performance Evaluation
• NAS needs only the rank of the performance values
• Method 2: approximate weights
February 13, 2023
Copy Weights Estimate Weights
Reuse weights from
other similar networks
Infer the weights from the
previous feature maps
Copy
Generate
What
weights?
Previous
New
Feature
Map
Filter
Previous New
L03-68
Sze and Emer
(3) Simplify the Performance Evaluation
• NAS needs only the rank of the performance values
• Method 3: approximate metrics (e.g., latency, energy)
February 13, 2023
Look-Up Table
Proxy Metric
Use an easy-to-compute
metric to approximate target
Use table lookup
Latency # MACs
L03-69
Sze and Emer
Design Considerations for NAS
• The components may not be chosen individually
– Some optimization algorithms limit the search space
– Type of performance metric may limit the selection of the optimization
algorithms
• Commonly overlooked properties
– The complexity of implementation
– The ease of tuning hyperparameters of the optimization
– The probability of convergence to a good architecture
February 13, 2023
L03-70
Sze and Emer
Example: NASNet
• Search Space: Build model from popular layers
• Identity
• 1x3 then 3x1 convolution
• 1x7 then 7x1 convolution
• 3x3 dilated convolution
• 1x1 convolution
• 3x3 convolution
• 3x3 separable convolution
• 5x5 separable convolution
• 3x3 average pooling
• 3x3 max pooling
• 5x5 max pooling
• 7x7 max pooling
[Zoph, CVPR 2018]
February 13, 2023
L03-71
Sze and Emer
NASNet: Learned Convolutional Cells
[Zoph, CVPR 2018]
February 13, 2023
L03-72
Sze and Emer
NASNet: Comparison with Existing Networks
Learned models have improved accuracy vs. ‘complexity’ tradeoff
compared to handcrafted models
[Zoph, CVPR 2018]
February 13, 2023
L03-73
Sze and Emer
EfficientNet
[Tan, ICML 2019]
Uniformly scaling all dimensions including depth, width, and resolution
since there is an interplay between the different dimensions.
Use NAS to search for baseline model and then scale up.
February 13, 2023
L03-74
Sze and Emer
Summary
• Approaches used to improve accuracy by popular CNN models in the ImageNet
Challenge
– Go deeper (i.e., more layers)
– Stack smaller filters and apply 1x1 bottlenecks to reduce number of weights such that the
deeper models can fit into a GPU (faster training)
– Use multiple connections across layers (e.g., parallel and short cut)
• Efficient models aim to reduce number of weights and number of operations
– Most use some form of filter decomposition (spatial, depth and channel)
– Note: Number of weights and operations does not directly map to storage, speed and
power/energy. Depends on hardware!
• Filter shapes vary across layers and models
– Need flexible hardware!
February 13, 2023
L03-75
Sze and Emer
Warning!
• These works often use number of weights and operations to
measure “complexity”
• Number of weights provides an indication of storage cost for
inference
• However later in the course, we will see that
– Number of operations doesn’t directly translate to latency/throughput
– Number of weights and operations doesn’t directly translate to power/energy
consumption
• Understanding the underlying hardware is important for evaluating
the impact of these “efficient” CNN models
February 13, 2023
L03-76
Sze and Emer
References
• Book: Chapter 2 & 9
– https://doi.org/10.1007/978-3-031-01766-7
• Other Works Cited in Lecture (increase accuracy)
– LeNet: LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proc. IEEE 1998.
– AlexNet: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NeurIPS.
2012.
– VGGNet: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." ICLR 2015.
– Network in Network: Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014
– GoogleNet: Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern
recognition. CVPR 2015.
– ResNet: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern
recognition. CVPR 2016.
– DenseNet: Huang, Gao, et al. "Densely connected convolutional networks." CVPR 2017
– Wide ResNet: Zagoruyko, Sergey, and Nikos Komodakis. "Wide residual networks." BMVC 2017.
– ResNext: Xie, Saining, et al. "Aggregated residual transformations for deep neural networks.” CVPR 2017
– SENets: Hu, Jie et al., “Squeeze-and-Excitation Networks,” CVPR 2018
– NFNet: Brock, Andrew, et al., “High-Performance Large-Scale Image Recognition Without Normalization,” arXiv 2021
February 13, 2023
L03-77
Sze and Emer
References
• Other Works Cited in Lecture (increase efficiency)
– InceptionV3: Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." CVPR 2016.
– SqueezeNet: Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model
size." ICLR 2017.
– Xception: Chollet, François. "Xception: Deep Learning with Depthwise Separable Convolutions." CVPR 2017
– MobileNet: Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications."
arXiv preprint arXiv:1704.04861 (2017).
– MobileNetv2: Sandler, Mark et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” CVPR 2018
– MobileNetv3: Howard, Andrew et al., “Searching for MobileNetV3,” ICCV 2019
– ShuffleNet: Zhang, Xiangyu, et al. "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices."
CVPR 2018
– Learning Network Architecture: Zoph, Barret, et al. "Learning Transferable Architectures for Scalable Image Recognition."
CVPR 2018
• Other Works Cited in Lecture (Increase accuracy and efficiency)
– EfficientNet: Tan, Mingxing, et al. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” ICML 2019
February 13, 2023

More Related Content

Similar to L03.pdf

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...Evaluation the affects of mimo based rayleigh network cascaded with unstable ...
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...eSAT Publishing House
 
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...VLSICS Design
 
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...VLSICS Design
 
Erca energy efficient routing and reclustering
Erca energy efficient routing and reclusteringErca energy efficient routing and reclustering
Erca energy efficient routing and reclusteringaciijournal
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
 
A 3-D printed lightweight x-band waveguide filter based
A 3-D printed lightweight x-band waveguide filter basedA 3-D printed lightweight x-band waveguide filter based
A 3-D printed lightweight x-band waveguide filter basedKailash Bhati
 
Multiuser detection new
Multiuser detection newMultiuser detection new
Multiuser detection newNebiye Slmn
 
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BIST
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BISTA REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BIST
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BISTjedt_journal
 
Multilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrrMultilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrreSAT Journals
 
Multilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrrMultilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrreSAT Publishing House
 
High performance novel dual stack gating technique
High performance novel dual stack gating techniqueHigh performance novel dual stack gating technique
High performance novel dual stack gating techniqueeSAT Publishing House
 
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPMODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPVLSICS Design
 
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...ijsc
 
Hardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's LawHardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's Lawcsandit
 
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONS
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONSMICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONS
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONSjmicro
 
Simulation and performance analysis of blast
Simulation and performance analysis of blastSimulation and performance analysis of blast
Simulation and performance analysis of blasteSAT Publishing House
 
Performance Analysis of Rake Receivers in IR–UWB System
Performance Analysis of Rake Receivers in IR–UWB System Performance Analysis of Rake Receivers in IR–UWB System
Performance Analysis of Rake Receivers in IR–UWB System IOSR Journals
 

Similar to L03.pdf (20)

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...Evaluation the affects of mimo based rayleigh network cascaded with unstable ...
Evaluation the affects of mimo based rayleigh network cascaded with unstable ...
 
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
 
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
CONCURRENT TERNARY GALOIS-BASED COMPUTATION USING NANO-APEX MULTIPLEXING NIBS...
 
Erca energy efficient routing and reclustering
Erca energy efficient routing and reclusteringErca energy efficient routing and reclustering
Erca energy efficient routing and reclustering
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
H010223640
H010223640H010223640
H010223640
 
A 3-D printed lightweight x-band waveguide filter based
A 3-D printed lightweight x-band waveguide filter basedA 3-D printed lightweight x-band waveguide filter based
A 3-D printed lightweight x-band waveguide filter based
 
Multiuser detection new
Multiuser detection newMultiuser detection new
Multiuser detection new
 
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BIST
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BISTA REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BIST
A REVIEW OF LOW POWERAND AREA EFFICIENT FSM BASED LFSR FOR LOGIC BIST
 
Multilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrrMultilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrr
 
Multilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrrMultilayered low pass microstrip filter using csrr
Multilayered low pass microstrip filter using csrr
 
High performance novel dual stack gating technique
High performance novel dual stack gating techniqueHigh performance novel dual stack gating technique
High performance novel dual stack gating technique
 
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPMODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
 
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...
Comparative Analysis of Route Information Based Enhanced Divide and Rule Stra...
 
Hardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's LawHardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's Law
 
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONS
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONSMICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONS
MICROSTRIP COUPLED LINE FILTER DESIGN FOR ULTRA WIDEBAND APPLICATIONS
 
routing algo n
routing algo                                nrouting algo                                n
routing algo n
 
Simulation and performance analysis of blast
Simulation and performance analysis of blastSimulation and performance analysis of blast
Simulation and performance analysis of blast
 
Performance Analysis of Rake Receivers in IR–UWB System
Performance Analysis of Rake Receivers in IR–UWB System Performance Analysis of Rake Receivers in IR–UWB System
Performance Analysis of Rake Receivers in IR–UWB System
 

More from TRNHONGLINHBCHCM (20)

L06-handout.pdf
L06-handout.pdfL06-handout.pdf
L06-handout.pdf
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 
L09-handout.pdf
L09-handout.pdfL09-handout.pdf
L09-handout.pdf
 
L06.pdf
L06.pdfL06.pdf
L06.pdf
 
L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L09.pdf
L09.pdfL09.pdf
L09.pdf
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 

Recently uploaded

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 

Recently uploaded (20)

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 

L03.pdf

  • 1. L03-1 Sze and Emer 6.5930/1 Hardware Architectures for Deep Learning Popular DNN Models Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science February 13, 2023
  • 2. L03-2 Sze and Emer Goals of Today’s Lecture • Last lecture covered the building blocks of CNNs; this lecture describes how we put these blocks together to form a CNN. • Overview of various well-known CNN models – CNN ‘models’ are also referred to as ‘network architectures’; however, we prefer to use the term ‘model’ in this class to avoid overloading the term ‘architecture’ • We group the CNN models into two categories – High Accuracy CNN Models: Designed to maximize accuracy to compete in the ImageNet Challenge – Efficient CNN Models: Designed to reduce the number of weights and operations (specifically MACs) while maintaining accuracy February 13, 2023
  • 3. L03-3 Sze and Emer High Accuracy CNN Models
  • 4. L03-4 Sze and Emer Popular CNNs • LeNet (1998) • AlexNet (2012) • OverFeat (2013) • VGGNet (2014) • GoogleNet (2014) • ResNet (2015) 0 2 4 6 8 10 12 14 16 18 2012 2013 2014 2015 Human Accuracy (Top 5 error) [Russakovsky, IJCV 2015] AlexNet OverFeat GoogLeNet ResNet Clarifai VGGNet ImageNet Large Scale Visual Recognition Challenge (ILSVRC) February 13, 2023
  • 5. L03-5 Sze and Emer MNIST http://yann.lecun.com/exdb/mnist/ Digit Classification 28x28 pixels (B&W) 10 Classes 60,000 Training 10,000 Testing February 13, 2023
  • 6. L03-6 Sze and Emer LeNet-5 [Lecun, Proceedings of the IEEE, 1998] CONV Layers: 2 Fully Connected Layers: 2 Weights: 60k MACs: 341k Sigmoid used for non-linearity Digit Classification! (MNIST Dataset) 2x2 average pooling sixteen 5x5 filters 2x2 average pooling six 5x5 filters February 13, 2023
  • 8. L03-8 Sze and Emer ImageNet http://www.image-net.org/challenges/LSVRC/ Image Classification ~256x256 pixels (color) 1000 Classes 1.3M Training 100,000 Testing (50,000 Validation) Image Source: http://karpathy.github.io/ For ImageNet Large Scale Visual Recognition Challenge (ILSVRC) accuracy of classification task reported based on top-1 and top-5 error February 13, 2023
  • 9. L03-9 Sze and Emer AlexNet CONV Layers: 5 Fully Connected Layers: 3 Weights: 61M MACs: 724M ReLU used for non-linearity ILSCVR12 Winner Uses Local Response Normalization (LRN) input ofmap1 ofmap2 ofmap3 ofmap4 ofmap5 ofmap6 ofmap7 ofmap8 [Krizhevsky, NeurIPS 2012] February 13, 2023
  • 10. L03-10 Sze and Emer AlexNet CONV Layers: 5 Fully Connected Layers: 3 Weights: 61M MACs: 724M ReLU used for non-linearity ILSCVR12 Winner Uses Local Response Normalization (LRN) L1 96 filters (11x11x3) L2 256 filters (5x5x48) L3 384 filters (3x3x256) L4 384 filters (3x3x192) L5 384 filters (3x3x192) L6 4096 filters (13x13x256) L7 4096 (4096x1) L8 1000 (4096x1) [Krizhevsky, NeurIPS 2012] February 13, 2023
  • 11. L03-11 Sze and Emer Large Sizes with Varying Shapes Layer Filter Size (RxS) # Filters (M) # Channels (C) Stride 1 11x11 96 3 4 2 5x5 256 48 1 3 3x3 384 256 1 4 3x3 384 192 1 5 3x3 256 192 1 AlexNet Convolutional Layer Configurations 34k Params 307k Params 885k Params Layer 1 Layer 2 Layer 3 105M MACs 224M MACs 150M MACs [Krizhevsky, NeurIPS 2012] February 13, 2023
  • 12. L03-12 Sze and Emer AlexNet CONV Layers: 5 Fully Connected Layers: 3 Weights: 61M MACs: 724M ReLU used for non-linearity [Krizhevsky, NeurIPS 2012] ILSCVR12 Winner Uses Local Response Normalization (LRN) L1 L2 L3 L4 L5 L6 L7 1000 scores 224x224 Input Image Conv (11x11) Non-Linearity Norm (LRN) Max Pool Conv (5x5) Non-Linearity Norm (LRN) Max Pooling Conv (3x3) Non-Linearity Conv (3x3) Non-Linearity Conv (3x3) Non-Linearity Max Pooling Fully Connect Non-Linearity Fully Connect Non-Linearity Fully Connect 34k 307k 885k 664k 442k 37.7M 16.8M 4.1M # of weights February 13, 2023
  • 13. L03-13 Sze and Emer VGG-16 CONV Layers: 13 Fully Connected Layers: 3 Weights: 138M MACs: 15.5G [Simonyan, ICLR 2015] Image Source: http://www.cs.toronto.edu/~frossard/post/vgg16/ Also, 19-layer version More Layers à Deeper! February 13, 2023
  • 14. L03-14 Sze and Emer Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights 3 0 4 2 0 1 0 0 0 1 2 3 2 0 0 1 2 2 2 0 3 5 0 1 0 1 3 0 0 1 2 2 1 0 1 0 0 1 0 3 1 0 5 2 0 3 0 5 8 5x5 filter Example 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 31 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 February 13, 2023
  • 15. L03-15 Sze and Emer Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 0 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 7 8 8 0 0 0 0 5 6 7 3 0 0 1 6 5 7 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 3x3 filter1 Example 0 1 0 1 1 1 0 1 0 filter (3x3) February 13, 2023
  • 16. L03-16 Sze and Emer Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 0 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 7 8 8 0 0 0 0 5 6 7 3 0 0 1 6 5 7 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 3x3 filter1 Example 0 1 0 1 1 1 0 1 0 filter (3x3) February 13, 2023
  • 17. L03-17 Sze and Emer Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 0 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 7 8 8 0 0 0 0 5 6 7 3 0 0 1 6 5 7 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 3x3 filter1 Example 0 1 0 1 1 1 0 1 0 filter (3x3) February 13, 2023
  • 18. L03-18 Sze and Emer Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 0 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 7 8 8 0 0 0 0 5 6 7 3 0 0 1 6 5 7 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 3x3 filter1 Example 0 1 0 1 1 1 0 1 0 filter (3x3) February 13, 2023
  • 19. L03-19 Sze and Emer VGGNet: Stacked Filters • Deeper network means more weights • Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights • Non-linear activation inserted between each filter 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 0 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 7 8 8 0 0 0 0 5 6 7 3 0 0 1 6 5 7 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 3x3 filter2 0 0 0 0 0 0 0 0 0 1 2 3 2 0 0 1 2 2 2 0 0 0 0 1 31 1 3 0 0 1 2 2 1 0 0 0 0 1 0 3 1 0 0 0 0 0 0 0 0 Example: 5x5 filter (25 weights) à two 3x3 filters (18 weights) 3x3 filter1 February 13, 2023
  • 20. L03-20 Sze and Emer GoogLeNet/Inception (v1) CONV Layers: 21 (depth), 57 (total) Fully Connected Layers: 1 Weights: 7.0M MACs: 1.43G [Szegedy, CVPR 2015] Also, v2, v3 and v4 ILSVRC14 Winner 9 Inception Blocks* 3 CONV layers 1 FC layer (reduced from 3) Auxiliary Classifiers (helps with training, not used during inference) *referred to as inception module in textbook February 13, 2023
  • 21. L03-21 Sze and Emer GoogLeNet/Inception (v1) parallel* filters of different size have the effect of processing image at different scales 1x1 ‘bottleneck’ to reduce number of weights and multiplications Inception Block CONV Layers: 21 (depth), 57 (total) Fully Connected Layers: 1 Weights: 7.0M MACs: 1.43G Also, v2, v3 and v4 ILSVRC14 Winner [Szegedy, CVPR 2015] *also referred to as “multi-branch” and “split-transform-merge” February 13, 2023
  • 22. L03-22 Sze and Emer 1x1 Bottleneck Modified image from source: Stanford cs231n [Lin, Network in Network, ICLR 2014] Use 1x1 filter to capture cross-channel correlation, but not spatial correlation. Can be used to reduce the number of channels in next layer (compress). (Filter dimensions for bottleneck: R=1, S=1, C > M) 1 56 56 filter1 (1x1x64) February 13, 2023
  • 23. L03-23 Sze and Emer 1x1 Bottleneck Modified image from source: Stanford cs231n filter2 (1x1x64) 2 56 56 Use 1x1 filter to capture cross-channel correlation, but not spatial correlation. Can be used to reduce the number of channels in next layer (compress). (Filter dimensions for bottleneck: R=1, S=1, C > M) [Lin, Network in Network, ICLR 2014] February 13, 2023
  • 24. L03-24 Sze and Emer 1x1 Bottleneck Modified image from source: Stanford cs231n 32 56 56 Use 1x1 filter to capture cross-channel correlation, but not spatial correlation. Can be used to reduce the number of channels in next layer (compress). (Filter dimensions for bottleneck: R=1, S=1, C > M) [Lin, Network in Network, ICLR 2014] February 13, 2023
  • 25. L03-25 Sze and Emer GoogLeNet:1x1 Bottleneck 1x1 ‘bottleneck’ to reduce number of weights and multiplications Inception Block [Szegedy, CVPR 2015] Apply 1x1 bottleneck before ‘large’ convolution filters. Reduce weights such that entire CNN can be trained on one GPU. Number of multiplications reduced from 854M à 358M February 13, 2023
  • 26. L03-26 Sze and Emer Reduce Cost of FC Layers First FC layer accounts for a significant portion of weights 38M of 61M for AlexNet 105M 224M 150M 112M 75M 38M 17M 4M # of MACs L1 L2 L3 L4 L5 L6 L7 1000 scores 224x224 Input Image Conv (11x11) Non-Linearity Norm (LRN) Max Pool Conv (5x5) Non-Linearity Norm (LRN) Max Pooling Conv (3x3) Non-Linearity Conv (3x3) Non-Linearity Conv (3x3) Non-Linearity Max Pooling Fully Connect Non-Linearity Fully Connect Non-Linearity Fully Connect 34k 307k 885k 664k 442k 37.7M 16.8M 4.1M # of weights 38M [Krizhevsky, NeurIPS 2012] February 13, 2023
  • 27. L03-27 Sze and Emer Global Pooling GoogLeNet uses global pooling to reduce number of FC layers from three to one [Lin, ICLR 2014] Use Global Pooling to reduce size of input to the first FC layer and the FC layer itself FC Layer H … input fmap output fmap1 … C 1 W 1 H … … C W … filter1 input fmapGP 1 1 C 1 1 C filter1 output fmap1 1 1 Step 2: FC Layer H input fmap output fmapGP … … C 1 W 1 Pool C Step 1: Global Pooling Size of FC layer: HxWxCxM à 1x1xCxM February 13, 2023
  • 28. L03-28 Sze and Emer ResNet Image Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf Go Deeper! ILSVRC15 Winner (better than human level accuracy!) February 13, 2023
  • 29. L03-29 Sze and Emer ResNet: Training Training and validation error increases with more layers; this is due to vanishing gradient, no overfitting. Introduce short cut block to address this! Without shortcut With shortcut Thin curves denote training error, and bold curves denote validation error. [He, CVPR 2016] February 13, 2023
  • 30. L03-30 Sze and Emer ResNet: Short Cut Block Helps address the vanishing gradient challenge for training very deep networks 1 CONV layer 1 FC layer 16 Short Cut Blocks ResNet-34 3x3 CONV ReLU ReLU 3x3 CONV + x F(x) H(x) = F(x) + x Iden%ty x Learns Residual F(x)=H(x)-x Skip Connection (also referred to as highway) [He, CVPR 2016] February 13, 2023
  • 31. L03-31 Sze and Emer ResNet: Bottleneck Apply 1x1 bottleneck to reduce computation and size Also makes network deeper (ResNet-34 à ResNet-50) compress C > M expand C < M [He, CVPR 2016] February 13, 2023
  • 32. L03-32 Sze and Emer ResNet-50 CONV Layers: 49 Fully Connected Layers: 1 Weights: 25.5M MACs: 3.9G Also, 34-, 152-, and 1202-layer versions ILSVRC15 Winner 1 CONV layer 1 FC layer 16 Short Cut Blocks ResNet-50 Short Cut Block [He, CVPR 2016] February 13, 2023
  • 33. L03-33 Sze and Emer Summary of Popular CNNs Metrics LeNet-5 AlexNet VGG-16 GoogLeNet (v1) ResNet-50 Top-5 error n/a 16.4 7.4 6.7 5.3 Input Size 28x28 227x227 224x224 224x224 224x224 # of CONV Layers 2 5 16 21 (depth) 49 Filter Sizes 5 3, 5,11 3 1, 3 , 5, 7 1, 3, 7 # of Channels 1, 6 3 - 256 3 - 512 3 - 1024 3 - 2048 # of Filters 6, 16 96 - 384 64 - 512 64 - 384 64 - 2048 Stride 1 1, 4 1 1, 2 1, 2 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 2 3 3 1 1 # of Weights 58k 58.6M 124M 1M 2M # of MACs 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 1.43G 3.9G CONV Layers increasingly important! February 13, 2023
  • 34. L03-34 Sze and Emer Summary of Popular CNNs • AlexNet – First CNN Winner of ILSVRC – Uses LRN (deprecated after this) • VGG-16 – Goes Deeper (16+ layers) – Uses only 3x3 filters (stack for larger filters) • GoogLeNet (v1) – Reduces weights with Inception and uses Global Pooling so that only one FC layer is needed – Inception Block: 1x1 and parallel connections – Batch Normalization • ResNet – Goes Deeper (24+ layers) – Short cut Block: Skip connections February 13, 2023
  • 35. L03-35 Sze and Emer DenseNet [Huang, CVPR 2017] Feature maps are concatenated rather than added. Break into blocks to limit depth and thus size of combined feature map. More Skip Connections! Connections not only from previous layer, but many past layers to strengthen feature map propagation and feature reuse. Dense Block Transition layers February 13, 2023
  • 36. L03-36 Sze and Emer DenseNet Note: 1 MAC = 2 FLOPS Higher accuracy than ResNet with fewer weights and multiplications [Huang, CVPR 2017] Top-1 error Top-1 error February 13, 2023
  • 37. L03-37 Sze and Emer Wide ResNet Increase width (# of filters) rather than depth of network • 50-layer wide ResNet outperforms 152-layer original ResNet • Increasing width instead of depth is also more parallel-friendly Image Source: Stanford cs231n [Zagoruyko, BMVC 2016] February 13, 2023
  • 38. L03-38 Sze and Emer Squeeze and Excitation H C W Input fmap 1 C 1 1) Global Pooling 1 C 1 2) Multiple FC Layers Filters H C W Output fmap ∗ 3) Depthwise Convolution Dynamic Weights Depth-wise convolution with dynamic weights, where the weights change based on the input feature map. • Squeeze: Summarize each channel of input features map with global pooling • Excitation: Determine weights using FC layers to increase attention on certain channels of the input features map [Hu, CVPR 2018] excitation squeeze Attention (input à dynamic weights) Used by SENet ILSVRC 2017 Winner February 13, 2023
  • 39. L03-39 Sze and Emer Convolution versus Attention Mechanism • Convolution – Only models dependencies between spatial neighbors – Use sparsely connected layer to spatial neighbors; no support for dependencies outside of spatial dimensions of filter (R x S) • Attention – “Allows modeling of [global] dependencies without regard to their distance” [Vaswani, NeurIPS 2017] – However, fully connected layer too expensive; develop mechanism to bias “the allocation of available computational resources towards the most informative components of a signal” [Hu, CVPR 2018] • Transformer is a type of DNN that is built entirely using Attention Mechanism [Vaswani, NeurIPS 2017] (Next Lecture) February 13, 2023
  • 41. L03-41 Sze and Emer Accuracy vs. Weight & OPs [Bianco, IEEE Access 2018] February 13, 2023
  • 42. L03-42 Sze and Emer Manual Network Design • Reduce Spatial Size (R, S) – stacked filters • Reduce Channels (C) – 1x1 convolution, grouped convolution • Reduce Filters (M) – feature map reuse across layers Filters R S … … … C H W … … … C … E F … … … M … … … M … R S … … … C H W … … C 1 N 1 M 1 … … Input fmaps Output fmaps … E F N P Q P Q February 13, 2023
  • 43. L03-43 Sze and Emer Reduce Spatial Size (R, S): Stacked Small Filters 5x5 filter Two 3x3 filters decompose Apply sequentially decompose 5x5 filter 5x1 filter 1x5 filter Apply sequentially GoogleNet/ Inception v3 VGG separable filters Replace a large filter with a series of smaller filters (reduces degrees of freedom) February 13, 2023
  • 44. L03-44 Sze and Emer Example: Inception V3 Go deeper (v1: 22 layers à v3: 40+ layers) by reducing the number of weights per filter using filter decomposition ~3.5% higher accuracy than v1 [Szegedy, CVPR 2016] 5x5 filter à 3x3 filters 3x3 filter à 3x1 and 1x3 filters Separable filters February 13, 2023
  • 45. L03-45 Sze and Emer Reduce Channels (C): 1x1 Convolution ResNet GoogLeNet compress expand compress C > M C < M • Use 1x1 (bottleneck) filter to capture cross-channel correlation, but not spatial correlation • Reduce the number of channels in next layer (compress), where C > M February 13, 2023
  • 46. L03-46 Sze and Emer Example: SqueezeNet [Iandola, ICLR 2017]] Fire Block Reduce number of weights by reducing number of input channels by “squeezing” with 1x1 50x fewer weights than AlexNet (no accuracy loss) However, 1.2x more operations than AlexNet* *for SqueezeNetv1.0 (reduce operations by 2x in SqueezeNetv1.1) February 13, 2023
  • 47. L03-47 Sze and Emer Reduce Channels (C): Grouped Convolutions Grouped convolutions reduce the number of weights and multiplications at the cost of not sharing information between groups • Divide filters into groups (G) operating on subset of channels. • Each group has M/G filters and processes C/G channels. P Q 4 P Q 4 P Q 4 P Q 4 1 Grouped Convolution filters input fmaps output fmaps H C W R S C/2 1 1 H C W R C/2 S 4 1 H C W R S C/2 2 1 H C W R C/2 S 3 1 Group 1 Group 2 1 1 1 P Q 4 P Q 4 1 Grouped Convolution filters input fmaps output fmaps H C W R S C/2 1 1 H C W R S C/2 2 1 Group 1 1 Group 2 Example for G=2: Each filter requires 2x fewer weights and MACs (C à C/2) Group 1 In this example, N=1 & M=4 February 13, 2023
  • 48. L03-48 Sze and Emer Reduce Channels (C): Grouped Convolutions Two ways of mixing information from groups Shuffle Operation (Mix in multiple steps) ShuffleNet fmap 0 layer 1 fmap 1 layer 2 fmap 2 Pointwise (1x1) Convolution (Mix in one step) MobileNet Also referred to as depth-wise separable: Decouple the cross-channels correlations and spatial correlations in the feature maps of the CNN C 1 1 S R 1 R S C + C M February 13, 2023
  • 49. L03-49 Sze and Emer Depth-wise Convolutions The extreme case of Grouped Convolutions is Depth-wise Convolutions, where the number of groups (G) equals number channels (C) (i.e., one input channel per group) Group 1 Group C H R S input fmap output fmap1 … … … … C … filter1 P W Q input fmap output fmapC … … P Q H … … … … C … W R filterC S Typically, M=C (but does not have to be) February 13, 2023
  • 50. L03-50 Sze and Emer Example: MobileNets [Howard, arXiv 2017] Depth-wise filter decomposition depthwise pointwise (1x1) C C C R S M M Reduction in MACs HWCRSM RSM HWC(RS+M) (RS+M) = R S February 13, 2023
  • 51. L03-51 Sze and Emer MobileNets: Comparison Comparison with other CNN Models [Image source: Github] [Howard, arXiv 2017] February 13, 2023
  • 52. L03-52 Sze and Emer Example: Xception • An Inception block based on depth-wise separable convolutions • Claims to learn richer features with similar number of weights as Inception V3 (i.e., more efficient use of weights) – Similar performance on ImageNet; 4.3% better on larger dataset (JFT) – However, 1.5x more operations required than Inception V3 [Chollet, CVPR 2017] Spatial correlation Cross-channel correlation February 13, 2023
  • 53. L03-53 Sze and Emer Example: ResNeXt ResNet ResNeXt [Xie, CVPR 2017] Used by ILSVRC 2017 Winner SENet Inspired by Inception’s “split-transform-merge” Increase number of convolution groups (G) (referred to as cardinality in the paper) instead of depth and width of network February 13, 2023
  • 54. L03-54 Sze and Emer Example: ResNeXt Improved accuracy vs. ‘complexity’ tradeoff compared to other ResNet based models [Xie, CVPR 2017] February 13, 2023
  • 55. L03-55 Sze and Emer Shuffle Operation H input fmap output fmap … … … … C … P W Q input fmap output fmap … … P Q H … … … … C … W ∗ ∗ 1 2 3 4 output fmap P Q output fmap P Q 3 2 1 4 Shuffle Group 1 Group 2 February 13, 2023
  • 56. L03-56 Sze and Emer Example: ShuffleNet Shuffle order such that channels are not isolated across groups (up to 4% increase in accuracy) [Zhang, CVPR 2018] No interaction between channels from different groups Shuffling allow interaction between channels from different groups February 13, 2023
  • 57. L03-57 Sze and Emer AlexNet: Grouped Convolutions Split into 2 Groups Split into 2 Groups AlexNet uses grouped convolutions to train on two separate GPUs (Drawback: correlation between channels of different groups is not used) Mix Information (3x3 CONV) Mix Information (FC) February 13, 2023
  • 58. L03-58 Sze and Emer Reduce Filters (M): Feature Map Reuse … R S 1 C … … … M Filters … R S K C … … … R S M C … … … … Output fmap with M channels L2 L3 L1 Reuse (M-K) channels in feature maps from previously processed layers [Huang, CVPR 2017] DenseNet reuses feature map from multiple layers M-K M F … … … … P K Q February 13, 2023 M-K K M
  • 59. L03-59 Sze and Emer Neural Architecture Search (NAS) 3x3? 5x5? 128 Filters? Pool? CONV? Rather than handcrafting the model, automatically search for it February 13, 2023
  • 60. L03-60 Sze and Emer Neural Architecture Search (NAS) • Three main components: – Search Space (what is the set of all samples) – Optimization Algorithm (where to sample) – Performance Evaluation (how to evaluate samples) Key Metrics: Achievable DNN accuracy and required search time Search Space Performance Evaluation Evaluation Result Optimization Algorithm Next Location to Sample Sampled Network Final Network February 13, 2023
  • 61. L03-61 Sze and Emer Evaluate NAS Search Time 𝒕𝒊𝒎𝒆𝒏𝒂𝒔 = 𝒏𝒖𝒎𝒔𝒂𝒎𝒑𝒍𝒆𝒔 × 𝒕𝒊𝒎𝒆𝒔𝒂𝒎𝒑𝒍𝒆 𝒕𝒊𝒎𝒆𝒏𝒂𝒔 ∝ 𝒔𝒊𝒛𝒆𝒔𝒆𝒂𝒓𝒄𝒉_𝒔𝒑𝒂𝒄𝒆× 𝒏𝒖𝒎𝒂𝒍𝒈_𝒕𝒖𝒏𝒊𝒏𝒈 𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒄𝒚𝒂𝒍𝒈 × (𝒕𝒊𝒎𝒆𝒆𝒗𝒂𝒍 + 𝒕𝒊𝒎𝒆𝒕𝒓𝒂𝒊𝒏) (1) Shrink the search space (2) Improve the optimization algorithm (3) Simplify the performance evaluation Goal: Improve the efficiency of NAS in the three main components Search Space Performance Evaluation Optimization Algorithm February 13, 2023
  • 62. L03-62 Sze and Emer (1) Shrink the Search Space • Trade the breadth of models for search speed • May limit the performance that can be achieved • Use domain knowledge from manual network design to help guide the reduction of the search space February 13, 2023 Model Universe Model Universe Search Space Optimal Optimal Samples =
  • 63. L03-63 Sze and Emer (1) Shrink the Search Space • Search space = layer operations + connections between layers February 13, 2023 • Identity • 1x3 then 3x1 convolution • 1x7 then 7x1 convolution • 3x3 dilated convolution • 1x1 convolution • 3x3 convolution • 3x3 separable convolution • 5x5 separable convolution • 3x3 average pooling • 3x3 max pooling • 5x5 max pooling • 7x7 max pooling Common layer operations [Zoph, CVPR 2018]
  • 64. L03-64 Sze and Emer (1) Shrink the Search Space • Search space = layer operations + connections between layers February 13, 2023 Image Source: [Zoph, CVPR 2018] Smaller Search Space
  • 65. L03-65 Sze and Emer (2) Improve Optimization Algorithm February 13, 2023 Random Gradient Descent Coordinate Descent Reinforcement Learning Bayesian Evolutionary
  • 66. L03-66 Sze and Emer (3) Simplify the Performance Evaluation • NAS needs only the rank of the performance values • Method 1: approximate accuracy February 13, 2023 Proxy Task Early Termination Accuracy Prediction E.g., Smaller resolution, simpler tasks Stop training earlier Accuracy Iteration Stop Extrapolate accuracy Accuracy Iteration Predict
  • 67. L03-67 Sze and Emer (3) Simplify the Performance Evaluation • NAS needs only the rank of the performance values • Method 2: approximate weights February 13, 2023 Copy Weights Estimate Weights Reuse weights from other similar networks Infer the weights from the previous feature maps Copy Generate What weights? Previous New Feature Map Filter Previous New
  • 68. L03-68 Sze and Emer (3) Simplify the Performance Evaluation • NAS needs only the rank of the performance values • Method 3: approximate metrics (e.g., latency, energy) February 13, 2023 Look-Up Table Proxy Metric Use an easy-to-compute metric to approximate target Use table lookup Latency # MACs
  • 69. L03-69 Sze and Emer Design Considerations for NAS • The components may not be chosen individually – Some optimization algorithms limit the search space – Type of performance metric may limit the selection of the optimization algorithms • Commonly overlooked properties – The complexity of implementation – The ease of tuning hyperparameters of the optimization – The probability of convergence to a good architecture February 13, 2023
  • 70. L03-70 Sze and Emer Example: NASNet • Search Space: Build model from popular layers • Identity • 1x3 then 3x1 convolution • 1x7 then 7x1 convolution • 3x3 dilated convolution • 1x1 convolution • 3x3 convolution • 3x3 separable convolution • 5x5 separable convolution • 3x3 average pooling • 3x3 max pooling • 5x5 max pooling • 7x7 max pooling [Zoph, CVPR 2018] February 13, 2023
  • 71. L03-71 Sze and Emer NASNet: Learned Convolutional Cells [Zoph, CVPR 2018] February 13, 2023
  • 72. L03-72 Sze and Emer NASNet: Comparison with Existing Networks Learned models have improved accuracy vs. ‘complexity’ tradeoff compared to handcrafted models [Zoph, CVPR 2018] February 13, 2023
  • 73. L03-73 Sze and Emer EfficientNet [Tan, ICML 2019] Uniformly scaling all dimensions including depth, width, and resolution since there is an interplay between the different dimensions. Use NAS to search for baseline model and then scale up. February 13, 2023
  • 74. L03-74 Sze and Emer Summary • Approaches used to improve accuracy by popular CNN models in the ImageNet Challenge – Go deeper (i.e., more layers) – Stack smaller filters and apply 1x1 bottlenecks to reduce number of weights such that the deeper models can fit into a GPU (faster training) – Use multiple connections across layers (e.g., parallel and short cut) • Efficient models aim to reduce number of weights and number of operations – Most use some form of filter decomposition (spatial, depth and channel) – Note: Number of weights and operations does not directly map to storage, speed and power/energy. Depends on hardware! • Filter shapes vary across layers and models – Need flexible hardware! February 13, 2023
  • 75. L03-75 Sze and Emer Warning! • These works often use number of weights and operations to measure “complexity” • Number of weights provides an indication of storage cost for inference • However later in the course, we will see that – Number of operations doesn’t directly translate to latency/throughput – Number of weights and operations doesn’t directly translate to power/energy consumption • Understanding the underlying hardware is important for evaluating the impact of these “efficient” CNN models February 13, 2023
  • 76. L03-76 Sze and Emer References • Book: Chapter 2 & 9 – https://doi.org/10.1007/978-3-031-01766-7 • Other Works Cited in Lecture (increase accuracy) – LeNet: LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proc. IEEE 1998. – AlexNet: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NeurIPS. 2012. – VGGNet: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." ICLR 2015. – Network in Network: Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014 – GoogleNet: Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2015. – ResNet: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2016. – DenseNet: Huang, Gao, et al. "Densely connected convolutional networks." CVPR 2017 – Wide ResNet: Zagoruyko, Sergey, and Nikos Komodakis. "Wide residual networks." BMVC 2017. – ResNext: Xie, Saining, et al. "Aggregated residual transformations for deep neural networks.” CVPR 2017 – SENets: Hu, Jie et al., “Squeeze-and-Excitation Networks,” CVPR 2018 – NFNet: Brock, Andrew, et al., “High-Performance Large-Scale Image Recognition Without Normalization,” arXiv 2021 February 13, 2023
  • 77. L03-77 Sze and Emer References • Other Works Cited in Lecture (increase efficiency) – InceptionV3: Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." CVPR 2016. – SqueezeNet: Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size." ICLR 2017. – Xception: Chollet, François. "Xception: Deep Learning with Depthwise Separable Convolutions." CVPR 2017 – MobileNet: Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017). – MobileNetv2: Sandler, Mark et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” CVPR 2018 – MobileNetv3: Howard, Andrew et al., “Searching for MobileNetV3,” ICCV 2019 – ShuffleNet: Zhang, Xiangyu, et al. "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices." CVPR 2018 – Learning Network Architecture: Zoph, Barret, et al. "Learning Transferable Architectures for Scalable Image Recognition." CVPR 2018 • Other Works Cited in Lecture (Increase accuracy and efficiency) – EfficientNet: Tan, Mingxing, et al. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” ICML 2019 February 13, 2023