The document presents the deep residual learning framework proposed in the paper "Deep Residual Learning for Image Recognition". The framework aims to make it easier to optimize extremely deep convolutional neural networks. It does this by introducing "skip connections" that allow layers to learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. This addresses issues like vanishing gradients in very deep networks. The authors demonstrate that residual networks are easier to optimize and can gain accuracy from increased depth, outperforming standard networks.
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
ย
Deep Residual Learning Image Recognition
1. Deep Residual Learning for
Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
Presented by โ Sanjay Saha, School of Computing, NUS
CS6240 โ Multimedia Analysis โ Sem 2 AY2019/20
2. Objective | Problem Statement
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
3. Motivation
Performance of plain networks in a deeper architecture
Image source: paper
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
4. Main Idea
โข Skip Connections/ Shortcuts
โข Trying to avoid:
โVanishing Gradientsโ
โLong training timesโ
Image source: Wikipedia
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
5. Contributions | Problem Statement
โข These extremely deep residual nets are easy to optimize, but the
counterpart โplainโ nets (that simply stack layers) exhibit higher
training error when the depth increases.
โข These deep residual nets can easily enjoy accuracy gains from greatly
increased depth, producing results substantially better than previous
networks.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
A residual learning framework to ease the training of networks that
are substantially deeper than those used previously.
Perfor
mance
Depth
School of Computing
7. Literature Review
โข Partial solutions for vanishing
โข Batch Normalization โ To rescale the weights over some batch.
โข Smart Initialization of weights โ Like for example Xavier initialization.
โข Train portions of the network individually.
โข Highway Networks
โข Feature residual connections of the form
๐ = ๐ ๐ฅ ร ๐ ๐๐๐๐๐๐(๐๐ฅ + ๐) + ๐ฅ ร (1 โ ๐ ๐๐๐๐๐๐ ๐๐ฅ + ๐ )
โข Data-dependent gated shortcuts with parameters
โข When gates are โclosedโ, the layers become โnon-residualโ.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
8. ResNet | Design | Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
9. Plain Block
๐[๐] ๐[๐+2]
๐[๐+1]
๐ง[๐+1]
= ๐[๐+1]
๐[๐]
+ ๐[๐+1]
โlinearโ
๐[๐+1] = ๐(๐ง[๐+1])
โreluโ
๐ง[๐+2] = ๐[๐+2] ๐[๐+1] + ๐[๐+2]
โoutputโ
๐[๐+2]
= ๐ ๐ง ๐+2
โrelu on outputโ
Image source: deeplearning.ai
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
10. Residual Block
๐[๐] ๐[๐+2]
๐[๐+1]
๐ง[๐+1]
= ๐[๐+1]
๐[๐]
+ ๐[๐+1]
โlinearโ
๐[๐+1] = ๐(๐ง[๐+1])
โreluโ
๐ง[๐+2] = ๐[๐+2] ๐[๐+1] + ๐[๐+2]
โoutputโ
๐[๐+2]
= ๐ ๐ง ๐+2
+ ๐ ๐
โrelu on output plus inputโ
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: deeplearning.ai
School of Computing
11. Skip Connections
โข Skipping immediate connections!
โข Referred to as residual part of the network.
โข Such residual part receives the input as an amplifier to its output โ
The dimensions usually are the same.
โข Another option is to use a projection to the output space.
โข Either way โ no additional training parameters are used.
Image source: towardsdatascience.com
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
13. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
Stacked Residual Blocks
School of Computing
14. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
3x3 conv
layers
2x # of filters
2 strides to down-sample
Avg. pool after the
last conv layer
FC layer to
output classes
School of Computing
15. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
3x3 conv
layers
2x # of filters
2 strides to down-sample
Avg. pool after the
last conv layer
FC layer to
output classes
School of Computing
16. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
3x3 conv
layers
2x # of filters
2 strides to down-sample
Avg. pool after the
last conv layer
FC layer to
output classes
School of Computing
17. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
1x1 conv with 64 filters
28x28x64
Input:
28x28x256
Image source: paper
School of Computing
18. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
1x1 conv with 64 filters
28x28x64
Input:
28x28x256
3x3 conv on 64 feature
maps only
Image source: paper
School of Computing
19. ResNet Architecture
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
1x1 conv with 64 filters
28x28x64
Input:
28x28x256
3x3 conv on 64 feature
maps only
1x1 conv with 256 filters
28x28x256
BOTTLENECK
Image source: paper
School of Computing
21. Benefits of Bottleneck
โข Less training time for deeper networks
โข By keeping time complexity same as
two-layer conv.
โข Hence, allows to increase # of layers.
โข And, model converges faster: 152-
layer ResNet has 11.3 billion FLOPS
while VGG-16/19 nets has 15.3/19.6
billion FLOPS.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Input:
28x28x256
Image source: paper
School of Computing
22. Summary โ Advantages of ResNet over Plain
Networks
โข A deeper plain network tends to perform bad because of the
vanishing and exploding gradients
โข In such cases, ResNets will stop improving rather than decrease in
performance: ๐[๐+2] = ๐ ๐ง ๐+2 + ๐ ๐ = ๐(๐ค ๐+1 ๐ ๐+1 + ๐ ๐ + ๐[๐])
โข If a layer is not โusefulโ, L2 regularization will bring its parameters very
close to zero, resulting in ๐[๐+2]
= ๐ ๐[๐]
= ๐[๐]
(when using ReLU)
โข In theory, ResNet is still identical to plain networks, but in practice
due to the above the convergence is much faster.
โข No additional training parameters and complexity introduced.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
24. Results
โข ILSVRC 2015 classification winner (3.6% top 5 error) -- better than โhuman performanceโ!
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Error rates (%) of ensembles. The top-5 error is on the
test set of ImageNet and reported by the test server
School of Computing
25. Results
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Error rates (%, 10-crop testing) on ImageNet
validation set
Error rates (%) of single-model results on
the ImageNet validation set
School of Computing
26. Plain vs. ResNet
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
School of Computing
27. Plain vs. Deeper ResNet
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu)
Image source: paper
School of Computing
28. Conclusion | Future Trends
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
29. Conclusion
โขEasy to optimize deep neural networks.
โขGuaranteed Accuracy gain with deeper layers.
โขAddressed: Vanishing Gradient and Longer
Training duration.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
30. Conclusion
โขEasy to optimize deep neural networks.
โขGuaranteed Accuracy gain with deeper layers.
โขAddressed: Vanishing Gradient and Longer
Training duration.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
31. Conclusion
โขEasy to optimize deep neural networks.
โขGuaranteed Accuracy gain with deeper layers.
โขAddressed: Vanishing Gradient and Longer
Training duration.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
32. Conclusion
โขEasy to optimize deep neural networks.
โขGuaranteed Accuracy gain with deeper layers.
โขAddressed: Vanishing Gradient and Longer
Training duration.
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
33. Future Trends
โข Identity Mappings in Deep Residual Networks suggests to pass the
input directly to the final residual layer, hence allowing the network
to easily learn to pass the input as identity mapping both in forward
and backward passes. (He et. al. 2016)
โข Using the Batch Normalization as pre-activation improves the
regularization
โข Reduce Learning Time with Random Layer Drops
โข ResNeXt: Aggregated Residual Transformations for Deep Neural
Networks. (Xie et. al. 2016)
Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
34. Presented by โ Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing
Questions?