Deep Residual Learning Image Recognition

Deep Residual Learning for
Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
Presented by – Sanjay Saha, School of Computing, NUS
CS6240 – Multimedia Analysis – Sem 2 AY2019/20

Objective | Problem Statement
Presented by – Sanjay Saha (sanjaysaha@u.nus.edu) School of Computing

Motivation
Performance of plain networks in a deeper architecture
Image source: paper

Main Idea
• Skip Connections/ Shortcuts
• Trying to avoid:
‘Vanishing Gradients’
‘Long training times’
Image source: Wikipedia

Contributions | Problem Statement
• These extremely deep residual nets are easy to optimize, but the
counterpart “plain” nets (that simply stack layers) exhibit higher
training error when the depth increases.
• These deep residual nets can easily enjoy accuracy gains from greatly
increased depth, producing results substantially better than previous
networks.
Presented by – Sanjay Saha (sanjaysaha@u.nus.edu)
A residual learning framework to ease the training of networks that
are substantially deeper than those used previously.
Perfor
mance
Depth
School of Computing

Literature

Literature Review
• Partial solutions for vanishing
• Batch Normalization – To rescale the weights over some batch.
• Smart Initialization of weights – Like for example Xavier initialization.
• Train portions of the network individually.
• Highway Networks
• Feature residual connections of the form
𝑌 = 𝑓 𝑥 × 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑊𝑥 + 𝑏) + 𝑥 × (1 − 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊𝑥 + 𝑏 )
• Data-dependent gated shortcuts with parameters
• When gates are ‘closed’, the layers become ‘non-residual’.

ResNet | Design | Architecture

Plain Block
𝑎[𝑙] 𝑎[𝑙+2]
𝑎[𝑙+1]
𝑧[𝑙+1]
= 𝑊[𝑙+1]
𝑎[𝑙]
+ 𝑏[𝑙+1]
“linear”
𝑎[𝑙+1] = 𝑔(𝑧[𝑙+1])
“relu”
𝑧[𝑙+2] = 𝑊[𝑙+2] 𝑎[𝑙+1] + 𝑏[𝑙+2]
“output”
𝑎[𝑙+2]
= 𝑔 𝑧 𝑙+2
“relu on output”
Image source: deeplearning.ai

Residual Block
𝑎[𝑙] 𝑎[𝑙+2]
𝑎[𝑙+1]
𝑧[𝑙+1]
= 𝑊[𝑙+1]
𝑎[𝑙]
+ 𝑏[𝑙+1]
“linear”
𝑎[𝑙+1] = 𝑔(𝑧[𝑙+1])
“relu”
𝑧[𝑙+2] = 𝑊[𝑙+2] 𝑎[𝑙+1] + 𝑏[𝑙+2]
“output”
𝑎[𝑙+2]
= 𝑔 𝑧 𝑙+2
+ 𝑎 𝑙
“relu on output plus input”
Image source: deeplearning.ai
School of Computing

Skip Connections
• Skipping immediate connections!
• Referred to as residual part of the network.
• Such residual part receives the input as an amplifier to its output –
The dimensions usually are the same.
• Another option is to use a projection to the output space.
• Either way – no additional training parameters are used.
Image source: towardsdatascience.com

ResNet Architecture
Image source: paper

ResNet Architecture
Image source: paper
Stacked Residual Blocks
School of Computing

ResNet Architecture
Image source: paper
3x3 conv
layers
2x # of filters
2 strides to down-sample
Avg. pool after the
last conv layer
FC layer to
output classes
School of Computing

ResNet Architecture
1x1 conv with 64 filters
28x28x64
Input:
28x28x256
Image source: paper
School of Computing

ResNet Architecture
28x28x64
Input:
28x28x256
3x3 conv on 64 feature
maps only
Image source: paper
School of Computing

ResNet Architecture
28x28x64
Input:
28x28x256
3x3 conv on 64 feature
maps only
28x28x256
BOTTLENECK
Image source: paper
School of Computing

Summary | Advantages

Benefits of Bottleneck
• Less training time for deeper networks
• By keeping time complexity same as
two-layer conv.
• Hence, allows to increase # of layers.
• And, model converges faster: 152-
layer ResNet has 11.3 billion FLOPS
while VGG-16/19 nets has 15.3/19.6
billion FLOPS.
Input:
28x28x256
Image source: paper
School of Computing

Summary – Advantages of ResNet over Plain
Networks
• A deeper plain network tends to perform bad because of the
vanishing and exploding gradients
• In such cases, ResNets will stop improving rather than decrease in
performance: 𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2 + 𝑎 𝑙 = 𝑔(𝑤 𝑙+1 𝑎 𝑙+1 + 𝑏 𝑙 + 𝑎[𝑙])
• If a layer is not ‘useful’, L2 regularization will bring its parameters very
close to zero, resulting in 𝑎[𝑙+2]
= 𝑔 𝑎[𝑙]
= 𝑎[𝑙]
(when using ReLU)
• In theory, ResNet is still identical to plain networks, but in practice
due to the above the convergence is much faster.
• No additional training parameters and complexity introduced.

Results

Results
• ILSVRC 2015 classification winner (3.6% top 5 error) -- better than “human performance”!
Error rates (%) of ensembles. The top-5 error is on the
test set of ImageNet and reported by the test server
School of Computing

Results
Error rates (%, 10-crop testing) on ImageNet
validation set
Error rates (%) of single-model results on
the ImageNet validation set
School of Computing

Plain vs. ResNet
Image source: paper
School of Computing

Plain vs. Deeper ResNet
Image source: paper
School of Computing

Conclusion | Future Trends

Conclusion
•Easy to optimize deep neural networks.
•Guaranteed Accuracy gain with deeper layers.
•Addressed: Vanishing Gradient and Longer
Training duration.

Future Trends
• Identity Mappings in Deep Residual Networks suggests to pass the
input directly to the final residual layer, hence allowing the network
to easily learn to pass the input as identity mapping both in forward
and backward passes. (He et. al. 2016)
• Using the Batch Normalization as pre-activation improves the
regularization
• Reduce Learning Time with Random Layer Drops
• ResNeXt: Aggregated Residual Transformations for Deep Neural
Networks. (Xie et. al. 2016)

Questions?

Deep Residual Learning Image Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Residual Learning Image Recognition

Similar to Deep Residual Learning Image Recognition (20)

More from Sanjay Saha

More from Sanjay Saha (7)

Recently uploaded

Recently uploaded (20)

Deep Residual Learning Image Recognition