CNNAttention-basedNetworks
+ Attention,CNNReview
Tensorflow-KR PR-163,TaeohKim
MVPLAB,YonseiUniv
Contents
PR-163: CNN_Attention_Networks
• Attention, Self-Attention in NLP
• CNN-Review
• CNN Attention Networks for Recognition
• CNN Attention Networks for Other Vision Tasks
Review 1: Attention
Neural Networks
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
Prediction
Attention
PR-163: CNN_Attention_Networks
Quary{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
𝑦 = ෍
𝑖
𝑤𝑖 𝑥𝑖
𝑦 = ෍
𝑖
𝑓 𝑄𝑢𝑎𝑟𝑦, 𝐾𝑒𝑦 × 𝑉𝑎𝑙𝑢𝑒
Output
Fully Connected Neural Network
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
𝑤1
𝑤2
…
𝑤 𝑛
Fully Connected NN
Represent Blue
using Weighted Sum of Inputs
without Constraint
Convolutional Neural Network
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(V)
(V)
(V)
(V)
(V)
(V)
𝑤1
𝑤2
𝑤3
Convolutional NN
Represent Blue
using Weighted Sum of Inputs
from Current Position
Q
Attention
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
Attention
Represent Blue
using Weighted Sum of Inputs
from Blue State and Inputs
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐵𝑆, 𝐾 𝑛
ex) Machine Translation (PR-055)
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐵𝑆, 𝐾 𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ
𝐾𝑜𝑟𝑒𝑎𝑛1
𝐾𝑜𝑟𝑒𝑎𝑛2
ex) Image Captioning
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐵𝑆, 𝐾 𝑛
𝐼𝑚𝑎𝑔𝑒
𝐹𝑒𝑎𝑡𝑢𝑟𝑒
𝐶𝑎𝑝𝑡𝑖𝑜𝑛1
𝐶𝑎𝑝𝑡𝑖𝑜𝑛2
Self-Attention
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
Self-Attention
Represent Blue
using Weighted Sum of Inputs
from Input and Inputs
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐾 𝑛
ex) Transformer (Enc) (PR-049, PR-161)
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐾 𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒
ex) Transformer (Dec)
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐾 𝑛
𝐾𝑜𝑟𝑒𝑎𝑛1:𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 (K, V)
𝐾𝑜𝑟𝑒𝑎𝑛 𝑛+1
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐸 𝑛
CNN Self-Attention for Image = Representation
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐾 𝑛
𝐼𝑚𝑎𝑔𝑒 𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛
ex) Non-local Neural Networks (CVPR18, PR-083)
PR-163: CNN_Attention_Networks
𝑦𝑖 =
1
σ 𝑗 exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))
෍
𝑗
exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))𝑊𝑔 𝑥𝑗
Reshape
1x1Conv
Operation
Soft
max
HxWxC HWxC
HWxC
CxHW
HWxHW
HWxC
Quary
Key
Value
HxWxC
+
HxWxC
CNN Simplified-Attention for Image = Recalibration
PR-163: CNN_Attention_Networks
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤2
𝑤2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓 𝐾2, 𝐾 𝑛
𝐼𝑚𝑎𝑔𝑒 𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛
CNN Simplified-Attention for Image = Recalibration
PR-163: CNN_Attention_Networks
x
ex) Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
Summary
PR-163: CNN_Attention_Networks
Attention Quary Structure Objective Examples
Attention Current States Recurrent Representation
NMT, Captioning,
VQA
Self-Attention Input Itself Feed-Forward Representation
Transformer
Non-local NN
Input Itself Feed-Forward Recalibration
SE-Net, RAN,
CBAM
Review 2: CNN Networks
CNN Review
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417
CNN Review
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417 𝑇𝑜𝑑𝑎𝑦
Plain Networks
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
• PlainNetworksusingMax-Pooling
• LowPerformance/LargeParameters,Operations
ILSVRC12 ILSVRC14
ResNet
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
• DeeperNetworksusingSkip-Connection
Res
Block
Res
Block
Res
Block
Res
Block
ILSVRC15
ResNet ResBlock
PR-163: CNN_Attention_Networks
Res
Block
Res
Block
Res
Block
Res
Block
1x1 3x3 1x1
c c/4 c/4 c
BatchNorm
ReLU
xN
ResNet Variants
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
WideResNet
2016.05 / 1063
ResNetv2
2016.03 / 1926
• Pre-activationResNet
1x1 3x3 1x1
c c/4 c/4 c
ResNet Variants
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
WideResNet
2016.05 / 1063
ResNetv2
2016.03 / 1926
• WiderChannelResNet
3x3 3x3
c c x k c
ResNet with Cardinality
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
• Cardinality
=GroupConv
• PR-034:Xception
• PR-044:MobileNet
• PR-054:SuffleNet/ResNeXt
• ModifyConvolutionOperators
ResNeXt (C=32)
PR-163: CNN_Attention_Networks
256 256
1x1
1x1
1x1
3x3
3x3
3x3
1x1
1x1
1x1
4 4 256
+
32 Blocks
......
......
......
PR-163: CNN_Attention_Networks
256 256
3x3
3x3
3x3
128 4 4
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
ResNeXt (C=32) with Group Conv.
PR-163: CNN_Attention_Networks
256 256
1x1
1x1
1x1
3x3
3x3
3x3
1x1
1x1
1x1
4 4 256
+
32 Blocks
X
(1x1x4)x32 Summation
......
......
......
ResNeXt (C=32)
PR-163: CNN_Attention_Networks
256 256
3x3
3x3
3x3
128 4 4
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
... ...X
1x1x128Summation
......
......
ResNeXt (C=32) with Group Conv.
Xception (G=Channel)
PR-163: CNN_Attention_Networks
256 256
3x3
3x3
3x3
128 1 1
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
......
......
MobileNet / ShuffleNet
PR-163: CNN_Attention_Networks
• MobileNet:LightweightXception
• Xception: 1x1→3x3Depthwise
• MobileNet: 3x3Depthwise→1x1
• ShuffleNet:LightweightResNeXt+ChannelShuffle
DenseNet
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
• DenseNet:ConcatPreviousLayers:PR-028
Dense
Block
Dense
Block
Dense
Block
Dense
Block
DenseBlock
PR-163: CNN_Attention_Networks
Dense
Block
Dense
Block
Dense
Block
Dense
Block
1x1 3x3 Conv Conv Conv ConvConv
Inception / NASNet
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417
• EngineeredNetworks
• PR-034:Inception
• PR-069:NASNet
ILSVRC14
CNN Performances
PR-163: CNN_Attention_Networks
https://github.com/CeLuigi/models-comparison.pytorch
CNN Review
PR-163: CNN_Attention_Networks
Category Networks Pros Cons
Plain AlexNet, VGG
Simple
Good Transfer
Low Performance
ResNet ResNet Simple
Cardinality
ResNeXt/Xception
MobileNet/ShuffleNet
Cost Efficient
+ Performance
Group Conv
DenseNet DenseNet
Cost Efficient
+ Performance
Memory I/O
Engineering
Inception
NASNet
SoTA Complex
CNN Review
PR-163: CNN_Attention_Networks
Category Networks Pros Cons
Plain AlexNet, VGG
Simple
Good Transfer
Low Performance
ResNet ResNet Simple
Cardinality
ResNeXt/Xception
MobileNet/ShuffleNet
Cost Efficient
+ Performance
Group Conv
DenseNet DenseNet
Cost Efficient
+ Performance
Memory I/O
Engineering
Inception
NASNet
SoTA Complex
Attention Module SENet, CBAM, GCNet
Simple
+ Performance
CNN Review
PR-163: CNN_Attention_Networks
CNN x Attention
CNN Attention-Networks
PR-163: CNN_Attention_Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417
ILSVRC17
Spatial Transformer Networks (NIPS15, PR-011)
PR-163: CNN_Attention_Networks
• Recalibration(withTransform)
Residual Attention Network (CVPR17)
PR-163: CNN_Attention_Networks
Residual Attention Network (CVPR17)
PR-163: CNN_Attention_Networks
c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
• OriginalResNet(BottleNeckBlock)
Residual Attention Network (CVPR17)
PR-163: CNN_Attention_Networks
c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
Mask Generation
Networks
Sigmoid
ReLU
x
• Recalibrate{Spatial+Channel)
• ActuallyNotResidual
Residual Attention Network (CVPR17)
PR-163: CNN_Attention_Networks
• Results:InterpretableFeatures
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly
Per-Channel
GlobalSpatialContext
(Squeeze)
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly ModelingInter-Channel
Relationship(Excitation)
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
Squeeze-and-Excitation Networks (CVPR18)
PR-163: CNN_Attention_Networks
• ChannelRecalibrationStats.
Bottleneck Attention Networks (BMVC18)
PR-163: CNN_Attention_Networks
c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
1x1 1x1
AvgPool
+
x
+
Sigmoid
ReLU
BN
3x3 Dilated Conv
1x1
r
r
3x3
Channel
Attention
Spatial
Attention
Convolutional Block Attention Networks (ECCV18)
PR-163: CNN_Attention_Networks
c
+
x
c cc/16
AvgPool
MaxPool
x
7x7
Channel-wise
AvgPool
MaxPool
c
Conv(1x1)
Conv(3x3)
Conv(1x1)
+
Share 1x1
Sigmoid
ReLU
BN Channel
Attention
Spatial
Attention
BAM / CBAM Results
PR-163: CNN_Attention_Networks
RAN / SE / BAM / CBAM Comparison
PR-163: CNN_Attention_Networks
Network Module Position Attention
RAN (CVPR17) Modified ChannelxSpatial 3D
SE (CVPR18) In the ResBlock Channel
BAM (BMVC18)
Before the Stride=2
ResBlock
Channel, Spatial Parallel
CBAM (ECCV18) In the ResBlock Channel, Spatial Sequential
Non-local Networks
PR-163: CNN_Attention_Networks
• RepresentSpatial-Only
Soft
maxHxWxC
HxWxC HWxC
HWxC
CxHW
HWxHW
HWxC
Quary
Key
Value
HxWxC
+
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+
SpatialRepresentation
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+
SpatialRepresentation
ChannelAttention
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
QuaryIndependentRepresentation→Recalibration
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
NLNet SimplifiedNLNet SENet GCNet
SpatialWeightedSum
PerPixel
(HxW)
SpatialWeightedSum
Shared
(Scalar)
SpatialAggregation
(GlobalAvgPool)
→forChannelRecalibration
SpatialWeightedSum
→forChannelRecalibration
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
QuaryIndependentRepresentation→Recalibration
Non-localNetworksMeetSqueeze-ExcitationNetworksandBeyond(GCNet)
Global-Context Attention Networks
PR-163: CNN_Attention_Networks
Summary
PR-163: CNN_Attention_Networks
Network Attention Spatial Modeling
RAN (CVPR17) ChannelxSpatial 3D Network
SE (CVPR18) Channel Avg Pool
BAM (BMVC18) Channel, Spatial Parallel Avg Pool
CBAM (ECCV18) Channel, Spatial Sequential Avg Pool + Max Pool
NLNet (CVPR18) Spatial (Representation) Non-local Representation
GCNet (Preprint19) Channel Non-local Representation
CNN x Attention: Other Vision Tasks
Self-Attention GAN
PR-163: CNN_Attention_Networks
Style Transfer (CVPR19)
PR-163: CNN_Attention_Networks
ArbitraryStyleTransferwithStyle-AttentionalNetworks
PSANet (ECCV18) / Context Encoding (CVPR18) / OCNet (2018)
PR-163: CNN_Attention_Networks
Dual Attention Network (CVPR19)
PR-163: CNN_Attention_Networks
Criss-Cross Non-local Attention Networks (2019)
PR-163: CNN_Attention_Networks
Semantic Segmentation
PR-163: CNN_Attention_Networks
Network Performance (Cityscape)
mIoU
Structure
DenseASPP (CVPR18) 80.6 DenseNet
PSANet (ECCV18) 80.1 Spatial Attention
Context Encoding (CVPR18) - Channel Attention
CCNet (Arxiv19) 81.4 Fast NL-Net
DANet (CVPR19) 81.5 NL-Net (Spatial + Channel)
OCNet (Arxiv18) 81.7 NL-Net + PSP
Non-local in SISR
PR-163: CNN_Attention_Networks
Single Image Super-Resolution
PR-163: CNN_Attention_Networks
Network Performance (set5, PSNR) Structure
RDN (CVPR18) 38.24 / 32.47 (x2, x4) DenseNet
RNRN (ICLR19) 38.17 / 32.49 NL-Net
RCAN (ECCV18) 38.27 / 32.63 Channel Attention
SAN (CVPR19) 38.31 / 32.64 Channel Attention + NL-Net
Conclusion
PR-163: CNN_Attention_Networks
• Attention (Recurrent) vs Self-Attention (Feed-Forward)
• Representation vs Recalibration
• Channel Attention: Simple
• Spatial Attention: Global Information
ThankYou
Q&A?

CNN Attention Networks