CNN Attention Networks

CNNAttention-basedNetworks
+ Attention,CNNReview
Tensorflow-KR PR-163,TaeohKim
MVPLAB,YonseiUniv

Contents
PR-163: CNN_Attention_Networks
• Attention, Self-Attention in NLP
• CNN-Review
• CNN Attention Networks for Recognition
• CNN Attention Networks for Other Vision Tasks

Neural Networks
Input
Feature
Hidden State
Representation
Prediction

Attention
Quary{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
{Key, Value}
𝑦 = ෍
𝑖
𝑤𝑖 𝑥𝑖
𝑦 = ෍
𝑖
𝑓 𝑄𝑢𝑎𝑟𝑦, 𝐾𝑒𝑦 × 𝑉𝑎𝑙𝑢𝑒
Output

Fully Connected Neural Network
Input
Feature
Hidden State
Representation
𝑤1
𝑤2
…
𝑤 𝑛
Fully Connected NN
Represent Blue
using Weighted Sum of Inputs
without Constraint

Convolutional Neural Network
Input
Feature
Hidden State
Representation
(V)
(V)
(V)
(V)
(V)
(V)
𝑤1
𝑤2
𝑤3
Convolutional NN
Represent Blue
from Current Position
Q

Attention
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
Attention
Represent Blue
from Blue State and Inputs
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐵𝑆, 𝐾 𝑛

ex) Machine Translation (PR-055)
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ
𝐾𝑜𝑟𝑒𝑎𝑛1
𝐾𝑜𝑟𝑒𝑎𝑛2

ex) Image Captioning
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝐼𝑚𝑎𝑔𝑒
𝐹𝑒𝑎𝑡𝑢𝑟𝑒
𝐶𝑎𝑝𝑡𝑖𝑜𝑛1
𝐶𝑎𝑝𝑡𝑖𝑜𝑛2

Self-Attention
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
Self-Attention
Represent Blue
from Input and Inputs
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐾 𝑛

ex) Transformer (Enc) (PR-049, PR-161)
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒

ex) Transformer (Dec)
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝐾𝑜𝑟𝑒𝑎𝑛1:𝑛
𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 (K, V)
𝐾𝑜𝑟𝑒𝑎𝑛 𝑛+1
𝑤1
𝑤2
…
𝑤 𝑛
𝑤 𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝐾2, 𝐸 𝑛

CNN Self-Attention for Image = Representation
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤1
𝑤2
…
𝑤 𝑛
𝐼𝑚𝑎𝑔𝑒 𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛

ex) Non-local Neural Networks (CVPR18, PR-083)
𝑦𝑖 =
1
σ 𝑗 exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))
෍
𝑗
exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))𝑊𝑔 𝑥𝑗
Reshape
1x1Conv
Operation
Soft
max
HxWxC HWxC
HWxC
CxHW
HWxHW
HWxC
Quary
Key
Value
HxWxC
+
HxWxC

CNN Simplified-Attention for Image = Recalibration
Input
Feature
Hidden State
Representation
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
(K, V)
𝑤2
𝑤2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓 𝐾2, 𝐾 𝑛
𝐼𝑚𝑎𝑔𝑒 𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛

CNN Simplified-Attention for Image = Recalibration
x

ex) Squeeze-and-Excitation Networks (CVPR18)

Summary
Attention Quary Structure Objective Examples
Attention Current States Recurrent Representation
NMT, Captioning,
VQA
Self-Attention Input Itself Feed-Forward Representation
Transformer
Non-local NN
Input Itself Feed-Forward Recalibration
SE-Net, RAN,
CBAM

CNN Review
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417

CNN Review
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417 𝑇𝑜𝑑𝑎𝑦

Plain Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
• PlainNetworksusingMax-Pooling
• LowPerformance/LargeParameters,Operations
ILSVRC12 ILSVRC14

ResNet
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
• DeeperNetworksusingSkip-Connection
Res
Block
Res
Block
Res
Block
Res
Block
ILSVRC15

ResNet ResBlock
Res
Block
Res
Block
Res
Block
Res
Block
1x1 3x3 1x1
c c/4 c/4 c
BatchNorm
ReLU
xN

ResNet Variants
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
WideResNet
2016.05 / 1063
ResNetv2
2016.03 / 1926
• Pre-activationResNet
1x1 3x3 1x1
c c/4 c/4 c

ResNet Variants
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
WideResNet
2016.05 / 1063
ResNetv2
2016.03 / 1926
• WiderChannelResNet
3x3 3x3
c c x k c

ResNet with Cardinality
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
• Cardinality
=GroupConv
• PR-034:Xception
• PR-044:MobileNet
• PR-054:SuffleNet/ResNeXt
• ModifyConvolutionOperators

ResNeXt (C=32)
256 256
1x1
1x1
1x1
3x3
3x3
3x3
1x1
1x1
1x1
4 4 256
+
32 Blocks
......
......
......

256 256
3x3
3x3
3x3
128 4 4
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
ResNeXt (C=32) with Group Conv.

256 256
1x1
1x1
1x1
3x3
3x3
3x3
1x1
1x1
1x1
4 4 256
+
32 Blocks
X
(1x1x4)x32 Summation
......
......
......
ResNeXt (C=32)

256 256
3x3
3x3
3x3
128 4 4
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
... ...X
1x1x128Summation
......
......
ResNeXt (C=32) with Group Conv.

Xception (G=Channel)
256 256
3x3
3x3
3x3
128 1 1
1x1 1x1
C
O
N
C
A
T
G
R
O
U
P
C
O
N
V
128
......
......

MobileNet / ShuffleNet
• MobileNet:LightweightXception
• Xception: 1x1→3x3Depthwise
• MobileNet: 3x3Depthwise→1x1
• ShuffleNet:LightweightResNeXt+ChannelShuffle

DenseNet
AlexNet
2012 / 39646
VGG
2014.09 / 22554
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
• DenseNet:ConcatPreviousLayers:PR-028
Dense
Block
Dense
Block
Dense
Block
Dense
Block

DenseBlock
Dense
Block
Dense
Block
Dense
Block
Dense
Block
1x1 3x3 Conv Conv Conv ConvConv

Inception / NASNet
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417
• EngineeredNetworks
• PR-034:Inception
• PR-069:NASNet
ILSVRC14

CNN Performances
https://github.com/CeLuigi/models-comparison.pytorch

CNN Review
Category Networks Pros Cons
Plain AlexNet, VGG
Simple
Good Transfer
Low Performance
ResNet ResNet Simple
Cardinality
ResNeXt/Xception
MobileNet/ShuffleNet
Cost Efficient
+ Performance
Group Conv
DenseNet DenseNet
Cost Efficient
+ Performance
Memory I/O
Engineering
Inception
NASNet
SoTA Complex

CNN Review
Category Networks Pros Cons
Plain AlexNet, VGG
Simple
Good Transfer
Low Performance
ResNet ResNet Simple
Cardinality
ResNeXt/Xception
MobileNet/ShuffleNet
Cost Efficient
+ Performance
Group Conv
DenseNet DenseNet
Cost Efficient
+ Performance
Memory I/O
Engineering
Inception
NASNet
SoTA Complex
Attention Module SENet, CBAM, GCNet
Simple
+ Performance

CNN Review

CNN Attention-Networks
AlexNet
2012 / 39646
VGG
2014.09 / 22554
GoogleNet
2014.09 / 13233
ResNet
2015.12 / 21871
DenseNet
2016.08 / 3591
ResNeXt
2016.11 / 913
Xception
2016.10 / 847
MobileNet
2017.04 / 1553
ShuffleNet
2017.07 / 407
WideResNet
2016.05 / 1063
SE-Net
2017.09 / 724
Residual
Attention Net
2017.04 / 257
Non-local
Network
2017.11 / 240
CBAM
2018.07 / 54
GCNet
2019.04 / -
ResNetv2
2016.03 / 1926
Inception-2,3
2015.12 / 3752
Inception-4
Inception-ResNet
2016.02 / 2152
NASNet
2017.07. / 417
ILSVRC17

Spatial Transformer Networks (NIPS15, PR-011)
• Recalibration(withTransform)

Residual Attention Network (CVPR17)

c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
• OriginalResNet(BottleNeckBlock)

c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
Mask Generation
Networks
Sigmoid
ReLU
x
• Recalibrate{Spatial+Channel)
• ActuallyNotResidual

• Results:InterpretableFeatures

Squeeze-and-Excitation Networks (CVPR18)
c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly

c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly
Per-Channel
GlobalSpatialContext
(Squeeze)

c c
+
x
c cc/16
1x1 1x1
AvgPool
Conv(1x1)
Conv(3x3)
Conv(1x1)
Sigmoid
ReLU
BN
• RecalibrateChannelOnly ModelingInter-Channel
Relationship(Excitation)

• ChannelRecalibrationStats.

Bottleneck Attention Networks (BMVC18)
c
+
Conv(1x1)
Conv(3x3)
Conv(1x1)
1x1 1x1
AvgPool
+
x
+
Sigmoid
ReLU
BN
3x3 Dilated Conv
1x1
r
r
3x3
Channel
Attention
Spatial
Attention

Convolutional Block Attention Networks (ECCV18)
c
+
x
c cc/16
AvgPool
MaxPool
x
7x7
Channel-wise
AvgPool
MaxPool
c
Conv(1x1)
Conv(3x3)
Conv(1x1)
+
Share 1x1
Sigmoid
ReLU
BN Channel
Attention
Spatial
Attention

BAM / CBAM Results

RAN / SE / BAM / CBAM Comparison
Network Module Position Attention
RAN (CVPR17) Modified ChannelxSpatial 3D
SE (CVPR18) In the ResBlock Channel
BAM (BMVC18)
Before the Stride=2
ResBlock
Channel, Spatial Parallel
CBAM (ECCV18) In the ResBlock Channel, Spatial Sequential

Non-local Networks
• RepresentSpatial-Only
Soft
maxHxWxC
HxWxC HWxC
HWxC
CxHW
HWxHW
HWxC
Quary
Key
Value
HxWxC
+

Global-Context Attention Networks
Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+

Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+
SpatialRepresentation

Softmax
HxWxC
HxWxC HWxC
1xHW
C
Key
Value
HxWxC
+
+
SpatialRepresentation
ChannelAttention

QuaryIndependentRepresentation→Recalibration

NLNet SimplifiedNLNet SENet GCNet
SpatialWeightedSum
PerPixel
(HxW)
SpatialWeightedSum
Shared
(Scalar)
SpatialAggregation
(GlobalAvgPool)
→forChannelRecalibration
SpatialWeightedSum
→forChannelRecalibration

QuaryIndependentRepresentation→Recalibration
Non-localNetworksMeetSqueeze-ExcitationNetworksandBeyond(GCNet)

Summary
Network Attention Spatial Modeling
RAN (CVPR17) ChannelxSpatial 3D Network
SE (CVPR18) Channel Avg Pool
BAM (BMVC18) Channel, Spatial Parallel Avg Pool
CBAM (ECCV18) Channel, Spatial Sequential Avg Pool + Max Pool
NLNet (CVPR18) Spatial (Representation) Non-local Representation
GCNet (Preprint19) Channel Non-local Representation

CNN x Attention: Other Vision Tasks

Self-Attention GAN

Style Transfer (CVPR19)
ArbitraryStyleTransferwithStyle-AttentionalNetworks

PSANet (ECCV18) / Context Encoding (CVPR18) / OCNet (2018)

Dual Attention Network (CVPR19)

Criss-Cross Non-local Attention Networks (2019)

Semantic Segmentation
Network Performance (Cityscape)
mIoU
Structure
DenseASPP (CVPR18) 80.6 DenseNet
PSANet (ECCV18) 80.1 Spatial Attention
Context Encoding (CVPR18) - Channel Attention
CCNet (Arxiv19) 81.4 Fast NL-Net
DANet (CVPR19) 81.5 NL-Net (Spatial + Channel)
OCNet (Arxiv18) 81.7 NL-Net + PSP

Non-local in SISR

Single Image Super-Resolution
Network Performance (set5, PSNR) Structure
RDN (CVPR18) 38.24 / 32.47 (x2, x4) DenseNet
RNRN (ICLR19) 38.17 / 32.49 NL-Net
RCAN (ECCV18) 38.27 / 32.63 Channel Attention
SAN (CVPR19) 38.31 / 32.64 Channel Attention + NL-Net

Conclusion
• Attention (Recurrent) vs Self-Attention (Feed-Forward)
• Representation vs Recalibration
• Channel Attention: Simple
• Spatial Attention: Global Information

CNN Attention Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CNN Attention Networks

Similar to CNN Attention Networks (20)

More from Taeoh Kim

More from Taeoh Kim (6)

Recently uploaded

Recently uploaded (20)

CNN Attention Networks