ResNeSt: Split-Attention Networks

ResNeSt: Split-Attention Networks
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
University of California, Davis & Amazon
CVPR 2020
2020.05.31

Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents

ResNeSt
Introduction – Proposal
• While Image classification models have recently
continued to advance, most downstream applications
still employ “ResNet” as the backbone network
• NAS-derived models are usually not optimized for
training efficiency of memory usage.
• Recent image classification networks have focused
more on group or depth-wise convolution, these
methods do not transfer well to other tasks (No cross-
channel relationships)
Introduction / Related Work / Methods and Experiments / Conclusion
Depth-wise Convolution
Neural Architecture Search
01

ResNeSt
Introduction – Contributions
• Explored a simple architectural modification of the ResNet.
→ Requires no additional computation and is easy to be adopted as a backbone for
other vision tasks.
• Set large scale benchmarks on image classifications and transfer learning applications.
→ Tested on image classification, object detection, instance segmentation, and
semantic segmentation.
• ResNeSt outperforms all existing ResNet variants and has the same computational
efficiency and even achieves better speed-accuracy trade-offs than SOTA NAS-derived
models.
02

Related Work
Multi-path and Feature-Map Attention
• Multi-path representation has shown
success in “GoogleNet”
• “ResNext” adopted group convolution in
the ResNet bottle block, which converts the
multi-path structure into a unified operation.
• “SE-Net” introduced a channel-attention
mechanism.
• “SK-Net” brings the feature-map attention
across two network branches.
Group Convolution
Inception Block
Squeeze-and-Excitation Block 03

Related Work
• ResNeSt generalized the channel-wise attention into feature-map
group representation
Split Attention
04

Methods and Experiments
Split-Attention Networks
• Features are divided into several groups
- Cardinality hyperparameter: K
- Radix hyperparameter: R
- Total number of feature groups: G = RK
• Element-wise summation across multiple
splits → Feature-map groups with the same
cardinality-index but different radix index
are fused together
• Global contextual information with
embedded channel-wise statistics can be
gathered with GAP
• Two consecutive FC layers are added to
predict the attention weights for each splits
05

Network Tweaks
• Average Downsampling
→ In terms of preserving spatial information, zero
padding is suboptimal. Instead of using strided
convolution at the transitioning block, use average
pooling layer.
• Tweaks from ResNet-D
→ The first 7x7 convolutional layer is replaced with
three consecutive 3x3 layers, which have the same
receptive field size with a similar computational cost
→ 2x2 average pooling layer is added to the shortcut
connection prior to the 1x1 convolutional layer for
the transitioning blocks.
06

Training Strategy
• Large Mini-batch Distributed Training
→ Used cosine scheduling, and linearly scaled-up the initial learning rate based on the
mini batch size (n = B/256 * 0.1)
• Label Smoothing
• Auto Augmentation
→ First introduce 16 different types of image transformations and make 24 different
combinations of those transformations. 24 polices are randomly chosen and applied
to each sample image during training
• Mixup Training
→ Weighted combinations of random image pairs from the training data.
07

Training Strategy
• Large Crop Size
→ EfficientNet has demonstrated that increasing the input image size for a deeper and
wider network may better trade off accuracy vs. FLOPS
→ Used diverse crop sizes for input image. 224, and 256
• Regularization
→ Dropout with probability of 0.2 is applied.
→ Also applied DropBlock layers to the convolutional layers at the last two stages of the
network, which is more effective than dropout for specifically regularizing layers.
08

Main Results – Image Classification
09

Main Results – Image Classification
* ResNeSt-200 : 256 x 256 , ResNeSt-269: 320 x 320
* Bicubic upsampling is employed for input size greater than 256
* Result proved that Depth-wise convolution is not optimized for inference speed. 10

Main Results – Ablation Studies
* Improving radix from 0 to 4 continuously improved the top-1 accuracy, while also
increasing latency and memory usage.
* Finally employed 2s1x64d setting for good trade off between speed, and accuracy.
11

Main Results – Object Detection
* Test on MS-COCO validation set
12

Main Results – Instance Segmentation
13

Main Results – Semantic Segmentation
14

Conclusion
• ResNeSt architecture proposed a novel Split-Attention block that
universally improved the learned feature representations to boost
performance.
• In the downstream tasks, simply switching the backbone network to
ResNeSt showed substantially better result.
• Depth-wise convolution is not optimal for training and inference
efficiency on GPU
• Model accuracy get saturated on ImageNet with a fixed input image Size
• Increasing input image size can get better accuracy and FLOPS trade-off.
15

ResNeSt: Split-Attention Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ResNeSt: Split-Attention Networks

Similar to ResNeSt: Split-Attention Networks (20)

More from Seunghyun Hwang

More from Seunghyun Hwang (14)

Recently uploaded

Recently uploaded (20)

ResNeSt: Split-Attention Networks