This document proposes ResNeSt, a split-attention network that divides feature maps into groups and applies attention mechanisms across groups. It outperforms ResNet variants on image classification, object detection, semantic segmentation, and instance segmentation while maintaining the same computational efficiency. The paper introduces ResNeSt's split attention block, training strategies including large batches, data augmentation, and regularization methods. Evaluation shows ResNeSt achieves state-of-the-art accuracy on ImageNet and downstream tasks using less computation than NAS models.
2. Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
3. ResNeSt
Introduction – Proposal
• While Image classification models have recently
continued to advance, most downstream applications
still employ “ResNet” as the backbone network
• NAS-derived models are usually not optimized for
training efficiency of memory usage.
• Recent image classification networks have focused
more on group or depth-wise convolution, these
methods do not transfer well to other tasks (No cross-
channel relationships)
Introduction / Related Work / Methods and Experiments / Conclusion
Depth-wise Convolution
Neural Architecture Search
01
4. ResNeSt
Introduction – Contributions
• Explored a simple architectural modification of the ResNet.
→ Requires no additional computation and is easy to be adopted as a backbone for
other vision tasks.
• Set large scale benchmarks on image classifications and transfer learning applications.
→ Tested on image classification, object detection, instance segmentation, and
semantic segmentation.
• ResNeSt outperforms all existing ResNet variants and has the same computational
efficiency and even achieves better speed-accuracy trade-offs than SOTA NAS-derived
models.
Introduction / Related Work / Methods and Experiments / Conclusion
02
5. Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
Multi-path and Feature-Map Attention
• Multi-path representation has shown
success in “GoogleNet”
• “ResNext” adopted group convolution in
the ResNet bottle block, which converts the
multi-path structure into a unified operation.
• “SE-Net” introduced a channel-attention
mechanism.
• “SK-Net” brings the feature-map attention
across two network branches.
Group Convolution
Inception Block
Squeeze-and-Excitation Block 03
6. Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
• ResNeSt generalized the channel-wise attention into feature-map
group representation
Split Attention
04
7. Methods and Experiments
Split-Attention Networks
Introduction / Related Work / Methods and Experiments / Conclusion
• Features are divided into several groups
- Cardinality hyperparameter: K
- Radix hyperparameter: R
- Total number of feature groups: G = RK
• Element-wise summation across multiple
splits → Feature-map groups with the same
cardinality-index but different radix index
are fused together
• Global contextual information with
embedded channel-wise statistics can be
gathered with GAP
• Two consecutive FC layers are added to
predict the attention weights for each splits
05
8. Methods and Experiments
Network Tweaks
Introduction / Related Work / Methods and Experiments / Conclusion
• Average Downsampling
→ In terms of preserving spatial information, zero
padding is suboptimal. Instead of using strided
convolution at the transitioning block, use average
pooling layer.
• Tweaks from ResNet-D
→ The first 7x7 convolutional layer is replaced with
three consecutive 3x3 layers, which have the same
receptive field size with a similar computational cost
→ 2x2 average pooling layer is added to the shortcut
connection prior to the 1x1 convolutional layer for
the transitioning blocks.
06
9. Methods and Experiments
Training Strategy
Introduction / Related Work / Methods and Experiments / Conclusion
• Large Mini-batch Distributed Training
→ Used cosine scheduling, and linearly scaled-up the initial learning rate based on the
mini batch size (n = B/256 * 0.1)
• Label Smoothing
• Auto Augmentation
→ First introduce 16 different types of image transformations and make 24 different
combinations of those transformations. 24 polices are randomly chosen and applied
to each sample image during training
• Mixup Training
→ Weighted combinations of random image pairs from the training data.
07
10. Methods and Experiments
Training Strategy
Introduction / Related Work / Methods and Experiments / Conclusion
• Large Crop Size
→ EfficientNet has demonstrated that increasing the input image size for a deeper and
wider network may better trade off accuracy vs. FLOPS
→ Used diverse crop sizes for input image. 224, and 256
• Regularization
→ Dropout with probability of 0.2 is applied.
→ Also applied DropBlock layers to the convolutional layers at the last two stages of the
network, which is more effective than dropout for specifically regularizing layers.
08
11. Methods and Experiments
Main Results – Image Classification
Introduction / Related Work / Methods and Experiments / Conclusion
09
12. Methods and Experiments
Main Results – Image Classification
Introduction / Related Work / Methods and Experiments / Conclusion
* ResNeSt-200 : 256 x 256 , ResNeSt-269: 320 x 320
* Bicubic upsampling is employed for input size greater than 256
* Result proved that Depth-wise convolution is not optimized for inference speed. 10
13. Methods and Experiments
Main Results – Ablation Studies
Introduction / Related Work / Methods and Experiments / Conclusion
* Improving radix from 0 to 4 continuously improved the top-1 accuracy, while also
increasing latency and memory usage.
* Finally employed 2s1x64d setting for good trade off between speed, and accuracy.
11
14. Methods and Experiments
Main Results – Object Detection
Introduction / Related Work / Methods and Experiments / Conclusion
* Test on MS-COCO validation set
12
15. Methods and Experiments
Main Results – Instance Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
13
16. Methods and Experiments
Main Results – Semantic Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
14
17. Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• ResNeSt architecture proposed a novel Split-Attention block that
universally improved the learned feature representations to boost
performance.
• In the downstream tasks, simply switching the backbone network to
ResNeSt showed substantially better result.
• Depth-wise convolution is not optimal for training and inference
efficiency on GPU
• Model accuracy get saturated on ImageNet with a fixed input image Size
• Increasing input image size can get better accuracy and FLOPS trade-off.
15