A ConvNet for the 2020s
Zhuang Liu et al.
Sushant Gautam
MSIISE
Department of Electronics and
Computer Engineering
Institute of Engineering,
Thapathali Campus
13 March 2022
Introduction: ConvNets
• LeNet/ AlexNet / ResNets
• Have several built-in inductive biases
• well-suited to a wide variety of computer vision applications.
• Translation equivariance
• Shared Computations:
• inherently efficient in a sliding-window
• Fundamental building block in visual recognition system
3/13/2022 2
Transformers in NLP: BERT
3/13/2022 3
• NN design for NLP:
• Transformers replaced
RNNs
• become the dominant
backbone architecture
Imagine. . .
• But how do we tokenize image?
3/13/2022 4
Introduction: Vision Transformers
• Vision Transformers (ViT) in 2020:
• completely altered the landscape of network architecture design.
• two streams surprisingly converged
• Transformers outperform ResNets by a significant margin.
• on image classification
• only on image classification
3/13/2022 5
Vision Transformer: Patchify
3/13/2022 6
For 3x3 image:
8x8 image
Vision Transformer: ViT
3/13/2022 7
3/13/2022 8
3/13/2022 9
Length: C=4
3/13/2022 10
Segmentation:
Each token is a pixel.
ViT’s were challenging
• ViT’s global attention design:
• quadratic complexity with respect to the input size.
• Hierarchical Transformers
• hybrid approach to bridge this gap.
• for example, the sliding window strategy
(e.g., attention within local windows)
• behave more similarly to ConvNets.
3/13/2022 11
Swin Transformer
• a milestone work in this direction
• demonstrated Transformers as a generic vision backbone
• achieved SOTA results across a range of computer vision tasks
• beyond image classification.
• successful and rapid adoption revealed:
• the essence of convolution is not becoming irrelevant; rather,
• it remains much desired and has never faded.
3/13/2022 12
3/13/2022 13
3/13/2022 14
3/13/2022 15
3/13/2022 16
3/13/2022 17
3/13/2022 18
Swin Transformer
• Still relies on Patches
• But instead of choosing one
• Start with smaller patch and merge in the deeper transformer layers
• Similar to UNet and Convolution
3/13/2022 19
3/13/2022 20
Swin Transformer
• Shifted windows Transformer
• Self-attention in non-overlapped windows
• Hierarchical feature maps
3/13/2022 21
Back to the topic: ConvNeXt
Objectives:
• bridge the gap between the pre-ViT and post-ViT ConvNets
• test the limits of what a pure ConvNet can achieve.
• Exploration: How do design decisions in Transformers impact
ConvNets’ performance?
• Result: a family of pure ConvNets: ConvNeXt.
3/13/2022 22
Modernizing a ConvNet: a Roadmap
• Recipe: Two ResNet model sizes in terms of G-FLOPs
 ResNet-50 / Swin-T ~4.5x109 FLOPs
 ResNet-200 / Swin-B ~15.0x109 FLOPs
• Network Modernization
1. ViT training techniques for ResNet
2. ResNeXt
3. Inverted bottleneck
4. Large kernel size
5. Various layer-wise micro designs
3/13/2022 23
ViT Training Techniques
• Training epochs: 300 epochs (from the original 90 epochs for ResNets)
• Optimizer: AdamW (Weight Decay)
• Data augmentation
 Mixup
 Cutmix
 RandAugment
 Random Erasing
• Regularization
 Stochastic Depth
 Label Smoothing
• This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%)
3/13/2022 24
Label Smoothing making
model less over smart
Changing Stage Compute Ratio
• Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger
• Swin-B Transformers, the ratio is 1:1:9:1.
• #Conv blocks in ResNet-50: (3, 4, 6, 3)
• ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50
• which also aligns the FLOPs with Swin-T.
• This improves the model accuracy from 78.8% to 79.4%.
3/13/2022 25
Changing Stem to “Patchify”
• Stem cell of Standard ResNet: 7x7 convolution layer
• followed by a max pool, which result in a 4x downsampling.
• Stem cell of ViT: a more aggressive “patchify” strategy
• corresponds to a large kernel size(14 or 16) + non-overlapping convolution.
• Swin Transformer uses a similar “patchfiy” layer
• smaller patch size of 4: accommodate the architecture’s multi-stage design.
• ConvNeXt replaces the ResNet-style CNN stem cell
• with a patchify layer implemented using a 4x4, stride 4 convolution layer.
• The accuracy has changed from 79.4% to 79.5%.
3/13/2022 26
ResNeXt-ify
• ResNeXt has a better FLOPs/accuracy trade-off than ResNet.
• The core component is grouped convolution.
• Guiding principle: “use more groups, expand width”.
• ConvNeXt uses depthwise convolution, a special case of grouped
convolution where the number of groups equals the number of
channels.
• ConvNeXt increases the network width to the same number of
channels as Swin-T’s (from 64 to 96)
• This brings the network performance to 80.5% with increased
FLOPs (5.3G)
3/13/2022 27
3/13/2022 28
Depth wise Convolution
Inverted Bottleneck
• Important design in every Transformer
• the hidden dimension of the MLP block is 4 times wider than the input
• Connected to the design in ConvNets with an expansion ratio of
• popularized by MobileNetV2 and gained traction in advanced ConvNets.
• Costly: increased FLOPs for the depth wise convolution layer
• Significant FLOPs reduction in the down sampling residual blocks.
• Reduces the whole network FLOPs to 4.6G
• Interestingly, improved performance
(80.5% to 80.6%).
3/13/2022 29
Moving Up Depthwise Conv Layer
• Moving up position of depth wise conv layer as in Transformers.
• The MSA block is placed prior to the MLP layers.
• As ConvNeXt has an inverted bottleneck block, this is a natural design
choice.
 The complex/inefficien modules (MSA, large-kernel conv) will have fewer
channels, while the efficient, dense 1x1 layers will do the heavy lifting.
• This intermediate step reduces the FLOPs to 4.1G, resulting in a
temporary performance degradation to 79.9%.
3/13/2022 30
Note: a self-attention operation can be viewed as a
dynamic lightweight convolution, a depthwise separable
convolutions.
Increasing the Kernel Size
• Swin Transformers reintroduced the local window to the self-
attention block, the window size is at least 7x7, significantly larger
than the ResNe(X)t kernel size of 3x3.
• The authors experimented with several kernel sizes, including 3, 5, 7,
9, and 11. The network’s performance increases from 79.9% (3x3) to
80.6% (7x7), while the network’s FLOPs stay roughly the same.
• A ResNet-200 regime model does not exhibit further gain when
increasing the kernel size beyond 7x7.
3/13/2022 31
Replacing ReLU with GELU
• One discrepancy between NLP and vision architectures is the
specifics of which activation functions to use.
• Gaussian Error Linear Unit, or GELU
• a smoother variant of ReLU,
• utilized in the most advanced Transformers
• Google’s BERT and OpenAI’s GPT-2, ViTs.
3/13/2022 32
• ReLU can be substituted with
GELU in ConvNeXt too.
• But the accuracy stays
unchanged (80.6%).
Fewer Activation Functions
• Transformers have fewer activation functions than ResNet
• There is only one activation function present in the MLP block.
• In comparison, it is common practice to append an activation function
to each convolutional layer, including the 1x1 convs.
• Replicate the Transformer block:
• eliminates all GELU layers from the residual block except for one between
two 1x1 layers
• This process improves the result by 0.7% to 81.3%, practically
matching the performance of Swin-T.
3/13/2022 33
Fewer Normalization Layers
• Transformer blocks usually have fewer normalization layers as well.
• ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN
layer before the conv 1x1 layers.
• This further boosts the performance to 81.4%, already surpassing
Swin-T’s result.
3/13/2022 34
Substituting BN with LN
• BatchNorm is an essential component in ConvNets as it improves the
convergence and reduces overfitting.
• Simpler Layer Normalization(LN) has been used in Transformers,
• resulting in good performance across different application scenarios.
• Directly substituting LN for BN in the original ResNet will result in
suboptimal performance.
• With all modifications in network architecture and training techniques,
the authors revisit the impact of using LN in place of BN.
• ConvNeXt model does not have any difficulties training with LN; in
fact, the performance is slightly better, obtaining an accuracy of
81.5%.
3/13/2022 35
Separate Downsampling Layers
• In ResNet, the spatial downsampling is achieved by the residual block
at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv
with stride 2 at the shortcut connection).
• In Swin Transformers, a separate downsampling layer is added
between stages.
• ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling
and adding normalization layers wherever spatial resolution is
changed can help stablize training.
• This can improve the accuracy to 82.0%, significantly exceeding
Swin-
• T’s 81.3%.
3/13/2022 36
ConvNeXt Block Design
3/13/2022 37
Modernizing a Standard ConvNet Results
3/13/2022 38
Closing Remarks
• ConvNeXt, a pure ConvNet, that can outperform the Swin
Transformer for ImageNet-1K classification in this compute regime.
• It is also worth noting that none of the design options discussed thus
far is novel—they have all been researched separately, but not
collectively, over the last decade.
• ConvNeXt model has approximately the same FLOPs, #params.,
throughput, and memory use as the Swin Transformer, but does not
require specialized modules such as shifted window attention or
relative position biases.
3/13/2022 39
Results – ImageNet-1K
3/13/2022 40
Results – ImageNet-22K
• A widely held view is that vision
Transformers have fewer inductive
biases thus can per form better
than ConvNets when pre- trained
on a larger scale.
• Properly designed ConvNets are
not inferior to vision Transformers
when pre-trained with large dataset
ConvNeXts still perform on par or
better than similarly-sized Swin
Transformers, with slightly higher
throughput.
3/13/2022 41
Comments on Twitter about the Results
3/13/2022 42
Isotropic ConvNeXt vs. ViT
• Examining if ConvNeXt block design is generalizable to ViT-style
isotropic architectures which have no downsampling layers and keep
the same feature resolutions (e.g. 14x14) at all depths.
• ConvNeXt can perform generally on par with ViT, showing that
ConvNeXt block design is competitive when used in non-hierarchical
models.
3/13/2022 43
Results – Object Detection and Segmentation
on COCO, Semantic Segmentation on ADE20K
3/13/2022 44
Results – Robustness Evaluation of ConvNeXt
3/13/2022 45
Remarks on Model Efficiency[1]
• Under similar FLOPs, models with depthwise convolutions are known to be slower
and consume more memory than ConvNets with only dense convolutions.
• Natural to ask whether the design of ConvNeXt will render it practically inefficient.
• As demonstrated throughout the paper, the inference throughputs of ConvNeXts
are comparable to or exceed that of Swin Transformers. This is true for both
classification and other tasks requiring higher-resolution inputs.
3/13/2022 46
Remarks on Model Efficiency[2]
• Furthermore, training ConvNeXts requires less memory than training Swin
Transformers. Training Cascade Mask-RCNN using ConvNeXt-B backbone
consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the
reference number for Swin-B is 18.5GB.
• It is worth noting that this improved efficiency is a result of the ConvNet inductive
bias, and is not directly related to the self-attention mechanism in vision
Transformers
3/13/2022 47
Conclusions[1]
• In the 2020s, vision Transformers, particularly hierarchical ones such as Swin
Transformers, began to overtake ConvNets as the favored choice for generic
vision backbones.
• The widely held belief is that vision Transformers are more accurate,
efficient, and scalable than ConvNets.
• ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art
hierarchical vision Transformers across multiple computer vision benchmarks,
while retaining the simplicity and efficiency of standard ConvNets.
3/13/2022 48
Conclusions[2]
• In some ways, these observations are surprising while ConvNeXt
model itself is not completely new—many design choices have all
been examined separately over the last decade, but not collectively.
• The authors hope that the new results reported in this study will
challenge several widely held views and prompt people to rethink the
importance of convolution in computer vision.
3/13/2022 49
Thank you
3/13/2022 50

ConvNeXt: A ConvNet for the 2020s explained

  • 1.
    A ConvNet forthe 2020s Zhuang Liu et al. Sushant Gautam MSIISE Department of Electronics and Computer Engineering Institute of Engineering, Thapathali Campus 13 March 2022
  • 2.
    Introduction: ConvNets • LeNet/AlexNet / ResNets • Have several built-in inductive biases • well-suited to a wide variety of computer vision applications. • Translation equivariance • Shared Computations: • inherently efficient in a sliding-window • Fundamental building block in visual recognition system 3/13/2022 2
  • 3.
    Transformers in NLP:BERT 3/13/2022 3 • NN design for NLP: • Transformers replaced RNNs • become the dominant backbone architecture
  • 4.
    Imagine. . . •But how do we tokenize image? 3/13/2022 4
  • 5.
    Introduction: Vision Transformers •Vision Transformers (ViT) in 2020: • completely altered the landscape of network architecture design. • two streams surprisingly converged • Transformers outperform ResNets by a significant margin. • on image classification • only on image classification 3/13/2022 5
  • 6.
    Vision Transformer: Patchify 3/13/20226 For 3x3 image: 8x8 image
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    ViT’s were challenging •ViT’s global attention design: • quadratic complexity with respect to the input size. • Hierarchical Transformers • hybrid approach to bridge this gap. • for example, the sliding window strategy (e.g., attention within local windows) • behave more similarly to ConvNets. 3/13/2022 11
  • 12.
    Swin Transformer • amilestone work in this direction • demonstrated Transformers as a generic vision backbone • achieved SOTA results across a range of computer vision tasks • beyond image classification. • successful and rapid adoption revealed: • the essence of convolution is not becoming irrelevant; rather, • it remains much desired and has never faded. 3/13/2022 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Swin Transformer • Stillrelies on Patches • But instead of choosing one • Start with smaller patch and merge in the deeper transformer layers • Similar to UNet and Convolution 3/13/2022 19
  • 20.
  • 21.
    Swin Transformer • Shiftedwindows Transformer • Self-attention in non-overlapped windows • Hierarchical feature maps 3/13/2022 21
  • 22.
    Back to thetopic: ConvNeXt Objectives: • bridge the gap between the pre-ViT and post-ViT ConvNets • test the limits of what a pure ConvNet can achieve. • Exploration: How do design decisions in Transformers impact ConvNets’ performance? • Result: a family of pure ConvNets: ConvNeXt. 3/13/2022 22
  • 23.
    Modernizing a ConvNet:a Roadmap • Recipe: Two ResNet model sizes in terms of G-FLOPs  ResNet-50 / Swin-T ~4.5x109 FLOPs  ResNet-200 / Swin-B ~15.0x109 FLOPs • Network Modernization 1. ViT training techniques for ResNet 2. ResNeXt 3. Inverted bottleneck 4. Large kernel size 5. Various layer-wise micro designs 3/13/2022 23
  • 24.
    ViT Training Techniques •Training epochs: 300 epochs (from the original 90 epochs for ResNets) • Optimizer: AdamW (Weight Decay) • Data augmentation  Mixup  Cutmix  RandAugment  Random Erasing • Regularization  Stochastic Depth  Label Smoothing • This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%) 3/13/2022 24 Label Smoothing making model less over smart
  • 25.
    Changing Stage ComputeRatio • Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger • Swin-B Transformers, the ratio is 1:1:9:1. • #Conv blocks in ResNet-50: (3, 4, 6, 3) • ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50 • which also aligns the FLOPs with Swin-T. • This improves the model accuracy from 78.8% to 79.4%. 3/13/2022 25
  • 26.
    Changing Stem to“Patchify” • Stem cell of Standard ResNet: 7x7 convolution layer • followed by a max pool, which result in a 4x downsampling. • Stem cell of ViT: a more aggressive “patchify” strategy • corresponds to a large kernel size(14 or 16) + non-overlapping convolution. • Swin Transformer uses a similar “patchfiy” layer • smaller patch size of 4: accommodate the architecture’s multi-stage design. • ConvNeXt replaces the ResNet-style CNN stem cell • with a patchify layer implemented using a 4x4, stride 4 convolution layer. • The accuracy has changed from 79.4% to 79.5%. 3/13/2022 26
  • 27.
    ResNeXt-ify • ResNeXt hasa better FLOPs/accuracy trade-off than ResNet. • The core component is grouped convolution. • Guiding principle: “use more groups, expand width”. • ConvNeXt uses depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels. • ConvNeXt increases the network width to the same number of channels as Swin-T’s (from 64 to 96) • This brings the network performance to 80.5% with increased FLOPs (5.3G) 3/13/2022 27
  • 28.
  • 29.
    Inverted Bottleneck • Importantdesign in every Transformer • the hidden dimension of the MLP block is 4 times wider than the input • Connected to the design in ConvNets with an expansion ratio of • popularized by MobileNetV2 and gained traction in advanced ConvNets. • Costly: increased FLOPs for the depth wise convolution layer • Significant FLOPs reduction in the down sampling residual blocks. • Reduces the whole network FLOPs to 4.6G • Interestingly, improved performance (80.5% to 80.6%). 3/13/2022 29
  • 30.
    Moving Up DepthwiseConv Layer • Moving up position of depth wise conv layer as in Transformers. • The MSA block is placed prior to the MLP layers. • As ConvNeXt has an inverted bottleneck block, this is a natural design choice.  The complex/inefficien modules (MSA, large-kernel conv) will have fewer channels, while the efficient, dense 1x1 layers will do the heavy lifting. • This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradation to 79.9%. 3/13/2022 30 Note: a self-attention operation can be viewed as a dynamic lightweight convolution, a depthwise separable convolutions.
  • 31.
    Increasing the KernelSize • Swin Transformers reintroduced the local window to the self- attention block, the window size is at least 7x7, significantly larger than the ResNe(X)t kernel size of 3x3. • The authors experimented with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (3x3) to 80.6% (7x7), while the network’s FLOPs stay roughly the same. • A ResNet-200 regime model does not exhibit further gain when increasing the kernel size beyond 7x7. 3/13/2022 31
  • 32.
    Replacing ReLU withGELU • One discrepancy between NLP and vision architectures is the specifics of which activation functions to use. • Gaussian Error Linear Unit, or GELU • a smoother variant of ReLU, • utilized in the most advanced Transformers • Google’s BERT and OpenAI’s GPT-2, ViTs. 3/13/2022 32 • ReLU can be substituted with GELU in ConvNeXt too. • But the accuracy stays unchanged (80.6%).
  • 33.
    Fewer Activation Functions •Transformers have fewer activation functions than ResNet • There is only one activation function present in the MLP block. • In comparison, it is common practice to append an activation function to each convolutional layer, including the 1x1 convs. • Replicate the Transformer block: • eliminates all GELU layers from the residual block except for one between two 1x1 layers • This process improves the result by 0.7% to 81.3%, practically matching the performance of Swin-T. 3/13/2022 33
  • 34.
    Fewer Normalization Layers •Transformer blocks usually have fewer normalization layers as well. • ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN layer before the conv 1x1 layers. • This further boosts the performance to 81.4%, already surpassing Swin-T’s result. 3/13/2022 34
  • 35.
    Substituting BN withLN • BatchNorm is an essential component in ConvNets as it improves the convergence and reduces overfitting. • Simpler Layer Normalization(LN) has been used in Transformers, • resulting in good performance across different application scenarios. • Directly substituting LN for BN in the original ResNet will result in suboptimal performance. • With all modifications in network architecture and training techniques, the authors revisit the impact of using LN in place of BN. • ConvNeXt model does not have any difficulties training with LN; in fact, the performance is slightly better, obtaining an accuracy of 81.5%. 3/13/2022 35
  • 36.
    Separate Downsampling Layers •In ResNet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv with stride 2 at the shortcut connection). • In Swin Transformers, a separate downsampling layer is added between stages. • ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling and adding normalization layers wherever spatial resolution is changed can help stablize training. • This can improve the accuracy to 82.0%, significantly exceeding Swin- • T’s 81.3%. 3/13/2022 36
  • 37.
  • 38.
    Modernizing a StandardConvNet Results 3/13/2022 38
  • 39.
    Closing Remarks • ConvNeXt,a pure ConvNet, that can outperform the Swin Transformer for ImageNet-1K classification in this compute regime. • It is also worth noting that none of the design options discussed thus far is novel—they have all been researched separately, but not collectively, over the last decade. • ConvNeXt model has approximately the same FLOPs, #params., throughput, and memory use as the Swin Transformer, but does not require specialized modules such as shifted window attention or relative position biases. 3/13/2022 39
  • 40.
  • 41.
    Results – ImageNet-22K •A widely held view is that vision Transformers have fewer inductive biases thus can per form better than ConvNets when pre- trained on a larger scale. • Properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput. 3/13/2022 41
  • 42.
    Comments on Twitterabout the Results 3/13/2022 42
  • 43.
    Isotropic ConvNeXt vs.ViT • Examining if ConvNeXt block design is generalizable to ViT-style isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14x14) at all depths. • ConvNeXt can perform generally on par with ViT, showing that ConvNeXt block design is competitive when used in non-hierarchical models. 3/13/2022 43
  • 44.
    Results – ObjectDetection and Segmentation on COCO, Semantic Segmentation on ADE20K 3/13/2022 44
  • 45.
    Results – RobustnessEvaluation of ConvNeXt 3/13/2022 45
  • 46.
    Remarks on ModelEfficiency[1] • Under similar FLOPs, models with depthwise convolutions are known to be slower and consume more memory than ConvNets with only dense convolutions. • Natural to ask whether the design of ConvNeXt will render it practically inefficient. • As demonstrated throughout the paper, the inference throughputs of ConvNeXts are comparable to or exceed that of Swin Transformers. This is true for both classification and other tasks requiring higher-resolution inputs. 3/13/2022 46
  • 47.
    Remarks on ModelEfficiency[2] • Furthermore, training ConvNeXts requires less memory than training Swin Transformers. Training Cascade Mask-RCNN using ConvNeXt-B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for Swin-B is 18.5GB. • It is worth noting that this improved efficiency is a result of the ConvNet inductive bias, and is not directly related to the self-attention mechanism in vision Transformers 3/13/2022 47
  • 48.
    Conclusions[1] • In the2020s, vision Transformers, particularly hierarchical ones such as Swin Transformers, began to overtake ConvNets as the favored choice for generic vision backbones. • The widely held belief is that vision Transformers are more accurate, efficient, and scalable than ConvNets. • ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art hierarchical vision Transformers across multiple computer vision benchmarks, while retaining the simplicity and efficiency of standard ConvNets. 3/13/2022 48
  • 49.
    Conclusions[2] • In someways, these observations are surprising while ConvNeXt model itself is not completely new—many design choices have all been examined separately over the last decade, but not collectively. • The authors hope that the new results reported in this study will challenge several widely held views and prompt people to rethink the importance of convolution in computer vision. 3/13/2022 49
  • 50.