1
Signal Processing and Machine Learning
Gandhi Jugal
IDDP-August-2020
CSIR-CEERI
EfficientNet: Rethinking Model scaling for CNN
• Scaling up ConvNets is widely used to achieve better accuracy.
• ResNet can be scaled from ResNet-18 to ResNet-200 by using more layers.
• GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline
model 4 times larger.
• The most common way is to scale up ConvNet by their depth,
width, or image resolution.
• In previous works, it is common to scale only one of the three dimensions.
• Though it is possible to scale up two or three dimensions arbitrarily,
arbitrary scaling requires tedious manual tuning and still often yields sub-
optimal accuracy and efficiency.
2
• Deep ConvNets are often over-parameterized.
• Model compression is a common way to reduce model size by trading
accuracy for efficiency.
• it is also common to handcraft efficient mobile-size ConvNets, such as
SqueezeNets, MobileNets, and ShuffleNets.
• Recently, Neural Architecture Search becomes increasingly popular in
designing efficient mobile-size ConvNets such as MNasNet.
• However, it is unclear how to apply these techniques for larger
models that have much larger design space and much more
expensive tuning cost.
3
• There are many ways to scale a ConvNet for different resource
constraints
• ResNet can be scaled down (e.g., ResNet-18) or up (e.g.,ResNet-200) by
adjusting network depth (#layers).
• WideResNet and MobileNets can be scaled by network width (#channels).
• It is also well-recognized that bigger input image size will help accuracy with
the overhead of more OPS.
• The Network depth and width are both important for ConvNets, it
still remains an open question of how to effectively scale a ConvNet
to achieve better efficiency and accuracy.
4
5
Increase width, add
more layer, improve
input resolution m
6
• Depth (# layers): Deeper ConvNet can capture richer and more
complex features, and generalize well on new tasks
• Width(#channels): Wider networks tend to be able to capture more
find-grained features and are easier to train / Difficulties in
capturing higher level features
• Resolution (#image sizes): ConvNet can potentially capture more
fine-grained patterns
7
• Scaling Dimensions –Depth
• The intuition is that deeper ConvNet can capture richer and more
complex features, and generalize well on new tasks.
• However, the accuracy gain of very deep network diminishes.
For example, ResNet-1000 has similar accuracy as ResNet-101 even
though it has much more layers.
Floating points operations per sec.
8
• Scaling Dimensions –Width
• Scaling network width is commonly used for small size models.
• As discussed in previously, wider networks tend to be able to capture more fine-grained
features and are easier to train.
• However, extremely wide but shallow networks tend to have difficulties in capturing higher
level features.
• And the accuracy quickly saturates when networks become much wider with larger w.
9
• Scaling Dimensions –Resolution
• With higher resolution input images, ConvNet can potentially capture more fine-
grained patterns.
• Starting from 224x224 in early ConvNet, modern ConvNet stand to use 299x299
or 331x331 for better accuracy. Recently, Gpipe achieves state-of-the-art
ImageNet accuracy with 480x480 resolution.
• Higher resolutions improve accuracy, but the accuracy gain diminishes for very
high resolutions.
• The accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of
single dimension scaling. (Baseline: EfficientNet-B0)
Width Scaling Depth Scaling Resolution Scaling
224*224
11
• Compound Scaling
• Intuitively, the compound scaling method makes sense because if the input image is bigger,
then the network needs more layers to increase the receptive field and more channels to
capture more fine-grained patterns on the bigger image.
• If we only scale network width w without changing depth (d=1.0) and resolution (r=1.0),
the accuracy saturates quickly.
• With deeper (d=2.0) and higher resolution (r=2.0), width scaling achieves much better
accuracy under the same OPS cost.
12
, , are constants that can be determined by a small grid search.
Phi is a user-specified coefficient that controls how many more
resources are available for model scaling.
13
Backbone Network
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑁𝑒𝑡−𝐵0~𝐵6
Input resolution
𝑅 𝑖𝑛𝑝𝑢𝑡=512+𝜙∙128
𝐷 𝑏𝑖𝑓𝑝𝑛=2+𝜙
#layers
#channels
𝑊 𝑝𝑟𝑒𝑑=𝑊 𝑏𝑖𝑓𝑝𝑛
14
Backbone network
Same width/depth scaling coefficient of EfficientNet B0 to B6
BiFPN network ( Bi-directional Feature Pyramid Network )
Exponentially grow BiFPN width 𝑊 𝑏𝑖𝑓𝑝𝑛(#channels), but linearly increase depth
𝐷b𝑖𝑓𝑝𝑛(#layers) since depth needs to be rounded to small.
𝑊 𝑏𝑖𝑓𝑝𝑛=64∙1.35 𝜙,𝐷 𝑏𝑖𝑓𝑝𝑛=2+𝜙
Box/class prediction network
Fix their width to always the same as BiFPN (i.e., 𝑊 𝑝𝑟𝑒𝑑=𝑊 𝑏𝑖𝑓𝑝𝑛
But, linearly increase the depth(#layers) using equation:
𝐷𝑏𝑜𝑥 = 𝐷𝑐𝑙𝑎𝑠𝑠 = 3 + [𝜙/3]
Input image resolution
Since feature level 3 - 7 are used in BiFPN , the input resolution must be dividable by
128
𝑅𝑖𝑛𝑝𝑢𝑡=512+𝜙∙128
15
16
17
The EfficientNet Architecture using Neural Architecture Search
18
ImageNet Results for EfficientNet
19
ImageNet Results for EfficientNet
84.4 % for top 1
97 % for top 5
20
21
EfficientNet Performance Results on Transfer Learning Datasets. Our
scaled EfficientNet models achieve new state-of-the-art accuracy for
5 out of 8 datasets, with 9.6x fewer parameters on average.
• Conclusion
• Weighted bidirectional feature network and customized compound
scaling method are proposed.
• These are for improving both accuracy and efficiency.
• EfficientDet is also up to 3.2x faster on GPUs and 8.1x faster on CPUs
22
23
Reference
• https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html
• https://www.youtube.com/watch?v=3svIm5UC94I&t=11s
• https://arxiv.org/abs/1905.11946
• https://heartbeat.fritz.ai/reviewing-efficientnet-increasing-the-accuracy-and-
robustness-of-cnns-6aaf411fc81d
24

EfficientNet

  • 1.
    1 Signal Processing andMachine Learning Gandhi Jugal IDDP-August-2020 CSIR-CEERI EfficientNet: Rethinking Model scaling for CNN
  • 2.
    • Scaling upConvNets is widely used to achieve better accuracy. • ResNet can be scaled from ResNet-18 to ResNet-200 by using more layers. • GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline model 4 times larger. • The most common way is to scale up ConvNet by their depth, width, or image resolution. • In previous works, it is common to scale only one of the three dimensions. • Though it is possible to scale up two or three dimensions arbitrarily, arbitrary scaling requires tedious manual tuning and still often yields sub- optimal accuracy and efficiency. 2
  • 3.
    • Deep ConvNetsare often over-parameterized. • Model compression is a common way to reduce model size by trading accuracy for efficiency. • it is also common to handcraft efficient mobile-size ConvNets, such as SqueezeNets, MobileNets, and ShuffleNets. • Recently, Neural Architecture Search becomes increasingly popular in designing efficient mobile-size ConvNets such as MNasNet. • However, it is unclear how to apply these techniques for larger models that have much larger design space and much more expensive tuning cost. 3
  • 4.
    • There aremany ways to scale a ConvNet for different resource constraints • ResNet can be scaled down (e.g., ResNet-18) or up (e.g.,ResNet-200) by adjusting network depth (#layers). • WideResNet and MobileNets can be scaled by network width (#channels). • It is also well-recognized that bigger input image size will help accuracy with the overhead of more OPS. • The Network depth and width are both important for ConvNets, it still remains an open question of how to effectively scale a ConvNet to achieve better efficiency and accuracy. 4
  • 5.
    5 Increase width, add morelayer, improve input resolution m
  • 6.
    6 • Depth (#layers): Deeper ConvNet can capture richer and more complex features, and generalize well on new tasks • Width(#channels): Wider networks tend to be able to capture more find-grained features and are easier to train / Difficulties in capturing higher level features • Resolution (#image sizes): ConvNet can potentially capture more fine-grained patterns
  • 7.
    7 • Scaling Dimensions–Depth • The intuition is that deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. • However, the accuracy gain of very deep network diminishes. For example, ResNet-1000 has similar accuracy as ResNet-101 even though it has much more layers. Floating points operations per sec.
  • 8.
    8 • Scaling Dimensions–Width • Scaling network width is commonly used for small size models. • As discussed in previously, wider networks tend to be able to capture more fine-grained features and are easier to train. • However, extremely wide but shallow networks tend to have difficulties in capturing higher level features. • And the accuracy quickly saturates when networks become much wider with larger w.
  • 9.
    9 • Scaling Dimensions–Resolution • With higher resolution input images, ConvNet can potentially capture more fine- grained patterns. • Starting from 224x224 in early ConvNet, modern ConvNet stand to use 299x299 or 331x331 for better accuracy. Recently, Gpipe achieves state-of-the-art ImageNet accuracy with 480x480 resolution. • Higher resolutions improve accuracy, but the accuracy gain diminishes for very high resolutions.
  • 10.
    • The accuracygain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling. (Baseline: EfficientNet-B0) Width Scaling Depth Scaling Resolution Scaling 224*224
  • 11.
    11 • Compound Scaling •Intuitively, the compound scaling method makes sense because if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image. • If we only scale network width w without changing depth (d=1.0) and resolution (r=1.0), the accuracy saturates quickly. • With deeper (d=2.0) and higher resolution (r=2.0), width scaling achieves much better accuracy under the same OPS cost.
  • 12.
    12 , , areconstants that can be determined by a small grid search. Phi is a user-specified coefficient that controls how many more resources are available for model scaling.
  • 13.
    13 Backbone Network 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑁𝑒𝑡−𝐵0~𝐵6 Input resolution 𝑅𝑖𝑛𝑝𝑢𝑡=512+𝜙∙128 𝐷 𝑏𝑖𝑓𝑝𝑛=2+𝜙 #layers #channels 𝑊 𝑝𝑟𝑒𝑑=𝑊 𝑏𝑖𝑓𝑝𝑛
  • 14.
    14 Backbone network Same width/depthscaling coefficient of EfficientNet B0 to B6 BiFPN network ( Bi-directional Feature Pyramid Network ) Exponentially grow BiFPN width 𝑊 𝑏𝑖𝑓𝑝𝑛(#channels), but linearly increase depth 𝐷b𝑖𝑓𝑝𝑛(#layers) since depth needs to be rounded to small. 𝑊 𝑏𝑖𝑓𝑝𝑛=64∙1.35 𝜙,𝐷 𝑏𝑖𝑓𝑝𝑛=2+𝜙 Box/class prediction network Fix their width to always the same as BiFPN (i.e., 𝑊 𝑝𝑟𝑒𝑑=𝑊 𝑏𝑖𝑓𝑝𝑛 But, linearly increase the depth(#layers) using equation: 𝐷𝑏𝑜𝑥 = 𝐷𝑐𝑙𝑎𝑠𝑠 = 3 + [𝜙/3] Input image resolution Since feature level 3 - 7 are used in BiFPN , the input resolution must be dividable by 128 𝑅𝑖𝑛𝑝𝑢𝑡=512+𝜙∙128
  • 15.
  • 16.
  • 17.
    17 The EfficientNet Architectureusing Neural Architecture Search
  • 18.
  • 19.
    19 ImageNet Results forEfficientNet 84.4 % for top 1 97 % for top 5
  • 20.
  • 21.
    21 EfficientNet Performance Resultson Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the-art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.
  • 22.
    • Conclusion • Weightedbidirectional feature network and customized compound scaling method are proposed. • These are for improving both accuracy and efficiency. • EfficientDet is also up to 3.2x faster on GPUs and 8.1x faster on CPUs 22
  • 23.
    23 Reference • https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html • https://www.youtube.com/watch?v=3svIm5UC94I&t=11s •https://arxiv.org/abs/1905.11946 • https://heartbeat.fritz.ai/reviewing-efficientnet-increasing-the-accuracy-and- robustness-of-cnns-6aaf411fc81d
  • 24.