EfficientDet is a single-shot object detector that achieves state-of-the-art accuracy with high efficiency. It introduces a Bi-directional Feature Pyramid Network (BiFPN) that fuses multi-scale features with weighted connections. EfficientDet also uses a compound scaling method to jointly scale the backbone network, BiFPN, prediction network, and input resolution. Experiments on COCO show that EfficientDet outperforms prior models across different resource constraints in terms of accuracy and efficiency for object detection and semantic segmentation tasks.
4. Introduction
• Recent detectors have the trade-off between accuracy and efficiency
• Most previous works only focus on a specific or a small range of resource requirements
• This points make hard to apply the recent detection models on industry field
• “Is it possible to build a scalable detection architecture with both higher
accuracy and better efficiency across a wide spectrum of resource constraints?”
Motivation
01
4
5. Introduction
Challenge 1. Efficient multi-scale feature fusion
01
5
• Feature fusion : The method for combining feature maps
→ Normal feature fusion methods don’t care about feature resolution.
Challenge 2 : Model scaling
• Model scaling : The method for up-scaling the model architecture
→ Limitation of up-scaling by considering one factor
Input-image up-scaling
Network up-scaling
02
7. Related work
Multi-scale feature representation
01
Conv
Conv
Conv
Conv
Up scaling
Up scaling
Up scaling
1x1 Conv
1x1 Conv
1x1 Conv
1x1 Conv
Prediction
Prediction
Prediction
Prediction
Backbone
Feature
pyramid
𝒑𝟒𝒐𝒖𝒕
𝒑𝟑𝒐𝒖𝒕
𝒑𝟐𝒐𝒖𝒕
𝒑𝟏𝒐𝒖𝒕
𝒑𝟒
𝒑𝟑
𝒑𝟐
𝒑𝟏
7
• For considering multi-scale object
Area
Prediction
layer
8. Related work
Model scaling
02
• EfficientNet (EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan et al, ICML 2019)
• Jointly Scale up the depth, width, resolution (Compound scaling)
8
𝑓
𝑓
𝑓
𝑓
𝐷𝑒𝑝𝑡ℎ
𝐼𝑛𝑝𝑢𝑡 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛
12. BiFPN
Weighted Feature Fusion
02
• The difference of Resolution between Inputs → Different degrees of contribution to output
• Gave each input feature a weight to learn the contribution of the input feature.
𝑶𝒖𝒕𝒑𝒖𝒕
𝒇𝒆𝒂𝒕𝒖𝒓𝒆
𝑾𝒆𝒊𝒈𝒉𝒕𝒊 𝑰𝒏𝒑𝒖𝒕
𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒊
𝑺𝒐𝒇𝒕𝒎𝒂𝒙 − 𝒃𝒂𝒔𝒆𝒅 𝒇𝒖𝒔𝒊𝒐𝒏 𝑭𝒂𝒔𝒕 𝒏𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅 𝒇𝒖𝒔𝒊𝒐𝒏
(30% Speed Gain in GPU)
12
14. EfficientDet
Compound scaling
02
• Previous works mostly scale up baseline network or using larger image inputs, stacking
more FPN layers
• New compound scaling method jointly scale up all dimensions of backbone network, BiFPN
network, prediction network and resolution of input.
Backbone network
02-1
• Reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
BiFPN network
02-2
• Perform grid search for finding best factor value on a list of values {1.2, 1.25, 1.3, 1.35, 1.4, 1.45}
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟
14
15. EfficientDet
Prediction network
02-3
• The width of network is same as BiFPN network's width
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟
Input image resolution
02-4
Overall scaling output
02-5
15
17. Experiments
Experiment configuration
01
• Dataset : COCO 2017 datasets with 118K images
• Optimizer : SGD with momentum 0.9 and weight decay 4e-5
• Learning Rate : 0 to 0.16 (First epoch), annealed down using cosine decay rule (0~0.16 𝑟𝑒𝑝𝑒𝑎𝑡)
• Batch normalization is used after every convolution layer
• Every convolution layer is depth-wise conv layer
• Activation function : Swish (𝑥 ∗ 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝛽𝑥))
• Augmentation : Multi-resolution cropping / scaling / flipping
17
18. Experiments
Loss function
02
• Using Focal-loss for detection
• Class imbalanced problem is most effected by easy negative samples
• Training by focusing on hard samples
• If 𝑝𝑡 is almost 1 → − 1 − 0.999 𝑟
𝑙𝑜𝑔 𝑝𝑡 ≈ 0
• Else → − 1 − 0.001 𝑟 𝑙𝑜𝑔(𝑝𝑡) ≈ ∞
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓
𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
18
22. Ablation study
Disentangling Backbone and BiFPN
01
• The Backbone network and multi-feature network of EfficientDet achieves higher AP and
Efficiency than prior networks
22
23. Ablation study
BiFPN Cross Scale Connection
02
• For the fair comparison, FPN and PANet are repeated multiple times and change the conv.
• BiFPN achieves the best accuracy with fewer parameters and FLOPs
23
24. Ablation study
Softmax vs Fast Normalized fusion
03
• Fast normalized fusion approach achieves similar accuracy as the softmax-based method
• Figure 5 illustrates the learned weights for three feature fusion nodes
24
25. Ablation study
Compound Scaling
04
• EfficientDet jointly scale up the network’s backbone, BiFPN, prediction net, input resolution
• The proposed method achieves the best accuracy than other scaling method
25
26. Conclusion
Propose the weight bidirectional feature network and customized compound scaling
method, in order to improve accuracy and efficiency
01
EfficientDet achieves better accuracy and efficiency than the prior art across a wide
spectrum of resource constrains
02
EfficientDet achieves SOTA accuracy with much fewer parameters and FLOPs in object
detection and semantic segmentation
03
26