TensorFlow-KR 논문읽기모임 PR12(12PR) 183번째 논문 review입니다.
이번에 살펴볼 논문은 Google Brain에서 발표한 MixNet입니다. Efficiency를 추구하는 CNN에서 depthwise convolution이 많이 사용되는데, 이 때 depthwise convolution filter의 size를 다양하게 해서 성능도 높이고 efficiency도 높이는 방법을 제안한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크 : https://arxiv.org/abs/1907.09595
발표영상 : https://youtu.be/252YxqpHzsg
1. MixNet:
Mixed Depthwise Convolutional Kernels
Mingxing Tan, et al., “MixNet: Mixed Depthwise Convolutional Kernels”, BMVC 2019
28th July, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics
2. Introduction
• A recent trend in ConvNets design is to improve both accuracy and
efficiency.
• Following this trend, depthwise convolutions are becoming
increasingly more popular in modern ConvNets.
Such as MobileNets, ShuffleNets, NASNets, AmoebaNet, MnasNet, and
EfficientNet.
3. Introduction
• Although conventional practice is to simply use 3x3 kernels, recent
research results have shown larger kernel size such as 5x5 kernels
and 7x7 kernels can potentially improve model accuracy and
efficiency.
• In this paper, authors revisit the fundamental question.
Do larger kernels always achieve higher accuracy?
• Larger kernel tend to capture high-resolution patterns with more
details at the cost of more parameters and computations.
• But do they always improve accuracy?
5. Introduction
• In the extreme case that kernel size is equal to the input resolution, a
ConvNet simply becomes a fully-connected network, which is known
to be inferior
• We need both large kernels to capture high-resolution patterns and
small kernels to capture low-resolution patterns for better model
accuracy and efficiency
6. RelatedWork
• Efficient ConvNets
In recent years, significant efforts have been spent on improving ConvNet
efficiency.
In particular, depthwise convolution has been increasingly popular in all
mobile-size ConvNets.
Unlike regular convolution, depthwise convolution performs convolutional
kernels for each channel separately, thus reducing parameter size and
computational cost.
7. RelatedWork
• Multi-Scale Networks and Features
There are multi-branch ConvNets, such as Inceptions, Inception-ResNet,
ResNext, and NASNet.
By using multiple branches in each layer, these ConvNets are able to utilize
different operations in a single layer.
Similarly, there are also many prior work on combining multi-scale feature
maps from different layers, such as DenseNet, and feature pyramid network
These prior works mostly focus on changing the macro-architecture of neural
networks in order to utilize different convolution ops.
8. RelatedWork
• Neural Architecture Search
Recently, neural architecture search has achieved better performance than
hand-crafted models by automating the design process and learning better
design choices.
When a new operation appears, it is added to the search space in NAS.
9. Regular(Normal) Convolution
w, h, c : width, height and channel of an input feature map
k : width and height of convolution filters
n : the number of convolution filters(channel of an output feature map)
w
h
c
k
k
...
n
1
2
3
n
w
h
c
13. Depthwise Convolution
• Same as group convolution with g = c, n = m x c
w
h
k
k1
2
c
1
c
... h
c
k
k1
2
1
c
...
k
k1
2
1
c
...
...
...
...
...
...
c c
m
m x c
16. MDConv
• MDConv partitions channels into groups and applies different kernel
size to each group.
• The input tensor is partitioned into g groups of virtual tensors
where all virtual tensors 𝑋 have the same spatial height h and,
c1 + c2 + … cg = c
• Output is calculated as:
19. Design Choices
• Group Size g
In the extreme case of g = 1, a MDConv becomes equivalent to a vanilla
depthwise convolution.
g = 4 is generally a safe choice for MobileNets, but with the help of neural
architecture search, it can further benefit with a variety of group sizes from 1
to 5.
20. Kernel Size Per Group
• In theory, each group can have arbitrary kernel size.
• Authors restrict kernel size always stars from 3x3 and monotonically
increases by 2 per group.
• In other word, group I always has kernel size 2i+1
A 4-group MDConv always uses kernel sizes {3x3, 5x5, 7x7, 9x9}
21. Channel Size Per Group
• Two channel partition methods
Equal partition: each group will have the same number of filters
4-group with total filter size 32, the channels will be divided into (8, 8, 8, 8)
Exponential partition : the i-th group will have about 2-i portion of total
channels.
4-group with total filter size 32, the channels will be divided into (16, 8, 4, 4)
22. Dilated Convolution
• Since large kernels need more parameters and computations, an
alternative is to use dilated convolution.
• However, dilated convolutions usually have inferior accuracy than
larger kernel sizes.
25. Ablation Study
• MDConv for Single Layer
For most of layers, the accuracy doesn’t change much, but for certain layers
with stride 2, a larger kernel can significantly improve the accuracy.
27. MixNets
• To further demonstrate the effectiveness of MDConv, the authors
leverage recent progress in neural architecture search to develop a
new family of MDConv-based models, named as MixNets.
• Similar to recent neural architecture search approaches, the authors
directly search on ImageNet train set, and then pick a few top-
performing models from search to verify their accuracy on ImageNet
validation set and transfer learning datasets.
28. MixNetArchitecture
• Small kernels are more common in early stage for saving
computational cost, while large kernels are more common in later
stage for better accuracy.
• The bigger MixNet-M tends to use more large kernels and more
layers to pursing higher accuracy, with the cost of more parameters
and FLOPS.
32. Conclusion
• Authors revisit the impact of kernel size for depthwise convolution,
and identify that traditional depthwise convolution suffers from the
limitation of single kernel size.
• They proposes MDConv, which mixes multiple kernels in a single op.
• MDconv is a simple drop-in replacement of vanilla depthwise
convolution, and improves the accuracy and efficiency.
• They further develop a new family of MixNets using NAS techniques
and MixNets achieve significantly better accuracy and efficiency than
all latest mobile ConvNets.