PR-183: MixNet: Mixed Depthwise Convolutional Kernels

MixNet:
Mixed Depthwise Convolutional Kernels
Mingxing Tan, et al., “MixNet: Mixed Depthwise Convolutional Kernels”, BMVC 2019
28th July, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics

Introduction
• A recent trend in ConvNets design is to improve both accuracy and
efficiency.
• Following this trend, depthwise convolutions are becoming
increasingly more popular in modern ConvNets.
 Such as MobileNets, ShuffleNets, NASNets, AmoebaNet, MnasNet, and
EfficientNet.

Introduction
• Although conventional practice is to simply use 3x3 kernels, recent
research results have shown larger kernel size such as 5x5 kernels
and 7x7 kernels can potentially improve model accuracy and
efficiency.
• In this paper, authors revisit the fundamental question.
Do larger kernels always achieve higher accuracy?
• Larger kernel tend to capture high-resolution patterns with more
details at the cost of more parameters and computations.
• But do they always improve accuracy?

Introduction
• In the extreme case that kernel size is equal to the input resolution, a
ConvNet simply becomes a fully-connected network, which is known
to be inferior
• We need both large kernels to capture high-resolution patterns and
small kernels to capture low-resolution patterns for better model
accuracy and efficiency

RelatedWork
• Efficient ConvNets
 In recent years, significant efforts have been spent on improving ConvNet
efficiency.
 In particular, depthwise convolution has been increasingly popular in all
mobile-size ConvNets.
 Unlike regular convolution, depthwise convolution performs convolutional
kernels for each channel separately, thus reducing parameter size and
computational cost.

RelatedWork
• Multi-Scale Networks and Features
 There are multi-branch ConvNets, such as Inceptions, Inception-ResNet,
ResNext, and NASNet.
 By using multiple branches in each layer, these ConvNets are able to utilize
different operations in a single layer.
 Similarly, there are also many prior work on combining multi-scale feature
maps from different layers, such as DenseNet, and feature pyramid network
 These prior works mostly focus on changing the macro-architecture of neural
networks in order to utilize different convolution ops.

RelatedWork
• Neural Architecture Search
 Recently, neural architecture search has achieved better performance than
hand-crafted models by automating the design process and learning better
design choices.
 When a new operation appears, it is added to the search space in NAS.

Regular(Normal) Convolution
w, h, c : width, height and channel of an input feature map
k : width and height of convolution filters
n : the number of convolution filters(channel of an output feature map)
w
h
c
k
k
...
n
1
2
3
n
w
h
c

Dilated(Atrous) Convolution
c
w
h
c
k
k
...
n
1
2
3
n
w'
h'
r
r : dilation rate

Group(ed) Convolution
• When g = 2,
w
h
c/g
k
k
...
n-1
1
2
n/g
w
h
n
...
n/g
n/g
+1
c
n/g
w
h
g : the number of groups

Depthwise Convolution
• Same as group convolution with g = c, n = c
w
h
k
k1
2
c
1
c
...
w
h
c

Depthwise Convolution
• Same as group convolution with g = c, n = m x c
w
h
k
k1
2
c
1
c
... h
c
k
k1
2
1
c
...
k
k1
2
1
c
...
...
...
...
...
...
c c
m
m x c

MDConv(Mixed Depthwise Convolution)
The main idea of MDConv is to mix up multiple kernels with different
sizes in a single depthwise convolution operation.

Vanilla Depthwise Convolution
w
h
k
k1
2
c
1
c
...
w
h
c
X Y
W

MDConv
• MDConv partitions channels into groups and applies different kernel
size to each group.
• The input tensor is partitioned into g groups of virtual tensors
where all virtual tensors ෠𝑋 have the same spatial height h and,
c1 + c2 + … cg = c
• Output is calculated as:

MDConv
w
h
k
k1
2
c
1
c
... h
c
k
k1
2
1
c
...
k
k1
2
1
c
...
...
...
...
...
...
c c
m
m x cGroup 1
k = 3
Group 2
k = 5
Group g
k = 11

Design Choices
• Group Size g
 In the extreme case of g = 1, a MDConv becomes equivalent to a vanilla
depthwise convolution.
 g = 4 is generally a safe choice for MobileNets, but with the help of neural
architecture search, it can further benefit with a variety of group sizes from 1
to 5.

Kernel Size Per Group
• In theory, each group can have arbitrary kernel size.
• Authors restrict kernel size always stars from 3x3 and monotonically
increases by 2 per group.
• In other word, group I always has kernel size 2i+1
 A 4-group MDConv always uses kernel sizes {3x3, 5x5, 7x7, 9x9}

Channel Size Per Group
• Two channel partition methods
 Equal partition: each group will have the same number of filters
4-group with total filter size 32, the channels will be divided into (8, 8, 8, 8)
 Exponential partition : the i-th group will have about 2-i portion of total
channels.
4-group with total filter size 32, the channels will be divided into (16, 8, 4, 4)

Dilated Convolution
• Since large kernels need more parameters and computations, an
alternative is to use dilated convolution.
• However, dilated convolutions usually have inferior accuracy than
larger kernel sizes.

MDConv Performance on MobileNets
• ImageNet Classification

MDConv Performance on MobileNets
• Object Detection

Ablation Study
• MDConv for Single Layer
 For most of layers, the accuracy doesn’t change much, but for certain layers
with stride 2, a larger kernel can significantly improve the accuracy.

Channel Partition Methods & Dilated
Convolution

MixNets
• To further demonstrate the effectiveness of MDConv, the authors
leverage recent progress in neural architecture search to develop a
new family of MDConv-based models, named as MixNets.
• Similar to recent neural architecture search approaches, the authors
directly search on ImageNet train set, and then pick a few top-
performing models from search to verify their accuracy on ImageNet
validation set and transfer learning datasets.

MixNetArchitecture
• Small kernels are more common in early stage for saving
computational cost, while large kernels are more common in later
stage for better accuracy.
• The bigger MixNet-M tends to use more large kernels and more
layers to pursing higher accuracy, with the cost of more parameters
and FLOPS.

MixNet Performance on ImageNet

Conclusion
• Authors revisit the impact of kernel size for depthwise convolution,
and identify that traditional depthwise convolution suffers from the
limitation of single kernel size.
• They proposes MDConv, which mixes multiple kernels in a single op.
• MDconv is a simple drop-in replacement of vanilla depthwise
convolution, and improves the accuracy and efficiency.
• They further develop a new family of MixNets using NAS techniques
and MixNets achieve significantly better accuracy and efficiency than
all latest mobile ConvNets.

PR-183: MixNet: Mixed Depthwise Convolutional Kernels

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR-183: MixNet: Mixed Depthwise Convolutional Kernels

Similar to PR-183: MixNet: Mixed Depthwise Convolutional Kernels (20)

More from Jinwon Lee

More from Jinwon Lee (20)

Recently uploaded

Recently uploaded (20)

PR-183: MixNet: Mixed Depthwise Convolutional Kernels