Slides for paper reading in VietNam AI Community in Japan
Explanation on MobileNet V2: Inverted Residuals and Linear Bottlenecks, a paper in CVPR 23018
1. Pham Quang Khang
2018/8/18 Paper Reading Fest 20180819 1
MobileNet V2: Inverted Residuals
and Linear Bottlenecks
Mark Sandler et al. CVPR 2018
2. Agendas
1. Motivation of research
2. Key components of MobileNet V2
a. Depthwise Separable Convolutions
b. Linear bottlenecks and inverted residual
c. Effect of linear bottlenecks and inverted residual
3. Architecture of MobileNet V2
4. Experiments and results
2018/8/18 Paper Reading Fest 20180819 2
3. Agendas
1. Motivation of research
2. Key components of MobileNet V2
a. Depthwise Separable Convolutions
b. Linear bottlenecks and inverted residual
c. Effect of linear bottlenecks and inverted residual
3. Architecture of MobileNet V2
4. Experiments and results
2018/8/18 Paper Reading Fest 20180819 3
5. Evolution of ImageNet
■ 2012: AlexNet major debut for power of CNN
– Conv layers:3, 48, 128, 192, 192, 128
– FC layers: 2048, 2048
■ 2014: VGG 19 power of very deep network
– Conv layers: 19 conv3
– FC layers: 4096, 4096
■ 2015: ResNet very very very deep network
– 152-layers residual of various side conv
– No FC
■ 2014 – 2016: Inception -> Inception v4, Inception + ResNet
■ Xception (CVPR 2017)
■ MobileNet, ShuffleNet => it is time for architectures can fit on mobile
2018/8/18 Paper Reading Fest 20180819 5
6. Computation power requirements
■ Previous architectures required massive amount of memory and computational
power
■ In order to run image classification or detection on mobile devices, it is a must to
create lighter model with sufficient accuracy
2018/8/18 Paper Reading Fest 20180819 6
Model
ImageNet
Accuracy
Million
Mult-Adds
Million
Parameters
MobileNetV2 72.0% 300 3.4
MobileNet(1) 70.6 569 4.2
GoogleNet
(Inception)
69.8% 1550 6.8
VGG 16 71.5% 15300 138
Andrew G. Howard et al. 2017
Mark Sandler et al. 2018
7. Agendas
1. Motivation of research
2. Key components of MobileNet V2
a. Depthwise Separable Convolutions
b. Linear bottlenecks and inverted residual
c. Effect of linear bottlenecks and inverted residual
3. Architecture of MobileNet V2
4. Experiments and results
2018/8/18 Paper Reading Fest 20180819 7
8. Depthwise Separable Conv
■ Conventional Conv: transform DF x DF x M (input size of DF and M
channels) to DF x DF x N, using DK x DK x M x N kernel
– Cost to compute one point in output: DKxDKxM
– Cost to compute whole output: DK x DK x M x DF x DF x N
■ Conv = filtering + combination
■ New way: split into 2 steps of filtering and combination
– Depthwise conv (filtering): use kernel size DKxDKx1 to first get
the DF x DF x M output 1
Cost: DK x DK x M x DF x DF
– Pointwise conv (combination): use kernel size 1x1xMxN to
combine channels of output 1 to final output of DF x DF x N
Cost: M x DF x DF x N
– Total cost: DF x DF x M x (DK x DK + N)
– With DK = 3, cost is down around 9 times
2018/8/18 Paper Reading Fest 20180819 8
Andrew G. Howard et al. 2017
9. ReLu and information lost
■ Manifold of interest: each activation tensor of dims ℎ𝑖 × 𝑤𝑖 × 𝑑𝑖 can be treated as
ℎ𝑖 × 𝑤𝑖 pixels with 𝑑𝑖 dimensions
■ Manifold of interest can be embedded in low-dimensional subspaces => reducing
the dimension of the layer would not cause information lost
■ Not so true with non-linear transformation like ReLU:
– If manifold of interest remains non-zero volume after ReLU transformation, it
corresponds to a linear transformation
– ReLU is capable of preserving complete information about input manifold, but
only if the input manifold lies in a low-dimensional subspace of input space
2018/8/18 Paper Reading Fest 20180819 9
Use linear bottleneck layers
10. Inverted Residuals and Linear Bottlenecks
■ Residual connections: improve the ability of gradient to propagate
■ Inverted: considerably more memory efficient
2018/8/18 Paper Reading Fest 20180819 10
Kaiming He et al. 2015
11. Unit block of MobileNet V2
■ Combining Depthwise Separable Convolutions, linear bottlenecks and inverted
residual block
■ Computational cost per block:
ℎ × 𝑤 × 𝑑 × 𝑡(𝑑′ + 𝑘2 + 𝑑)
■ With this, input and output dimension can
be relatively small
2018/8/18 Paper Reading Fest 20180819 11
Input Operator Output
ℎ × 𝑤 × 𝑑 1x1 conv2d, ReLU6 ℎ × 𝑤 × (𝑡𝑑)
ℎ × 𝑤 × 𝑡𝑑 3x3 dwise s=s, ReLU6
ℎ
𝑠
×
𝑤
𝑠
× (𝑡𝑑)
ℎ
𝑠
×
𝑤
𝑠
× 𝑡𝑑 Linear 1x1 conv2d
ℎ
𝑠
×
𝑤
𝑠
× 𝑑′
12. Inverted residual bottleneck for memory saving
■ Transformation function: 𝐹 𝑥 = 𝐴 ∙ 𝑁 ∙ 𝐵 𝑥
A: linear transformation: 𝑅 𝑠×𝑠×𝑘 → 𝑅 𝑠×𝑠×𝑛
N: ReLU6 ∙ dwise ∙ ReLU6: 𝑅 𝑠×𝑠×𝑛 → 𝑅 𝑠′×𝑠′×𝑛
B: linear transformation: 𝑅 𝑠′×𝑠′×𝑛 → 𝑅 𝑠′×𝑠′×𝑘′
■ Memory needed is:
𝑠2
𝑘 + 𝑠′2
𝑘′
+ 𝑂(max 𝑠2
, 𝑠′2
)
■ If expansion layers can be separated into t tensors (that concatenation of them
made up the tensors):
𝐹 𝑥 = σ𝑖=1
𝑡
( 𝐴𝑖 . 𝑁 . 𝐵𝑖) 𝑥
2018/8/18 Paper Reading Fest 20180819 12
A
N
B
13. Agendas
1. Motivation of research
2. Key components of MobileNet V2
a. Depthwise Separable Convolutions
b. Linear bottlenecks and inverted residual
c. Effect of linear bottlenecks and inverted residual
3. Architecture of MobileNet V2
4. Experiments and results
2018/8/18 Paper Reading Fest 20180819 13
14. Architecture of the model
■ Each line is a sequence of 1 or
more identical layers, repeated n
times
■ Output channel number: c
■ First layer of each sequence has a
stride s and all others use stride 1
■ All spatial conv use 3x3 kernels
■ Bottleneck layer expansion factor t
■ Input resolution should be 96-224
■ Can use multiplier to use thinner
model
2018/8/18 Paper Reading Fest 20180819 14
Input Operator t c n s
2242
× 3 Conv2d - 32 1 2
1122
× 32 bottleneck 1 16 1 1
1122 × 16 bottleneck 6 24 2 2
562 × 24 bottleneck 6 32 3 2
282
× 32 bottleneck 6 64 4 2
142
× 64 bottleneck 6 96 3 1
142
× 96 bottleneck 6 160 3 2
72 × 160 bottleneck 6 320 1 1
72 × 320 Conv2d 1x1 - 1280 1 1
72
× 1280 Avgpool 7x7 - - 1 -
1 × 1 × 1280 Conv2d 1x1 - k -
16. Agendas
1. Motivation of research
2. Key components of MobileNet V2
a. Depthwise Separable Convolutions
b. Linear bottlenecks and inverted residual
c. Effect of linear bottlenecks and inverted residual
3. Architecture of MobileNet V2
4. Experiments and results
2018/8/18 Paper Reading Fest 20180819 16
17. ImageNet Classification
■ Tensorflow
■ RMSProp: decay and momentum of 0.9
■ Batchnorm after every layer
■ Weight decay of 0.00004
■ Initial learning rate 0.045
■ Learning rate decay 0.98 per epoch
■ 16 GPU
■ Batch size 96
2018/8/18 Paper Reading Fest 20180819 17
Model
ImageNet
Accuracy
Million
Mult-Adds
Million
Parameters
MobileNetV2 72.0% 300 3.4
MobileNet(1) 70.6 569 4.2
GoogleNet
(Inception)
69.8% 1550 6.8
VGG 16 71.5% 15300 138
18. Comparison between models for mobile (ImageNet)
■ MobileNet, ShuffleNet, NasNet ■ MobileNetV2 with different input
resolution vs NasNet, MobileNetV1,
Shuffle Net
2018/8/18 Paper Reading Fest 20180819 18
Model
ImageNet
Accuracy
Million
Mult-Adds
Million
Parameters
MobileNetV1
70.6 575 4.2
ShuffleNet(1.5) 71.5% 292 3.4
ShuffleNet (x2) 73.7% 524 5.4
NasNet-A 74% 564 5.3
MobileNetV2 72.0 300 3.4
MobileNetV2(1.
4)
74.7% 585 6.9
19. Object detection
■ Use MobileNet V2 as feature extractors for object detection with modified version of
Single Shot Detector (SSD) on COCO dataset
■ Compare with YOLOv2, original SSD
■ SSDLite: replace all normal conv with separable conv in SSD prediction layers
■ MNetV2 + SSDLite run on Pixel 1
2018/8/18 Paper Reading Fest 20180819 19Liu et al.2016
Model mAP
Ave. Precision
Params
Millions
MAdd CPU
SSD300 23.2 36.1 35.2B
SSD512 26.8 36.1 99.5B
YOLOv2 21.6 50.7 17.5B
MNet 1
SSDLite
22.2 5.1 1.3B 270ms
MNet 2
SSD Lite
22.1 4.3 0.8B 200ms
20. Thank you for listening. Time for Q&A
2018/8/18 Paper Reading Fest 20180819 20