ShuffleNet:
An Extremely Efficient Convolutional Neural
Network for Mobile Devices
10th December, 2017
PR12 Paper Review
Jinwon Lee
Samsung Electronics
Xiangyu Zhang, et al. “ShuffleNet: an extremely efficient
convolutional neural network for mobile devices.”, arXiv:1707.01083
BeforeWe Start…
• Xception
 PR-034 : reviewed by JaejunYoo
 https://youtu.be/V0dLhyg5_Dw
• MobileNet
 PR-044 : reviewed by Jinwon Lee
 https://youtu.be/7UoOFKcyIvM
Standard Convolution
Figures from http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
Inception v2 & v3
• Factorization of convolution filters
Xception
Slides from Jaejun Yoo’s PR-034: Xception
Depthwise Separable Convolution
• Depthwise Convolution + Pointwise Convolution(1x1 convolution)
Depthwise convolution Pointwise convolution
Figures from http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
Depthwise Separable Convolutions
MobileNetArchitecture
Toward More Efficient Network Architecture
• The 1x1 convolution accounts for most of the computation
• Can we reduce more?
A Secret of AlexNet
Grouped Convolution!
Grouped Convolution ofAlexNet
• AlexNet’s primary motivation was to allow the training of the
network over two Nvidia GTX580 GPUs with 1.5GB of memory each
• AlexNet without filter groups is not only less efficient(both in
parameters and compute), but also slightly less accurate!
Figure from https://blog.yani.io/filter-group-tutorial/
Main Ideas of ShuffleNet
• (Use depthwise separable convolution)
• Grouped convolution on 1x1 convolution layers – pointwise group
convolution
• Channel shuffle operation after pointwise group convolution
Grouped Convolution
Figure from https://blog.yani.io/filter-group-tutorial/
ResNext
• Equivalent building blocks of RexNext (cardinality=32)
1x1 Grouped Convolution with Channel Shuffling
• If multiple group convolutions stack together, there is one side effect(a)
 Outputs from a certain channel are only derived from a small fraction of input channels
• If we allow group convolution to obtain input data from different groups, the
input and outputchannels will be fully related.
Channel Shuffle Operation
• Suppose a convolutional layer with g groups whose output has g x n
channels; we first reshape the output channel dimension into (g, n),
transposing and then flattening it back as the input of next layer.
• Channel shuffle operation is also differentiable
Tensorflow Implementation from https://github.com/MG2033/ShuffleNet
ShuffleNet Units
ShuffleNet Units
• From (a), replace the first 1x1 layer with pointwise group convolution
followed by a channel shuffle operation
• ReLU is not applied to 3x3 DWConv
• As for the case where ShuffleNet is applied with stride, simply make
to modifications
 Add 3x3 average pooling on the shortcut path
 Replace element-wise addition with channel concatenation to enlarge
channel dimension with little extra computation
Complexity
• For example, given the input size c x h x w and the bottleneck channel m
ResNet ResNeXt ShuffleNet
ℎ𝑤(2𝑐𝑚 + 9𝑚2
) ℎ𝑤(2𝑐𝑚 + 9𝑚2
/𝑔) ℎ𝑤(2𝑐𝑚/𝑔 + 9𝑚)
<Number of Operations>
ShuffleNetArchitecture
Experimental Results
• To customize the network to a desired complexity, a scale factor s on
the number of channels is applied
 ShuffleNet s× means scaling the number of filters in ShuffleNet 1× by s
times thus overall complexity will be roughly s2 times of ShuffleNet 1×
• Grouped convolutions (g > 1) consistently perform better than the
counter parts without pointwise group convolutions (g=1). Smaller
models tend to benefit more from groups
Experimental Results
• It is clear that channel shuffle consistently boosts classification
scores for different settings
Experimental Results
• ShuffleNet models outperform most others by a significant margin
under different complexities
• Interestingly, they found an empirical relationship between feature
map channels and classification accuracy
 Under the complexity of 38 MFLOPs, output channels of Stage 4 forVGG-like,
ResNet, ResNeXt, Xception-like. ShuffleNet models are 50, 192, 192, 288,
576 respectively
• PVANet has 29.7% classification errors with 557MFLOPs / ShuffleNet
2x model(g=3) gets 26.3% with 524MFLOPs
Experimental Results
• It is clear that ShuffleNet models are superior to MobileNet for all the
complexities though ShuffleNet network is specially designed for small models
(< 150 MFLOPs)
• Results show that the shallower model is still significantly better than the
corresponding MobileNet, which implies that the effectiveness of ShuffleNet
mainly results from its efficient structure, not the depth.
Experimental Results
• Results show that with similar accuracy ShuffleNet is much more
efficient than others.
Experimental Results
• Due to memory access and other overheads, every 4x theoretical complexity
reduction usually result in ~2.6x actual speedup.Compared with AlexNet
achieving ~13x actual speedup(the theoretical speedup is 18x)

ShuffleNet - PR054

  • 1.
    ShuffleNet: An Extremely EfficientConvolutional Neural Network for Mobile Devices 10th December, 2017 PR12 Paper Review Jinwon Lee Samsung Electronics Xiangyu Zhang, et al. “ShuffleNet: an extremely efficient convolutional neural network for mobile devices.”, arXiv:1707.01083
  • 2.
    BeforeWe Start… • Xception PR-034 : reviewed by JaejunYoo  https://youtu.be/V0dLhyg5_Dw • MobileNet  PR-044 : reviewed by Jinwon Lee  https://youtu.be/7UoOFKcyIvM
  • 3.
    Standard Convolution Figures fromhttp://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
  • 4.
    Inception v2 &v3 • Factorization of convolution filters
  • 5.
    Xception Slides from JaejunYoo’s PR-034: Xception
  • 6.
    Depthwise Separable Convolution •Depthwise Convolution + Pointwise Convolution(1x1 convolution) Depthwise convolution Pointwise convolution Figures from http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
  • 7.
  • 8.
  • 9.
    Toward More EfficientNetwork Architecture • The 1x1 convolution accounts for most of the computation • Can we reduce more?
  • 10.
    A Secret ofAlexNet Grouped Convolution!
  • 11.
    Grouped Convolution ofAlexNet •AlexNet’s primary motivation was to allow the training of the network over two Nvidia GTX580 GPUs with 1.5GB of memory each • AlexNet without filter groups is not only less efficient(both in parameters and compute), but also slightly less accurate! Figure from https://blog.yani.io/filter-group-tutorial/
  • 12.
    Main Ideas ofShuffleNet • (Use depthwise separable convolution) • Grouped convolution on 1x1 convolution layers – pointwise group convolution • Channel shuffle operation after pointwise group convolution
  • 13.
    Grouped Convolution Figure fromhttps://blog.yani.io/filter-group-tutorial/
  • 14.
    ResNext • Equivalent buildingblocks of RexNext (cardinality=32)
  • 15.
    1x1 Grouped Convolutionwith Channel Shuffling • If multiple group convolutions stack together, there is one side effect(a)  Outputs from a certain channel are only derived from a small fraction of input channels • If we allow group convolution to obtain input data from different groups, the input and outputchannels will be fully related.
  • 16.
    Channel Shuffle Operation •Suppose a convolutional layer with g groups whose output has g x n channels; we first reshape the output channel dimension into (g, n), transposing and then flattening it back as the input of next layer. • Channel shuffle operation is also differentiable Tensorflow Implementation from https://github.com/MG2033/ShuffleNet
  • 17.
  • 18.
    ShuffleNet Units • From(a), replace the first 1x1 layer with pointwise group convolution followed by a channel shuffle operation • ReLU is not applied to 3x3 DWConv • As for the case where ShuffleNet is applied with stride, simply make to modifications  Add 3x3 average pooling on the shortcut path  Replace element-wise addition with channel concatenation to enlarge channel dimension with little extra computation
  • 19.
    Complexity • For example,given the input size c x h x w and the bottleneck channel m ResNet ResNeXt ShuffleNet ℎ𝑤(2𝑐𝑚 + 9𝑚2 ) ℎ𝑤(2𝑐𝑚 + 9𝑚2 /𝑔) ℎ𝑤(2𝑐𝑚/𝑔 + 9𝑚) <Number of Operations>
  • 20.
  • 21.
    Experimental Results • Tocustomize the network to a desired complexity, a scale factor s on the number of channels is applied  ShuffleNet s× means scaling the number of filters in ShuffleNet 1× by s times thus overall complexity will be roughly s2 times of ShuffleNet 1× • Grouped convolutions (g > 1) consistently perform better than the counter parts without pointwise group convolutions (g=1). Smaller models tend to benefit more from groups
  • 22.
    Experimental Results • Itis clear that channel shuffle consistently boosts classification scores for different settings
  • 23.
    Experimental Results • ShuffleNetmodels outperform most others by a significant margin under different complexities • Interestingly, they found an empirical relationship between feature map channels and classification accuracy  Under the complexity of 38 MFLOPs, output channels of Stage 4 forVGG-like, ResNet, ResNeXt, Xception-like. ShuffleNet models are 50, 192, 192, 288, 576 respectively • PVANet has 29.7% classification errors with 557MFLOPs / ShuffleNet 2x model(g=3) gets 26.3% with 524MFLOPs
  • 24.
    Experimental Results • Itis clear that ShuffleNet models are superior to MobileNet for all the complexities though ShuffleNet network is specially designed for small models (< 150 MFLOPs) • Results show that the shallower model is still significantly better than the corresponding MobileNet, which implies that the effectiveness of ShuffleNet mainly results from its efficient structure, not the depth.
  • 25.
    Experimental Results • Resultsshow that with similar accuracy ShuffleNet is much more efficient than others.
  • 26.
    Experimental Results • Dueto memory access and other overheads, every 4x theoretical complexity reduction usually result in ~2.6x actual speedup.Compared with AlexNet achieving ~13x actual speedup(the theoretical speedup is 18x)