PR-144: SqueezeNext: Hardware-Aware Neural Network Design

SqueezeNext:
Hardware-Aware Neural Network Design
+
AmirGholami, et al., “SqueezeNext: Hardware-Aware Neural Network Design”, CVPR 2018
Forrest N. Iandola, et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, ICLR 2017
AlexanderWong, et al., “NetScore:Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks
for PracticalOn-Device Edge Usage”, arxiv:1806.05512
24th February, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics
SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
NetScore:
Towards Universal Metrics for Large-scale Performance Analysis of Deep
Neural Networks for Practical On-Device Edge Usage

Related Papers in PR12
• MobileNet
 PR-044: https://youtu.be/7UoOFKcyIvM
• MobileNetV2
 PR-108: https://youtu.be/mT5Y-Zumbbw
• ShuffleNet
 PR-054: https://youtu.be/pNuBdj53Hbc
• ShuffleNetV2
 PR-120: https://youtu.be/lrU6uXiJ_9Y

CNN Benchmark from “NetScore”

Introduction
• Much of the focus in the design of deep neural networks has been on
improving accuracy, leading to more powerful yet highly complex
network architectures.
• But, they are difficult to deploy in practical scenarios, particularly on
edge devices such as mobile and other consumer devices.
• The design of deep neural networks that strike a balance between
accuracy and complexity has become a very hot area of research
focus.

Information Density
• One of the most widely cited metrics in research literature for
assessing the performance of DNNs that accounts for both accuracy
and architectural complexity.
 D(N) : Information Density
 a(N) : Accuracy
 p(N) :The Number of Parameters
• The information density metric does not account for the fact that the
architecture complexity does not necessarily reflect the
computational requirements for performing network inference.

• Designed specifically to provide a quantitative assessment of the
balance between accuracy, computational complexity, and network
architecture complexity of a DNN.
 Ω(N) : NetScore
 a(N) : accuracy (Top-1 accuracy of ILSVRC 2012 dataset)
 p(N) : the number of parameters in the network
 m(N) : the number of multiply-accumulate(MAC) operations during inference
 α = 2, β = 0.5, γ = 0.5
Architectural and computational complexity are both very important factors.
But, the most important metric remains accuracy given that networks with
unreasonably low model accuracy are not useful in practical scenarios regardless of size
and speed.
NetScore
Logarithmic scaling to account
for large dynamic range,
inspired by the decibel scale of
signal processing

SqueezeNet
• Architectural Design Startegies
1. Replace 3x3 filters with 1x1 filters
2. Decrease the number of input channels to 3x3 filters
Total quantity of parameters in 3x3 conv layer is (number of input channels) x (number
of filters) x (3x3)
3. Downsample late in the network so that convolution layers have large
activation maps
large activation maps (due to delayed downsampling) can lead to higher classification
accuracy
• Strategies 1 and 2 are about judiciously decreasing the quantity of
parameters in a CNN while attempting to preserve accuracy.
• Strategy 3 is about maximizing accuracy on a limited budget of
parameters.

CNN Microarchitecture Metaparameters

CNN Macroarchitecture Design Space Exploration
Best performance

Network Pruning & Deep Compression
PR-072: Deep Compression byTaeoh Kim
https://youtu.be/9mFZmpIbMDs

The Impact of SqueezeNet
• SqueezeDet & SqueezeSeg

Motivation
• A general trend of neural network design has been to find larger and
deeper models to get better accuracy without considering the
memory or power budget.
• However, increase in transistor speed due to semiconductor process
improvements has slowed dramatically, and it seems unlikely that
mobile processors will meet computational requirements on a
limited power budget.

Contributions
• Use a more aggressive channel reduction by incorporating a two-
stage squeeze module.
• Use separable 3x3 convolutions to further reduce the model size, and
remove the additional 1x1 branch after the squeeze module.
• Use an element-wise addition skip connection similar to that of
ResNet architecture.
• Optimize the baseline SqueezeNext architecture by simulating its
performance on a multi-processor embedded system.

Design – Low Rank Filters
• Decompose the K x K convolutions into two separable convolutions
of size 1 x K and K x 1
• This effectively reduces the number of parameters from K2 to 2K,
and also increases the depth of the network.

Design – Bottleneck Module
• Use a variation of bottleneck approach by using a two stage squeeze
layer
• Use two bottleneck modules each reducing the channel size by a
factor of 2, which is followed by two separable convolutions
• Also incorporate a final 1 x 1 expansion module, which further
reduces the number of output channels for the separable
convolutions.

Design – Fully Connected Layers
• In the case of AlexNet, the majority of the network parameters are in
Fully Connected layers, accounting for 96% of the total model size.
• SqueezeNext incorporates a final bottleneck layer to reduce the
input channel size to the last fully connected layer, which
considerably reduces the total number of model parameters.

Block Arrangement in 1.0-SqNxt-23

Breakdown of the
1.0-SqNxt-23
architecture
6
6
8
1For skip connection

Hardware Platform
• Weight Stationary & Output Stationary
• The x and y loops form the innermost
loop in theWS data flow, whereas the c, i,
and j loops form the innermost loop in
the OS data flow

Hardware Simulation Setup
• 16x16 or 8x8 array of PEs.
• A 128KB or 32KB global buffer and a
DMA controller to transfer data between
DRAM and the buffer.
• A PE has a 16-bit integer multiply-and-
accumulate(MAC) unit and a local
register file.
• The performance estimator computes
the number of clock cycles required to
process each layer and sums all the
results.

Classification Performance Results
• 23 module architecture exceeds AlexNet’s performance with a 2% margin
with 84x smaller number of parameters.
• The version with twice the width and 44 modules(2.0-SqNxt-44) is able to
matchVGG-19’s performance with 31x smaller number of parameters.

SqueezeNext v2~v5
• In the 1.0-SqNxt-23, the first 7 x 7 convolutional layer accounts for 26% of the
total inference time.
• Therefore, the first optimization we make is replacing this 7 x 7 layer with a 5
x 5 convolution, and construct 1.0-SqNxt-23-v2 model.
• Note the significant drop in efficiency for the layers in the first module.The
reason for this drop is that the initial layers have very small number of
channels which needs to be applied a large input activation map.
• In the v3/v4 variation, authors reduce the number of the blocks in the first
module by 2/4 and instead add it to the second module, respectively. In the
v5 variation, authors reduce the blocks of the first two modules and instead
increase the blocks in the third module.

Further Discussion
• What are we trying to get by reducing the number of computations
and the number of parameters?
• In many cases it will be speed or low energy.
• Then, can small number of computations and fewer parameters
guarantee speed or lower energy?

Speed and the Number of Computations
From ShuffleNetV2

Energy/Power Efficiency and the Number of
Parameters
SlideCredit : “How to Estimate the Energy Consumption of DNNs”
byTien-JuYang(MIT)
SlideCredit : Movidius @Hotchips 2016

Key Insights of Energy Consumption
Slide Credit : “How to Estimate the Energy Consumption of DNNs “ byTien-JuYang(MIT)

PR-144: SqueezeNext: Hardware-Aware Neural Network Design

More Related Content

What's hot

Similar to PR-144: SqueezeNext: Hardware-Aware Neural Network Design

More from Jinwon Lee

Recently uploaded

PR-144: SqueezeNext: Hardware-Aware Neural Network Design