SqueezeNext:
Hardware-Aware Neural Network Design
+
AmirGholami, et al., “SqueezeNext: Hardware-Aware Neural Network Design”, CVPR 2018
Forrest N. Iandola, et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, ICLR 2017
AlexanderWong, et al., “NetScore:Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks
for PracticalOn-Device Edge Usage”, arxiv:1806.05512
24th February, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics
SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
NetScore:
Towards Universal Metrics for Large-scale Performance Analysis of Deep
Neural Networks for Practical On-Device Edge Usage
Related Papers in PR12
• MobileNet
 PR-044: https://youtu.be/7UoOFKcyIvM
• MobileNetV2
 PR-108: https://youtu.be/mT5Y-Zumbbw
• ShuffleNet
 PR-054: https://youtu.be/pNuBdj53Hbc
• ShuffleNetV2
 PR-120: https://youtu.be/lrU6uXiJ_9Y
CNN Benchmark from “NetScore”
NetScore
Introduction
• Much of the focus in the design of deep neural networks has been on
improving accuracy, leading to more powerful yet highly complex
network architectures.
• But, they are difficult to deploy in practical scenarios, particularly on
edge devices such as mobile and other consumer devices.
• The design of deep neural networks that strike a balance between
accuracy and complexity has become a very hot area of research
focus.
Information Density
• One of the most widely cited metrics in research literature for
assessing the performance of DNNs that accounts for both accuracy
and architectural complexity.
 D(N) : Information Density
 a(N) : Accuracy
 p(N) :The Number of Parameters
• The information density metric does not account for the fact that the
architecture complexity does not necessarily reflect the
computational requirements for performing network inference.
• Designed specifically to provide a quantitative assessment of the
balance between accuracy, computational complexity, and network
architecture complexity of a DNN.
 Ω(N) : NetScore
 a(N) : accuracy (Top-1 accuracy of ILSVRC 2012 dataset)
 p(N) : the number of parameters in the network
 m(N) : the number of multiply-accumulate(MAC) operations during inference
 α = 2, β = 0.5, γ = 0.5
Architectural and computational complexity are both very important factors.
But, the most important metric remains accuracy given that networks with
unreasonably low model accuracy are not useful in practical scenarios regardless of size
and speed.
NetScore
Logarithmic scaling to account
for large dynamic range,
inspired by the decibel scale of
signal processing
SqueezeNet
SqueezeNet
• Architectural Design Startegies
1. Replace 3x3 filters with 1x1 filters
2. Decrease the number of input channels to 3x3 filters
Total quantity of parameters in 3x3 conv layer is (number of input channels) x (number
of filters) x (3x3)
3. Downsample late in the network so that convolution layers have large
activation maps
large activation maps (due to delayed downsampling) can lead to higher classification
accuracy
• Strategies 1 and 2 are about judiciously decreasing the quantity of
parameters in a CNN while attempting to preserve accuracy.
• Strategy 3 is about maximizing accuracy on a limited budget of
parameters.
The Fire Module
MacroarchitecturalView
SqueezeNetArchitecture
CNN Microarchitecture Metaparameters
CNN Macroarchitecture Design Space Exploration
Best performance
Results
Network Pruning & Deep Compression
PR-072: Deep Compression byTaeoh Kim
https://youtu.be/9mFZmpIbMDs
The Impact of SqueezeNet
• SqueezeDet & SqueezeSeg
SqueezeNext
Motivation
• A general trend of neural network design has been to find larger and
deeper models to get better accuracy without considering the
memory or power budget.
• However, increase in transistor speed due to semiconductor process
improvements has slowed dramatically, and it seems unlikely that
mobile processors will meet computational requirements on a
limited power budget.
Contributions
• Use a more aggressive channel reduction by incorporating a two-
stage squeeze module.
• Use separable 3x3 convolutions to further reduce the model size, and
remove the additional 1x1 branch after the squeeze module.
• Use an element-wise addition skip connection similar to that of
ResNet architecture.
• Optimize the baseline SqueezeNext architecture by simulating its
performance on a multi-processor embedded system.
Design – Low Rank Filters
• Decompose the K x K convolutions into two separable convolutions
of size 1 x K and K x 1
• This effectively reduces the number of parameters from K2 to 2K,
and also increases the depth of the network.
Design – Bottleneck Module
• Use a variation of bottleneck approach by using a two stage squeeze
layer
• Use two bottleneck modules each reducing the channel size by a
factor of 2, which is followed by two separable convolutions
• Also incorporate a final 1 x 1 expansion module, which further
reduces the number of output channels for the separable
convolutions.
Design – Fully Connected Layers
• In the case of AlexNet, the majority of the network parameters are in
Fully Connected layers, accounting for 96% of the total model size.
• SqueezeNext incorporates a final bottleneck layer to reduce the
input channel size to the last fully connected layer, which
considerably reduces the total number of model parameters.
Comparison of Building Blocks
SqueezeNext Block
Block Arrangement in 1.0-SqNxt-23
Breakdown of the
1.0-SqNxt-23
architecture
6
6
8
1For skip connection
Hardware Platform
• Weight Stationary & Output Stationary
• The x and y loops form the innermost
loop in theWS data flow, whereas the c, i,
and j loops form the innermost loop in
the OS data flow
Hardware Simulation Setup
• 16x16 or 8x8 array of PEs.
• A 128KB or 32KB global buffer and a
DMA controller to transfer data between
DRAM and the buffer.
• A PE has a 16-bit integer multiply-and-
accumulate(MAC) unit and a local
register file.
• The performance estimator computes
the number of clock cycles required to
process each layer and sums all the
results.
Classification Performance Results
• 23 module architecture exceeds AlexNet’s performance with a 2% margin
with 84x smaller number of parameters.
• The version with twice the width and 44 modules(2.0-SqNxt-44) is able to
matchVGG-19’s performance with 31x smaller number of parameters.
Hardware Performance Results
SqueezeNext v2~v5
• In the 1.0-SqNxt-23, the first 7 x 7 convolutional layer accounts for 26% of the
total inference time.
• Therefore, the first optimization we make is replacing this 7 x 7 layer with a 5
x 5 convolution, and construct 1.0-SqNxt-23-v2 model.
• Note the significant drop in efficiency for the layers in the first module.The
reason for this drop is that the initial layers have very small number of
channels which needs to be applied a large input activation map.
• In the v3/v4 variation, authors reduce the number of the blocks in the first
module by 2/4 and instead add it to the second module, respectively. In the
v5 variation, authors reduce the blocks of the first two modules and instead
increase the blocks in the third module.
Results
1.0-SqNxt-23v5
Results
Further Discussion
• What are we trying to get by reducing the number of computations
and the number of parameters?
• In many cases it will be speed or low energy.
• Then, can small number of computations and fewer parameters
guarantee speed or lower energy?
Speed and the Number of Computations
From ShuffleNetV2
Energy/Power Efficiency and the Number of
Parameters
SlideCredit : “How to Estimate the Energy Consumption of DNNs”
byTien-JuYang(MIT)
SlideCredit : Movidius @Hotchips 2016
Key Insights of Energy Consumption
Slide Credit : “How to Estimate the Energy Consumption of DNNs “ byTien-JuYang(MIT)
Thank you

PR-144: SqueezeNext: Hardware-Aware Neural Network Design

  • 1.
    SqueezeNext: Hardware-Aware Neural NetworkDesign + AmirGholami, et al., “SqueezeNext: Hardware-Aware Neural Network Design”, CVPR 2018 Forrest N. Iandola, et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, ICLR 2017 AlexanderWong, et al., “NetScore:Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks for PracticalOn-Device Edge Usage”, arxiv:1806.05512 24th February, 2019 PR12 Paper Review JinWon Lee Samsung Electronics SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size NetScore: Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage
  • 2.
    Related Papers inPR12 • MobileNet  PR-044: https://youtu.be/7UoOFKcyIvM • MobileNetV2  PR-108: https://youtu.be/mT5Y-Zumbbw • ShuffleNet  PR-054: https://youtu.be/pNuBdj53Hbc • ShuffleNetV2  PR-120: https://youtu.be/lrU6uXiJ_9Y
  • 3.
    CNN Benchmark from“NetScore”
  • 4.
  • 5.
    Introduction • Much ofthe focus in the design of deep neural networks has been on improving accuracy, leading to more powerful yet highly complex network architectures. • But, they are difficult to deploy in practical scenarios, particularly on edge devices such as mobile and other consumer devices. • The design of deep neural networks that strike a balance between accuracy and complexity has become a very hot area of research focus.
  • 6.
    Information Density • Oneof the most widely cited metrics in research literature for assessing the performance of DNNs that accounts for both accuracy and architectural complexity.  D(N) : Information Density  a(N) : Accuracy  p(N) :The Number of Parameters • The information density metric does not account for the fact that the architecture complexity does not necessarily reflect the computational requirements for performing network inference.
  • 7.
    • Designed specificallyto provide a quantitative assessment of the balance between accuracy, computational complexity, and network architecture complexity of a DNN.  Ω(N) : NetScore  a(N) : accuracy (Top-1 accuracy of ILSVRC 2012 dataset)  p(N) : the number of parameters in the network  m(N) : the number of multiply-accumulate(MAC) operations during inference  α = 2, β = 0.5, γ = 0.5 Architectural and computational complexity are both very important factors. But, the most important metric remains accuracy given that networks with unreasonably low model accuracy are not useful in practical scenarios regardless of size and speed. NetScore Logarithmic scaling to account for large dynamic range, inspired by the decibel scale of signal processing
  • 10.
  • 11.
    SqueezeNet • Architectural DesignStartegies 1. Replace 3x3 filters with 1x1 filters 2. Decrease the number of input channels to 3x3 filters Total quantity of parameters in 3x3 conv layer is (number of input channels) x (number of filters) x (3x3) 3. Downsample late in the network so that convolution layers have large activation maps large activation maps (due to delayed downsampling) can lead to higher classification accuracy • Strategies 1 and 2 are about judiciously decreasing the quantity of parameters in a CNN while attempting to preserve accuracy. • Strategy 3 is about maximizing accuracy on a limited budget of parameters.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    CNN Macroarchitecture DesignSpace Exploration Best performance
  • 17.
  • 18.
    Network Pruning &Deep Compression PR-072: Deep Compression byTaeoh Kim https://youtu.be/9mFZmpIbMDs
  • 19.
    The Impact ofSqueezeNet • SqueezeDet & SqueezeSeg
  • 20.
  • 21.
    Motivation • A generaltrend of neural network design has been to find larger and deeper models to get better accuracy without considering the memory or power budget. • However, increase in transistor speed due to semiconductor process improvements has slowed dramatically, and it seems unlikely that mobile processors will meet computational requirements on a limited power budget.
  • 22.
    Contributions • Use amore aggressive channel reduction by incorporating a two- stage squeeze module. • Use separable 3x3 convolutions to further reduce the model size, and remove the additional 1x1 branch after the squeeze module. • Use an element-wise addition skip connection similar to that of ResNet architecture. • Optimize the baseline SqueezeNext architecture by simulating its performance on a multi-processor embedded system.
  • 23.
    Design – LowRank Filters • Decompose the K x K convolutions into two separable convolutions of size 1 x K and K x 1 • This effectively reduces the number of parameters from K2 to 2K, and also increases the depth of the network.
  • 24.
    Design – BottleneckModule • Use a variation of bottleneck approach by using a two stage squeeze layer • Use two bottleneck modules each reducing the channel size by a factor of 2, which is followed by two separable convolutions • Also incorporate a final 1 x 1 expansion module, which further reduces the number of output channels for the separable convolutions.
  • 25.
    Design – FullyConnected Layers • In the case of AlexNet, the majority of the network parameters are in Fully Connected layers, accounting for 96% of the total model size. • SqueezeNext incorporates a final bottleneck layer to reduce the input channel size to the last fully connected layer, which considerably reduces the total number of model parameters.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Hardware Platform • WeightStationary & Output Stationary • The x and y loops form the innermost loop in theWS data flow, whereas the c, i, and j loops form the innermost loop in the OS data flow
  • 31.
    Hardware Simulation Setup •16x16 or 8x8 array of PEs. • A 128KB or 32KB global buffer and a DMA controller to transfer data between DRAM and the buffer. • A PE has a 16-bit integer multiply-and- accumulate(MAC) unit and a local register file. • The performance estimator computes the number of clock cycles required to process each layer and sums all the results.
  • 32.
    Classification Performance Results •23 module architecture exceeds AlexNet’s performance with a 2% margin with 84x smaller number of parameters. • The version with twice the width and 44 modules(2.0-SqNxt-44) is able to matchVGG-19’s performance with 31x smaller number of parameters.
  • 33.
  • 34.
    SqueezeNext v2~v5 • Inthe 1.0-SqNxt-23, the first 7 x 7 convolutional layer accounts for 26% of the total inference time. • Therefore, the first optimization we make is replacing this 7 x 7 layer with a 5 x 5 convolution, and construct 1.0-SqNxt-23-v2 model. • Note the significant drop in efficiency for the layers in the first module.The reason for this drop is that the initial layers have very small number of channels which needs to be applied a large input activation map. • In the v3/v4 variation, authors reduce the number of the blocks in the first module by 2/4 and instead add it to the second module, respectively. In the v5 variation, authors reduce the blocks of the first two modules and instead increase the blocks in the third module.
  • 35.
  • 36.
  • 37.
  • 38.
    Further Discussion • Whatare we trying to get by reducing the number of computations and the number of parameters? • In many cases it will be speed or low energy. • Then, can small number of computations and fewer parameters guarantee speed or lower energy?
  • 39.
    Speed and theNumber of Computations From ShuffleNetV2
  • 40.
    Energy/Power Efficiency andthe Number of Parameters SlideCredit : “How to Estimate the Energy Consumption of DNNs” byTien-JuYang(MIT) SlideCredit : Movidius @Hotchips 2016
  • 41.
    Key Insights ofEnergy Consumption Slide Credit : “How to Estimate the Energy Consumption of DNNs “ byTien-JuYang(MIT)
  • 42.