Survey on HW-aware NAS

NTHU-CS VLSI/CAD LAB
Speaker:Yi-Wen Hung
2022/12/18

 A Comprehensive Survey on Hardware-
Aware Neural Architecture Search
 Submitted to Proceedings of IEEE
 Listed in the footnote of each slides
2

 HW-aware NAS
 Issues & Related Works in HW-NAS
 Goal
 NAS
 HW Cost
 Benchmarks
 Discussions
3

 Long ASIC/NN design iteration
 The ASIC-NN design iteration is manual and not a turn-key solution
4

 Long ASIC/NN design iteration
 The ASIC-NN design iteration is manual and not a turn-key solution,
even using Network Architecture Search
5
Ref: Lin et. al, “MCUNet: Tiny Deep Learning on IoT Devices”

 Hardware-aware auto NN architecture
design
 Goal: Find the best
accuracy NN arch.
w/ HW constraint
6

 HW-aware NAS
 Goal
 NAS
 HW Cost
 Benchmarks
 Discussions
7

 Goals
 Search space
 Search strategy
 Multi-Objective
 Non-differentiable HW constraints
 HW cost model
 Others
8

 Single Target
Search architectures for a single specific HW
 Single Config: best accuracy w/ HW constraints
 E. g., single configuration goal get a best accuracy with HW constraints
 Multiple Config: best accuracy, best latency
 E.g., multiple configuration goal get the best accuracy model, and the best latency model
 Multiple Targets
Search architectures consider multiple HWs simultaneously
9

 Architecture search space
Architecture search space is a set of architectures contains the solutions
 Hyperparameter search
 #channel, stride, kernel size
10
Ref: Ma et. al, “Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA”

 Architecture search space
Architecture search space is a set of architectures contains the solutions
 Whole architecture
 Layer-wise, Cell-based, Hierarchical
11
Layer-wise Cell-based Hierarchical

 Hardware search space
 Parameter-based, Template-based
The search space is formalized by a set of different parameter configuration to fit
the HW design
12
Ref: Jiang et. al, “Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search”

 Hardware search space
 Parameter-based, Template-based
The search space is defined as a set of pre-configured HW designs
 Categories: server, mobile, tiny
13
Ref: Jiang et. al, “Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks”

 Goal: sample architecture candidates from a
search space
 Evolutionary, Reinforcement, Gradient-based,
Bayesian Optimization, Random Search
 Hybrid
Hybrid different search strategy for speed up or better exploration/exploitation
 Evolutionary w/ RL or Bayesian or Gradient-
based
14

 Single
 Two-stage
Find the best accuracy model, then optimize HW cost
 Constrained Optimization
Find the best accuracy model under a specific constraint
 Multiple
 Scalarization
Optimize the model with weighted sum of the accuracy & HW metrics
 NSGA-II
Find solutions that are better than all previous solutions in terms of all
objectives
15

 Gumbel Softmax
Use softmax & temperature, which are differentiable, to approximate one-hot with
argmax
 Estimated Continuous Function
 REINFORCE
16
Ref: Jang et. al, “Categorical Reparameterization with Gumbel-Softmax”
≈

 Gumbel Softmax
 Estimated Continuous Function
Use continuous variable as the activation probability of non-continuous constraints,
used in gated operation
 REINFORCE
Use reinforcement learning to learn the policy, and sample the choice from
the poicy
17
0.7 0.15 0.05 0.1
Ref: Cai et. al, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware”

 Method
 Real-time measure, LUT, Analytical estimation,
Prediction model
The sampled model is executed on the hardware target while searching
18
Ref: Lin et. al, “MCUNet: Tiny Deep Learning on IoT Devices”

 Method
Prediction model
A lookup table is created beforehand and filled with each operator latency on the targeted hardware.
Once the search starts, the system will calculate the overall cost from the lookup table
19
Ref: Ｗu et. al, “FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search”

 Method
Prediction model
Compute a rough estimate using the processing time, the stall time, and the
starting time
20
Ref: Marchisio et. al, “NASCaps: a framework for neural architecture search to optimize the accuracy and hardware efficiency of
convolutional capsule networks”
more complex
For instance, a
ating structure,
the DeepCaps
cture has been
e for the single
l ClassCapsule
hen completed
osition of askip
plicitly indicate
ows the format
which is, from
kip
ection
Resize
flag
capsout
ape
footprint is computed as the sum of the number of weights for
each layer. They are modeled for each operation in a modular
way (i.e., bottom-up). First, the weights must be loaded onto the
PE array, then reused as long as they need to be multiplied by
other inputs. Afterward, the next group of weights is loaded
until all the computations of the layers are done (see Eqs. 2-4).
The model has been validated by comparing the results with the
hardware implementation of the CapsAcc [15] accelerator. The
adopted model parameters arethefollowing:
• w_load_cycles: number of clock cycles required to load the
weight onto thePEarray,
• w_loads: number of groupsof weights loaded onto thePEarray,
• cycles(l): number of cycles required to executethelayer l,
• ma: number of memory accesses,
• enmem : energy consumption of asinglememory accesses,
• pwrPEA: power consumption of thePEarray.
w_load_cycles = 16 (2)
w_loads =
⇠
weiдhts
16·min(16,sums_per_out)
⇡
(3)
cycles(l) = w_load_cycles·w_loads+ data_per_weiдht (4)
The overall latency is then computed as the sum of the
contributions of thelayers (Eq. 5).
latency =
’
l 2L
cycles(l) ·T (5)
In the Eq. 6, the number of memory accesses is computed by
distinguishingwhether theoperation isaconvolutional layer or not.
more complex
For instance, a
ating structure,
the DeepCaps
cture has been
e for the single
al ClassCapsule
hen completed
osition of askip
plicitly indicate
ows the format
which is, from
kip
ection
Resize
flag
capsout
ape
ype.
PE array, then reused as long as they need to be multiplied by
other inputs. Afterward, the next group of weights is loaded
until all the computations of the layers are done (see Eqs. 2-4).
The model has been validated by comparing the results with the
hardware implementation of the CapsAcc [15] accelerator. The
adopted model parameters arethefollowing:
• w_load_cycles: number of clock cycles required to load the
weight onto thePEarray,
• w_loads: number of groupsof weightsloaded onto thePEarray,
• cycles(l): number of cyclesrequired to executethelayer l,
• ma: number of memory accesses,
• enmem: energy consumption of asinglememory accesses,
• pwrPEA: power consumption of thePEarray.
w_load_cycles = 16 (2)
w_loads =
⇠
weiдhts
16·min(16,sums_per_out)
⇡
(3)
cycles(l) = w_load_cycles·w_loads+ data_per_weiдht (4)
The overall latency is then computed as the sum of the
contributions of thelayers(Eq. 5).
latency =
’
l 2L
cycles(l) ·T (5)
In the Eq. 6, the number of memory accesses is computed by
distinguishingwhether theoperation isaconvolutional layer or not.
Such adistinction hasbeen implemented by analyzing thevalueof
data_per_weiдht, which isgreater than 1 for convolutional layers

 Method
Prediction model
Build a ML model to predict the cost using architecture and dataset feature.
21
Ref: Cai et. al, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware”

 Method
Prediction model
 Metrics
 FLOPs & #Prameters, Latency, Energy
Consumption, Area, Memory Footprint
22

 Speed up
 Early Stop, Hot Start (warm up), Proxy Datasets,
Accuracy Prediction
 Quantization & Pruning
 Auto mix precision, Auto pruning
 Security & Reliability
 Adversarial attack
23

 HW-aware NAS
 Goal
 NAS
 HW Cost
 Benchmarks
 Discussions
24

 Lack of reproducibility
Due to the use of different search spaces, various training methods, and the
required significant computational resources, reproducibility is a difficult step.
 For NAS
 NAS-Bench-101, NAS-Bench-201, NATS-Bench,
NAS-Bench-1shot1, NAS-Bench-NLP, NAS-
Bench-301
 For HWNAS
 HW-NAS-Bench
25

 HW-aware NAS
 Goal
 NAS
 HW Cost
 Benchmarks
 Discussions - HWNAS Applications
26

 An auto model refinement tool for model-
accelerator integration
Given a pretrained model & a target HW, find a refined model that satisfied
constraints on the target HW
 An HW-SW co-op model deployment tool
for FPGA
Given a pretrained model & a target FPGA, find a refined model & a target
accelerator HDL that satisfied constraints on the target FPGA
 Similar with MCUNet, but wider integration
scope
27

Survey on HW-aware NAS

Recommended

Recommended

More Related Content

Similar to Survey on HW-aware NAS

Similar to Survey on HW-aware NAS (20)

Recently uploaded

Recently uploaded (20)

Survey on HW-aware NAS

Editor's Notes