2. NTHU-CS VLSI/CAD LAB
A Comprehensive Survey on Hardware-
Aware Neural Architecture Search
Submitted to Proceedings of IEEE
Listed in the footnote of each slides
2
3. NTHU-CS VLSI/CAD LAB
HW-aware NAS
Issues & Related Works in HW-NAS
Goal
NAS
HW Cost
Benchmarks
Discussions
3
4. NTHU-CS VLSI/CAD LAB
Long ASIC/NN design iteration
The ASIC-NN design iteration is manual and not a turn-key solution
4
5. NTHU-CS VLSI/CAD LAB
Long ASIC/NN design iteration
The ASIC-NN design iteration is manual and not a turn-key solution,
even using Network Architecture Search
5
Ref: Lin et. al, “MCUNet: Tiny Deep Learning on IoT Devices”
6. NTHU-CS VLSI/CAD LAB
Hardware-aware auto NN architecture
design
Goal: Find the best
accuracy NN arch.
w/ HW constraint
6
7. NTHU-CS VLSI/CAD LAB
HW-aware NAS
Issues & Related Works in HW-NAS
Goal
NAS
HW Cost
Benchmarks
Discussions
7
9. NTHU-CS VLSI/CAD LAB
Single Target
Search architectures for a single specific HW
Single Config: best accuracy w/ HW constraints
E. g., single configuration goal get a best accuracy with HW constraints
Multiple Config: best accuracy, best latency
E.g., multiple configuration goal get the best accuracy model, and the best latency model
Multiple Targets
Search architectures consider multiple HWs simultaneously
9
10. NTHU-CS VLSI/CAD LAB
Architecture search space
Architecture search space is a set of architectures contains the solutions
Hyperparameter search
#channel, stride, kernel size
10
Ref: Ma et. al, “Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA”
11. NTHU-CS VLSI/CAD LAB
Architecture search space
Architecture search space is a set of architectures contains the solutions
Whole architecture
Layer-wise, Cell-based, Hierarchical
11
Layer-wise Cell-based Hierarchical
12. NTHU-CS VLSI/CAD LAB
Hardware search space
Parameter-based, Template-based
The search space is formalized by a set of different parameter configuration to fit
the HW design
12
Ref: Jiang et. al, “Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search”
13. NTHU-CS VLSI/CAD LAB
Hardware search space
Parameter-based, Template-based
The search space is defined as a set of pre-configured HW designs
Categories: server, mobile, tiny
13
Ref: Jiang et. al, “Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks”
14. NTHU-CS VLSI/CAD LAB
Goal: sample architecture candidates from a
search space
Evolutionary, Reinforcement, Gradient-based,
Bayesian Optimization, Random Search
Hybrid
Hybrid different search strategy for speed up or better exploration/exploitation
Evolutionary w/ RL or Bayesian or Gradient-
based
14
15. NTHU-CS VLSI/CAD LAB
Single
Two-stage
Find the best accuracy model, then optimize HW cost
Constrained Optimization
Find the best accuracy model under a specific constraint
Multiple
Scalarization
Optimize the model with weighted sum of the accuracy & HW metrics
NSGA-II
Find solutions that are better than all previous solutions in terms of all
objectives
15
16. NTHU-CS VLSI/CAD LAB
Gumbel Softmax
Use softmax & temperature, which are differentiable, to approximate one-hot with
argmax
Estimated Continuous Function
REINFORCE
16
Ref: Jang et. al, “Categorical Reparameterization with Gumbel-Softmax”
≈
17. NTHU-CS VLSI/CAD LAB
Gumbel Softmax
Estimated Continuous Function
Use continuous variable as the activation probability of non-continuous constraints,
used in gated operation
REINFORCE
Use reinforcement learning to learn the policy, and sample the choice from
the poicy
17
0.7 0.15 0.05 0.1
Ref: Cai et. al, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware”
18. NTHU-CS VLSI/CAD LAB
Method
Real-time measure, LUT, Analytical estimation,
Prediction model
The sampled model is executed on the hardware target while searching
18
Ref: Lin et. al, “MCUNet: Tiny Deep Learning on IoT Devices”
19. NTHU-CS VLSI/CAD LAB
Method
Real-time measure, LUT, Analytical estimation,
Prediction model
A lookup table is created beforehand and filled with each operator latency on the targeted hardware.
Once the search starts, the system will calculate the overall cost from the lookup table
19
Ref: Wu et. al, “FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search”
20. NTHU-CS VLSI/CAD LAB
Method
Real-time measure, LUT, Analytical estimation,
Prediction model
Compute a rough estimate using the processing time, the stall time, and the
starting time
20
Ref: Marchisio et. al, “NASCaps: a framework for neural architecture search to optimize the accuracy and hardware efficiency of
convolutional capsule networks”
more complex
For instance, a
ating structure,
the DeepCaps
cture has been
e for the single
l ClassCapsule
hen completed
osition of askip
plicitly indicate
ows the format
which is, from
kip
ection
Resize
flag
capsout
ape
footprint is computed as the sum of the number of weights for
each layer. They are modeled for each operation in a modular
way (i.e., bottom-up). First, the weights must be loaded onto the
PE array, then reused as long as they need to be multiplied by
other inputs. Afterward, the next group of weights is loaded
until all the computations of the layers are done (see Eqs. 2-4).
The model has been validated by comparing the results with the
hardware implementation of the CapsAcc [15] accelerator. The
adopted model parameters arethefollowing:
• w_load_cycles: number of clock cycles required to load the
weight onto thePEarray,
• w_loads: number of groupsof weights loaded onto thePEarray,
• cycles(l): number of cycles required to executethelayer l,
• ma: number of memory accesses,
• enmem : energy consumption of asinglememory accesses,
• pwrPEA: power consumption of thePEarray.
w_load_cycles = 16 (2)
w_loads =
⇠
weiдhts
16·min(16,sums_per_out)
⇡
(3)
cycles(l) = w_load_cycles·w_loads+ data_per_weiдht (4)
The overall latency is then computed as the sum of the
contributions of thelayers (Eq. 5).
latency =
’
l 2L
cycles(l) ·T (5)
In the Eq. 6, the number of memory accesses is computed by
distinguishingwhether theoperation isaconvolutional layer or not.
more complex
For instance, a
ating structure,
the DeepCaps
cture has been
e for the single
al ClassCapsule
hen completed
osition of askip
plicitly indicate
ows the format
which is, from
kip
ection
Resize
flag
capsout
ape
ype.
PE array, then reused as long as they need to be multiplied by
other inputs. Afterward, the next group of weights is loaded
until all the computations of the layers are done (see Eqs. 2-4).
The model has been validated by comparing the results with the
hardware implementation of the CapsAcc [15] accelerator. The
adopted model parameters arethefollowing:
• w_load_cycles: number of clock cycles required to load the
weight onto thePEarray,
• w_loads: number of groupsof weightsloaded onto thePEarray,
• cycles(l): number of cyclesrequired to executethelayer l,
• ma: number of memory accesses,
• enmem: energy consumption of asinglememory accesses,
• pwrPEA: power consumption of thePEarray.
w_load_cycles = 16 (2)
w_loads =
⇠
weiдhts
16·min(16,sums_per_out)
⇡
(3)
cycles(l) = w_load_cycles·w_loads+ data_per_weiдht (4)
The overall latency is then computed as the sum of the
contributions of thelayers(Eq. 5).
latency =
’
l 2L
cycles(l) ·T (5)
In the Eq. 6, the number of memory accesses is computed by
distinguishingwhether theoperation isaconvolutional layer or not.
Such adistinction hasbeen implemented by analyzing thevalueof
data_per_weiдht, which isgreater than 1 for convolutional layers
21. NTHU-CS VLSI/CAD LAB
Method
Real-time measure, LUT, Analytical estimation,
Prediction model
Build a ML model to predict the cost using architecture and dataset feature.
21
Ref: Cai et. al, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware”
23. NTHU-CS VLSI/CAD LAB
Speed up
Early Stop, Hot Start (warm up), Proxy Datasets,
Accuracy Prediction
Quantization & Pruning
Auto mix precision, Auto pruning
Security & Reliability
Adversarial attack
23
24. NTHU-CS VLSI/CAD LAB
HW-aware NAS
Issues & Related Works in HW-NAS
Goal
NAS
HW Cost
Benchmarks
Discussions
24
25. NTHU-CS VLSI/CAD LAB
Lack of reproducibility
Due to the use of different search spaces, various training methods, and the
required significant computational resources, reproducibility is a difficult step.
For NAS
NAS-Bench-101, NAS-Bench-201, NATS-Bench,
NAS-Bench-1shot1, NAS-Bench-NLP, NAS-
Bench-301
For HWNAS
HW-NAS-Bench
25
26. NTHU-CS VLSI/CAD LAB
HW-aware NAS
Issues & Related Works in HW-NAS
Goal
NAS
HW Cost
Benchmarks
Discussions - HWNAS Applications
26
27. NTHU-CS VLSI/CAD LAB
An auto model refinement tool for model-
accelerator integration
Given a pretrained model & a target HW, find a refined model that satisfied
constraints on the target HW
An HW-SW co-op model deployment tool
for FPGA
Given a pretrained model & a target FPGA, find a refined model & a target
accelerator HDL that satisfied constraints on the target FPGA
Similar with MCUNet, but wider integration
scope
27