"Enabling Automated Design of Computationally Efficient Deep Neural Networks," a Presentation from UC Berkeley

© 2019 Bichen Wu
Enabling Automated Design of
Computationally Efficient Deep
Neural Networks
Bichen Wu
UC Berkeley
May 2019
bichen@berkeley.edu

© 2019 Bichen Wu
Neural networks for embedded vision
2

© 2019 Bichen Wu
Augmented reality
Need for Embedded Vision
3
• Privacy concern
• Latency constraint
• Availability, reliability and cost of data transmission
Biometric identification Autonomous driving Internet-of-things

© 2019 Bichen Wu
Computation Complexity of Neural Networks
4
DGX-1,
170 TOPS,
3.2 KWatts,
128 GB Memory
TitanX:
11 TOPS,
223 Watts,
12GB Memory
VGG16[1] model:
- Parameter size: 552 MB
- Memory: 93 MB/image
- Computation: 15.8 GOPs/image
[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Smartphones
800 MOPs
3 Watts
2-4GB
Embedded Devices
100’s MHz
<5Watt
<1GB

© 2019 Bichen Wu
Goal: Accurate AND Efficient Neural Networks
5
• Embedded computer vision requires accurate AND
efficient neural networks
Accuracy: Essential for many
applications including security
cameras and autonomous driving
Efficiency: Real-time inference speed
on embedded processors with limited
compute & power budgets

© 2019 Bichen Wu
Designing accurate and efficient
neural networks is challenging.
6

© 2019 Bichen Wu
Intractable Design Space
• Design space of Deep Neural Nets is huge!
• VGG16[1] has 16 layers
• Design choices for each layer:
• kernel size = {1, 3, 5}
• channel size = {32, 64, 128, 256, 512}
• Search space = (3x5)^16 = 7e18
7
[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

© 2019 Bichen Wu
Conditional Optimality
8
IoTGPUs iPhones Android phones Low end phones Wearable
• Ideally, we should design different Neural Networks to different
devices/tasks/computation budgets
• In reality, due to the cost of design & training Neural Networks, we can only
afford to design one and deploy to all conditions

© 2019 Bichen Wu
Inconsistent Efficiency Metrics
9
• Previous works focus on reducing parameter size or MACs (number of
Multiply-Accumulation operations)
• However, a lower MAC count does not necessarily mean lower latency
– Dilated convolution is slower due to the more complicated
memory access pattern
– NASNet-A has slightly smaller MACs than MobileNetV1, but the
latency is 1.6x slower
Dilated Convolution [1] NASNet [2]
[1] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arXiv preprint arXiv:1511.07122 (2015).
[2] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.070122.6 (2017).

© 2019 Bichen Wu
Rethinking the flow for neural
network design.
10

© 2019 Bichen Wu
Using Off-the-shelf Models
11
• Dealing with hardware constraints:
• Model is too big/small
• Can’t support 1x1, 3x3, or 5x5
convolutions
• Too slow with XXX operators
• Can’t support residual connection
• ReLU must follow convolutions
• Fixed input size

© 2019 Bichen Wu
Manual Design
12
• Manual design:
• Can only afford a few iterations

© 2019 Bichen Wu
(Previous) Neural Architecture Search
13
• Search based neural architecture search
• Computationally expensive: [1] takes 450 GPUs for 4-5 days
[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv:1707.070122.6 (2017).

© 2019 Bichen Wu
DNAS: Differentiable Neural Architecture Search
14
Desirable features
• A general framework to support arbitrary design spaces
• Optimize for actual efficiency metrics (such as latency)
• Reasonable search cost

© 2019 Bichen Wu
Using DNAS to search for mixed
precision quantization strategy
15

© 2019 Bichen Wu
Mixed Precision Quantization
16
• Quantizing different layers of a ConvNet to different precisions
• Candidate operators are convolutions with quantized weight and activations

© 2019 Bichen Wu
17
Model ResNet18
reference
DNAS
(ours)
TTQ [1] ADMM [2]
Precision full mixed 2bit 3bit
Accuracy 69.60% 69.58% 66.60% 68.0%
Compression rate 1.0x 21.1x 16.0x 10.7x
[1] Zhu, Chenzhuo, et al. "Trained ternary quantization." arXiv preprint arXiv:1612.01064 (2016).
[2] Leng, Cong, et al. "Extremely low bit neural network: Squeeze the last bit out with admm." arXiv preprint arXiv:1707.09870(2017).
Weight quantization on ImageNet dataset
• 21.2x smaller model size, -0.02% accuracy loss, 2.98% better than
TTQ, 1.58% better than ADMM
Block ID B1 B2 B3 B4 B5 B6 B7 B8 B9
Bit-width 2 3 0 2 4 2 3 2 1
• Block-wise precision
Skipped the entire block

© 2019 Bichen Wu
18
[1] Choi, Jungwook, et al. "PACT: Parameterized Clipping Activation for Quantized Neural Networks." arXiv preprint arXiv:1805.06085 (2018).
[2] Jung, Sangil, et al. "Joint training of low-precision neural network with quantization interval parameters." arXiv preprint arXiv:1808.05779 (2018).
[3] Zhuang, Bohan, Chunhua Shen, and Ian Reid. "Training Compact Neural Networks with Binary Weights and Low Precision Activations." arXiv
preprint arXiv:1808.02631 (2018).
Model ResNet18
reference
DNAS
(ours)
PACT [1] QIP [2] GroupNet[3]
Precision full mixed w4a4 w4a4 w1a2g5
Accuracy 69.60% 68.65% 69.20% 69.30% 67.60%
Compression rate 1.0x 103.5x 64x 64x 102.4x
Weight & activation quantization on ImageNet dataset
• Compression rate computed as: weight-bit x activation-bit / (32 x 32)
• 103.5x reduction of computational cost, <1% accuracy drop
• Search finished in 24 hours on 8 GPUs

© 2019 Bichen Wu
Using DNAS to search for efficient
neural network architectures
19

© 2019 Bichen Wu
Efficient Architecture Search
20
1x1 (group) Conv, ReLU
K x K DWConv, ReLU
1x1 (group) Conv
H x W x Cin
H x W x (e x Cin)
(H/s) x (W/s) x (e x Cin)
(H/s) x (W/s) x Cout
+
Candidate modules with different
hyper-parameters
• Kernel size: 3, 5
• Expansion rate: 1, 3, 6
• Skip: no-operation
• Each “layer” of a network can have different modules

© 2019 Bichen Wu
FBNet vs. MobileNet & MNasNet
22
[1] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks.” CVPR18
[2] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv preprint arXiv:1807.11626 (2018).
• FBNet-B has the same accuracy
with MobileNetV2[1], but 1.5x lower
latency
• The smallest FBNet achieves 4.5%
higher accuracy than MobileNetV2,
the latency is only 2.9 ms (345
frames per second) on a Samsung
Galaxy S8 phone.
• The search cost of DNAS is 8 GPUs
x 24 hours, 421x smaller than
MnasNet [2] – efficient ConvNets
discovered by reinforcement
learning
Search cost
(GPU hours)
# MACs
(M)
Latency
(ms)
ImageNet
top-1 acc
MobileNetV2-0.35-69 - 11 3.8 45.50
FBNet-0.35-96 (ours) 216 12.9 2.9 50.20
MobileNetV2-1.0 - 300 21.7 72.0
MnasNet-65 91,000 270 - 73.0
FBNet-A (ours) 216 249 19.8 73.0
MobileNetV2-1.3 - 509 33.8 74.4
MnasNet 91,000 317 23.7 74.0
FBNet-B (ours) 216 295 23.1 74.1
MobileNetV2-1.4 - 585 37.4 74.7
MnasNet-92 91,000 388 - 74.8
FBNet-C (ours) 216 375 28.1 74.9

© 2019 Bichen Wu
MobileNetV2: [1]
Acc: 71.8%, lat: 21.7 ms
FBNet vs. MobileNet & MNasNet
23
Longer Latency
(bad)
ImageNet top-1 Accuracy
* Estimated from the paper
description
[2] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv:1807.11626 (2018).
DNASNet-A: (ours)
Acc: 73.0%, lat: 19.8 ms
MobileNetV2-1.3: [1]
Acc: 74.4%, lat: 33.8 ms
MobileNetV2-1.4: [1]
Acc: 74.7%, lat: 37.4 ms
DNASNet-B: (ours)
Acc: 74.1%, lat: 23.1 ms
DNASNet-C: (ours)
Acc: 74.9%, lat: 28.1 ms
MnasNet: [2]
Acc: 74.0%, lat: 23.7 ms

© 2019 Bichen Wu
MobileNetV2: [4]
Acc: 71.8%, MACs: 300M
FBNet Compared with Other NAS
24
More MACs- BAD
ImageNet top-1 Accuracy -- Good
PNAS: [2] Acc: 74.2%, MACs: 588M
Search cost*: 6,000 GPU-hrs
DARTS: [3] Acc: 73.1%, MACs: 595M
Search cost: 288 GPU-hrs
AMC: [5] Acc: 70.8%, MACs: 150M
MnasNet: [6]
Acc: 74.0, MACs: 317M
Search Cost*: 91,000 GPU-hrs
NAS: [1] Acc: 74.0%, MACs: 564M
Search cost: 48,000 GPU-hrs
* Estimated from the paper
description
[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv:1707.070122.6 (2017).
[2] Liu, Chenxi, et al. "Progressive neural architecture search." arXiv:1712.00559 (2017).
[3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv:1806.09055 (2018)
[5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." ECCV 2018.
[6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv:1807.11626 (2018).
FBNet: (ours)
Acc: 74.1%, MACs: 295M
Search Cost: 216 GPU-hrs
• X-axis: MACs
• Y-axis: accuracy
• Mark size: search
cost
• Circles: search cost
unknown

© 2019 Bichen Wu
Result: FBNet for different target devices
25
• Apple A11
• Big: 2 ARMv8 @ 2.5 GHz
• Little: 4 ARMv8 @ 1.4 GHz
• Vectorization: 4-wide 32-bit MAC
• LPDDR4x memory (30 GB/s)
• GPU + Neural Processing Engine
• Snapdragon 835
• Big: 4 ARMv8 @ 2.4 GHz
• Little: 4 ARMv8 @ 1.9 GHz
• Vectorization: 4-wide 32-bit MAC
• LPDDR4x memory (30 GB/s)
• Adreno 540 GPU
0
5
10
15
20
25
30
iPhone X Samsung S8
FBNet latency on target devices
Target model for iPhoneX Target model for Samsung S8
1.4x speedup• Under similar accuracy
constraint (73.27% vs 73.20%),
FBNet optimized for iPhone-X
achieves 1.4x speedup over the
Samsung optimized model

© 2019 Bichen Wu
Result: FBNet for different target devices
27
• DNAS automatically adopts operators with low latency on the targeted devices

© 2019 Bichen Wu
DNAS summary
General search space: the search space of each layer can contain arbitrary operators. This
allows us to apply DNAS to support different target processors
Extremely fast: This process typically takes 8 GPUs for 24 hours to finish. In comparison, to
find models with similar performances, MnasNet requires 421x more computing
resources.
State-of-the-art performance:
• Mixed precision quantization: 21x model size reduction or 104x computational cost
reduction, almost no accuracy loss
• Efficient architecture search: same accuracy, 1.5x faster, 2.4x smaller MACs
Optimize for actual latency on targeted devices:
• Up to 1.4x speedup compared to non-targeted neural architectures
28

© 2019 Bichen Wu
References
Paper:
• Wu, Bichen, et al. "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable
Neural Architecture Search." arXiv preprint arXiv:1812.03443 (2018).
• Wu, Bichen, et al. "Mixed Precision Quantization of ConvNets via Differentiable Neural
Architecture Search." arXiv preprint arXiv:1812.00090 (2018).
FBNet models:
• Will be open-sourced soon!
Questions and feedback:
• Contact me via email: bichen@berkeley.edu
29

"Enabling Automated Design of Computationally Efficient Deep Neural Networks," a Presentation from UC Berkeley

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to "Enabling Automated Design of Computationally Efficient Deep Neural Networks," a Presentation from UC Berkeley

Similar to "Enabling Automated Design of Computationally Efficient Deep Neural Networks," a Presentation from UC Berkeley (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Enabling Automated Design of Computationally Efficient Deep Neural Networks," a Presentation from UC Berkeley