Accelerating Deep Learning Inference  on Mobile Systems

Accelerating Deep Learning Inference
on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
June 27, 2019

2
Typical implementations of Deep Learning (DL) models focus on
the maximization of accuracy for a given task.
Architectures to achieve such an objective have become
significantly deeper and more complex over time.
Top-5 error (%)
Introduction

3
Artificial Intelligence (AI) on the edge is a
matter of great importance towards the
enhancement of smart devices that rely on
operations with real-time constraints.
Despite the rapid growth of computational
power in embedded systems, such as
smartphones, wearable devices, drones and
FPGAs, the deployment of highly complex and
considerably big DL models remains
challenging.
Introduction

4
Introduction
Cloud-offloading issues:
• Cost
• Availability
• Coverage
• Latency
• Privacy

5
Related work
• Compression techniques.
– Quantization
– Pruning
– Knowledge distillation
– Tensor decomposition
• Optimized model architectures.
– SqueezeNet
– MobileNet v1
– MobileNet v2
– MnasNet
• Hardware acceleration.
– Neural Networks API
– OpenGL
– Vulkan
– Metal

6
Related work
• Heterogeneous computing scheduling.
– Mobile GPU
– Custom implementations with access to hardware
primitives
• Mobile Deep Learning frameworks.
– TensorFlow Lite
– Caffe2
– CoreML

7
Limitations
1. Hardware Acceleration primitives are still not
completely standardized and stable, but are
tightly dependent on SoC vendors.
2. Retraining or modifying the architecture of ready-
to-use models can be extremely time-consuming.
3. Post-training compression of already small
models can detriment accuracy.

8
Use case
PeakLens is a real world mobile app that combines Augmented
Reality and Computer Vision (CV) for the identification of mountain
peaks.
It processes sensor readings and camera frames in real-time by
using an efficient on-board Deep Learning-powered CV module.
+400k installs
in Android

9
Requirements
1. Focus on execution. It should be possible to train a model using tools already known
to the developer. The framework should focus just on execution concerns, without the
need of re-training.
2. Minimum dependencies. It should be possible to execute an optimized model
independently of the Operating System, hardware platform or model storage format.
3. Easy embedding. It should be possible to embed the framework and optimized models
into existing applications easily, without the need of ad-hoc integration procedures.
4. End-to-end optimization. Optimization should be applied as early as possible and
span the model life-cycle (generation, compilation, initialization, configuration,
execution).
5. Offline support. Computation should occur only on-board the embedded system,
without the need of a network connection for work off-loading.
6. No accuracy loss. The acceleration for constrained devices should not reduce
accuracy w.r.t. to the execution on a high performance infrastructure.

10
The PolimiDL Framework
PolimiDL is an open source framework for
accelerating DL inference on mobile and embedded
systems, which was started when no efficient off-
the-shelf edge solutions were available.
Implementation is generic and aims at supporting
devices with limited power and heterogeneous
architectures.

12
• Generation-time optimizations.
– Layers fusion.
Consecutive in-place layers with identical filter size
can be fused into one single layer, thus reducing the
number of iterations over the cells of an input matrix.
Examples:
• Bias + ReLU = Bias_ReLU
• Batch_Normalization + ReLu6 =
BatchNormalization_ReLU6

13
– Weights fusion.
Layers applying functions with constant terms comprising multiple
weights can be pre-computed and encoded as unique constant weights,
thus reducing operations at run-time and potential temporary memory
allocation.
Example:
• Batch Normalization (BN)

14
– Weights rearrangement.
Weights associated to predefined Convolutional layer types are
stored in an order such that Eigen’s GEMM matrix operations
do not require any memory reshaping at run-time.

16
• Compile-time optimizations.
– Fixed network architecture.
The architecture of a model is fixed at compile-time,
which enables the compiler to perform per-layer
optimizations.
.SO

Layer
input
Layer
output
Layer
output
Layer
input
Layer
input
Layer
output
17
• Compile-time optimizations.
– Shared memory allocation & “tick-tock” piping.
The memory required by a model can be reduced and
allocated efficiently by exploiting spatial locality and
inverting the input and output buffers of subsequent
layers.
Temporary
data

19
• Initialization-time optimizations.
– Memory pre-allocation.
Memory requirements can be reduced by fusing the 3
buffers into a single one. During initialization, each
layer is queried about its memory size requirements.
Layer
input
Layer
output
Temporary
data

20
• Initialization-time optimizations.
– Small tasks for low memory consumption.
The operation of certain layers is divided into smaller
tasks that can be executed independently, thus not
performing a complete input unroll, but maintaining a
fixed required size for the temporary memory.
Task
T0 T1 T2 T3 T4
T5 T6 T7 T8 T9
T10 T11 T12 T13 T14
T15 T16 T17 T18 T19
T20 T21 T22 T23 T24

22
• Configuration-time optimizations.
– Scheduling optimization.
The optimal size for a scheduled task may vary
depending on the specific layer, the underlying
architecture, or even on the input size for Fully
Convolutional Neural Networks.
The size can be:
• Set to a default value.
• Inferred by executing a profiling routine.
• Loaded from previous profiling routine executions.

24
• Run-time optimizations.
– Dynamic workload scheduling.
Dynamic multithreaded scheduling of tasks can adapt
well to different contexts such as ARM big.LITTLE
architecture and allows cores to be better exploited.

25
Layers coverage
Layer name In place Temp.
memory
Schedulable
Convolution X √ √
Depthwise convolution X √ √
Pointwise convolution
(out_channels <= in_channels)
√ √ √
Pointwise convolution
(out_channels > in_channels)
X X √
Max Pooling X √ X
Average Pooling X √ √
Batch normalization √ X √
Bias √ X X
ReLU/ReLU6 √ X X

26
Evaluation
Compare inference execution
time of PolimiDL and
TensorFlow Lite.
Execute benchmark over:
– Multiple models
– Multiple devices with
heterogeneous architectures

27
Experimental setup
Models
Model Task Input size Paramete
rs
Mult-Adds
PeakLens original Image
Segmentation
320 x 240 x 3 429K 2G
PeakLens optimized Image
Segmentation
320 x 240 x 3 21K 198M
MobileNet v1 Object
Classification
224 x 224 x 3 4.24M 569M

28
Experimental setup
Device Android
V.
Chipset CPU RAM
Asus ZenFone 2 5.0 Z2560 Intel Atom 2-cores 1.6 GHz
(4 threads)
2 GB
Google Pixel 9.0 MSM8996
Qualcomm
Snapdragon 821
2-cores 2.15 Ghz Kryo + 2-cores 1.6 Ghz Kryo
(4 threads)
4 GB
LG G5 SE 7.0 MSM8976
Qualcomm
Snapdragon 652
4-cores 1.8 GHz Cortex-A72 + 4-cores 1.2
GHz Cortex-A53 (8 threads)
3 GB
LG Nexus 5X 8.1 MSM899 Qualcomm
Snapdragon 808
4-cores 1.44 GHz Cortex-A53 + 2-cores 1.82
GHz Cortex-A57 (6 threads)
2 GB
Motorola Nexus 6 7.0 Qualcomm
Snapdragon 805
4-cores 2.7 GHz Krait (4 threads) 3 GB
One Plus 6T 9.0 SDM845 Qualcomm 4-cores 2.8 GHz Kryo 385 + 4-cores 1.8 GHz
Kryo 385 (8 threads)
6 GB
Devices

29
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 1672.67 1138.00 (-31.96%)
Google Pixel 255.33 171.00 (-33.03%)
LG G5 SE 290.00 209.00 (-27.93%)
LG Nexus 5X 370.33 342.33 (-7.56%)
Motorola Nexus 6 505.33 215.67 (-57.32%)
One Plus 6T 144.33 91.00 (-36.95%)
Average (-32.46%)
PeakLens original

30
Asus Zenfone 2 807.67 179.33 (-77.80%)
Google Pixel 95.00 35.33 (-62.81%)
LG G5 SE 138.33 68.00 (-50.84%)
LG Nexus 5X 193.00 80.33 (-58.38%)
Motorola Nexus 6 225.67 66.00 (-70.75%)
One Plus 6T 68.67 22.67 (-66.99%)
Average (-64.59%)
PeakLens optimized

31
Asus Zenfone 2 775.33 377.33 (-51.33%)
Google Pixel 82.33 82.67 (+0.40%)
LG G5 SE 274.67 259.00 (-5.70%)
LG Nexus 5X 225.00 234.33 (+4.15%)
Motorola Nexus 6 298.33 176.00 (-41.01%)
One Plus 6T 56.67 51.67 (-8.82%)
Average (-17.05%)
MobileNet v1

Concept
– Open source framework for accelerating Deep Learning
inference on mobile and embedded systems, which has
proved competitive w.r.t. TensorFlow Lite.
Future work
– Extended support for more layers, quantization and
conversion from more DL frameworks.
– Extended evaluation with more configurations, metrics
and devices.
32
Conclusions

33
Thanks For Your
Attention!
Accelerating Deep Learning
Inference on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
https://github.com/darianfrajberg/polimidldarian.frajberg@polimi.it

Accelerating Deep Learning Inference on Mobile Systems

More Related Content

What's hot

Similar to Accelerating Deep Learning Inference on Mobile Systems

Recently uploaded