Accelerating Deep Learning Inference
on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
June 27, 2019
2
Typical implementations of Deep Learning (DL) models focus on
the maximization of accuracy for a given task.
Architectures to achieve such an objective have become
significantly deeper and more complex over time.
Top-5 error (%)
Introduction
3
Artificial Intelligence (AI) on the edge is a
matter of great importance towards the
enhancement of smart devices that rely on
operations with real-time constraints.
Despite the rapid growth of computational
power in embedded systems, such as
smartphones, wearable devices, drones and
FPGAs, the deployment of highly complex and
considerably big DL models remains
challenging.
Introduction
4
Introduction
Cloud-offloading issues:
• Cost
• Availability
• Coverage
• Latency
• Privacy
5
Related work
• Compression techniques.
– Quantization
– Pruning
– Knowledge distillation
– Tensor decomposition
• Optimized model architectures.
– SqueezeNet
– MobileNet v1
– MobileNet v2
– MnasNet
• Hardware acceleration.
– Neural Networks API
– OpenGL
– Vulkan
– Metal
6
Related work
• Heterogeneous computing scheduling.
– Mobile GPU
– Custom implementations with access to hardware
primitives
• Mobile Deep Learning frameworks.
– TensorFlow Lite
– Caffe2
– CoreML
7
Limitations
1. Hardware Acceleration primitives are still not
completely standardized and stable, but are
tightly dependent on SoC vendors.
2. Retraining or modifying the architecture of ready-
to-use models can be extremely time-consuming.
3. Post-training compression of already small
models can detriment accuracy.
8
Use case
PeakLens is a real world mobile app that combines Augmented
Reality and Computer Vision (CV) for the identification of mountain
peaks.
It processes sensor readings and camera frames in real-time by
using an efficient on-board Deep Learning-powered CV module.
+400k installs
in Android
9
Requirements
1. Focus on execution. It should be possible to train a model using tools already known
to the developer. The framework should focus just on execution concerns, without the
need of re-training.
2. Minimum dependencies. It should be possible to execute an optimized model
independently of the Operating System, hardware platform or model storage format.
3. Easy embedding. It should be possible to embed the framework and optimized models
into existing applications easily, without the need of ad-hoc integration procedures.
4. End-to-end optimization. Optimization should be applied as early as possible and
span the model life-cycle (generation, compilation, initialization, configuration,
execution).
5. Offline support. Computation should occur only on-board the embedded system,
without the need of a network connection for work off-loading.
6. No accuracy loss. The acceleration for constrained devices should not reduce
accuracy w.r.t. to the execution on a high performance infrastructure.
10
The PolimiDL Framework
PolimiDL is an open source framework for
accelerating DL inference on mobile and embedded
systems, which was started when no efficient off-
the-shelf edge solutions were available.
Implementation is generic and aims at supporting
devices with limited power and heterogeneous
architectures.
11
The PolimiDL Framework
12
The PolimiDL Framework
• Generation-time optimizations.
– Layers fusion.
Consecutive in-place layers with identical filter size
can be fused into one single layer, thus reducing the
number of iterations over the cells of an input matrix.
Examples:
• Bias + ReLU = Bias_ReLU
• Batch_Normalization + ReLu6 =
BatchNormalization_ReLU6
13
The PolimiDL Framework
• Generation-time optimizations.
– Weights fusion.
Layers applying functions with constant terms comprising multiple
weights can be pre-computed and encoded as unique constant weights,
thus reducing operations at run-time and potential temporary memory
allocation.
Example:
• Batch Normalization (BN)
14
The PolimiDL Framework
• Generation-time optimizations.
– Weights rearrangement.
Weights associated to predefined Convolutional layer types are
stored in an order such that Eigen’s GEMM matrix operations
do not require any memory reshaping at run-time.
15
The PolimiDL Framework
16
The PolimiDL Framework
• Compile-time optimizations.
– Fixed network architecture.
The architecture of a model is fixed at compile-time,
which enables the compiler to perform per-layer
optimizations.
.SO
Layer
input
Layer
output
Layer
output
Layer
input
Layer
input
Layer
output
17
The PolimiDL Framework
• Compile-time optimizations.
– Shared memory allocation & “tick-tock” piping.
The memory required by a model can be reduced and
allocated efficiently by exploiting spatial locality and
inverting the input and output buffers of subsequent
layers.
Temporary
data
18
The PolimiDL Framework
19
The PolimiDL Framework
• Initialization-time optimizations.
– Memory pre-allocation.
Memory requirements can be reduced by fusing the 3
buffers into a single one. During initialization, each
layer is queried about its memory size requirements.
Layer
input
Layer
output
Temporary
data
20
The PolimiDL Framework
• Initialization-time optimizations.
– Small tasks for low memory consumption.
The operation of certain layers is divided into smaller
tasks that can be executed independently, thus not
performing a complete input unroll, but maintaining a
fixed required size for the temporary memory.
Task
T0 T1 T2 T3 T4
T5 T6 T7 T8 T9
T10 T11 T12 T13 T14
T15 T16 T17 T18 T19
T20 T21 T22 T23 T24
21
The PolimiDL Framework
22
The PolimiDL Framework
• Configuration-time optimizations.
– Scheduling optimization.
The optimal size for a scheduled task may vary
depending on the specific layer, the underlying
architecture, or even on the input size for Fully
Convolutional Neural Networks.
The size can be:
• Set to a default value.
• Inferred by executing a profiling routine.
• Loaded from previous profiling routine executions.
23
The PolimiDL Framework
24
The PolimiDL Framework
• Run-time optimizations.
– Dynamic workload scheduling.
Dynamic multithreaded scheduling of tasks can adapt
well to different contexts such as ARM big.LITTLE
architecture and allows cores to be better exploited.
25
The PolimiDL Framework
Layers coverage
Layer name In place Temp.
memory
Schedulable
Convolution X √ √
Depthwise convolution X √ √
Pointwise convolution
(out_channels <= in_channels)
√ √ √
Pointwise convolution
(out_channels > in_channels)
X X √
Max Pooling X √ X
Average Pooling X √ √
Batch normalization √ X √
Bias √ X X
ReLU/ReLU6 √ X X
26
Evaluation
Compare inference execution
time of PolimiDL and
TensorFlow Lite.
Execute benchmark over:
– Multiple models
– Multiple devices with
heterogeneous architectures
27
Experimental setup
Models
Model Task Input size Paramete
rs
Mult-Adds
PeakLens original Image
Segmentation
320 x 240 x 3 429K 2G
PeakLens optimized Image
Segmentation
320 x 240 x 3 21K 198M
MobileNet v1 Object
Classification
224 x 224 x 3 4.24M 569M
28
Experimental setup
Device Android
V.
Chipset CPU RAM
Asus ZenFone 2 5.0 Z2560 Intel Atom 2-cores 1.6 GHz
(4 threads)
2 GB
Google Pixel 9.0 MSM8996
Qualcomm
Snapdragon 821
2-cores 2.15 Ghz Kryo + 2-cores 1.6 Ghz Kryo
(4 threads)
4 GB
LG G5 SE 7.0 MSM8976
Qualcomm
Snapdragon 652
4-cores 1.8 GHz Cortex-A72 + 4-cores 1.2
GHz Cortex-A53 (8 threads)
3 GB
LG Nexus 5X 8.1 MSM899 Qualcomm
Snapdragon 808
4-cores 1.44 GHz Cortex-A53 + 2-cores 1.82
GHz Cortex-A57 (6 threads)
2 GB
Motorola Nexus 6 7.0 Qualcomm
Snapdragon 805
4-cores 2.7 GHz Krait (4 threads) 3 GB
One Plus 6T 9.0 SDM845 Qualcomm 4-cores 2.8 GHz Kryo 385 + 4-cores 1.8 GHz
Kryo 385 (8 threads)
6 GB
Devices
29
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 1672.67 1138.00 (-31.96%)
Google Pixel 255.33 171.00 (-33.03%)
LG G5 SE 290.00 209.00 (-27.93%)
LG Nexus 5X 370.33 342.33 (-7.56%)
Motorola Nexus 6 505.33 215.67 (-57.32%)
One Plus 6T 144.33 91.00 (-36.95%)
Average (-32.46%)
PeakLens original
30
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 807.67 179.33 (-77.80%)
Google Pixel 95.00 35.33 (-62.81%)
LG G5 SE 138.33 68.00 (-50.84%)
LG Nexus 5X 193.00 80.33 (-58.38%)
Motorola Nexus 6 225.67 66.00 (-70.75%)
One Plus 6T 68.67 22.67 (-66.99%)
Average (-64.59%)
PeakLens optimized
31
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 775.33 377.33 (-51.33%)
Google Pixel 82.33 82.67 (+0.40%)
LG G5 SE 274.67 259.00 (-5.70%)
LG Nexus 5X 225.00 234.33 (+4.15%)
Motorola Nexus 6 298.33 176.00 (-41.01%)
One Plus 6T 56.67 51.67 (-8.82%)
Average (-17.05%)
MobileNet v1
Concept
– Open source framework for accelerating Deep Learning
inference on mobile and embedded systems, which has
proved competitive w.r.t. TensorFlow Lite.
Future work
– Extended support for more layers, quantization and
conversion from more DL frameworks.
– Extended evaluation with more configurations, metrics
and devices.
32
Conclusions
33
Thanks For Your
Attention!
Accelerating Deep Learning
Inference on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
https://github.com/darianfrajberg/polimidldarian.frajberg@polimi.it

Accelerating Deep Learning Inference 
on Mobile Systems

  • 1.
    Accelerating Deep LearningInference on Mobile Systems Darian Frajberg Carlo Bernaschina Christian Marone Piero Fraternali June 27, 2019
  • 2.
    2 Typical implementations ofDeep Learning (DL) models focus on the maximization of accuracy for a given task. Architectures to achieve such an objective have become significantly deeper and more complex over time. Top-5 error (%) Introduction
  • 3.
    3 Artificial Intelligence (AI)on the edge is a matter of great importance towards the enhancement of smart devices that rely on operations with real-time constraints. Despite the rapid growth of computational power in embedded systems, such as smartphones, wearable devices, drones and FPGAs, the deployment of highly complex and considerably big DL models remains challenging. Introduction
  • 4.
    4 Introduction Cloud-offloading issues: • Cost •Availability • Coverage • Latency • Privacy
  • 5.
    5 Related work • Compressiontechniques. – Quantization – Pruning – Knowledge distillation – Tensor decomposition • Optimized model architectures. – SqueezeNet – MobileNet v1 – MobileNet v2 – MnasNet • Hardware acceleration. – Neural Networks API – OpenGL – Vulkan – Metal
  • 6.
    6 Related work • Heterogeneouscomputing scheduling. – Mobile GPU – Custom implementations with access to hardware primitives • Mobile Deep Learning frameworks. – TensorFlow Lite – Caffe2 – CoreML
  • 7.
    7 Limitations 1. Hardware Accelerationprimitives are still not completely standardized and stable, but are tightly dependent on SoC vendors. 2. Retraining or modifying the architecture of ready- to-use models can be extremely time-consuming. 3. Post-training compression of already small models can detriment accuracy.
  • 8.
    8 Use case PeakLens isa real world mobile app that combines Augmented Reality and Computer Vision (CV) for the identification of mountain peaks. It processes sensor readings and camera frames in real-time by using an efficient on-board Deep Learning-powered CV module. +400k installs in Android
  • 9.
    9 Requirements 1. Focus onexecution. It should be possible to train a model using tools already known to the developer. The framework should focus just on execution concerns, without the need of re-training. 2. Minimum dependencies. It should be possible to execute an optimized model independently of the Operating System, hardware platform or model storage format. 3. Easy embedding. It should be possible to embed the framework and optimized models into existing applications easily, without the need of ad-hoc integration procedures. 4. End-to-end optimization. Optimization should be applied as early as possible and span the model life-cycle (generation, compilation, initialization, configuration, execution). 5. Offline support. Computation should occur only on-board the embedded system, without the need of a network connection for work off-loading. 6. No accuracy loss. The acceleration for constrained devices should not reduce accuracy w.r.t. to the execution on a high performance infrastructure.
  • 10.
    10 The PolimiDL Framework PolimiDLis an open source framework for accelerating DL inference on mobile and embedded systems, which was started when no efficient off- the-shelf edge solutions were available. Implementation is generic and aims at supporting devices with limited power and heterogeneous architectures.
  • 11.
  • 12.
    12 The PolimiDL Framework •Generation-time optimizations. – Layers fusion. Consecutive in-place layers with identical filter size can be fused into one single layer, thus reducing the number of iterations over the cells of an input matrix. Examples: • Bias + ReLU = Bias_ReLU • Batch_Normalization + ReLu6 = BatchNormalization_ReLU6
  • 13.
    13 The PolimiDL Framework •Generation-time optimizations. – Weights fusion. Layers applying functions with constant terms comprising multiple weights can be pre-computed and encoded as unique constant weights, thus reducing operations at run-time and potential temporary memory allocation. Example: • Batch Normalization (BN)
  • 14.
    14 The PolimiDL Framework •Generation-time optimizations. – Weights rearrangement. Weights associated to predefined Convolutional layer types are stored in an order such that Eigen’s GEMM matrix operations do not require any memory reshaping at run-time.
  • 15.
  • 16.
    16 The PolimiDL Framework •Compile-time optimizations. – Fixed network architecture. The architecture of a model is fixed at compile-time, which enables the compiler to perform per-layer optimizations. .SO
  • 17.
    Layer input Layer output Layer output Layer input Layer input Layer output 17 The PolimiDL Framework •Compile-time optimizations. – Shared memory allocation & “tick-tock” piping. The memory required by a model can be reduced and allocated efficiently by exploiting spatial locality and inverting the input and output buffers of subsequent layers. Temporary data
  • 18.
  • 19.
    19 The PolimiDL Framework •Initialization-time optimizations. – Memory pre-allocation. Memory requirements can be reduced by fusing the 3 buffers into a single one. During initialization, each layer is queried about its memory size requirements. Layer input Layer output Temporary data
  • 20.
    20 The PolimiDL Framework •Initialization-time optimizations. – Small tasks for low memory consumption. The operation of certain layers is divided into smaller tasks that can be executed independently, thus not performing a complete input unroll, but maintaining a fixed required size for the temporary memory. Task T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24
  • 21.
  • 22.
    22 The PolimiDL Framework •Configuration-time optimizations. – Scheduling optimization. The optimal size for a scheduled task may vary depending on the specific layer, the underlying architecture, or even on the input size for Fully Convolutional Neural Networks. The size can be: • Set to a default value. • Inferred by executing a profiling routine. • Loaded from previous profiling routine executions.
  • 23.
  • 24.
    24 The PolimiDL Framework •Run-time optimizations. – Dynamic workload scheduling. Dynamic multithreaded scheduling of tasks can adapt well to different contexts such as ARM big.LITTLE architecture and allows cores to be better exploited.
  • 25.
    25 The PolimiDL Framework Layerscoverage Layer name In place Temp. memory Schedulable Convolution X √ √ Depthwise convolution X √ √ Pointwise convolution (out_channels <= in_channels) √ √ √ Pointwise convolution (out_channels > in_channels) X X √ Max Pooling X √ X Average Pooling X √ √ Batch normalization √ X √ Bias √ X X ReLU/ReLU6 √ X X
  • 26.
    26 Evaluation Compare inference execution timeof PolimiDL and TensorFlow Lite. Execute benchmark over: – Multiple models – Multiple devices with heterogeneous architectures
  • 27.
    27 Experimental setup Models Model TaskInput size Paramete rs Mult-Adds PeakLens original Image Segmentation 320 x 240 x 3 429K 2G PeakLens optimized Image Segmentation 320 x 240 x 3 21K 198M MobileNet v1 Object Classification 224 x 224 x 3 4.24M 569M
  • 28.
    28 Experimental setup Device Android V. ChipsetCPU RAM Asus ZenFone 2 5.0 Z2560 Intel Atom 2-cores 1.6 GHz (4 threads) 2 GB Google Pixel 9.0 MSM8996 Qualcomm Snapdragon 821 2-cores 2.15 Ghz Kryo + 2-cores 1.6 Ghz Kryo (4 threads) 4 GB LG G5 SE 7.0 MSM8976 Qualcomm Snapdragon 652 4-cores 1.8 GHz Cortex-A72 + 4-cores 1.2 GHz Cortex-A53 (8 threads) 3 GB LG Nexus 5X 8.1 MSM899 Qualcomm Snapdragon 808 4-cores 1.44 GHz Cortex-A53 + 2-cores 1.82 GHz Cortex-A57 (6 threads) 2 GB Motorola Nexus 6 7.0 Qualcomm Snapdragon 805 4-cores 2.7 GHz Krait (4 threads) 3 GB One Plus 6T 9.0 SDM845 Qualcomm 4-cores 2.8 GHz Kryo 385 + 4-cores 1.8 GHz Kryo 385 (8 threads) 6 GB Devices
  • 29.
    29 Experimental results Device TensorFlowLite (ms) PolimiDL (ms) Asus Zenfone 2 1672.67 1138.00 (-31.96%) Google Pixel 255.33 171.00 (-33.03%) LG G5 SE 290.00 209.00 (-27.93%) LG Nexus 5X 370.33 342.33 (-7.56%) Motorola Nexus 6 505.33 215.67 (-57.32%) One Plus 6T 144.33 91.00 (-36.95%) Average (-32.46%) PeakLens original
  • 30.
    30 Experimental results Device TensorFlowLite (ms) PolimiDL (ms) Asus Zenfone 2 807.67 179.33 (-77.80%) Google Pixel 95.00 35.33 (-62.81%) LG G5 SE 138.33 68.00 (-50.84%) LG Nexus 5X 193.00 80.33 (-58.38%) Motorola Nexus 6 225.67 66.00 (-70.75%) One Plus 6T 68.67 22.67 (-66.99%) Average (-64.59%) PeakLens optimized
  • 31.
    31 Experimental results Device TensorFlowLite (ms) PolimiDL (ms) Asus Zenfone 2 775.33 377.33 (-51.33%) Google Pixel 82.33 82.67 (+0.40%) LG G5 SE 274.67 259.00 (-5.70%) LG Nexus 5X 225.00 234.33 (+4.15%) Motorola Nexus 6 298.33 176.00 (-41.01%) One Plus 6T 56.67 51.67 (-8.82%) Average (-17.05%) MobileNet v1
  • 32.
    Concept – Open sourceframework for accelerating Deep Learning inference on mobile and embedded systems, which has proved competitive w.r.t. TensorFlow Lite. Future work – Extended support for more layers, quantization and conversion from more DL frameworks. – Extended evaluation with more configurations, metrics and devices. 32 Conclusions
  • 33.
    33 Thanks For Your Attention! AcceleratingDeep Learning Inference on Mobile Systems Darian Frajberg Carlo Bernaschina Christian Marone Piero Fraternali https://github.com/darianfrajberg/polimidldarian.frajberg@polimi.it

Editor's Notes

  • #6 Compression techniques target large scale architectures and aim at reducing the number of parameters and floating point operations (FLOPs), possibly tolerating small accuracy drops in favor of execution acceleration and optimization of computational resources, storage, memory occupation and energy consumption. Lightweight architectures with compact layers pursue the design of an optimized network topology, yielding small, fast and accurate models, suitable for resource-constrained devices. HA is the use of dedicated hardware to complement general-purpose CPUs and perform computationally intensive work more efficiently, e.g. by favoring specific operations and data-parallel computation.
  • #7 Heterogeneous computing scheduling comprises the design of strategies to efficiently coordinate and distribute the workload among processors of different types. Frameworks for the execution of DL models on mobile and embedded systems pursue optimized deployment on devices with limited resources, by managing memory allocation efficiently and exploiting the available hardware resources at best.
  • #11 Optimized execution requires managing memory allocation efficiently, to avoid overloading, and exploiting the available hardware resources for acceleration, which is not trivial given the non standardized access to such resources.
  • #28 Evaluation exploits hardware with limited resources and models with a small-size architecture achieving a good trade-o between accuracy and latency. Three models with diverse characteristics, listed in Table 2, are evaluated.