SlideShare a Scribd company logo
Accelerating Deep Learning Inference
on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
June 27, 2019
2
Typical implementations of Deep Learning (DL) models focus on
the maximization of accuracy for a given task.
Architectures to achieve such an objective have become
significantly deeper and more complex over time.
Top-5 error (%)
Introduction
3
Artificial Intelligence (AI) on the edge is a
matter of great importance towards the
enhancement of smart devices that rely on
operations with real-time constraints.
Despite the rapid growth of computational
power in embedded systems, such as
smartphones, wearable devices, drones and
FPGAs, the deployment of highly complex and
considerably big DL models remains
challenging.
Introduction
4
Introduction
Cloud-offloading issues:
• Cost
• Availability
• Coverage
• Latency
• Privacy
5
Related work
• Compression techniques.
– Quantization
– Pruning
– Knowledge distillation
– Tensor decomposition
• Optimized model architectures.
– SqueezeNet
– MobileNet v1
– MobileNet v2
– MnasNet
• Hardware acceleration.
– Neural Networks API
– OpenGL
– Vulkan
– Metal
6
Related work
• Heterogeneous computing scheduling.
– Mobile GPU
– Custom implementations with access to hardware
primitives
• Mobile Deep Learning frameworks.
– TensorFlow Lite
– Caffe2
– CoreML
7
Limitations
1. Hardware Acceleration primitives are still not
completely standardized and stable, but are
tightly dependent on SoC vendors.
2. Retraining or modifying the architecture of ready-
to-use models can be extremely time-consuming.
3. Post-training compression of already small
models can detriment accuracy.
8
Use case
PeakLens is a real world mobile app that combines Augmented
Reality and Computer Vision (CV) for the identification of mountain
peaks.
It processes sensor readings and camera frames in real-time by
using an efficient on-board Deep Learning-powered CV module.
+400k installs
in Android
9
Requirements
1. Focus on execution. It should be possible to train a model using tools already known
to the developer. The framework should focus just on execution concerns, without the
need of re-training.
2. Minimum dependencies. It should be possible to execute an optimized model
independently of the Operating System, hardware platform or model storage format.
3. Easy embedding. It should be possible to embed the framework and optimized models
into existing applications easily, without the need of ad-hoc integration procedures.
4. End-to-end optimization. Optimization should be applied as early as possible and
span the model life-cycle (generation, compilation, initialization, configuration,
execution).
5. Offline support. Computation should occur only on-board the embedded system,
without the need of a network connection for work off-loading.
6. No accuracy loss. The acceleration for constrained devices should not reduce
accuracy w.r.t. to the execution on a high performance infrastructure.
10
The PolimiDL Framework
PolimiDL is an open source framework for
accelerating DL inference on mobile and embedded
systems, which was started when no efficient off-
the-shelf edge solutions were available.
Implementation is generic and aims at supporting
devices with limited power and heterogeneous
architectures.
11
The PolimiDL Framework
12
The PolimiDL Framework
• Generation-time optimizations.
– Layers fusion.
Consecutive in-place layers with identical filter size
can be fused into one single layer, thus reducing the
number of iterations over the cells of an input matrix.
Examples:
• Bias + ReLU = Bias_ReLU
• Batch_Normalization + ReLu6 =
BatchNormalization_ReLU6
13
The PolimiDL Framework
• Generation-time optimizations.
– Weights fusion.
Layers applying functions with constant terms comprising multiple
weights can be pre-computed and encoded as unique constant weights,
thus reducing operations at run-time and potential temporary memory
allocation.
Example:
• Batch Normalization (BN)
14
The PolimiDL Framework
• Generation-time optimizations.
– Weights rearrangement.
Weights associated to predefined Convolutional layer types are
stored in an order such that Eigen’s GEMM matrix operations
do not require any memory reshaping at run-time.
15
The PolimiDL Framework
16
The PolimiDL Framework
• Compile-time optimizations.
– Fixed network architecture.
The architecture of a model is fixed at compile-time,
which enables the compiler to perform per-layer
optimizations.
.SO
Layer
input
Layer
output
Layer
output
Layer
input
Layer
input
Layer
output
17
The PolimiDL Framework
• Compile-time optimizations.
– Shared memory allocation & “tick-tock” piping.
The memory required by a model can be reduced and
allocated efficiently by exploiting spatial locality and
inverting the input and output buffers of subsequent
layers.
Temporary
data
18
The PolimiDL Framework
19
The PolimiDL Framework
• Initialization-time optimizations.
– Memory pre-allocation.
Memory requirements can be reduced by fusing the 3
buffers into a single one. During initialization, each
layer is queried about its memory size requirements.
Layer
input
Layer
output
Temporary
data
20
The PolimiDL Framework
• Initialization-time optimizations.
– Small tasks for low memory consumption.
The operation of certain layers is divided into smaller
tasks that can be executed independently, thus not
performing a complete input unroll, but maintaining a
fixed required size for the temporary memory.
Task
T0 T1 T2 T3 T4
T5 T6 T7 T8 T9
T10 T11 T12 T13 T14
T15 T16 T17 T18 T19
T20 T21 T22 T23 T24
21
The PolimiDL Framework
22
The PolimiDL Framework
• Configuration-time optimizations.
– Scheduling optimization.
The optimal size for a scheduled task may vary
depending on the specific layer, the underlying
architecture, or even on the input size for Fully
Convolutional Neural Networks.
The size can be:
• Set to a default value.
• Inferred by executing a profiling routine.
• Loaded from previous profiling routine executions.
23
The PolimiDL Framework
24
The PolimiDL Framework
• Run-time optimizations.
– Dynamic workload scheduling.
Dynamic multithreaded scheduling of tasks can adapt
well to different contexts such as ARM big.LITTLE
architecture and allows cores to be better exploited.
25
The PolimiDL Framework
Layers coverage
Layer name In place Temp.
memory
Schedulable
Convolution X √ √
Depthwise convolution X √ √
Pointwise convolution
(out_channels <= in_channels)
√ √ √
Pointwise convolution
(out_channels > in_channels)
X X √
Max Pooling X √ X
Average Pooling X √ √
Batch normalization √ X √
Bias √ X X
ReLU/ReLU6 √ X X
26
Evaluation
Compare inference execution
time of PolimiDL and
TensorFlow Lite.
Execute benchmark over:
– Multiple models
– Multiple devices with
heterogeneous architectures
27
Experimental setup
Models
Model Task Input size Paramete
rs
Mult-Adds
PeakLens original Image
Segmentation
320 x 240 x 3 429K 2G
PeakLens optimized Image
Segmentation
320 x 240 x 3 21K 198M
MobileNet v1 Object
Classification
224 x 224 x 3 4.24M 569M
28
Experimental setup
Device Android
V.
Chipset CPU RAM
Asus ZenFone 2 5.0 Z2560 Intel Atom 2-cores 1.6 GHz
(4 threads)
2 GB
Google Pixel 9.0 MSM8996
Qualcomm
Snapdragon 821
2-cores 2.15 Ghz Kryo + 2-cores 1.6 Ghz Kryo
(4 threads)
4 GB
LG G5 SE 7.0 MSM8976
Qualcomm
Snapdragon 652
4-cores 1.8 GHz Cortex-A72 + 4-cores 1.2
GHz Cortex-A53 (8 threads)
3 GB
LG Nexus 5X 8.1 MSM899 Qualcomm
Snapdragon 808
4-cores 1.44 GHz Cortex-A53 + 2-cores 1.82
GHz Cortex-A57 (6 threads)
2 GB
Motorola Nexus 6 7.0 Qualcomm
Snapdragon 805
4-cores 2.7 GHz Krait (4 threads) 3 GB
One Plus 6T 9.0 SDM845 Qualcomm 4-cores 2.8 GHz Kryo 385 + 4-cores 1.8 GHz
Kryo 385 (8 threads)
6 GB
Devices
29
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 1672.67 1138.00 (-31.96%)
Google Pixel 255.33 171.00 (-33.03%)
LG G5 SE 290.00 209.00 (-27.93%)
LG Nexus 5X 370.33 342.33 (-7.56%)
Motorola Nexus 6 505.33 215.67 (-57.32%)
One Plus 6T 144.33 91.00 (-36.95%)
Average (-32.46%)
PeakLens original
30
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 807.67 179.33 (-77.80%)
Google Pixel 95.00 35.33 (-62.81%)
LG G5 SE 138.33 68.00 (-50.84%)
LG Nexus 5X 193.00 80.33 (-58.38%)
Motorola Nexus 6 225.67 66.00 (-70.75%)
One Plus 6T 68.67 22.67 (-66.99%)
Average (-64.59%)
PeakLens optimized
31
Experimental results
Device TensorFlow Lite (ms) PolimiDL (ms)
Asus Zenfone 2 775.33 377.33 (-51.33%)
Google Pixel 82.33 82.67 (+0.40%)
LG G5 SE 274.67 259.00 (-5.70%)
LG Nexus 5X 225.00 234.33 (+4.15%)
Motorola Nexus 6 298.33 176.00 (-41.01%)
One Plus 6T 56.67 51.67 (-8.82%)
Average (-17.05%)
MobileNet v1
Concept
– Open source framework for accelerating Deep Learning
inference on mobile and embedded systems, which has
proved competitive w.r.t. TensorFlow Lite.
Future work
– Extended support for more layers, quantization and
conversion from more DL frameworks.
– Extended evaluation with more configurations, metrics
and devices.
32
Conclusions
33
Thanks For Your
Attention!
Accelerating Deep Learning
Inference on Mobile Systems
Darian Frajberg
Carlo Bernaschina
Christian Marone
Piero Fraternali
https://github.com/darianfrajberg/polimidldarian.frajberg@polimi.it

More Related Content

What's hot

Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Vincenzo Lomonaco
 
Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)
tm1966
 
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVMRobust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Ferhat Ozgur Catak
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval System
Konstantinos Zagoris
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-training
Dongmin Choi
 
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
Edge AI and Vision Alliance
 
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Ferhat Ozgur Catak
 
Framework for Contextual Outlier Identification using Multivariate Analysis a...
Framework for Contextual Outlier Identification using Multivariate Analysis a...Framework for Contextual Outlier Identification using Multivariate Analysis a...
Framework for Contextual Outlier Identification using Multivariate Analysis a...
IJECEIAES
 
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNNTRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
ijaia
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Deep Convolutional Neural Network based Intrusion Detection System
Deep Convolutional Neural Network based Intrusion Detection SystemDeep Convolutional Neural Network based Intrusion Detection System
Deep Convolutional Neural Network based Intrusion Detection System
Sri Ram
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Deep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical MethodologyDeep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical Methodology
Jason Tsai
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Vincenzo Lomonaco
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
inside-BigData.com
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
shanullah3
 
One shot learning
One shot learningOne shot learning
One shot learning
Vuong Ho Ngoc
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
WithTheBest
 

What's hot (19)

Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
 
Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)
 
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVMRobust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval System
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-training
 
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
“Explainability in Computer Vision: A Machine Learning Engineer’s Overview,” ...
 
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
 
Framework for Contextual Outlier Identification using Multivariate Analysis a...
Framework for Contextual Outlier Identification using Multivariate Analysis a...Framework for Contextual Outlier Identification using Multivariate Analysis a...
Framework for Contextual Outlier Identification using Multivariate Analysis a...
 
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNNTRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
TRANSFER LEARNING BASED IMAGE VISUALIZATION USING CNN
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Deep Convolutional Neural Network based Intrusion Detection System
Deep Convolutional Neural Network based Intrusion Detection SystemDeep Convolutional Neural Network based Intrusion Detection System
Deep Convolutional Neural Network based Intrusion Detection System
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Deep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical MethodologyDeep Learning: Chapter 11 Practical Methodology
Deep Learning: Chapter 11 Practical Methodology
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
One shot learning
One shot learningOne shot learning
One shot learning
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 

Similar to Accelerating Deep Learning Inference 
on Mobile Systems

FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
Heechul Yun
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
VEDLIoT Project
 
FPGA Design Challenges
FPGA Design ChallengesFPGA Design Challenges
FPGA Design Challenges
Krishna Gaihre
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxVivek Kumar
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
Julien SIMON
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
Deepak Shankar
 
Trends in DNN compression
Trends in DNN compressionTrends in DNN compression
Trends in DNN compression
Kaushalya Madhawa
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
inside-BigData.com
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
Kuniyasu Suzaki
 
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
Numenta
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC design
AishwaryaRavishankar8
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Hannes Tschofenig
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 

Similar to Accelerating Deep Learning Inference 
on Mobile Systems (20)

FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Clustering
ClusteringClustering
Clustering
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
FPGA Design Challenges
FPGA Design ChallengesFPGA Design Challenges
FPGA Design Challenges
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 
Ameya_Kasbekar_Resume
Ameya_Kasbekar_ResumeAmeya_Kasbekar_Resume
Ameya_Kasbekar_Resume
 
Trends in DNN compression
Trends in DNN compressionTrends in DNN compression
Trends in DNN compression
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Grid computing
Grid computingGrid computing
Grid computing
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC design
 
Embedded C
Embedded CEmbedded C
Embedded C
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 

Recently uploaded

Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 

Recently uploaded (20)

Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 

Accelerating Deep Learning Inference 
on Mobile Systems

  • 1. Accelerating Deep Learning Inference on Mobile Systems Darian Frajberg Carlo Bernaschina Christian Marone Piero Fraternali June 27, 2019
  • 2. 2 Typical implementations of Deep Learning (DL) models focus on the maximization of accuracy for a given task. Architectures to achieve such an objective have become significantly deeper and more complex over time. Top-5 error (%) Introduction
  • 3. 3 Artificial Intelligence (AI) on the edge is a matter of great importance towards the enhancement of smart devices that rely on operations with real-time constraints. Despite the rapid growth of computational power in embedded systems, such as smartphones, wearable devices, drones and FPGAs, the deployment of highly complex and considerably big DL models remains challenging. Introduction
  • 4. 4 Introduction Cloud-offloading issues: • Cost • Availability • Coverage • Latency • Privacy
  • 5. 5 Related work • Compression techniques. – Quantization – Pruning – Knowledge distillation – Tensor decomposition • Optimized model architectures. – SqueezeNet – MobileNet v1 – MobileNet v2 – MnasNet • Hardware acceleration. – Neural Networks API – OpenGL – Vulkan – Metal
  • 6. 6 Related work • Heterogeneous computing scheduling. – Mobile GPU – Custom implementations with access to hardware primitives • Mobile Deep Learning frameworks. – TensorFlow Lite – Caffe2 – CoreML
  • 7. 7 Limitations 1. Hardware Acceleration primitives are still not completely standardized and stable, but are tightly dependent on SoC vendors. 2. Retraining or modifying the architecture of ready- to-use models can be extremely time-consuming. 3. Post-training compression of already small models can detriment accuracy.
  • 8. 8 Use case PeakLens is a real world mobile app that combines Augmented Reality and Computer Vision (CV) for the identification of mountain peaks. It processes sensor readings and camera frames in real-time by using an efficient on-board Deep Learning-powered CV module. +400k installs in Android
  • 9. 9 Requirements 1. Focus on execution. It should be possible to train a model using tools already known to the developer. The framework should focus just on execution concerns, without the need of re-training. 2. Minimum dependencies. It should be possible to execute an optimized model independently of the Operating System, hardware platform or model storage format. 3. Easy embedding. It should be possible to embed the framework and optimized models into existing applications easily, without the need of ad-hoc integration procedures. 4. End-to-end optimization. Optimization should be applied as early as possible and span the model life-cycle (generation, compilation, initialization, configuration, execution). 5. Offline support. Computation should occur only on-board the embedded system, without the need of a network connection for work off-loading. 6. No accuracy loss. The acceleration for constrained devices should not reduce accuracy w.r.t. to the execution on a high performance infrastructure.
  • 10. 10 The PolimiDL Framework PolimiDL is an open source framework for accelerating DL inference on mobile and embedded systems, which was started when no efficient off- the-shelf edge solutions were available. Implementation is generic and aims at supporting devices with limited power and heterogeneous architectures.
  • 12. 12 The PolimiDL Framework • Generation-time optimizations. – Layers fusion. Consecutive in-place layers with identical filter size can be fused into one single layer, thus reducing the number of iterations over the cells of an input matrix. Examples: • Bias + ReLU = Bias_ReLU • Batch_Normalization + ReLu6 = BatchNormalization_ReLU6
  • 13. 13 The PolimiDL Framework • Generation-time optimizations. – Weights fusion. Layers applying functions with constant terms comprising multiple weights can be pre-computed and encoded as unique constant weights, thus reducing operations at run-time and potential temporary memory allocation. Example: • Batch Normalization (BN)
  • 14. 14 The PolimiDL Framework • Generation-time optimizations. – Weights rearrangement. Weights associated to predefined Convolutional layer types are stored in an order such that Eigen’s GEMM matrix operations do not require any memory reshaping at run-time.
  • 16. 16 The PolimiDL Framework • Compile-time optimizations. – Fixed network architecture. The architecture of a model is fixed at compile-time, which enables the compiler to perform per-layer optimizations. .SO
  • 17. Layer input Layer output Layer output Layer input Layer input Layer output 17 The PolimiDL Framework • Compile-time optimizations. – Shared memory allocation & “tick-tock” piping. The memory required by a model can be reduced and allocated efficiently by exploiting spatial locality and inverting the input and output buffers of subsequent layers. Temporary data
  • 19. 19 The PolimiDL Framework • Initialization-time optimizations. – Memory pre-allocation. Memory requirements can be reduced by fusing the 3 buffers into a single one. During initialization, each layer is queried about its memory size requirements. Layer input Layer output Temporary data
  • 20. 20 The PolimiDL Framework • Initialization-time optimizations. – Small tasks for low memory consumption. The operation of certain layers is divided into smaller tasks that can be executed independently, thus not performing a complete input unroll, but maintaining a fixed required size for the temporary memory. Task T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24
  • 22. 22 The PolimiDL Framework • Configuration-time optimizations. – Scheduling optimization. The optimal size for a scheduled task may vary depending on the specific layer, the underlying architecture, or even on the input size for Fully Convolutional Neural Networks. The size can be: • Set to a default value. • Inferred by executing a profiling routine. • Loaded from previous profiling routine executions.
  • 24. 24 The PolimiDL Framework • Run-time optimizations. – Dynamic workload scheduling. Dynamic multithreaded scheduling of tasks can adapt well to different contexts such as ARM big.LITTLE architecture and allows cores to be better exploited.
  • 25. 25 The PolimiDL Framework Layers coverage Layer name In place Temp. memory Schedulable Convolution X √ √ Depthwise convolution X √ √ Pointwise convolution (out_channels <= in_channels) √ √ √ Pointwise convolution (out_channels > in_channels) X X √ Max Pooling X √ X Average Pooling X √ √ Batch normalization √ X √ Bias √ X X ReLU/ReLU6 √ X X
  • 26. 26 Evaluation Compare inference execution time of PolimiDL and TensorFlow Lite. Execute benchmark over: – Multiple models – Multiple devices with heterogeneous architectures
  • 27. 27 Experimental setup Models Model Task Input size Paramete rs Mult-Adds PeakLens original Image Segmentation 320 x 240 x 3 429K 2G PeakLens optimized Image Segmentation 320 x 240 x 3 21K 198M MobileNet v1 Object Classification 224 x 224 x 3 4.24M 569M
  • 28. 28 Experimental setup Device Android V. Chipset CPU RAM Asus ZenFone 2 5.0 Z2560 Intel Atom 2-cores 1.6 GHz (4 threads) 2 GB Google Pixel 9.0 MSM8996 Qualcomm Snapdragon 821 2-cores 2.15 Ghz Kryo + 2-cores 1.6 Ghz Kryo (4 threads) 4 GB LG G5 SE 7.0 MSM8976 Qualcomm Snapdragon 652 4-cores 1.8 GHz Cortex-A72 + 4-cores 1.2 GHz Cortex-A53 (8 threads) 3 GB LG Nexus 5X 8.1 MSM899 Qualcomm Snapdragon 808 4-cores 1.44 GHz Cortex-A53 + 2-cores 1.82 GHz Cortex-A57 (6 threads) 2 GB Motorola Nexus 6 7.0 Qualcomm Snapdragon 805 4-cores 2.7 GHz Krait (4 threads) 3 GB One Plus 6T 9.0 SDM845 Qualcomm 4-cores 2.8 GHz Kryo 385 + 4-cores 1.8 GHz Kryo 385 (8 threads) 6 GB Devices
  • 29. 29 Experimental results Device TensorFlow Lite (ms) PolimiDL (ms) Asus Zenfone 2 1672.67 1138.00 (-31.96%) Google Pixel 255.33 171.00 (-33.03%) LG G5 SE 290.00 209.00 (-27.93%) LG Nexus 5X 370.33 342.33 (-7.56%) Motorola Nexus 6 505.33 215.67 (-57.32%) One Plus 6T 144.33 91.00 (-36.95%) Average (-32.46%) PeakLens original
  • 30. 30 Experimental results Device TensorFlow Lite (ms) PolimiDL (ms) Asus Zenfone 2 807.67 179.33 (-77.80%) Google Pixel 95.00 35.33 (-62.81%) LG G5 SE 138.33 68.00 (-50.84%) LG Nexus 5X 193.00 80.33 (-58.38%) Motorola Nexus 6 225.67 66.00 (-70.75%) One Plus 6T 68.67 22.67 (-66.99%) Average (-64.59%) PeakLens optimized
  • 31. 31 Experimental results Device TensorFlow Lite (ms) PolimiDL (ms) Asus Zenfone 2 775.33 377.33 (-51.33%) Google Pixel 82.33 82.67 (+0.40%) LG G5 SE 274.67 259.00 (-5.70%) LG Nexus 5X 225.00 234.33 (+4.15%) Motorola Nexus 6 298.33 176.00 (-41.01%) One Plus 6T 56.67 51.67 (-8.82%) Average (-17.05%) MobileNet v1
  • 32. Concept – Open source framework for accelerating Deep Learning inference on mobile and embedded systems, which has proved competitive w.r.t. TensorFlow Lite. Future work – Extended support for more layers, quantization and conversion from more DL frameworks. – Extended evaluation with more configurations, metrics and devices. 32 Conclusions
  • 33. 33 Thanks For Your Attention! Accelerating Deep Learning Inference on Mobile Systems Darian Frajberg Carlo Bernaschina Christian Marone Piero Fraternali https://github.com/darianfrajberg/polimidldarian.frajberg@polimi.it

Editor's Notes

  1. Compression techniques target large scale architectures and aim at reducing the number of parameters and floating point operations (FLOPs), possibly tolerating small accuracy drops in favor of execution acceleration and optimization of computational resources, storage, memory occupation and energy consumption. Lightweight architectures with compact layers pursue the design of an optimized network topology, yielding small, fast and accurate models, suitable for resource-constrained devices. HA is the use of dedicated hardware to complement general-purpose CPUs and perform computationally intensive work more efficiently, e.g. by favoring specific operations and data-parallel computation.
  2. Heterogeneous computing scheduling comprises the design of strategies to efficiently coordinate and distribute the workload among processors of different types. Frameworks for the execution of DL models on mobile and embedded systems pursue optimized deployment on devices with limited resources, by managing memory allocation efficiently and exploiting the available hardware resources at best.
  3. Optimized execution requires managing memory allocation efficiently, to avoid overloading, and exploiting the available hardware resources for acceleration, which is not trivial given the non standardized access to such resources.
  4. Evaluation exploits hardware with limited resources and models with a small-size architecture achieving a good trade-o between accuracy and latency. Three models with diverse characteristics, listed in Table 2, are evaluated.