Moldable pipelines for CNNs on heterogeneous edge devices

The LEGaTO project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the grant agreement No 780681. www.legato-project.eu
Moldable pipelines for CNNs on heterogeneous edge devices
Pirah Noor Soomro, Chalmers University of Technology
A framework for efficient performance of CNNs on heterogeneous edge devices containing different type of compute
resources.
• We implement a brief and guided online training to find near optimal configuration for a balanced pipeline.
• We designed a simple and programmer friendly interface to generate high throughput and balanced CNN
pipeline by leveraging information provided through the interface.
Motivation
• Modern edge devices contain variable core configuration on a single chip.
• Existing DNN libraries do not provide heterogeneity aware implementation of CNNs targeting edge devices.
• Existing solutions [1,2] for CNN pipelines on edge devices require an offline training followed by an exhaustive
DSE (Domain Search space Exploration).
Background
Edge Devices: Nvidia Jetson TX2
4 energy efficient cores, 2 high performance
cores
Methodology
References
1. Wang, Siqi, et al. "High-throughput cnn inference on embedded arm big. little multi-core processors." IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2019).
2. Lu, Zongqing, et al. "Modeling the resource requirements of convolutional neural networks on mobile devices." Proceedings of the 25th ACM international conference on
Multimedia. 2017.
Conclusion
• A balanced VGG pipeline increases throughput by 22% compared to baseline.
• Computational hints provide a good seed to start exploration of near optimal
configuration.
• Our approach does offline partitioning and online molding (Changing number of cores)
of pipeline stages to generate a balanced pipeline.
Convolutional Neural
Networks
• Consecutive layers of
computationally intensives
convolutional kernels.
• Each layer has different
computational complexity,
represented by input
descriptors.
• Figure on right represents
VGG-16 CNN.
• Widely used for classification
on streaming input data.
• Pipelined implementation is
favored on streaming input.
Ne
de c
Ge e a e
c a a h
f de c
Ge e a e e e
age
R DSE g h
a eed
N
Ye
P e e
ba a ced
P ce e
N
Ye , e fe e ce de ec ed
Pe f a ce
deg aded
15
17
19
21
23
25
27
29
31
33
1 2 3 4
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
1 2 3 4
Computationalintensity
PS1 PS2 PS3 PS4 PS5 PS6
Experiments and observations
1) 3PS: 2D-2A57-2A57 [7-7-7] 4) 2PS: 4A57-2D [13-8]2) 6PS: 1D-1D-1A57-1A57-1A57-1A57 [4-4-4-4-3-2] 3) 3PS: 2D-2A57-2A57 [7-4-10]
Time
Training
Figure 4. Timeline of a 4-stage pipeline on a
20 cores machine. Training phase represents
trying various pipeline configurations to
select one best configurations for a balanced
pipeline
VGG pipelines on TX2
Four different pipeline
configurations are tested on
TX2.
• Figure 1 shows
configuration 3 is fastest.
• Figure 2 also supports the
observation, configuration
3 has the most balanced
distribution of
computations among 3
pipeline stages.
• Figure 3 presents a view of
pipelines. 1,2 and 4 are
imbalanced pipelines while
3 yields comparatively
balanced pipeline.
C0 C1 C2 C3
L2
L1I
L1D L1D
L1I L1I
L1D L1D
L1I
C5C4
L2
L1I
L1D L1D
L1I
Network description in
template language
main(){
…
Conv1 = CONV(ip,
op, weights);
Conv2 =
CONV(conv1, op,
weights);
….
network.add(Conv1)
;
network.add(Conv2)
;
…
network.execute();
}
4 A57s 2 Denvers
Figure 1. Execution-time(s)/input of 4 different
configurations of VGG pipelines (lower is better). The
baseline is data parallel implementation of VGG-16 on TX2.
Figure 2. Distribution of computational
load among Pipeline Stages(PS). The
numbers are derived from network input
descriptors.
Figure 3. Timeline of VGG pipelines read as; 1) 3-stage pipeline where first stage is scheduled on 2 Denver cores, second stage on 2 A57 cores and third on other 2 A57 cores. Configuration 3 is most
balanced among four configurations
A57 Denver
C0 C1 C2 C3 C4 C5
Kernel level parallelism.
Layers are executed one
after another
A57 Denver
C0 C1 C2 C3 C4 C5
Layer 1-10
Input 1
Layer 1-10
Input 2
Layer 1-10
Input 3
Layer 11-21
Input 1
Layer 11-21
Input 2
Layer 11-21
Input 3
2 Stage pipeline
on TX2
Conv 64
Conv 64
Maxpool
Conv128
Conv 128
Maxpool
Conv 256
Conv 256
Conv 256
Maxpool
conv 512
conv 512
conv 512
Maxpool
conv 512
conv 512
conv 512
Maxpool
FC
FC
FC
Conv 64
Conv 64
Maxpool
Conv128
Conv 128
Maxpool
Conv 256
Conv 256
Conv 256
………
Conv 64
Conv 64
Maxpool
…….
Conv 64
Conv 64
Maxpool
…….
Conv 64
Conv 64
Maxpool
…….
conv 512
conv 512
conv 512
……..
conv 512
conv 512
conv 512
……..
conv 512
conv 512
conv 512
……..

Moldable pipelines for CNNs on heterogeneous edge devices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Moldable pipelines for CNNs on heterogeneous edge devices

Similar to Moldable pipelines for CNNs on heterogeneous edge devices (20)

More from LEGATO project

More from LEGATO project (20)

Recently uploaded

Recently uploaded (20)

Moldable pipelines for CNNs on heterogeneous edge devices