“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA

Efficiently Map AI and
Vision Applications onto
Multi-Core AI Processors
Using CEVA’s Parallel
Processing Framework
Rami Drucker
SW Architect
CEVA

Executive Summary
• Next generation
AI and CV
applications
require higher
than ever
computing power.
• Edge devices use
multi-core
processors to
deliver high
performance.
• However,
developers must
efficiently map
their applications
onto the multiple
cores, which can
be difficult.
• CEVA has
introduced the
Architecture
Planner tool as a
new element of
CDNN, CEVA’s
comprehensive AI
SDK.
• In this talk, we’ll
show how the
Architecture
Planner tool
analyzes the
network model
and maps the
workload onto the
multiple cores in
an efficient
manner.
• We’ll explain key
techniques used
by the Tool,
including
symmetrical and
asymmetrical
multi-processing
paradigms.
2

NeuPro-M – A Family of AI Processors
A Full System Solution
NeuPro-M AI Core
NeuPro-M Common Subsystem
Multi
Engine
r
Data &
Weights
Compression
AXI Matrix
Host
Interface
Realtime
Interface
Functional
Safety
Security logic
Core Shared
Memory
NeuPro-M Master Controller
- Optimized for performance, control and code size
AXI
Engine#8
Engine #...
Engine #2
Engine #1
NeuPro-M Engine
AXI
Mixed
Precision
Neural Engine
Complementary
Activation
Unit
Unstructured
Sparsity
Engine
(Data/Weights)
Winograd
Transform
Engine
Vector
Processor
Unit
Engine Local
Controller
Matrix
Decomposition
Engine
local
Shared
Memory
4K
MAC
3

CDNN Open Development Platform
CDNN SDK
NeuPro-M Core HW
NPM
Controller
NPM
Sub-system
CDNN
Frontend
CDNN
Backend
Optimized Layers / Operators
Multi-thread
CDNN
Android NN
CEVA’s
Host / OS Interface
CEVA’s
Parallel Programing
Solution
NPM Engine
SensPro DSP HW
SP
Controller
Halide
SPU
VCU 4

CDNN Toolchain Workflow
1 Neural Network Training Optimizer Tool
Real-Time Multi-Network Inferencing
4 CDNN Offline Optimization
3
2 Architecture Planner Tools
Training
Framework
Image
Database
OUTPUT
► BW to DDR
► L1-L2 BW
► Cycle count
► Overall performance
► Precision
► Output Image
Data Flow
Resolution
L2
Analyzer
L1
Analyzer
Layer
Performance
in a front basket"
with a cat
riding a bicycle,
with long hair
“Young girl
Parallel processing of Multi- networks
running over Several engines/cores
Input Image
Neural
Network
+ +
User configuration
• L1/L2 MSS size
• DDR max BW
• DDR latency
Graph
Manager
Quantizer
– Winograd
– Mix Precision
– BW level
Precision Wizard
System Planner
Input
Output
Hidden
Fixed-point
Customized Network + Weight
FEATURES
 Graph Compiler
 Memory Optimizations
 BW reduction
 Code size reduction
 Precision
 Scale per Channel
 Scale per Layer
 Scale per Module
 CDNN-Invite
 Node API
 Custom Device API
 Open-source front-end
 QAT: Quantization Aware Training
 Pruning
 Structured pruning
 Semi-Structured Pruning
 Unstructured pruning
 HW-Aware Network Optimization
 Model size reduction
5

CDNN Multi-Engine Features
• CDNN supports the following parallel processing paradigms to leverage the
compute power of multi-engine device and maximize overall system
throughput
• Partitioning by tiles  different tiles, shared weights. This is efficient in first
layers where input maps are large
• Partitioning by output maps  same tile, different weights
• Partitioning by sub-graphs  different sub-graphs are assigned to engines
• Pipeline partitioning  sequence of nodes assigned to engine
• Batch partitioning  same network, different input image
• Partitioning by networks  different network per engine
6

Quad-Engine Core – Segmentation Network
Data Flow
NeuPro-M™
AI Processor
L1 MEM
engine #0
L1 MEM
engine #3
L1 MEM
engine #1
L2
Shared
MSS
L1 MEM
engine #0
NPM
engine #0
NPM
engine #3
NPM
engine #1
NPM
engine #2
L1 MEM
engine #1
L1 MEM
engine #2
L1 MEM
engine #3
External Memory /
ISP real-time client
X
Y
AXI
P a r a l l e l
p r o c e s s i n g
L1 MEM
engine #2
Overlap
area
Output
Input:
7

Parallel Processing Paradigms
Model by model
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
3
Object detection
Image processing
Lane detection
Free space segmentation
Frame by frame
NeuPro-M
engine #0
NeuPro-M
engine #2
NeuPro-M
engine #1
1
3
2
Object detection
Object detection
Object detection
Object detection
NeuPro-M
engine #...
#
Layer by layer
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
Input
Output
Neural Network
Tile by tile
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
3
F r e e s p a c e s e g m e n t a t i o n
NeuPro-M SDK - Adaptive computing methods:
II
I
III
IV
4
3
2
1
I
n
p
u
t
8

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
• Given graph G=(V,E)
• Given input and output nodes (a,u)
• Given cost function f(v) for each node
• We have multiple paths from a to u
• Parallel execution is appropriate in this
case
• We will attempt to parallelize the model
across 3 DSPs
CDNN Multi-Engine Offline Algorithm
10

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Multi-Engine Offline Algorithm
11

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 1
12

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
f 13 17
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
13

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
m 20 24 k 17 21
h 21 25
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
14

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
o 25 29
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
15

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
t 27 46 o 25 29
s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
16

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
u 46 47 t 27 46 o 25 29
s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
17

a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
----- 30 46 t 27 46 o 25 29
u 46 47 s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
Total inference time for 3 DSPs: 47
Single DSP inference time is the sum of all nodes:
4+5+7+10+9*4+3+13+13+19+1 = 111
CDNN Offline Algorithm – Finish
18

• Algorithm performs load balancing
across sub-graphs to minimize idle
time of DSPs
• Parallel execution is appropriate in
this case
• We will attempt to parallelize the
model across 1, 2, 3, and 4 DSPs
and compare the results
• The algorithm implements
backtracking traversal of the graph
to limit the computational
complexity of the process
Example: Inception_v3
19

• Backtracking algorithm
produced the best results
• The bar chart compares the
performance in cycles of a
single DSP vs two, three,
and four DSPs
Number of DSP Units
Execution
cycles
3,568,884
2,166,036
2,013,499 2,001,637
Results: Inception_v3
20

DSP1 DSP2 DSP3
Time
• Many neural networks comprise
a long thread of nodes that are
computed one after the other
• Parallelization is difficult
because there are no nodes
that can run in parallel
• A pipeline approach is adequate
in this case
• We will attempt to parallelize
the model across 3 DSPs
• To minimize idle time, the
algorithm needs to balance the
load across DSPs
CDNN Multi-engine Pipeline Execution
21

• Given Graph G=(V,E)
• The resultant frame rate equals to
1/CYCLE_COUNT of the slowest
section of the pipeline
Example: Mobilenet_v2
22

• The pipeline algorithm
produced the best result
with 4 DSPs
Number of DSP Units
Execution
cycles
165,376
84,436
64,468
43,659
Results: Mobilenet_v2
23

• We introduced CEVA NeuPro-M multi-engine processor for AI and CV
applications.
• These cores are complemented by CDNN for multi-engine, a highly
optimized graph compiler and runtime framework.
• We presented the results of sub-graph and pipeline algorithms, which were
developed at CEVA to distribute the network inference workload across
multiple engines.
• Surprisingly, even for Inception_V3, the pipeline approach outperformed
the backtracking technique.
• Further testing would need to be carried out to confirm the initial results.
Summary and Conclusions
24

Resources
For more information, please visit our booth, #420
www.ceva-dsp.com
25

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA

Recommended

Recommended

More Related Content

Similar to “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA

Similar to “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA