For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
Similar to “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA
Project Slides for Website 2020-22.pptxAkshitAgiwal1
Similar to “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA (20)
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework,” a Presentation from CEVA
1. Efficiently Map AI and
Vision Applications onto
Multi-Core AI Processors
Using CEVA’s Parallel
Processing Framework
Rami Drucker
SW Architect
CEVA
2. Executive Summary
• Next generation
AI and CV
applications
require higher
than ever
computing power.
• Edge devices use
multi-core
processors to
deliver high
performance.
• However,
developers must
efficiently map
their applications
onto the multiple
cores, which can
be difficult.
• CEVA has
introduced the
Architecture
Planner tool as a
new element of
CDNN, CEVA’s
comprehensive AI
SDK.
• In this talk, we’ll
show how the
Architecture
Planner tool
analyzes the
network model
and maps the
workload onto the
multiple cores in
an efficient
manner.
• We’ll explain key
techniques used
by the Tool,
including
symmetrical and
asymmetrical
multi-processing
paradigms.
2
3. NeuPro-M – A Family of AI Processors
A Full System Solution
NeuPro-M AI Core
NeuPro-M Common Subsystem
Multi
Engine
r
Data &
Weights
Compression
AXI Matrix
Host
Interface
Realtime
Interface
Functional
Safety
Security logic
Core Shared
Memory
NeuPro-M Master Controller
- Optimized for performance, control and code size
AXI
Engine#8
Engine #...
Engine #2
Engine #1
NeuPro-M Engine
AXI
Mixed
Precision
Neural Engine
Complementary
Activation
Unit
Unstructured
Sparsity
Engine
(Data/Weights)
Winograd
Transform
Engine
Vector
Processor
Unit
Engine Local
Controller
Matrix
Decomposition
Engine
local
Shared
Memory
4K
MAC
3
5. CDNN Toolchain Workflow
1 Neural Network Training Optimizer Tool
Real-Time Multi-Network Inferencing
4 CDNN Offline Optimization
3
2 Architecture Planner Tools
Training
Framework
Image
Database
OUTPUT
► BW to DDR
► L1-L2 BW
► Cycle count
► Overall performance
► Precision
► Output Image
Data Flow
Resolution
L2
Analyzer
L1
Analyzer
Layer
Performance
in a front basket"
with a cat
riding a bicycle,
with long hair
“Young girl
Parallel processing of Multi- networks
running over Several engines/cores
Input Image
Neural
Network
+ +
User configuration
• L1/L2 MSS size
• DDR max BW
• DDR latency
Graph
Manager
Quantizer
– Winograd
– Mix Precision
– BW level
Precision Wizard
System Planner
Input
Output
Hidden
Fixed-point
Customized Network + Weight
FEATURES
Graph Compiler
Memory Optimizations
BW reduction
Code size reduction
Precision
Scale per Channel
Scale per Layer
Scale per Module
CDNN-Invite
Node API
Custom Device API
Open-source front-end
QAT: Quantization Aware Training
Pruning
Structured pruning
Semi-Structured Pruning
Unstructured pruning
HW-Aware Network Optimization
Model size reduction
5
6. CDNN Multi-Engine Features
• CDNN supports the following parallel processing paradigms to leverage the
compute power of multi-engine device and maximize overall system
throughput
• Partitioning by tiles different tiles, shared weights. This is efficient in first
layers where input maps are large
• Partitioning by output maps same tile, different weights
• Partitioning by sub-graphs different sub-graphs are assigned to engines
• Pipeline partitioning sequence of nodes assigned to engine
• Batch partitioning same network, different input image
• Partitioning by networks different network per engine
6
7. Quad-Engine Core – Segmentation Network
Data Flow
NeuPro-M™
AI Processor
L1 MEM
engine #0
L1 MEM
engine #3
L1 MEM
engine #1
L2
Shared
MSS
L1 MEM
engine #0
NPM
engine #0
NPM
engine #3
NPM
engine #1
NPM
engine #2
L1 MEM
engine #1
L1 MEM
engine #2
L1 MEM
engine #3
External Memory /
ISP real-time client
X
Y
AXI
P a r a l l e l
p r o c e s s i n g
L1 MEM
engine #2
Overlap
area
Output
Input:
7
8. Parallel Processing Paradigms
Model by model
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
3
Object detection
Image processing
Lane detection
Free space segmentation
Frame by frame
NeuPro-M
engine #0
NeuPro-M
engine #2
NeuPro-M
engine #1
1
3
2
Object detection
Object detection
Object detection
Object detection
NeuPro-M
engine #...
#
Layer by layer
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
Input
Output
Neural Network
Tile by tile
NeuPro-M
engine #0
NeuPro-M
engine #3
NeuPro-M
engine #1
NeuPro-M
engine #2
3
F r e e s p a c e s e g m e n t a t i o n
NeuPro-M SDK - Adaptive computing methods:
II
I
III
IV
4
3
2
1
I
n
p
u
t
8
10. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
• Given graph G=(V,E)
• Given input and output nodes (a,u)
• Given cost function f(v) for each node
• We have multiple paths from a to u
• Parallel execution is appropriate in this
case
• We will attempt to parallelize the model
across 3 DSPs
CDNN Multi-Engine Offline Algorithm
10
11. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Multi-Engine Offline Algorithm
11
12. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 1
12
13. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
f 13 17
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 2
13
14. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
m 20 24 k 17 21
h 21 25
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 3
14
15. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
o 25 29
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 4
15
16. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
t 27 46 o 25 29
s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 5
16
17. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
u 46 47 t 27 46 o 25 29
s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
CDNN Offline Algorithm – Layer 6
17
18. a
c b
g
u
d
n
h
o
e
j
p
f
k
q
l m
r
s t
0
1
2
3
4
5
6
Dsp1 Start finish Dsp2 Start finish Dsp3 Start finish
a 0 4 ----- 0 4 ----- 0 4
n 4 14 b 4 11 c 4 9
e 14 18 g 11 14 d 9 13
j 18 22 l 14 20 f 13 17
p 22 26 m 20 24 k 17 21
q 26 30 r 24 27 h 21 25
----- 30 46 t 27 46 o 25 29
u 46 47 s 29 41
4
5 7
10
3
4
6
4
4
4
4
4
4
4
4
4
13
3
19
1
Total inference time for 3 DSPs: 47
Single DSP inference time is the sum of all nodes:
4+5+7+10+9*4+3+13+13+19+1 = 111
CDNN Offline Algorithm – Finish
18
19. • Algorithm performs load balancing
across sub-graphs to minimize idle
time of DSPs
• Parallel execution is appropriate in
this case
• We will attempt to parallelize the
model across 1, 2, 3, and 4 DSPs
and compare the results
• The algorithm implements
backtracking traversal of the graph
to limit the computational
complexity of the process
Example: Inception_v3
19
20. • Backtracking algorithm
produced the best results
• The bar chart compares the
performance in cycles of a
single DSP vs two, three,
and four DSPs
Number of DSP Units
Execution
cycles
3,568,884
2,166,036
2,013,499 2,001,637
Results: Inception_v3
20
21. DSP1 DSP2 DSP3
Time
• Many neural networks comprise
a long thread of nodes that are
computed one after the other
• Parallelization is difficult
because there are no nodes
that can run in parallel
• A pipeline approach is adequate
in this case
• We will attempt to parallelize
the model across 3 DSPs
• To minimize idle time, the
algorithm needs to balance the
load across DSPs
CDNN Multi-engine Pipeline Execution
21
22. • Given Graph G=(V,E)
• The resultant frame rate equals to
1/CYCLE_COUNT of the slowest
section of the pipeline
Example: Mobilenet_v2
22
23. • The pipeline algorithm
produced the best result
with 4 DSPs
Number of DSP Units
Execution
cycles
165,376
84,436
64,468
43,659
Results: Mobilenet_v2
23
24. • We introduced CEVA NeuPro-M multi-engine processor for AI and CV
applications.
• These cores are complemented by CDNN for multi-engine, a highly
optimized graph compiler and runtime framework.
• We presented the results of sub-graph and pipeline algorithms, which were
developed at CEVA to distribute the network inference workload across
multiple engines.
• Surprisingly, even for Inception_V3, the pipeline approach outperformed
the backtracking technique.
• Further testing would need to be carried out to confirm the initial results.
Summary and Conclusions
24