Information Classification: General
December 8-10 | Virtual Event
An Open Flow for DNNs
on Ultra-Low-Power RISC-V Cores
Dr. Francesco Conti
Assistant Professor
University of Bologna
#RISCVSUMMIT
Information Classification: General
Real-World DNNs at Extreme-Edge?
preserve privacy
improve latency
enhance energy efficiency
Information Classification: General
training
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
How to bridge the algorithm <-> device divide?
?
Information Classification: General
training
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
1. Exploit hardware’s strengths
Information Classification: General
training
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
2. Keep compute units fed with data
from on-chip memories; minimize off-chip access
Information Classification: General
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
3. Produce DNNs that are tuned to embedded
hardware (e.g., int8)
Information Classification: General
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
PULP-NN
Parallel ULP
Neural Network library
NEMO
NEural Minimization
for pytOrch
DORY
Deployment Oriented to memoRY
Open-Source flow for DNN
Deployment on PULP devices
Information Classification: General
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
Real-World DNNs at Extreme-Edge?
PULP-NN
Parallel ULP
Neural Network library
NEMO
NEural Minimization
for pytOrch
DORY
Deployment Oriented to memoRY
THIS TALK
Information Classification: General
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
PULP-NN: optimized back-end
PULP-NN
Parallel ULP
Neural Network library
Information Classification: General
GreenWaves GAP8 Architecture
GreenWaves GAP8 is a RISC-V ultra-low power processor
based on the open-source PULP (Parallel Ultra-Low-Power)
paradigm
- a fast microcontroller with autonomous I/O capability,
support for off-chip memory (HyperRAM DRAM / Flash)
- RV32IMC in-order, 4 pipeline stage core
(RI5CY, now OpenHW Group CV32E40P)
Information Classification: General
GreenWaves GAP8 Architecture
GreenWaves GAP8 is a RISC-V ultra-low power
processor based on the open-source PULP (Parallel
Ultra-Low-Power) paradigm
- a programmable cluster of RI5CY cores sharing
memory at L1
- high-bandwidth, low-latency L1 scratchpad +
DMA for manual memory management
- xPULPv2 ISA extension for enhanced DSP
capabilities: hardware loops, address post-
increment, SIMD scalar product
Information Classification: General
GreenWaves GAP8 Architecture
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
Load 16 weights (8-bit)
4 out chan, 4 in chan
address post-increment
4x2:
69%
utilization
F
x
F
x
F
x
F
x
K
in
MATM
(out chan)
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
Load 8 pixels
2 rows, 4 in chan
address post-increment
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2:
69%
utilization
F
x
F
x
F
x
F
x
K
in
MATM
(out chan)
(rows)
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
Compute 32 MAC over 8 accumulators
dot-product instructions
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2:
69%
utilization
F
x
F
x
F
x
F
x
K
in
MATM
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
(rows)
(rows)
(out chan)
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
Loop over in chan, filter size
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2:
69%
utilization
F
x
F
x
K
in
F
x
F
x
K
in
MATMUL
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
4x2: 69%
utilization
F x F x Kin
F
x
F
x
K
in
MATMUL
(out chan)
(rows)
(rows)
(out chan)
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
(rows)
(out chan)
Parallelize over 8 cores
column dimension
Information Classification: General
PULP-NN: optimized computational back-end
Target int8 execution of CONV, FC, ... primitives
1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism
lp.setup
p.lw w0, 4(W0!)
p.lw w1, 4(W1!)
p.lw w2, 4(W2!)
p.lw w3, 4(W3!)
p.lw x1, 4(X0!)
p.lw x2, 4(X1!)
pv.sdotsp.b acc1, w0, x0
pv.sdotsp.b acc2, w0, x1
pv.sdotsp.b acc3, w1, x0
pv.sdotsp.b acc4, w1, x1
pv.sdotsp.b acc5, w2, x0
pv.sdotsp.b acc6, w2, x1
pv.sdotsp.b acc7, w3, x0
pv.sdotsp.b acc8, w3, x1
end
PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263
Peak Performance (8 cores)
15.5 MAC/cycle
(rows)
(out chan)
Information Classification: General
DORY: Tiling & Code Generation
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
DORY
Deployment Oriented to memoRY
Information Classification: General
DORY: the tensor tiling problem
1.0-MobileNet-128
(59% top-1 accuracy on ImageNet)
Information Classification: General
DORY: the tensor tiling problem
L3 / L2 tiling
64 MB / 512 kB
small
memory
big
memory
Information Classification: General
DORY: the tensor tiling problem
L3 / L2 tiling
64 MB / 512 kB
L2 / L1 tiling
512 kB / 64 kB
small
memory
big
memory
Information Classification: General
DORY: Tiling as optimization problem
How to define the sizes of various tiles?  abstracted as
Constraint Programming problem
cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆)
s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem
MEMORY 
on
NeM
f
evalua
low
py
prec
datas
et
loader
Integer DNN
Google
ORTools
Integer DNN + tile sizes
Basic idea: maximize tile size while fitting the overall memory
size constraint
Information Classification: General
DORY: Tiling as optimization problem
How to define the sizes of various tiles?  abstracted as
Constraint Programming problem
cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆)
s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem
MEMORY 
s. t. {𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 = 𝑾𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 , … }
GEOMETRY 
on
NeM
f
evalua
low
py
prec
datas
et
loader
Integer DNN
Google
ORTools
Integer DNN + tile sizes
Size of tiles for in tensors / out tensors / weights tied by geometric
constraints typical of a given layer
Information Classification: General
DORY: Tiling as optimization problem
How to define the sizes of various tiles?  abstracted as
Constraint Programming problem
cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆)
s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem
MEMORY 
s. t. {𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 = 𝑾𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 , … }
GEOMETRY 
cost′ = cost + 𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 divisible by 4, …
EFF. HEURISTICS 
on
NeM
f
evalua
low
py
prec
datas
et
loader
Integer DNN
Google
ORTools
Integer DNN + tile sizes
Performance is maximum for configurations that use PULP-NN primitives
more efficiently (e.g., full parallelism)
Information Classification: General
DORY: Tile data movement
Input feature map I
Output feature map O
Filters weights W
L1 buffer 1
L1 buffer 2
x TILE 1
y TILE 1
W TILE 1
x TILE 2
y TILE 2
W TILE 2
L1 memory
L2 memory
In.
copy
t0 t1 t2 t3 … tn
CONVOLUTIONAL PIPELINE
DMA ch. 0-1 DMA ch. 2
Cluster computation
iM
wM
hM
oM
wM
hM
oM
iM
Information Classification: General
Convol.
kernel
Input feature map I
Output feature map O
Filters weights W
L1 buffer 1
L1 buffer 2
x TILE 1
y TILE 1
W TILE 1
x TILE 2
y TILE 2
W TILE 2
L1 memory
L2 memory
In.
copy
t0 t1 t2 t3 … tn
CONVOLUTIONAL PIPELINE
DMA ch. 0-1 DMA ch. 2
Cluster computation
iM
wM
hM
oM
wM
hM
oM
iM
DORY: Tile data movement
Information Classification: General
Convol.
kernel
In.
copy
In.
copy
t0 t1 t2 t3 … tn
CONVOLUTIONAL PIPELINE
DMA ch. 0-1 DMA ch. 2
Cluster computation
Input feature map I
Output feature map O
Filters weights W
L1 buffer 1
L1 buffer 2
x TILE 1
y TILE 1
W TILE 1
x TILE 2
y TILE 2
W TILE 2
L1 memory
L2 memory
iM
wM
hM
oM
wM
hM
oM
iM
DORY: Tile data movement
Information Classification: General
Out.
copy
Convol.
kernel
Input feature map I
Output feature map O
Filters weights W
L1 buffer 1
L1 buffer 2
x TILE 1
y TILE 1
W TILE 1
x TILE 2
y TILE 2
W TILE 2
L1 memory
L2 memory
In.
copy
In.
copy
t0 t1 t2 t3 … tn
CONVOLUTIONAL PIPELINE
DMA ch. 0-1 DMA ch. 2
Cluster computation
iM
wM
hM
oM
wM
hM
oM
iM
DORY: Tile data movement
Information Classification: General
In.
copy
Convol.
kernel
In.
copy
Out.
copy
Convol.
kernel
Input feature map I
Output feature map O
Filters weights W
L1 buffer 1
L1 buffer 2
x TILE 1
y TILE 1
W TILE 1
x TILE 2
y TILE 2
W TILE 2
L1 memory
L2 memory
In.
copy
Out.
copy
Convol.
kernel
In.
copy
Out.
copy
Convol.
kernel
t0 t1 t2 t3 … tn
CONVOLUTIONAL PIPELINE
DMA ch. 0-1 DMA ch. 2
Cluster computation
iM
wM
hM
oM
wM
hM
oM
iM
In.
copy
DORY: Tile data movement
Information Classification: General
DORY: code generation example
Integer Network + tile sizes
Code Generation
from templates
Network-level C code
• L3/L2 transfer boilerplate
• double buffering for weights
• calls to layer-level code
Layer-level C code
• L2/L1 transfer boilerplate
• calls to PULP-NN backend library
100 MHz, 4 fps, 12.5 mJ/frame
260 MHz, 10 fps, 25 mJ/frame
1.0-MobileNet-128
(59% top-1 accuracy on ImageNet)
Information Classification: General
An example of our flow in use!
https://www.youtube.com/
watch?v=xBd2nAFNuWY
Credits: Nicky Zimmermann1, Hanna Müller2,
Jerome Guzzi1, Alessandro Giusti1,
Daniele Palossi1,2
1IDSIA Lugano, 2ETH Zürich
training
quantization & pruning
graph optimization
memory-aware deployment
optimized DNN primitives
optimized HW & architecture
specification & dataset selection
PULP-NN
Parallel ULP
Neural Network library
NEMO
NEural Minimization
for pytOrch
DORY
Deployment Oriented to memoRY
Information Classification: General
Where can I get it?
Our deployment flow is fully open-source – as is our full hardware platform.
PULP-NN library: https://github.com/pulp-platform/pulp-nn
DORY deployment tool: https://github.com/pulp-platform/dory
NEMO network minimization tool: https://github.com/pulp-platform/nemo
Information Classification: General
December 8-10 | Virtual Event
Thanks for your attention!
Credits: Angelo Garofalo, Alessio Burrello, Nazareno Bruschi, Manuele Rusci, Giuseppe
Tagliavini, Davide Rossi, Luca Benini & the PULP team

An open flow for dn ns on ultra low-power RISC-V cores

  • 1.
    Information Classification: General December8-10 | Virtual Event An Open Flow for DNNs on Ultra-Low-Power RISC-V Cores Dr. Francesco Conti Assistant Professor University of Bologna #RISCVSUMMIT
  • 2.
    Information Classification: General Real-WorldDNNs at Extreme-Edge? preserve privacy improve latency enhance energy efficiency
  • 3.
    Information Classification: General training optimizedHW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? How to bridge the algorithm <-> device divide? ?
  • 4.
    Information Classification: General training optimizedDNN primitives optimized HW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? 1. Exploit hardware’s strengths
  • 5.
    Information Classification: General training graphoptimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? 2. Keep compute units fed with data from on-chip memories; minimize off-chip access
  • 6.
    Information Classification: General training quantization& pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? 3. Produce DNNs that are tuned to embedded hardware (e.g., int8)
  • 7.
    Information Classification: General training quantization& pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? PULP-NN Parallel ULP Neural Network library NEMO NEural Minimization for pytOrch DORY Deployment Oriented to memoRY Open-Source flow for DNN Deployment on PULP devices
  • 8.
    Information Classification: General training quantization& pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection Real-World DNNs at Extreme-Edge? PULP-NN Parallel ULP Neural Network library NEMO NEural Minimization for pytOrch DORY Deployment Oriented to memoRY THIS TALK
  • 9.
    Information Classification: General training quantization& pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection PULP-NN: optimized back-end PULP-NN Parallel ULP Neural Network library
  • 10.
    Information Classification: General GreenWavesGAP8 Architecture GreenWaves GAP8 is a RISC-V ultra-low power processor based on the open-source PULP (Parallel Ultra-Low-Power) paradigm - a fast microcontroller with autonomous I/O capability, support for off-chip memory (HyperRAM DRAM / Flash) - RV32IMC in-order, 4 pipeline stage core (RI5CY, now OpenHW Group CV32E40P)
  • 11.
    Information Classification: General GreenWavesGAP8 Architecture GreenWaves GAP8 is a RISC-V ultra-low power processor based on the open-source PULP (Parallel Ultra-Low-Power) paradigm - a programmable cluster of RI5CY cores sharing memory at L1 - high-bandwidth, low-latency L1 scratchpad + DMA for manual memory management - xPULPv2 ISA extension for enhanced DSP capabilities: hardware loops, address post- increment, SIMD scalar product
  • 12.
  • 13.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 4x2: 69% utilization F x F x Kin F x F x K in MATMUL Load 16 weights (8-bit) 4 out chan, 4 in chan address post-increment 4x2: 69% utilization F x F x F x F x K in MATM (out chan)
  • 14.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 4x2: 69% utilization F x F x Kin F x F x K in MATMUL Load 8 pixels 2 rows, 4 in chan address post-increment 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x F x F x K in MATM (out chan) (rows)
  • 15.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 4x2: 69% utilization F x F x Kin F x F x K in MATMUL Compute 32 MAC over 8 accumulators dot-product instructions 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x F x F x K in MATM 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x Kin F x F x K in MATMUL (rows) (rows) (out chan)
  • 16.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 4x2: 69% utilization F x F x Kin F x F x K in MATMUL Loop over in chan, filter size 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x K in F x F x K in MATMUL 4x2: 69% utilization F x F x Kin F x F x K in MATMUL 4x2: 69% utilization F x F x Kin F x F x K in MATMUL (out chan) (rows) (rows) (out chan)
  • 17.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 (rows) (out chan) Parallelize over 8 cores column dimension
  • 18.
    Information Classification: General PULP-NN:optimized computational back-end Target int8 execution of CONV, FC, ... primitives 1) maximize data reuse in register file 2) improve kernel regularity 3) exploit parallelism lp.setup p.lw w0, 4(W0!) p.lw w1, 4(W1!) p.lw w2, 4(W2!) p.lw w3, 4(W3!) p.lw x1, 4(X0!) p.lw x2, 4(X1!) pv.sdotsp.b acc1, w0, x0 pv.sdotsp.b acc2, w0, x1 pv.sdotsp.b acc3, w1, x0 pv.sdotsp.b acc4, w1, x1 pv.sdotsp.b acc5, w2, x0 pv.sdotsp.b acc6, w2, x1 pv.sdotsp.b acc7, w3, x0 pv.sdotsp.b acc8, w3, x1 end PULP-NN [Garofalo 19] https://arxiv.org/abs/1908.11263 Peak Performance (8 cores) 15.5 MAC/cycle (rows) (out chan)
  • 19.
    Information Classification: General DORY:Tiling & Code Generation training quantization & pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection DORY Deployment Oriented to memoRY
  • 20.
    Information Classification: General DORY:the tensor tiling problem 1.0-MobileNet-128 (59% top-1 accuracy on ImageNet)
  • 21.
    Information Classification: General DORY:the tensor tiling problem L3 / L2 tiling 64 MB / 512 kB small memory big memory
  • 22.
    Information Classification: General DORY:the tensor tiling problem L3 / L2 tiling 64 MB / 512 kB L2 / L1 tiling 512 kB / 64 kB small memory big memory
  • 23.
    Information Classification: General DORY:Tiling as optimization problem How to define the sizes of various tiles?  abstracted as Constraint Programming problem cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem MEMORY  on NeM f evalua low py prec datas et loader Integer DNN Google ORTools Integer DNN + tile sizes Basic idea: maximize tile size while fitting the overall memory size constraint
  • 24.
    Information Classification: General DORY:Tiling as optimization problem How to define the sizes of various tiles?  abstracted as Constraint Programming problem cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem MEMORY  s. t. {𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 = 𝑾𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 , … } GEOMETRY  on NeM f evalua low py prec datas et loader Integer DNN Google ORTools Integer DNN + tile sizes Size of tiles for in tensors / out tensors / weights tied by geometric constraints typical of a given layer
  • 25.
    Information Classification: General DORY:Tiling as optimization problem How to define the sizes of various tiles?  abstracted as Constraint Programming problem cost = 𝐦𝐚𝐱 Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) s. t. Size(𝑾𝒕𝒊𝒍𝒆)+Size(𝒙𝒕𝒊𝒍𝒆)+Size(𝒚𝒕𝒊𝒍𝒆) < SizeSmallMem MEMORY  s. t. {𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 = 𝑾𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 , … } GEOMETRY  cost′ = cost + 𝒚𝒕𝒊𝒍𝒆 𝑐ℎ𝑜𝑢𝑡 divisible by 4, … EFF. HEURISTICS  on NeM f evalua low py prec datas et loader Integer DNN Google ORTools Integer DNN + tile sizes Performance is maximum for configurations that use PULP-NN primitives more efficiently (e.g., full parallelism)
  • 26.
    Information Classification: General DORY:Tile data movement Input feature map I Output feature map O Filters weights W L1 buffer 1 L1 buffer 2 x TILE 1 y TILE 1 W TILE 1 x TILE 2 y TILE 2 W TILE 2 L1 memory L2 memory In. copy t0 t1 t2 t3 … tn CONVOLUTIONAL PIPELINE DMA ch. 0-1 DMA ch. 2 Cluster computation iM wM hM oM wM hM oM iM
  • 27.
    Information Classification: General Convol. kernel Inputfeature map I Output feature map O Filters weights W L1 buffer 1 L1 buffer 2 x TILE 1 y TILE 1 W TILE 1 x TILE 2 y TILE 2 W TILE 2 L1 memory L2 memory In. copy t0 t1 t2 t3 … tn CONVOLUTIONAL PIPELINE DMA ch. 0-1 DMA ch. 2 Cluster computation iM wM hM oM wM hM oM iM DORY: Tile data movement
  • 28.
    Information Classification: General Convol. kernel In. copy In. copy t0t1 t2 t3 … tn CONVOLUTIONAL PIPELINE DMA ch. 0-1 DMA ch. 2 Cluster computation Input feature map I Output feature map O Filters weights W L1 buffer 1 L1 buffer 2 x TILE 1 y TILE 1 W TILE 1 x TILE 2 y TILE 2 W TILE 2 L1 memory L2 memory iM wM hM oM wM hM oM iM DORY: Tile data movement
  • 29.
    Information Classification: General Out. copy Convol. kernel Inputfeature map I Output feature map O Filters weights W L1 buffer 1 L1 buffer 2 x TILE 1 y TILE 1 W TILE 1 x TILE 2 y TILE 2 W TILE 2 L1 memory L2 memory In. copy In. copy t0 t1 t2 t3 … tn CONVOLUTIONAL PIPELINE DMA ch. 0-1 DMA ch. 2 Cluster computation iM wM hM oM wM hM oM iM DORY: Tile data movement
  • 30.
    Information Classification: General In. copy Convol. kernel In. copy Out. copy Convol. kernel Inputfeature map I Output feature map O Filters weights W L1 buffer 1 L1 buffer 2 x TILE 1 y TILE 1 W TILE 1 x TILE 2 y TILE 2 W TILE 2 L1 memory L2 memory In. copy Out. copy Convol. kernel In. copy Out. copy Convol. kernel t0 t1 t2 t3 … tn CONVOLUTIONAL PIPELINE DMA ch. 0-1 DMA ch. 2 Cluster computation iM wM hM oM wM hM oM iM In. copy DORY: Tile data movement
  • 31.
    Information Classification: General DORY:code generation example Integer Network + tile sizes Code Generation from templates Network-level C code • L3/L2 transfer boilerplate • double buffering for weights • calls to layer-level code Layer-level C code • L2/L1 transfer boilerplate • calls to PULP-NN backend library 100 MHz, 4 fps, 12.5 mJ/frame 260 MHz, 10 fps, 25 mJ/frame 1.0-MobileNet-128 (59% top-1 accuracy on ImageNet)
  • 32.
    Information Classification: General Anexample of our flow in use! https://www.youtube.com/ watch?v=xBd2nAFNuWY Credits: Nicky Zimmermann1, Hanna Müller2, Jerome Guzzi1, Alessandro Giusti1, Daniele Palossi1,2 1IDSIA Lugano, 2ETH Zürich training quantization & pruning graph optimization memory-aware deployment optimized DNN primitives optimized HW & architecture specification & dataset selection PULP-NN Parallel ULP Neural Network library NEMO NEural Minimization for pytOrch DORY Deployment Oriented to memoRY
  • 33.
    Information Classification: General Wherecan I get it? Our deployment flow is fully open-source – as is our full hardware platform. PULP-NN library: https://github.com/pulp-platform/pulp-nn DORY deployment tool: https://github.com/pulp-platform/dory NEMO network minimization tool: https://github.com/pulp-platform/nemo
  • 34.
    Information Classification: General December8-10 | Virtual Event Thanks for your attention! Credits: Angelo Garofalo, Alessio Burrello, Nazareno Bruschi, Manuele Rusci, Giuseppe Tagliavini, Davide Rossi, Luca Benini & the PULP team

Editor's Notes

  • #2 Good evening. My name is Francesco Conti, I am an Assistant Professor of Electronics at the University of Bologna, and I am here to talk about our open flow for Deep Neural Network Deployment on ultra-low-power risc-v cores.
  • #3 The deployment of Deep Neural Networks in devices at the Extreme Edge of the IoT is seen as a key enabler technology for advanced Artificial Intelligence applications. Compared to cloud-based execution, local AI using realistically complex models can result in stronger privacy and security, improved latency and energy efficiency. These features would make applications such as smart sensors, personal healthcare devices, autonomous nanorobots much more powerful than today.
  • #4 However, there is one big obstacle: how to overcome the divide between the algorithm developed and trained on high-performance GPUs with the extreme constraints of devices with little onboard memory? And how to extract performance from specialized, but hard-to-program, hardware architectures?
  • #5 Proceeding from the bottom up, a reasonable strategy starts from exploiting the hardware’s strengths, focusing on maximizing the utilization of compute units: build libraries for commonly used DNN primitives, such as convolutional layers, that are designed to achieve high performance and energy efficiency on data that is stored _directly in the device_
  • #6 Extreme-edge devices have tiny memories: being very fast at computation is not enough. Deploying realistically sized DNNs at the edge requires scheduling computation in such a way that data can be manipulated on these small memories, minimizing off-chip access and maximizing the usage of optimized primitives at the same time.
  • #7 To produce DNNs that are well-tuned to embedded hardware, a key step is to use quantized data for weights and activations. Quantization reduces the memory footprint and, at the same time, enables using faster arithmetic primitives in the lower layers of the deployment stack. So far, 8 bits have been selected as the “industry standard” to combine good accuracy with low footprint and high energy efficiency.
  • #8 Within the PULP project, we have designed a fully open source, vertically integrated flow putting together all of these steps, from quantization-aware training down to optimized execution on DNN-optimized hardware.
  • #9 In this talk, I will focus on the lowest layers of the deployment stack: PULP-NN, our PULP-optimized Neural Network library; and DORY, our tool for Memory-Oriented Deployment of DNNs.
  • #10 Let us start by focusing on the PULP architecture and the optimized PULP-NN libraries running on top of it.
  • #11 The open-source PULP project is a tight cooperation between the University of Bologna and ETH Zurich, focusing on the design of extreme-edge devices based on Parallel Ultra-Low-Power Clusters of RISC-V cores, for which AI is one of the primary application targets. In this talk, I’m going to focus on a specific commercial embodiment of PULP: GreenWaves GAP-8. GAP-8 is an advanced microcontroller with autonomous I/O capability, a 512 kilobyte on-chip SRAM and support for external HyperRAM memory. It employes a RISC-V 32-bit in-order core with 4 pipeline stages based on the RISCY design from ETH Zurich.
  • #12 The main feature distinguishing GAP-8 from other microcontroller is the fact that it embeds a powerful 8-core programmable accelerator: a cluster of 8 RISCY cores to accelerate parallel workloads. The cluster is centered around a 64 kilobyte high-bandwidth L1 scratchpad shared by all cores, and manually managed by means of a cluster DMA. Each cluster core is enhanced with DSP extensions such as hardware loops, address post-increment and scalar product instructions; hardware accelerated synchronization primitives enable exploting parallelism at ultra-fine-grain.
  • #13 To fully exploit the GAP8 architecture, a DNN application must be able to mine all performance gains from the cluster while using intelligently the scarce amount of on-chip memory and scheduling data movement between the off-chip DRAM, the on-chip Level-2 memory and the ultra-fast (but small) Level-1 scratchpad
  • #14 Our optimized library for DNN primitives, PULP-NN, takes care of the first task: maximize performance for data that is already inside the fast L1 scratchpad while disregarding higher layers of the memory hierarchy. To do so, PULP-NN focuses on primitives running on 8-bit data and aims at maximizing data reuse in the register file, improving kernel regularity, and exploiting data parallelism. Here we see the innermost loop of the PULP-NN library, which works similarly for Convolutional, Fully Connected and other similar layers. In the innermost loop, each RISCY core exploits load instructions with address post-increment to fetch 16 bytes of weights – a matrix of 4 output channels and 4 input channels.
  • #15 The same is repeated to load 8 bytes of pixels – from 2 rows in the input feature map, and across 4 input channels. That makes a total of 6 loads per each iteration of teh innermost loop.
  • #16 With this data, RISCY can exploit its 8-bit dot-product instructions to perform 32 multiply-accumulate operations over 8 cycles, producing a total of 8 accumulators (4 channels across 2 spatial rows).
  • #17 This operation can be iterated over the input channels and convolutional filters, to perform a full accumulation of output pixels.
  • #18 The workload is divided among multiple cores by having each core work on a different input column.
  • #19 Using this strategy requires some initial data marsahling to reorganize inputs, but thanks to its regular and tight inner loop, it can achieve a peak performance of up to 15.5 MACs per cycle on 8 cores.
  • #20 But achieving a high peak performance with data in L1 is not nearly enough to achieve high end-to-end performance. Therefore, we designed a tool for Deployment Oriented to memory – in short, DORY – designed to divide the computation in tiles that fit on-chip memory while organizing off-chip access in the most efficient way possible.
  • #21 To understand how tiling works, let’s have a look at the architecture of a GAP-8 system, which here is rotated 90 degrees for convenience. The full representation of a realistically sized network such as a MobileNet cannot be stored in on-chip memory, as it requires at least tens of megabytes for weights. This kind of data has always to reside in the external DRAM or Flash memory.
  • #22 Tensor tiling means that we copy inside the on-chip SRAM only a small “block” or tile of the input and tensor data that we need. The GAP-8 on-chip Level-2 SRAM has 512 kilobyte of space, therefore we can often fit all weights for a layer, but sometimes we need to select only a tile of the weights and activations. For each tile, we copy-in the tile that is going to be consumed, we project it onto an output tile, and we copy the latter one back to the Level-3 DRAM. then we repeat the operation.
  • #23 As the GAP-8 Level-1 scratchpad in the cluster is manually managed, we have to repeat a similar operation between the L2 and L1 memories. However, in this case the L1 size is only 64 kB – therefore our job is significantly harder. How to find tile sizes that make sense, minimize transfer overhead and at the same time are capable of exploiting the performance of PULP-NN primitives?
  • #24 In DORY, we abstract this general problem as a Constraint Programming problem, and we use the open-source Google OR-Tools to solve it. The starting point is the idea of maximizing the size of the three tiles that we need: input activations, weight, and output activations – subject to the constraint of the total space that we can allocate in the smaller memory.
  • #25 The size of the various tiles, like the ones of the full tensors, are not disjoint one from another, but they are tied by geometric constraints typical of a given kind of layer: convolutional, depthwise, fully-connected. These are also imposed as additional constraints.
  • #26 Finally, we acknowledge that the PULP-NN primitives are faster with some tiles than with others. For example, they are faster with tiles exposing enough paralllelism to keep the 8 cores busy. These considerations are taken into account by expanding the cost function with heuristic components, that help select automatically the tile combinations that perform best on hardware.
  • #27 DORY uses the output of the constraint programming problem to generate code to move data tiles between the various level of the memory hierarchy. Data transfer are organized in a fully triple buffered fashion so that in most cases, transfer overheads are completely hidden and do not hinder performance.
  • #28 DORY uses the output of the constraint programming problem to generate code to move data tiles between the various level of the memory hierarchy. Data transfer are organized in a fully triple buffered fashion so that in most cases, transfer overheads are completely hidden and do not hinder performance.
  • #29 DORY uses the output of the constraint programming problem to generate code to move data tiles between the various level of the memory hierarchy. Data transfer are organized in a fully triple buffered fashion so that in most cases, transfer overheads are completely hidden and do not hinder performance.
  • #30 DORY uses the output of the constraint programming problem to generate code to move data tiles between the various level of the memory hierarchy. Data transfer are organized in a fully triple buffered fashion so that in most cases, transfer overheads are completely hidden and do not hinder performance.
  • #31 DORY uses the output of the constraint programming problem to generate code to move data tiles between the various level of the memory hierarchy. Data transfer are organized in a fully triple buffered fashion so that in most cases, transfer overheads are completely hidden and do not hinder performance.
  • #32 Putting all together, DORY can be used to take an Integer-only DNN and generate low-level code that directly runs on GAP-8, including all data transfers. On a Mobilenet run on 128x128 input , which achieves 59% top-1 accuracy on ImageNet, we achieve real-time processing at 10 fps spending less than 25 milliJoule per frame.
  • #33 To make the case more convincing, I attached a small video of our full flow being used for a robotic application, prepared by the team of Alessandro Giusti at IDSIA Lugano. The nano-drone, which hosts a GAP-8 processor, is running a neural network that tries to keep its position aligned with the front of a human’s face.
  • #34 Where can you get this? Our deployment flow, as well as our full hardware platform, is entirely open-source. You can find everything on GitHub at these links, inside the pulp-platform group
  • #35 Thank you very much for your attention. Let me also thank the people who did all the hard work: Angelo Garofalo, Alessio Burrello and the rest of the awesome ”DNN on PULP” team.