Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

Information Classification: General
December 8-10 | Virtual Event
Klessydra-T: Designing Vector Coprocessors for
Multi-Threaded Edge-Computing Cores
Mauro Olivieri
Professor
Sapienza University of Rome
#RISCVSUMMIT

Francesco Lannutti
collaborator @Synopsys
DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME
Marcello Barbirotta
PhD candidate
Mauro Olivieri
Associate Professor
Francesco Menichelli
Assistant Professor
Antonio Mastrandrea
Research Fellow
Abdallah Cheikh
Research Fellow
Luigi Blasi
PhD cand. @DSI Gmbh
Francesco Vigli
PhD cand. @ ELT Spa
Stefano Sordillo
PhD candidate

INTRODUCTION & MOTIVATION
THE KLESSYDRA-T ARCHITECTURE
• Interleaved Multi-Threading baseline
• Parameterized vector acceleration schemes
• Klessydra vector intrinsic functions
BENCHMARK WORKLOADS
• Convolution, Matmul, FFT
• Homogeneous and composite workload
RESULTS
• Cycle count and absolute execution time
• Maximum clock frequency and hardware resource utilization
• Energy efficiency
CONCLUSIONS
OUTLINE

19/04/2021 Page 4
APPLICATION CONTEXT AND MOTIVATION
 There are recognized drives towards (extreme)
edge computing: availability, energy saving,
security, etc., having implications on both SW
design and HW design
 HW design challenges of extreme edge computing
devices:
• Local energy budget
• Cost & size
• Computing power
 General setting:
• Possibly taking advantage of inherently
multi-threaded application routines
• Inevitability of hardware acceleration support

• “space-qualified” core,
• T0 microarchitecture
• + configurable HW/SW fault-
tolerance support
• “edge computing” core
• extends T0 microarchitecture
• RV32IM
• + configurable multiple
scratchpad memories
• + configurable vector unit
• extended ISA
• Starting point
• M mode v1.10
• RV32I user ISA
• single hart
• M mode v1.10
• RV32I user ISA
• Atomic ext. (partial)
• multiple PC & CSR
• multiple interleaved
harts
PULPino
feat.
Klessydra S0
core
PULPino
feat.
Klessydra
T0 cores
PULPino
feat.
Klessydra F0
cores
PULPino
feat.
Klessydra T1
cores
19/04/2021 Page 5
core
courtesy of
THE PULPINO-COMPATIBLE KLESSYDRA CORE
FAMILY

THE KLESSYDRA IMT MICROARCHITECTURE
 Baseline Klessydra T03 core features:
• Thread context switch at each clock cycle
• in-order, single issue instruction execution
• feed-forward pipeline structure (no hardware support
for pipeline hazard handling)
• bare metal execution (RISCV M mode)
 The vector-accelerated Klessydra-T13 core has been
designed as a superset of the basic Klessydra-T03
microarchitecture.
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory

THE KLESSYDRA-T1 MICROARCHITECTURE
FAMILY
Input Mapping
Add
Sub
Shft Mul Accum Relu
MFU
Bank Intrlv
Bank1
Bank0 BankN
SPMI
Data reorder
Output Mapping
MAU_busy
MAU_req
EXEC
Regfile
Decode
Fetch
PC
PC
CSR
Data Mem
WB
Debug
Prg Mem
Updater
harc
Updater
DSP Initialization
Control / Mapping
Add
Sub
Shft Mul Accum Relu
Accl Exec
MFU
Accl Init
hart a
hart a,
b, or c
hart c SPMI
B0 B1 B2
LSU
x F
x D
SPM
SPM
SPM
x D
bank
bank
bank
…
x N
bank
bank
bank
bank
bank
bank
SPM0 SPM1
SPMN-1
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Execute MFU
SPMI
LSU
Klessydra T13 core features
 multiple units in the execution stage
• scalar execution unit (EXEC)
• vector-oriented multi-purpose functional
unit (MFU) with Scratchpad Memory
support
• Load/Store unit (LSU)
 possible concurrent execution of instructions
of different types

HARDWARE ACCELERATION PARAMETRIC
SCHEMES
The parametric coprocessor architecture in T13 cores,
comprised of the MFU and the SPMIs, can be
configured at synthesis level according to the following
values:
• the number of parallel lanes D in the MFU, which
defines the DLP degree and also corresponds to the
number of SPM banks in each SMPI block
• the number of MFUs F
• the SPM bank capacity B
• the number of SPMs N
• the number of SPMIs M
• The sharing scheme of MFUs and SMPI among the
harts, i.e. heterogeneous or symmetric
19/04/2021 Titolo Presentazione Pagina 8
 M=1, F=1, D=1: SISD
 M=1, F=1, D=2,4,8: Pure SIMD
 M=3, F=3, D=1: Symmetric MIMD
 M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD
 M=3, F=1, D=1: Heterogeneous MIMD
 M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD

KLESSYDRA VECTOR EXTENSION AND INTRINSIC
FUNCTIONS
Assembly syntax – (r) denotes
memoryaddressing via register r
Short description
kmemld (rd),(rs1),(rs2) load vector into scratchpad region
kmemstr (rd),(rs1),(rs2) store vector into main memory
kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region
ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region
kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region
kvred (rd),(rs1),(rs2) reduce vector by addition
kdotp (rd),(rs1),(rs2) vector dot product into register
ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad
ksvaddrf (rd),(rs1),rs2 add vector + scalar into register
ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad
ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register
kdotpps (rd),(rs1),(rs2) vector dot product and post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad
krelu (rd),(rs1) vector ReLu within scratchpad
kvslt (rd),(rs1),(rs2) compare vectors and create mask vector
ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask
kvcp (rd),(rs1) copy vector within scratchpad region
The instructions supported by the coprocessor sub-
system are exposed to the programmer in the form of
very simple intrinsic functions, fully integrated in the
RISC-V gcc compiler toolchain.
CSR_MVSIZE(Row_size); //set vector length
for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows
k_element = 0;
for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) {
for ( column_offset = 0; column_offset < kernel_size; column_offset++){
FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space
ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result
ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment
OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space
kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row
}
}
}

BENCHMARK WORKLOADS AND EVALUATION SETUP
 2D convolution
• 32-bit data elements in fixed-point representation
• 3x3 filter size
• matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements
• additional analysis of larger than 3x3 filter sizes on 32x32 matrices
 FFT
• 256 complex samples
 Matmul
• Square matrices of 64x64 elements
• Homogeneous workload (3 harts running same program)
• Composite workload (3 harts running different programs)
19/04/2021 Titolo Presentazione Pagina 10
ANALYZED PERFORMANCE FIGURES
ON FPGA SOFT-CORE
IMPLEMENTATION
• Average total cycle count per hart
• Maximum clock frequency
• Absolute execution time
• Hardware Resource Utilization
• Average energy per algorithmic
operation

SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376

 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
Core
Config
uration
DLP
freq
MHz
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
• The clock speed exhibited the sharpest drops as the DLP grew larger.
• In the symmetric MIMD scheme, the large HW overhead forced FPGA
slices on the same critical path to be placed far from each other, thus
increasing interconnect delay.
• Pipelining the heterogeneous MIMD crossbar to reduce the critical path,
introduces additional HW overhead, compromising the area advantage.

Core
Config
uration
DLP
freq
MHz
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
• Small matrix convolutions and FFT on
the accelerated core reached up to
2X cycle count reduction over the
single-threaded, DSP-extended
RI5CY core.
• Large matrix convolutions and
MatMul obtain advantage from
vector-acceleration reaching 9X cycle
count reduction relative to RI5CY.

• Assuming maximum clock frequency for each core
• Zeroriscy core taken as common reference
• In pure SIMD configurations, the speed-up grows linearly
with the DLP
• Going from a SISD/SIMD to MIMD+SIMD improved the
speedup in all cases, despite the frequency drop
associated to the MIMD hardware.
• The symmetric MIMD+SIMD schemes exhibit up to 17X
speed-up over Zeroriscy for Convolution 32x32 and up to
13X speed-up for the composite workload.
• Heterogeneous MIMD configurations maintain an almost
perfect overlap with the symmetric MIMD.
• The non-accelerated Klessydra-T03, exhibits an absolute
performance gain over RI5CY and ZeroRiscy
Pagina 14
ABSOLUTE EXECUTION TIME SPEED-UP
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4
Conv.2D 8x8
Conv.2D 16x16
Conv.2D 32x32
FFT 256
MatMul 64x64
Composite

ENERGY EFFICIENCY
• The result of this analysis is expressed as energy
per algorithmic operation, for the FPGA soft-core
implementations, normalized to Zeroriscy, taken as
reference.
• The most energy efficient designs resulted to be
the T13 symmetric MIMD configurations
• The heterogenous MIMD approach exhibited an
almost complete overlap in energy consumption
with the symmetric MIMD
• The pure SIMD schemes resulted in a larger
energy consumption than other schemes, due to
the impossibility of efficiently exploiting TLP.
Pagina 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3 SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4 Conv.2D 8x8
Conv.2D 16x16 Conv.2D 32x32
FFT 256 MatMul 64x64
Composite

Pagina 16
LARGER CONVOLUTION FILTERS
Core DLP
Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11)
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6
T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8
T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5
T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7
T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1
T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1
RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3
ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4
• The matrix being convoluted is 32x32 elements
• The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference

 The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP
• >15X absolute time speed-up , -85% energy per operation.
 Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core,
• 2X-3X speed-up.
 Fully symmetric MIMD and heterogeneous MIMD give very similar results,
• functional unit contention is less impacting than SPM contention.
• coprocessor contention can be effectively mitigated by functional unit heterogeneity
 Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration.
• The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.
 In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution
• Simplified hardware structure phylosophy
19/04/2021 Pagina 17
CONCLUSIONS

December 8-10 | Virtual Event
Thank you for joining
Contribute to the RISC-V conversation on social!
#RISCVSUMMIT #KLESSYDRA @mauro_olivieri_
https://github.com/klessydra
Mauro.Olivieri@uniroma1.it

Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

More Related Content

What's hot

Similar to Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

More from RISC-V International

Recently uploaded

Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores