RISC-V Zce Extension
Ibrahim Abu Kharmeh
Huawei Bristol
19 April 2021
Introduction
● RISC-V is an open-source ISA designed at the University of
California, Berkeley.
● The ISA is designed to target a wide range of applications from
HPC to embedded :
○ RISC-V is a variable-length ISA where instructions can be any length
multiple of 16bit.
○ RISC-V ISA contains some vacant encoding space that third parties
are allowed to use to design their extensions.
● RISC-V application code size is considerably worse than that of
alternative commercial ISAs.
Factors to consider:
Compilers
GCC Clang Others
ABI
UABI EABI
ISA
RV32 RV64
Language
C
C++ Rust
Workload
IOT
ML
Possible Side
Effects:
• Performance
• Power
• Implementation
Complexity
Tool chain options: -Ox,
-msave-restore, ffunction-sections,
-fdata-sections, gc-sections
Short Immediate Fields
Analysis Script
Input ELF files Call Objdump
Parse Objdump
Symbol tables
Saves
disassembly
Construct
Instructions
record
Very limited
CFG recovery
Perform given
optimisations
sequentially
Report results
either as a CSV
or to STDIO
BFD and Ctypes
Finding optimisations opportunities
● Most new instruction proposals are either:
○ Fuse some common instruction sequences into a single instruction.
○ Convert a single normal instruction into a compressed one.
● We can identify new optimisation opportunities with one of the following two
methods:
○ Track how instructions results get used in other instructions.
○ Track how instructions operands get generated.
Couple of Issues :
● When tracking instructions, what do we do
when we reach a possible change of
control flow?
○ Unconditional calls to outside the current
function: we save the tracking buffer
○ Conditional calls, branch targets: Stop
tracking and remove tracking chain record
Or do we keep existing instructions we
tracked
● Function names cannot be used as a
unique key !
UABI Calling Convention
Benchmark Suite
Xlen
RV64
Top 30
Debian
V8
RV32
Opus L3C Embench Coremark Testfloat
Huawei
IOT Code
Zephyr
Audio Codec Collection of
benchmarks
FP tests RTOS / IOT
CPP test /
generic RV64
Zce Extension
TBLJAL
MTBLJALVEC
.base
Addr 0
Addr N
Addr 255
Addr 1
Xlen
bits
Code 1
Code 0
Code N
Code 255
Rationale:
Function calls and jumps to fixed labels
typically take 32-bit or 64-bit instruction
sequences
Proposed Solution:
• Create a table of X entries
• Store Jump addresses in the table
• Separate entries in the table using
the lower two bits depending on link register
(x0,x1 and x5)
• Create a new compressed instruction
that jumps to addresses in the new jump table
TBLJAL Table:
TBLJAL (Example)
<vsprintf>:
#64-bit AUIPC/JALR sequence
e084be: 001f8317 auipc t1,0x1f8
e084c2: 18a302e7 jalr t0,394(t1)
e084c6: 86b2 mv a3,a2
e084c8: 862e mv a2,a1
e084ca: 800005b7 lui a1,0x80000
e084ce: fff5c593 not a1,a1
#32-bit JAL
e084d2: f61ff0ef jal ra,e08432
#64-bit AUIPC/JALR sequence
e084d6: 001f8317 auipc t1,0x1f8
e084da: 19630067 jr 406(t1)
00e084be <vsprintf>:
e084be: xxxx tbljal #x
e084c6: 86b2 mv a3,a2
e084c8: 862e mv a2,a1
e084ca: 800005b7 lui a1,0x80000
e084ce: fff5c593 not a1,a1
e084d2: xxxx tbljal #y
e084da: xxxx tbljal #z
TBLJAL Analysis
Get all function calls
and count the number
each is used
Go through the
entries and eliminate
all entries that wont
gain from substitution
(JAL,J) < 3
Change the weight of
JALR, and JR entries
to be 3*Count
Get the most
common (X entries)
Replace the entries in
the instructions
record
Calculate new
instruction record size
Determining the value of X
0.000%
2.000%
4.000%
6.000%
8.000%
10.000%
12.000%
0 100 200 300 400 500 600 700
Table Size Vs Saving (IOT_Application)
PUSH POP POPRET
<bt_rand>:
20405458: 1141 addi sp,sp,-16
2040545a: c04a sw s2,0(sp)
2040545c: 70000937 lui s2,0x70000
20405460: 62090613 addi a2,s2,1568
20405464: c422 sw s0,8(sp)
20405466: c226 sw s1,4(sp)
20405468: c606 sw ra,12(sp)
2040546a: 842a mv s0,a0
2040546c: 84ae mv s1,a1
<function body>
20405494: 4501 li a0,0
20405496: 40b2 lw ra,12(sp)
20405498: 4422 lw s0,8(sp)
2040549a: 4492 lw s1,4(sp)
2040549c: 4902 lw s2,0(sp)
2040549e: 0141 addi sp,sp,16
204054a0: 8082 ret
20405458 <bt_rand>:
20405458: <16-bit> push {ra,s0-s2},{a0-a1},-16
2040545c: 70000937 lui s2,0x70000
20405460: 62090613 addi a2,s2,1568
<function body>
20405496: <16-bit> popret {ra,s0-s2},{0} 16
Rationale:
Very often in functions epilogue and prologue, we need to save
and restore multiple registers to and from the stack.
Proposed Solution:
Instead of using multiple sw/ lw instructions, we can introduce
a single instruction that perform that.
MULIADD, MULI and ADDIADD
uint32 get_element(uint8 index) {
return
array_base[index].element1.element2
}
02002a96 <get_element>:
02002a96 47d1 li a5,20
02002a98 02f50533 mul a0,a0,a5
02002a9c 010057b7 lui a5,0x1005
02002aa0 74478793 addi a5,a5,1860
02002aa4 953e add a0,a0,a5
02002aa6 4548 lw a0,12(a0)
02002aa8 8082 ret
The code above get compiled into
the following assembly code
Rationale:
Indexing arrays of structures in C often
requires 3 instructions:
• Load immediate to get element size
• Multiplication by index to get location of
the required element
•Addition to the base address of the array
Proposed Solution:
• Create a new instruction (MULIADD)
to fuse the 3 Instructions into
a single instruction
• Similarly we can fuse mul and li to create
MULI and add and addi to create ADDIADD.
Immediate Length Evaluation
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
1.00%
1 2 3 4 5 6 7 8 9 10 11 12
MULI Immediate Length Evaluation
GCC10_audiocodec_fixed_LC3plus GCC10_audiocodec_fixed_opus_demo
GCC10_audiocodec_float_LC3plus GCC10_audiocodec_float_opus_demo
GCC10_coremark GCC10_embench_aha-mont64
GCC10_embench_matmult-int GCC10_embench_minver
GCC10_embench_picojpeg GCC10_embench_st
GCC10_embench_ud GCC10_fpmark_atan-1M
GCC10_fpmark_inner-product-mid-10k GCC10_huawei_iot_application
GCC10_huawei_iot_protocol GCC10_zephyr_central
GCC10_zephyr_peripheral
Results !
Filename Instruction record size C.TBLJAL PUSHPOP MULI ADDIADD MULIADD
GCC10_huawei_iot_protocol 960678 7.00% 2.86% 0.34% 0.10% 0.30%
GCC10_huawei_iot_application 338824 9.36% 4.14% 0.18% 0.11% 0.18%
GCC10_zephyr_peripheral 76246 6.58% 0.17% 0.07% 0.05% 0.07%
GCC10_embench_cubic 45632 2.59% 4.69% 0.00% 0.40% 0.00%
GCC10_zephyr_central 43434 5.83% 0.21% 0.08% 0.12% 0.11%
GCC10_embench_nsichneu 15208 0.00% 1.42% 0.00% 0.00% 0.00%
GCC10_embench_wikisort 13776 1.32% 4.60% 0.00% 0.10% 0.00%
GCC10_embench_st 10228 1.29% 6.06% 0.04% 0.00% 0.00%
GCC10_embench_nbody 10026 1.38% 5.52% 0.00% 0.00% 0.00%
GCC10_embench_minver 8944 1.27% 5.73% 0.07% 0.00% 0.09%
GCC10_embench_picojpeg 8164 2.40% 3.28% 0.93% 0.59% 0.59%
GCC10_embench_qrduino 6314 0.25% 3.83% 0.03% 0.10% 0.06%
GCC10_embench_nettle-sha256 6120 0.07% 6.25% 0.07% 0.03% 0.10%
GCC10_embench_statemate 4312 0.05% 4.82% 0.00% 0.00% 0.00%
GCC10_embench_ud 3522 0.80% 8.52% 0.11% 0.06% 0.00%
GCC10_embench_nettle-aes 3290 0.49% 10.64% 0.06% 0.00% 0.00%
GCC10_embench_slre 2672 0.07% 8.38% 0.00% 0.22% 0.00%
GCC10_embench_sglib-combined 2542 0.31% 8.74% 0.00% 0.00% 0.00%
GCC10_embench_huffbench 1888 0.00% 10.81% 0.00% 0.21% 0.00%
GCC10_embench_edn 1696 0.00% 14.97% 1.06% 0.00% 0.00%
GCC10_embench_aha-mont64 1204 0.00% 17.11% 0.00% 0.00% 0.00%
GCC10_embench_matmult-int 652 0.00% 32.21% 0.61% 0.00% 0.00%
GCC10_embench_crc32 388 0.52% 51.54% 0.00% 0.52% 0.00%
Average 6.93% 3.17% 0.26% 0.11% 0.23%
Snapshot of complete results !
Bonus !
Estimated by searching for
double shifts or andi 255
for ZEXT.B.
Estimated by searching for
stack adjustments and sw
after or lw before.
Pseudo instruction fitting
(dst==src and reg range).
Normal mul fitting for
encoding, and li followed
by mul. Get all long addresses,
hash from objdump, add
to normalised list, create
a sliding window trying
to maximise benefit.
Normal instructions
fitting for the
compressed encoding.
Questions ?

RISC-V Zce Extension

  • 1.
    RISC-V Zce Extension IbrahimAbu Kharmeh Huawei Bristol 19 April 2021
  • 2.
    Introduction ● RISC-V isan open-source ISA designed at the University of California, Berkeley. ● The ISA is designed to target a wide range of applications from HPC to embedded : ○ RISC-V is a variable-length ISA where instructions can be any length multiple of 16bit. ○ RISC-V ISA contains some vacant encoding space that third parties are allowed to use to design their extensions. ● RISC-V application code size is considerably worse than that of alternative commercial ISAs.
  • 3.
    Factors to consider: Compilers GCCClang Others ABI UABI EABI ISA RV32 RV64 Language C C++ Rust Workload IOT ML Possible Side Effects: • Performance • Power • Implementation Complexity Tool chain options: -Ox, -msave-restore, ffunction-sections, -fdata-sections, gc-sections Short Immediate Fields
  • 4.
    Analysis Script Input ELFfiles Call Objdump Parse Objdump Symbol tables Saves disassembly Construct Instructions record Very limited CFG recovery Perform given optimisations sequentially Report results either as a CSV or to STDIO BFD and Ctypes
  • 5.
    Finding optimisations opportunities ●Most new instruction proposals are either: ○ Fuse some common instruction sequences into a single instruction. ○ Convert a single normal instruction into a compressed one. ● We can identify new optimisation opportunities with one of the following two methods: ○ Track how instructions results get used in other instructions. ○ Track how instructions operands get generated.
  • 6.
    Couple of Issues: ● When tracking instructions, what do we do when we reach a possible change of control flow? ○ Unconditional calls to outside the current function: we save the tracking buffer ○ Conditional calls, branch targets: Stop tracking and remove tracking chain record Or do we keep existing instructions we tracked ● Function names cannot be used as a unique key ! UABI Calling Convention
  • 7.
    Benchmark Suite Xlen RV64 Top 30 Debian V8 RV32 OpusL3C Embench Coremark Testfloat Huawei IOT Code Zephyr Audio Codec Collection of benchmarks FP tests RTOS / IOT CPP test / generic RV64
  • 8.
  • 9.
    TBLJAL MTBLJALVEC .base Addr 0 Addr N Addr255 Addr 1 Xlen bits Code 1 Code 0 Code N Code 255 Rationale: Function calls and jumps to fixed labels typically take 32-bit or 64-bit instruction sequences Proposed Solution: • Create a table of X entries • Store Jump addresses in the table • Separate entries in the table using the lower two bits depending on link register (x0,x1 and x5) • Create a new compressed instruction that jumps to addresses in the new jump table TBLJAL Table:
  • 10.
    TBLJAL (Example) <vsprintf>: #64-bit AUIPC/JALRsequence e084be: 001f8317 auipc t1,0x1f8 e084c2: 18a302e7 jalr t0,394(t1) e084c6: 86b2 mv a3,a2 e084c8: 862e mv a2,a1 e084ca: 800005b7 lui a1,0x80000 e084ce: fff5c593 not a1,a1 #32-bit JAL e084d2: f61ff0ef jal ra,e08432 #64-bit AUIPC/JALR sequence e084d6: 001f8317 auipc t1,0x1f8 e084da: 19630067 jr 406(t1) 00e084be <vsprintf>: e084be: xxxx tbljal #x e084c6: 86b2 mv a3,a2 e084c8: 862e mv a2,a1 e084ca: 800005b7 lui a1,0x80000 e084ce: fff5c593 not a1,a1 e084d2: xxxx tbljal #y e084da: xxxx tbljal #z
  • 11.
    TBLJAL Analysis Get allfunction calls and count the number each is used Go through the entries and eliminate all entries that wont gain from substitution (JAL,J) < 3 Change the weight of JALR, and JR entries to be 3*Count Get the most common (X entries) Replace the entries in the instructions record Calculate new instruction record size
  • 12.
    Determining the valueof X 0.000% 2.000% 4.000% 6.000% 8.000% 10.000% 12.000% 0 100 200 300 400 500 600 700 Table Size Vs Saving (IOT_Application)
  • 13.
    PUSH POP POPRET <bt_rand>: 20405458:1141 addi sp,sp,-16 2040545a: c04a sw s2,0(sp) 2040545c: 70000937 lui s2,0x70000 20405460: 62090613 addi a2,s2,1568 20405464: c422 sw s0,8(sp) 20405466: c226 sw s1,4(sp) 20405468: c606 sw ra,12(sp) 2040546a: 842a mv s0,a0 2040546c: 84ae mv s1,a1 <function body> 20405494: 4501 li a0,0 20405496: 40b2 lw ra,12(sp) 20405498: 4422 lw s0,8(sp) 2040549a: 4492 lw s1,4(sp) 2040549c: 4902 lw s2,0(sp) 2040549e: 0141 addi sp,sp,16 204054a0: 8082 ret 20405458 <bt_rand>: 20405458: <16-bit> push {ra,s0-s2},{a0-a1},-16 2040545c: 70000937 lui s2,0x70000 20405460: 62090613 addi a2,s2,1568 <function body> 20405496: <16-bit> popret {ra,s0-s2},{0} 16 Rationale: Very often in functions epilogue and prologue, we need to save and restore multiple registers to and from the stack. Proposed Solution: Instead of using multiple sw/ lw instructions, we can introduce a single instruction that perform that.
  • 14.
    MULIADD, MULI andADDIADD uint32 get_element(uint8 index) { return array_base[index].element1.element2 } 02002a96 <get_element>: 02002a96 47d1 li a5,20 02002a98 02f50533 mul a0,a0,a5 02002a9c 010057b7 lui a5,0x1005 02002aa0 74478793 addi a5,a5,1860 02002aa4 953e add a0,a0,a5 02002aa6 4548 lw a0,12(a0) 02002aa8 8082 ret The code above get compiled into the following assembly code Rationale: Indexing arrays of structures in C often requires 3 instructions: • Load immediate to get element size • Multiplication by index to get location of the required element •Addition to the base address of the array Proposed Solution: • Create a new instruction (MULIADD) to fuse the 3 Instructions into a single instruction • Similarly we can fuse mul and li to create MULI and add and addi to create ADDIADD.
  • 15.
    Immediate Length Evaluation 0.00% 0.10% 0.20% 0.30% 0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 12 3 4 5 6 7 8 9 10 11 12 MULI Immediate Length Evaluation GCC10_audiocodec_fixed_LC3plus GCC10_audiocodec_fixed_opus_demo GCC10_audiocodec_float_LC3plus GCC10_audiocodec_float_opus_demo GCC10_coremark GCC10_embench_aha-mont64 GCC10_embench_matmult-int GCC10_embench_minver GCC10_embench_picojpeg GCC10_embench_st GCC10_embench_ud GCC10_fpmark_atan-1M GCC10_fpmark_inner-product-mid-10k GCC10_huawei_iot_application GCC10_huawei_iot_protocol GCC10_zephyr_central GCC10_zephyr_peripheral
  • 16.
    Results ! Filename Instructionrecord size C.TBLJAL PUSHPOP MULI ADDIADD MULIADD GCC10_huawei_iot_protocol 960678 7.00% 2.86% 0.34% 0.10% 0.30% GCC10_huawei_iot_application 338824 9.36% 4.14% 0.18% 0.11% 0.18% GCC10_zephyr_peripheral 76246 6.58% 0.17% 0.07% 0.05% 0.07% GCC10_embench_cubic 45632 2.59% 4.69% 0.00% 0.40% 0.00% GCC10_zephyr_central 43434 5.83% 0.21% 0.08% 0.12% 0.11% GCC10_embench_nsichneu 15208 0.00% 1.42% 0.00% 0.00% 0.00% GCC10_embench_wikisort 13776 1.32% 4.60% 0.00% 0.10% 0.00% GCC10_embench_st 10228 1.29% 6.06% 0.04% 0.00% 0.00% GCC10_embench_nbody 10026 1.38% 5.52% 0.00% 0.00% 0.00% GCC10_embench_minver 8944 1.27% 5.73% 0.07% 0.00% 0.09% GCC10_embench_picojpeg 8164 2.40% 3.28% 0.93% 0.59% 0.59% GCC10_embench_qrduino 6314 0.25% 3.83% 0.03% 0.10% 0.06% GCC10_embench_nettle-sha256 6120 0.07% 6.25% 0.07% 0.03% 0.10% GCC10_embench_statemate 4312 0.05% 4.82% 0.00% 0.00% 0.00% GCC10_embench_ud 3522 0.80% 8.52% 0.11% 0.06% 0.00% GCC10_embench_nettle-aes 3290 0.49% 10.64% 0.06% 0.00% 0.00% GCC10_embench_slre 2672 0.07% 8.38% 0.00% 0.22% 0.00% GCC10_embench_sglib-combined 2542 0.31% 8.74% 0.00% 0.00% 0.00% GCC10_embench_huffbench 1888 0.00% 10.81% 0.00% 0.21% 0.00% GCC10_embench_edn 1696 0.00% 14.97% 1.06% 0.00% 0.00% GCC10_embench_aha-mont64 1204 0.00% 17.11% 0.00% 0.00% 0.00% GCC10_embench_matmult-int 652 0.00% 32.21% 0.61% 0.00% 0.00% GCC10_embench_crc32 388 0.52% 51.54% 0.00% 0.52% 0.00% Average 6.93% 3.17% 0.26% 0.11% 0.23%
  • 17.
  • 18.
    Bonus ! Estimated bysearching for double shifts or andi 255 for ZEXT.B. Estimated by searching for stack adjustments and sw after or lw before. Pseudo instruction fitting (dst==src and reg range). Normal mul fitting for encoding, and li followed by mul. Get all long addresses, hash from objdump, add to normalised list, create a sliding window trying to maximise benefit. Normal instructions fitting for the compressed encoding.
  • 19.