SlideShare a Scribd company logo
1 of 32
Download to read offline
Vector Codegen in the RISC-V
Backend
Luke Lau
11th October 2023
Agenda
● Overview of RISC-V Vector ISA
● Modelling semantics in LLVM IR
● Vector codegen and lowering
● Middle-end vectorization passes
Arm NEON
add v0.s, v1.s, v2.s
%v = add <4 x i32> %x, %y
0 1 2 3
128 bits
4 x i32
● Scalable vector registers
○ 128-1024 bits in size (128-2048 in SVE2)
● Code is vector length agnostic (VLA)
add z0.s, z1.s, z2.s
Arm SVE
0 1 2 3
128 bits
4 x i32
µArch A
0 1 2 3
256 bits
8 x i32
µArch B
4 5 6 7
add z0.s, z1.s, z2.s
Arm SVE
0 1 2 3
Min 128 bits
n x 4 x i32 4 5 6 7 8 9 10 11 12 13
%v = add <???> %x, %y
add z0.s, z1.s, z2.s
%v = add <vscale x 4 x i32> %x, %y
● vscale: constant unknown at compile time
● On SVE, vscale = register size in bits / 128
Scalable vectors
0 1 2 3
Min 128 bits
vscale x 4 x i32 4 5 6 7 8 9 10 11 12 13
● Standard RISC-V extension that provides vector capabilities
● Ratified in 2021
● Adds 32 VLEN-sized vector registers v0-v31
○ 32 <= VLEN <= 65,536
○ Minimum supported VLEN in LLVM is 64 bits
○ vscale = VLEN / 64
RISC-V “V” Vector Extension (RVV)
vadd.vv v0, v1, v2
%v = add <vscale x 2 x i32> %x, %y
RISC-V “V” Vector Extension (RVV)
add z0.s, z1.s, z2.s
vsetvli t0, zero, e32, m1, ta, ma
● Element size SEW is encoded in VTYPE register
● Set VTYPE with vsetvli/vsetivli/vsetvl
add z0.s, z1.s, z2.s
Where do we encode the element type?
vadd.vv v1, v2, v3
vsetivli t0, 5, e32, m1, ta, ma
● Can control the number of elements to
operate on with the VL register
● Configured with vset[i]vli
Vector length VL
0 1 2 3 4 5 6 7
VL=5
VLEN=256
vadd.vv v1, v2, v3, v0.t
vsetivli t0, 5, e32, m1, ta, ma
● Can predicate operations by using a mask
register
● Always has to be in v0
Mask
vadd.vv v1, v2, v3, v0.t
vsetivli t0, 5, e32, m1, ta, ma
● Controlled by length multiplier LMUL, stored
in VTYPE
● At LMUL=1, can operate on VLEN / SEW
elements
Register grouping LMUL
v0 v1 v2 v3
LMUL=1
v4
…
vadd.vv v2, v4, v6, v0.t
vsetivli t0, 5, e32, m2, ta, ma
● Controlled by length multiplier LMUL, stored
in VTYPE
● At LMUL=1, can operate on VLEN / SEW
elements
● At LMUL=2, can operate on VLEN * 2 / SEW
elements
Register grouping LMUL
v0 v2
v0 v1 v2 v3
LMUL=1
LMUL=2
v4
…
v4
How can we express RVV semantics in LLVM IR?
RVV specific intrinsics
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(
<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x,
<vscale x 2 x i32> %y, <vscale x 2 x i1> %mask,
i64 %avl, i64 %policy
)
%v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(
<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl
)
Vector predication (VP) intrinsics
%v = add <vscale x 2 x i32> %x, %y
Plain LLVM IR
How can we express RVV semantics in LLVM IR?
RVV specific intrinsics
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(
<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x,
<vscale x 2 x i32> %y, <vscale x 2 x i1> %mask,
i64 %avl, i64 %policy
)
%v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(
<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl
)
Vector predication (VP) intrinsics
%v = add <vscale x 2 x i32> %x, %y
Plain LLVM IR
How do we control register grouping?
%v = add <vscale x 2 x i32> %x, %y vsetvli t0, zero, e32, m1, tu, mu
vadd.vv v0, v1, v2
%v = add <vscale x 4 x i32> %x, %y vsetvli t0, zero, e32, m2, tu, mu
vadd.vv v0, v1, v2
%v = add <vscale x 8 x i32> %x, %y vsetvli t0, zero, e32, m4, tu, mu
vadd.vv v0, v1, v2
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
MCInst
???
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
LLVM IR
SelectionDAG
MCInst
???
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
SelectionDAG
MCInst
%v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr,
%mask:vr, %avl:gpr, 5, 0
MachineInstr
Instruction selection
???
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
SelectionDAG
MCInst
MachineInstr
Instruction selection
VSETVLI %avl:gpr, e32, m1, 0, implicit-def $vl, implicit-def $vtype
%v:vr = VADD_VV %0:vr(tied-def 0), %x:vr, %y:vr, $v0,
implicit $vl, implicit $vtype
RISCVInsertVSETVLI
%v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr,
%mask:vr, %avl:gpr, 5, 0
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli zero, a0, e32, m1, tu, mu
vle32.v v1, (a1)
vsetvli zero, a0, e32, m1, tu, mu
vadd.vi v1, v1, 1
vsetvli zero, a0, e32, m1, tu, mu
vse32.v v1, (a1)
RISCVInsertVSETVLI
PseudoVLE32_V_M1 ...
PseudoVADD_VI_M1 ...
PseudoVSE32_V_M1 ...
vsetvli zero, a0, e32, m1, tu, mu
vle32.v v1, (a1)
vadd.vi v1, v1, 1
vse32.v v1, (a1)
RISCVInsertVSETVLI
PseudoVLE32_V_M1 ...
PseudoVADD_VI_M1 ...
PseudoVSE32_V_M1 ...
SLP Vectorization
● Vectorizes straight line code to vectorized code
● Generates fixed-length vectors:
○ Insert into scalable container type that’s guaranteed to fit
○ Legalize to RISCVISD::*_VL node
○ Pass vector length to VL operand
○ Then gets selected as a vector pseudo
%p1 = getelementptr i32, ptr %dest, i64 4
store i32 1, ptr %p1
%p2 = getelementptr i32, ptr %dest, i64 5
store i32 1, ptr %p2
%p1 = getelementptr i32, ptr %dest, i64 4
store <2 x i32> <i32 1, i32 1>, ptr %p
SLP Vectorization
● Some things in RVV are more expensive than other targets!
li a1, 1
sw a1, 16(a0)
sw a1, 20(a0)
addi a0, a0, 16
vsetivli zero, 2, e32, mf2, ta, ma
vmv.v.i v8, 1
vse32.v v8, (a0)
SLP Vectorization
li a1, 1
sw a1, 16(a0)
sw a1, 20(a0)
; SVE
movi v0.2s, #1
str d0, [x0, #16]
● Some things in RVV are more expensive than other targets!
○ No addressing modes in vector load/stores
○ Overhead of vsetvli not costed for
○ Inserts and extracts can be expensive on larger LMULs
○ Which means constant materialization of vectors can be expensive
● Work done to improve cost model
● Recently enabled by default in LLVM 17 now that it’s overall profitable
addi a0, a0, 16
vsetivli zero, 2, e32, mf2, ta, ma
vmv.v.i v8, 1
vse32.v v8, (a0)
Loop Vectorization
f:
blez a1, .exit
li a2, 8
bgeu a1, a2, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
andi a2, a1, -8
vsetivli zero, 8, e32, m2, ta, ma
mv a3, a2
mv a4, a0
.vector.body:
vle32.v v8, (a4)
vadd.vi v8, v8, 1
vse32.v v8, (a4)
addi a3, a3, -8
addi a4, a4, 32
bnez a3, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addi a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
for (int i = 0; i < n; i++)
x[i]++;
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but
need to account for register pressure
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but
need to account for register pressure
● By default scalar epilogue is emitted…
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
li a2, 0
csrr a5, vlenb
srli a3, a5, 1
neg a4, a3
add a6, a3, a1
addi a6, a6, -1
and a4, a6, a4
slli a5, a5, 1
vsetvli a6, zero, e64, m4, ta, ma
vid.v v8
.vector.body:
vsetvli zero, zero, e64, m4, ta, ma
vsaddu.vx v12, v8, a2
vmsltu.vx v0, v12, a1
vle32.v v12, (a0), v0.t
vsetvli zero, zero, e32, m2, ta, ma
vadd.vi v12, v12, 1
vse32.v v12, (a0), v0.t
add a2, a2, a3
add a0, a0, a5
bne a4, a2, .vector.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but need
to account for register pressure
● By default scalar epilogue is emitted…
○ But predicated tail folding can be enabled
with -prefer-predicate-over-epilogue
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
.vector.body:
vsetvli t0, a1, e32, m2, ta, ma
vle32.v v12, (a0)
vadd.vi v12, v12, 1
vse32.v v12, (a0)
sub a1, a1, t0
add a0, a0, t0
bnez a1, .vector.body
.exit:
ret
for (int i = 0; i < n; i++)
x[i]++;
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but need
to account for register pressure
● By default scalar epilogue is emitted…
○ But predicated tail folding can be enabled
with -prefer-predicate-over-epilogue
○ Work is underway to perform tail folding
via VL (D99750)
■ Via VP intrinsics
What next?
● Teach loop vectorizer to dynamically select an LMUL
● Round out vector predication support
○ Lift InstCombine/DAGCombine to handle VP intrinsics
● Handle scalable interleaved accesses with more than 2 groups
○ Take advantage of segmented loads/stores: vlseg/vsseg
● Data dependent loop exits
○ Take advantage of fault-only-first loads: vle32ff.v
● Use VL to allow SLP to vectorize non-power-of-2 VFs
Vector Codegen in the RISC-V Backend

More Related Content

Similar to Vector Codegen in the RISC-V Backend

Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216Linaro
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыDevGAMM Conference
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESacijjournal
 
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdfSantoshDeshmukh36
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devicesSt1X
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Daniel Lemire
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errorsPVS-Studio
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXLLinaro
 
PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio
 
Improving Android Performance at Mobiconf 2014
Improving Android Performance at Mobiconf 2014Improving Android Performance at Mobiconf 2014
Improving Android Performance at Mobiconf 2014Raimon Ràfols
 
Wannier90: Band Structures, Tips and Tricks
Wannier90: Band Structures, Tips and TricksWannier90: Band Structures, Tips and Tricks
Wannier90: Band Structures, Tips and TricksJonathan Skelton
 

Similar to Vector Codegen in the RISC-V Backend (20)

Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформы
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENES
 
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devices
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Verilog
VerilogVerilog
Verilog
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errors
 
LLVM Overview
LLVM OverviewLLVM Overview
LLVM Overview
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 
PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio Meets Octave
PVS-Studio Meets Octave
 
Improving Android Performance at Mobiconf 2014
Improving Android Performance at Mobiconf 2014Improving Android Performance at Mobiconf 2014
Improving Android Performance at Mobiconf 2014
 
Wannier90: Band Structures, Tips and Tricks
Wannier90: Band Structures, Tips and TricksWannier90: Band Structures, Tips and Tricks
Wannier90: Band Structures, Tips and Tricks
 
numdoc
numdocnumdoc
numdoc
 
201707 SER332 Lecture 15
201707 SER332 Lecture 15  201707 SER332 Lecture 15
201707 SER332 Lecture 15
 
node-rpi-ws281x
node-rpi-ws281xnode-rpi-ws281x
node-rpi-ws281x
 
Certification
CertificationCertification
Certification
 

More from Igalia

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEIgalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesIgalia
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceIgalia
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfIgalia
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JITIgalia
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!Igalia
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerIgalia
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in MesaIgalia
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIgalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera LinuxIgalia
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVMIgalia
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsIgalia
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesIgalia
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSIgalia
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webIgalia
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersIgalia
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...Igalia
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on RaspberryIgalia
 

More from Igalia (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shaders
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on Raspberry
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Vector Codegen in the RISC-V Backend

  • 1. Vector Codegen in the RISC-V Backend Luke Lau 11th October 2023
  • 2. Agenda ● Overview of RISC-V Vector ISA ● Modelling semantics in LLVM IR ● Vector codegen and lowering ● Middle-end vectorization passes
  • 3. Arm NEON add v0.s, v1.s, v2.s %v = add <4 x i32> %x, %y 0 1 2 3 128 bits 4 x i32
  • 4. ● Scalable vector registers ○ 128-1024 bits in size (128-2048 in SVE2) ● Code is vector length agnostic (VLA) add z0.s, z1.s, z2.s Arm SVE 0 1 2 3 128 bits 4 x i32 µArch A 0 1 2 3 256 bits 8 x i32 µArch B 4 5 6 7
  • 5. add z0.s, z1.s, z2.s Arm SVE 0 1 2 3 Min 128 bits n x 4 x i32 4 5 6 7 8 9 10 11 12 13 %v = add <???> %x, %y
  • 6. add z0.s, z1.s, z2.s %v = add <vscale x 4 x i32> %x, %y ● vscale: constant unknown at compile time ● On SVE, vscale = register size in bits / 128 Scalable vectors 0 1 2 3 Min 128 bits vscale x 4 x i32 4 5 6 7 8 9 10 11 12 13
  • 7. ● Standard RISC-V extension that provides vector capabilities ● Ratified in 2021 ● Adds 32 VLEN-sized vector registers v0-v31 ○ 32 <= VLEN <= 65,536 ○ Minimum supported VLEN in LLVM is 64 bits ○ vscale = VLEN / 64 RISC-V “V” Vector Extension (RVV)
  • 8. vadd.vv v0, v1, v2 %v = add <vscale x 2 x i32> %x, %y RISC-V “V” Vector Extension (RVV) add z0.s, z1.s, z2.s vsetvli t0, zero, e32, m1, ta, ma ● Element size SEW is encoded in VTYPE register ● Set VTYPE with vsetvli/vsetivli/vsetvl add z0.s, z1.s, z2.s Where do we encode the element type?
  • 9. vadd.vv v1, v2, v3 vsetivli t0, 5, e32, m1, ta, ma ● Can control the number of elements to operate on with the VL register ● Configured with vset[i]vli Vector length VL 0 1 2 3 4 5 6 7 VL=5 VLEN=256
  • 10. vadd.vv v1, v2, v3, v0.t vsetivli t0, 5, e32, m1, ta, ma ● Can predicate operations by using a mask register ● Always has to be in v0 Mask
  • 11. vadd.vv v1, v2, v3, v0.t vsetivli t0, 5, e32, m1, ta, ma ● Controlled by length multiplier LMUL, stored in VTYPE ● At LMUL=1, can operate on VLEN / SEW elements Register grouping LMUL v0 v1 v2 v3 LMUL=1 v4 …
  • 12. vadd.vv v2, v4, v6, v0.t vsetivli t0, 5, e32, m2, ta, ma ● Controlled by length multiplier LMUL, stored in VTYPE ● At LMUL=1, can operate on VLEN / SEW elements ● At LMUL=2, can operate on VLEN * 2 / SEW elements Register grouping LMUL v0 v2 v0 v1 v2 v3 LMUL=1 LMUL=2 v4 … v4
  • 13. How can we express RVV semantics in LLVM IR? RVV specific intrinsics %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32( <vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i64 %avl, i64 %policy ) %v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32( <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl ) Vector predication (VP) intrinsics %v = add <vscale x 2 x i32> %x, %y Plain LLVM IR
  • 14. How can we express RVV semantics in LLVM IR? RVV specific intrinsics %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32( <vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i64 %avl, i64 %policy ) %v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32( <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl ) Vector predication (VP) intrinsics %v = add <vscale x 2 x i32> %x, %y Plain LLVM IR
  • 15. How do we control register grouping? %v = add <vscale x 2 x i32> %x, %y vsetvli t0, zero, e32, m1, tu, mu vadd.vv v0, v1, v2 %v = add <vscale x 4 x i32> %x, %y vsetvli t0, zero, e32, m2, tu, mu vadd.vv v0, v1, v2 %v = add <vscale x 8 x i32> %x, %y vsetvli t0, zero, e32, m4, tu, mu vadd.vv v0, v1, v2
  • 16. %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...) vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR MCInst ???
  • 17. %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...) vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 LLVM IR SelectionDAG MCInst ???
  • 18. vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR SelectionDAG MCInst %v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr, %mask:vr, %avl:gpr, 5, 0 MachineInstr Instruction selection ??? v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
  • 19. vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR SelectionDAG MCInst MachineInstr Instruction selection VSETVLI %avl:gpr, e32, m1, 0, implicit-def $vl, implicit-def $vtype %v:vr = VADD_VV %0:vr(tied-def 0), %x:vr, %y:vr, $v0, implicit $vl, implicit $vtype RISCVInsertVSETVLI %v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr, %mask:vr, %avl:gpr, 5, 0 v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
  • 20. vsetvli zero, a0, e32, m1, tu, mu vle32.v v1, (a1) vsetvli zero, a0, e32, m1, tu, mu vadd.vi v1, v1, 1 vsetvli zero, a0, e32, m1, tu, mu vse32.v v1, (a1) RISCVInsertVSETVLI PseudoVLE32_V_M1 ... PseudoVADD_VI_M1 ... PseudoVSE32_V_M1 ...
  • 21. vsetvli zero, a0, e32, m1, tu, mu vle32.v v1, (a1) vadd.vi v1, v1, 1 vse32.v v1, (a1) RISCVInsertVSETVLI PseudoVLE32_V_M1 ... PseudoVADD_VI_M1 ... PseudoVSE32_V_M1 ...
  • 22. SLP Vectorization ● Vectorizes straight line code to vectorized code ● Generates fixed-length vectors: ○ Insert into scalable container type that’s guaranteed to fit ○ Legalize to RISCVISD::*_VL node ○ Pass vector length to VL operand ○ Then gets selected as a vector pseudo %p1 = getelementptr i32, ptr %dest, i64 4 store i32 1, ptr %p1 %p2 = getelementptr i32, ptr %dest, i64 5 store i32 1, ptr %p2 %p1 = getelementptr i32, ptr %dest, i64 4 store <2 x i32> <i32 1, i32 1>, ptr %p
  • 23. SLP Vectorization ● Some things in RVV are more expensive than other targets! li a1, 1 sw a1, 16(a0) sw a1, 20(a0) addi a0, a0, 16 vsetivli zero, 2, e32, mf2, ta, ma vmv.v.i v8, 1 vse32.v v8, (a0)
  • 24. SLP Vectorization li a1, 1 sw a1, 16(a0) sw a1, 20(a0) ; SVE movi v0.2s, #1 str d0, [x0, #16] ● Some things in RVV are more expensive than other targets! ○ No addressing modes in vector load/stores ○ Overhead of vsetvli not costed for ○ Inserts and extracts can be expensive on larger LMULs ○ Which means constant materialization of vectors can be expensive ● Work done to improve cost model ● Recently enabled by default in LLVM 17 now that it’s overall profitable addi a0, a0, 16 vsetivli zero, 2, e32, mf2, ta, ma vmv.v.i v8, 1 vse32.v v8, (a0)
  • 25. Loop Vectorization f: blez a1, .exit li a2, 8 bgeu a1, a2, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: andi a2, a1, -8 vsetivli zero, 8, e32, m2, ta, ma mv a3, a2 mv a4, a0 .vector.body: vle32.v v8, (a4) vadd.vi v8, v8, 1 vse32.v v8, (a4) addi a3, a3, -8 addi a4, a4, 32 bnez a3, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addi a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported for (int i = 0; i < n; i++) x[i]++;
  • 26. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret for (int i = 0; i < n; i++) x[i]++; ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default
  • 27. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure for (int i = 0; i < n; i++) x[i]++;
  • 28. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… for (int i = 0; i < n; i++) x[i]++;
  • 29. Loop Vectorization f: blez a1, .exit li a2, 0 csrr a5, vlenb srli a3, a5, 1 neg a4, a3 add a6, a3, a1 addi a6, a6, -1 and a4, a6, a4 slli a5, a5, 1 vsetvli a6, zero, e64, m4, ta, ma vid.v v8 .vector.body: vsetvli zero, zero, e64, m4, ta, ma vsaddu.vx v12, v8, a2 vmsltu.vx v0, v12, a1 vle32.v v12, (a0), v0.t vsetvli zero, zero, e32, m2, ta, ma vadd.vi v12, v12, 1 vse32.v v12, (a0), v0.t add a2, a2, a3 add a0, a0, a5 bne a4, a2, .vector.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… ○ But predicated tail folding can be enabled with -prefer-predicate-over-epilogue for (int i = 0; i < n; i++) x[i]++;
  • 30. Loop Vectorization f: blez a1, .exit .vector.body: vsetvli t0, a1, e32, m2, ta, ma vle32.v v12, (a0) vadd.vi v12, v12, 1 vse32.v v12, (a0) sub a1, a1, t0 add a0, a0, t0 bnez a1, .vector.body .exit: ret for (int i = 0; i < n; i++) x[i]++; ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… ○ But predicated tail folding can be enabled with -prefer-predicate-over-epilogue ○ Work is underway to perform tail folding via VL (D99750) ■ Via VP intrinsics
  • 31. What next? ● Teach loop vectorizer to dynamically select an LMUL ● Round out vector predication support ○ Lift InstCombine/DAGCombine to handle VP intrinsics ● Handle scalable interleaved accesses with more than 2 groups ○ Take advantage of segmented loads/stores: vlseg/vsseg ● Data dependent loop exits ○ Take advantage of fault-only-first loads: vle32ff.v ● Use VL to allow SLP to vectorize non-power-of-2 VFs