SlideShare a Scribd company logo
Vector Codegen in the RISC-V
Backend
Luke Lau
11th October 2023
Agenda
● Overview of RISC-V Vector ISA
● Modelling semantics in LLVM IR
● Vector codegen and lowering
● Middle-end vectorization passes
Arm NEON
add v0.s, v1.s, v2.s
%v = add <4 x i32> %x, %y
0 1 2 3
128 bits
4 x i32
● Scalable vector registers
○ 128-1024 bits in size (128-2048 in SVE2)
● Code is vector length agnostic (VLA)
add z0.s, z1.s, z2.s
Arm SVE
0 1 2 3
128 bits
4 x i32
µArch A
0 1 2 3
256 bits
8 x i32
µArch B
4 5 6 7
add z0.s, z1.s, z2.s
Arm SVE
0 1 2 3
Min 128 bits
n x 4 x i32 4 5 6 7 8 9 10 11 12 13
%v = add <???> %x, %y
add z0.s, z1.s, z2.s
%v = add <vscale x 4 x i32> %x, %y
● vscale: constant unknown at compile time
● On SVE, vscale = register size in bits / 128
Scalable vectors
0 1 2 3
Min 128 bits
vscale x 4 x i32 4 5 6 7 8 9 10 11 12 13
● Standard RISC-V extension that provides vector capabilities
● Ratified in 2021
● Adds 32 VLEN-sized vector registers v0-v31
○ 32 <= VLEN <= 65,536
○ Minimum supported VLEN in LLVM is 64 bits
○ vscale = VLEN / 64
RISC-V “V” Vector Extension (RVV)
vadd.vv v0, v1, v2
%v = add <vscale x 2 x i32> %x, %y
RISC-V “V” Vector Extension (RVV)
add z0.s, z1.s, z2.s
vsetvli t0, zero, e32, m1, ta, ma
● Element size SEW is encoded in VTYPE register
● Set VTYPE with vsetvli/vsetivli/vsetvl
add z0.s, z1.s, z2.s
Where do we encode the element type?
vadd.vv v1, v2, v3
vsetivli t0, 5, e32, m1, ta, ma
● Can control the number of elements to
operate on with the VL register
● Configured with vset[i]vli
Vector length VL
0 1 2 3 4 5 6 7
VL=5
VLEN=256
vadd.vv v1, v2, v3, v0.t
vsetivli t0, 5, e32, m1, ta, ma
● Can predicate operations by using a mask
register
● Always has to be in v0
Mask
vadd.vv v1, v2, v3, v0.t
vsetivli t0, 5, e32, m1, ta, ma
● Controlled by length multiplier LMUL, stored
in VTYPE
● At LMUL=1, can operate on VLEN / SEW
elements
Register grouping LMUL
v0 v1 v2 v3
LMUL=1
v4
…
vadd.vv v2, v4, v6, v0.t
vsetivli t0, 5, e32, m2, ta, ma
● Controlled by length multiplier LMUL, stored
in VTYPE
● At LMUL=1, can operate on VLEN / SEW
elements
● At LMUL=2, can operate on VLEN * 2 / SEW
elements
Register grouping LMUL
v0 v2
v0 v1 v2 v3
LMUL=1
LMUL=2
v4
…
v4
How can we express RVV semantics in LLVM IR?
RVV specific intrinsics
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(
<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x,
<vscale x 2 x i32> %y, <vscale x 2 x i1> %mask,
i64 %avl, i64 %policy
)
%v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(
<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl
)
Vector predication (VP) intrinsics
%v = add <vscale x 2 x i32> %x, %y
Plain LLVM IR
How can we express RVV semantics in LLVM IR?
RVV specific intrinsics
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(
<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x,
<vscale x 2 x i32> %y, <vscale x 2 x i1> %mask,
i64 %avl, i64 %policy
)
%v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(
<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl
)
Vector predication (VP) intrinsics
%v = add <vscale x 2 x i32> %x, %y
Plain LLVM IR
How do we control register grouping?
%v = add <vscale x 2 x i32> %x, %y vsetvli t0, zero, e32, m1, tu, mu
vadd.vv v0, v1, v2
%v = add <vscale x 4 x i32> %x, %y vsetvli t0, zero, e32, m2, tu, mu
vadd.vv v0, v1, v2
%v = add <vscale x 8 x i32> %x, %y vsetvli t0, zero, e32, m4, tu, mu
vadd.vv v0, v1, v2
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
MCInst
???
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
LLVM IR
SelectionDAG
MCInst
???
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
SelectionDAG
MCInst
%v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr,
%mask:vr, %avl:gpr, 5, 0
MachineInstr
Instruction selection
???
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli t0, a0, e32, m1, ta, ma
vadd.vv v1, v2, v3, v0.t
LLVM IR
SelectionDAG
MCInst
MachineInstr
Instruction selection
VSETVLI %avl:gpr, e32, m1, 0, implicit-def $vl, implicit-def $vtype
%v:vr = VADD_VV %0:vr(tied-def 0), %x:vr, %y:vr, $v0,
implicit $vl, implicit $vtype
RISCVInsertVSETVLI
%v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr,
%mask:vr, %avl:gpr, 5, 0
v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0
%v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
vsetvli zero, a0, e32, m1, tu, mu
vle32.v v1, (a1)
vsetvli zero, a0, e32, m1, tu, mu
vadd.vi v1, v1, 1
vsetvli zero, a0, e32, m1, tu, mu
vse32.v v1, (a1)
RISCVInsertVSETVLI
PseudoVLE32_V_M1 ...
PseudoVADD_VI_M1 ...
PseudoVSE32_V_M1 ...
vsetvli zero, a0, e32, m1, tu, mu
vle32.v v1, (a1)
vadd.vi v1, v1, 1
vse32.v v1, (a1)
RISCVInsertVSETVLI
PseudoVLE32_V_M1 ...
PseudoVADD_VI_M1 ...
PseudoVSE32_V_M1 ...
SLP Vectorization
● Vectorizes straight line code to vectorized code
● Generates fixed-length vectors:
○ Insert into scalable container type that’s guaranteed to fit
○ Legalize to RISCVISD::*_VL node
○ Pass vector length to VL operand
○ Then gets selected as a vector pseudo
%p1 = getelementptr i32, ptr %dest, i64 4
store i32 1, ptr %p1
%p2 = getelementptr i32, ptr %dest, i64 5
store i32 1, ptr %p2
%p1 = getelementptr i32, ptr %dest, i64 4
store <2 x i32> <i32 1, i32 1>, ptr %p
SLP Vectorization
● Some things in RVV are more expensive than other targets!
li a1, 1
sw a1, 16(a0)
sw a1, 20(a0)
addi a0, a0, 16
vsetivli zero, 2, e32, mf2, ta, ma
vmv.v.i v8, 1
vse32.v v8, (a0)
SLP Vectorization
li a1, 1
sw a1, 16(a0)
sw a1, 20(a0)
; SVE
movi v0.2s, #1
str d0, [x0, #16]
● Some things in RVV are more expensive than other targets!
○ No addressing modes in vector load/stores
○ Overhead of vsetvli not costed for
○ Inserts and extracts can be expensive on larger LMULs
○ Which means constant materialization of vectors can be expensive
● Work done to improve cost model
● Recently enabled by default in LLVM 17 now that it’s overall profitable
addi a0, a0, 16
vsetivli zero, 2, e32, mf2, ta, ma
vmv.v.i v8, 1
vse32.v v8, (a0)
Loop Vectorization
f:
blez a1, .exit
li a2, 8
bgeu a1, a2, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
andi a2, a1, -8
vsetivli zero, 8, e32, m2, ta, ma
mv a3, a2
mv a4, a0
.vector.body:
vle32.v v8, (a4)
vadd.vi v8, v8, 1
vse32.v v8, (a4)
addi a3, a3, -8
addi a4, a4, 32
bnez a3, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addi a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
for (int i = 0; i < n; i++)
x[i]++;
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but
need to account for register pressure
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
csrr a4, vlenb
srli a3, a4, 1
bgeu a1, a3, .vector.preheader
li a2, 0
j .scalar.preheader
.vector.preheader:
srli a2, a4, 3
slli a5, a2, 2
slli a2, a2, 32
sub a2, a2, a5
and a2, a2, a1
slli a4, a4, 1
vsetvli a5, zero, e32, m2, ta, ma
mv a5, a2
mv a6, a0
.vector.body:
vl2re32.v v8, (a6)
vadd.vi v8, v8, 1
vs2r.v v8, (a6)
sub a5, a5, a3
add a6, a6, a4
bnez a5, .vector.body
beq a2, a1, .exit
.scalar.preheader:
slli a3, a2, 2
add a0, a0, a3
sub a1, a1, a2
.scalar.body:
lw a2, 0(a0)
addiw a2, a2, 1
sw a2, 0(a0)
addi a1, a1, -1
addi a0, a0, 4
bnez a1, .scalar.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but
need to account for register pressure
● By default scalar epilogue is emitted…
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
li a2, 0
csrr a5, vlenb
srli a3, a5, 1
neg a4, a3
add a6, a3, a1
addi a6, a6, -1
and a4, a6, a4
slli a5, a5, 1
vsetvli a6, zero, e64, m4, ta, ma
vid.v v8
.vector.body:
vsetvli zero, zero, e64, m4, ta, ma
vsaddu.vx v12, v8, a2
vmsltu.vx v0, v12, a1
vle32.v v12, (a0), v0.t
vsetvli zero, zero, e32, m2, ta, ma
vadd.vi v12, v12, 1
vse32.v v12, (a0), v0.t
add a2, a2, a3
add a0, a0, a5
bne a4, a2, .vector.body
.exit:
ret
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but need
to account for register pressure
● By default scalar epilogue is emitted…
○ But predicated tail folding can be enabled
with -prefer-predicate-over-epilogue
for (int i = 0; i < n; i++)
x[i]++;
Loop Vectorization
f:
blez a1, .exit
.vector.body:
vsetvli t0, a1, e32, m2, ta, ma
vle32.v v12, (a0)
vadd.vi v12, v12, 1
vse32.v v12, (a0)
sub a1, a1, t0
add a0, a0, t0
bnez a1, .vector.body
.exit:
ret
for (int i = 0; i < n; i++)
x[i]++;
● Vectorizing with fixed-length vectors is
supported
● Scalable vectorization is enabled by default
● Uses LMUL=2 by default
○ Could be smarter and increase it, but need
to account for register pressure
● By default scalar epilogue is emitted…
○ But predicated tail folding can be enabled
with -prefer-predicate-over-epilogue
○ Work is underway to perform tail folding
via VL (D99750)
■ Via VP intrinsics
What next?
● Teach loop vectorizer to dynamically select an LMUL
● Round out vector predication support
○ Lift InstCombine/DAGCombine to handle VP intrinsics
● Handle scalable interleaved accesses with more than 2 groups
○ Take advantage of segmented loads/stores: vlseg/vsseg
● Data dependent loop exits
○ Take advantage of fault-only-first loads: vle32ff.v
● Use VL to allow SLP to vectorize non-power-of-2 VFs
Vector Codegen in the RISC-V Backend

More Related Content

What's hot

F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~
Nobuhisa Koizumi
 
よくわかるHopscotch hashing
よくわかるHopscotch hashingよくわかるHopscotch hashing
よくわかるHopscotch hashingKumazaki Hiroki
 
Joykoly all ict mcq pdf (www.onlinebcs.com)
Joykoly all ict mcq pdf (www.onlinebcs.com)Joykoly all ict mcq pdf (www.onlinebcs.com)
Joykoly all ict mcq pdf (www.onlinebcs.com)
Itmona
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
MITSUNARI Shigeo
 
AvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregateAvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregate
Hiroki Ishikawa
 
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
Cloudera Japan
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
TLA+についての話
TLA+についての話TLA+についての話
TLA+についての話
takahashi takahashi
 
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
Amazon Web Services
 
[Public] gerrit concepts and workflows
[Public] gerrit   concepts and workflows[Public] gerrit   concepts and workflows
[Public] gerrit concepts and workflows
Yanbin Kong
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ScyllaDB
 
オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話
京大 マイコンクラブ
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
Lucidworks (Archived)
 
Introduction to Git and Github
Introduction to Git and Github Introduction to Git and Github
Introduction to Git and Github
Max Claus Nunes
 
動的計画法
動的計画法動的計画法
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
MITSUNARI Shigeo
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
Lucidworks
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
Rob Emanuele
 

What's hot (20)

F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~
 
よくわかるHopscotch hashing
よくわかるHopscotch hashingよくわかるHopscotch hashing
よくわかるHopscotch hashing
 
Joykoly all ict mcq pdf (www.onlinebcs.com)
Joykoly all ict mcq pdf (www.onlinebcs.com)Joykoly all ict mcq pdf (www.onlinebcs.com)
Joykoly all ict mcq pdf (www.onlinebcs.com)
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
AvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregateAvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregate
 
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
TLA+についての話
TLA+についての話TLA+についての話
TLA+についての話
 
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
A story of Netflix and AB Testing in the User Interface using DynamoDB - DAT3...
 
[Public] gerrit concepts and workflows
[Public] gerrit   concepts and workflows[Public] gerrit   concepts and workflows
[Public] gerrit concepts and workflows
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
 
オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
TaPL9
TaPL9TaPL9
TaPL9
 
Introduction to Git and Github
Introduction to Git and Github Introduction to Git and Github
Introduction to Git and Github
 
動的計画法
動的計画法動的計画法
動的計画法
 
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
 

Similar to Vector Codegen in the RISC-V Backend

Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
Igalia
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
Roberto Agostino Vitillo
 
Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18
Shekar Midde
 
Experiments with C++11
Experiments with C++11Experiments with C++11
Experiments with C++11
Andreas Bærentzen
 
Adobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGALAdobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGAL
Daniel Freeman
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Igalia
 
Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216
Linaro
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
Positive Hack Days
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформы
DevGAMM Conference
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENES
acijjournal
 
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
SantoshDeshmukh36
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devices
St1X
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
Daniel Lemire
 
Verilog
VerilogVerilog
Verilog
abkvlsi
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
AMD Developer Central
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errors
PVS-Studio
 
LLVM Overview
LLVM OverviewLLVM Overview
LLVM Overview
Constantin Lungu
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
Linaro
 

Similar to Vector Codegen in the RISC-V Backend (20)

Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18Ecad &amp;vlsi lab 18
Ecad &amp;vlsi lab 18
 
Experiments with C++11
Experiments with C++11Experiments with C++11
Experiments with C++11
 
Adobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGALAdobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGAL
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
 
Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216Automatic Vectorization in ART (Android RunTime) - SFO17-216
Automatic Vectorization in ART (Android RunTime) - SFO17-216
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформы
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENES
 
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
3CA949D5-0399-46AE-8A97-F01C8599B2DA.pdf
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devices
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Verilog
VerilogVerilog
Verilog
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errors
 
LLVM Overview
LLVM OverviewLLVM Overview
LLVM Overview
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 

More from Igalia

The journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland supportThe journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland support
Igalia
 
Sustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web EcosystemSustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web Ecosystem
Igalia
 
Status of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKitStatus of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKit
Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
Igalia
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
Igalia
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
Igalia
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
Igalia
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
Igalia
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Igalia
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
Igalia
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
Igalia
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
Igalia
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Igalia
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
Igalia
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Igalia
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
Igalia
 

More from Igalia (20)

The journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland supportThe journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland support
 
Sustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web EcosystemSustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web Ecosystem
 
Status of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKitStatus of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKit
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
 

Recently uploaded

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 

Recently uploaded (20)

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 

Vector Codegen in the RISC-V Backend

  • 1. Vector Codegen in the RISC-V Backend Luke Lau 11th October 2023
  • 2. Agenda ● Overview of RISC-V Vector ISA ● Modelling semantics in LLVM IR ● Vector codegen and lowering ● Middle-end vectorization passes
  • 3. Arm NEON add v0.s, v1.s, v2.s %v = add <4 x i32> %x, %y 0 1 2 3 128 bits 4 x i32
  • 4. ● Scalable vector registers ○ 128-1024 bits in size (128-2048 in SVE2) ● Code is vector length agnostic (VLA) add z0.s, z1.s, z2.s Arm SVE 0 1 2 3 128 bits 4 x i32 µArch A 0 1 2 3 256 bits 8 x i32 µArch B 4 5 6 7
  • 5. add z0.s, z1.s, z2.s Arm SVE 0 1 2 3 Min 128 bits n x 4 x i32 4 5 6 7 8 9 10 11 12 13 %v = add <???> %x, %y
  • 6. add z0.s, z1.s, z2.s %v = add <vscale x 4 x i32> %x, %y ● vscale: constant unknown at compile time ● On SVE, vscale = register size in bits / 128 Scalable vectors 0 1 2 3 Min 128 bits vscale x 4 x i32 4 5 6 7 8 9 10 11 12 13
  • 7. ● Standard RISC-V extension that provides vector capabilities ● Ratified in 2021 ● Adds 32 VLEN-sized vector registers v0-v31 ○ 32 <= VLEN <= 65,536 ○ Minimum supported VLEN in LLVM is 64 bits ○ vscale = VLEN / 64 RISC-V “V” Vector Extension (RVV)
  • 8. vadd.vv v0, v1, v2 %v = add <vscale x 2 x i32> %x, %y RISC-V “V” Vector Extension (RVV) add z0.s, z1.s, z2.s vsetvli t0, zero, e32, m1, ta, ma ● Element size SEW is encoded in VTYPE register ● Set VTYPE with vsetvli/vsetivli/vsetvl add z0.s, z1.s, z2.s Where do we encode the element type?
  • 9. vadd.vv v1, v2, v3 vsetivli t0, 5, e32, m1, ta, ma ● Can control the number of elements to operate on with the VL register ● Configured with vset[i]vli Vector length VL 0 1 2 3 4 5 6 7 VL=5 VLEN=256
  • 10. vadd.vv v1, v2, v3, v0.t vsetivli t0, 5, e32, m1, ta, ma ● Can predicate operations by using a mask register ● Always has to be in v0 Mask
  • 11. vadd.vv v1, v2, v3, v0.t vsetivli t0, 5, e32, m1, ta, ma ● Controlled by length multiplier LMUL, stored in VTYPE ● At LMUL=1, can operate on VLEN / SEW elements Register grouping LMUL v0 v1 v2 v3 LMUL=1 v4 …
  • 12. vadd.vv v2, v4, v6, v0.t vsetivli t0, 5, e32, m2, ta, ma ● Controlled by length multiplier LMUL, stored in VTYPE ● At LMUL=1, can operate on VLEN / SEW elements ● At LMUL=2, can operate on VLEN * 2 / SEW elements Register grouping LMUL v0 v2 v0 v1 v2 v3 LMUL=1 LMUL=2 v4 … v4
  • 13. How can we express RVV semantics in LLVM IR? RVV specific intrinsics %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32( <vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i64 %avl, i64 %policy ) %v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32( <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl ) Vector predication (VP) intrinsics %v = add <vscale x 2 x i32> %x, %y Plain LLVM IR
  • 14. How can we express RVV semantics in LLVM IR? RVV specific intrinsics %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32( <vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i64 %avl, i64 %policy ) %v = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32( <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %mask, i32 %evl ) Vector predication (VP) intrinsics %v = add <vscale x 2 x i32> %x, %y Plain LLVM IR
  • 15. How do we control register grouping? %v = add <vscale x 2 x i32> %x, %y vsetvli t0, zero, e32, m1, tu, mu vadd.vv v0, v1, v2 %v = add <vscale x 4 x i32> %x, %y vsetvli t0, zero, e32, m2, tu, mu vadd.vv v0, v1, v2 %v = add <vscale x 8 x i32> %x, %y vsetvli t0, zero, e32, m4, tu, mu vadd.vv v0, v1, v2
  • 16. %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...) vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR MCInst ???
  • 17. %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...) vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 LLVM IR SelectionDAG MCInst ???
  • 18. vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR SelectionDAG MCInst %v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr, %mask:vr, %avl:gpr, 5, 0 MachineInstr Instruction selection ??? v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
  • 19. vsetvli t0, a0, e32, m1, ta, ma vadd.vv v1, v2, v3, v0.t LLVM IR SelectionDAG MCInst MachineInstr Instruction selection VSETVLI %avl:gpr, e32, m1, 0, implicit-def $vl, implicit-def $vtype %v:vr = VADD_VV %0:vr(tied-def 0), %x:vr, %y:vr, $v0, implicit $vl, implicit $vtype RISCVInsertVSETVLI %v:vr = PseudoVADD_VV_M1_MASK %passthru:vr, %x:vr, %y:vr, %mask:vr, %avl:gpr, 5, 0 v:nxv2i32 = llvm.riscv.vadd x:nxv2i32, y:nxv2i32, mask:nxv2i1, vl:i32, 0 %v = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32(...)
  • 20. vsetvli zero, a0, e32, m1, tu, mu vle32.v v1, (a1) vsetvli zero, a0, e32, m1, tu, mu vadd.vi v1, v1, 1 vsetvli zero, a0, e32, m1, tu, mu vse32.v v1, (a1) RISCVInsertVSETVLI PseudoVLE32_V_M1 ... PseudoVADD_VI_M1 ... PseudoVSE32_V_M1 ...
  • 21. vsetvli zero, a0, e32, m1, tu, mu vle32.v v1, (a1) vadd.vi v1, v1, 1 vse32.v v1, (a1) RISCVInsertVSETVLI PseudoVLE32_V_M1 ... PseudoVADD_VI_M1 ... PseudoVSE32_V_M1 ...
  • 22. SLP Vectorization ● Vectorizes straight line code to vectorized code ● Generates fixed-length vectors: ○ Insert into scalable container type that’s guaranteed to fit ○ Legalize to RISCVISD::*_VL node ○ Pass vector length to VL operand ○ Then gets selected as a vector pseudo %p1 = getelementptr i32, ptr %dest, i64 4 store i32 1, ptr %p1 %p2 = getelementptr i32, ptr %dest, i64 5 store i32 1, ptr %p2 %p1 = getelementptr i32, ptr %dest, i64 4 store <2 x i32> <i32 1, i32 1>, ptr %p
  • 23. SLP Vectorization ● Some things in RVV are more expensive than other targets! li a1, 1 sw a1, 16(a0) sw a1, 20(a0) addi a0, a0, 16 vsetivli zero, 2, e32, mf2, ta, ma vmv.v.i v8, 1 vse32.v v8, (a0)
  • 24. SLP Vectorization li a1, 1 sw a1, 16(a0) sw a1, 20(a0) ; SVE movi v0.2s, #1 str d0, [x0, #16] ● Some things in RVV are more expensive than other targets! ○ No addressing modes in vector load/stores ○ Overhead of vsetvli not costed for ○ Inserts and extracts can be expensive on larger LMULs ○ Which means constant materialization of vectors can be expensive ● Work done to improve cost model ● Recently enabled by default in LLVM 17 now that it’s overall profitable addi a0, a0, 16 vsetivli zero, 2, e32, mf2, ta, ma vmv.v.i v8, 1 vse32.v v8, (a0)
  • 25. Loop Vectorization f: blez a1, .exit li a2, 8 bgeu a1, a2, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: andi a2, a1, -8 vsetivli zero, 8, e32, m2, ta, ma mv a3, a2 mv a4, a0 .vector.body: vle32.v v8, (a4) vadd.vi v8, v8, 1 vse32.v v8, (a4) addi a3, a3, -8 addi a4, a4, 32 bnez a3, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addi a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported for (int i = 0; i < n; i++) x[i]++;
  • 26. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret for (int i = 0; i < n; i++) x[i]++; ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default
  • 27. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure for (int i = 0; i < n; i++) x[i]++;
  • 28. Loop Vectorization f: blez a1, .exit csrr a4, vlenb srli a3, a4, 1 bgeu a1, a3, .vector.preheader li a2, 0 j .scalar.preheader .vector.preheader: srli a2, a4, 3 slli a5, a2, 2 slli a2, a2, 32 sub a2, a2, a5 and a2, a2, a1 slli a4, a4, 1 vsetvli a5, zero, e32, m2, ta, ma mv a5, a2 mv a6, a0 .vector.body: vl2re32.v v8, (a6) vadd.vi v8, v8, 1 vs2r.v v8, (a6) sub a5, a5, a3 add a6, a6, a4 bnez a5, .vector.body beq a2, a1, .exit .scalar.preheader: slli a3, a2, 2 add a0, a0, a3 sub a1, a1, a2 .scalar.body: lw a2, 0(a0) addiw a2, a2, 1 sw a2, 0(a0) addi a1, a1, -1 addi a0, a0, 4 bnez a1, .scalar.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… for (int i = 0; i < n; i++) x[i]++;
  • 29. Loop Vectorization f: blez a1, .exit li a2, 0 csrr a5, vlenb srli a3, a5, 1 neg a4, a3 add a6, a3, a1 addi a6, a6, -1 and a4, a6, a4 slli a5, a5, 1 vsetvli a6, zero, e64, m4, ta, ma vid.v v8 .vector.body: vsetvli zero, zero, e64, m4, ta, ma vsaddu.vx v12, v8, a2 vmsltu.vx v0, v12, a1 vle32.v v12, (a0), v0.t vsetvli zero, zero, e32, m2, ta, ma vadd.vi v12, v12, 1 vse32.v v12, (a0), v0.t add a2, a2, a3 add a0, a0, a5 bne a4, a2, .vector.body .exit: ret ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… ○ But predicated tail folding can be enabled with -prefer-predicate-over-epilogue for (int i = 0; i < n; i++) x[i]++;
  • 30. Loop Vectorization f: blez a1, .exit .vector.body: vsetvli t0, a1, e32, m2, ta, ma vle32.v v12, (a0) vadd.vi v12, v12, 1 vse32.v v12, (a0) sub a1, a1, t0 add a0, a0, t0 bnez a1, .vector.body .exit: ret for (int i = 0; i < n; i++) x[i]++; ● Vectorizing with fixed-length vectors is supported ● Scalable vectorization is enabled by default ● Uses LMUL=2 by default ○ Could be smarter and increase it, but need to account for register pressure ● By default scalar epilogue is emitted… ○ But predicated tail folding can be enabled with -prefer-predicate-over-epilogue ○ Work is underway to perform tail folding via VL (D99750) ■ Via VP intrinsics
  • 31. What next? ● Teach loop vectorizer to dynamically select an LMUL ● Round out vector predication support ○ Lift InstCombine/DAGCombine to handle VP intrinsics ● Handle scalable interleaved accesses with more than 2 groups ○ Take advantage of segmented loads/stores: vlseg/vsseg ● Data dependent loop exits ○ Take advantage of fault-only-first loads: vle32ff.v ● Use VL to allow SLP to vectorize non-power-of-2 VFs