Andes RISC-V vector extension demystified-tutorial

RISC-V Vector
Extension Demystified
December 10, 2020
Thang Tran, Ph.D.
Principal Engineer

2
Copyright© 2020 Andes Technology Corp.
Andes Corporate Overview
• Core R&D from AMD, DEC, Intel,
MIPS, nVidia, and Sun
Silicon Valley Tie
15-Year CPU IP Company
• IPO in 2017; HQ in Taiwan
• AndeStar™ V1-V3, V5 (RISC-V)
>1 Bn Annual Run Rate
of Andes-Embedded SoC
• ~300 customers in TW, CN, US,
EU, JP, KR
Founding Premier Member
of RISC-V International
• Chairing Task Groups
• Contributing to GNU, LLVM,
uBoot, glibc, Linux, etc.

3
V5 Adoptions: From MCU to Datacenters
• Edge to Cloud:
− ADAS
− AIoT
− Blockchain
− FPGA
− MCU
− Multimedia
− Security
− Wireless (BT/WiFi)
• 1 to 1000+ core
• 40nm to 7nm
• Many in AI
− Datacenter AI accelerators
− SSD: enterprise (& consumer)
− 5G macro/small cells
5G Macro

4
Agenda
• Andes overview
• Vector technology background
– SIMD/vector concept
– Vector processor basic
• RISC-V V extension ISA
– Basic
– CSR
– Memory operations
– Compute instruction format
• Sample codes
– Matrix multiplication
– Loads with RVV versions 0.8 and 1.0
• AndesCore™ NX27V introduction
• Summary

5
• ISA: Instruction Set Architecture
• GOPS: Giga Operations Per Second
• GFLOPS: Giga Floating-Point OPS
• XRF: Integer register file
• FRF: Floating-point register file
• VRF: Vector register file
• SIMD: Single Instruction Multiple Data
• MMX: Multi Media Extension
• SSE: Streaming SIMD Extension
• AVX: Advanced Vector Extension
• Configurable: parameters are fixed at built time, i.e. cache size
• Extensible: added instructions to ISA includes custom instructions to be added by customer
• Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, …
• Programmable: parameters can be dynamically changed in the program
• ACE: Andes Custom Extension
• CSR: Control and Status Register
• SEW: Element Width (8-64)
• ELEN: Largest Element Width (32 or 64)
• XLEN: Scalar register length in bits (64)
• FLEN: FP register length in bits (16-64)
• VLEN: Vector register length in bits (128-512)
• LMUL: Register grouping multiple (1/8-8)
• EMUL: Effective LMUL
• VLMAX/MVL: Vector Length Max
• AVL/VL: Application Vector Length
Terminology

6
RISC-V & RVV Background
• An open processor architecture started by UC Berkeley
– Compact, modular, extensible
• RISC-V International (formerly RISC-V Foundation)
– RISC-V Foundation formed in 2015 to govern its growth
– 750+ members including industry and research institute/university
• RISC-V Vector Extension
– Defined in a Foundation Task Group
– Vector instruction set with scalable vector
– Scalable data sizes with 2x data expansion arithmetic
– Over 300+ vector instructions, including load/store, integer, fixed-
point/floating-point operations

7
RISC-V V Spec Status
• Version 1.0 (draft) was published in July 2020
• Note from the spec:
“This is a draft of the stable proposal for the vector extension specification to
be used for implementation and evaluation. Once the draft label is removed,
version 1.0 is intended to be sent out for public review as part of the RISC-V
International ratification process. Version 1.0 is also considered stable enough
to begin developing toolchains, functional simulators, and initial
implementations, including in upstream software projects, and is not expected
to have major functionality changes except if serious issues are discovered
during ratification. Once ratified, the spec will be given version 2.0”.
• Version 1.0 is targeted for end of 2020 and ready for
ratifying
RISC-V V Spec Status

8
Vector technology
background

9
• Instruction Stream
− Sequence of Instructions read from memory
• Data Stream
− Operations performed on the data in the processor
• SIMD is the most interesting which is the basic for:
− MMX, SSE, SSE2-5, AVX
− Altivec
− Neon
Single Multiple
Single SISD SIMD
Multiple MISD MIMD
Number of
Instruction Streams
Number of Data Streams
Flynn’s Classification
Microprocessors
DSP
Graphic
Vector

10
Why SIMD?
• Single 32-bit Microprocessor
– One 32-bit element per instruction
• 512-bit SIMD processor
– 16 32-bit elements per instruction
– 32 16-bit elements per instruction
– 64 8-bit elements per instruction. If the SIMD processor clock frequency is
at 1 GHz, then the performance is 64 GOPS
• Increasing the SIMD size, the higher the performance
• SIMD is the basic for Vector Processor
– N operations per instruction where the elements in the operations are
independent from each other

11
Instruction stream
LD VR <-- A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR <-- VR, 1 ADD0 ADD1 ADD2 ADD3 LD1 ADD0
MUL VR <-- VR, 2 MUL0 MUL1 MUL2 MUL3 LD2 ADD1 MUL0
ST A[3:0] <-- VR ST0 ST1 ST2 ST3 LD3 ADD2 MUL1 ST0
ADD3 MUL2 ST1
Time MUL3 ST2
ST3
Different ops at same time
Space
VECTOR PROCESSOR
LD ADD MUL ST
Same ops at same space
PE0 PE1 PE3
PE2
Space
ARRAY PROCESSOR
Same ops at same time
Different ops at same space
Multiple loads are handling by hardware
Parallel Processing
• Time-space duality:
– Array processor: Instruction operates on multiple data elements at the same
time – multi-thread, multi-core
– Vector processor: Instruction operates on multiple data elements in
consecutive time steps – pipeline of SIMD execution

12
Intel MMX, SSE, SSE2, AVX
• MMX (1997) adds 57 new integer instructions, 64-bit registers
– 8 of 8-bit, 4 of 16-bit, 2 of 32-bit, or 1 of 64-bit (reuse 8FP registers – FP and MMX cannot mix)
– Number of elements is defined in the opcode, only consecutive memory load/store is allowed
– Move, Add, Subtract, Shift, Logical, Mult, MACC, Compare, Pack/Unpack
• SSE (1999) adds 70 new instructions for single precision FP, 2 of 32-bit
• SSE2 (2001) adds 144 new instructions for FP, 4 of 32-bit, 128-bit registers
• SSE3 (2004), SSE4 (2006), SSE5 (2007) add more new instructions
• AVX (2008) adds new instructions for 256-bit registers
• AVX2 (2013) adds more new instructions
• AVX-512 (2015) adds new instructions for 512-bit registers
• Issue:
– Each register length has a new set of instructions
– Many instructions for the 512-bit SIMD

13
v31
f31
r31
v31
f31
r31
One Lane.
Usually less
Lanes than
Vector
Elements
(Lanes/VFUs
timeshared)
A Vector
Functional Unit
(VFU).
Multiple VFUs
per Lane can
operate in
parallel
A Vector
Element
Parallelism achieved by:
• Multiple Lanes
• Multiple VFUs per Lane (e.g., L/S, +, x)
• Individual VFUs are pipelined
RISC-V CPU
Vector
Meta Data
VLDS
VLEN
Vector Extension Processor Overview
VLEN >= SIMD
Newell, R., “RISC-V Vector Extension Proposal Snapshot,” Microsemi, Microchip Technology Inc., 2018

14
VLSU
Vector Register File
Memory
MULT
ADD
chain chain
Vector Instruction Chaining
• Independent functional units are needed
– Partial data of an instruction can be forwarded
and operations are scheduled to be executed in
parallel
– Data chaining can happen only if the
instructions operate on multiple SIMD
widths
• VLEN=SIMD width, single instruction/cycle
– The operation is on multiple elements but it is
basically pipelining, not chaining
– The instructions must be set for a vector length
that is greater then the SIMD width in order for
chaining to happen

15
Chaining – Vector Unit Structure
• Four lanes, 4 functional units to fetch/execute independent data
16 elements, 4 lanes:
- Elements 0-3
- Elements 4-7
- Elements 8-11
- Elements 12-15
Functional Unit

16
Cycles 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 vld Ia0 Ia1 Ia2 Ia3 Ia4 Ia5 Ia6 Ia7
1 vadd la8 la9 la10 la11 la12 la13 la14 la15 Ib0 Ib1 Ib2 Ib3 Ib4 Ib5 Ib6 Ib7
2 vmult la16 la17 la18 la19 la20 la21 la22 la23 lb8 lb9 lb10 lb11 lb12 lb13 lb14 lb15 Ic0 Ic1 Ic2 Ic3 Ic4 Ic5 Ic6 Ic7
3 add la24 la25 la26 la27 la28 la29 la30 la31 lb16 lb17 lb18 lb19 lb20 lb21 lb22 lb23 lc8 lc9 lc10 lc11 lc12 lc13 lc14 lc15
4 vld Ie0 Ie1 Ie2 Ie3 Ie4 Ie5 Ie6 Ie7 lb24 lb25 lb26 lb27 lb28 lb29 lb30 lb31 lc16 lc17 lc18 lc19 lc20 lc21 lc22 lc23
5 vadd le8 le9 le10 le11 le12 le13 le14 le15 If0 If1 If2 If3 If4 If5 If6 If7 lc24 lc25 lc26 lc27 lc28 lc29 lc30 lc31
6 vmult le16 le17 le18 le19 le20 le12 le22 le23 lf8 lf9 lf10 lf11 lf12 lf13 lf14 lf15 Ig0 Ig1 Ig2 Ig3 Ig4 Ig5 Ig6 Ig7
7 add le24 le25 le26 le27 le28 le29 le30 le31 lf16 lf17 lf18 lf19 lf20 lf21 lf22 lf23 lg8 lg9 lg10 lg11 lg12 lg13 lg14 lg15
8 lf24 lf25 lf26 lf27 lf28 lf29 lf30 lf31 lg16 lg17 lg18 lg19 lg20 lg21 lg22 lg23
9 lg24 lg25 lg26 lg27 lg28 lg29 lg30 lg31
8 lanes, 8 LS units 8 lanes, 8 vector MUL units 8 lanes, 8 vector ADD units
Chaining of 3 vector operations
Chaining
• 32 elements, 8 lanes = 8 units to fetch/execute 8 independent elements
– vld: fetch data 32 elements in 4 clock cycles, 8 elements per cycle
– Vector add/mult: each vector instruction passes through the functional units
4 times

17
RAW Dependencies
vectors
scalar
Vectorize Loop Iterations
• X and Y are vectors, a is scalar, n is a large number
• For N=64, dynamic instructions
– Processor = 64x9 + 2 = 578:
– Vector = 6 (5 vectors and 1 scalar)
Processor/SIMD Code
C Code
Vector Code
Operate on
64 elements
From: Patterson, D., CS252, UCB

18
Advantages of Vector Architecture
• From Spec92fp*
– Static Instructions: scalar/vector = 4.5X
– Dynamic Instructions: scalar/vector = 23.8X
– Operations: scalar/vector = 1.4X (the extra operations are loop overheads)
• Most efficient way for data parallel processing
– High computation throughput (operate on multiple elements at the same time)
– Require high memory bandwidth, not low latency, data operation is chaining. The data
memory is interleaving multiple banks for higher memory bandwidth
• Orthogonal to microprocessor architecture
– Scalar or superscalar microprocessor with vector unit, e.g. Cray X1, Alpha Tarantula,
Andes NX27V
– Parallel vector processors, e.g. NEC Earth Simulator, Broadcom Calisto
*Quintana, F., et.al., “An ISA comparison between Superscalar and Vector Processors,” U. of Las Palmas de Gran Canaria

19
Example: Cray 1
• Scalar and vector mode
• 8 64-element vector registers
– SEW = 64-bit
– VLEN = 64x64 = 4096-bit
• 8 64-bit scalar registers
• 8 24-bit address registers
• 16 memory banks
– Bank latency = 11 cycles
– Sustain 16 parallel accesses to
different banks
“Cray-1 Computer System Hardware Reference
Manual,”1977, Cray Research, Inc.

20
3rd Iteration
VLMAX elements
4th Iteration
VLMAX elements
1st Iteration - odd size
MOD(VLMAX) piece
2nd Iteration
VLMAX elements
Strip Mining
• When the size of vector > VLMAX
– size MOD(VLMAX) = the number of elements in the
first iteration
– n/VLMAX = number of iterations for each VLMAX time
• Example size = 200
– 200 MOD(64) = 8 elements in first iteration
– 200/64 = 3 iterations of 64 elements
• Strip Mining is assumed to be done in software
similar to loop operation for VLMAX
– vl or vector mask register is used for the first iteration

21
Technical Challenges of Vector Architecture
• Complexity of vector register file (VRF) with VLEN=512-bit
– Complexity of handling read and write ports with multiple lanes are much more expensive in VRF in
comparison to scalar or FP register files
– Performance is related to the number of read and write ports
– The number of functional units are also limited by the number of read/write ports. Stalling of
writing back to VRF means additional 512-bit buffers and back pressure to previous pipeline stages
• Microprocessor methods of handling out-of-order execution and retirement
become expensive for vector processor
– Register renaming added several vector registers of 512-bit
– Re-Order Buffer has a retire queue of 512-bit for vector registers
• Expensive to implement precise exception and interrupt latency
– In AI or embedded market, the precise exception is less important
• Vector loads/stores are expensive, especially for stride and index
• Andes NX27V vector processor simplifies many of the above issues

22
Applications of Vector Processing
• Scientific computations (astrophysics, atmospheric, ocean modeling, bioinformatics, physics,
biomolecular simulation, chemistry, fluid dynamics, material sciences, quantum chemistry)
• Engineering computations (computer vision and image, data mining and data intensive
computing, CAD/CAM engineering analysis, military applications, vlsi design)
• Multimedia Processing (compress., graphics, audio synth, image proc.)
• Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
• Artificial Intelligent
• Lossy Compression (JPEG, MPEG video and audio)
• Lossless Compression (Zero removal, RLE, Differencing, LZW)
• Cryptography (RSA, DES/IDEA, SHA/MD5)
• Speech and handwriting recognition
• Operating systems/Networking (memcpy, memset, parity, checksum)
• Databases (hash/join, data mining, image/video serving)
• Language run-time support (stdlib, garbage collection)
Source: Rochester Institute of Technology, “Introduction to Vector Processing,” Shaaban, EECC722

23
Vectorizing Media Processing
• Kernel
− Matrix transpose/multiply
− DCT (video, communication)
− FFT (audio)
− Motion estimation (video)
− Gamma correction (video)
− Haar transform (media mining)
− Median filter (image processing)
− Separable convolution (img. proc.)
• Vector length
− # vertices at once
− image width
− 256-1024
− image width, iw/16
− image width
− image width
− image width
− image width
(http://www.research.ibm.com/people/p/pradeep/tutor.html)

24
RISC-V V Extension ISA
Basic

25
Vector Register ISA
• Vector-Register ISA Definition:
− All vector operations are between vector registers (except for load and store). This is in contrast
with Vector-Memory ISA where all vector operations are on data memory
− Vector load-store unit fetches and stores the memory data to/from vector registers
• Basic requirements
− Each vector data register hold N M-bit values
 SEW – single vector element in bits, M-bit, power of 2
 VLEN – number of bits in a vector register, N*M-bit, power of 2, VLEN > ELEN
 VLMAX – number of vector elements, N, in the vector register, VLEN/SEW
 Elements are independently executed (mostly)
− Vector control registers: VMASK (mask – predication – valid element for execution) is v0
− Vector instructions
 Arithmetic, logical, shift, multiply, divide, mask, permute, floating-point functional units
 Vector load/store unit: load/store with stride, load/store with index, segment load/store

26
Vector Processing ISA Basic
Scalar
Scalar FP
ALU
Mult s.int 8 .vv
Macc u.int 16 .vx masked
Mask fp 32 .vi unmasked
Permute 64
Divide
4 8
load s.int 8 16 masked
store u.int 16 32 unmasked
32 64 segment
64
Vector registers
Standard scalar RISC-V ISA
Saturate
Overflow
constant
indexed
32x(128b, 256b, 512b)
Standard floating-point RISC-V ISA
Vector ALU
Vector Memory
Andes Vector processor: VLEN=512b, SEW=8,16,32,64b, elements=64,32,16,8
These numbers are used for the rest of the presentation

27
Notations
• Register references
– vs1, vs2, vd – source and destination vector registers, VRF
– rs1, rs2, rd – source and destination integer registers, XRF
– fs1, fs2, fd – source and destination floating point registers, FRF
• Register numbers
– v0, v1, …, v30, v31 – vector register numbers (addresses, indices)
– x0, x1, …, x30, x31 – integer register numbers (addresses, indices)
– f0, f1, …, f30, f31 – floating-point register numbers (addresses, indices)
– x0 is not real register; source register data = 0, destination register = no
write back
– Vector mask and carry bits use v0

28
FP MAC unit
0 MVL Element FP misc unit
1 MVL Element
2 MVL Element FP divide unit
3 MVL Element
… … Integer ALU
… …
… … Integer MAC unit
… … R0
30 MVL Element R1 Integer divide unit
31 MVL Element R2
R3 Vector Mask Unit
…
Vector Length … Vector Permute Unit
Vector Mask R28
R29
R30
R31
Memory
multi banks
Vector Load/Store
Vector
Registers
Scalar
Registers
Vector Processor Overview - Block Diagram

29
FP MAC unit
0 MVL Element FP misc unit
1 MVL Element
2 MVL Element FP divide unit
3 MVL Element
… … Integer ALU
… …
… … Integer MAC unit
… … R0
30 MVL Element R1 Integer divide unit
31 MVL Element R2
R3 Vector Mask Unit
…
Vector Length … Vector Permute Unit
Vector Mask R28
R29
R30
R31
Memory
multi banks
Vector Load/Store
Vector
Registers
Scalar
Registers
vector functional units
Many types of memories can be implemented
VRF read/write
ports control
Data dependency
is handled by
scoreboard
Execution queues for out-of-order execution
Vector Processor Microarchitecture Features

30
CSR

31
Vector CSR Registers
Address Privilege Name Description
0x008 URW vstart Vector start position
0x009 URW vxsat Fixed-Point Saturate Flag
0x00A URW vxrm Fixed-Point Rounding Mode
0x00F URW vcsr Vector control and status register
0xC20 URO* vl Vector length
0xC21 URO* vtype Vector data type register
0xC22 URO vlenb VLEN/8 (vector register length in bytes) – configuration register
*written by vsetvl(i) instruction, not by CSR access instructions

32
CSR – Vector Type Register (vtype)
• vlmul[2:0] – grouping of vector registers, 2x, 4x, 8x, or fraction of vector register, 1/2, 1/4, 1/8
– 000: no grouping, 32 vector registers, VLMAX = VLEN/SEW
– 001: 16 vector registers, effective VLEN is 2X, twice number of elements
– 010: 8 vector registers, effective VLEN is 4X, quadruple number of elements
– 011: 4 vector registers, effective VLEN is 8X, octuple number of elements
– 100: Reserved
– 101: Fractional vector register, effective VLEN is 1/8, one-eighth number of elements
– 110: Fractional vector register, effective VLEN is 1/4, quarter number of elements
– 111: Fractional vector register, effective VLEN is 1/2, half number of elements
• vsew[2:0] – vector standard element width, VLMAX = VLEN/VSEW
– 000 – 111: 8, 16, 32, 64, 128, 256, 512, 1024
• vta – vector tail agnostic, implementation dependent, should the data is unknown or undisturbed
• vma – vector mask agnostic, implementation dependent, should the data is unknown or undisturbed
• vill – illegal value in the vsetvl(i) instruction, any subsequent vector instruction is illegal instruction
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vill
vma vta
vlmul
vsew
reserved

33
LMUL=1 LMUL=4
v31 v28
v30 v24
v29 …
v28 v4
… LMUL=2 v0
… v30
… v28
… … LMUL=8
v3 … v24
v2 … v16
v1 v2 v8
v0 v0 v0
LMUL – Registers Grouping
• LMUL=2, instructions with odd register are illegal instructions
• LMUL=4, vector registers are incremented by 4, else illegal instructions
• LMUL=8, only v0, v8, v16, and v24 are valid vector registers

34
elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
LMUL=1
LMUL=1/2
LMUL=1/4
LMUL=1/8
LMUL – Fractional Registers
• LMUL=1/2, load data into half vector register, then perform vector double operations to expand data
to the whole register
• LMUL=1/4, vector quad instructions to expand data to the whole register
• LMUL=1/8, using vector extension of 8x to expand to the whole register

35
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Register Vector Registers
Vector Length Register VLEN
Scalar
Register
Vector Register Programming Model
• 32 vector registers – 5-bit
• VLEN = The width of each vector register[i] in power of 2
• Version 0.7: FPU cannot shares the vector registers

36
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Scalar
Register
15 14 … … 2 1 0 element #
VLEN=512b, SEW=32b
Vector Register 32-bit Elements

37
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Scalar
Register
31 30 … … 2 1 0 element #
VLEN=512b, SEW=16b

38
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Scalar
Register
63 62 … … 2 1 0 element #
VLEN=512b, SEW=8b

39
Vm*n 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Vm*n+1 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64
Vm*n
Vm*n+1
Vm*n
Vm*n+1
Vm*n
Vm*n+1
SEW=8
SEW=16
SEW=32
SEW=64
8
15 14 13 12 11 10 9
0
1
2
3
7 6 5 4
30 29 28 27 26
31 25
61 60 59 58 57
24 23
63 62 56 55
15 14 13 12 11 10 9 8
31 30 29 28 27 26 25 24
21 20 19 18 17 16
54 53 52 51 50 49 48
22 5 4 3 2 1 0
47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
15 14 6
13 12 11 10 9 8 7
7 6 5 4 3 2 1 0
23 22 21 20 19 18 17 16
Element Numbering, LMUL=2
• VLEN=SLEN is now mandatory in the RVV version 1.0
– Straight forward numbering, increment sequentially
– The numbering is important for mask and permutation operations

40
Vector Mask – Conditional or Predicate Execution
• v0 is used for vector mask, single mask for all LMUL values
– 1 mask bit for each element
• 0 – mask (mask-off), no write to destination vector elements, cannot have exception
• 1 – unmask
– The mask is enabled for each vector instruction with inst[25]=0, inst[25]=1 is unmask
• vop.v v8, v16, 3, v0.t // mask is enable, each element writes back with v0_mask[i]=1
• vop.v v8, v16, 3 // unmask, all elements write result data back to VRF
• Note: new vector mask load/store maybe added
SEW=8
SEW=16
SEW=32
7 6 5 4 3 2 1 0 SEW=64
maskbitsforv7 maskbitsforv6 maskbitsforv5 maskbitsforv4 maskbitsforv3
maskv7 maskv6
v0vectormaskregister
maskbitsforv2
maskv5 maskv4
maskbitsforv1 maskbitsforv0
maskv3 maskv2 maskv1 maskv0
mv7 mv6 mv5 mv4 mv3 mv2 mv1 mv0
Unused
Unused
Unused
LMUL=8
LMUL=4 LMUL=2

41
LMUL=8 and Chaining Example
Cycles: 1 2 3 4 5 6 7 8 9 10
vmul v0, v8, v16 X1 X2
v1, v9, v17 X1 X2
……… … … … … …
v7, v15, v23 X1 X2
vadd v24, v0, v8 X1
v25, v1, v9 X1
……… … … … … … … …
Vector length: VLEN * LMUL
• LMUL=1  512 bits
• LMUL=8  4096 bits
V0~V7 group 1
V8~V15 group 2
V16~V23 group 3
V24~V31 group 4
Xn: execution LMUL Chaining

42
elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vl=VLMAX
vl=21
CSR - Vector Length Register (vl)
• vl is the number of elements for vector operations
– Updated by vsetvl(i) instruction along with the vtype CSR
• vl is set to VLMAX if the set value is greater than VLMAX
• The default vl is VLMAX
• If x0 is used to set vl, then vl is set to VLMAX
– Force to an element by fault-only-first load instruction
– vl=0, then no elements are updated in vector operations
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
reserved vl[VLMAX*8-1:0]

43
CSR Vector Instruction – vsetvli/vsetvl
• Two vector CSRs are updated by single instruction
– vsetvli rd, rs1, vtypei // vl=rs1 (which is also AVL, application vector length), rd=AVL
– vsetvl rd, rs1, rs2 // rd = avl
• CSR updates:
– csr.vtype = vtypei or rs2
• e8, e16, e32, e64, … setting the SEW size
• {m1}, m2, m4, m8, mf2, mf4, mf8 – setting the LMUL, default is m1 if m value is not specified
• {tu}, ta – setting tail elements, default is tu if ta is not specified
• {mu}, ma – setting masked-off elements, default is mu if ma is not specified
• csr.vl = rs1
– rd=rs1=x0 no update for csr.vl
– rd!=x0, rs1=x0 vl=VLMAX, x[rd]=vl
– rs1!=x0 vl=x[rs1], x[rd]=vl
– x[rs1] > VLMAX vl=VLMAX, x[rd]=vl

44
Sample Codes for LMUL and SEW Manipulation
• VLEN=512b, e8 to e32 elements for 16 elements
vsetvli t0, a0, e8, mf4 // set SEW=8b, LMUL=1/4, VLMAX=64/4=16 elements
vle8.v v8, (a2) // load 16 8b elements
vsext.vf4 v8, v8 // Signed extended 16 8b elements to 32b elements
vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=16 (VLMAX)
vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements
…
• Notes:
− Write and read to CSRs are normally serial events but for VPU a speculative value of LMUL, SEW, and vl
are used for subsequent instructions until the values are changed again
sew=8b
lmul=1/4
sew=32b
lmul=1
vl=16
sew=8

45
Optimized codes for Load and Signed Extension
vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=512/32=16 elements
vle8.v v8, (a2) // load 16 8b elements, vle16.v or vle32.v is always 16 elements
vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements
…
• Notes:
− In earlier version, 0.8, of RVV ISA, the vle8.v and vsext.vf4 instructions are replace by a single instruction,
(the ACE load instruction can still do this load):
vlb.v v8, (a2), // load 8b elements and signed extended to 32b elements, still 16 elements

46
Sample Code for Large Data Set
vsetvli t0, zero, e32, m8 // set SEW=32b, LMUL=8, vl=512b*8/32b=128 elements
vle8.v v6, (a2) // load 128 8b elements into 2 vector registers, v6-v7
vadd.vx v0, v0, t1 // 32-bit vector add for 128 elements
…
• Notes:
– vle8.v: load data into 2 vector registers, even register number or illegal instruction
– vsext.vf4: the source operands are overlapped with destination operands
• If source operand is v0, v2, v4, then it is illegal instruction
• v6 and v7 are legal instructions because the vector registers are read before
written by the result data

47
CSR - Vector Byte Length
• Vector register length in bytes, vlenb = VLEN/8
– Design time constant, vlenb = 512/8 = 64
– Read-only CSR, constant value, hardwired
– A configuration CSR
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vlenb = 64
reserved

48
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vstart[VLMAX*8-1:0]
reserved
CSR – Vector Start Index Register (vstart)
• Initialize to zero
• vstart is the index of element that causes the trap on vector instruction
• When return from trap, the vector execution restarts from the vstart
element and reset the vstart after vector instruction execution
• vstart > vl, meaning that no element operation and vstart is reset to zero
For Andes vector processor, only load/store instruction can cause trap and
interrupt are taken at instruction boundary. A non-zero value for vstart for
any other vector instruction will cause illegal instruction.

49
Valid Vector Range
Behavior for destination element i
0 < i < vstart Unchanged
vstart <= i < vl Updated if v0.mask[i] enabled, unchanged if not
mta exception
vl <= i < VLMAX Unchanged, vta exception
VLMAX-1 vl-1 Masked elements vstart
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Elements
valid elements for operation

50
• vxsat – Vector Fixed-Point Saturation Flag Register – single bit to indicate that
saturated value was written to destination vector register
• vxrm – Vector Fixed-Point Rounding Mode Register:
− 00: rnu – round to nearest up (add +0.5 to LSB)
− 01: rne – round to nearest even
− 10: rdn – round down (truncate)
− 11: rod – round to odd (OR bits into LSB, “jam”)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sat
vxrm
reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sat
reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vxrm
reserved
• vcsr – merging of vxsat and vxrm into single CSR, explicitly reset by software
CSR – Vector Execution Fixed Point Register

51
CSR – Vector Execution Floating-Point Register
• The vector processor uses the fflags and frm from the FPU
– fflags[4:0]: floating-point accrued exception flags
– frm[2:0]: floating-point rounding modes

52
Memory Operations

53
Memory Operations
• Load/store operations move groups of data between the VRF and memory
– Andes implements unaligned addressing, no restriction
• Four basic types of addressing
– Unit stride: fetch consecutive elements in contiguous memory
– Non-unit or constant stride: elements are in stride memory; stride is the distance
between 2 elements of a vector in memory
– Indexed (gather-scatter): elements are randomly offset from a base address
– Segment: move multiple contiguous fields in memory to and from consecutively
numbered vector registers – can be unit-stride, constant stride, or index
• One TLB lookup per group of loads/stores (Andes)
– Limit the memory access to 1 TLB page – 1 translation
– 4KB, memory access fault if any element address crosses the page boundary
• Atomic load/store is not implemented

54
Extract 7 from first 32B
Extract 25 from second 32B
Vector Load Data Alignment
Byte 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vn 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Firstpackage
Secondpackage
Andes processor handles all unaligned addresses

55
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
VL* unit-stride mw vm
VLS* strided mw vm
VLX* indexed mw vm
VS* unit-stride mw vm
VSS* strided mw vm
VSX* indexed mw vm
mop
mop
mop
mop
mop
0100111
rs2 rs1 width vs3 0100111
0000111
rs1 vd
width
0000111
0100111
vs3
lumpop
rs1 width vd
nf vs2 rs1 width vs3
rs1
nf
nf
nf
vs2
nf sumpop rs1 width
nf
mop
rs2
width vd 0000111
mw-width[2:0]
E1024
FL/FS
H
W
D
Q
Vector - VLx/VSx
E8
E16
E32
E64
E128
E256
E512
1110
1111
16
32
64
128
8
16
32
64
128
256
512
1024
x001
x010
x011
x100
0000
0101
0110
0111
1000
1101
00000
01000
10000
others
unit-stride
unit-stride, whole register
load, fault-only-first
reserved
lumop/sumop[4:0]
EEW=MEW
Instruction Format – Load/Store
Segment,
multiple
registers
FP load/store
Vector
load/store
mop[1:0] function
00 unit-stride
01 index_unordered
10 constant-stride
11 index_ordered
11 index
store
ls
store
load
ls

56
Vector load/store, different EMUL
• EMUL is the effective LMUL
• SEW=16, LMUL=1, VLEN=512b
– elements=512/16=32, all vector instructions operate on 32 elements
vle16.v v1, (t0) # Load a vector of sew=16, lmul=1, load to v1
vle32.v v2, (t1) # Load a vector of sew=32, emul=2, load to v2 and v3
vwadd.wv v4, v2, v1 # v4/v5 ← v2/v3 + sign-extend(v1), emul=2
vse32.v v4, (t1) # Store a vector of sew=32, emul=2, v4 and v5 store to memory
vle8.v v1, (t1) # Load a vector of sew=8, emul=1/2, load to elements first half of v1
vle64.v v4, (t0) # Load a vector of sew=64, emul=4, load to v4, v5, v6, v7
vle128.v v8, (t0) # Load a vector of sew=64, emul=8, load to v8, v9, v10, …, v15

57
Load with and without Extension
• SEW=32b, LMUL=8, VLEN=512b, (512b/32b)*8 = 128 elements
• RVV version 1.0 – vle8.v v6, (t0) & vsext.vf4 v0, v6
− Load 128 elements of 8b
 512b/8b = 64 elements, write to 2 vector registers, v6-7
− Signed extension of 128 elements from 8b to 32b
 Read 8b elements from 2 vector registers, v6-7
 Write 32b elements to 8 vector registers, v0-7
• RVV version 0.8 – vlb.v v0, (t0) – custom Andes instruction
− Load 128 elements of 8b and signed extended to 32b
 Write data into 8 vector registers, v0-7

58
Unit-Stride Load/Store Variances
• Load Fault-only-First: exception only for first element
– Set vl=faulted element, no exception
– Use for exit condition of loop
– Example: the unit load fetches to the end of the TLB page, all subsequent
vector operations will be for the valid elements that were fetched. This is the
reversed of the stripped mining for unknown number of application elements
• Load/store of whole register:
– vtype & vl are ignored, SEW=8b, uses NF bits for multiple vector registers
– vm=1 (inst[25]), else illegal instruction
– Vstart >= VLEN/8, no operation
– Use for fill/spill, context switch, return from interrupt, …

59
Example: strncpy, Fault-only-First Load
strncpy:
mv a3, a0 // Copy dst
loop:
vsetvli x0, a2, e8, m8, ta, ma // Vectors of bytes, VLEN=512b
vle8ff.v v8, (a1) // Fault-only-First load to 8 vector registers
vmseq.vi v1, v8, 0 // Flag zero bytes
csrr t1, vl // Get number of bytes fetched
Note: ta is used to write zero to elements after the faulted elements (vl=faulted
elements)

60
Stride Gather/Scatter Memory Access
• Unit stride continuous memory access
• Constant stride: gather scatter (stride can be negative or zero)
– Support store memory overlapped due to stride=0 or stride<SEW
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Unit
0 1 2 3 unit stride
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 1 2 3 stride=2
Stride 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Gather
Scatter 0 1 2 3 stride=8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 1 2 3 stride=0

61
Note:
- The indices are absolute and not accumulative
- The indices are not in-order
Index Gather/Scatter Memory Access
Base Address 7 12 8 2 vs2
Gather 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Base Address 5 8 0 15 vs2 0 1 2 3 vd/vs3
Scatter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
&data[0]
&data[0]
Load Index vector
Store Index vector

62
Segment Unit Stride Load/Store
• Multiple vector registers load/store, NF*EMUL <= 8
• Equivalent to 4 stride load/store micro-ops
• Fault-only-first is allowed
Segment stride load/store with
NF=4 is four stride loads/stores
0 1 2 3 4 5 6 7 8 9 10 11 Memory
V31
.
.
.
.
V3
V2
V1
V0
[0] [1] [2] … [VLMAX-1]
Vector Registers
Load 2
Load 1
Load 0
Unit load/store with NF=4, LMUL=1

63
0 1 2 3 4 5 6 7 8 9 10 11 Memory
V31
.
.
.
.
V3
V2
V1
V0
[0] [1] [2] … [VLMAX-1]
Unit load/store with NF=2, LMUL=2
Load 0 Load 1 Load 2
Vector Registers
Segment Stride Load/Store, LMUL=2
• Zero and negative strides are allowed
First load/store, stride=2, LMUL=2
Second stride load/store, base address=1
Stride=0, write [0] into all elements of v0 & v1, [1] into all elements of v2 & v3

64
Segment Index Load/Store
– Overlapping of store data to memory
– NF number of index load/store micro-ops, the index is incremented by SEW
Segment index load/store with
NF=4 is four index loads/stores
Base Address 7 12 8 2
Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
V31
.
.
.
.
V3 10 15 11 5
V2 9 14 10 4
V1 8 13 9 3
V0 7 12 8 2
[0] [1] [2] … [VLMAX-1]
&data[0] Index vector
Vector Registers

65
Compute Instruction
Format

66
Vector Arithmetic Instruction Overview
• vop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] op vs1[i]
• vop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] op x[rs1]
• vop.vi vd, vs2, imm, vm // vector-imm, vd[i]=vs2[i] op imm
• vfop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] fop vs1[i]
• vfop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] fop f[rs1]

67
Compute Instruction Format
OPI – integer operation
OPF – floating-point operation
OFM – mask operation
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
OPIVV vector-vector m
OPFVV vector-vector m
OPMVV vector-vector m
OPIVI vector-immed m
OPIVX vector-scalar m
OPFVF vector-scalar(F) m
OPMVX vector-scalar m
vsetvli scalar-immed 0
vsetvl scalar-scalar 1
funct6
rs1 101
funct6
vs2 vs1 001
funct6
funct6
111
vd
funct6
rd 1010111
1010111
1010111
vs2 vs1 010
1010111
vs2 rs1 100 vd
1010111
vs2 vs1 000 vd 1010111
vd/rd 1010111
vd/rd 1010111
vs2 simm5 011 vd
000000
zimm[10:0]
vs2 rs1 110 vd/rd
rs1 111 rd
funct6
funct6
vs2
1010111
rs2 rs1
Funct6 is used to encode all vector
instructions, most of the time, the
same funct6 code is used for all 3
forms for vector instructions (vector,
scalar, immediate)

68
Vector Widening Instructions
• Widening: double width
− vwop.v{v,x} vd, vs2, vs1/rs1 // 2*SEW = sign/zero(SEW) op sign/zero(SEW)
− vwop.w{v,x} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op sign/zero(SEW)
− vfwop.v{v,f} vd, vs2, vs1/rs1 // 2*SEW = SEW op SEW
− vfwop.w{v,f} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op SEW (vfwadd & vfwsub)
• Widening operations
− Integer ALU: unsigned/signed extended the source operands before the operation
− Integer MUL: operate on SEW-bit, then return 2*SEW
− Floating-point: operate on SEW-bit, then return 2*SEW
− Same number of elements rule as defined in vtype:
o Double width  writing back to twice the number of vector registers

69
Vector Widening Instructions – Extension
• Vector integer extension
− V{z,s}ext.vf{2,4,8} vd, vs2 // Zero/sign extend to SEW
− The result is the same number of elements set by vtype
− Example: SEW=32b, then only V{z,s}ext.vf{2,4} are legal instructions

70
Shifting of Source Operands for Double-Width
• The data are extended and folded into the source data from VRF before
sending to execution
– Cycle 1: operate on full-width data and write back to v2
shift data from left-half and extension to full width
– Cycle 2: operate on full-width data and write back to v3
Pipeline in after the right half to write to v3 16b 16b 16b 16b 16b 16b 16b 16b
v1 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
v0 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
src0 - v1
src1 - v0
Stage 2
v2=v1+v0
2*SEW
v2
v3
Elem 2
Stage 1
Extension
Elem 7 Elem 6 Elem 5
Elem 15 Elem 14 Elem 13 Elem 12 Elem 11 Elem 10 Elem 9
Elem 1
Elem 7 Elem 6
Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0
Elem 1 Elem 0
Elem 5 Elem 4 Elem 3 Elem 2 Elem 0
Elem 8
Elem 4 Elem 3
+ + + +
+ + + +

71
Vector Narrow Instructions
• Narrowing: from double-width
– vnop.w{v,x,i} vd, vs2, vs1/rs1 // SEW = 2*SEW op SEW
– Logical and arithmetic right shift operations
Example of operations without changing vtype, LMUL=1, SEW=16b, 32 elements in
all vector instruction operations:
vle16.v v1, (t0) // Load to v1 full vector registers
vle32.v v2, (t1) // Load a vector of sew=32b, emul=2, load to v2 and v3
vwadd.wv v0, v2, v1 // v0/v1 ← v2/v3 + sign-extend(v1), emul=2
vnsra.wx v7, v0, t2 // v7 ← v0/v1 (shift right by t2)
vse16.v v7, (t0) // Store v7 to memory

72
Shifting of Source Operands for Narrow-Width
• The data are extended and folded into the source data from VRF before
sending to execution
– Cycle 1: operate on full-width data
– Cycle 2: Shift result data to write back to right-half of v0
Pipeline in after the right half to write to v3
v1
src0 - v1
v3
v2
v0 Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0
Elem 1
Elem 3
Elem 0
Elem 7 Elem 6
Elem 0
Elem 1
Elem 3
Elem 3
16b 16b 16b 16b
v0=v2(shf)v1
Elem 4
Elem 5 Elem 4 Elem 3 Elem 2
s
s
s
s

73
Compute Instructions

74
Vector Integer Arithmetic Instructions
Vector Integer Add, Subtract, Reverse Subtract (including Double Width)
Vector Integer Extension
Vector Integer Add-with-Carry / Subtract-with-Borrow
Vector Bitwise Logical Instructions
Vector Bit Shift Instructions (including Narrow Width)
Vector Integer Comparison Instructions, write 1/0 to destination register
Vector Integer Min/Max Instructions, write min/max to destination register
Vector Integer Multiply (including Double Width)
Vector Integer Divide and Remain
Vector Integer Multiply-Add/Sub, Multiply-Accumulate (including Double Width)
Vector Integer Merge Instructions (vector – vs1, scalar – rs1, imm)
Vector Integer Move Instructions (vector – vs1, scalar – rs1, imm)

75
Vector Arithmetic with Carry Instructions
vadc.vv vd, vs2, vs1, vm=0 // vd[i] = vs2[i] + vs1[i] + v0[i].mask
vmadc.vvm vd, vs2, vs1, vm=0 // vd[i].mask = carry_out(vs2[i] + vs1[i] + v0[i].mask)
• Vector instructions oriented to implementation :
− Single vector destination register
− Single write port per vector instruction
• The carry-in is in v0 same as mask layout
• Two vector instructions to complete the add with carry
− vadc: destination and source registers cannot be overlapped
− vadc: destination register cannot be v0
− vmadc: destination register can be v0 or any vector register

76
Vector MAC Instructions
v(w)madd.vv vd, vs2, vs1, vm // vd[i] = vs2[i] + (vs1[i] * vd[i])
v(w)madd.vx vd, vs2, rs1, vm // vd[i] = vs2[i] + (rs1 * vd[i])
v(w)macc.vv vd, vs2, vs1, vm // vd[i] = vd[i] + (vs1[i] * vs2[i])
v(w)macc.vx vd, vs2, rs1, vm // vd[i] = vd[i] + (rs1 * vs2[i])
v(w)msub and v(w)msac have the same format

77
Vector Merge Instructions
vmerge.vx vd, vs2, rs1, vm=0 // vd[i] = vm[i] ? rs1 : vs2[i]
vmerge.vv vd, vs2, vs1, vm=0 // vd[i] = vm[i] ? vs1[i] : vs2[i]
vmerge.vi vd, vs2, imm, vm=0 // vd[i] = vm[i] ? imm : vs2[i]
• Each element uses vm[i] to select the input element

78
Vector Integer Move Instructions
vmv.v.{v,x,i} vd, {vs1, rs1, imm}, vm=1
• Same encoding as with Vector Merge Instructions with vm=1 and vs2=v0
• {x,i} splat a scalar register or immediate to all elements
• vs1=vd is NOP
• Other vs2 values are reserved

79
Vector Fixed-Point Arithmetic Instructions
Vector Saturating Add and Subtract (set CSR vxsat bit)
Vector Averaging Add and Subtract (overflow is ignored)
– Add/subtract, then shift right by 1, round off according to CSR vxrm bit
Vector Fractional Multiply with Rounding and Saturation
Vector Scaling Shift Instructions
Vector Narrowing Clip Instructions

80
Vector Fractional Multiply with Rounding and Saturation
vsmul.v{v,x} vd, vs2, {vs1,rs1}, vm
– Double-width multiplication vs2[i]*{vs1[i], rs1}
– Shift right by SEW-1
– Roundoff_signed according to CSR vxrm bit
– Saturate the result to SEW bits, set CSR vxsat bit if saturated
– vd[i] = clip(roundoff_signed(vs2[i]*vs1[i], SEW-1))

81
Vector Scale-Shift and Clip Instructions
vssrl.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right logical
– vd[i] = round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm})
vssra.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right arithmetic
– vd[i] = round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm})
vnclip.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip
– vd[i] = clip(round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm}))
vnclipu.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip
– vd[i] = clip(round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm}))

82
Vector Floating-Point Arithmetic Instructions
Vector FP Add, Subtract, Reverse Subtract (including Double Width)
Vector FP Multiply (including Double Width)
Vector FP Divide, Reverse Divide
Vector FP Multiply-Add/Sub, Multiply-Accumulate (including Double Width)
Vector FP Square-Root, Reciprocal Square-Root Estimate, Reciprocal Estimate
Vector FP Comparison Instructions, write 1/0 to destination register
Vector FP Min/Max Instructions, write min/max to destination register
Vector FP Merge, Move Instructions (only fp.scalar – f[rs1])
Vector FP Sign-Injection Instructions
Vector FP Classify Instructions
Vector FP/Integer Type-Convert (including Double Width and Narrowing)

83
Vector Reduction Instructions
Vector Integer Reduction Instructions
Reduction sum (including Double Width)
Reduction min/max (including signed and unsigned)
Reduction bitwise AND, OR, XOR
Vector FP Reduction Instructions
Reduction ordered sum (including Double Width)
((((vs1[0] + vs2[0]) + vs2[1]) + vs2[2]) + vs2[3]) + …
Reduction unordered sum (including Double Width)
Reduction min/max
vd[0] = op( vs1[0] , vs2[*] ) where * are all active elements (mask enabled)
The w form writes the double-width result

84
Vector Reduction Sum
• Reduction of VLEN/SEW=256-bit/16-bit=16 elements
• LMUL>1 can be pipelined
SWPL =
SEW = 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b
V3-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
v3=v3[0]+red(v1)
64-bit 64-bit
64-bit 64-bit
+
+
+ +
+ + + +
+ + + + + + + +

85
Vector Mask Instructions
Vector Mask-Register Logical Instructions
AND, NAND, AND-NOT, OR, NOR, OR-NOT, XOR, XNOR
Vector mask population count vpopc
vfirst find-first-set mask bit
vmsbf.m set-before-first mask bit
vmsif.m set-including-first mask bit
vmsof.m set-only-first mask bit
Vector Iota Instruction
Vector Element Index Instruction

86
Vector Mask Logical
• Mask logical instructions, vd = vs2 (logical-op) vs1
– vmmv.m = vmand.mm vd, vs, vs // copy mask register
– vmclr.m = vmxor.mm vd, vd, vd // clear mask register
– vmset.m = vmxnor.mm vd, vd, vd // set mask register
– vmnot.m = vmnand.mm vd, vs, vs // invert mask bits
• vl and vstart are used as with normal vector operations
• Single vector register operation, LMUL is ignored, bit operations, each bit is
for an element

87
Vector Mask Instructions
[VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0]
vs2 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register
rd 14 vpopc - x[rd] = sum_i( vs2[i] && v0[i] )
rd 3 vmfirst - element index of first mask bit
vd 0 … 0 0 0 0 0 0 1 1 1 vmsbf - set before first mask bit
vd 0 … 0 0 0 0 0 1 1 1 1 vmsif - set include first mask bit
vd 0 … 0 0 0 0 0 1 0 0 0 vmsof - set only first mask bit
vd n … 3 3 2 1 1 0 0 0 0 viota - vd[i] = sum of preceding mask bits
vd m*i … 8 7 6 5 4 3 2 1 0 vid - mask index, vs2=v0
vstart=0, except for vid
+

88
Vector Permutation Instructions
Integer Scalar Move Instructions – between vector element 0 and x[rd/rs1]
Floating-Point Scalar Move Instructions – between vector element 0 and f[rd/rs1]
Vector Slide Instructions
Vector Register Gather Instructions
Vector Compress Instruction
Whole Vector Register Move
– Move instructions with decoded number of vector registers to copy, ignoring vtype LMUL

89
Vector Slide Instructions
• vslideup/down – move elements up and down a vector
− vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i]
− vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i]
− vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i]
− vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1]
− vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1]
− vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm]

90
Vector Slideup Instruction
• No source and destination registers overlapping, destination register can
be written before the source register is read (RAW)
− vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i]
− vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i]
− vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i]
− Slide1up is from XRF or FRF
− All lower elements are not modified
Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
slide1up
vslide1up v8, v4 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x
vslideup v8, v4, 10 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x x
v11 v10 v9 v8
Unchange
x[rs1], f[rs1]
v5 v4
v7 v6

91
Vector Slidedown Instruction
• Source and destination registers overlapping is okay, source register is
read before the destination register is written
− vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1]
− vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm]
− vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1]
− Slide1down is from XRF or FRF
− All the upper elements fill in with zero
Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x[rs1], f[rs1]
slide1down
vslide1down v4, v4 0 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vslidedown v8, v4, 9 0 0 0 0 0 0 0 0 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
v4
v7 v6 v5
v7 v6 v5 v4
Write 0

92
Vector Register Gather Instructions
• vrgather – read elements from a source vector with index register to write
destination vector
– vrgather.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]]
– vrgatherei16.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]], vs1 is 16b
– vrgather.vx vd, vs2, rs1, vm // vd[i] = vs2[rs1]
– vrgather.vi vd, vs2, uimm, vm // vd[i] = vs2[uimm]
– If index >= VLMAX, then vd[i] = 0
– For SEW=8, vs1 can address only 256 elements
– Can be used in combination with vector unit-stride load/store instead of index load/store
element 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vs2 e15 e14 e13 e12 e11 e10 e9 e8 e7 e6 e5 e4 e3 e2 e1 e0
vs1 10 9 2 3 11 12 15 0 0 1 4 8 1 0 7 5
vd e10 e9 e2 e3 e11 e12 e15 e0 e0 e1 e4 e8 e1 e0 e7 e5
The element number in vd is the same as the index in vs1

93
Vector Compress Instructions
• vcompress – shift the unmask elements from source vector to destination
register
− vcompress.vm vd, vs2, vs1
// vd[i] = Compress into vd elements of vs2 where vs1(LSB) is enabled
− vm=1 (inst[25]), else illegal instruction
[VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0]
vs1 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register
vs2 ex … e8 e7 e6 e5 e4 e3 e2 e1 e0
vd … … … … … … … … e6 e5 e3 vcompress

94
Sample Codes

95
# void matmulFloat_v(
# size_t n, // a0: A columns, B rows
# size_t m, // a1: A rows
# size_t p, // a2: B columns
# const float *A, // a3: A is m x n matrix
# const float *B, // a4: B is n x p matrix
# float *C // a5: C is m x p matrix
# ) {
# int i, j, k;
#
# for (i = 0; i < m; i++) {
# for (j = 0; j < p; j++) {
# float c = 0;
# for (k = 0; k < n; k++) {
# c += A[i*n+k] * B[k*p+j];
# }
# C[i*p+j] = c;
# }
# }
# }
# void matmulFloat_v(
# size_t 3, // a0: A columns, B rows
# size_t 2, // a1: A rows
# size_t 3, // a2: B columns
# const float *A, // a3: A is 2x3 matrix
# const float *B, // a4: B is 3x3 matrix
# float *C // a5: C is 2x3 matrix
# ) {
# int i, j, k;
# for (h = 0; h< COUNT; h++) {
# for (i = 0; i < 2; i++) {
# for (j = 0; j < 3; j++) {
# float c = 0;
# for (k = 0; k < 3; k++) {
# c += A[i*n+k] * B[k*p+j];
# }
# C[i*p+j] = c;
# }
# }
# }
# }
Matrix Multiplication – Issue #495 in RVV Extension Work Group
b0 b1 b2
b3 b4 b5
b6 b7 b8
a0 a1 a2 c0 c1 c2
a3 a4 a5 c3 c4 c5
Matrix
Multiplication

96
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
Matrix Multiplication – Scalar Coding Style
c0=a0*b0
c1=a0*b1
c2=a0*b2
c3=a3*b0
c4=a3*b1
c5=a3*b2
c0+=a1*b3
c1+=a1*b4
c2+=a1*b5
c3+=a4*b3
c4+=a4*b4
c5+=a4*b5
c0+=a2*b6
c1+=a2*b7
c2+=a2*b8
c3+=a5*b6
c4+=a5*b7
c5+=a5*b8
vfmul.vv v16, v10, v0
vfmacc.vv v16, v11, v3
b0 b1 b2
b3 b4 b5
b6 b7 b8
a0 a1 a2 c0 c1 c2
a3 a4 a5 c3 c4 c5
v0 v1 v2
v3 v4 v5
v6 v7 v8
v10 v11 v12 v16 v17 v18
v13 v14 v15 v19 v20 v21
vector registers
Matrix
Multiplication
Matrix
Multiplication
Each vector register holds 16 elements, 16 sets of matrices are operated in parallel

97
VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements
• Single precision floating point data
• 16 sets of matrices can be processed in parallel
• Vector multiplication and MACC (6 vfmul and 12 vfmac)
Assuming data are contiguous in memory with prefetched in L1
data cache (data cache hit) – stride between the matrices
• 6 stride loads for 16 A matrices (2x3=6), stride=4Bx6=24
• 9 stride loads for 16 B matrices (3x3=9), stride=4Bx9=36
• 6 stride store for 16 C matrices (2x3=6), stride=4Bx6=24
Vector Scalar-Style Coding

98
Stride Loads & Stores for Scalar-Style Code
vlse32.v v10, x2, x5 Aref 0
addi x2, x2, 4
addi x2, x2, 4
addi x2, x2, 4
addi x2, x2, 4
addi x2, x2, 4
addi x2, x2, 364
vlse32.v v0, x3, x6 Bref 0
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 4
addi x3, x3, 540
vsse32.v v16, x4, x5 Cref 0
addi x4, x4, 4
addi x4, x4, 4
addi x4, x4, 4
addi x4, x4, 4
addi x4, x4, 4
addi x4, x4, 364

99
Performance of Scalar-Style Codes
• Load 6 Arefs = 12*6 = 72 cycles // access 12 32B cache lines
• Load 9 Brefs = 17*9 = 153 cycles // access 17 32B cache lines
• Store 6 Crefs = 12*6= 54 cycles // access 12 32B cache lines
• Vector FP multiplications = 18 cycles // hidden by the store
• Total cycles per loop = 297 for 16 sets of matrices
• Total cycles per set of matrix = 19 cycles
• Number of vector registers used = 6+9+6 = 21 registers, efficiency = 66%
• Number of static instructions in the loop = 62 instructions
• Average instructions per set of matrices = 4 instructions
• It is 16X the performance of scalar code

100
Matrix Multiplication – Unit Load Coding Style
• VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements
– LMUL=4, 16x4 = 64 elements
– LMUL=8, 16x8 = 128 elements
• Number of Cref or Aref per matrix = 6 elements
− 10 sets of matrices are 60 elements which fit into VRF with LMUL=4
• Number of Bref per matrix = 9 elements
− 10 sets of matrices are 90 elements which fit into VRF with LMUL=8

101
• Unit load of 10 consecutive B matrices (10x9=90 elements)
− vrgather to set up 10 sets of 6 B elements = 2, 1, 0, 2, 1, 0 (yellow) in v16
− vslideup v16 to v20
− vrgather to set up 10 sets of 6 B elements = 5, 4, 3, 5, 4, 3 (blue) in v16
− vrgather to set up 10 sets of 6 B elements = 8, 7, 6, 8, 7, 6 (white) in v24
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
setvli t1, x5, e32, m8 vl=90, LMUL=8
vle32.v v8, x3 unit load - Bref, 90 elements
addi x3, x3, 360
vsub.vi v0, v0, 6
vrgather.vv v16, v8, v0 60 elements of first Bref in v16
vadd.vi v0, v0, 3
vslideup.vi v16, v16, 64 Move first Bref to v20
vrgather.vv v16, v8, v0 60 elements of second Bref in v16
vadd.vi v0, v0, 3
vrgather.vv v24, v8, v0 60 elements of third Bref in v24
Unit Load of B elements

102
• Unit load of 10 consecutive A matrices (10x6=60 elements)
− vrgather to set up 10 sets of 6 A elements = 3, 3, 3, 0, 0, 0 (yellow) in v16
− vrgather to set up 10 sets of 6 A elements = 4, 4, 4, 1, 1, 1 (blue)
− vrgather to set up 10 sets of 6 A elements = 5, 5, 5, 2, 2, 2 (white)
• Multiplication and accumulate and unit store C elements to memory
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
vle32.v v8, x2 unit load - Aref, 60 elements
addi x2, x2, 240
vrgather.vv v28, v8, v4 set up 60 elements of first Aref
vadd.vi v4, v4, 1
vrgather.vv v12, v8, v4 set up 60 elements, of second Aref
vadd.vi v4, v4, 1
vrgather.vv v28, v8, v4 set up 60 elements of third Aref
vsub.vi v8, v8, 2
vse32.v v20, x4 unit store - Cref, 60 elements
addi x4, x4, 240
bne
Unit Load of A elements

103
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
VRF
b8 31 a5
b7 30 a4
… 29 …
… 28 …
b3 27 a2
b2 26 a1
b1 25 a0
b0 24 a5
b8 23 a4
b7 22 a3
b6 21 a2
b5 20 a1
b4 19 a0
b3 18
b2 17
b1 16
b0 15 c5
b8 14 c4
b7 13 …
b6 12 …
b5 11 c2
b4 10 c1
b3 9 c0
b2 8 c5
b1 7 c4
b0 6 c3
5 c2
4 c1
3 c0
2
1
0
vrgather
vrgather
vslideup
Index for
B elements
Index for
A elements
vrgather
vrgather
Graphical View of Unit Load Codes

104
Performance of Unit Load Codes
• Load 60 Aref and 90 Bref elements, store 60 Cref elements = 28 cycles of 32B cache lines
o Vrgather for Aref = 8 cycles + 4 + 4 = 16 cycles
o Vrgather/vslideup for Bref = 13 cycles + 8 + 13 + 13 = 47 cycles
• Vector FP multiplications = 12 cycles
• Total cycles per loop = 103 for 10 sets of matrices
• Total cycles per set of matrix = 10 cycles (19 cycles for straight code)
• Number of static instructions in the loop = 27 instructions
• Average instructions per set of matrices = 2.7 instructions (4 instructions for straight code)
• Number of elements used in VLMAX, 60/64 and 90/128, efficiency is about 90%
• Unit load/store are much more efficient memory operation than stride load: accessing 28
instead of 41 cache lines. Shifting the complexity of stride load/store to vrgather
instructions

105
Larger Matrices – Scalar Style Coding
• Same VLEN, SEW=32b, VLMAX=16
– A matrix 6x3, 18 elements – requires 18 vector registers
– B matrix 3x6, 18 elements – requires 18 vector registers
– C matrix 6x6, 36 elements – requires 36 vector registers
– Compute 1 row of A and C elements at a time, 6 iterations to finish 16 sets of matrices
b0 b1 b2 b3 b4 b5 v0 v1 v2 v3 v4 v5
b6 b7 b8 b9 b10 b11 v6 v7 v8 v9 v10 v11
b12 b13 b14 b15 b16 b17 v12 v13 v14 v15 v16 v17
a0 a1 a2 c0 c1 c2 c3 c4 c5 v18 v19 v20 v21 v22 v23 v24 v25 v26
Matrix
Multiplication
Matrix
Multiplication
vector registers

106
Larger Matrices – Unit Load Codes
• VLEN=512b, SEW=32b,
VLMAX=16 elements
− LMUL=2, 32 elements, 2
vector registers are used for
each 18 elements of Aref and
Bref
− LMUL=4, 64 elements, 4
vector registers are used for 36
elements of Cref
− 1 set of matrices per loop
iteration
VRF
a17 31
a16 30
… 29
… 28
a8 27
a7 26
a6 25
a5 24
a4 23
a3 22
a2 21
a1 20
a0 19
18
b17 17
b16 16
… 15 c35
… 14 c34
b8 13 …
b7 12 …
b6 11 c8
b5 10 c7
b4 9 c6
b3 8 c5
b2 7 c4
b1 6 c3
b0 5 c2
4 c1
3 c0
2
1
0
vmul
and
vfmac
Index for
B elements
Index for
A elements
vrgather
vrgather

107
Performances of Larger Matrices
*Vector loads/stores are pipelined between iterations
**6x4 or 6x5 matrices are more efficient. 6x3 has 18 active elements with VLMAX=32
Unit loads/stores are much more efficient memory operation than stride load
Scalar style Unit load
Number of static instructions per loop 252 30
Total number of cycles 1259 22*
Number of cycles per set of matrices 79 22
Efficiency of register usages 81% 56%**

108
vsetvli t0, zero, e32, m8 // VLEN=512b, LMUL=8, SEW=32b, ELEN=4096b
vlb.v v0, t0;
add t0, t0, imm;
vadd.vx v0, v0, t1;
lfw f0, t2;
add t2, t2, imm;
vfcvt.f.x.v v0, v0;
vfmacc.vf v8, v0, f0;
bne t0, t3, imm
• Performance requirement: 16B of 8b INT every clock cycle
Loop of 8 instructions,
Load element = 8b INT, extension to 32b
Operation on FP 32b
RVV Work Group, Issue #362, Codes Based on RVV 0.8

109
Vector Processor Pipeline Example
VLEN=64B
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
vadd v0, v0, t1 va va va
vfcvt v0, v0 fc fc fc fc fc fc
vfcvt v4, v4 fc fc fc fc fc
vfcvt v5, v5 fc fc fc fc
vfmacc v8, v0, f0 fm fm fm fm fm fm fm fm fm
vfmacc v9, v1, f0 fm fm fm fm fm fm fm fm
vfmacc v10, v2, f0 fm fm fm fm fm fm fm
vfmacc v11, v3, f0 fm fm fm fm fm fm
vlb v0, t0 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7
lfw f0, t2 f0 f0 f0
8 cycles
8 cycles 8 cycles
chaining
chaining

110
Vector Processor Performance
• Number of instructions and micro-ops in 8 cycles
– 3 vector arithmetic instructions = 24 micro-ops (vadd, vfcvt, vfmac)
– 1 FP load
– 3 scalar instructions + 1 vector load to data cache
– Performance at 1 GHz = 64 GOPS or 48 GFLOPS for SEW=32b
– Performance at 1 GHz = 128 GOPS or 96 GFLOPS for SEW=16b
• Unroll the loop twice in order to pipeline vector load instructions
– All 32 vector registers are used
– Performance improvement can be achieved with VLEN=1024b

111
vle8.v v6, t0;
add t0, t0, imm;
vsext.v4 v0, v6
vadd.vx v0, v0, t1;
lfw f0, t2;
add t2, t2, imm;
vfcvt.f.x.v v0, v0;
vfmacc.vf v8, v0, f0;
bne t0, t3, imm
• An extra instruction in a tight loop can degrade the performance by 12.5%
• Andes provides custom load instruction in order to achieve the required performance
• In addition, the vsext.v4 instruction needs an extra write port since all the write ports are
used
Loop of 9 instructions,
Load element = 8b INT,
Vector extension to 32b,
Operation on FP 32b
RVV Work Group, Issue #362, Codes Based on RVV 0.9

112
AndesCore™ NX27V
introduction

113
• Vector Processor is implemented as part of the RISC-V CPU
• Configurability: necessary for IP to attract wide range of customers, 128b to 512b
• Scalability: necessary for next generation
• Low power and area: necessary for embedded market
• Performance: out-of-order execution and design, 2*VLEN = double performance
• Simplicity: necessary for time to market, reducing verification time
• Modular Design: decoupling for independent design, verification, timing, and
clock gating
Andes NX27V Design Objectives

114
Overview of NX27V
• AndeStar V5 architecture:
– RV64GCN+ Andes V5 Extensions
– RV Vector extension (RVV)
• 5-stage pipeline, single-issue
– Optional branch prediction
• I/D caches
– Caches: 8KB to 64KB
– I$/D$ prefetch
– HW unaligned load/store accesses
– 16 non-blocking outstanding data accesses
• Wide data paths to feed VPU:
– Cached and uncached RVV load/stores
– Streaming Ports for ACE loads/stores

115
NX27V Pipeline
Fetch Decode Execute Memory Retire
Integer
Execution
Unit
Data
Cache
Exception
Handling
IFU GPR
V P U
Vector
scalar
V
I
Q
V/F/D
inst
Custom
Coprocessor/
Mem-Controller
command
data
A C E pipeline Execute
Streaming
Ports
Custom
Memory
AXI

116
NX27V Performance Gain
Functions Speedup1
F32 basic mathematical functions 19X
RGB CNN functions 18X
Depthwise CNN functions 23X
Pointwise CNN functions 21X
Relu CNN functions 69X
F32 filtering functions 19X
Q7 filtering functions 39X
F32 32x32x32 matrix multiplication 57X
1Compared to pure C scalar code compiled with high optimization; both vector
and scalar code ran on the NX27V FPGA with 512-bit VLEN, 256-bit bus.

117
Andes NX27V vs. Competitions
Andes NX27V CAxx CMxx
Architecture RVV/Andes VPU Popular SIMD Hxxx
Vector registers 32 32 8
Vector Length Up to 512b 128b 128b
SIMD width Up to 512b/cycle 128b/cycle 64b/cycle
LMUL Yes None None
Chaining Yes Not applicable Yes
Custom extension ACE No No
Streaming Ports Yes No No

118
Core/System Performance Comparison
CPU A64FX* NX27V**
Vector ISA ARM SVE RISC-V RVV
Technology (nm) 7 7
Core sustain Perf 16b (GFLOPS) ~56 96
Core Peak Perf 16b (GOPS) 230 320
# of cores 48 48**
System (TFLOPS) ~2.7 ~4.6
Memory BW (GB/s) 1024 3072
*Fujitsu presentation at Hot Chip 2018.
**Assume the same number of cores for comparison. 512-bit SIMD, 512-bit bus.

119
NX27V Floorplan
 Process: 7nm
 Frequency: 1GHz
(worst case)
 Size: ~0.3mm2

120
Development Tools
Standard tools in AndeSight™ IDE:
– AndeSim: Near-cycle accurate simulator
– Compiler (intrinsic functions and inline assembly)
– Assembler
– Debugger and ICE
– Computation library
Advanced tools:
– AI compiler support thru LLVM
– AndesClarity pipeline visualizer and analyzer
 Pipeline view of instructions and functional units
 Resource view corresponding to instruction usage
 Stall bubbles with different colors for data dependency

121
Clarity: NX27V Pipeline View

122
Summary
• Andes NX27V vector processor
– The world first commercial RISC-V vector processor, VLEN = 512b/256b are
available
– The 512-bit NX27V vector processor has been shipped to lead customer
– High performance in HPC and AI applications
– Flexible VPU configurations to enable a wide range of applications
– AndesClarity for performance optimization
– Work with customers on Andes Custom Extension to provide proprietary,
hardware acceleration, and marketing competitive
• Andes Technology will continue to expand vector product families to
meet the requirements of edge to cloud applications

Thank you!
#RISCVSUMMIT @risc_v

Andes RISC-V vector extension demystified-tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Andes RISC-V vector extension demystified-tutorial

Similar to Andes RISC-V vector extension demystified-tutorial (20)

More from RISC-V International

More from RISC-V International (20)

Recently uploaded

Recently uploaded (20)

Andes RISC-V vector extension demystified-tutorial