SlideShare a Scribd company logo
1 of 124
Download to read offline
RISC-V Vector
Extension Demystified
December 10, 2020
Thang Tran, Ph.D.
Principal Engineer
2
Copyright© 2020 Andes Technology Corp.
Andes Corporate Overview
• Core R&D from AMD, DEC, Intel,
MIPS, nVidia, and Sun
Silicon Valley Tie
15-Year CPU IP Company
• IPO in 2017; HQ in Taiwan
• AndeStar™ V1-V3, V5 (RISC-V)
>1 Bn Annual Run Rate
of Andes-Embedded SoC
• ~300 customers in TW, CN, US,
EU, JP, KR
Founding Premier Member
of RISC-V International
• Chairing Task Groups
• Contributing to GNU, LLVM,
uBoot, glibc, Linux, etc.
3
Copyright© 2020 Andes Technology Corp.
V5 Adoptions: From MCU to Datacenters
• Edge to Cloud:
− ADAS
− AIoT
− Blockchain
− FPGA
− MCU
− Multimedia
− Security
− Wireless (BT/WiFi)
• 1 to 1000+ core
• 40nm to 7nm
• Many in AI
− Datacenter AI accelerators
− SSD: enterprise (& consumer)
− 5G macro/small cells
5G Macro
4
Copyright© 2020 Andes Technology Corp.
Agenda
• Andes overview
• Vector technology background
– SIMD/vector concept
– Vector processor basic
• RISC-V V extension ISA
– Basic
– CSR
– Memory operations
– Compute instruction format
• Sample codes
– Matrix multiplication
– Loads with RVV versions 0.8 and 1.0
• AndesCore™ NX27V introduction
• Summary
5
Copyright© 2020 Andes Technology Corp.
• ISA: Instruction Set Architecture
• GOPS: Giga Operations Per Second
• GFLOPS: Giga Floating-Point OPS
• XRF: Integer register file
• FRF: Floating-point register file
• VRF: Vector register file
• SIMD: Single Instruction Multiple Data
• MMX: Multi Media Extension
• SSE: Streaming SIMD Extension
• AVX: Advanced Vector Extension
• Configurable: parameters are fixed at built time, i.e. cache size
• Extensible: added instructions to ISA includes custom instructions to be added by customer
• Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, …
• Programmable: parameters can be dynamically changed in the program
• ACE: Andes Custom Extension
• CSR: Control and Status Register
• SEW: Element Width (8-64)
• ELEN: Largest Element Width (32 or 64)
• XLEN: Scalar register length in bits (64)
• FLEN: FP register length in bits (16-64)
• VLEN: Vector register length in bits (128-512)
• LMUL: Register grouping multiple (1/8-8)
• EMUL: Effective LMUL
• VLMAX/MVL: Vector Length Max
• AVL/VL: Application Vector Length
Terminology
6
Copyright© 2020 Andes Technology Corp.
RISC-V & RVV Background
• An open processor architecture started by UC Berkeley
– Compact, modular, extensible
• RISC-V International (formerly RISC-V Foundation)
– RISC-V Foundation formed in 2015 to govern its growth
– 750+ members including industry and research institute/university
• RISC-V Vector Extension
– Defined in a Foundation Task Group
– Vector instruction set with scalable vector
– Scalable data sizes with 2x data expansion arithmetic
– Over 300+ vector instructions, including load/store, integer, fixed-
point/floating-point operations
7
Copyright© 2020 Andes Technology Corp.
RISC-V V Spec Status
• Version 1.0 (draft) was published in July 2020
• Note from the spec:
“This is a draft of the stable proposal for the vector extension specification to
be used for implementation and evaluation. Once the draft label is removed,
version 1.0 is intended to be sent out for public review as part of the RISC-V
International ratification process. Version 1.0 is also considered stable enough
to begin developing toolchains, functional simulators, and initial
implementations, including in upstream software projects, and is not expected
to have major functionality changes except if serious issues are discovered
during ratification. Once ratified, the spec will be given version 2.0”.
• Version 1.0 is targeted for end of 2020 and ready for
ratifying
RISC-V V Spec Status
8
Copyright© 2020 Andes Technology Corp.
Vector technology
background
9
Copyright© 2020 Andes Technology Corp.
• Instruction Stream
− Sequence of Instructions read from memory
• Data Stream
− Operations performed on the data in the processor
• SIMD is the most interesting which is the basic for:
− MMX, SSE, SSE2-5, AVX
− Altivec
− Neon
Single Multiple
Single SISD SIMD
Multiple MISD MIMD
Number of
Instruction Streams
Number of Data Streams
Flynn’s Classification
Microprocessors
DSP
Graphic
Vector
10
Copyright© 2020 Andes Technology Corp.
Why SIMD?
• Single 32-bit Microprocessor
– One 32-bit element per instruction
• 512-bit SIMD processor
– 16 32-bit elements per instruction
– 32 16-bit elements per instruction
– 64 8-bit elements per instruction. If the SIMD processor clock frequency is
at 1 GHz, then the performance is 64 GOPS
• Increasing the SIMD size, the higher the performance
• SIMD is the basic for Vector Processor
– N operations per instruction where the elements in the operations are
independent from each other
11
Copyright© 2020 Andes Technology Corp.
Instruction stream
LD VR <-- A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR <-- VR, 1 ADD0 ADD1 ADD2 ADD3 LD1 ADD0
MUL VR <-- VR, 2 MUL0 MUL1 MUL2 MUL3 LD2 ADD1 MUL0
ST A[3:0] <-- VR ST0 ST1 ST2 ST3 LD3 ADD2 MUL1 ST0
ADD3 MUL2 ST1
Time MUL3 ST2
ST3
Different ops at same time
Space
VECTOR PROCESSOR
LD ADD MUL ST
Same ops at same space
PE0 PE1 PE3
PE2
Space
ARRAY PROCESSOR
Same ops at same time
Different ops at same space
Multiple loads are handling by hardware
Parallel Processing
• Time-space duality:
– Array processor: Instruction operates on multiple data elements at the same
time – multi-thread, multi-core
– Vector processor: Instruction operates on multiple data elements in
consecutive time steps – pipeline of SIMD execution
12
Copyright© 2020 Andes Technology Corp.
Intel MMX, SSE, SSE2, AVX
• MMX (1997) adds 57 new integer instructions, 64-bit registers
– 8 of 8-bit, 4 of 16-bit, 2 of 32-bit, or 1 of 64-bit (reuse 8FP registers – FP and MMX cannot mix)
– Number of elements is defined in the opcode, only consecutive memory load/store is allowed
– Move, Add, Subtract, Shift, Logical, Mult, MACC, Compare, Pack/Unpack
• SSE (1999) adds 70 new instructions for single precision FP, 2 of 32-bit
• SSE2 (2001) adds 144 new instructions for FP, 4 of 32-bit, 128-bit registers
• SSE3 (2004), SSE4 (2006), SSE5 (2007) add more new instructions
• AVX (2008) adds new instructions for 256-bit registers
• AVX2 (2013) adds more new instructions
• AVX-512 (2015) adds new instructions for 512-bit registers
• Issue:
– Each register length has a new set of instructions
– Many instructions for the 512-bit SIMD
13
Copyright© 2020 Andes Technology Corp.
v31
f31
r31
v31
f31
r31
One Lane.
Usually less
Lanes than
Vector
Elements
(Lanes/VFUs
timeshared)
A Vector
Functional Unit
(VFU).
Multiple VFUs
per Lane can
operate in
parallel
A Vector
Element
Parallelism achieved by:
• Multiple Lanes
• Multiple VFUs per Lane (e.g., L/S, +, x)
• Individual VFUs are pipelined
RISC-V CPU
Vector
Meta Data
VLDS
VLEN
Vector Extension Processor Overview
VLEN >= SIMD
Newell, R., “RISC-V Vector Extension Proposal Snapshot,” Microsemi, Microchip Technology Inc., 2018
14
Copyright© 2020 Andes Technology Corp.
VLSU
Vector Register File
Memory
MULT
ADD
chain chain
Vector Instruction Chaining
• Independent functional units are needed
– Partial data of an instruction can be forwarded
and operations are scheduled to be executed in
parallel
– Data chaining can happen only if the
instructions operate on multiple SIMD
widths
• VLEN=SIMD width, single instruction/cycle
– The operation is on multiple elements but it is
basically pipelining, not chaining
– The instructions must be set for a vector length
that is greater then the SIMD width in order for
chaining to happen
15
Copyright© 2020 Andes Technology Corp.
Chaining – Vector Unit Structure
• Four lanes, 4 functional units to fetch/execute independent data
16 elements, 4 lanes:
- Elements 0-3
- Elements 4-7
- Elements 8-11
- Elements 12-15
Functional Unit
16
Copyright© 2020 Andes Technology Corp.
Cycles 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 vld Ia0 Ia1 Ia2 Ia3 Ia4 Ia5 Ia6 Ia7
1 vadd la8 la9 la10 la11 la12 la13 la14 la15 Ib0 Ib1 Ib2 Ib3 Ib4 Ib5 Ib6 Ib7
2 vmult la16 la17 la18 la19 la20 la21 la22 la23 lb8 lb9 lb10 lb11 lb12 lb13 lb14 lb15 Ic0 Ic1 Ic2 Ic3 Ic4 Ic5 Ic6 Ic7
3 add la24 la25 la26 la27 la28 la29 la30 la31 lb16 lb17 lb18 lb19 lb20 lb21 lb22 lb23 lc8 lc9 lc10 lc11 lc12 lc13 lc14 lc15
4 vld Ie0 Ie1 Ie2 Ie3 Ie4 Ie5 Ie6 Ie7 lb24 lb25 lb26 lb27 lb28 lb29 lb30 lb31 lc16 lc17 lc18 lc19 lc20 lc21 lc22 lc23
5 vadd le8 le9 le10 le11 le12 le13 le14 le15 If0 If1 If2 If3 If4 If5 If6 If7 lc24 lc25 lc26 lc27 lc28 lc29 lc30 lc31
6 vmult le16 le17 le18 le19 le20 le12 le22 le23 lf8 lf9 lf10 lf11 lf12 lf13 lf14 lf15 Ig0 Ig1 Ig2 Ig3 Ig4 Ig5 Ig6 Ig7
7 add le24 le25 le26 le27 le28 le29 le30 le31 lf16 lf17 lf18 lf19 lf20 lf21 lf22 lf23 lg8 lg9 lg10 lg11 lg12 lg13 lg14 lg15
8 lf24 lf25 lf26 lf27 lf28 lf29 lf30 lf31 lg16 lg17 lg18 lg19 lg20 lg21 lg22 lg23
9 lg24 lg25 lg26 lg27 lg28 lg29 lg30 lg31
8 lanes, 8 LS units 8 lanes, 8 vector MUL units 8 lanes, 8 vector ADD units
Chaining of 3 vector operations
Chaining
• 32 elements, 8 lanes = 8 units to fetch/execute 8 independent elements
– vld: fetch data 32 elements in 4 clock cycles, 8 elements per cycle
– Vector add/mult: each vector instruction passes through the functional units
4 times
17
Copyright© 2020 Andes Technology Corp.
RAW Dependencies
vectors
scalar
Vectorize Loop Iterations
• X and Y are vectors, a is scalar, n is a large number
• For N=64, dynamic instructions
– Processor = 64x9 + 2 = 578:
– Vector = 6 (5 vectors and 1 scalar)
Processor/SIMD Code
C Code
Vector Code
Operate on
64 elements
From: Patterson, D., CS252, UCB
18
Copyright© 2020 Andes Technology Corp.
Advantages of Vector Architecture
• From Spec92fp*
– Static Instructions: scalar/vector = 4.5X
– Dynamic Instructions: scalar/vector = 23.8X
– Operations: scalar/vector = 1.4X (the extra operations are loop overheads)
• Most efficient way for data parallel processing
– High computation throughput (operate on multiple elements at the same time)
– Require high memory bandwidth, not low latency, data operation is chaining. The data
memory is interleaving multiple banks for higher memory bandwidth
• Orthogonal to microprocessor architecture
– Scalar or superscalar microprocessor with vector unit, e.g. Cray X1, Alpha Tarantula,
Andes NX27V
– Parallel vector processors, e.g. NEC Earth Simulator, Broadcom Calisto
*Quintana, F., et.al., “An ISA comparison between Superscalar and Vector Processors,” U. of Las Palmas de Gran Canaria
19
Copyright© 2020 Andes Technology Corp.
Example: Cray 1
• Scalar and vector mode
• 8 64-element vector registers
– SEW = 64-bit
– VLEN = 64x64 = 4096-bit
• 8 64-bit scalar registers
• 8 24-bit address registers
• 16 memory banks
– Bank latency = 11 cycles
– Sustain 16 parallel accesses to
different banks
“Cray-1 Computer System Hardware Reference
Manual,”1977, Cray Research, Inc.
20
Copyright© 2020 Andes Technology Corp.
3rd Iteration
VLMAX elements
4th Iteration
VLMAX elements
1st Iteration - odd size
MOD(VLMAX) piece
2nd Iteration
VLMAX elements
Strip Mining
• When the size of vector > VLMAX
– size MOD(VLMAX) = the number of elements in the
first iteration
– n/VLMAX = number of iterations for each VLMAX time
• Example size = 200
– 200 MOD(64) = 8 elements in first iteration
– 200/64 = 3 iterations of 64 elements
• Strip Mining is assumed to be done in software
similar to loop operation for VLMAX
– vl or vector mask register is used for the first iteration
21
Copyright© 2020 Andes Technology Corp.
Technical Challenges of Vector Architecture
• Complexity of vector register file (VRF) with VLEN=512-bit
– Complexity of handling read and write ports with multiple lanes are much more expensive in VRF in
comparison to scalar or FP register files
– Performance is related to the number of read and write ports
– The number of functional units are also limited by the number of read/write ports. Stalling of
writing back to VRF means additional 512-bit buffers and back pressure to previous pipeline stages
• Microprocessor methods of handling out-of-order execution and retirement
become expensive for vector processor
– Register renaming added several vector registers of 512-bit
– Re-Order Buffer has a retire queue of 512-bit for vector registers
• Expensive to implement precise exception and interrupt latency
– In AI or embedded market, the precise exception is less important
• Vector loads/stores are expensive, especially for stride and index
• Andes NX27V vector processor simplifies many of the above issues
22
Copyright© 2020 Andes Technology Corp.
Applications of Vector Processing
• Scientific computations (astrophysics, atmospheric, ocean modeling, bioinformatics, physics,
biomolecular simulation, chemistry, fluid dynamics, material sciences, quantum chemistry)
• Engineering computations (computer vision and image, data mining and data intensive
computing, CAD/CAM engineering analysis, military applications, vlsi design)
• Multimedia Processing (compress., graphics, audio synth, image proc.)
• Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
• Artificial Intelligent
• Lossy Compression (JPEG, MPEG video and audio)
• Lossless Compression (Zero removal, RLE, Differencing, LZW)
• Cryptography (RSA, DES/IDEA, SHA/MD5)
• Speech and handwriting recognition
• Operating systems/Networking (memcpy, memset, parity, checksum)
• Databases (hash/join, data mining, image/video serving)
• Language run-time support (stdlib, garbage collection)
Source: Rochester Institute of Technology, “Introduction to Vector Processing,” Shaaban, EECC722
23
Copyright© 2020 Andes Technology Corp.
Vectorizing Media Processing
• Kernel
− Matrix transpose/multiply
− DCT (video, communication)
− FFT (audio)
− Motion estimation (video)
− Gamma correction (video)
− Haar transform (media mining)
− Median filter (image processing)
− Separable convolution (img. proc.)
• Vector length
− # vertices at once
− image width
− 256-1024
− image width, iw/16
− image width
− image width
− image width
− image width
(http://www.research.ibm.com/people/p/pradeep/tutor.html)
24
Copyright© 2020 Andes Technology Corp.
RISC-V V Extension ISA
Basic
25
Copyright© 2020 Andes Technology Corp.
Vector Register ISA
• Vector-Register ISA Definition:
− All vector operations are between vector registers (except for load and store). This is in contrast
with Vector-Memory ISA where all vector operations are on data memory
− Vector load-store unit fetches and stores the memory data to/from vector registers
• Basic requirements
− Each vector data register hold N M-bit values
 SEW – single vector element in bits, M-bit, power of 2
 VLEN – number of bits in a vector register, N*M-bit, power of 2, VLEN > ELEN
 VLMAX – number of vector elements, N, in the vector register, VLEN/SEW
 Elements are independently executed (mostly)
− Vector control registers: VMASK (mask – predication – valid element for execution) is v0
− Vector instructions
 Arithmetic, logical, shift, multiply, divide, mask, permute, floating-point functional units
 Vector load/store unit: load/store with stride, load/store with index, segment load/store
26
Copyright© 2020 Andes Technology Corp.
Vector Processing ISA Basic
Scalar
Scalar FP
ALU
Mult s.int 8 .vv
Macc u.int 16 .vx masked
Mask fp 32 .vi unmasked
Permute 64
Divide
4 8
load s.int 8 16 masked
store u.int 16 32 unmasked
32 64 segment
64
Vector registers
Standard scalar RISC-V ISA
Saturate
Overflow
constant
indexed
32x(128b, 256b, 512b)
Standard floating-point RISC-V ISA
Vector ALU
Vector Memory
Andes Vector processor: VLEN=512b, SEW=8,16,32,64b, elements=64,32,16,8
These numbers are used for the rest of the presentation
27
Copyright© 2020 Andes Technology Corp.
Notations
• Register references
– vs1, vs2, vd – source and destination vector registers, VRF
– rs1, rs2, rd – source and destination integer registers, XRF
– fs1, fs2, fd – source and destination floating point registers, FRF
• Register numbers
– v0, v1, …, v30, v31 – vector register numbers (addresses, indices)
– x0, x1, …, x30, x31 – integer register numbers (addresses, indices)
– f0, f1, …, f30, f31 – floating-point register numbers (addresses, indices)
– x0 is not real register; source register data = 0, destination register = no
write back
– Vector mask and carry bits use v0
28
Copyright© 2020 Andes Technology Corp.
FP MAC unit
0 MVL Element FP misc unit
1 MVL Element
2 MVL Element FP divide unit
3 MVL Element
… … Integer ALU
… …
… … Integer MAC unit
… … R0
30 MVL Element R1 Integer divide unit
31 MVL Element R2
R3 Vector Mask Unit
…
Vector Length … Vector Permute Unit
Vector Mask R28
R29
R30
R31
Memory
multi banks
Vector Load/Store
Vector
Registers
Scalar
Registers
Vector Processor Overview - Block Diagram
29
Copyright© 2020 Andes Technology Corp.
FP MAC unit
0 MVL Element FP misc unit
1 MVL Element
2 MVL Element FP divide unit
3 MVL Element
… … Integer ALU
… …
… … Integer MAC unit
… … R0
30 MVL Element R1 Integer divide unit
31 MVL Element R2
R3 Vector Mask Unit
…
Vector Length … Vector Permute Unit
Vector Mask R28
R29
R30
R31
Memory
multi banks
Vector Load/Store
Vector
Registers
Scalar
Registers
vector functional units
Many types of memories can be implemented
VRF read/write
ports control
Data dependency
is handled by
scoreboard
Execution queues for out-of-order execution
Vector Processor Microarchitecture Features
30
Copyright© 2020 Andes Technology Corp.
RISC-V V Extension ISA
CSR
31
Copyright© 2020 Andes Technology Corp.
Vector CSR Registers
Address Privilege Name Description
0x008 URW vstart Vector start position
0x009 URW vxsat Fixed-Point Saturate Flag
0x00A URW vxrm Fixed-Point Rounding Mode
0x00F URW vcsr Vector control and status register
0xC20 URO* vl Vector length
0xC21 URO* vtype Vector data type register
0xC22 URO vlenb VLEN/8 (vector register length in bytes) – configuration register
*written by vsetvl(i) instruction, not by CSR access instructions
32
Copyright© 2020 Andes Technology Corp.
CSR – Vector Type Register (vtype)
• vlmul[2:0] – grouping of vector registers, 2x, 4x, 8x, or fraction of vector register, 1/2, 1/4, 1/8
– 000: no grouping, 32 vector registers, VLMAX = VLEN/SEW
– 001: 16 vector registers, effective VLEN is 2X, twice number of elements
– 010: 8 vector registers, effective VLEN is 4X, quadruple number of elements
– 011: 4 vector registers, effective VLEN is 8X, octuple number of elements
– 100: Reserved
– 101: Fractional vector register, effective VLEN is 1/8, one-eighth number of elements
– 110: Fractional vector register, effective VLEN is 1/4, quarter number of elements
– 111: Fractional vector register, effective VLEN is 1/2, half number of elements
• vsew[2:0] – vector standard element width, VLMAX = VLEN/VSEW
– 000 – 111: 8, 16, 32, 64, 128, 256, 512, 1024
• vta – vector tail agnostic, implementation dependent, should the data is unknown or undisturbed
• vma – vector mask agnostic, implementation dependent, should the data is unknown or undisturbed
• vill – illegal value in the vsetvl(i) instruction, any subsequent vector instruction is illegal instruction
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vill
vma vta
vlmul
vsew
reserved
33
Copyright© 2020 Andes Technology Corp.
LMUL=1 LMUL=4
v31 v28
v30 v24
v29 …
v28 v4
… LMUL=2 v0
… v30
… v28
… … LMUL=8
v3 … v24
v2 … v16
v1 v2 v8
v0 v0 v0
LMUL – Registers Grouping
• LMUL=2, instructions with odd register are illegal instructions
• LMUL=4, vector registers are incremented by 4, else illegal instructions
• LMUL=8, only v0, v8, v16, and v24 are valid vector registers
34
Copyright© 2020 Andes Technology Corp.
elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
LMUL=1
LMUL=1/2
LMUL=1/4
LMUL=1/8
LMUL – Fractional Registers
• LMUL=1/2, load data into half vector register, then perform vector double operations to expand data
to the whole register
• LMUL=1/4, vector quad instructions to expand data to the whole register
• LMUL=1/8, using vector extension of 8x to expand to the whole register
35
Copyright© 2020 Andes Technology Corp.
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Register Vector Registers
Vector Length Register VLEN
Scalar
Register
Vector Register Programming Model
• 32 vector registers – 5-bit
• VLEN = The width of each vector register[i] in power of 2
• Version 0.7: FPU cannot shares the vector registers
36
Copyright© 2020 Andes Technology Corp.
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Register Vector Registers
Vector Length Register VLEN
Scalar
Register
15 14 … … 2 1 0 element #
VLEN=512b, SEW=32b
Vector Register 32-bit Elements
37
Copyright© 2020 Andes Technology Corp.
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Register Vector Registers
Vector Length Register VLEN
Scalar
Register
31 30 … … 2 1 0 element #
VLEN=512b, SEW=16b
Vector Register 16-bit Elements
38
Copyright© 2020 Andes Technology Corp.
R31 F31 V31
. . .
. . .
. . .
R0 F0 V0
[VLMAX-1] … [2] [1] [0]
FP
Register Vector Registers
Vector Length Register VLEN
Scalar
Register
63 62 … … 2 1 0 element #
VLEN=512b, SEW=8b
Vector Register 8-bit Elements
39
Copyright© 2020 Andes Technology Corp.
Vm*n 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Vm*n+1 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64
Vm*n
Vm*n+1
Vm*n
Vm*n+1
Vm*n
Vm*n+1
SEW=8
SEW=16
SEW=32
SEW=64
8
15 14 13 12 11 10 9
0
1
2
3
7 6 5 4
30 29 28 27 26
31 25
61 60 59 58 57
24 23
63 62 56 55
15 14 13 12 11 10 9 8
31 30 29 28 27 26 25 24
21 20 19 18 17 16
54 53 52 51 50 49 48
22 5 4 3 2 1 0
47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
15 14 6
13 12 11 10 9 8 7
7 6 5 4 3 2 1 0
23 22 21 20 19 18 17 16
Element Numbering, LMUL=2
• VLEN=SLEN is now mandatory in the RVV version 1.0
– Straight forward numbering, increment sequentially
– The numbering is important for mask and permutation operations
40
Copyright© 2020 Andes Technology Corp.
Vector Mask – Conditional or Predicate Execution
• v0 is used for vector mask, single mask for all LMUL values
– 1 mask bit for each element
• 0 – mask (mask-off), no write to destination vector elements, cannot have exception
• 1 – unmask
– The mask is enabled for each vector instruction with inst[25]=0, inst[25]=1 is unmask
• vop.v v8, v16, 3, v0.t // mask is enable, each element writes back with v0_mask[i]=1
• vop.v v8, v16, 3 // unmask, all elements write result data back to VRF
• Note: new vector mask load/store maybe added
SEW=8
SEW=16
SEW=32
7 6 5 4 3 2 1 0 SEW=64
maskbitsforv7 maskbitsforv6 maskbitsforv5 maskbitsforv4 maskbitsforv3
maskv7 maskv6
v0vectormaskregister
maskbitsforv2
maskv5 maskv4
maskbitsforv1 maskbitsforv0
maskv3 maskv2 maskv1 maskv0
mv7 mv6 mv5 mv4 mv3 mv2 mv1 mv0
Unused
Unused
Unused
LMUL=8
LMUL=4 LMUL=2
41
Copyright© 2020 Andes Technology Corp.
LMUL=8 and Chaining Example
Cycles: 1 2 3 4 5 6 7 8 9 10
vmul v0, v8, v16 X1 X2
v1, v9, v17 X1 X2
……… … … … … …
v7, v15, v23 X1 X2
vadd v24, v0, v8 X1
v25, v1, v9 X1
……… … … … … … … …
Vector length: VLEN * LMUL
• LMUL=1  512 bits
• LMUL=8  4096 bits
V0~V7 group 1
V8~V15 group 2
V16~V23 group 3
V24~V31 group 4
Xn: execution LMUL Chaining
42
Copyright© 2020 Andes Technology Corp.
elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vl=VLMAX
vl=21
CSR - Vector Length Register (vl)
• vl is the number of elements for vector operations
– Updated by vsetvl(i) instruction along with the vtype CSR
• vl is set to VLMAX if the set value is greater than VLMAX
• The default vl is VLMAX
• If x0 is used to set vl, then vl is set to VLMAX
– Force to an element by fault-only-first load instruction
– vl=0, then no elements are updated in vector operations
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
reserved vl[VLMAX*8-1:0]
43
Copyright© 2020 Andes Technology Corp.
CSR Vector Instruction – vsetvli/vsetvl
• Two vector CSRs are updated by single instruction
– vsetvli rd, rs1, vtypei // vl=rs1 (which is also AVL, application vector length), rd=AVL
– vsetvl rd, rs1, rs2 // rd = avl
• CSR updates:
– csr.vtype = vtypei or rs2
• e8, e16, e32, e64, … setting the SEW size
• {m1}, m2, m4, m8, mf2, mf4, mf8 – setting the LMUL, default is m1 if m value is not specified
• {tu}, ta – setting tail elements, default is tu if ta is not specified
• {mu}, ma – setting masked-off elements, default is mu if ma is not specified
• csr.vl = rs1
– rd=rs1=x0 no update for csr.vl
– rd!=x0, rs1=x0 vl=VLMAX, x[rd]=vl
– rs1!=x0 vl=x[rs1], x[rd]=vl
– x[rs1] > VLMAX vl=VLMAX, x[rd]=vl
44
Copyright© 2020 Andes Technology Corp.
Sample Codes for LMUL and SEW Manipulation
• VLEN=512b, e8 to e32 elements for 16 elements
vsetvli t0, a0, e8, mf4 // set SEW=8b, LMUL=1/4, VLMAX=64/4=16 elements
vle8.v v8, (a2) // load 16 8b elements
vsext.vf4 v8, v8 // Signed extended 16 8b elements to 32b elements
vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=16 (VLMAX)
vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements
…
• Notes:
− Write and read to CSRs are normally serial events but for VPU a speculative value of LMUL, SEW, and vl
are used for subsequent instructions until the values are changed again
sew=8b
lmul=1/4
sew=32b
lmul=1
vl=16
sew=8
45
Copyright© 2020 Andes Technology Corp.
Optimized codes for Load and Signed Extension
• VLEN=512b, e8 to e32 elements for 16 elements
vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=512/32=16 elements
vle8.v v8, (a2) // load 16 8b elements, vle16.v or vle32.v is always 16 elements
vsext.vf4 v8, v8 // Signed extended 16 8b elements to 32b elements
vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements
…
• Notes:
− In earlier version, 0.8, of RVV ISA, the vle8.v and vsext.vf4 instructions are replace by a single instruction,
(the ACE load instruction can still do this load):
vlb.v v8, (a2), // load 8b elements and signed extended to 32b elements, still 16 elements
46
Copyright© 2020 Andes Technology Corp.
Sample Code for Large Data Set
• VLEN=512b, e8 to e32 elements for 128 elements
vsetvli t0, zero, e32, m8 // set SEW=32b, LMUL=8, vl=512b*8/32b=128 elements
vle8.v v6, (a2) // load 128 8b elements into 2 vector registers, v6-v7
vsext.vf4 v0, v6 // Signed extended 128 8b elements to 32b elements
vadd.vx v0, v0, t1 // 32-bit vector add for 128 elements
…
• Notes:
– vle8.v: load data into 2 vector registers, even register number or illegal instruction
– vsext.vf4: the source operands are overlapped with destination operands
• If source operand is v0, v2, v4, then it is illegal instruction
• v6 and v7 are legal instructions because the vector registers are read before
written by the result data
47
Copyright© 2020 Andes Technology Corp.
CSR - Vector Byte Length
• Vector register length in bytes, vlenb = VLEN/8
– Design time constant, vlenb = 512/8 = 64
– Read-only CSR, constant value, hardwired
– A configuration CSR
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vlenb = 64
reserved
48
Copyright© 2020 Andes Technology Corp.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vstart[VLMAX*8-1:0]
reserved
CSR – Vector Start Index Register (vstart)
• Initialize to zero
• vstart is the index of element that causes the trap on vector instruction
• When return from trap, the vector execution restarts from the vstart
element and reset the vstart after vector instruction execution
• vstart > vl, meaning that no element operation and vstart is reset to zero
For Andes vector processor, only load/store instruction can cause trap and
interrupt are taken at instruction boundary. A non-zero value for vstart for
any other vector instruction will cause illegal instruction.
49
Copyright© 2020 Andes Technology Corp.
Valid Vector Range
Behavior for destination element i
0 < i < vstart Unchanged
vstart <= i < vl Updated if v0.mask[i] enabled, unchanged if not
mta exception
vl <= i < VLMAX Unchanged, vta exception
VLMAX-1 vl-1 Masked elements vstart
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Elements
valid elements for operation
50
Copyright© 2020 Andes Technology Corp.
• vxsat – Vector Fixed-Point Saturation Flag Register – single bit to indicate that
saturated value was written to destination vector register
• vxrm – Vector Fixed-Point Rounding Mode Register:
− 00: rnu – round to nearest up (add +0.5 to LSB)
− 01: rne – round to nearest even
− 10: rdn – round down (truncate)
− 11: rod – round to odd (OR bits into LSB, “jam”)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sat
vxrm
reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sat
reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vxrm
reserved
• vcsr – merging of vxsat and vxrm into single CSR, explicitly reset by software
CSR – Vector Execution Fixed Point Register
51
Copyright© 2020 Andes Technology Corp.
CSR – Vector Execution Floating-Point Register
• The vector processor uses the fflags and frm from the FPU
– fflags[4:0]: floating-point accrued exception flags
– frm[2:0]: floating-point rounding modes
52
Copyright© 2020 Andes Technology Corp.
RISC-V V Extension ISA
Memory Operations
53
Copyright© 2020 Andes Technology Corp.
Memory Operations
• Load/store operations move groups of data between the VRF and memory
– Andes implements unaligned addressing, no restriction
• Four basic types of addressing
– Unit stride: fetch consecutive elements in contiguous memory
– Non-unit or constant stride: elements are in stride memory; stride is the distance
between 2 elements of a vector in memory
– Indexed (gather-scatter): elements are randomly offset from a base address
– Segment: move multiple contiguous fields in memory to and from consecutively
numbered vector registers – can be unit-stride, constant stride, or index
• One TLB lookup per group of loads/stores (Andes)
– Limit the memory access to 1 TLB page – 1 translation
– 4KB, memory access fault if any element address crosses the page boundary
• Atomic load/store is not implemented
54
Copyright© 2020 Andes Technology Corp.
Extract 7 from first 32B
Extract 25 from second 32B
Vector Load Data Alignment
Byte 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vn 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Firstpackage
Secondpackage
Andes processor handles all unaligned addresses
55
Copyright© 2020 Andes Technology Corp.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
VL* unit-stride mw vm
VLS* strided mw vm
VLX* indexed mw vm
VS* unit-stride mw vm
VSS* strided mw vm
VSX* indexed mw vm
mop
mop
mop
mop
mop
0100111
rs2 rs1 width vs3 0100111
0000111
rs1 vd
width
0000111
0100111
vs3
lumpop
rs1 width vd
nf vs2 rs1 width vs3
rs1
nf
nf
nf
vs2
nf sumpop rs1 width
nf
mop
rs2
width vd 0000111
mw-width[2:0]
E1024
FL/FS
H
W
D
Q
Vector - VLx/VSx
E8
E16
E32
E64
E128
E256
E512
1110
1111
16
32
64
128
8
16
32
64
128
256
512
1024
x001
x010
x011
x100
0000
0101
0110
0111
1000
1101
00000
01000
10000
others
unit-stride
unit-stride, whole register
load, fault-only-first
reserved
lumop/sumop[4:0]
EEW=MEW
Instruction Format – Load/Store
Segment,
multiple
registers
FP load/store
Vector
load/store
mop[1:0] function
00 unit-stride
01 index_unordered
10 constant-stride
11 index_ordered
11 index
store
ls
store
load
ls
56
Copyright© 2020 Andes Technology Corp.
Vector load/store, different EMUL
• EMUL is the effective LMUL
• SEW=16, LMUL=1, VLEN=512b
– elements=512/16=32, all vector instructions operate on 32 elements
vle16.v v1, (t0) # Load a vector of sew=16, lmul=1, load to v1
vle32.v v2, (t1) # Load a vector of sew=32, emul=2, load to v2 and v3
vwadd.wv v4, v2, v1 # v4/v5 ← v2/v3 + sign-extend(v1), emul=2
vse32.v v4, (t1) # Store a vector of sew=32, emul=2, v4 and v5 store to memory
vle8.v v1, (t1) # Load a vector of sew=8, emul=1/2, load to elements first half of v1
vle64.v v4, (t0) # Load a vector of sew=64, emul=4, load to v4, v5, v6, v7
vle128.v v8, (t0) # Load a vector of sew=64, emul=8, load to v8, v9, v10, …, v15
57
Copyright© 2020 Andes Technology Corp.
Load with and without Extension
• SEW=32b, LMUL=8, VLEN=512b, (512b/32b)*8 = 128 elements
• RVV version 1.0 – vle8.v v6, (t0) & vsext.vf4 v0, v6
− Load 128 elements of 8b
 512b/8b = 64 elements, write to 2 vector registers, v6-7
− Signed extension of 128 elements from 8b to 32b
 Read 8b elements from 2 vector registers, v6-7
 Write 32b elements to 8 vector registers, v0-7
• RVV version 0.8 – vlb.v v0, (t0) – custom Andes instruction
− Load 128 elements of 8b and signed extended to 32b
 Write data into 8 vector registers, v0-7
58
Copyright© 2020 Andes Technology Corp.
Unit-Stride Load/Store Variances
• Load Fault-only-First: exception only for first element
– Set vl=faulted element, no exception
– Use for exit condition of loop
– Example: the unit load fetches to the end of the TLB page, all subsequent
vector operations will be for the valid elements that were fetched. This is the
reversed of the stripped mining for unknown number of application elements
• Load/store of whole register:
– vtype & vl are ignored, SEW=8b, uses NF bits for multiple vector registers
– vm=1 (inst[25]), else illegal instruction
– Vstart >= VLEN/8, no operation
– Use for fill/spill, context switch, return from interrupt, …
59
Copyright© 2020 Andes Technology Corp.
Example: strncpy, Fault-only-First Load
strncpy:
mv a3, a0 // Copy dst
loop:
vsetvli x0, a2, e8, m8, ta, ma // Vectors of bytes, VLEN=512b
vle8ff.v v8, (a1) // Fault-only-First load to 8 vector registers
vmseq.vi v1, v8, 0 // Flag zero bytes
csrr t1, vl // Get number of bytes fetched
Note: ta is used to write zero to elements after the faulted elements (vl=faulted
elements)
60
Copyright© 2020 Andes Technology Corp.
Stride Gather/Scatter Memory Access
• Unit stride continuous memory access
• Constant stride: gather scatter (stride can be negative or zero)
– Support store memory overlapped due to stride=0 or stride<SEW
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Unit
0 1 2 3 unit stride
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 1 2 3 stride=2
Stride 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Gather
Scatter 0 1 2 3 stride=8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 1 2 3 stride=0
61
Copyright© 2020 Andes Technology Corp.
Note:
- The indices are absolute and not accumulative
- The indices are not in-order
Index Gather/Scatter Memory Access
Base Address 7 12 8 2 vs2
Gather 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Base Address 5 8 0 15 vs2 0 1 2 3 vd/vs3
Scatter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
&data[0]
&data[0]
Load Index vector
Store Index vector
62
Copyright© 2020 Andes Technology Corp.
Segment Unit Stride Load/Store
• Multiple vector registers load/store, NF*EMUL <= 8
• Equivalent to 4 stride load/store micro-ops
• Fault-only-first is allowed
Segment stride load/store with
NF=4 is four stride loads/stores
0 1 2 3 4 5 6 7 8 9 10 11 Memory
V31
.
.
.
.
V3
V2
V1
V0
[0] [1] [2] … [VLMAX-1]
Vector Registers
Load 2
Load 1
Load 0
Unit load/store with NF=4, LMUL=1
63
Copyright© 2020 Andes Technology Corp.
0 1 2 3 4 5 6 7 8 9 10 11 Memory
V31
.
.
.
.
V3
V2
V1
V0
[0] [1] [2] … [VLMAX-1]
Unit load/store with NF=2, LMUL=2
Load 0 Load 1 Load 2
Vector Registers
Segment Stride Load/Store, LMUL=2
• Multiple vector registers load/store, NF*EMUL <= 8
• Zero and negative strides are allowed
First load/store, stride=2, LMUL=2
Second stride load/store, base address=1
Stride=0, write [0] into all elements of v0 & v1, [1] into all elements of v2 & v3
64
Copyright© 2020 Andes Technology Corp.
Segment Index Load/Store
• Multiple vector registers load/store, NF*EMUL <= 8
– Overlapping of store data to memory
– NF number of index load/store micro-ops, the index is incremented by SEW
Segment index load/store with
NF=4 is four index loads/stores
Base Address 7 12 8 2
Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
V31
.
.
.
.
V3 10 15 11 5
V2 9 14 10 4
V1 8 13 9 3
V0 7 12 8 2
[0] [1] [2] … [VLMAX-1]
&data[0] Index vector
Vector Registers
65
Copyright© 2020 Andes Technology Corp.
RISC-V V Extension ISA
Compute Instruction
Format
66
Copyright© 2020 Andes Technology Corp.
Vector Arithmetic Instruction Overview
• vop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] op vs1[i]
• vop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] op x[rs1]
• vop.vi vd, vs2, imm, vm // vector-imm, vd[i]=vs2[i] op imm
• vfop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] fop vs1[i]
• vfop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] fop f[rs1]
67
Copyright© 2020 Andes Technology Corp.
Compute Instruction Format
OPI – integer operation
OPF – floating-point operation
OFM – mask operation
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
OPIVV vector-vector m
OPFVV vector-vector m
OPMVV vector-vector m
OPIVI vector-immed m
OPIVX vector-scalar m
OPFVF vector-scalar(F) m
OPMVX vector-scalar m
vsetvli scalar-immed 0
vsetvl scalar-scalar 1
funct6
rs1 101
funct6
vs2 vs1 001
funct6
funct6
111
vd
funct6
rd 1010111
1010111
1010111
vs2 vs1 010
1010111
vs2 rs1 100 vd
1010111
vs2 vs1 000 vd 1010111
vd/rd 1010111
vd/rd 1010111
vs2 simm5 011 vd
000000
zimm[10:0]
vs2 rs1 110 vd/rd
rs1 111 rd
funct6
funct6
vs2
1010111
rs2 rs1
Funct6 is used to encode all vector
instructions, most of the time, the
same funct6 code is used for all 3
forms for vector instructions (vector,
scalar, immediate)
68
Copyright© 2020 Andes Technology Corp.
Vector Widening Instructions
• Widening: double width
− vwop.v{v,x} vd, vs2, vs1/rs1 // 2*SEW = sign/zero(SEW) op sign/zero(SEW)
− vwop.w{v,x} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op sign/zero(SEW)
− vfwop.v{v,f} vd, vs2, vs1/rs1 // 2*SEW = SEW op SEW
− vfwop.w{v,f} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op SEW (vfwadd & vfwsub)
• Widening operations
− Integer ALU: unsigned/signed extended the source operands before the operation
− Integer MUL: operate on SEW-bit, then return 2*SEW
− Floating-point: operate on SEW-bit, then return 2*SEW
− Same number of elements rule as defined in vtype:
o Double width  writing back to twice the number of vector registers
69
Copyright© 2020 Andes Technology Corp.
Vector Widening Instructions – Extension
• Vector integer extension
− V{z,s}ext.vf{2,4,8} vd, vs2 // Zero/sign extend to SEW
− The result is the same number of elements set by vtype
− Example: SEW=32b, then only V{z,s}ext.vf{2,4} are legal instructions
70
Copyright© 2020 Andes Technology Corp.
Shifting of Source Operands for Double-Width
• The data are extended and folded into the source data from VRF before
sending to execution
– Cycle 1: operate on full-width data and write back to v2
shift data from left-half and extension to full width
– Cycle 2: operate on full-width data and write back to v3
Pipeline in after the right half to write to v3 16b 16b 16b 16b 16b 16b 16b 16b
v1 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
v0 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
src0 - v1
src1 - v0
Stage 2
v2=v1+v0
2*SEW
v2
v3
Elem 2
Stage 1
Extension
Elem 7 Elem 6 Elem 5
Elem 15 Elem 14 Elem 13 Elem 12 Elem 11 Elem 10 Elem 9
Elem 1
Elem 7 Elem 6
Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0
Elem 1 Elem 0
Elem 5 Elem 4 Elem 3 Elem 2 Elem 0
Elem 8
Elem 4 Elem 3
+ + + +
+ + + +
71
Copyright© 2020 Andes Technology Corp.
Vector Narrow Instructions
• Narrowing: from double-width
– vnop.w{v,x,i} vd, vs2, vs1/rs1 // SEW = 2*SEW op SEW
– Logical and arithmetic right shift operations
Example of operations without changing vtype, LMUL=1, SEW=16b, 32 elements in
all vector instruction operations:
vle16.v v1, (t0) // Load to v1 full vector registers
vle32.v v2, (t1) // Load a vector of sew=32b, emul=2, load to v2 and v3
vwadd.wv v0, v2, v1 // v0/v1 ← v2/v3 + sign-extend(v1), emul=2
vnsra.wx v7, v0, t2 // v7 ← v0/v1 (shift right by t2)
vse16.v v7, (t0) // Store v7 to memory
72
Copyright© 2020 Andes Technology Corp.
Shifting of Source Operands for Narrow-Width
• The data are extended and folded into the source data from VRF before
sending to execution
– Cycle 1: operate on full-width data
– Cycle 2: Shift result data to write back to right-half of v0
Pipeline in after the right half to write to v3
v1
src0 - v1
v3
v2
v0 Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0
Elem 1
Elem 3
Elem 0
Elem 2 Elem 1 Elem 0
Elem 7 Elem 6
Elem 0
Elem 1
Elem 3
Elem 3
16b 16b 16b 16b
Elem 7 Elem 6 Elem 5
v0=v2(shf)v1
Elem 4
Elem 5 Elem 4 Elem 3 Elem 2
s
s
s
s
73
Copyright© 2020 Andes Technology Corp.
RISC-V V Extension ISA
Compute Instructions
74
Copyright© 2020 Andes Technology Corp.
Vector Integer Arithmetic Instructions
Vector Integer Add, Subtract, Reverse Subtract (including Double Width)
Vector Integer Extension
Vector Integer Add-with-Carry / Subtract-with-Borrow
Vector Bitwise Logical Instructions
Vector Bit Shift Instructions (including Narrow Width)
Vector Integer Comparison Instructions, write 1/0 to destination register
Vector Integer Min/Max Instructions, write min/max to destination register
Vector Integer Multiply (including Double Width)
Vector Integer Divide and Remain
Vector Integer Multiply-Add/Sub, Multiply-Accumulate (including Double Width)
Vector Integer Merge Instructions (vector – vs1, scalar – rs1, imm)
Vector Integer Move Instructions (vector – vs1, scalar – rs1, imm)
75
Copyright© 2020 Andes Technology Corp.
Vector Arithmetic with Carry Instructions
vadc.vv vd, vs2, vs1, vm=0 // vd[i] = vs2[i] + vs1[i] + v0[i].mask
vmadc.vvm vd, vs2, vs1, vm=0 // vd[i].mask = carry_out(vs2[i] + vs1[i] + v0[i].mask)
• Vector instructions oriented to implementation :
− Single vector destination register
− Single write port per vector instruction
• The carry-in is in v0 same as mask layout
• Two vector instructions to complete the add with carry
− vadc: destination and source registers cannot be overlapped
− vadc: destination register cannot be v0
− vmadc: destination register can be v0 or any vector register
76
Copyright© 2020 Andes Technology Corp.
Vector MAC Instructions
v(w)madd.vv vd, vs2, vs1, vm // vd[i] = vs2[i] + (vs1[i] * vd[i])
v(w)madd.vx vd, vs2, rs1, vm // vd[i] = vs2[i] + (rs1 * vd[i])
v(w)macc.vv vd, vs2, vs1, vm // vd[i] = vd[i] + (vs1[i] * vs2[i])
v(w)macc.vx vd, vs2, rs1, vm // vd[i] = vd[i] + (rs1 * vs2[i])
v(w)msub and v(w)msac have the same format
77
Copyright© 2020 Andes Technology Corp.
Vector Merge Instructions
vmerge.vx vd, vs2, rs1, vm=0 // vd[i] = vm[i] ? rs1 : vs2[i]
vmerge.vv vd, vs2, vs1, vm=0 // vd[i] = vm[i] ? vs1[i] : vs2[i]
vmerge.vi vd, vs2, imm, vm=0 // vd[i] = vm[i] ? imm : vs2[i]
• Each element uses vm[i] to select the input element
78
Copyright© 2020 Andes Technology Corp.
Vector Integer Move Instructions
vmv.v.{v,x,i} vd, {vs1, rs1, imm}, vm=1
• Same encoding as with Vector Merge Instructions with vm=1 and vs2=v0
• {x,i} splat a scalar register or immediate to all elements
• vs1=vd is NOP
• Other vs2 values are reserved
79
Copyright© 2020 Andes Technology Corp.
Vector Fixed-Point Arithmetic Instructions
Vector Saturating Add and Subtract (set CSR vxsat bit)
Vector Averaging Add and Subtract (overflow is ignored)
– Add/subtract, then shift right by 1, round off according to CSR vxrm bit
Vector Fractional Multiply with Rounding and Saturation
Vector Scaling Shift Instructions
Vector Narrowing Clip Instructions
80
Copyright© 2020 Andes Technology Corp.
Vector Fractional Multiply with Rounding and Saturation
vsmul.v{v,x} vd, vs2, {vs1,rs1}, vm
– Double-width multiplication vs2[i]*{vs1[i], rs1}
– Shift right by SEW-1
– Roundoff_signed according to CSR vxrm bit
– Saturate the result to SEW bits, set CSR vxsat bit if saturated
– vd[i] = clip(roundoff_signed(vs2[i]*vs1[i], SEW-1))
81
Copyright© 2020 Andes Technology Corp.
Vector Scale-Shift and Clip Instructions
vssrl.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right logical
– vd[i] = round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm})
vssra.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right arithmetic
– vd[i] = round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm})
vnclip.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip
– vd[i] = clip(round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm}))
vnclipu.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip
– vd[i] = clip(round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm}))
82
Copyright© 2020 Andes Technology Corp.
Vector Floating-Point Arithmetic Instructions
Vector FP Add, Subtract, Reverse Subtract (including Double Width)
Vector FP Multiply (including Double Width)
Vector FP Divide, Reverse Divide
Vector FP Multiply-Add/Sub, Multiply-Accumulate (including Double Width)
Vector FP Square-Root, Reciprocal Square-Root Estimate, Reciprocal Estimate
Vector FP Comparison Instructions, write 1/0 to destination register
Vector FP Min/Max Instructions, write min/max to destination register
Vector FP Merge, Move Instructions (only fp.scalar – f[rs1])
Vector FP Sign-Injection Instructions
Vector FP Classify Instructions
Vector FP/Integer Type-Convert (including Double Width and Narrowing)
83
Copyright© 2020 Andes Technology Corp.
Vector Reduction Instructions
Vector Integer Reduction Instructions
Reduction sum (including Double Width)
Reduction min/max (including signed and unsigned)
Reduction bitwise AND, OR, XOR
Vector FP Reduction Instructions
Reduction ordered sum (including Double Width)
((((vs1[0] + vs2[0]) + vs2[1]) + vs2[2]) + vs2[3]) + …
Reduction unordered sum (including Double Width)
Reduction min/max
vd[0] = op( vs1[0] , vs2[*] ) where * are all active elements (mask enabled)
The w form writes the double-width result
84
Copyright© 2020 Andes Technology Corp.
Vector Reduction Sum
• Reduction of VLEN/SEW=256-bit/16-bit=16 elements
• LMUL>1 can be pipelined
SWPL =
SEW = 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b
V3-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
V1-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
v3=v3[0]+red(v1)
V3-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0
64-bit 64-bit
64-bit 64-bit
+
+
+ +
+ + + +
+ + + + + + + +
85
Copyright© 2020 Andes Technology Corp.
Vector Mask Instructions
Vector Mask-Register Logical Instructions
AND, NAND, AND-NOT, OR, NOR, OR-NOT, XOR, XNOR
Vector mask population count vpopc
vfirst find-first-set mask bit
vmsbf.m set-before-first mask bit
vmsif.m set-including-first mask bit
vmsof.m set-only-first mask bit
Vector Iota Instruction
Vector Element Index Instruction
86
Copyright© 2020 Andes Technology Corp.
Vector Mask Logical
• Mask logical instructions, vd = vs2 (logical-op) vs1
– vmmv.m = vmand.mm vd, vs, vs // copy mask register
– vmclr.m = vmxor.mm vd, vd, vd // clear mask register
– vmset.m = vmxnor.mm vd, vd, vd // set mask register
– vmnot.m = vmnand.mm vd, vs, vs // invert mask bits
• vl and vstart are used as with normal vector operations
• Single vector register operation, LMUL is ignored, bit operations, each bit is
for an element
87
Copyright© 2020 Andes Technology Corp.
Vector Mask Instructions
[VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0]
vs2 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register
rd 14 vpopc - x[rd] = sum_i( vs2[i] && v0[i] )
rd 3 vmfirst - element index of first mask bit
vd 0 … 0 0 0 0 0 0 1 1 1 vmsbf - set before first mask bit
vd 0 … 0 0 0 0 0 1 1 1 1 vmsif - set include first mask bit
vd 0 … 0 0 0 0 0 1 0 0 0 vmsof - set only first mask bit
vd n … 3 3 2 1 1 0 0 0 0 viota - vd[i] = sum of preceding mask bits
vd m*i … 8 7 6 5 4 3 2 1 0 vid - mask index, vs2=v0
vstart=0, except for vid
+
88
Copyright© 2020 Andes Technology Corp.
Vector Permutation Instructions
Integer Scalar Move Instructions – between vector element 0 and x[rd/rs1]
Floating-Point Scalar Move Instructions – between vector element 0 and f[rd/rs1]
Vector Slide Instructions
Vector Register Gather Instructions
Vector Compress Instruction
Whole Vector Register Move
– Move instructions with decoded number of vector registers to copy, ignoring vtype LMUL
89
Copyright© 2020 Andes Technology Corp.
Vector Slide Instructions
• vslideup/down – move elements up and down a vector
− vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i]
− vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i]
− vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i]
− vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1]
− vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1]
− vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm]
90
Copyright© 2020 Andes Technology Corp.
Vector Slideup Instruction
• No source and destination registers overlapping, destination register can
be written before the source register is read (RAW)
− vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i]
− vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i]
− vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i]
− Slide1up is from XRF or FRF
− All lower elements are not modified
Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
slide1up
vslide1up v8, v4 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x
vslideup v8, v4, 10 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x x
v11 v10 v9 v8
Unchange
x[rs1], f[rs1]
v5 v4
v7 v6
91
Copyright© 2020 Andes Technology Corp.
Vector Slidedown Instruction
• Source and destination registers overlapping is okay, source register is
read before the destination register is written
− vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1]
− vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm]
− vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1]
− Slide1down is from XRF or FRF
− All the upper elements fill in with zero
Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x[rs1], f[rs1]
slide1down
vslide1down v4, v4 0 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vslidedown v8, v4, 9 0 0 0 0 0 0 0 0 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
v4
v7 v6 v5
v7 v6 v5 v4
Write 0
92
Copyright© 2020 Andes Technology Corp.
Vector Register Gather Instructions
• vrgather – read elements from a source vector with index register to write
destination vector
– vrgather.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]]
– vrgatherei16.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]], vs1 is 16b
– vrgather.vx vd, vs2, rs1, vm // vd[i] = vs2[rs1]
– vrgather.vi vd, vs2, uimm, vm // vd[i] = vs2[uimm]
– If index >= VLMAX, then vd[i] = 0
– For SEW=8, vs1 can address only 256 elements
– Can be used in combination with vector unit-stride load/store instead of index load/store
element 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
vs2 e15 e14 e13 e12 e11 e10 e9 e8 e7 e6 e5 e4 e3 e2 e1 e0
vs1 10 9 2 3 11 12 15 0 0 1 4 8 1 0 7 5
vd e10 e9 e2 e3 e11 e12 e15 e0 e0 e1 e4 e8 e1 e0 e7 e5
The element number in vd is the same as the index in vs1
93
Copyright© 2020 Andes Technology Corp.
Vector Compress Instructions
• vcompress – shift the unmask elements from source vector to destination
register
− vcompress.vm vd, vs2, vs1
// vd[i] = Compress into vd elements of vs2 where vs1(LSB) is enabled
− vm=1 (inst[25]), else illegal instruction
[VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0]
vs1 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register
vs2 ex … e8 e7 e6 e5 e4 e3 e2 e1 e0
vd … … … … … … … … e6 e5 e3 vcompress
94
Copyright© 2020 Andes Technology Corp.
Sample Codes
95
Copyright© 2020 Andes Technology Corp.
# void matmulFloat_v(
# size_t n, // a0: A columns, B rows
# size_t m, // a1: A rows
# size_t p, // a2: B columns
# const float *A, // a3: A is m x n matrix
# const float *B, // a4: B is n x p matrix
# float *C // a5: C is m x p matrix
# ) {
# int i, j, k;
#
# for (i = 0; i < m; i++) {
# for (j = 0; j < p; j++) {
# float c = 0;
# for (k = 0; k < n; k++) {
# c += A[i*n+k] * B[k*p+j];
# }
# C[i*p+j] = c;
# }
# }
# }
# void matmulFloat_v(
# size_t 3, // a0: A columns, B rows
# size_t 2, // a1: A rows
# size_t 3, // a2: B columns
# const float *A, // a3: A is 2x3 matrix
# const float *B, // a4: B is 3x3 matrix
# float *C // a5: C is 2x3 matrix
# ) {
# int i, j, k;
# for (h = 0; h< COUNT; h++) {
# for (i = 0; i < 2; i++) {
# for (j = 0; j < 3; j++) {
# float c = 0;
# for (k = 0; k < 3; k++) {
# c += A[i*n+k] * B[k*p+j];
# }
# C[i*p+j] = c;
# }
# }
# }
# }
Matrix Multiplication – Issue #495 in RVV Extension Work Group
b0 b1 b2
b3 b4 b5
b6 b7 b8
a0 a1 a2 c0 c1 c2
a3 a4 a5 c3 c4 c5
Matrix
Multiplication
96
Copyright© 2020 Andes Technology Corp.
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
Matrix Multiplication – Scalar Coding Style
c0=a0*b0
c1=a0*b1
c2=a0*b2
c3=a3*b0
c4=a3*b1
c5=a3*b2
c0+=a1*b3
c1+=a1*b4
c2+=a1*b5
c3+=a4*b3
c4+=a4*b4
c5+=a4*b5
c0+=a2*b6
c1+=a2*b7
c2+=a2*b8
c3+=a5*b6
c4+=a5*b7
c5+=a5*b8
vfmul.vv v16, v10, v0
vfmul.vv v17, v10, v1
vfmul.vv v18, v10, v2
vfmul.vv v19, v13, v0
vfmul.vv v20, v13, v1
vfmul.vv v21, v13, v2
vfmacc.vv v16, v11, v3
vfmacc.vv v17, v11, v4
vfmacc.vv v18, v11, v5
vfmacc.vv v19, v14, v3
vfmacc.vv v20, v14, v4
vfmacc.vv v21, v14, v5
vfmacc.vv v16, v12, v6
vfmacc.vv v17, v12, v7
vfmacc.vv v18, v12, v8
vfmacc.vv v19, v15, v6
vfmacc.vv v20, v15, v7
vfmacc.vv v21, v15, v8
b0 b1 b2
b3 b4 b5
b6 b7 b8
a0 a1 a2 c0 c1 c2
a3 a4 a5 c3 c4 c5
v0 v1 v2
v3 v4 v5
v6 v7 v8
v10 v11 v12 v16 v17 v18
v13 v14 v15 v19 v20 v21
vector registers
Matrix
Multiplication
Matrix
Multiplication
Each vector register holds 16 elements, 16 sets of matrices are operated in parallel
97
Copyright© 2020 Andes Technology Corp.
VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements
• Single precision floating point data
• 16 sets of matrices can be processed in parallel
• Vector multiplication and MACC (6 vfmul and 12 vfmac)
Assuming data are contiguous in memory with prefetched in L1
data cache (data cache hit) – stride between the matrices
• 6 stride loads for 16 A matrices (2x3=6), stride=4Bx6=24
• 9 stride loads for 16 B matrices (3x3=9), stride=4Bx9=36
• 6 stride store for 16 C matrices (2x3=6), stride=4Bx6=24
Vector Scalar-Style Coding
98
Copyright© 2020 Andes Technology Corp.
Stride Loads & Stores for Scalar-Style Code
vlse32.v v10, x2, x5 Aref 0
addi x2, x2, 4
vlse32.v v11, x2, x5 Aref 1
addi x2, x2, 4
vlse32.v v12, x2, x5 Aref 2
addi x2, x2, 4
vlse32.v v13, x2, x5 Aref 3
addi x2, x2, 4
vlse32.v v14, x2, x5 Aref 4
addi x2, x2, 4
vlse32.v v15, x2, x5 Aref 5
addi x2, x2, 364
vlse32.v v0, x3, x6 Bref 0
addi x3, x3, 4
vlse32.v v1, x3, x6 Bref 1
addi x3, x3, 4
vlse32.v v2, x3, x6 Bref 2
addi x3, x3, 4
vlse32.v v3, x3, x6 Bref 3
addi x3, x3, 4
vlse32.v v4, x3, x6 Bref 4
addi x3, x3, 4
vlse32.v v5, x3, x6 Bref 5
addi x3, x3, 4
vlse32.v v6, x3, x6 Bref 6
addi x3, x3, 4
vlse32.v v7, x3, x6 Bref 7
addi x3, x3, 4
vlse32.v v8, x3, x6 Bref 8
addi x3, x3, 540
vsse32.v v16, x4, x5 Cref 0
addi x4, x4, 4
vsse32.v v17, x4, x5 Cref 1
addi x4, x4, 4
vsse32.v v18, x4, x5 Cref 2
addi x4, x4, 4
vsse32.v v19, x4, x5 Cref 3
addi x4, x4, 4
vsse32.v v20, x4, x5 Cref 4
addi x4, x4, 4
vsse32.v v21, x4, x5 Cref 5
addi x4, x4, 364
99
Copyright© 2020 Andes Technology Corp.
Performance of Scalar-Style Codes
• Load 6 Arefs = 12*6 = 72 cycles // access 12 32B cache lines
• Load 9 Brefs = 17*9 = 153 cycles // access 17 32B cache lines
• Store 6 Crefs = 12*6= 54 cycles // access 12 32B cache lines
• Vector FP multiplications = 18 cycles // hidden by the store
• Total cycles per loop = 297 for 16 sets of matrices
• Total cycles per set of matrix = 19 cycles
• Number of vector registers used = 6+9+6 = 21 registers, efficiency = 66%
• Number of static instructions in the loop = 62 instructions
• Average instructions per set of matrices = 4 instructions
• It is 16X the performance of scalar code
100
Copyright© 2020 Andes Technology Corp.
Matrix Multiplication – Unit Load Coding Style
• VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements
– LMUL=4, 16x4 = 64 elements
– LMUL=8, 16x8 = 128 elements
• Number of Cref or Aref per matrix = 6 elements
− 10 sets of matrices are 60 elements which fit into VRF with LMUL=4
• Number of Bref per matrix = 9 elements
− 10 sets of matrices are 90 elements which fit into VRF with LMUL=8
101
Copyright© 2020 Andes Technology Corp.
• Unit load of 10 consecutive B matrices (10x9=90 elements)
− vrgather to set up 10 sets of 6 B elements = 2, 1, 0, 2, 1, 0 (yellow) in v16
− vslideup v16 to v20
− vrgather to set up 10 sets of 6 B elements = 5, 4, 3, 5, 4, 3 (blue) in v16
− vrgather to set up 10 sets of 6 B elements = 8, 7, 6, 8, 7, 6 (white) in v24
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
setvli t1, x5, e32, m8 vl=90, LMUL=8
vle32.v v8, x3 unit load - Bref, 90 elements
addi x3, x3, 360
setvli t1, x6, e32, m8 vl=60, LMUL=8
vsub.vi v0, v0, 6
vrgather.vv v16, v8, v0 60 elements of first Bref in v16
vadd.vi v0, v0, 3
vslideup.vi v16, v16, 64 Move first Bref to v20
vrgather.vv v16, v8, v0 60 elements of second Bref in v16
vadd.vi v0, v0, 3
vrgather.vv v24, v8, v0 60 elements of third Bref in v24
Unit Load of B elements
102
Copyright© 2020 Andes Technology Corp.
• Unit load of 10 consecutive A matrices (10x6=60 elements)
− vrgather to set up 10 sets of 6 A elements = 3, 3, 3, 0, 0, 0 (yellow) in v16
− vrgather to set up 10 sets of 6 A elements = 4, 4, 4, 1, 1, 1 (blue)
− vrgather to set up 10 sets of 6 A elements = 5, 5, 5, 2, 2, 2 (white)
• Multiplication and accumulate and unit store C elements to memory
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
setvli t1, x7, e32, m4 vl=60, LMUL=4
vle32.v v8, x2 unit load - Aref, 60 elements
addi x2, x2, 240
vrgather.vv v28, v8, v4 set up 60 elements of first Aref
vadd.vi v4, v4, 1
vfmul.vv v20, v20, v28
vrgather.vv v12, v8, v4 set up 60 elements, of second Aref
vadd.vi v4, v4, 1
vfmacc.vv v20, v16, v12
vrgather.vv v28, v8, v4 set up 60 elements of third Aref
vsub.vi v8, v8, 2
vfmacc.vv v20, v24, v28
vse32.v v20, x4 unit store - Cref, 60 elements
addi x4, x4, 240
bne
Unit Load of A elements
103
Copyright© 2020 Andes Technology Corp.
C A B
0 0 0
0 1 3
0 2 6
1 0 1
1 1 4
1 2 7
2 0 2
2 1 5
2 2 8
3 3 0
3 4 3
3 5 6
4 3 1
4 4 4
4 5 7
5 3 2
5 4 5
5 5 8
VRF
b8 31 a5
b7 30 a4
… 29 …
… 28 …
b3 27 a2
b2 26 a1
b1 25 a0
b0 24 a5
b8 23 a4
b7 22 a3
b6 21 a2
b5 20 a1
b4 19 a0
b3 18
b2 17
b1 16
b0 15 c5
b8 14 c4
b7 13 …
b6 12 …
b5 11 c2
b4 10 c1
b3 9 c0
b2 8 c5
b1 7 c4
b0 6 c3
5 c2
4 c1
3 c0
2
1
0
vrgather
vrgather
vslideup
Index for
B elements
Index for
A elements
vrgather
vrgather
Graphical View of Unit Load Codes
104
Copyright© 2020 Andes Technology Corp.
Performance of Unit Load Codes
• Load 60 Aref and 90 Bref elements, store 60 Cref elements = 28 cycles of 32B cache lines
o Vrgather for Aref = 8 cycles + 4 + 4 = 16 cycles
o Vrgather/vslideup for Bref = 13 cycles + 8 + 13 + 13 = 47 cycles
• Vector FP multiplications = 12 cycles
• Total cycles per loop = 103 for 10 sets of matrices
• Total cycles per set of matrix = 10 cycles (19 cycles for straight code)
• Number of static instructions in the loop = 27 instructions
• Average instructions per set of matrices = 2.7 instructions (4 instructions for straight code)
• Number of elements used in VLMAX, 60/64 and 90/128, efficiency is about 90%
• Unit load/store are much more efficient memory operation than stride load: accessing 28
instead of 41 cache lines. Shifting the complexity of stride load/store to vrgather
instructions
105
Copyright© 2020 Andes Technology Corp.
Larger Matrices – Scalar Style Coding
• Same VLEN, SEW=32b, VLMAX=16
– A matrix 6x3, 18 elements – requires 18 vector registers
– B matrix 3x6, 18 elements – requires 18 vector registers
– C matrix 6x6, 36 elements – requires 36 vector registers
– Compute 1 row of A and C elements at a time, 6 iterations to finish 16 sets of matrices
b0 b1 b2 b3 b4 b5 v0 v1 v2 v3 v4 v5
b6 b7 b8 b9 b10 b11 v6 v7 v8 v9 v10 v11
b12 b13 b14 b15 b16 b17 v12 v13 v14 v15 v16 v17
a0 a1 a2 c0 c1 c2 c3 c4 c5 v18 v19 v20 v21 v22 v23 v24 v25 v26
a3 a4 a5 c6 c7 c8 c9 c10 c11 v18 v19 v20 v21 v22 v23 v24 v25 v26
a6 a7 a8 c12 c13 c14 c15 c16 c17 v18 v19 v20 v21 v22 v23 v24 v25 v26
a9 a10 a11 c18 c19 c20 c21 c22 c23 v18 v19 v20 v21 v22 v23 v24 v25 v26
a12 a13 a14 c24 c25 c26 c27 c28 c29 v18 v19 v20 v21 v22 v23 v24 v25 v26
a15 a16 a17 c30 c31 c32 c33 c34 c35 v18 v19 v20 v21 v22 v23 v24 v25 v26
Matrix
Multiplication
Matrix
Multiplication
vector registers
106
Copyright© 2020 Andes Technology Corp.
Larger Matrices – Unit Load Codes
• VLEN=512b, SEW=32b,
VLMAX=16 elements
− LMUL=2, 32 elements, 2
vector registers are used for
each 18 elements of Aref and
Bref
− LMUL=4, 64 elements, 4
vector registers are used for 36
elements of Cref
− 1 set of matrices per loop
iteration
VRF
a17 31
a16 30
… 29
… 28
a8 27
a7 26
a6 25
a5 24
a4 23
a3 22
a2 21
a1 20
a0 19
18
b17 17
b16 16
… 15 c35
… 14 c34
b8 13 …
b7 12 …
b6 11 c8
b5 10 c7
b4 9 c6
b3 8 c5
b2 7 c4
b1 6 c3
b0 5 c2
4 c1
3 c0
2
1
0
vmul
and
vfmac
Index for
B elements
Index for
A elements
vrgather
vrgather
107
Copyright© 2020 Andes Technology Corp.
Performances of Larger Matrices
*Vector loads/stores are pipelined between iterations
**6x4 or 6x5 matrices are more efficient. 6x3 has 18 active elements with VLMAX=32
Unit loads/stores are much more efficient memory operation than stride load
Scalar style Unit load
Number of static instructions per loop 252 30
Total number of cycles 1259 22*
Number of cycles per set of matrices 79 22
Efficiency of register usages 81% 56%**
108
Copyright© 2020 Andes Technology Corp.
vsetvli t0, zero, e32, m8 // VLEN=512b, LMUL=8, SEW=32b, ELEN=4096b
vlb.v v0, t0;
add t0, t0, imm;
vadd.vx v0, v0, t1;
lfw f0, t2;
add t2, t2, imm;
vfcvt.f.x.v v0, v0;
vfmacc.vf v8, v0, f0;
bne t0, t3, imm
• Performance requirement: 16B of 8b INT every clock cycle
Loop of 8 instructions,
Load element = 8b INT, extension to 32b
Operation on FP 32b
RVV Work Group, Issue #362, Codes Based on RVV 0.8
109
Copyright© 2020 Andes Technology Corp.
Vector Processor Pipeline Example
VLEN=64B
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
vadd v0, v0, t1 va va va
vadd v1, v1, t1 va va va
vadd v2, v2, t1 va va va
vadd v3, v3, t1 va va va
vadd v4, v4, t1 va va va
vadd v5, v5, t1 va va va
vadd v6, v6, t1 va va va
vadd v7, v7, t1 va va va
vfcvt v0, v0 fc fc fc fc fc fc
vfcvt v1, v1 fc fc fc fc fc fc
vfcvt v2, v2 fc fc fc fc fc fc
vfcvt v3, v3 fc fc fc fc fc fc
vfcvt v4, v4 fc fc fc fc fc
vfcvt v5, v5 fc fc fc fc
vfcvt v6, v6 fc fc fc fc
vfcvt v7, v7 fc fc fc fc
vfmacc v8, v0, f0 fm fm fm fm fm fm fm fm fm
vfmacc v9, v1, f0 fm fm fm fm fm fm fm fm
vfmacc v10, v2, f0 fm fm fm fm fm fm fm
vfmacc v11, v3, f0 fm fm fm fm fm fm
vfmacc v12, v4, f0 fm fm fm fm fm fm
vfmacc v13, v5, f0 fm fm fm fm fm fm
vfmacc v14, v6, f0 fm fm fm fm fm fm
vfmacc v15, v7, f0 fm fm fm fm fm fm
vlb v0, t0 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7
lfw f0, t2 f0 f0 f0
8 cycles
8 cycles 8 cycles
chaining
chaining
110
Copyright© 2020 Andes Technology Corp.
Vector Processor Performance
• Number of instructions and micro-ops in 8 cycles
– 3 vector arithmetic instructions = 24 micro-ops (vadd, vfcvt, vfmac)
– 1 FP load
– 3 scalar instructions + 1 vector load to data cache
– Performance at 1 GHz = 64 GOPS or 48 GFLOPS for SEW=32b
– Performance at 1 GHz = 128 GOPS or 96 GFLOPS for SEW=16b
• Unroll the loop twice in order to pipeline vector load instructions
– All 32 vector registers are used
– Performance improvement can be achieved with VLEN=1024b
111
Copyright© 2020 Andes Technology Corp.
vle8.v v6, t0;
add t0, t0, imm;
vsext.v4 v0, v6
vadd.vx v0, v0, t1;
lfw f0, t2;
add t2, t2, imm;
vfcvt.f.x.v v0, v0;
vfmacc.vf v8, v0, f0;
bne t0, t3, imm
• An extra instruction in a tight loop can degrade the performance by 12.5%
• Andes provides custom load instruction in order to achieve the required performance
• In addition, the vsext.v4 instruction needs an extra write port since all the write ports are
used
Loop of 9 instructions,
Load element = 8b INT,
Vector extension to 32b,
Operation on FP 32b
RVV Work Group, Issue #362, Codes Based on RVV 0.9
112
Copyright© 2020 Andes Technology Corp.
AndesCore™ NX27V
introduction
113
Copyright© 2020 Andes Technology Corp.
• Vector Processor is implemented as part of the RISC-V CPU
• Configurability: necessary for IP to attract wide range of customers, 128b to 512b
• Scalability: necessary for next generation
• Low power and area: necessary for embedded market
• Performance: out-of-order execution and design, 2*VLEN = double performance
• Simplicity: necessary for time to market, reducing verification time
• Modular Design: decoupling for independent design, verification, timing, and
clock gating
Andes NX27V Design Objectives
114
Copyright© 2020 Andes Technology Corp.
Overview of NX27V
• AndeStar V5 architecture:
– RV64GCN+ Andes V5 Extensions
– RV Vector extension (RVV)
• 5-stage pipeline, single-issue
– Optional branch prediction
• I/D caches
– Caches: 8KB to 64KB
– I$/D$ prefetch
– HW unaligned load/store accesses
– 16 non-blocking outstanding data accesses
• Wide data paths to feed VPU:
– Cached and uncached RVV load/stores
– Streaming Ports for ACE loads/stores
115
Copyright© 2020 Andes Technology Corp.
NX27V Pipeline
Fetch Decode Execute Memory Retire
Integer
Execution
Unit
Data
Cache
Exception
Handling
IFU GPR
V P U
Vector
scalar
V
I
Q
V/F/D
inst
Custom
Coprocessor/
Mem-Controller
command
data
A C E pipeline Execute
Streaming
Ports
Custom
Memory
AXI
116
Copyright© 2020 Andes Technology Corp.
NX27V Performance Gain
Functions Speedup1
F32 basic mathematical functions 19X
RGB CNN functions 18X
Depthwise CNN functions 23X
Pointwise CNN functions 21X
Relu CNN functions 69X
F32 filtering functions 19X
Q7 filtering functions 39X
F32 32x32x32 matrix multiplication 57X
1Compared to pure C scalar code compiled with high optimization; both vector
and scalar code ran on the NX27V FPGA with 512-bit VLEN, 256-bit bus.
117
Copyright© 2020 Andes Technology Corp.
Andes NX27V vs. Competitions
Andes NX27V CAxx CMxx
Architecture RVV/Andes VPU Popular SIMD Hxxx
Vector registers 32 32 8
Vector Length Up to 512b 128b 128b
SIMD width Up to 512b/cycle 128b/cycle 64b/cycle
LMUL Yes None None
Chaining Yes Not applicable Yes
Custom extension ACE No No
Streaming Ports Yes No No
118
Copyright© 2020 Andes Technology Corp.
Core/System Performance Comparison
CPU A64FX* NX27V**
Vector ISA ARM SVE RISC-V RVV
Technology (nm) 7 7
Core sustain Perf 16b (GFLOPS) ~56 96
Core Peak Perf 16b (GOPS) 230 320
# of cores 48 48**
System (TFLOPS) ~2.7 ~4.6
Memory BW (GB/s) 1024 3072
*Fujitsu presentation at Hot Chip 2018.
**Assume the same number of cores for comparison. 512-bit SIMD, 512-bit bus.
119
Copyright© 2020 Andes Technology Corp.
NX27V Floorplan
 Process: 7nm
 Frequency: 1GHz
(worst case)
 Size: ~0.3mm2
120
Copyright© 2020 Andes Technology Corp.
Development Tools
Standard tools in AndeSight™ IDE:
– AndeSim: Near-cycle accurate simulator
– Compiler (intrinsic functions and inline assembly)
– Assembler
– Debugger and ICE
– Computation library
Advanced tools:
– AI compiler support thru LLVM
– AndesClarity pipeline visualizer and analyzer
 Pipeline view of instructions and functional units
 Resource view corresponding to instruction usage
 Stall bubbles with different colors for data dependency
121
Copyright© 2020 Andes Technology Corp.
Clarity: NX27V Pipeline View
122
Copyright© 2020 Andes Technology Corp.
Summary
• Andes NX27V vector processor
– The world first commercial RISC-V vector processor, VLEN = 512b/256b are
available
– The 512-bit NX27V vector processor has been shipped to lead customer
– High performance in HPC and AI applications
– Flexible VPU configurations to enable a wide range of applications
– AndesClarity for performance optimization
– Work with customers on Andes Custom Extension to provide proprietary,
hardware acceleration, and marketing competitive
• Andes Technology will continue to expand vector product families to
meet the requirements of edge to cloud applications
Thank you!
#RISCVSUMMIT @risc_v

More Related Content

What's hot

AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabMichelle Holley
 
Basic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLBasic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLJoohan KIM
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-pattersonKrste Asanovic
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and designSatya Harish
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingDVClub
 
Online test program generator for RISC-V processors
Online test program generator for RISC-V processorsOnline test program generator for RISC-V processors
Online test program generator for RISC-V processorsRISC-V International
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUAMD
 
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion SystemsMIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion SystemsMIPI Alliance
 
The Future of Operating Systems on RISC-V
The Future of Operating Systems on RISC-VThe Future of Operating Systems on RISC-V
The Future of Operating Systems on RISC-VC4Media
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Deepak Shankar
 

What's hot (20)

AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Fpga design flow
Fpga design flowFpga design flow
Fpga design flow
 
ARM Processor
ARM ProcessorARM Processor
ARM Processor
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
Basic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLBasic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDL
 
RISC-V: The Open Era of Computing
RISC-V: The Open Era of ComputingRISC-V: The Open Era of Computing
RISC-V: The Open Era of Computing
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-patterson
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 
Introduction to RISC-V
Introduction to RISC-VIntroduction to RISC-V
Introduction to RISC-V
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference Modeling
 
RISC-V Online Tutor
RISC-V Online TutorRISC-V Online Tutor
RISC-V Online Tutor
 
Online test program generator for RISC-V processors
Online test program generator for RISC-V processorsOnline test program generator for RISC-V processors
Online test program generator for RISC-V processors
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion SystemsMIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
 
The Future of Operating Systems on RISC-V
The Future of Operating Systems on RISC-VThe Future of Operating Systems on RISC-V
The Future of Operating Systems on RISC-V
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 

Similar to Andes RISC-V vector extension demystified-tutorial

6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhardeepikakaler1
 
6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhianadeepikakaler1
 
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDT
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDTEnea Freescale Webinar 12 Sep 2012 - Version 5-SDT
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDTMichael Christofferson
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
vlsi design summer training ppt
vlsi design summer training pptvlsi design summer training ppt
vlsi design summer training pptBhagwan Lal Teli
 
Ti k2 e for mission critical applications
Ti k2 e for mission critical applicationsTi k2 e for mission critical applications
Ti k2 e for mission critical applicationsHitesh Jani
 
MediaTek IoT power management webinar
MediaTek IoT power management webinarMediaTek IoT power management webinar
MediaTek IoT power management webinarMediaTek Labs
 
Onboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareOnboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareCloudify Community
 
Introduction to the new MediaTek LinkIt™ Development Platform for RTOS
Introduction to the new MediaTek LinkIt™ Development Platform for RTOSIntroduction to the new MediaTek LinkIt™ Development Platform for RTOS
Introduction to the new MediaTek LinkIt™ Development Platform for RTOSMediaTek Labs
 
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...Edge AI and Vision Alliance
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple dataSyed Zaid Irshad
 
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112GDesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112GLeah Wilkinson
 
IoT and connected devices
IoT and connected devicesIoT and connected devices
IoT and connected devicesPascal Bodin
 
IoT support for .NET Core
IoT support for .NET CoreIoT support for .NET Core
IoT support for .NET CoreMirco Vanini
 
Vlsi final year project in jalandhar
Vlsi final year project in jalandharVlsi final year project in jalandhar
Vlsi final year project in jalandhardeepikakaler1
 
Vlsi final year project in ludhiana
Vlsi final year project in ludhianaVlsi final year project in ludhiana
Vlsi final year project in ludhianadeepikakaler1
 
Module 1 - ARM 32 Bit Microcontroller
Module 1 - ARM 32 Bit Microcontroller Module 1 - ARM 32 Bit Microcontroller
Module 1 - ARM 32 Bit Microcontroller Amogha Bandrikalli
 
IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)Mirco Vanini
 

Similar to Andes RISC-V vector extension demystified-tutorial (20)

6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar
 
6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana
 
Santhosh Resume
Santhosh ResumeSanthosh Resume
Santhosh Resume
 
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDT
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDTEnea Freescale Webinar 12 Sep 2012 - Version 5-SDT
Enea Freescale Webinar 12 Sep 2012 - Version 5-SDT
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
vlsi design summer training ppt
vlsi design summer training pptvlsi design summer training ppt
vlsi design summer training ppt
 
Ti k2 e for mission critical applications
Ti k2 e for mission critical applicationsTi k2 e for mission critical applications
Ti k2 e for mission critical applications
 
MediaTek IoT power management webinar
MediaTek IoT power management webinarMediaTek IoT power management webinar
MediaTek IoT power management webinar
 
Onboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareOnboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking Software
 
Introduction to the new MediaTek LinkIt™ Development Platform for RTOS
Introduction to the new MediaTek LinkIt™ Development Platform for RTOSIntroduction to the new MediaTek LinkIt™ Development Platform for RTOS
Introduction to the new MediaTek LinkIt™ Development Platform for RTOS
 
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...
“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable T...
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple data
 
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112GDesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
 
IoT and connected devices
IoT and connected devicesIoT and connected devices
IoT and connected devices
 
IoT support for .NET Core
IoT support for .NET CoreIoT support for .NET Core
IoT support for .NET Core
 
Vlsi final year project in jalandhar
Vlsi final year project in jalandharVlsi final year project in jalandhar
Vlsi final year project in jalandhar
 
Vlsi final year project in ludhiana
Vlsi final year project in ludhianaVlsi final year project in ludhiana
Vlsi final year project in ludhiana
 
Module 1 - ARM 32 Bit Microcontroller
Module 1 - ARM 32 Bit Microcontroller Module 1 - ARM 32 Bit Microcontroller
Module 1 - ARM 32 Bit Microcontroller
 
ADAM-3600 Sales kit_WATER.pptx
ADAM-3600 Sales kit_WATER.pptxADAM-3600 Sales kit_WATER.pptx
ADAM-3600 Sales kit_WATER.pptx
 
IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)
 

More from RISC-V International

London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VRISC-V International
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...RISC-V International
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VRISC-V International
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresRISC-V International
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipRISC-V International
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V International
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V International
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V International
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V International
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V International
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the unionRISC-V International
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...RISC-V International
 
Open source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeOpen source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeRISC-V International
 

More from RISC-V International (20)

WD RISC-V inliner work effort
WD RISC-V inliner work effortWD RISC-V inliner work effort
WD RISC-V inliner work effort
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-V
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-V
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V cores
 
Security and functional safety
Security and functional safetySecurity and functional safety
Security and functional safety
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_gen
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmware
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notes
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the union
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...
 
Porting tock to open titan
Porting tock to open titanPorting tock to open titan
Porting tock to open titan
 
Open j9 jdk on RISC-V
Open j9 jdk on RISC-VOpen j9 jdk on RISC-V
Open j9 jdk on RISC-V
 
Open source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeOpen source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process node
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Andes RISC-V vector extension demystified-tutorial

  • 1. RISC-V Vector Extension Demystified December 10, 2020 Thang Tran, Ph.D. Principal Engineer
  • 2. 2 Copyright© 2020 Andes Technology Corp. Andes Corporate Overview • Core R&D from AMD, DEC, Intel, MIPS, nVidia, and Sun Silicon Valley Tie 15-Year CPU IP Company • IPO in 2017; HQ in Taiwan • AndeStar™ V1-V3, V5 (RISC-V) >1 Bn Annual Run Rate of Andes-Embedded SoC • ~300 customers in TW, CN, US, EU, JP, KR Founding Premier Member of RISC-V International • Chairing Task Groups • Contributing to GNU, LLVM, uBoot, glibc, Linux, etc.
  • 3. 3 Copyright© 2020 Andes Technology Corp. V5 Adoptions: From MCU to Datacenters • Edge to Cloud: − ADAS − AIoT − Blockchain − FPGA − MCU − Multimedia − Security − Wireless (BT/WiFi) • 1 to 1000+ core • 40nm to 7nm • Many in AI − Datacenter AI accelerators − SSD: enterprise (& consumer) − 5G macro/small cells 5G Macro
  • 4. 4 Copyright© 2020 Andes Technology Corp. Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR – Memory operations – Compute instruction format • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V introduction • Summary
  • 5. 5 Copyright© 2020 Andes Technology Corp. • ISA: Instruction Set Architecture • GOPS: Giga Operations Per Second • GFLOPS: Giga Floating-Point OPS • XRF: Integer register file • FRF: Floating-point register file • VRF: Vector register file • SIMD: Single Instruction Multiple Data • MMX: Multi Media Extension • SSE: Streaming SIMD Extension • AVX: Advanced Vector Extension • Configurable: parameters are fixed at built time, i.e. cache size • Extensible: added instructions to ISA includes custom instructions to be added by customer • Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, … • Programmable: parameters can be dynamically changed in the program • ACE: Andes Custom Extension • CSR: Control and Status Register • SEW: Element Width (8-64) • ELEN: Largest Element Width (32 or 64) • XLEN: Scalar register length in bits (64) • FLEN: FP register length in bits (16-64) • VLEN: Vector register length in bits (128-512) • LMUL: Register grouping multiple (1/8-8) • EMUL: Effective LMUL • VLMAX/MVL: Vector Length Max • AVL/VL: Application Vector Length Terminology
  • 6. 6 Copyright© 2020 Andes Technology Corp. RISC-V & RVV Background • An open processor architecture started by UC Berkeley – Compact, modular, extensible • RISC-V International (formerly RISC-V Foundation) – RISC-V Foundation formed in 2015 to govern its growth – 750+ members including industry and research institute/university • RISC-V Vector Extension – Defined in a Foundation Task Group – Vector instruction set with scalable vector – Scalable data sizes with 2x data expansion arithmetic – Over 300+ vector instructions, including load/store, integer, fixed- point/floating-point operations
  • 7. 7 Copyright© 2020 Andes Technology Corp. RISC-V V Spec Status • Version 1.0 (draft) was published in July 2020 • Note from the spec: “This is a draft of the stable proposal for the vector extension specification to be used for implementation and evaluation. Once the draft label is removed, version 1.0 is intended to be sent out for public review as part of the RISC-V International ratification process. Version 1.0 is also considered stable enough to begin developing toolchains, functional simulators, and initial implementations, including in upstream software projects, and is not expected to have major functionality changes except if serious issues are discovered during ratification. Once ratified, the spec will be given version 2.0”. • Version 1.0 is targeted for end of 2020 and ready for ratifying RISC-V V Spec Status
  • 8. 8 Copyright© 2020 Andes Technology Corp. Vector technology background
  • 9. 9 Copyright© 2020 Andes Technology Corp. • Instruction Stream − Sequence of Instructions read from memory • Data Stream − Operations performed on the data in the processor • SIMD is the most interesting which is the basic for: − MMX, SSE, SSE2-5, AVX − Altivec − Neon Single Multiple Single SISD SIMD Multiple MISD MIMD Number of Instruction Streams Number of Data Streams Flynn’s Classification Microprocessors DSP Graphic Vector
  • 10. 10 Copyright© 2020 Andes Technology Corp. Why SIMD? • Single 32-bit Microprocessor – One 32-bit element per instruction • 512-bit SIMD processor – 16 32-bit elements per instruction – 32 16-bit elements per instruction – 64 8-bit elements per instruction. If the SIMD processor clock frequency is at 1 GHz, then the performance is 64 GOPS • Increasing the SIMD size, the higher the performance • SIMD is the basic for Vector Processor – N operations per instruction where the elements in the operations are independent from each other
  • 11. 11 Copyright© 2020 Andes Technology Corp. Instruction stream LD VR <-- A[3:0] LD0 LD1 LD2 LD3 LD0 ADD VR <-- VR, 1 ADD0 ADD1 ADD2 ADD3 LD1 ADD0 MUL VR <-- VR, 2 MUL0 MUL1 MUL2 MUL3 LD2 ADD1 MUL0 ST A[3:0] <-- VR ST0 ST1 ST2 ST3 LD3 ADD2 MUL1 ST0 ADD3 MUL2 ST1 Time MUL3 ST2 ST3 Different ops at same time Space VECTOR PROCESSOR LD ADD MUL ST Same ops at same space PE0 PE1 PE3 PE2 Space ARRAY PROCESSOR Same ops at same time Different ops at same space Multiple loads are handling by hardware Parallel Processing • Time-space duality: – Array processor: Instruction operates on multiple data elements at the same time – multi-thread, multi-core – Vector processor: Instruction operates on multiple data elements in consecutive time steps – pipeline of SIMD execution
  • 12. 12 Copyright© 2020 Andes Technology Corp. Intel MMX, SSE, SSE2, AVX • MMX (1997) adds 57 new integer instructions, 64-bit registers – 8 of 8-bit, 4 of 16-bit, 2 of 32-bit, or 1 of 64-bit (reuse 8FP registers – FP and MMX cannot mix) – Number of elements is defined in the opcode, only consecutive memory load/store is allowed – Move, Add, Subtract, Shift, Logical, Mult, MACC, Compare, Pack/Unpack • SSE (1999) adds 70 new instructions for single precision FP, 2 of 32-bit • SSE2 (2001) adds 144 new instructions for FP, 4 of 32-bit, 128-bit registers • SSE3 (2004), SSE4 (2006), SSE5 (2007) add more new instructions • AVX (2008) adds new instructions for 256-bit registers • AVX2 (2013) adds more new instructions • AVX-512 (2015) adds new instructions for 512-bit registers • Issue: – Each register length has a new set of instructions – Many instructions for the 512-bit SIMD
  • 13. 13 Copyright© 2020 Andes Technology Corp. v31 f31 r31 v31 f31 r31 One Lane. Usually less Lanes than Vector Elements (Lanes/VFUs timeshared) A Vector Functional Unit (VFU). Multiple VFUs per Lane can operate in parallel A Vector Element Parallelism achieved by: • Multiple Lanes • Multiple VFUs per Lane (e.g., L/S, +, x) • Individual VFUs are pipelined RISC-V CPU Vector Meta Data VLDS VLEN Vector Extension Processor Overview VLEN >= SIMD Newell, R., “RISC-V Vector Extension Proposal Snapshot,” Microsemi, Microchip Technology Inc., 2018
  • 14. 14 Copyright© 2020 Andes Technology Corp. VLSU Vector Register File Memory MULT ADD chain chain Vector Instruction Chaining • Independent functional units are needed – Partial data of an instruction can be forwarded and operations are scheduled to be executed in parallel – Data chaining can happen only if the instructions operate on multiple SIMD widths • VLEN=SIMD width, single instruction/cycle – The operation is on multiple elements but it is basically pipelining, not chaining – The instructions must be set for a vector length that is greater then the SIMD width in order for chaining to happen
  • 15. 15 Copyright© 2020 Andes Technology Corp. Chaining – Vector Unit Structure • Four lanes, 4 functional units to fetch/execute independent data 16 elements, 4 lanes: - Elements 0-3 - Elements 4-7 - Elements 8-11 - Elements 12-15 Functional Unit
  • 16. 16 Copyright© 2020 Andes Technology Corp. Cycles 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 vld Ia0 Ia1 Ia2 Ia3 Ia4 Ia5 Ia6 Ia7 1 vadd la8 la9 la10 la11 la12 la13 la14 la15 Ib0 Ib1 Ib2 Ib3 Ib4 Ib5 Ib6 Ib7 2 vmult la16 la17 la18 la19 la20 la21 la22 la23 lb8 lb9 lb10 lb11 lb12 lb13 lb14 lb15 Ic0 Ic1 Ic2 Ic3 Ic4 Ic5 Ic6 Ic7 3 add la24 la25 la26 la27 la28 la29 la30 la31 lb16 lb17 lb18 lb19 lb20 lb21 lb22 lb23 lc8 lc9 lc10 lc11 lc12 lc13 lc14 lc15 4 vld Ie0 Ie1 Ie2 Ie3 Ie4 Ie5 Ie6 Ie7 lb24 lb25 lb26 lb27 lb28 lb29 lb30 lb31 lc16 lc17 lc18 lc19 lc20 lc21 lc22 lc23 5 vadd le8 le9 le10 le11 le12 le13 le14 le15 If0 If1 If2 If3 If4 If5 If6 If7 lc24 lc25 lc26 lc27 lc28 lc29 lc30 lc31 6 vmult le16 le17 le18 le19 le20 le12 le22 le23 lf8 lf9 lf10 lf11 lf12 lf13 lf14 lf15 Ig0 Ig1 Ig2 Ig3 Ig4 Ig5 Ig6 Ig7 7 add le24 le25 le26 le27 le28 le29 le30 le31 lf16 lf17 lf18 lf19 lf20 lf21 lf22 lf23 lg8 lg9 lg10 lg11 lg12 lg13 lg14 lg15 8 lf24 lf25 lf26 lf27 lf28 lf29 lf30 lf31 lg16 lg17 lg18 lg19 lg20 lg21 lg22 lg23 9 lg24 lg25 lg26 lg27 lg28 lg29 lg30 lg31 8 lanes, 8 LS units 8 lanes, 8 vector MUL units 8 lanes, 8 vector ADD units Chaining of 3 vector operations Chaining • 32 elements, 8 lanes = 8 units to fetch/execute 8 independent elements – vld: fetch data 32 elements in 4 clock cycles, 8 elements per cycle – Vector add/mult: each vector instruction passes through the functional units 4 times
  • 17. 17 Copyright© 2020 Andes Technology Corp. RAW Dependencies vectors scalar Vectorize Loop Iterations • X and Y are vectors, a is scalar, n is a large number • For N=64, dynamic instructions – Processor = 64x9 + 2 = 578: – Vector = 6 (5 vectors and 1 scalar) Processor/SIMD Code C Code Vector Code Operate on 64 elements From: Patterson, D., CS252, UCB
  • 18. 18 Copyright© 2020 Andes Technology Corp. Advantages of Vector Architecture • From Spec92fp* – Static Instructions: scalar/vector = 4.5X – Dynamic Instructions: scalar/vector = 23.8X – Operations: scalar/vector = 1.4X (the extra operations are loop overheads) • Most efficient way for data parallel processing – High computation throughput (operate on multiple elements at the same time) – Require high memory bandwidth, not low latency, data operation is chaining. The data memory is interleaving multiple banks for higher memory bandwidth • Orthogonal to microprocessor architecture – Scalar or superscalar microprocessor with vector unit, e.g. Cray X1, Alpha Tarantula, Andes NX27V – Parallel vector processors, e.g. NEC Earth Simulator, Broadcom Calisto *Quintana, F., et.al., “An ISA comparison between Superscalar and Vector Processors,” U. of Las Palmas de Gran Canaria
  • 19. 19 Copyright© 2020 Andes Technology Corp. Example: Cray 1 • Scalar and vector mode • 8 64-element vector registers – SEW = 64-bit – VLEN = 64x64 = 4096-bit • 8 64-bit scalar registers • 8 24-bit address registers • 16 memory banks – Bank latency = 11 cycles – Sustain 16 parallel accesses to different banks “Cray-1 Computer System Hardware Reference Manual,”1977, Cray Research, Inc.
  • 20. 20 Copyright© 2020 Andes Technology Corp. 3rd Iteration VLMAX elements 4th Iteration VLMAX elements 1st Iteration - odd size MOD(VLMAX) piece 2nd Iteration VLMAX elements Strip Mining • When the size of vector > VLMAX – size MOD(VLMAX) = the number of elements in the first iteration – n/VLMAX = number of iterations for each VLMAX time • Example size = 200 – 200 MOD(64) = 8 elements in first iteration – 200/64 = 3 iterations of 64 elements • Strip Mining is assumed to be done in software similar to loop operation for VLMAX – vl or vector mask register is used for the first iteration
  • 21. 21 Copyright© 2020 Andes Technology Corp. Technical Challenges of Vector Architecture • Complexity of vector register file (VRF) with VLEN=512-bit – Complexity of handling read and write ports with multiple lanes are much more expensive in VRF in comparison to scalar or FP register files – Performance is related to the number of read and write ports – The number of functional units are also limited by the number of read/write ports. Stalling of writing back to VRF means additional 512-bit buffers and back pressure to previous pipeline stages • Microprocessor methods of handling out-of-order execution and retirement become expensive for vector processor – Register renaming added several vector registers of 512-bit – Re-Order Buffer has a retire queue of 512-bit for vector registers • Expensive to implement precise exception and interrupt latency – In AI or embedded market, the precise exception is less important • Vector loads/stores are expensive, especially for stride and index • Andes NX27V vector processor simplifies many of the above issues
  • 22. 22 Copyright© 2020 Andes Technology Corp. Applications of Vector Processing • Scientific computations (astrophysics, atmospheric, ocean modeling, bioinformatics, physics, biomolecular simulation, chemistry, fluid dynamics, material sciences, quantum chemistry) • Engineering computations (computer vision and image, data mining and data intensive computing, CAD/CAM engineering analysis, military applications, vlsi design) • Multimedia Processing (compress., graphics, audio synth, image proc.) • Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) • Artificial Intelligent • Lossy Compression (JPEG, MPEG video and audio) • Lossless Compression (Zero removal, RLE, Differencing, LZW) • Cryptography (RSA, DES/IDEA, SHA/MD5) • Speech and handwriting recognition • Operating systems/Networking (memcpy, memset, parity, checksum) • Databases (hash/join, data mining, image/video serving) • Language run-time support (stdlib, garbage collection) Source: Rochester Institute of Technology, “Introduction to Vector Processing,” Shaaban, EECC722
  • 23. 23 Copyright© 2020 Andes Technology Corp. Vectorizing Media Processing • Kernel − Matrix transpose/multiply − DCT (video, communication) − FFT (audio) − Motion estimation (video) − Gamma correction (video) − Haar transform (media mining) − Median filter (image processing) − Separable convolution (img. proc.) • Vector length − # vertices at once − image width − 256-1024 − image width, iw/16 − image width − image width − image width − image width (http://www.research.ibm.com/people/p/pradeep/tutor.html)
  • 24. 24 Copyright© 2020 Andes Technology Corp. RISC-V V Extension ISA Basic
  • 25. 25 Copyright© 2020 Andes Technology Corp. Vector Register ISA • Vector-Register ISA Definition: − All vector operations are between vector registers (except for load and store). This is in contrast with Vector-Memory ISA where all vector operations are on data memory − Vector load-store unit fetches and stores the memory data to/from vector registers • Basic requirements − Each vector data register hold N M-bit values  SEW – single vector element in bits, M-bit, power of 2  VLEN – number of bits in a vector register, N*M-bit, power of 2, VLEN > ELEN  VLMAX – number of vector elements, N, in the vector register, VLEN/SEW  Elements are independently executed (mostly) − Vector control registers: VMASK (mask – predication – valid element for execution) is v0 − Vector instructions  Arithmetic, logical, shift, multiply, divide, mask, permute, floating-point functional units  Vector load/store unit: load/store with stride, load/store with index, segment load/store
  • 26. 26 Copyright© 2020 Andes Technology Corp. Vector Processing ISA Basic Scalar Scalar FP ALU Mult s.int 8 .vv Macc u.int 16 .vx masked Mask fp 32 .vi unmasked Permute 64 Divide 4 8 load s.int 8 16 masked store u.int 16 32 unmasked 32 64 segment 64 Vector registers Standard scalar RISC-V ISA Saturate Overflow constant indexed 32x(128b, 256b, 512b) Standard floating-point RISC-V ISA Vector ALU Vector Memory Andes Vector processor: VLEN=512b, SEW=8,16,32,64b, elements=64,32,16,8 These numbers are used for the rest of the presentation
  • 27. 27 Copyright© 2020 Andes Technology Corp. Notations • Register references – vs1, vs2, vd – source and destination vector registers, VRF – rs1, rs2, rd – source and destination integer registers, XRF – fs1, fs2, fd – source and destination floating point registers, FRF • Register numbers – v0, v1, …, v30, v31 – vector register numbers (addresses, indices) – x0, x1, …, x30, x31 – integer register numbers (addresses, indices) – f0, f1, …, f30, f31 – floating-point register numbers (addresses, indices) – x0 is not real register; source register data = 0, destination register = no write back – Vector mask and carry bits use v0
  • 28. 28 Copyright© 2020 Andes Technology Corp. FP MAC unit 0 MVL Element FP misc unit 1 MVL Element 2 MVL Element FP divide unit 3 MVL Element … … Integer ALU … … … … Integer MAC unit … … R0 30 MVL Element R1 Integer divide unit 31 MVL Element R2 R3 Vector Mask Unit … Vector Length … Vector Permute Unit Vector Mask R28 R29 R30 R31 Memory multi banks Vector Load/Store Vector Registers Scalar Registers Vector Processor Overview - Block Diagram
  • 29. 29 Copyright© 2020 Andes Technology Corp. FP MAC unit 0 MVL Element FP misc unit 1 MVL Element 2 MVL Element FP divide unit 3 MVL Element … … Integer ALU … … … … Integer MAC unit … … R0 30 MVL Element R1 Integer divide unit 31 MVL Element R2 R3 Vector Mask Unit … Vector Length … Vector Permute Unit Vector Mask R28 R29 R30 R31 Memory multi banks Vector Load/Store Vector Registers Scalar Registers vector functional units Many types of memories can be implemented VRF read/write ports control Data dependency is handled by scoreboard Execution queues for out-of-order execution Vector Processor Microarchitecture Features
  • 30. 30 Copyright© 2020 Andes Technology Corp. RISC-V V Extension ISA CSR
  • 31. 31 Copyright© 2020 Andes Technology Corp. Vector CSR Registers Address Privilege Name Description 0x008 URW vstart Vector start position 0x009 URW vxsat Fixed-Point Saturate Flag 0x00A URW vxrm Fixed-Point Rounding Mode 0x00F URW vcsr Vector control and status register 0xC20 URO* vl Vector length 0xC21 URO* vtype Vector data type register 0xC22 URO vlenb VLEN/8 (vector register length in bytes) – configuration register *written by vsetvl(i) instruction, not by CSR access instructions
  • 32. 32 Copyright© 2020 Andes Technology Corp. CSR – Vector Type Register (vtype) • vlmul[2:0] – grouping of vector registers, 2x, 4x, 8x, or fraction of vector register, 1/2, 1/4, 1/8 – 000: no grouping, 32 vector registers, VLMAX = VLEN/SEW – 001: 16 vector registers, effective VLEN is 2X, twice number of elements – 010: 8 vector registers, effective VLEN is 4X, quadruple number of elements – 011: 4 vector registers, effective VLEN is 8X, octuple number of elements – 100: Reserved – 101: Fractional vector register, effective VLEN is 1/8, one-eighth number of elements – 110: Fractional vector register, effective VLEN is 1/4, quarter number of elements – 111: Fractional vector register, effective VLEN is 1/2, half number of elements • vsew[2:0] – vector standard element width, VLMAX = VLEN/VSEW – 000 – 111: 8, 16, 32, 64, 128, 256, 512, 1024 • vta – vector tail agnostic, implementation dependent, should the data is unknown or undisturbed • vma – vector mask agnostic, implementation dependent, should the data is unknown or undisturbed • vill – illegal value in the vsetvl(i) instruction, any subsequent vector instruction is illegal instruction 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vill vma vta vlmul vsew reserved
  • 33. 33 Copyright© 2020 Andes Technology Corp. LMUL=1 LMUL=4 v31 v28 v30 v24 v29 … v28 v4 … LMUL=2 v0 … v30 … v28 … … LMUL=8 v3 … v24 v2 … v16 v1 v2 v8 v0 v0 v0 LMUL – Registers Grouping • LMUL=2, instructions with odd register are illegal instructions • LMUL=4, vector registers are incremented by 4, else illegal instructions • LMUL=8, only v0, v8, v16, and v24 are valid vector registers
  • 34. 34 Copyright© 2020 Andes Technology Corp. elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 LMUL=1 LMUL=1/2 LMUL=1/4 LMUL=1/8 LMUL – Fractional Registers • LMUL=1/2, load data into half vector register, then perform vector double operations to expand data to the whole register • LMUL=1/4, vector quad instructions to expand data to the whole register • LMUL=1/8, using vector extension of 8x to expand to the whole register
  • 35. 35 Copyright© 2020 Andes Technology Corp. R31 F31 V31 . . . . . . . . . R0 F0 V0 [VLMAX-1] … [2] [1] [0] FP Register Vector Registers Vector Length Register VLEN Scalar Register Vector Register Programming Model • 32 vector registers – 5-bit • VLEN = The width of each vector register[i] in power of 2 • Version 0.7: FPU cannot shares the vector registers
  • 36. 36 Copyright© 2020 Andes Technology Corp. R31 F31 V31 . . . . . . . . . R0 F0 V0 [VLMAX-1] … [2] [1] [0] FP Register Vector Registers Vector Length Register VLEN Scalar Register 15 14 … … 2 1 0 element # VLEN=512b, SEW=32b Vector Register 32-bit Elements
  • 37. 37 Copyright© 2020 Andes Technology Corp. R31 F31 V31 . . . . . . . . . R0 F0 V0 [VLMAX-1] … [2] [1] [0] FP Register Vector Registers Vector Length Register VLEN Scalar Register 31 30 … … 2 1 0 element # VLEN=512b, SEW=16b Vector Register 16-bit Elements
  • 38. 38 Copyright© 2020 Andes Technology Corp. R31 F31 V31 . . . . . . . . . R0 F0 V0 [VLMAX-1] … [2] [1] [0] FP Register Vector Registers Vector Length Register VLEN Scalar Register 63 62 … … 2 1 0 element # VLEN=512b, SEW=8b Vector Register 8-bit Elements
  • 39. 39 Copyright© 2020 Andes Technology Corp. Vm*n 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Vm*n+1 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 Vm*n Vm*n+1 Vm*n Vm*n+1 Vm*n Vm*n+1 SEW=8 SEW=16 SEW=32 SEW=64 8 15 14 13 12 11 10 9 0 1 2 3 7 6 5 4 30 29 28 27 26 31 25 61 60 59 58 57 24 23 63 62 56 55 15 14 13 12 11 10 9 8 31 30 29 28 27 26 25 24 21 20 19 18 17 16 54 53 52 51 50 49 48 22 5 4 3 2 1 0 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 15 14 6 13 12 11 10 9 8 7 7 6 5 4 3 2 1 0 23 22 21 20 19 18 17 16 Element Numbering, LMUL=2 • VLEN=SLEN is now mandatory in the RVV version 1.0 – Straight forward numbering, increment sequentially – The numbering is important for mask and permutation operations
  • 40. 40 Copyright© 2020 Andes Technology Corp. Vector Mask – Conditional or Predicate Execution • v0 is used for vector mask, single mask for all LMUL values – 1 mask bit for each element • 0 – mask (mask-off), no write to destination vector elements, cannot have exception • 1 – unmask – The mask is enabled for each vector instruction with inst[25]=0, inst[25]=1 is unmask • vop.v v8, v16, 3, v0.t // mask is enable, each element writes back with v0_mask[i]=1 • vop.v v8, v16, 3 // unmask, all elements write result data back to VRF • Note: new vector mask load/store maybe added SEW=8 SEW=16 SEW=32 7 6 5 4 3 2 1 0 SEW=64 maskbitsforv7 maskbitsforv6 maskbitsforv5 maskbitsforv4 maskbitsforv3 maskv7 maskv6 v0vectormaskregister maskbitsforv2 maskv5 maskv4 maskbitsforv1 maskbitsforv0 maskv3 maskv2 maskv1 maskv0 mv7 mv6 mv5 mv4 mv3 mv2 mv1 mv0 Unused Unused Unused LMUL=8 LMUL=4 LMUL=2
  • 41. 41 Copyright© 2020 Andes Technology Corp. LMUL=8 and Chaining Example Cycles: 1 2 3 4 5 6 7 8 9 10 vmul v0, v8, v16 X1 X2 v1, v9, v17 X1 X2 ……… … … … … … v7, v15, v23 X1 X2 vadd v24, v0, v8 X1 v25, v1, v9 X1 ……… … … … … … … … Vector length: VLEN * LMUL • LMUL=1  512 bits • LMUL=8  4096 bits V0~V7 group 1 V8~V15 group 2 V16~V23 group 3 V24~V31 group 4 Xn: execution LMUL Chaining
  • 42. 42 Copyright© 2020 Andes Technology Corp. elements 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vl=VLMAX vl=21 CSR - Vector Length Register (vl) • vl is the number of elements for vector operations – Updated by vsetvl(i) instruction along with the vtype CSR • vl is set to VLMAX if the set value is greater than VLMAX • The default vl is VLMAX • If x0 is used to set vl, then vl is set to VLMAX – Force to an element by fault-only-first load instruction – vl=0, then no elements are updated in vector operations 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 reserved vl[VLMAX*8-1:0]
  • 43. 43 Copyright© 2020 Andes Technology Corp. CSR Vector Instruction – vsetvli/vsetvl • Two vector CSRs are updated by single instruction – vsetvli rd, rs1, vtypei // vl=rs1 (which is also AVL, application vector length), rd=AVL – vsetvl rd, rs1, rs2 // rd = avl • CSR updates: – csr.vtype = vtypei or rs2 • e8, e16, e32, e64, … setting the SEW size • {m1}, m2, m4, m8, mf2, mf4, mf8 – setting the LMUL, default is m1 if m value is not specified • {tu}, ta – setting tail elements, default is tu if ta is not specified • {mu}, ma – setting masked-off elements, default is mu if ma is not specified • csr.vl = rs1 – rd=rs1=x0 no update for csr.vl – rd!=x0, rs1=x0 vl=VLMAX, x[rd]=vl – rs1!=x0 vl=x[rs1], x[rd]=vl – x[rs1] > VLMAX vl=VLMAX, x[rd]=vl
  • 44. 44 Copyright© 2020 Andes Technology Corp. Sample Codes for LMUL and SEW Manipulation • VLEN=512b, e8 to e32 elements for 16 elements vsetvli t0, a0, e8, mf4 // set SEW=8b, LMUL=1/4, VLMAX=64/4=16 elements vle8.v v8, (a2) // load 16 8b elements vsext.vf4 v8, v8 // Signed extended 16 8b elements to 32b elements vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=16 (VLMAX) vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements … • Notes: − Write and read to CSRs are normally serial events but for VPU a speculative value of LMUL, SEW, and vl are used for subsequent instructions until the values are changed again sew=8b lmul=1/4 sew=32b lmul=1 vl=16 sew=8
  • 45. 45 Copyright© 2020 Andes Technology Corp. Optimized codes for Load and Signed Extension • VLEN=512b, e8 to e32 elements for 16 elements vsetvli t0, zero, e32 // set SEW=32b, LMUL=1, vl=512/32=16 elements vle8.v v8, (a2) // load 16 8b elements, vle16.v or vle32.v is always 16 elements vsext.vf4 v8, v8 // Signed extended 16 8b elements to 32b elements vadd.vx v8, v8, t1 // 32-bit vector add for 16 32b elements … • Notes: − In earlier version, 0.8, of RVV ISA, the vle8.v and vsext.vf4 instructions are replace by a single instruction, (the ACE load instruction can still do this load): vlb.v v8, (a2), // load 8b elements and signed extended to 32b elements, still 16 elements
  • 46. 46 Copyright© 2020 Andes Technology Corp. Sample Code for Large Data Set • VLEN=512b, e8 to e32 elements for 128 elements vsetvli t0, zero, e32, m8 // set SEW=32b, LMUL=8, vl=512b*8/32b=128 elements vle8.v v6, (a2) // load 128 8b elements into 2 vector registers, v6-v7 vsext.vf4 v0, v6 // Signed extended 128 8b elements to 32b elements vadd.vx v0, v0, t1 // 32-bit vector add for 128 elements … • Notes: – vle8.v: load data into 2 vector registers, even register number or illegal instruction – vsext.vf4: the source operands are overlapped with destination operands • If source operand is v0, v2, v4, then it is illegal instruction • v6 and v7 are legal instructions because the vector registers are read before written by the result data
  • 47. 47 Copyright© 2020 Andes Technology Corp. CSR - Vector Byte Length • Vector register length in bytes, vlenb = VLEN/8 – Design time constant, vlenb = 512/8 = 64 – Read-only CSR, constant value, hardwired – A configuration CSR 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vlenb = 64 reserved
  • 48. 48 Copyright© 2020 Andes Technology Corp. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vstart[VLMAX*8-1:0] reserved CSR – Vector Start Index Register (vstart) • Initialize to zero • vstart is the index of element that causes the trap on vector instruction • When return from trap, the vector execution restarts from the vstart element and reset the vstart after vector instruction execution • vstart > vl, meaning that no element operation and vstart is reset to zero For Andes vector processor, only load/store instruction can cause trap and interrupt are taken at instruction boundary. A non-zero value for vstart for any other vector instruction will cause illegal instruction.
  • 49. 49 Copyright© 2020 Andes Technology Corp. Valid Vector Range Behavior for destination element i 0 < i < vstart Unchanged vstart <= i < vl Updated if v0.mask[i] enabled, unchanged if not mta exception vl <= i < VLMAX Unchanged, vta exception VLMAX-1 vl-1 Masked elements vstart 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Elements valid elements for operation
  • 50. 50 Copyright© 2020 Andes Technology Corp. • vxsat – Vector Fixed-Point Saturation Flag Register – single bit to indicate that saturated value was written to destination vector register • vxrm – Vector Fixed-Point Rounding Mode Register: − 00: rnu – round to nearest up (add +0.5 to LSB) − 01: rne – round to nearest even − 10: rdn – round down (truncate) − 11: rod – round to odd (OR bits into LSB, “jam”) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 sat vxrm reserved 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 sat reserved 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vxrm reserved • vcsr – merging of vxsat and vxrm into single CSR, explicitly reset by software CSR – Vector Execution Fixed Point Register
  • 51. 51 Copyright© 2020 Andes Technology Corp. CSR – Vector Execution Floating-Point Register • The vector processor uses the fflags and frm from the FPU – fflags[4:0]: floating-point accrued exception flags – frm[2:0]: floating-point rounding modes
  • 52. 52 Copyright© 2020 Andes Technology Corp. RISC-V V Extension ISA Memory Operations
  • 53. 53 Copyright© 2020 Andes Technology Corp. Memory Operations • Load/store operations move groups of data between the VRF and memory – Andes implements unaligned addressing, no restriction • Four basic types of addressing – Unit stride: fetch consecutive elements in contiguous memory – Non-unit or constant stride: elements are in stride memory; stride is the distance between 2 elements of a vector in memory – Indexed (gather-scatter): elements are randomly offset from a base address – Segment: move multiple contiguous fields in memory to and from consecutively numbered vector registers – can be unit-stride, constant stride, or index • One TLB lookup per group of loads/stores (Andes) – Limit the memory access to 1 TLB page – 1 translation – 4KB, memory access fault if any element address crosses the page boundary • Atomic load/store is not implemented
  • 54. 54 Copyright© 2020 Andes Technology Corp. Extract 7 from first 32B Extract 25 from second 32B Vector Load Data Alignment Byte 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vn 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Firstpackage Secondpackage Andes processor handles all unaligned addresses
  • 55. 55 Copyright© 2020 Andes Technology Corp. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 VL* unit-stride mw vm VLS* strided mw vm VLX* indexed mw vm VS* unit-stride mw vm VSS* strided mw vm VSX* indexed mw vm mop mop mop mop mop 0100111 rs2 rs1 width vs3 0100111 0000111 rs1 vd width 0000111 0100111 vs3 lumpop rs1 width vd nf vs2 rs1 width vs3 rs1 nf nf nf vs2 nf sumpop rs1 width nf mop rs2 width vd 0000111 mw-width[2:0] E1024 FL/FS H W D Q Vector - VLx/VSx E8 E16 E32 E64 E128 E256 E512 1110 1111 16 32 64 128 8 16 32 64 128 256 512 1024 x001 x010 x011 x100 0000 0101 0110 0111 1000 1101 00000 01000 10000 others unit-stride unit-stride, whole register load, fault-only-first reserved lumop/sumop[4:0] EEW=MEW Instruction Format – Load/Store Segment, multiple registers FP load/store Vector load/store mop[1:0] function 00 unit-stride 01 index_unordered 10 constant-stride 11 index_ordered 11 index store ls store load ls
  • 56. 56 Copyright© 2020 Andes Technology Corp. Vector load/store, different EMUL • EMUL is the effective LMUL • SEW=16, LMUL=1, VLEN=512b – elements=512/16=32, all vector instructions operate on 32 elements vle16.v v1, (t0) # Load a vector of sew=16, lmul=1, load to v1 vle32.v v2, (t1) # Load a vector of sew=32, emul=2, load to v2 and v3 vwadd.wv v4, v2, v1 # v4/v5 ← v2/v3 + sign-extend(v1), emul=2 vse32.v v4, (t1) # Store a vector of sew=32, emul=2, v4 and v5 store to memory vle8.v v1, (t1) # Load a vector of sew=8, emul=1/2, load to elements first half of v1 vle64.v v4, (t0) # Load a vector of sew=64, emul=4, load to v4, v5, v6, v7 vle128.v v8, (t0) # Load a vector of sew=64, emul=8, load to v8, v9, v10, …, v15
  • 57. 57 Copyright© 2020 Andes Technology Corp. Load with and without Extension • SEW=32b, LMUL=8, VLEN=512b, (512b/32b)*8 = 128 elements • RVV version 1.0 – vle8.v v6, (t0) & vsext.vf4 v0, v6 − Load 128 elements of 8b  512b/8b = 64 elements, write to 2 vector registers, v6-7 − Signed extension of 128 elements from 8b to 32b  Read 8b elements from 2 vector registers, v6-7  Write 32b elements to 8 vector registers, v0-7 • RVV version 0.8 – vlb.v v0, (t0) – custom Andes instruction − Load 128 elements of 8b and signed extended to 32b  Write data into 8 vector registers, v0-7
  • 58. 58 Copyright© 2020 Andes Technology Corp. Unit-Stride Load/Store Variances • Load Fault-only-First: exception only for first element – Set vl=faulted element, no exception – Use for exit condition of loop – Example: the unit load fetches to the end of the TLB page, all subsequent vector operations will be for the valid elements that were fetched. This is the reversed of the stripped mining for unknown number of application elements • Load/store of whole register: – vtype & vl are ignored, SEW=8b, uses NF bits for multiple vector registers – vm=1 (inst[25]), else illegal instruction – Vstart >= VLEN/8, no operation – Use for fill/spill, context switch, return from interrupt, …
  • 59. 59 Copyright© 2020 Andes Technology Corp. Example: strncpy, Fault-only-First Load strncpy: mv a3, a0 // Copy dst loop: vsetvli x0, a2, e8, m8, ta, ma // Vectors of bytes, VLEN=512b vle8ff.v v8, (a1) // Fault-only-First load to 8 vector registers vmseq.vi v1, v8, 0 // Flag zero bytes csrr t1, vl // Get number of bytes fetched Note: ta is used to write zero to elements after the faulted elements (vl=faulted elements)
  • 60. 60 Copyright© 2020 Andes Technology Corp. Stride Gather/Scatter Memory Access • Unit stride continuous memory access • Constant stride: gather scatter (stride can be negative or zero) – Support store memory overlapped due to stride=0 or stride<SEW 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Unit 0 1 2 3 unit stride 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 stride=2 Stride 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Gather Scatter 0 1 2 3 stride=8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 stride=0
  • 61. 61 Copyright© 2020 Andes Technology Corp. Note: - The indices are absolute and not accumulative - The indices are not in-order Index Gather/Scatter Memory Access Base Address 7 12 8 2 vs2 Gather 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Base Address 5 8 0 15 vs2 0 1 2 3 vd/vs3 Scatter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 &data[0] &data[0] Load Index vector Store Index vector
  • 62. 62 Copyright© 2020 Andes Technology Corp. Segment Unit Stride Load/Store • Multiple vector registers load/store, NF*EMUL <= 8 • Equivalent to 4 stride load/store micro-ops • Fault-only-first is allowed Segment stride load/store with NF=4 is four stride loads/stores 0 1 2 3 4 5 6 7 8 9 10 11 Memory V31 . . . . V3 V2 V1 V0 [0] [1] [2] … [VLMAX-1] Vector Registers Load 2 Load 1 Load 0 Unit load/store with NF=4, LMUL=1
  • 63. 63 Copyright© 2020 Andes Technology Corp. 0 1 2 3 4 5 6 7 8 9 10 11 Memory V31 . . . . V3 V2 V1 V0 [0] [1] [2] … [VLMAX-1] Unit load/store with NF=2, LMUL=2 Load 0 Load 1 Load 2 Vector Registers Segment Stride Load/Store, LMUL=2 • Multiple vector registers load/store, NF*EMUL <= 8 • Zero and negative strides are allowed First load/store, stride=2, LMUL=2 Second stride load/store, base address=1 Stride=0, write [0] into all elements of v0 & v1, [1] into all elements of v2 & v3
  • 64. 64 Copyright© 2020 Andes Technology Corp. Segment Index Load/Store • Multiple vector registers load/store, NF*EMUL <= 8 – Overlapping of store data to memory – NF number of index load/store micro-ops, the index is incremented by SEW Segment index load/store with NF=4 is four index loads/stores Base Address 7 12 8 2 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 V31 . . . . V3 10 15 11 5 V2 9 14 10 4 V1 8 13 9 3 V0 7 12 8 2 [0] [1] [2] … [VLMAX-1] &data[0] Index vector Vector Registers
  • 65. 65 Copyright© 2020 Andes Technology Corp. RISC-V V Extension ISA Compute Instruction Format
  • 66. 66 Copyright© 2020 Andes Technology Corp. Vector Arithmetic Instruction Overview • vop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] op vs1[i] • vop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] op x[rs1] • vop.vi vd, vs2, imm, vm // vector-imm, vd[i]=vs2[i] op imm • vfop.vv vd, vs2, vs1, vm // vector-vector, vd[i]=vs2[i] fop vs1[i] • vfop.vx vd, vs2, rs1, vm // vector-scalar, vd[i]=vs2[i] fop f[rs1]
  • 67. 67 Copyright© 2020 Andes Technology Corp. Compute Instruction Format OPI – integer operation OPF – floating-point operation OFM – mask operation 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 OPIVV vector-vector m OPFVV vector-vector m OPMVV vector-vector m OPIVI vector-immed m OPIVX vector-scalar m OPFVF vector-scalar(F) m OPMVX vector-scalar m vsetvli scalar-immed 0 vsetvl scalar-scalar 1 funct6 rs1 101 funct6 vs2 vs1 001 funct6 funct6 111 vd funct6 rd 1010111 1010111 1010111 vs2 vs1 010 1010111 vs2 rs1 100 vd 1010111 vs2 vs1 000 vd 1010111 vd/rd 1010111 vd/rd 1010111 vs2 simm5 011 vd 000000 zimm[10:0] vs2 rs1 110 vd/rd rs1 111 rd funct6 funct6 vs2 1010111 rs2 rs1 Funct6 is used to encode all vector instructions, most of the time, the same funct6 code is used for all 3 forms for vector instructions (vector, scalar, immediate)
  • 68. 68 Copyright© 2020 Andes Technology Corp. Vector Widening Instructions • Widening: double width − vwop.v{v,x} vd, vs2, vs1/rs1 // 2*SEW = sign/zero(SEW) op sign/zero(SEW) − vwop.w{v,x} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op sign/zero(SEW) − vfwop.v{v,f} vd, vs2, vs1/rs1 // 2*SEW = SEW op SEW − vfwop.w{v,f} vd, vs2, vs1/rs1 // 2*SEW = 2*SEW op SEW (vfwadd & vfwsub) • Widening operations − Integer ALU: unsigned/signed extended the source operands before the operation − Integer MUL: operate on SEW-bit, then return 2*SEW − Floating-point: operate on SEW-bit, then return 2*SEW − Same number of elements rule as defined in vtype: o Double width  writing back to twice the number of vector registers
  • 69. 69 Copyright© 2020 Andes Technology Corp. Vector Widening Instructions – Extension • Vector integer extension − V{z,s}ext.vf{2,4,8} vd, vs2 // Zero/sign extend to SEW − The result is the same number of elements set by vtype − Example: SEW=32b, then only V{z,s}ext.vf{2,4} are legal instructions
  • 70. 70 Copyright© 2020 Andes Technology Corp. Shifting of Source Operands for Double-Width • The data are extended and folded into the source data from VRF before sending to execution – Cycle 1: operate on full-width data and write back to v2 shift data from left-half and extension to full width – Cycle 2: operate on full-width data and write back to v3 Pipeline in after the right half to write to v3 16b 16b 16b 16b 16b 16b 16b 16b v1 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 v0 E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 src0 - v1 src1 - v0 Stage 2 v2=v1+v0 2*SEW v2 v3 Elem 2 Stage 1 Extension Elem 7 Elem 6 Elem 5 Elem 15 Elem 14 Elem 13 Elem 12 Elem 11 Elem 10 Elem 9 Elem 1 Elem 7 Elem 6 Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0 Elem 1 Elem 0 Elem 5 Elem 4 Elem 3 Elem 2 Elem 0 Elem 8 Elem 4 Elem 3 + + + + + + + +
  • 71. 71 Copyright© 2020 Andes Technology Corp. Vector Narrow Instructions • Narrowing: from double-width – vnop.w{v,x,i} vd, vs2, vs1/rs1 // SEW = 2*SEW op SEW – Logical and arithmetic right shift operations Example of operations without changing vtype, LMUL=1, SEW=16b, 32 elements in all vector instruction operations: vle16.v v1, (t0) // Load to v1 full vector registers vle32.v v2, (t1) // Load a vector of sew=32b, emul=2, load to v2 and v3 vwadd.wv v0, v2, v1 // v0/v1 ← v2/v3 + sign-extend(v1), emul=2 vnsra.wx v7, v0, t2 // v7 ← v0/v1 (shift right by t2) vse16.v v7, (t0) // Store v7 to memory
  • 72. 72 Copyright© 2020 Andes Technology Corp. Shifting of Source Operands for Narrow-Width • The data are extended and folded into the source data from VRF before sending to execution – Cycle 1: operate on full-width data – Cycle 2: Shift result data to write back to right-half of v0 Pipeline in after the right half to write to v3 v1 src0 - v1 v3 v2 v0 Elem 7 Elem 6 Elem 5 Elem 4 Elem 3 Elem 2 Elem 1 Elem 0 Elem 1 Elem 3 Elem 0 Elem 2 Elem 1 Elem 0 Elem 7 Elem 6 Elem 0 Elem 1 Elem 3 Elem 3 16b 16b 16b 16b Elem 7 Elem 6 Elem 5 v0=v2(shf)v1 Elem 4 Elem 5 Elem 4 Elem 3 Elem 2 s s s s
  • 73. 73 Copyright© 2020 Andes Technology Corp. RISC-V V Extension ISA Compute Instructions
  • 74. 74 Copyright© 2020 Andes Technology Corp. Vector Integer Arithmetic Instructions Vector Integer Add, Subtract, Reverse Subtract (including Double Width) Vector Integer Extension Vector Integer Add-with-Carry / Subtract-with-Borrow Vector Bitwise Logical Instructions Vector Bit Shift Instructions (including Narrow Width) Vector Integer Comparison Instructions, write 1/0 to destination register Vector Integer Min/Max Instructions, write min/max to destination register Vector Integer Multiply (including Double Width) Vector Integer Divide and Remain Vector Integer Multiply-Add/Sub, Multiply-Accumulate (including Double Width) Vector Integer Merge Instructions (vector – vs1, scalar – rs1, imm) Vector Integer Move Instructions (vector – vs1, scalar – rs1, imm)
  • 75. 75 Copyright© 2020 Andes Technology Corp. Vector Arithmetic with Carry Instructions vadc.vv vd, vs2, vs1, vm=0 // vd[i] = vs2[i] + vs1[i] + v0[i].mask vmadc.vvm vd, vs2, vs1, vm=0 // vd[i].mask = carry_out(vs2[i] + vs1[i] + v0[i].mask) • Vector instructions oriented to implementation : − Single vector destination register − Single write port per vector instruction • The carry-in is in v0 same as mask layout • Two vector instructions to complete the add with carry − vadc: destination and source registers cannot be overlapped − vadc: destination register cannot be v0 − vmadc: destination register can be v0 or any vector register
  • 76. 76 Copyright© 2020 Andes Technology Corp. Vector MAC Instructions v(w)madd.vv vd, vs2, vs1, vm // vd[i] = vs2[i] + (vs1[i] * vd[i]) v(w)madd.vx vd, vs2, rs1, vm // vd[i] = vs2[i] + (rs1 * vd[i]) v(w)macc.vv vd, vs2, vs1, vm // vd[i] = vd[i] + (vs1[i] * vs2[i]) v(w)macc.vx vd, vs2, rs1, vm // vd[i] = vd[i] + (rs1 * vs2[i]) v(w)msub and v(w)msac have the same format
  • 77. 77 Copyright© 2020 Andes Technology Corp. Vector Merge Instructions vmerge.vx vd, vs2, rs1, vm=0 // vd[i] = vm[i] ? rs1 : vs2[i] vmerge.vv vd, vs2, vs1, vm=0 // vd[i] = vm[i] ? vs1[i] : vs2[i] vmerge.vi vd, vs2, imm, vm=0 // vd[i] = vm[i] ? imm : vs2[i] • Each element uses vm[i] to select the input element
  • 78. 78 Copyright© 2020 Andes Technology Corp. Vector Integer Move Instructions vmv.v.{v,x,i} vd, {vs1, rs1, imm}, vm=1 • Same encoding as with Vector Merge Instructions with vm=1 and vs2=v0 • {x,i} splat a scalar register or immediate to all elements • vs1=vd is NOP • Other vs2 values are reserved
  • 79. 79 Copyright© 2020 Andes Technology Corp. Vector Fixed-Point Arithmetic Instructions Vector Saturating Add and Subtract (set CSR vxsat bit) Vector Averaging Add and Subtract (overflow is ignored) – Add/subtract, then shift right by 1, round off according to CSR vxrm bit Vector Fractional Multiply with Rounding and Saturation Vector Scaling Shift Instructions Vector Narrowing Clip Instructions
  • 80. 80 Copyright© 2020 Andes Technology Corp. Vector Fractional Multiply with Rounding and Saturation vsmul.v{v,x} vd, vs2, {vs1,rs1}, vm – Double-width multiplication vs2[i]*{vs1[i], rs1} – Shift right by SEW-1 – Roundoff_signed according to CSR vxrm bit – Saturate the result to SEW bits, set CSR vxsat bit if saturated – vd[i] = clip(roundoff_signed(vs2[i]*vs1[i], SEW-1))
  • 81. 81 Copyright© 2020 Andes Technology Corp. Vector Scale-Shift and Clip Instructions vssrl.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right logical – vd[i] = round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm}) vssra.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // scale-shift right arithmetic – vd[i] = round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm}) vnclip.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip – vd[i] = clip(round_signed((vs2[i] + rnd) >> {vs1[i],rs1,imm})) vnclipu.v{v/x/i} vd, vs2, {vs1,rs1,imm}, vm // vs2 is 2*SEW, narrow clip – vd[i] = clip(round_unsigned((vs2[i] + rnd) >> {vs1[i],rs1,imm}))
  • 82. 82 Copyright© 2020 Andes Technology Corp. Vector Floating-Point Arithmetic Instructions Vector FP Add, Subtract, Reverse Subtract (including Double Width) Vector FP Multiply (including Double Width) Vector FP Divide, Reverse Divide Vector FP Multiply-Add/Sub, Multiply-Accumulate (including Double Width) Vector FP Square-Root, Reciprocal Square-Root Estimate, Reciprocal Estimate Vector FP Comparison Instructions, write 1/0 to destination register Vector FP Min/Max Instructions, write min/max to destination register Vector FP Merge, Move Instructions (only fp.scalar – f[rs1]) Vector FP Sign-Injection Instructions Vector FP Classify Instructions Vector FP/Integer Type-Convert (including Double Width and Narrowing)
  • 83. 83 Copyright© 2020 Andes Technology Corp. Vector Reduction Instructions Vector Integer Reduction Instructions Reduction sum (including Double Width) Reduction min/max (including signed and unsigned) Reduction bitwise AND, OR, XOR Vector FP Reduction Instructions Reduction ordered sum (including Double Width) ((((vs1[0] + vs2[0]) + vs2[1]) + vs2[2]) + vs2[3]) + … Reduction unordered sum (including Double Width) Reduction min/max vd[0] = op( vs1[0] , vs2[*] ) where * are all active elements (mask enabled) The w form writes the double-width result
  • 84. 84 Copyright© 2020 Andes Technology Corp. Vector Reduction Sum • Reduction of VLEN/SEW=256-bit/16-bit=16 elements • LMUL>1 can be pipelined SWPL = SEW = 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b V3-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 V1-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 v3=v3[0]+red(v1) V3-256b E15 E14 E13 E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 64-bit 64-bit 64-bit 64-bit + + + + + + + + + + + + + + + +
  • 85. 85 Copyright© 2020 Andes Technology Corp. Vector Mask Instructions Vector Mask-Register Logical Instructions AND, NAND, AND-NOT, OR, NOR, OR-NOT, XOR, XNOR Vector mask population count vpopc vfirst find-first-set mask bit vmsbf.m set-before-first mask bit vmsif.m set-including-first mask bit vmsof.m set-only-first mask bit Vector Iota Instruction Vector Element Index Instruction
  • 86. 86 Copyright© 2020 Andes Technology Corp. Vector Mask Logical • Mask logical instructions, vd = vs2 (logical-op) vs1 – vmmv.m = vmand.mm vd, vs, vs // copy mask register – vmclr.m = vmxor.mm vd, vd, vd // clear mask register – vmset.m = vmxnor.mm vd, vd, vd // set mask register – vmnot.m = vmnand.mm vd, vs, vs // invert mask bits • vl and vstart are used as with normal vector operations • Single vector register operation, LMUL is ignored, bit operations, each bit is for an element
  • 87. 87 Copyright© 2020 Andes Technology Corp. Vector Mask Instructions [VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0] vs2 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register rd 14 vpopc - x[rd] = sum_i( vs2[i] && v0[i] ) rd 3 vmfirst - element index of first mask bit vd 0 … 0 0 0 0 0 0 1 1 1 vmsbf - set before first mask bit vd 0 … 0 0 0 0 0 1 1 1 1 vmsif - set include first mask bit vd 0 … 0 0 0 0 0 1 0 0 0 vmsof - set only first mask bit vd n … 3 3 2 1 1 0 0 0 0 viota - vd[i] = sum of preceding mask bits vd m*i … 8 7 6 5 4 3 2 1 0 vid - mask index, vs2=v0 vstart=0, except for vid +
  • 88. 88 Copyright© 2020 Andes Technology Corp. Vector Permutation Instructions Integer Scalar Move Instructions – between vector element 0 and x[rd/rs1] Floating-Point Scalar Move Instructions – between vector element 0 and f[rd/rs1] Vector Slide Instructions Vector Register Gather Instructions Vector Compress Instruction Whole Vector Register Move – Move instructions with decoded number of vector registers to copy, ignoring vtype LMUL
  • 89. 89 Copyright© 2020 Andes Technology Corp. Vector Slide Instructions • vslideup/down – move elements up and down a vector − vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i] − vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i] − vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i] − vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1] − vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1] − vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm]
  • 90. 90 Copyright© 2020 Andes Technology Corp. Vector Slideup Instruction • No source and destination registers overlapping, destination register can be written before the source register is read (RAW) − vslideup.vx vd, vs2, rs1, vm // vd[i+rs1] = vs2[i] − vslideup.vi vd, vs2, uimm, vm // vd[i+uimm] = vs2[i] − vslide1up.vx vd, vs2, rs1, vm // vd[0]=x[rs1], vd[i+1] = vs2[i] − Slide1up is from XRF or FRF − All lower elements are not modified Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 slide1up vslide1up v8, v4 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x vslideup v8, v4, 10 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x x x v11 v10 v9 v8 Unchange x[rs1], f[rs1] v5 v4 v7 v6
  • 91. 91 Copyright© 2020 Andes Technology Corp. Vector Slidedown Instruction • Source and destination registers overlapping is okay, source register is read before the destination register is written − vslidedown.vx vd, vs2, rs1, vm // vd[i] = vs2[i+rs1] − vslidedown.vi vd, vs2, uimm, vm // vd[i] = vs2[i+uimm] − vslide1down.vx vd, vs2, rs1, vm // vd[i] = vs2[i+1], vd[vl-1]=x[rs1] − Slide1down is from XRF or FRF − All the upper elements fill in with zero Source 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 x[rs1], f[rs1] slide1down vslide1down v4, v4 0 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vslidedown v8, v4, 9 0 0 0 0 0 0 0 0 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 v4 v7 v6 v5 v7 v6 v5 v4 Write 0
  • 92. 92 Copyright© 2020 Andes Technology Corp. Vector Register Gather Instructions • vrgather – read elements from a source vector with index register to write destination vector – vrgather.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]] – vrgatherei16.vv vd, vs2, vs1, vm // vd[i] = vs2[vs1[i]], vs1 is 16b – vrgather.vx vd, vs2, rs1, vm // vd[i] = vs2[rs1] – vrgather.vi vd, vs2, uimm, vm // vd[i] = vs2[uimm] – If index >= VLMAX, then vd[i] = 0 – For SEW=8, vs1 can address only 256 elements – Can be used in combination with vector unit-stride load/store instead of index load/store element 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 vs2 e15 e14 e13 e12 e11 e10 e9 e8 e7 e6 e5 e4 e3 e2 e1 e0 vs1 10 9 2 3 11 12 15 0 0 1 4 8 1 0 7 5 vd e10 e9 e2 e3 e11 e12 e15 e0 e0 e1 e4 e8 e1 e0 e7 e5 The element number in vd is the same as the index in vs1
  • 93. 93 Copyright© 2020 Andes Technology Corp. Vector Compress Instructions • vcompress – shift the unmask elements from source vector to destination register − vcompress.vm vd, vs2, vs1 // vd[i] = Compress into vd elements of vs2 where vs1(LSB) is enabled − vm=1 (inst[25]), else illegal instruction [VLMAX-1] … [8] [7] [6] [5] [4] [3] [2] [1] [0] vs1 1 … 0 0 1 1 0 1 0 0 0 Mask layout as with v0 mask register vs2 ex … e8 e7 e6 e5 e4 e3 e2 e1 e0 vd … … … … … … … … e6 e5 e3 vcompress
  • 94. 94 Copyright© 2020 Andes Technology Corp. Sample Codes
  • 95. 95 Copyright© 2020 Andes Technology Corp. # void matmulFloat_v( # size_t n, // a0: A columns, B rows # size_t m, // a1: A rows # size_t p, // a2: B columns # const float *A, // a3: A is m x n matrix # const float *B, // a4: B is n x p matrix # float *C // a5: C is m x p matrix # ) { # int i, j, k; # # for (i = 0; i < m; i++) { # for (j = 0; j < p; j++) { # float c = 0; # for (k = 0; k < n; k++) { # c += A[i*n+k] * B[k*p+j]; # } # C[i*p+j] = c; # } # } # } # void matmulFloat_v( # size_t 3, // a0: A columns, B rows # size_t 2, // a1: A rows # size_t 3, // a2: B columns # const float *A, // a3: A is 2x3 matrix # const float *B, // a4: B is 3x3 matrix # float *C // a5: C is 2x3 matrix # ) { # int i, j, k; # for (h = 0; h< COUNT; h++) { # for (i = 0; i < 2; i++) { # for (j = 0; j < 3; j++) { # float c = 0; # for (k = 0; k < 3; k++) { # c += A[i*n+k] * B[k*p+j]; # } # C[i*p+j] = c; # } # } # } # } Matrix Multiplication – Issue #495 in RVV Extension Work Group b0 b1 b2 b3 b4 b5 b6 b7 b8 a0 a1 a2 c0 c1 c2 a3 a4 a5 c3 c4 c5 Matrix Multiplication
  • 96. 96 Copyright© 2020 Andes Technology Corp. C A B 0 0 0 0 1 3 0 2 6 1 0 1 1 1 4 1 2 7 2 0 2 2 1 5 2 2 8 3 3 0 3 4 3 3 5 6 4 3 1 4 4 4 4 5 7 5 3 2 5 4 5 5 5 8 Matrix Multiplication – Scalar Coding Style c0=a0*b0 c1=a0*b1 c2=a0*b2 c3=a3*b0 c4=a3*b1 c5=a3*b2 c0+=a1*b3 c1+=a1*b4 c2+=a1*b5 c3+=a4*b3 c4+=a4*b4 c5+=a4*b5 c0+=a2*b6 c1+=a2*b7 c2+=a2*b8 c3+=a5*b6 c4+=a5*b7 c5+=a5*b8 vfmul.vv v16, v10, v0 vfmul.vv v17, v10, v1 vfmul.vv v18, v10, v2 vfmul.vv v19, v13, v0 vfmul.vv v20, v13, v1 vfmul.vv v21, v13, v2 vfmacc.vv v16, v11, v3 vfmacc.vv v17, v11, v4 vfmacc.vv v18, v11, v5 vfmacc.vv v19, v14, v3 vfmacc.vv v20, v14, v4 vfmacc.vv v21, v14, v5 vfmacc.vv v16, v12, v6 vfmacc.vv v17, v12, v7 vfmacc.vv v18, v12, v8 vfmacc.vv v19, v15, v6 vfmacc.vv v20, v15, v7 vfmacc.vv v21, v15, v8 b0 b1 b2 b3 b4 b5 b6 b7 b8 a0 a1 a2 c0 c1 c2 a3 a4 a5 c3 c4 c5 v0 v1 v2 v3 v4 v5 v6 v7 v8 v10 v11 v12 v16 v17 v18 v13 v14 v15 v19 v20 v21 vector registers Matrix Multiplication Matrix Multiplication Each vector register holds 16 elements, 16 sets of matrices are operated in parallel
  • 97. 97 Copyright© 2020 Andes Technology Corp. VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements • Single precision floating point data • 16 sets of matrices can be processed in parallel • Vector multiplication and MACC (6 vfmul and 12 vfmac) Assuming data are contiguous in memory with prefetched in L1 data cache (data cache hit) – stride between the matrices • 6 stride loads for 16 A matrices (2x3=6), stride=4Bx6=24 • 9 stride loads for 16 B matrices (3x3=9), stride=4Bx9=36 • 6 stride store for 16 C matrices (2x3=6), stride=4Bx6=24 Vector Scalar-Style Coding
  • 98. 98 Copyright© 2020 Andes Technology Corp. Stride Loads & Stores for Scalar-Style Code vlse32.v v10, x2, x5 Aref 0 addi x2, x2, 4 vlse32.v v11, x2, x5 Aref 1 addi x2, x2, 4 vlse32.v v12, x2, x5 Aref 2 addi x2, x2, 4 vlse32.v v13, x2, x5 Aref 3 addi x2, x2, 4 vlse32.v v14, x2, x5 Aref 4 addi x2, x2, 4 vlse32.v v15, x2, x5 Aref 5 addi x2, x2, 364 vlse32.v v0, x3, x6 Bref 0 addi x3, x3, 4 vlse32.v v1, x3, x6 Bref 1 addi x3, x3, 4 vlse32.v v2, x3, x6 Bref 2 addi x3, x3, 4 vlse32.v v3, x3, x6 Bref 3 addi x3, x3, 4 vlse32.v v4, x3, x6 Bref 4 addi x3, x3, 4 vlse32.v v5, x3, x6 Bref 5 addi x3, x3, 4 vlse32.v v6, x3, x6 Bref 6 addi x3, x3, 4 vlse32.v v7, x3, x6 Bref 7 addi x3, x3, 4 vlse32.v v8, x3, x6 Bref 8 addi x3, x3, 540 vsse32.v v16, x4, x5 Cref 0 addi x4, x4, 4 vsse32.v v17, x4, x5 Cref 1 addi x4, x4, 4 vsse32.v v18, x4, x5 Cref 2 addi x4, x4, 4 vsse32.v v19, x4, x5 Cref 3 addi x4, x4, 4 vsse32.v v20, x4, x5 Cref 4 addi x4, x4, 4 vsse32.v v21, x4, x5 Cref 5 addi x4, x4, 364
  • 99. 99 Copyright© 2020 Andes Technology Corp. Performance of Scalar-Style Codes • Load 6 Arefs = 12*6 = 72 cycles // access 12 32B cache lines • Load 9 Brefs = 17*9 = 153 cycles // access 17 32B cache lines • Store 6 Crefs = 12*6= 54 cycles // access 12 32B cache lines • Vector FP multiplications = 18 cycles // hidden by the store • Total cycles per loop = 297 for 16 sets of matrices • Total cycles per set of matrix = 19 cycles • Number of vector registers used = 6+9+6 = 21 registers, efficiency = 66% • Number of static instructions in the loop = 62 instructions • Average instructions per set of matrices = 4 instructions • It is 16X the performance of scalar code
  • 100. 100 Copyright© 2020 Andes Technology Corp. Matrix Multiplication – Unit Load Coding Style • VLEN=512b, SEW=32b, VLEN/SEW = 512b/32b = 16 elements – LMUL=4, 16x4 = 64 elements – LMUL=8, 16x8 = 128 elements • Number of Cref or Aref per matrix = 6 elements − 10 sets of matrices are 60 elements which fit into VRF with LMUL=4 • Number of Bref per matrix = 9 elements − 10 sets of matrices are 90 elements which fit into VRF with LMUL=8
  • 101. 101 Copyright© 2020 Andes Technology Corp. • Unit load of 10 consecutive B matrices (10x9=90 elements) − vrgather to set up 10 sets of 6 B elements = 2, 1, 0, 2, 1, 0 (yellow) in v16 − vslideup v16 to v20 − vrgather to set up 10 sets of 6 B elements = 5, 4, 3, 5, 4, 3 (blue) in v16 − vrgather to set up 10 sets of 6 B elements = 8, 7, 6, 8, 7, 6 (white) in v24 C A B 0 0 0 0 1 3 0 2 6 1 0 1 1 1 4 1 2 7 2 0 2 2 1 5 2 2 8 3 3 0 3 4 3 3 5 6 4 3 1 4 4 4 4 5 7 5 3 2 5 4 5 5 5 8 setvli t1, x5, e32, m8 vl=90, LMUL=8 vle32.v v8, x3 unit load - Bref, 90 elements addi x3, x3, 360 setvli t1, x6, e32, m8 vl=60, LMUL=8 vsub.vi v0, v0, 6 vrgather.vv v16, v8, v0 60 elements of first Bref in v16 vadd.vi v0, v0, 3 vslideup.vi v16, v16, 64 Move first Bref to v20 vrgather.vv v16, v8, v0 60 elements of second Bref in v16 vadd.vi v0, v0, 3 vrgather.vv v24, v8, v0 60 elements of third Bref in v24 Unit Load of B elements
  • 102. 102 Copyright© 2020 Andes Technology Corp. • Unit load of 10 consecutive A matrices (10x6=60 elements) − vrgather to set up 10 sets of 6 A elements = 3, 3, 3, 0, 0, 0 (yellow) in v16 − vrgather to set up 10 sets of 6 A elements = 4, 4, 4, 1, 1, 1 (blue) − vrgather to set up 10 sets of 6 A elements = 5, 5, 5, 2, 2, 2 (white) • Multiplication and accumulate and unit store C elements to memory C A B 0 0 0 0 1 3 0 2 6 1 0 1 1 1 4 1 2 7 2 0 2 2 1 5 2 2 8 3 3 0 3 4 3 3 5 6 4 3 1 4 4 4 4 5 7 5 3 2 5 4 5 5 5 8 setvli t1, x7, e32, m4 vl=60, LMUL=4 vle32.v v8, x2 unit load - Aref, 60 elements addi x2, x2, 240 vrgather.vv v28, v8, v4 set up 60 elements of first Aref vadd.vi v4, v4, 1 vfmul.vv v20, v20, v28 vrgather.vv v12, v8, v4 set up 60 elements, of second Aref vadd.vi v4, v4, 1 vfmacc.vv v20, v16, v12 vrgather.vv v28, v8, v4 set up 60 elements of third Aref vsub.vi v8, v8, 2 vfmacc.vv v20, v24, v28 vse32.v v20, x4 unit store - Cref, 60 elements addi x4, x4, 240 bne Unit Load of A elements
  • 103. 103 Copyright© 2020 Andes Technology Corp. C A B 0 0 0 0 1 3 0 2 6 1 0 1 1 1 4 1 2 7 2 0 2 2 1 5 2 2 8 3 3 0 3 4 3 3 5 6 4 3 1 4 4 4 4 5 7 5 3 2 5 4 5 5 5 8 VRF b8 31 a5 b7 30 a4 … 29 … … 28 … b3 27 a2 b2 26 a1 b1 25 a0 b0 24 a5 b8 23 a4 b7 22 a3 b6 21 a2 b5 20 a1 b4 19 a0 b3 18 b2 17 b1 16 b0 15 c5 b8 14 c4 b7 13 … b6 12 … b5 11 c2 b4 10 c1 b3 9 c0 b2 8 c5 b1 7 c4 b0 6 c3 5 c2 4 c1 3 c0 2 1 0 vrgather vrgather vslideup Index for B elements Index for A elements vrgather vrgather Graphical View of Unit Load Codes
  • 104. 104 Copyright© 2020 Andes Technology Corp. Performance of Unit Load Codes • Load 60 Aref and 90 Bref elements, store 60 Cref elements = 28 cycles of 32B cache lines o Vrgather for Aref = 8 cycles + 4 + 4 = 16 cycles o Vrgather/vslideup for Bref = 13 cycles + 8 + 13 + 13 = 47 cycles • Vector FP multiplications = 12 cycles • Total cycles per loop = 103 for 10 sets of matrices • Total cycles per set of matrix = 10 cycles (19 cycles for straight code) • Number of static instructions in the loop = 27 instructions • Average instructions per set of matrices = 2.7 instructions (4 instructions for straight code) • Number of elements used in VLMAX, 60/64 and 90/128, efficiency is about 90% • Unit load/store are much more efficient memory operation than stride load: accessing 28 instead of 41 cache lines. Shifting the complexity of stride load/store to vrgather instructions
  • 105. 105 Copyright© 2020 Andes Technology Corp. Larger Matrices – Scalar Style Coding • Same VLEN, SEW=32b, VLMAX=16 – A matrix 6x3, 18 elements – requires 18 vector registers – B matrix 3x6, 18 elements – requires 18 vector registers – C matrix 6x6, 36 elements – requires 36 vector registers – Compute 1 row of A and C elements at a time, 6 iterations to finish 16 sets of matrices b0 b1 b2 b3 b4 b5 v0 v1 v2 v3 v4 v5 b6 b7 b8 b9 b10 b11 v6 v7 v8 v9 v10 v11 b12 b13 b14 b15 b16 b17 v12 v13 v14 v15 v16 v17 a0 a1 a2 c0 c1 c2 c3 c4 c5 v18 v19 v20 v21 v22 v23 v24 v25 v26 a3 a4 a5 c6 c7 c8 c9 c10 c11 v18 v19 v20 v21 v22 v23 v24 v25 v26 a6 a7 a8 c12 c13 c14 c15 c16 c17 v18 v19 v20 v21 v22 v23 v24 v25 v26 a9 a10 a11 c18 c19 c20 c21 c22 c23 v18 v19 v20 v21 v22 v23 v24 v25 v26 a12 a13 a14 c24 c25 c26 c27 c28 c29 v18 v19 v20 v21 v22 v23 v24 v25 v26 a15 a16 a17 c30 c31 c32 c33 c34 c35 v18 v19 v20 v21 v22 v23 v24 v25 v26 Matrix Multiplication Matrix Multiplication vector registers
  • 106. 106 Copyright© 2020 Andes Technology Corp. Larger Matrices – Unit Load Codes • VLEN=512b, SEW=32b, VLMAX=16 elements − LMUL=2, 32 elements, 2 vector registers are used for each 18 elements of Aref and Bref − LMUL=4, 64 elements, 4 vector registers are used for 36 elements of Cref − 1 set of matrices per loop iteration VRF a17 31 a16 30 … 29 … 28 a8 27 a7 26 a6 25 a5 24 a4 23 a3 22 a2 21 a1 20 a0 19 18 b17 17 b16 16 … 15 c35 … 14 c34 b8 13 … b7 12 … b6 11 c8 b5 10 c7 b4 9 c6 b3 8 c5 b2 7 c4 b1 6 c3 b0 5 c2 4 c1 3 c0 2 1 0 vmul and vfmac Index for B elements Index for A elements vrgather vrgather
  • 107. 107 Copyright© 2020 Andes Technology Corp. Performances of Larger Matrices *Vector loads/stores are pipelined between iterations **6x4 or 6x5 matrices are more efficient. 6x3 has 18 active elements with VLMAX=32 Unit loads/stores are much more efficient memory operation than stride load Scalar style Unit load Number of static instructions per loop 252 30 Total number of cycles 1259 22* Number of cycles per set of matrices 79 22 Efficiency of register usages 81% 56%**
  • 108. 108 Copyright© 2020 Andes Technology Corp. vsetvli t0, zero, e32, m8 // VLEN=512b, LMUL=8, SEW=32b, ELEN=4096b vlb.v v0, t0; add t0, t0, imm; vadd.vx v0, v0, t1; lfw f0, t2; add t2, t2, imm; vfcvt.f.x.v v0, v0; vfmacc.vf v8, v0, f0; bne t0, t3, imm • Performance requirement: 16B of 8b INT every clock cycle Loop of 8 instructions, Load element = 8b INT, extension to 32b Operation on FP 32b RVV Work Group, Issue #362, Codes Based on RVV 0.8
  • 109. 109 Copyright© 2020 Andes Technology Corp. Vector Processor Pipeline Example VLEN=64B Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 vadd v0, v0, t1 va va va vadd v1, v1, t1 va va va vadd v2, v2, t1 va va va vadd v3, v3, t1 va va va vadd v4, v4, t1 va va va vadd v5, v5, t1 va va va vadd v6, v6, t1 va va va vadd v7, v7, t1 va va va vfcvt v0, v0 fc fc fc fc fc fc vfcvt v1, v1 fc fc fc fc fc fc vfcvt v2, v2 fc fc fc fc fc fc vfcvt v3, v3 fc fc fc fc fc fc vfcvt v4, v4 fc fc fc fc fc vfcvt v5, v5 fc fc fc fc vfcvt v6, v6 fc fc fc fc vfcvt v7, v7 fc fc fc fc vfmacc v8, v0, f0 fm fm fm fm fm fm fm fm fm vfmacc v9, v1, f0 fm fm fm fm fm fm fm fm vfmacc v10, v2, f0 fm fm fm fm fm fm fm vfmacc v11, v3, f0 fm fm fm fm fm fm vfmacc v12, v4, f0 fm fm fm fm fm fm vfmacc v13, v5, f0 fm fm fm fm fm fm vfmacc v14, v6, f0 fm fm fm fm fm fm vfmacc v15, v7, f0 fm fm fm fm fm fm vlb v0, t0 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7 v0 v1 v2 v3 v4 v5 v6 v7 lfw f0, t2 f0 f0 f0 8 cycles 8 cycles 8 cycles chaining chaining
  • 110. 110 Copyright© 2020 Andes Technology Corp. Vector Processor Performance • Number of instructions and micro-ops in 8 cycles – 3 vector arithmetic instructions = 24 micro-ops (vadd, vfcvt, vfmac) – 1 FP load – 3 scalar instructions + 1 vector load to data cache – Performance at 1 GHz = 64 GOPS or 48 GFLOPS for SEW=32b – Performance at 1 GHz = 128 GOPS or 96 GFLOPS for SEW=16b • Unroll the loop twice in order to pipeline vector load instructions – All 32 vector registers are used – Performance improvement can be achieved with VLEN=1024b
  • 111. 111 Copyright© 2020 Andes Technology Corp. vle8.v v6, t0; add t0, t0, imm; vsext.v4 v0, v6 vadd.vx v0, v0, t1; lfw f0, t2; add t2, t2, imm; vfcvt.f.x.v v0, v0; vfmacc.vf v8, v0, f0; bne t0, t3, imm • An extra instruction in a tight loop can degrade the performance by 12.5% • Andes provides custom load instruction in order to achieve the required performance • In addition, the vsext.v4 instruction needs an extra write port since all the write ports are used Loop of 9 instructions, Load element = 8b INT, Vector extension to 32b, Operation on FP 32b RVV Work Group, Issue #362, Codes Based on RVV 0.9
  • 112. 112 Copyright© 2020 Andes Technology Corp. AndesCore™ NX27V introduction
  • 113. 113 Copyright© 2020 Andes Technology Corp. • Vector Processor is implemented as part of the RISC-V CPU • Configurability: necessary for IP to attract wide range of customers, 128b to 512b • Scalability: necessary for next generation • Low power and area: necessary for embedded market • Performance: out-of-order execution and design, 2*VLEN = double performance • Simplicity: necessary for time to market, reducing verification time • Modular Design: decoupling for independent design, verification, timing, and clock gating Andes NX27V Design Objectives
  • 114. 114 Copyright© 2020 Andes Technology Corp. Overview of NX27V • AndeStar V5 architecture: – RV64GCN+ Andes V5 Extensions – RV Vector extension (RVV) • 5-stage pipeline, single-issue – Optional branch prediction • I/D caches – Caches: 8KB to 64KB – I$/D$ prefetch – HW unaligned load/store accesses – 16 non-blocking outstanding data accesses • Wide data paths to feed VPU: – Cached and uncached RVV load/stores – Streaming Ports for ACE loads/stores
  • 115. 115 Copyright© 2020 Andes Technology Corp. NX27V Pipeline Fetch Decode Execute Memory Retire Integer Execution Unit Data Cache Exception Handling IFU GPR V P U Vector scalar V I Q V/F/D inst Custom Coprocessor/ Mem-Controller command data A C E pipeline Execute Streaming Ports Custom Memory AXI
  • 116. 116 Copyright© 2020 Andes Technology Corp. NX27V Performance Gain Functions Speedup1 F32 basic mathematical functions 19X RGB CNN functions 18X Depthwise CNN functions 23X Pointwise CNN functions 21X Relu CNN functions 69X F32 filtering functions 19X Q7 filtering functions 39X F32 32x32x32 matrix multiplication 57X 1Compared to pure C scalar code compiled with high optimization; both vector and scalar code ran on the NX27V FPGA with 512-bit VLEN, 256-bit bus.
  • 117. 117 Copyright© 2020 Andes Technology Corp. Andes NX27V vs. Competitions Andes NX27V CAxx CMxx Architecture RVV/Andes VPU Popular SIMD Hxxx Vector registers 32 32 8 Vector Length Up to 512b 128b 128b SIMD width Up to 512b/cycle 128b/cycle 64b/cycle LMUL Yes None None Chaining Yes Not applicable Yes Custom extension ACE No No Streaming Ports Yes No No
  • 118. 118 Copyright© 2020 Andes Technology Corp. Core/System Performance Comparison CPU A64FX* NX27V** Vector ISA ARM SVE RISC-V RVV Technology (nm) 7 7 Core sustain Perf 16b (GFLOPS) ~56 96 Core Peak Perf 16b (GOPS) 230 320 # of cores 48 48** System (TFLOPS) ~2.7 ~4.6 Memory BW (GB/s) 1024 3072 *Fujitsu presentation at Hot Chip 2018. **Assume the same number of cores for comparison. 512-bit SIMD, 512-bit bus.
  • 119. 119 Copyright© 2020 Andes Technology Corp. NX27V Floorplan  Process: 7nm  Frequency: 1GHz (worst case)  Size: ~0.3mm2
  • 120. 120 Copyright© 2020 Andes Technology Corp. Development Tools Standard tools in AndeSight™ IDE: – AndeSim: Near-cycle accurate simulator – Compiler (intrinsic functions and inline assembly) – Assembler – Debugger and ICE – Computation library Advanced tools: – AI compiler support thru LLVM – AndesClarity pipeline visualizer and analyzer  Pipeline view of instructions and functional units  Resource view corresponding to instruction usage  Stall bubbles with different colors for data dependency
  • 121. 121 Copyright© 2020 Andes Technology Corp. Clarity: NX27V Pipeline View
  • 122. 122 Copyright© 2020 Andes Technology Corp. Summary • Andes NX27V vector processor – The world first commercial RISC-V vector processor, VLEN = 512b/256b are available – The 512-bit NX27V vector processor has been shipped to lead customer – High performance in HPC and AI applications – Flexible VPU configurations to enable a wide range of applications – AndesClarity for performance optimization – Work with customers on Andes Custom Extension to provide proprietary, hardware acceleration, and marketing competitive • Andes Technology will continue to expand vector product families to meet the requirements of edge to cloud applications
  • 123.