SlideShare a Scribd company logo
1 of 66
Single Instruction Multiple Data
Another approach to ILP and performance
Outline
• Array Processors / “True” SIMD
• Vector Processors
• Multimedia Extensions in modern instruction sets
SIMD: Motivation
• Let’s start with an example:
• ILLIAC IV, U of Illinois, 1972 (prototype)
• Reasoning: How to Improve Performance
• Rely on Faster Circuits
• Cost/circuit increases with circuit speed
• At some point, cost/performance unfavorable
• Concurrency:
• Replicate Resources
• Do more per cycle
SIMD: Motivation contd.• Replication to the extreme: Multi-processor
• Very Felixible, but costly
• Do we need all this flexibility?
• There are middle-ground designs were only parts are replicated
CU
ALU
MEM
Uniprocessor
replicate CU
ALU
MEM
CU
ALU
MEM
CU
ALU
MEM
Multiprocessor
SIMD: Motivation Contd.
• Recall:
• Part of architecture is understanding application needs
• Many Apps:
• for i = 0 to infinity
• a(i) = b(i) + c
• Same operation over many tuples of data
• Mostly independent across iterations
SIMD Architecture
• Replicate Datapath, not the control
• All PEs work in tandem
• CU orchestrates operations
CU
PE
MEM
PE
MEM
PE
MEM
ALU
μCU
regs
ILLIAC IV
• Goal:
• 1 Gops/sec
• 256 PEs as four partitions of 64 PEs
• What was built
• 0.2 Minsts/sec (we’ll talk about peak performance as ops)
• 64 PEs
• Prototype due date 1972
ILLIAC IV
CU
PE
PMEM
PE
PMEM
PE
PMEM
I/O Proc
ILLIAC IV Processing Element (PE)
• 64-bit numbers, float or fixed point
• Multiples of smaller numbers that add up to 64-bits
• Today’s multimedia extensions
• PMEM: One local memory module per PE
• 2K x 64-bits
• 188ns access / 350ns cycle (includes conflict resolution)
• 100K components per PE
PE Contd.
• PE mode: Active or Inactive, CU sets mode
• All PEs operate in lock-step
• Routing insts to move data from PE to PE
• The CU can execute instructions while PE’s are busy
• Another degree of concurrency
• Datatypes
• 64b float
• 64b logical
• 48b fixed
• 32 float
• 24 fixed
• 8 fixed
Peak Compute Bandwidth
• 64 PEs
• Each can perform:
• 1 64b, 2 32b, or 4 8b operations
• Or, in total:
• 64 elems, 128 elems, or 512 elems
• Peak:
• 150M 64b ops/sec up to 10G 32b ops/sec
• The last figure is for integer ops
• Each int op takes 66ns (4 per PE in parallel)
Control Unit (CU)
• A simple CPU
• Can execute instructions w/o PE intervention
• Coordinates all PEs
• 64 64b registers, D0-D63
• 4 64b Accumulators A0-A3
• Ops:
• Integer ops
• Shifts
• Boolean
• Loop control
• Index PMem
D0
D63
A0
A3
A1
A2
ALU
CU
Processing Element (PE)• 64 bit regs
• A: Accumulator
• B: 2nd
operand for binary
ops
• R: Routing – Inter-PE
Communication
• S: Temporary
• X: Index for PMEM 16bits
• D: mode 8bits
• Communication:
• PMEM only from local PE
• Amongst PE with R
A
S
B
R
ALU
PEi
X
D
0
1
2043
PMEMi
PEi-1
PEi+1
PEi-8
PEi+8
Datapaths
• CU Bus: Insts and Data from PMEM to CU in 8 words
• CDB: Broadcast to all PEs
• E.g., constants for adds
• Routing Network: amongst R registers
• Mode: To activate/de-activate PEs
CU
PE
PMEM
PE
PMEM
PE
PMEM
Control Unit Bus
Mode Common Data Bus
Routing
Routing Network
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
0
8
16
24
32
40
48
56
7
15
23
31
39
47
55
63
56 57 58 59 60 61 62 63
0 1 2 3 4 5 6 7
12
19 20 21
28
i-8
i+8
i+1i-1
Using ILLIAC IV: Example #2
• DO 10 I = 1 TO 64
10 C(I) = A(I) + B(I)
• LDA a + 2 load A(i) into A (same a per PMEM)
• ADDRN a + 1 add B(i) into A
• STA a store A into C(i)
C(1)
A(1)
B(1)
PMEM1
a C(2)
A(2)
B(2)
PMEM2
C(64)
A(64)
B(64)
PMEM64
Using ILLIAC IV: Example #2
• DO 10 I = 2 TO 64
• 10 A(I) = B(I) + A(I-1)
• Expand into:
• A(N) = A(1) + Sum B(i) [i = 2 to N]
• We get:
• DO 10 N=2 TO 64
• S = S + B(N)
• 10 A(N) = S
Using ILLIAC IV: Example #2 contd.
1. Enable all PEs
2. All load A from a
3. i = 0
4. All R = A (including those inactive)
5. All route R to PE(2^i) to the right
6. j = 2^i – 1
7. Disable all PEs 1 through j
8. A = A + R  R contains a partial sum of many A(i)
9. i = i + 1
10. if i < lg(64) goto 4
11. Enable All PEs
12. All store A at (a + 1)
Using ILLIAC IV: Example #2 contd.
• Initial State:
• PMEM(1)[a] = A(1)
• PMEM(1+i)[a] = B(i+1)
• For example, at PE1
• STEP 1: A = A(1)
• from PE2 we get B(2)
• STEP 2: A = A(1) + B(2)
• from PE4 we get B(4) + B(5)
• STEP 3: A = A(1) + B(2) + B(4) + B(5)
• From PE8 we get B(8) + B(7) + B(12) + B(13)
Vector Processors
SIMD over time
Vector Processors
• Vector Datatype
• Apply same operation on all elements of the vector
• No dependences amongst elements
• Same motivation as SIMD
Properties of Vector Processors
• One Vector instruction implies lots of work
• Fewer instructions
• Each result independent of previous result
• Multiple operations in parallel
• Simpler design; no need for dependence checks
• Higher clock rate
• Compiler must help
• Fewer Branches
• Memory access pattern per vector inst known
• Prefetching effect
• Amortize mem latency
• Can exploit high-bandwidth mem system
• Less/no need for data caches
Classes of Vector Processors
• Memory to memory
• Vectors are in memory
• Load/store
• Vectors are in registers
• Load/store to communicate with memory
• This prevailed
Historical Perspective
• Mid-60s: performance concerns
• SIMD processor arrays
• Also fast Scalar machines
• CDC 6600
• Texas Instruments ASC, 1972
• Memory to memory vector
• Cray Develops CRAY-1, 1978
CRAY-1
• Fast and simple scalar processor
• 80 Mhz
• Vector register concept
• Much simple ISA
• Reduced memory pressure
• Tight integration of scalar
and vector units
• Cylindrical design to minimize
wire lengths
• Freon Cooling
Physical Organization of CRAY-1
Components of Vector Processor
• Scalar CPU: registers, datapaths, instruction fetch
• Vector Registers:
• Fixed length memory bank holding a single vector reg
• Typically 8-32 Vregs, up to 8Kbits per Vreg
• At least; 2 Read, 1 Write ports
• Can be viewed as an array of N elements
• Vector Functional Units:
• Fully pipelined. New op per cycle
• Typically 2 to 8 FUs: integer and FP
• Multiple datapaths to process multiple elements per cycle if needed
• Vector Load/Store Units (LSUs):
• Fully pipelined
• Multiple elems fetched/store per cycle
• May have multiple LSUs
• Cross-bar:
• Connects FUS, LSUs and registers
CRAY-1 Organization
• Simple 16-bit Reg-to-Reg ISA
• Use two 16-bit to get Imm
• Natural combinations of
scalar and vector
• Scalar bit-vectors
match vector length
• Gather/Scatter M-R
• Cond. Merge
CRAY-1 CPU
• Scalar and vector modes
• 12.5 ns clock
• 64-bit words
• Int & FP units
• 12 FUs
• 8 24-bit A regs
• 64 B regs (temp storage for A)
• 8 64-bit S regs
• 64 T regs (temp storage for S)
• 64 64-elem, 64bit elem V regs
CRAY-1 CPU
• Vector Length Register
• Can use only a prefix of a vreg
• Vector Mask Register
• Can use only a subset of a vreg
• Real Time Register (counts clock cycles)
• Four instruction buffers
• 64 16-bit parcels
• 128 Basic Instructions
• Interrupt Control
• NO virtual memory system
Cray-1 Memory System
• 1M 64b words + 8 check bits (single error correction, double error
detection)
• 16 banks of 64K words
• 4 clocks period
• 1 word per clock for B, T and Vreg
• 1 word per 2 clocks for A & S
• 4 words per clock for inst buffers
Instruction Format
• Fields g h I j k m
• Bits 0-3 4-6 7-9 10-12 13-15 16-31
• Bits cnts 4 3 3 3 3 16
• X X  opcode
• Rd Rs1 Rs2
• A/S B/T
Basic Vector Instructions
• Inst Operands Operation Comment
• VADD.VV V1, V2, V3 V1=V2+V3 vector+vector
• VADD.SV V1, R0, V2 V1=R0+V2 scalar+vector
• VMUL.VVV1, V2, V3 V1=V2*V3 vector * vector
• VMUL.SV V1, R0, V2 V1=R0*V2 scalar * vector
• VLD V1, R0 V1=M[R0…R0+63] stride = 1
• VLDS V1, R1, R2 V1=M[R1…R1+63*R2] stride=R2
• VLDX V1, R1, V2 V1=M[R1+V2[i], i=0 to 63] gather
• VST store equiv of VLD
• VSTS store equiv of VLDS
• VSTX V1, R1 M[R1+V2[i], i=0 to 63]=V1 scatter
Vector Memory Operations
• Load/Store move groups of data between memory and registers
• Addressing Modes
• Unit-stride: Fastest
• Non-Unit, constant stride (interleaved memory helps
• Indexed (gather-scatter)
• Vector equiv of register indirect
• Sparse arrays
• Can vectorize more loops
Vector Code Example
• Y[0:63] = Y[0:63] + a * X[0:63]
• LD R0, a
• VLD V1, Rx Load X[] in V1
• VLD V2, Ry Load Y[] in V2
• VMUL.SV V3, R0, V1 V3 = X[]*a
• VADD.VV V4, V2, V3 V4 = Y[]+X[]*a
• VST Ry, V4 store in Y[]
Scalar Equivalent
• LD R0, a
• LI R5, 512 (offset at the end of X[])
• Loop: LD R2, 0(Rx)
• MULTD R2, R0, R2
• LD R3, 0(Ry)
• ADD R4, R2, R3
• ST R4, 0(Ry)
• ADD Rx, Rx, 8
• ADD Ry, Ry, 8
• SUB R5, R5, 8
• BNE Loop
LD R0, a
VLD V1, Rx
VLD V2, Ry
VMUL.SV V3, R0, V1
VADD.VV V4, V2, V3
VST Ry, V4
Vector Length Register
• Allows us to vectorize code where the elements do not exactly fit
within the vector register
• What if we need a vector of just 32 elems?
• Vector length register:
• Operate up to this element
• Can be anything from 0 to Maximum (64 in CRAY-1)
• Can also be used to support runtime vector length variability
Strip Mining
• Suppose (application vector length) AVL > MVL (max vector length)
• Each loop iteration handles MVL elems
• Last iteration AVL MOD MVL
• VL = (AVL mode MVL)
• For (I=0; I<VL; I++)
• Y[I] = A*X[I] + Y[I]
• low = (AVL mod MVL)
• VL = low
• For (i=low; i < VL; i++)
• Y[i] = A*X[i] + Y[i]
Optimization #1: Chaining• Subsequent vector op can be initiated as soon as a preceding vector
op it depends upon produces its first result
• Example
• Vadd.vv v1, v2, v3
• Vadd.sv v4, v1, R0
V1(1) V1(2) V1(3) V1(4) V1(63)
time
Add initiated
V4(1) V4(2) V4(3) V4(4) V4(63)
unchained
V1(1) V1(2) V1(3) V1(4) V1(63)
Add initiated
V4(1) V4(2) V4(3) V4(4) V4(63) chained
Optimization #2: Conditional Execution
• Vector Mask Register
• Bit vector: used as predicate
• If 0 operation is not performed for the corresponding pair
• VLD V1, Ra
• VLD V2, Rb
• VCMP.NEQ.VV VMR, V1, V2
• VSUB.VV V3, V2, V1 (VMR)
• VST V3, Ra
• For (i = 0; i < 64; i++)
• if (A[i] != B[i]) A[i] = A[i] – B[i]
Optimization #3: Multi-lane Implementation
• Vectors are interleaved so that multiple elems can be accessed per
cycle
• Replicate resources
• Equivalent of Superscalar
• Because of no intra-vector dependences and because inter-vector
dependences are aligned (elem(i) to elem(i)) no need for inter-bank
communications
Two Ways to View Vectorization
• Classic Approach: Inner-loop
• Think machine as having 32 vector registers with 16 elems
• 1 instruction updates all elements of a vector
• Vectorize single dimension array operations
• A new approach: Outer-loop
• Think of machine as 16 “virtual processors” each with 32 scalar registers
• 1 instruction updates register in 16 VPs
• Good for irregular kernels
• Hardware is the same for both
• These describe the compiler’s perspective
Startup Cost
Execution Cost
Multimedia extensions
SIMD in modern CPUs
Multimedia ISA Extensions
• Intel’s MMX
• The Basics
• Instruction Set
• Examples
• Integration into Pentium
• Relationship to vector ISAs
• AMD’s 3DNow!
• Intel’s ISSE (a.k.a. KNI)
MMX: Basics
• Multimedia applications are becoming popular
• Are current ISAs a good match for them?
• Methodology:
• Consider a number of “typical” applications
• Can we do better?
• Cost vs. performance vs. utility tradeoffs
• Net Result: Intel’s MMX
• Can also be viewed as an attempt to maintain market share
• If people are going to use these kind of applications we better support them
Multimedia Applications
• Most multimedia apps have lots of parallelism:
• for I = here to infinity
• out[I] = in_a[I] * in_b[I]
• At runtime:
• out[0] = in_a[0] * in_b[0]
• out[1] = in_a[1] * in_b[1]
• out[2] = in_a[2] * in_b[2]
• out[3] = in_a[3] * in_b[3]
• …..
• Also, work on short integers:
• in_a[i] is 0 to 256 for example (color)
• or, 0 to 64k (16-bit audio)
Observations
• 32-bit registers are wasted
• only using part of them and we know
• ALUs underutilized and we know
• Instruction specification is inefficient
• even though we know that a lot of the same operations will be
performed still we have to specify each of the individually
• Instruction bandwidth
• Discovering Parallelism
• Memory Ports?
• Could read four elements of an array with one 32-bit load
• Same for stores
• The hardware will have a hard time discovering this
• Coalescing and dependences
MMX Contd.
• Can do better than traditional ISA
• new data types
• new instructions
• Pack data in 64-bit words
• bytes
• “words” (16 bits)
• “double words” (32 bits)
• Operate on packed data like short vectors
• SIMD
• First used in Livermore S-1 (> 20 years)
MMX:Example
Up to 8 operations (64bit) go in parallel
 Potential improvement: 8x
 In practice less but still good
Besides another reason to think your machine
is obsolete
Data Types
MMX: Instruction Set
• 57 new instructions
• Integer Arithmetic
• add/sub/mul
• multiply add
• signed/unsigned
• saturating/wraparound
• Shifts
• Compare (form mask)
• Pack/Unpack
• Move
• from/to memory
• from/to registers
Arithmetic
• Conventional: Wrap-around
• on overflow, wrap to -1
• on underflow, wrap to MAXINT
• Think of digital audio
• What happens when you turn volume to the MAX?
• Similar for pictures
• Saturating arithmetic:
• on overflow, stay at MAXINT
• on underflow, stat at MININT
• Two flavors:
• unsigned
• signed
Operations
• Mult/Add
• Compares
• Conversion
• Interpolation/Transpose
• Unpack (e.g., byte to word)
• Pack (e.g., word to byte)
Matrix Transpose 4x4
• That’s for the first two rows
m33 m32 m31 m30 m13 m12 m11 m10
m23 m22 m21 m20 m03 m02 m01 m00
punpcklwd punpcklwd
m31 m21 m30 m20 m11 m01 m10 m00
punpckhdq punpckldq
m31 m21 m11 m01 m30 m20 m10 m00
m03 m02 m01 m00
m13 m12 m11 m10
m23 m22 m21 m20
m33 m32 m31 m30
m30 m20 m10 m00
m31 m21 m11 m01
m33 m22 m12 m02
m33 m23 m13 m03
Examples
• Image Composting
• A and B images fade-in and fade-out
• A * fade + B * (1 - fade), OR
• (A - B) * fade + B
• Image Overlay
• Sprite: e.g., mouse cursor
• Spite: normal colors + transparent
• for i = 1 to Sprite_Length
• if A[I] = clear_color then
• Out_frame[I] = C[I]
• else Out_frame[I] = A[I]
• Matrix Transpose
• Covert from row major to column major
• Used in JPEG
Chroma Keying
• for (i=0; i<image_size; i++)
• if (x[i] == Blue) new_image[i] =y[i]
• else new_image[i] = x[i];
Chroma Keying Code
• Movq mm3, mem1
• Load eight pixels from persons’
image
• Movq mm4, mem2
• Load eight pixels from the background image
• Pcmpeqb mm1, mm3
• Pand mm4, mm1
• Pandn mm1, mm3
• Por mm4, mm1
Integration into Pentium
• Major issue: OS compatibility
• Create new registers?
• Share registers with FP
• Existing OSes will save/restore
• Use 64-bit datapaths
• Pipe capable of 2 MMX IPC
• Separate MEM and Execute stage
“Recent” Multimedia Extensions
• Intel MMX: integer arithmetic only
• New algorithms -> new needs
• Need for massive amounts of FP ops
• Solution? MMX like ISA but for FP not only integer
• Example: AMD’s 3DNow!
• New data type:
• 2 packed single-precision FP
• 2 x 32-bits
• sign + exponent + significant
• New instructions
• Speedup potential: 2x
AMD’s 3DNow!
• 21 new instructions
• Average: motivated by MPEG
• Add, Sub, Reverse Sub, Mul
• Accumulate
• (A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2)
• Comparison (create mask)
• Min, Max (pairwise)
• Reciprocal and SQRT,
• Approximation: 1st step and other steps
• Prefetch
• Integer from/to FP conversion
• All operate on packed FP data
• sign * 2^(mantissa - 127) * exponent
Recent Extensions Cont.
• Intel’s ISSE
• very similar to AMD’s 3DNow!
• But has separate registers
• Lessons?
• Applications change over time
• Careful when introducing new instructions
• How useful are they?
• Cost?
• LEGACY: are they going to be useful in the future?
• Everyone has their own Multimedia Instruction set these
days
• read handout
Intel’s SSE
• Multimedia/Internet?
• 70 new instructions
• Major Types:
• SIMD-FP 128-bit wide 4 x 16 bit FP
• Data movement and re-organization
• Type conversion
• Int to Fp and vice versa
• Scalar/FP precision
• State Save/Restore
• New SSE registers not like MMX
• Memory Streaming
• Prefetch to specified hierarchy level
• New Media
• Absolute Diff, Rounded AVG, MIN/MAX
Altivec (PowerPC Mmedia Ext)
• 128-bit registers
• 8, 16, or 32 bit data types
• Scalar or single-precision FP
• 162 Instructions
• Saturation or Modulo arithmetic
• Four operand Instructions
• 3 sources, 1 target
Altivec Design Process
• Look at Mmedia Kernel
• Justify new instructions
• Video
• 8bit int LowQ, 16-bit int HighQ
• Audio
• 16bit int LowQ, SP FP HighQ
• Image Processing
• 8bit int LowQ, 16bit Int HighQ
• 3D Graphics
• 16bit int LowQ, SP FP HighQ
• Speech Recog.
• 16bit int Low Q, Sp FP HighQ
• Communications/Crypto
• 8-bit or 16bit unsigned int

More Related Content

What's hot

RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V IntroductionYi-Hsiu Hsu
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingTushar Swami
 
INTEL 80386 MICROPROCESSOR
INTEL  80386  MICROPROCESSORINTEL  80386  MICROPROCESSOR
INTEL 80386 MICROPROCESSORAnnies Minu
 
Multithreading
Multithreading Multithreading
Multithreading WafaQKhan
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classificationvani gupta
 
8257 DMA Controller
8257 DMA Controller8257 DMA Controller
8257 DMA ControllerShivamSood22
 
Assembly Language Basics
Assembly Language BasicsAssembly Language Basics
Assembly Language BasicsEducation Front
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorVikas Gupta
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Cache performance considerations
Cache performance considerationsCache performance considerations
Cache performance considerationsSlideshare
 

What's hot (20)

ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 
INTEL 80386 MICROPROCESSOR
INTEL  80386  MICROPROCESSORINTEL  80386  MICROPROCESSOR
INTEL 80386 MICROPROCESSOR
 
Multi processing
Multi processingMulti processing
Multi processing
 
Multithreading
Multithreading Multithreading
Multithreading
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
8257 DMA Controller
8257 DMA Controller8257 DMA Controller
8257 DMA Controller
 
Assembly Language Basics
Assembly Language BasicsAssembly Language Basics
Assembly Language Basics
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
pipelining
pipeliningpipelining
pipelining
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Memory management
Memory managementMemory management
Memory management
 
Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessor
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Cache performance considerations
Cache performance considerationsCache performance considerations
Cache performance considerations
 

Similar to Single Instruction Multiple Data: Another approach to ILP and performance

Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector ComputerHaris456
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cLaurent Leturgez
 
Andes RISC-V vector extension demystified-tutorial
Andes RISC-V vector extension demystified-tutorialAndes RISC-V vector extension demystified-tutorial
Andes RISC-V vector extension demystified-tutorialRISC-V International
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
arithmaticpipline-170310085040.pptx
arithmaticpipline-170310085040.pptxarithmaticpipline-170310085040.pptx
arithmaticpipline-170310085040.pptxAshokRachapalli1
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...CODE BLUE
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersNLJUG
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
PyParis2017 / Circuit simulation using Python, by Fabrice Salvaire
PyParis2017 / Circuit simulation using Python, by Fabrice SalvairePyParis2017 / Circuit simulation using Python, by Fabrice Salvaire
PyParis2017 / Circuit simulation using Python, by Fabrice SalvairePôle Systematic Paris-Region
 

Similar to Single Instruction Multiple Data: Another approach to ILP and performance (20)

Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
 
chapter4.ppt
chapter4.pptchapter4.ppt
chapter4.ppt
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12c
 
Andes RISC-V vector extension demystified-tutorial
Andes RISC-V vector extension demystified-tutorialAndes RISC-V vector extension demystified-tutorial
Andes RISC-V vector extension demystified-tutorial
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
arithmaticpipline-170310085040.pptx
arithmaticpipline-170310085040.pptxarithmaticpipline-170310085040.pptx
arithmaticpipline-170310085040.pptx
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...
[cb22] Under the hood of Wslink’s multilayered virtual machine en by Vladisla...
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
PyParis2017 / Circuit simulation using Python, by Fabrice Salvaire
PyParis2017 / Circuit simulation using Python, by Fabrice SalvairePyParis2017 / Circuit simulation using Python, by Fabrice Salvaire
PyParis2017 / Circuit simulation using Python, by Fabrice Salvaire
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 

More from Syed Zaid Irshad

More from Syed Zaid Irshad (20)

Operating System.pdf
Operating System.pdfOperating System.pdf
Operating System.pdf
 
DBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_SolutionDBMS_Lab_Manual_&_Solution
DBMS_Lab_Manual_&_Solution
 
Data Structure and Algorithms.pptx
Data Structure and Algorithms.pptxData Structure and Algorithms.pptx
Data Structure and Algorithms.pptx
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
Professional Issues in Computing
Professional Issues in ComputingProfessional Issues in Computing
Professional Issues in Computing
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xi
 
Reduce course notes class xii
Reduce course notes class xiiReduce course notes class xii
Reduce course notes class xii
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to Database
 
C Language
C LanguageC Language
C Language
 
Flowchart
FlowchartFlowchart
Flowchart
 
Algorithm Pseudo
Algorithm PseudoAlgorithm Pseudo
Algorithm Pseudo
 
Computer Programming
Computer ProgrammingComputer Programming
Computer Programming
 
ICS 2nd Year Book Introduction
ICS 2nd Year Book IntroductionICS 2nd Year Book Introduction
ICS 2nd Year Book Introduction
 
Security, Copyright and the Law
Security, Copyright and the LawSecurity, Copyright and the Law
Security, Copyright and the Law
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Data Communication
Data CommunicationData Communication
Data Communication
 
Information Networks
Information NetworksInformation Networks
Information Networks
 
Basic Concept of Information Technology
Basic Concept of Information TechnologyBasic Concept of Information Technology
Basic Concept of Information Technology
 
Introduction to ICS 1st Year Book
Introduction to ICS 1st Year BookIntroduction to ICS 1st Year Book
Introduction to ICS 1st Year Book
 
Using the set operators
Using the set operatorsUsing the set operators
Using the set operators
 

Recently uploaded

Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 

Recently uploaded (20)

Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 

Single Instruction Multiple Data: Another approach to ILP and performance

  • 1. Single Instruction Multiple Data Another approach to ILP and performance
  • 2. Outline • Array Processors / “True” SIMD • Vector Processors • Multimedia Extensions in modern instruction sets
  • 3. SIMD: Motivation • Let’s start with an example: • ILLIAC IV, U of Illinois, 1972 (prototype) • Reasoning: How to Improve Performance • Rely on Faster Circuits • Cost/circuit increases with circuit speed • At some point, cost/performance unfavorable • Concurrency: • Replicate Resources • Do more per cycle
  • 4. SIMD: Motivation contd.• Replication to the extreme: Multi-processor • Very Felixible, but costly • Do we need all this flexibility? • There are middle-ground designs were only parts are replicated CU ALU MEM Uniprocessor replicate CU ALU MEM CU ALU MEM CU ALU MEM Multiprocessor
  • 5. SIMD: Motivation Contd. • Recall: • Part of architecture is understanding application needs • Many Apps: • for i = 0 to infinity • a(i) = b(i) + c • Same operation over many tuples of data • Mostly independent across iterations
  • 6. SIMD Architecture • Replicate Datapath, not the control • All PEs work in tandem • CU orchestrates operations CU PE MEM PE MEM PE MEM ALU μCU regs
  • 7. ILLIAC IV • Goal: • 1 Gops/sec • 256 PEs as four partitions of 64 PEs • What was built • 0.2 Minsts/sec (we’ll talk about peak performance as ops) • 64 PEs • Prototype due date 1972
  • 9. ILLIAC IV Processing Element (PE) • 64-bit numbers, float or fixed point • Multiples of smaller numbers that add up to 64-bits • Today’s multimedia extensions • PMEM: One local memory module per PE • 2K x 64-bits • 188ns access / 350ns cycle (includes conflict resolution) • 100K components per PE
  • 10. PE Contd. • PE mode: Active or Inactive, CU sets mode • All PEs operate in lock-step • Routing insts to move data from PE to PE • The CU can execute instructions while PE’s are busy • Another degree of concurrency • Datatypes • 64b float • 64b logical • 48b fixed • 32 float • 24 fixed • 8 fixed
  • 11. Peak Compute Bandwidth • 64 PEs • Each can perform: • 1 64b, 2 32b, or 4 8b operations • Or, in total: • 64 elems, 128 elems, or 512 elems • Peak: • 150M 64b ops/sec up to 10G 32b ops/sec • The last figure is for integer ops • Each int op takes 66ns (4 per PE in parallel)
  • 12. Control Unit (CU) • A simple CPU • Can execute instructions w/o PE intervention • Coordinates all PEs • 64 64b registers, D0-D63 • 4 64b Accumulators A0-A3 • Ops: • Integer ops • Shifts • Boolean • Loop control • Index PMem D0 D63 A0 A3 A1 A2 ALU CU
  • 13. Processing Element (PE)• 64 bit regs • A: Accumulator • B: 2nd operand for binary ops • R: Routing – Inter-PE Communication • S: Temporary • X: Index for PMEM 16bits • D: mode 8bits • Communication: • PMEM only from local PE • Amongst PE with R A S B R ALU PEi X D 0 1 2043 PMEMi PEi-1 PEi+1 PEi-8 PEi+8
  • 14. Datapaths • CU Bus: Insts and Data from PMEM to CU in 8 words • CDB: Broadcast to all PEs • E.g., constants for adds • Routing Network: amongst R registers • Mode: To activate/de-activate PEs CU PE PMEM PE PMEM PE PMEM Control Unit Bus Mode Common Data Bus Routing
  • 15. Routing Network 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 8 16 24 32 40 48 56 7 15 23 31 39 47 55 63 56 57 58 59 60 61 62 63 0 1 2 3 4 5 6 7 12 19 20 21 28 i-8 i+8 i+1i-1
  • 16. Using ILLIAC IV: Example #2 • DO 10 I = 1 TO 64 10 C(I) = A(I) + B(I) • LDA a + 2 load A(i) into A (same a per PMEM) • ADDRN a + 1 add B(i) into A • STA a store A into C(i) C(1) A(1) B(1) PMEM1 a C(2) A(2) B(2) PMEM2 C(64) A(64) B(64) PMEM64
  • 17. Using ILLIAC IV: Example #2 • DO 10 I = 2 TO 64 • 10 A(I) = B(I) + A(I-1) • Expand into: • A(N) = A(1) + Sum B(i) [i = 2 to N] • We get: • DO 10 N=2 TO 64 • S = S + B(N) • 10 A(N) = S
  • 18. Using ILLIAC IV: Example #2 contd. 1. Enable all PEs 2. All load A from a 3. i = 0 4. All R = A (including those inactive) 5. All route R to PE(2^i) to the right 6. j = 2^i – 1 7. Disable all PEs 1 through j 8. A = A + R  R contains a partial sum of many A(i) 9. i = i + 1 10. if i < lg(64) goto 4 11. Enable All PEs 12. All store A at (a + 1)
  • 19. Using ILLIAC IV: Example #2 contd. • Initial State: • PMEM(1)[a] = A(1) • PMEM(1+i)[a] = B(i+1) • For example, at PE1 • STEP 1: A = A(1) • from PE2 we get B(2) • STEP 2: A = A(1) + B(2) • from PE4 we get B(4) + B(5) • STEP 3: A = A(1) + B(2) + B(4) + B(5) • From PE8 we get B(8) + B(7) + B(12) + B(13)
  • 21. Vector Processors • Vector Datatype • Apply same operation on all elements of the vector • No dependences amongst elements • Same motivation as SIMD
  • 22. Properties of Vector Processors • One Vector instruction implies lots of work • Fewer instructions • Each result independent of previous result • Multiple operations in parallel • Simpler design; no need for dependence checks • Higher clock rate • Compiler must help • Fewer Branches • Memory access pattern per vector inst known • Prefetching effect • Amortize mem latency • Can exploit high-bandwidth mem system • Less/no need for data caches
  • 23. Classes of Vector Processors • Memory to memory • Vectors are in memory • Load/store • Vectors are in registers • Load/store to communicate with memory • This prevailed
  • 24. Historical Perspective • Mid-60s: performance concerns • SIMD processor arrays • Also fast Scalar machines • CDC 6600 • Texas Instruments ASC, 1972 • Memory to memory vector • Cray Develops CRAY-1, 1978
  • 25. CRAY-1 • Fast and simple scalar processor • 80 Mhz • Vector register concept • Much simple ISA • Reduced memory pressure • Tight integration of scalar and vector units • Cylindrical design to minimize wire lengths • Freon Cooling
  • 27. Components of Vector Processor • Scalar CPU: registers, datapaths, instruction fetch • Vector Registers: • Fixed length memory bank holding a single vector reg • Typically 8-32 Vregs, up to 8Kbits per Vreg • At least; 2 Read, 1 Write ports • Can be viewed as an array of N elements • Vector Functional Units: • Fully pipelined. New op per cycle • Typically 2 to 8 FUs: integer and FP • Multiple datapaths to process multiple elements per cycle if needed • Vector Load/Store Units (LSUs): • Fully pipelined • Multiple elems fetched/store per cycle • May have multiple LSUs • Cross-bar: • Connects FUS, LSUs and registers
  • 28. CRAY-1 Organization • Simple 16-bit Reg-to-Reg ISA • Use two 16-bit to get Imm • Natural combinations of scalar and vector • Scalar bit-vectors match vector length • Gather/Scatter M-R • Cond. Merge
  • 29. CRAY-1 CPU • Scalar and vector modes • 12.5 ns clock • 64-bit words • Int & FP units • 12 FUs • 8 24-bit A regs • 64 B regs (temp storage for A) • 8 64-bit S regs • 64 T regs (temp storage for S) • 64 64-elem, 64bit elem V regs
  • 30. CRAY-1 CPU • Vector Length Register • Can use only a prefix of a vreg • Vector Mask Register • Can use only a subset of a vreg • Real Time Register (counts clock cycles) • Four instruction buffers • 64 16-bit parcels • 128 Basic Instructions • Interrupt Control • NO virtual memory system
  • 31. Cray-1 Memory System • 1M 64b words + 8 check bits (single error correction, double error detection) • 16 banks of 64K words • 4 clocks period • 1 word per clock for B, T and Vreg • 1 word per 2 clocks for A & S • 4 words per clock for inst buffers
  • 32. Instruction Format • Fields g h I j k m • Bits 0-3 4-6 7-9 10-12 13-15 16-31 • Bits cnts 4 3 3 3 3 16 • X X  opcode • Rd Rs1 Rs2 • A/S B/T
  • 33. Basic Vector Instructions • Inst Operands Operation Comment • VADD.VV V1, V2, V3 V1=V2+V3 vector+vector • VADD.SV V1, R0, V2 V1=R0+V2 scalar+vector • VMUL.VVV1, V2, V3 V1=V2*V3 vector * vector • VMUL.SV V1, R0, V2 V1=R0*V2 scalar * vector • VLD V1, R0 V1=M[R0…R0+63] stride = 1 • VLDS V1, R1, R2 V1=M[R1…R1+63*R2] stride=R2 • VLDX V1, R1, V2 V1=M[R1+V2[i], i=0 to 63] gather • VST store equiv of VLD • VSTS store equiv of VLDS • VSTX V1, R1 M[R1+V2[i], i=0 to 63]=V1 scatter
  • 34. Vector Memory Operations • Load/Store move groups of data between memory and registers • Addressing Modes • Unit-stride: Fastest • Non-Unit, constant stride (interleaved memory helps • Indexed (gather-scatter) • Vector equiv of register indirect • Sparse arrays • Can vectorize more loops
  • 35. Vector Code Example • Y[0:63] = Y[0:63] + a * X[0:63] • LD R0, a • VLD V1, Rx Load X[] in V1 • VLD V2, Ry Load Y[] in V2 • VMUL.SV V3, R0, V1 V3 = X[]*a • VADD.VV V4, V2, V3 V4 = Y[]+X[]*a • VST Ry, V4 store in Y[]
  • 36. Scalar Equivalent • LD R0, a • LI R5, 512 (offset at the end of X[]) • Loop: LD R2, 0(Rx) • MULTD R2, R0, R2 • LD R3, 0(Ry) • ADD R4, R2, R3 • ST R4, 0(Ry) • ADD Rx, Rx, 8 • ADD Ry, Ry, 8 • SUB R5, R5, 8 • BNE Loop LD R0, a VLD V1, Rx VLD V2, Ry VMUL.SV V3, R0, V1 VADD.VV V4, V2, V3 VST Ry, V4
  • 37. Vector Length Register • Allows us to vectorize code where the elements do not exactly fit within the vector register • What if we need a vector of just 32 elems? • Vector length register: • Operate up to this element • Can be anything from 0 to Maximum (64 in CRAY-1) • Can also be used to support runtime vector length variability
  • 38. Strip Mining • Suppose (application vector length) AVL > MVL (max vector length) • Each loop iteration handles MVL elems • Last iteration AVL MOD MVL • VL = (AVL mode MVL) • For (I=0; I<VL; I++) • Y[I] = A*X[I] + Y[I] • low = (AVL mod MVL) • VL = low • For (i=low; i < VL; i++) • Y[i] = A*X[i] + Y[i]
  • 39. Optimization #1: Chaining• Subsequent vector op can be initiated as soon as a preceding vector op it depends upon produces its first result • Example • Vadd.vv v1, v2, v3 • Vadd.sv v4, v1, R0 V1(1) V1(2) V1(3) V1(4) V1(63) time Add initiated V4(1) V4(2) V4(3) V4(4) V4(63) unchained V1(1) V1(2) V1(3) V1(4) V1(63) Add initiated V4(1) V4(2) V4(3) V4(4) V4(63) chained
  • 40. Optimization #2: Conditional Execution • Vector Mask Register • Bit vector: used as predicate • If 0 operation is not performed for the corresponding pair • VLD V1, Ra • VLD V2, Rb • VCMP.NEQ.VV VMR, V1, V2 • VSUB.VV V3, V2, V1 (VMR) • VST V3, Ra • For (i = 0; i < 64; i++) • if (A[i] != B[i]) A[i] = A[i] – B[i]
  • 41. Optimization #3: Multi-lane Implementation • Vectors are interleaved so that multiple elems can be accessed per cycle • Replicate resources • Equivalent of Superscalar • Because of no intra-vector dependences and because inter-vector dependences are aligned (elem(i) to elem(i)) no need for inter-bank communications
  • 42. Two Ways to View Vectorization • Classic Approach: Inner-loop • Think machine as having 32 vector registers with 16 elems • 1 instruction updates all elements of a vector • Vectorize single dimension array operations • A new approach: Outer-loop • Think of machine as 16 “virtual processors” each with 32 scalar registers • 1 instruction updates register in 16 VPs • Good for irregular kernels • Hardware is the same for both • These describe the compiler’s perspective
  • 46. Multimedia ISA Extensions • Intel’s MMX • The Basics • Instruction Set • Examples • Integration into Pentium • Relationship to vector ISAs • AMD’s 3DNow! • Intel’s ISSE (a.k.a. KNI)
  • 47. MMX: Basics • Multimedia applications are becoming popular • Are current ISAs a good match for them? • Methodology: • Consider a number of “typical” applications • Can we do better? • Cost vs. performance vs. utility tradeoffs • Net Result: Intel’s MMX • Can also be viewed as an attempt to maintain market share • If people are going to use these kind of applications we better support them
  • 48. Multimedia Applications • Most multimedia apps have lots of parallelism: • for I = here to infinity • out[I] = in_a[I] * in_b[I] • At runtime: • out[0] = in_a[0] * in_b[0] • out[1] = in_a[1] * in_b[1] • out[2] = in_a[2] * in_b[2] • out[3] = in_a[3] * in_b[3] • ….. • Also, work on short integers: • in_a[i] is 0 to 256 for example (color) • or, 0 to 64k (16-bit audio)
  • 49. Observations • 32-bit registers are wasted • only using part of them and we know • ALUs underutilized and we know • Instruction specification is inefficient • even though we know that a lot of the same operations will be performed still we have to specify each of the individually • Instruction bandwidth • Discovering Parallelism • Memory Ports? • Could read four elements of an array with one 32-bit load • Same for stores • The hardware will have a hard time discovering this • Coalescing and dependences
  • 50. MMX Contd. • Can do better than traditional ISA • new data types • new instructions • Pack data in 64-bit words • bytes • “words” (16 bits) • “double words” (32 bits) • Operate on packed data like short vectors • SIMD • First used in Livermore S-1 (> 20 years)
  • 51. MMX:Example Up to 8 operations (64bit) go in parallel  Potential improvement: 8x  In practice less but still good Besides another reason to think your machine is obsolete
  • 53. MMX: Instruction Set • 57 new instructions • Integer Arithmetic • add/sub/mul • multiply add • signed/unsigned • saturating/wraparound • Shifts • Compare (form mask) • Pack/Unpack • Move • from/to memory • from/to registers
  • 54. Arithmetic • Conventional: Wrap-around • on overflow, wrap to -1 • on underflow, wrap to MAXINT • Think of digital audio • What happens when you turn volume to the MAX? • Similar for pictures • Saturating arithmetic: • on overflow, stay at MAXINT • on underflow, stat at MININT • Two flavors: • unsigned • signed
  • 55. Operations • Mult/Add • Compares • Conversion • Interpolation/Transpose • Unpack (e.g., byte to word) • Pack (e.g., word to byte)
  • 56. Matrix Transpose 4x4 • That’s for the first two rows m33 m32 m31 m30 m13 m12 m11 m10 m23 m22 m21 m20 m03 m02 m01 m00 punpcklwd punpcklwd m31 m21 m30 m20 m11 m01 m10 m00 punpckhdq punpckldq m31 m21 m11 m01 m30 m20 m10 m00 m03 m02 m01 m00 m13 m12 m11 m10 m23 m22 m21 m20 m33 m32 m31 m30 m30 m20 m10 m00 m31 m21 m11 m01 m33 m22 m12 m02 m33 m23 m13 m03
  • 57. Examples • Image Composting • A and B images fade-in and fade-out • A * fade + B * (1 - fade), OR • (A - B) * fade + B • Image Overlay • Sprite: e.g., mouse cursor • Spite: normal colors + transparent • for i = 1 to Sprite_Length • if A[I] = clear_color then • Out_frame[I] = C[I] • else Out_frame[I] = A[I] • Matrix Transpose • Covert from row major to column major • Used in JPEG
  • 58. Chroma Keying • for (i=0; i<image_size; i++) • if (x[i] == Blue) new_image[i] =y[i] • else new_image[i] = x[i];
  • 59. Chroma Keying Code • Movq mm3, mem1 • Load eight pixels from persons’ image • Movq mm4, mem2 • Load eight pixels from the background image • Pcmpeqb mm1, mm3 • Pand mm4, mm1 • Pandn mm1, mm3 • Por mm4, mm1
  • 60. Integration into Pentium • Major issue: OS compatibility • Create new registers? • Share registers with FP • Existing OSes will save/restore • Use 64-bit datapaths • Pipe capable of 2 MMX IPC • Separate MEM and Execute stage
  • 61. “Recent” Multimedia Extensions • Intel MMX: integer arithmetic only • New algorithms -> new needs • Need for massive amounts of FP ops • Solution? MMX like ISA but for FP not only integer • Example: AMD’s 3DNow! • New data type: • 2 packed single-precision FP • 2 x 32-bits • sign + exponent + significant • New instructions • Speedup potential: 2x
  • 62. AMD’s 3DNow! • 21 new instructions • Average: motivated by MPEG • Add, Sub, Reverse Sub, Mul • Accumulate • (A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2) • Comparison (create mask) • Min, Max (pairwise) • Reciprocal and SQRT, • Approximation: 1st step and other steps • Prefetch • Integer from/to FP conversion • All operate on packed FP data • sign * 2^(mantissa - 127) * exponent
  • 63. Recent Extensions Cont. • Intel’s ISSE • very similar to AMD’s 3DNow! • But has separate registers • Lessons? • Applications change over time • Careful when introducing new instructions • How useful are they? • Cost? • LEGACY: are they going to be useful in the future? • Everyone has their own Multimedia Instruction set these days • read handout
  • 64. Intel’s SSE • Multimedia/Internet? • 70 new instructions • Major Types: • SIMD-FP 128-bit wide 4 x 16 bit FP • Data movement and re-organization • Type conversion • Int to Fp and vice versa • Scalar/FP precision • State Save/Restore • New SSE registers not like MMX • Memory Streaming • Prefetch to specified hierarchy level • New Media • Absolute Diff, Rounded AVG, MIN/MAX
  • 65. Altivec (PowerPC Mmedia Ext) • 128-bit registers • 8, 16, or 32 bit data types • Scalar or single-precision FP • 162 Instructions • Saturation or Modulo arithmetic • Four operand Instructions • 3 sources, 1 target
  • 66. Altivec Design Process • Look at Mmedia Kernel • Justify new instructions • Video • 8bit int LowQ, 16-bit int HighQ • Audio • 16bit int LowQ, SP FP HighQ • Image Processing • 8bit int LowQ, 16bit Int HighQ • 3D Graphics • 16bit int LowQ, SP FP HighQ • Speech Recog. • 16bit int Low Q, Sp FP HighQ • Communications/Crypto • 8-bit or 16bit unsigned int