SlideShare a Scribd company logo
1 of 34
CS 295: Modern Systems
Modern Processors – SIMD Extensions
Sang-Woo Jun
Spring, 2019
Modern Processor Topics
 Transparent Performance Improvements
o Pipelining, Caches
o Superscalar, Out-of-Order, Branch Prediction, Speculation, …
o Covered in CS250A and others
 Explicit Performance Improvements
o SIMD extensions, AES extensions, …
o …
 Non-Performance Topics
o Virtualization extensions, secure enclaves, transactional memory, …
Flynn Taxonomy (1966) Recap
Data Stream
Single Multi
Instruction
Stream
Single SISD
(Single-Core Processors)
SIMD
(GPUs, Intel SSE/AVX extensions, …)
Multi MISD
(Systolic Arrays, …)
MIMD
(VLIW, Parallel Computers)
Flynn Taxonomy Recap
Single-Instruction
Single-Data
(Single-Core Processors)
Multi-Instruction
Single-Data
(Systolic Arrays,…)
Single-Instruction
Multi-Data
(GPUs, Intel SIMD Extensions)
Multi-Instruction
Multi-Data
(Parallel Computers)
Intel SIMD Extensions
 New instructions, new registers
 Introduced in phases/groups of functionality
o SSE – SSE4 (1999 –2006)
• 128 bit width operations
o AVX, FMA, AVX2, AVX-512 (2008 – 2015)
• 256 – 512 bit width operations
• AVX-512 chips not available yet (as of Spring, 2019), but soon!
 F16C, and more to come?
ZMM0
YMM0
Intel SIMD Registers (AVX-512)
XMM0
ZMM1
YMM1
XMM1
ZMM31
YMM31
XMM31
…
 XMM0 – XMM15
o 128-bit registers
o SSE
 YMM0 – YMM15
o 256-bit registers
o AVX, AVX2
 ZMM0 – ZMM31
o 512-bit registers
o AVX-512
SSE/AVX Data Types
YMM0
float float float float
double double
int32 int32 int32 int32
float float float float
double double
int32 int32 int32 int32
16 16
8 8
16 16
8 8 8 8 8 8
16 16
8 8
16 16
8 8 8 8 8 8
16 16
8 8
16 16
8 8 8 8 8 8
16 16
8 8
16 16
8 8 8 8 8 8 Operation on
32 8-bit values
in one instruction!
255 0
Sandy Bridge Microarchitecture
e.g., “Port 5 pressure” when code uses too much shuffle operations
Skylake Die Layout
Aside: Do I Have SIMD Capabilities?
 less /proc/cpuinfo
Processor Microarchitectural Effects on
Power Efficiency
 The majority of power consumption of a CPU is not from the ALU
o Cache management, data movement, decoding, and other infrastructure
o Adding a few more ALUs should not impact power consumption
 Indeed, 4X performance via AVX does not add 4X power consumption
o FromiI7 4770K measurements:
o Idle: 40 W
o Under load : 117 W
o Under AVX load : 128 W
 Aside: Not all AVX operations are active at the same time –
“Specialization horseman”?
Compiler Automatic Vectorization
 In gcc, flags “-O3 –mavx –mavx2” attempts automatic vectorization
 Works pretty well for simple loops
 But not for anything complex
o E.g., naïve bubblesort code not parallelized at all
Generated using GCC explorer: https://gcc.godbolt.org/
Intel SIMD Intrinsics
 Use C functions instead of inline assembly to call AVX instructions
 Compiler manages registers, etc
 Intel Intrinsics Guide
o https://software.intel.com/sites/landingpage/IntrinsicsGuide
o One of my most-visited pages…
e.g.,
__m256 a, b, c;
__m256 d = _mm256_fmadd_ps(a, b, c); // d[i] = a[i]*b[i]+c[i] for i = 0 …7
Data Types in AVX/AVX2
Type Description
__m128 128-bit vector containing 4 floats
__m128d 128-bit vector containing 2 doubles
__m128i 128-bit vector containing integers
__m256 256-bit vector containing 8 floats
__m256d 256-bit vector containing 4 doubles​
__m256i 256-bit vector containing integers
1~16 signed/unsigned integers
1~32 signed/unsigned integers
__m512 variants in the future for AVX-512
Intrinsic Naming Convention
 _mm<width>_[function]_[type]
o E.g., _mm256_fmadd_ps :
perform fmadd (floating point multiply-add) on
256 bits of
packed single-precision floating point values (8 of them)
Width Prefix
128 _mm_
256 _mm256_
512 _mm512_
Type Postfix
Single precision _ps
Double precision _pd
Packed signed integer _epiNNN (e.g., epi256)
Packed unsigned integer _epuNNN (e.g., epu256)
Scalar integer _siNNN (e.g., si256)
Not all permutations exist! Check guide
Load/Store/Initialization Operations
 Initialization
o _mm256_setzero_ps/pd/epi32/…
o _mm256_set_...
o …
 Load/Store : Address MUST be aligned by 256-bit
o _mm256_load_...
o _mm256_store_...
 And many more! (Masked read/write, strided reads, etc…)
e.g.,
__mm256d t = _mm256_load_pd(double const * mem); // loads 4 double values from mem to t
__mm256i v = _mm256_set_epi32(h,g,f,e,d,c,b,a); // loads 8 integer values to v
Vertical Vector Instructions
 Add/Subtract/Multiply
o _mm256_add/sub/mul/div_ps/pd/epi
• Mul only supported for epi32/epu32/ps/pd
• Div only supported for ps/pd
• Consult the guide!
 Max/Min/GreaterThan/Equals
 Sqrt, Reciprocal, Shift, etc…
 FMA (Fused Multiply-Add)
o (a*b)+c, -(a*b)-c, -(a*b)+c, and other permutations!
o Consult the guide!
 …
a
b
c
d
× × × ×
+ + + +
=
__m256 a, b, c;
__m256 d = _mm256_fmadd_pd(a, b, c);
= =
=
Integer Multiplication Caveat
 Integer multiplication of two N bit values require 2N bits
 E.g., __mm256_mul_epi32 and __mm256_mul_epu32
o Only use the lower 4 32 bit values
o Result has 4 64 bit values
 E.g., __mm256_mullo_epi32 and __mm256_mullo_epu32
o Uses all 8 32 bit values
o Result has 8 truncated 32 bit values
 And more options!
o Consult the guide…
Horizontal Vector Instructions
 Horizontal add/subtraction
o Adds adjacent pairs of values
o E.g., __m256d _mm256_hadd_pd (__m256d a, __m256d b)
a b
c
+
+ +
+
Shuffling/Permutation
 Within 128-bit lanes
o _mm256_shuffle_ps/pd/… (a,b, imm8)
o _mm256_permute_ps/pd
o _mm256_permutevar_ps/…
 Across 128-bit lanes
o _mm256_permute2x128/4x64 : Uses 8 bit control
o _mm256_permutevar8x32/… : Uses 256 bit control
 Not all type permutations exist, but variables can be cast back and forth
Matt Scarpino, “Crunching Numbers with AVX and AVX2,” 2016
Blend
 Merges two vectors using a control
o _mm256_blend_... : Uses 8 bit control
• e.g., _mm256_blend_epi32
o _mm256_blendv_... : Uses 256 bit control
• e.g., _mm256_blendv_epi8
a b
c
? ? ? ?
Alignr
 Right-shifts concatenated value of two registers, by byte
o Often used to implement circular shift by using two same register inputs
o _mm256_alignr_epi8 (a, b, count)
a b
c
Example of 64-bit values being shifted by 8
Helper Instructions
 Cast
o __mm256i <-> __mm256, etc…
o Syntactic sugar -- does not spend cycles
 Convert
o 4 floats <-> 4 doubles, etc…
 Movemask
o __mm256 mask to -> int imm8
 And many more…
Case Study: Sorting
 Important, fundamental application!
 Can be parallelized via divide-and-conquer
 How can SIMD help?
Reminder: Sorting Network
 Network structure for sorting fixed number of values
 Type of a “comparator network”
o comparators perform compare-and-swap
 Easily pipelined and parallelized
Example 4-element sorting network
5 comparators, 3 cycle pipelined
Remind: Sorting Network
 Simple to generate correct sorting networks, but optimal structures are
not well-known
Source: Wikipedia (Sorting Network) Some known optimal sorting networks
SIMD And Sorting Networks
 AVX compare/max/min instructions do not work internally to a register
o Typically same values copied to two different registers, permuted, compared
o Or, “two register trick”, and many more optimizations
 In this case, optimal sorting networks are typically not very helpful
o Irregular computation means bubbles in the compute structure
o Simple, odd-even sorting network works well
8-element odd-even sorting network
What if I want to sort more than 8 values?
SIMD And Sorting Networks
 Typically, we are sorting more than one set of tuples
o If we have multiple tasks, we can have task-level parallelism – Optimized networks!
o Sort multiple tuples at the same time
 Then we need to transpose the 8 8-element variables
o Each variable has a value for each sorting network instance
o Non-SIMD works, or a string of unpackhi/unpacklo/blend
…
Instruction 1, 2 (min, max)
Instruction 3, 4 (min, max)
SIMD And Sorting Networks
 Some SIMD instructions have high throughput, but high latency
o Data dependency between two consecutive max instructions can take 8 cycles on
Skylake
o If each parallel stage has less than 4 operations, pipeline may stall
• Solution: Interleave two sets of parallel 8-tuple sorting
o In reality, min/max means even for 4-tuples, pipeline is still filled
Source: Instrinsics guide
The Two Register Merge
 Sort units of two pre-sorted registers, K elements
o minv = A, maxv = B
o // Repeat K times
• minv = min(minv,maxv)
• maxv = max(minv,maxv)
• // circular shift one value down
• minv = alignr(minv,minv,imm)
Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures,” VLDB 2015
SIMD And Merge Sort
 Hierarchically merged sorted
subsections
 Using the SIMD merger for sorting
o vector_merge is the two-register sorter
from before
 What if we want to sort structs?
o Instead of min/max, use cmpgt, followed
by permute for both key and values
o For large values, use <key, index>
Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an
Array of Structures,” VLDB 2015
Typically reports 2x + performance!
Topic Under Active Research!
 Papers being written about…
o Architecture-optimized matrix transposition
o Register-level sorting algorithm
o Merge-sort
o … and more!
 Good find can accelerate your application kernel 2x
Case Study: Matrix Multiply
 Branch :
Boser & Katz,
“CS61C: Great Ideas In Computer Architecture”
Lecture 18 – Parallel Processing – SIMD
 https://www.intel.com/content/dam/www/public/us/en/documents/ma
nuals/64-ia-32-architectures-optimization-manual.pdf

More Related Content

Similar to lec2 - Modern Processors - SIMD.pptx

How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoValeriia Maliarenko
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsDaniel Blezek
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threadingAntonio Cesarano
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegKieran Kunhya
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUs
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUsGOD MODE UNLOCKED - Hardware Backdoors in x86 CPUs
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUsPriyanka Aash
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.pptJaya Chavan
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyondLihang Li
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD supportJoel Falcou
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 

Similar to lec2 - Modern Processors - SIMD.pptx (20)

How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUs
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUsGOD MODE UNLOCKED - Hardware Backdoors in x86 CPUs
GOD MODE UNLOCKED - Hardware Backdoors in x86 CPUs
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 

More from Rakesh Pogula

More from Rakesh Pogula (6)

BWE2.ppt
BWE2.pptBWE2.ppt
BWE2.ppt
 
BWE1.ppt
BWE1.pptBWE1.ppt
BWE1.ppt
 
chapter4.ppt
chapter4.pptchapter4.ppt
chapter4.ppt
 
Tomas_IWAENC_keynote10.ppt
Tomas_IWAENC_keynote10.pptTomas_IWAENC_keynote10.ppt
Tomas_IWAENC_keynote10.ppt
 
Combinational Filters.pptx
Combinational Filters.pptxCombinational Filters.pptx
Combinational Filters.pptx
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 

Recently uploaded

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 

Recently uploaded (20)

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 

lec2 - Modern Processors - SIMD.pptx

  • 1. CS 295: Modern Systems Modern Processors – SIMD Extensions Sang-Woo Jun Spring, 2019
  • 2. Modern Processor Topics  Transparent Performance Improvements o Pipelining, Caches o Superscalar, Out-of-Order, Branch Prediction, Speculation, … o Covered in CS250A and others  Explicit Performance Improvements o SIMD extensions, AES extensions, … o …  Non-Performance Topics o Virtualization extensions, secure enclaves, transactional memory, …
  • 3. Flynn Taxonomy (1966) Recap Data Stream Single Multi Instruction Stream Single SISD (Single-Core Processors) SIMD (GPUs, Intel SSE/AVX extensions, …) Multi MISD (Systolic Arrays, …) MIMD (VLIW, Parallel Computers)
  • 4. Flynn Taxonomy Recap Single-Instruction Single-Data (Single-Core Processors) Multi-Instruction Single-Data (Systolic Arrays,…) Single-Instruction Multi-Data (GPUs, Intel SIMD Extensions) Multi-Instruction Multi-Data (Parallel Computers)
  • 5. Intel SIMD Extensions  New instructions, new registers  Introduced in phases/groups of functionality o SSE – SSE4 (1999 –2006) • 128 bit width operations o AVX, FMA, AVX2, AVX-512 (2008 – 2015) • 256 – 512 bit width operations • AVX-512 chips not available yet (as of Spring, 2019), but soon!  F16C, and more to come?
  • 6. ZMM0 YMM0 Intel SIMD Registers (AVX-512) XMM0 ZMM1 YMM1 XMM1 ZMM31 YMM31 XMM31 …  XMM0 – XMM15 o 128-bit registers o SSE  YMM0 – YMM15 o 256-bit registers o AVX, AVX2  ZMM0 – ZMM31 o 512-bit registers o AVX-512
  • 7. SSE/AVX Data Types YMM0 float float float float double double int32 int32 int32 int32 float float float float double double int32 int32 int32 int32 16 16 8 8 16 16 8 8 8 8 8 8 16 16 8 8 16 16 8 8 8 8 8 8 16 16 8 8 16 16 8 8 8 8 8 8 16 16 8 8 16 16 8 8 8 8 8 8 Operation on 32 8-bit values in one instruction! 255 0
  • 8. Sandy Bridge Microarchitecture e.g., “Port 5 pressure” when code uses too much shuffle operations
  • 10. Aside: Do I Have SIMD Capabilities?  less /proc/cpuinfo
  • 11. Processor Microarchitectural Effects on Power Efficiency  The majority of power consumption of a CPU is not from the ALU o Cache management, data movement, decoding, and other infrastructure o Adding a few more ALUs should not impact power consumption  Indeed, 4X performance via AVX does not add 4X power consumption o FromiI7 4770K measurements: o Idle: 40 W o Under load : 117 W o Under AVX load : 128 W  Aside: Not all AVX operations are active at the same time – “Specialization horseman”?
  • 12. Compiler Automatic Vectorization  In gcc, flags “-O3 –mavx –mavx2” attempts automatic vectorization  Works pretty well for simple loops  But not for anything complex o E.g., naïve bubblesort code not parallelized at all Generated using GCC explorer: https://gcc.godbolt.org/
  • 13. Intel SIMD Intrinsics  Use C functions instead of inline assembly to call AVX instructions  Compiler manages registers, etc  Intel Intrinsics Guide o https://software.intel.com/sites/landingpage/IntrinsicsGuide o One of my most-visited pages… e.g., __m256 a, b, c; __m256 d = _mm256_fmadd_ps(a, b, c); // d[i] = a[i]*b[i]+c[i] for i = 0 …7
  • 14. Data Types in AVX/AVX2 Type Description __m128 128-bit vector containing 4 floats __m128d 128-bit vector containing 2 doubles __m128i 128-bit vector containing integers __m256 256-bit vector containing 8 floats __m256d 256-bit vector containing 4 doubles​ __m256i 256-bit vector containing integers 1~16 signed/unsigned integers 1~32 signed/unsigned integers __m512 variants in the future for AVX-512
  • 15. Intrinsic Naming Convention  _mm<width>_[function]_[type] o E.g., _mm256_fmadd_ps : perform fmadd (floating point multiply-add) on 256 bits of packed single-precision floating point values (8 of them) Width Prefix 128 _mm_ 256 _mm256_ 512 _mm512_ Type Postfix Single precision _ps Double precision _pd Packed signed integer _epiNNN (e.g., epi256) Packed unsigned integer _epuNNN (e.g., epu256) Scalar integer _siNNN (e.g., si256) Not all permutations exist! Check guide
  • 16. Load/Store/Initialization Operations  Initialization o _mm256_setzero_ps/pd/epi32/… o _mm256_set_... o …  Load/Store : Address MUST be aligned by 256-bit o _mm256_load_... o _mm256_store_...  And many more! (Masked read/write, strided reads, etc…) e.g., __mm256d t = _mm256_load_pd(double const * mem); // loads 4 double values from mem to t __mm256i v = _mm256_set_epi32(h,g,f,e,d,c,b,a); // loads 8 integer values to v
  • 17. Vertical Vector Instructions  Add/Subtract/Multiply o _mm256_add/sub/mul/div_ps/pd/epi • Mul only supported for epi32/epu32/ps/pd • Div only supported for ps/pd • Consult the guide!  Max/Min/GreaterThan/Equals  Sqrt, Reciprocal, Shift, etc…  FMA (Fused Multiply-Add) o (a*b)+c, -(a*b)-c, -(a*b)+c, and other permutations! o Consult the guide!  … a b c d × × × × + + + + = __m256 a, b, c; __m256 d = _mm256_fmadd_pd(a, b, c); = = =
  • 18. Integer Multiplication Caveat  Integer multiplication of two N bit values require 2N bits  E.g., __mm256_mul_epi32 and __mm256_mul_epu32 o Only use the lower 4 32 bit values o Result has 4 64 bit values  E.g., __mm256_mullo_epi32 and __mm256_mullo_epu32 o Uses all 8 32 bit values o Result has 8 truncated 32 bit values  And more options! o Consult the guide…
  • 19. Horizontal Vector Instructions  Horizontal add/subtraction o Adds adjacent pairs of values o E.g., __m256d _mm256_hadd_pd (__m256d a, __m256d b) a b c + + + +
  • 20. Shuffling/Permutation  Within 128-bit lanes o _mm256_shuffle_ps/pd/… (a,b, imm8) o _mm256_permute_ps/pd o _mm256_permutevar_ps/…  Across 128-bit lanes o _mm256_permute2x128/4x64 : Uses 8 bit control o _mm256_permutevar8x32/… : Uses 256 bit control  Not all type permutations exist, but variables can be cast back and forth Matt Scarpino, “Crunching Numbers with AVX and AVX2,” 2016
  • 21. Blend  Merges two vectors using a control o _mm256_blend_... : Uses 8 bit control • e.g., _mm256_blend_epi32 o _mm256_blendv_... : Uses 256 bit control • e.g., _mm256_blendv_epi8 a b c ? ? ? ?
  • 22. Alignr  Right-shifts concatenated value of two registers, by byte o Often used to implement circular shift by using two same register inputs o _mm256_alignr_epi8 (a, b, count) a b c Example of 64-bit values being shifted by 8
  • 23. Helper Instructions  Cast o __mm256i <-> __mm256, etc… o Syntactic sugar -- does not spend cycles  Convert o 4 floats <-> 4 doubles, etc…  Movemask o __mm256 mask to -> int imm8  And many more…
  • 24. Case Study: Sorting  Important, fundamental application!  Can be parallelized via divide-and-conquer  How can SIMD help?
  • 25. Reminder: Sorting Network  Network structure for sorting fixed number of values  Type of a “comparator network” o comparators perform compare-and-swap  Easily pipelined and parallelized Example 4-element sorting network 5 comparators, 3 cycle pipelined
  • 26. Remind: Sorting Network  Simple to generate correct sorting networks, but optimal structures are not well-known Source: Wikipedia (Sorting Network) Some known optimal sorting networks
  • 27. SIMD And Sorting Networks  AVX compare/max/min instructions do not work internally to a register o Typically same values copied to two different registers, permuted, compared o Or, “two register trick”, and many more optimizations  In this case, optimal sorting networks are typically not very helpful o Irregular computation means bubbles in the compute structure o Simple, odd-even sorting network works well 8-element odd-even sorting network What if I want to sort more than 8 values?
  • 28. SIMD And Sorting Networks  Typically, we are sorting more than one set of tuples o If we have multiple tasks, we can have task-level parallelism – Optimized networks! o Sort multiple tuples at the same time  Then we need to transpose the 8 8-element variables o Each variable has a value for each sorting network instance o Non-SIMD works, or a string of unpackhi/unpacklo/blend … Instruction 1, 2 (min, max) Instruction 3, 4 (min, max)
  • 29. SIMD And Sorting Networks  Some SIMD instructions have high throughput, but high latency o Data dependency between two consecutive max instructions can take 8 cycles on Skylake o If each parallel stage has less than 4 operations, pipeline may stall • Solution: Interleave two sets of parallel 8-tuple sorting o In reality, min/max means even for 4-tuples, pipeline is still filled Source: Instrinsics guide
  • 30. The Two Register Merge  Sort units of two pre-sorted registers, K elements o minv = A, maxv = B o // Repeat K times • minv = min(minv,maxv) • maxv = max(minv,maxv) • // circular shift one value down • minv = alignr(minv,minv,imm) Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures,” VLDB 2015
  • 31. SIMD And Merge Sort  Hierarchically merged sorted subsections  Using the SIMD merger for sorting o vector_merge is the two-register sorter from before  What if we want to sort structs? o Instead of min/max, use cmpgt, followed by permute for both key and values o For large values, use <key, index> Inoue et.al., “SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures,” VLDB 2015 Typically reports 2x + performance!
  • 32. Topic Under Active Research!  Papers being written about… o Architecture-optimized matrix transposition o Register-level sorting algorithm o Merge-sort o … and more!  Good find can accelerate your application kernel 2x
  • 33. Case Study: Matrix Multiply  Branch : Boser & Katz, “CS61C: Great Ideas In Computer Architecture” Lecture 18 – Parallel Processing – SIMD