SlideShare a Scribd company logo
1 of 24
Download to read offline
SIMD Programming Introduction
Champ Yen (嚴梓鴻)
champ.yen@gmail.com
http://champyen.blogspt.com
Agenda
2
● What & Why SIMD
● SIMD in different Processors
● SIMD for Software Optimization
● What are important in SIMD?
● Q & A
Link of This Slides
https://goo.gl/Rc8xPE
What is SIMD (Single Instruction Multiple Data)
3
for(y = 0; y < height; y++){
for(x = 0; x < width; x+=8){
//process 8 point simutaneously
uin16x8_t va, vb, vout;
va = vld1q_u16(a+x);
vb = vld1q_u16(b+x);
vout = vaddq_u16(va, vb);
vst1q_u16(out, vout);
}
a+=width; b+=width; out+=width;
}
for(y = 0; y < height; y++){
for(x = 0; x < width; x++){
//process 1 point
out[x] = a[x]+b[x];
}
a+=width; b+=width; out+=width;
}
one lane
Why do we need to use SIMD?
4
Why & How do we use SIMD?
5
Image
Processing
Scientific Computing
Gaming
Deep
Neural
Network
SIMD in different Processor - CPU
6
● x86
− MMX
− SSE
− AVX/AVX-512
● ARM - Application
− v5 DSP Extension
− v6 SIMD
− v7 NEON
− v8 Advanced SIMD (NEON)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
SIMD in different Processor - GPU
7
● SIMD
− AMD GCN
− ARM Mali
● WaveFront
− Nvidia
− Imagination PowerVR Rogue
SIMD in different Processor - DSP
8
● Qualcomm Hexagon 600 HVX
● Cadence IVP P5
● Synopsys EV6x
● CEVA XM4
SIMD Optimization
9
SIMD
Hardware
Design
SIMD
Framework /
Programming
Model
SIMD
Framework /
Programming
Model
Software
SIMD Optimization
10
● Auto/Semi-Auto Method
● Compiler Intrinsics
● Specific Framework/Infrastructure
● Coding in Assembly
● What are the difficult parts of SIMD Programming?
SIMD Optimization – Auto/SemiAuto
11
● compiler
− auto-vectorization optimization options
− #pragma
− IR optimization
● CilkPlus/OpenMP/OpenACC
Serial Code
for(i = 0; i < N; i++){
A[i] = B[i] + C[i];
}
#pragma omp simd
for(i = 0; i < N; i++){
A[i] = B[i] + C[i];
}
SIMD Pragma
https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
SIMD Optimization – Intrinsics
12
● Arch-Dependent Intrinsics
− Intel SSE/AVX
− ARM NEON/MIPS ASE
− Vector-based DSP
● Common Intrisincs
− GCC/Clang vector intrinsics
− OpenCL / SIMD.js
● take Vector Width into consideration
● Portability between compilers
SIMD.js example from:
https://01.org/node/1495
SIMD Optimization – Intrinsics
13
4x4 Matrix Multiplication ARM NEON Example
http://www.fixstars.com/en/news/?p=125
//...
//Load matrixB into four vectors
uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;
vectorB1 = vld1_u16 (B[0]);
vectorB2 = vld1_u16 (B[1]);
vectorB3 = vld1_u16 (B[2]);
vectorB4 = vld1_u16 (B[3]);
//Temporary vectors to use with calculating the dotproduct
uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;
// For each row in A...
for (i=0; i<4; i++){
//Multiply the rows in B by each value in A's row
vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
vectorT4 = vmul_n_u16(vectorB4, A[i][3]);
//Add them together
vectorT1 = vadd_u16(vectorT1, vectorT2);
vectorT1 = vadd_u16(vectorT1, vectorT3);
vectorT1 = vadd_u16(vectorT1, vectorT4);
//Output the dotproduct
vst1_u16 (C[i], vectorT1);
}
//...
A B C
A B C
SIMD Optimization –
Data Parallel Frameworks
14
● OpenCL/Cuda/C++AMP
● OpenVX/Halide
● SIMD-Optimized Libraries
− Apple Accelerate
− OpenCV
− ffmpeg/x264
− fftw/Ne10
core idea of Halide Lang
workitems in OpenCL
SIMD Optimization –
Coding in Assembly
15
● WHY !!!!???
● Extreme Performance Optimization
− precise code size/cycle/register usage
● The difficulty depends on ISA design and
assembler
SIMD Optimization –
The Difficult Parts
16
● Finding Parallelism in Algorithm
● Portability between different intrinsics
− sse-to-neon
− neon-to-sse
● Boundary handling
− Padding, Predication, Fallback
● Divergence
− Predication, Fallback to scalar
● Register Spilling
− multi-stages, # of variables + braces, assembly
● Non-Regular Access/Processing Pattern/Dependency
− Multi-stages/Reduction ISA/Enhanced DMA
● Unsupported Operations
− Division, High-level function (eg: math functions)
● Floating-Point
− Unsupported/cross-deivce Compatiblilty
What are important in SIMD?
17
TLP Programming
ILP Programming
Data Parallel Algorithm
Architecture
What are important in SIMD?
18
● ISA design
● Memory Model
● Thread / Execution Model
● Scalability / Extensibility
● The Trends
What are important in SIMD?
ISA Design
19
● SuperScalar vs VLIW
● vector register design
− Unified or Dedicated float/predication/accumulators
registers
● ALU
● Inter-Lane
− reduction, swizzle, pack/unpack
● Multiply
● Memory Access
− Load/Store, Scatter/Gather, LUT, DMA ...
● Scalar ↔ Vector
What are important in SIMD?
Memory Model
20
● DATA! DATA! DATA!
− it takes great latency to pull data from DRAM to register!
● Buffers Between Host Processor and SIMD Processor
− Handled in driver/device-side
● TCM
− DMA
− relative simple hw/control/understand
− fixed cycle (good for simulation/estimation)
● Cache
− prefetch
− transparent
− portable
− performance scalability
− simple code flow
● Mixed
− powerful
What are important in SIMD?
Thread / Execution Model
21
● The flow to trigger DSP to work
Qualcomm FastRPC
w/ IDL Compiler CEVA Link
TI DSP Link
Synopsys EV6x SW
ecosystem
What are important in SIMD?
Scalability/Extensibility
22
● What to do when ONE core doesn’t meet
performance requirement?
● HWA/Co-processor Interface
● Add another core?
− Cost
− Multicore Programming Mode
What are important in SIMD?
Trends
23
Vector Width
128/256 bit vector width
CEVA DSP/SSE/AVX
64 bit vector width
Hexagon/ARM DSP/MIPS ASE/MMX
512 bit vector width
AVX-512/IVP P5/EV6x/HVX
1024/2048 bit vector width
HVX/ARM SVE
ILP/TLP
Count
Explicit
vector
programming
Autovectorization
Time
Thank You, Q & A
24

More Related Content

What's hot

Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
SRAM read and write and sense amplifier
SRAM read and write and sense amplifierSRAM read and write and sense amplifier
SRAM read and write and sense amplifierSoumyajit Langal
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD
 
/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinityTakuya ASADA
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門NVIDIA Japan
 
ACRiウェビナー:岩渕様ご講演資料
ACRiウェビナー:岩渕様ご講演資料ACRiウェビナー:岩渕様ご講演資料
ACRiウェビナー:岩渕様ご講演資料直久 住川
 
ACRiウェビナー:小野様ご講演資料
ACRiウェビナー:小野様ご講演資料ACRiウェビナー:小野様ご講演資料
ACRiウェビナー:小野様ご講演資料直久 住川
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100inside-BigData.com
 
A tutorial on programming the Mits Altair computer by Mihai Pruna
A tutorial on programming the Mits Altair computer by Mihai PrunaA tutorial on programming the Mits Altair computer by Mihai Pruna
A tutorial on programming the Mits Altair computer by Mihai PrunaMihai Pruna
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Unity Technologies
 
第9回ACRiウェビナー_セック/岩渕様ご講演資料
第9回ACRiウェビナー_セック/岩渕様ご講演資料第9回ACRiウェビナー_セック/岩渕様ご講演資料
第9回ACRiウェビナー_セック/岩渕様ご講演資料直久 住川
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunheut2008
 
라즈베리파이와자바스크립트로만드는 IoT
라즈베리파이와자바스크립트로만드는 IoT라즈베리파이와자바스크립트로만드는 IoT
라즈베리파이와자바스크립트로만드는 IoTCirculus
 
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術について
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術についてAIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術について
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術についてFixstars Corporation
 

What's hot (20)

Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
SRAM read and write and sense amplifier
SRAM read and write and sense amplifierSRAM read and write and sense amplifier
SRAM read and write and sense amplifier
 
Ec8791 lpc2148 pwm
Ec8791 lpc2148 pwmEc8791 lpc2148 pwm
Ec8791 lpc2148 pwm
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門
 
ACRiウェビナー:岩渕様ご講演資料
ACRiウェビナー:岩渕様ご講演資料ACRiウェビナー:岩渕様ご講演資料
ACRiウェビナー:岩渕様ご講演資料
 
ACRiウェビナー:小野様ご講演資料
ACRiウェビナー:小野様ご講演資料ACRiウェビナー:小野様ご講演資料
ACRiウェビナー:小野様ご講演資料
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100
 
A tutorial on programming the Mits Altair computer by Mihai Pruna
A tutorial on programming the Mits Altair computer by Mihai PrunaA tutorial on programming the Mits Altair computer by Mihai Pruna
A tutorial on programming the Mits Altair computer by Mihai Pruna
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
 
第9回ACRiウェビナー_セック/岩渕様ご講演資料
第9回ACRiウェビナー_セック/岩渕様ご講演資料第9回ACRiウェビナー_セック/岩渕様ご講演資料
第9回ACRiウェビナー_セック/岩渕様ご講演資料
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zun
 
Introduction to stm32-part1
Introduction to stm32-part1Introduction to stm32-part1
Introduction to stm32-part1
 
라즈베리파이와자바스크립트로만드는 IoT
라즈베리파이와자바스크립트로만드는 IoT라즈베리파이와자바스크립트로만드는 IoT
라즈베리파이와자바스크립트로만드는 IoT
 
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術について
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術についてAIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術について
AIチップ戦国時代における深層学習モデルの推論の最適化と実用的な運用を可能にするソフトウェア技術について
 
画像処理の高性能計算
画像処理の高性能計算画像処理の高性能計算
画像処理の高性能計算
 
Vlsi design
Vlsi designVlsi design
Vlsi design
 

Similar to SIMD Programming Introduction Guide

Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformCliff Click
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
Raspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentRaspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentCorley S.r.l.
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cLaurent Leturgez
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APISri Ambati
 
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsMegamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsEugenio Villar
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsPriyanka Aash
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingTom Spyrou
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemoryLaurent Leturgez
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 

Similar to SIMD Programming Introduction Guide (20)

Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Raspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentRaspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application Development
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12c
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsMegamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In Memory
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 

More from Champ Yen

Halide tutorial 2019
Halide tutorial 2019Halide tutorial 2019
Halide tutorial 2019Champ Yen
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Champ Yen
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionChamp Yen
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsChamp Yen
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionChamp Yen
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS ObservationChamp Yen
 
Play With Android
Play With AndroidPlay With Android
Play With AndroidChamp Yen
 
Linux Porting
Linux PortingLinux Porting
Linux PortingChamp Yen
 

More from Champ Yen (8)

Halide tutorial 2019
Halide tutorial 2019Halide tutorial 2019
Halide tutorial 2019
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming Introduction
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS Observation
 
Play With Android
Play With AndroidPlay With Android
Play With Android
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 

Recently uploaded

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 

Recently uploaded (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 

SIMD Programming Introduction Guide

  • 1. SIMD Programming Introduction Champ Yen (嚴梓鴻) champ.yen@gmail.com http://champyen.blogspt.com
  • 2. Agenda 2 ● What & Why SIMD ● SIMD in different Processors ● SIMD for Software Optimization ● What are important in SIMD? ● Q & A Link of This Slides https://goo.gl/Rc8xPE
  • 3. What is SIMD (Single Instruction Multiple Data) 3 for(y = 0; y < height; y++){ for(x = 0; x < width; x+=8){ //process 8 point simutaneously uin16x8_t va, vb, vout; va = vld1q_u16(a+x); vb = vld1q_u16(b+x); vout = vaddq_u16(va, vb); vst1q_u16(out, vout); } a+=width; b+=width; out+=width; } for(y = 0; y < height; y++){ for(x = 0; x < width; x++){ //process 1 point out[x] = a[x]+b[x]; } a+=width; b+=width; out+=width; } one lane
  • 4. Why do we need to use SIMD? 4
  • 5. Why & How do we use SIMD? 5 Image Processing Scientific Computing Gaming Deep Neural Network
  • 6. SIMD in different Processor - CPU 6 ● x86 − MMX − SSE − AVX/AVX-512 ● ARM - Application − v5 DSP Extension − v6 SIMD − v7 NEON − v8 Advanced SIMD (NEON) https://software.intel.com/sites/landingpage/IntrinsicsGuide/ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
  • 7. SIMD in different Processor - GPU 7 ● SIMD − AMD GCN − ARM Mali ● WaveFront − Nvidia − Imagination PowerVR Rogue
  • 8. SIMD in different Processor - DSP 8 ● Qualcomm Hexagon 600 HVX ● Cadence IVP P5 ● Synopsys EV6x ● CEVA XM4
  • 10. SIMD Optimization 10 ● Auto/Semi-Auto Method ● Compiler Intrinsics ● Specific Framework/Infrastructure ● Coding in Assembly ● What are the difficult parts of SIMD Programming?
  • 11. SIMD Optimization – Auto/SemiAuto 11 ● compiler − auto-vectorization optimization options − #pragma − IR optimization ● CilkPlus/OpenMP/OpenACC Serial Code for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } #pragma omp simd for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } SIMD Pragma https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
  • 12. SIMD Optimization – Intrinsics 12 ● Arch-Dependent Intrinsics − Intel SSE/AVX − ARM NEON/MIPS ASE − Vector-based DSP ● Common Intrisincs − GCC/Clang vector intrinsics − OpenCL / SIMD.js ● take Vector Width into consideration ● Portability between compilers SIMD.js example from: https://01.org/node/1495
  • 13. SIMD Optimization – Intrinsics 13 4x4 Matrix Multiplication ARM NEON Example http://www.fixstars.com/en/news/?p=125 //... //Load matrixB into four vectors uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4; vectorB1 = vld1_u16 (B[0]); vectorB2 = vld1_u16 (B[1]); vectorB3 = vld1_u16 (B[2]); vectorB4 = vld1_u16 (B[3]); //Temporary vectors to use with calculating the dotproduct uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4; // For each row in A... for (i=0; i<4; i++){ //Multiply the rows in B by each value in A's row vectorT1 = vmul_n_u16(vectorB1, A[i][0]); vectorT2 = vmul_n_u16(vectorB2, A[i][1]); vectorT3 = vmul_n_u16(vectorB3, A[i][2]); vectorT4 = vmul_n_u16(vectorB4, A[i][3]); //Add them together vectorT1 = vadd_u16(vectorT1, vectorT2); vectorT1 = vadd_u16(vectorT1, vectorT3); vectorT1 = vadd_u16(vectorT1, vectorT4); //Output the dotproduct vst1_u16 (C[i], vectorT1); } //... A B C A B C
  • 14. SIMD Optimization – Data Parallel Frameworks 14 ● OpenCL/Cuda/C++AMP ● OpenVX/Halide ● SIMD-Optimized Libraries − Apple Accelerate − OpenCV − ffmpeg/x264 − fftw/Ne10 core idea of Halide Lang workitems in OpenCL
  • 15. SIMD Optimization – Coding in Assembly 15 ● WHY !!!!??? ● Extreme Performance Optimization − precise code size/cycle/register usage ● The difficulty depends on ISA design and assembler
  • 16. SIMD Optimization – The Difficult Parts 16 ● Finding Parallelism in Algorithm ● Portability between different intrinsics − sse-to-neon − neon-to-sse ● Boundary handling − Padding, Predication, Fallback ● Divergence − Predication, Fallback to scalar ● Register Spilling − multi-stages, # of variables + braces, assembly ● Non-Regular Access/Processing Pattern/Dependency − Multi-stages/Reduction ISA/Enhanced DMA ● Unsupported Operations − Division, High-level function (eg: math functions) ● Floating-Point − Unsupported/cross-deivce Compatiblilty
  • 17. What are important in SIMD? 17 TLP Programming ILP Programming Data Parallel Algorithm Architecture
  • 18. What are important in SIMD? 18 ● ISA design ● Memory Model ● Thread / Execution Model ● Scalability / Extensibility ● The Trends
  • 19. What are important in SIMD? ISA Design 19 ● SuperScalar vs VLIW ● vector register design − Unified or Dedicated float/predication/accumulators registers ● ALU ● Inter-Lane − reduction, swizzle, pack/unpack ● Multiply ● Memory Access − Load/Store, Scatter/Gather, LUT, DMA ... ● Scalar ↔ Vector
  • 20. What are important in SIMD? Memory Model 20 ● DATA! DATA! DATA! − it takes great latency to pull data from DRAM to register! ● Buffers Between Host Processor and SIMD Processor − Handled in driver/device-side ● TCM − DMA − relative simple hw/control/understand − fixed cycle (good for simulation/estimation) ● Cache − prefetch − transparent − portable − performance scalability − simple code flow ● Mixed − powerful
  • 21. What are important in SIMD? Thread / Execution Model 21 ● The flow to trigger DSP to work Qualcomm FastRPC w/ IDL Compiler CEVA Link TI DSP Link Synopsys EV6x SW ecosystem
  • 22. What are important in SIMD? Scalability/Extensibility 22 ● What to do when ONE core doesn’t meet performance requirement? ● HWA/Co-processor Interface ● Add another core? − Cost − Multicore Programming Mode
  • 23. What are important in SIMD? Trends 23 Vector Width 128/256 bit vector width CEVA DSP/SSE/AVX 64 bit vector width Hexagon/ARM DSP/MIPS ASE/MMX 512 bit vector width AVX-512/IVP P5/EV6x/HVX 1024/2048 bit vector width HVX/ARM SVE ILP/TLP Count Explicit vector programming Autovectorization Time
  • 24. Thank You, Q & A 24