SlideShare a Scribd company logo
Richard Thomson
legalize@xmission.com
@LegalizeAdulthd
github.com/LegalizeAdulthood
SIMD
 Single
 Instruction
 Multiple
 Data
SIMD Exploits Data Parallelism
 Image Processing
 Array Processing
 Scientific Computing
 3D Graphics
Brief History of CPU SIMD
Year Extension Register Size
1997 MMX 64 bits
1999 SSE 128 bits
2001 SSE2 128 bits
2004 SSE3 128 bits
2006 SSE4 128 bits
2008 AVX 256 bits
2015 AVX-512 512 bits
Data Types
 8-bit integers
 16-bit integers
 32-bit integers
 64-bit integers
 16-bit floats
 32-bit floats
 64-bit floats
 Multiple smaller
quantities are packed into
registers ("multiple data")
 Alignment requirements
on data
 Older extensions do not
support all data types
Alignment C++11
struct alignas(16) foo
{
int i; // 4 bytes
int j; // 4 bytes
alignas(4) char s[3]; // 3 bytes
short q; // 2 bytes
};
// outputs 16:
std::cout << alignof(foo) << 'n';
Alignment C++03
// pre-C++11
// MSVC:
struct __declspec(align(16)) foo
{
// ...
};
// gcc:
struct foo __attribute__((aligned(16)))
{
// ...
};
Boost.Align
 Handles heap allocation of aligned memory
 Query the alignment requirements of a type
 Declare alignment to the compiler portably
Compiler Intrinsics
 A function whose implementation is handled directly
by the compiler.
 SIMD registers exposed as data types
 __m64, __m128, __m128d, __m128i, etc.
 SIMD instructions exposed as intrinsic functions
 _m_paddb, _m_paddd, _m_paddsb, etc.
 Register allocation, instruction scheduling and
addressing modes handled by the compiler
 Proper alignment of operands is assumed
Options Available
Assembly
Intrinsics
Class Library
Automatic Vectorization
+ Direct control,
- Hard to program
+ Pure C/C++,
- Hard to program
+ Easier to program,
- Less control
- Very little control
Proposed Boost.Simd
 https://github.com/NumScale/boost.simd
 Seems promising; easier to program without loss of
control?
 I had problems using it on Windows (issue #189)
 Abstracts away the different sizes of registers as packs
 Provides facilities to deal with alignment
 Provides natural syntax for manipulating packs, i.e.
a+b adds two packs together
 Single code base can target multiple extensions
 Templates expand to calls to intrinsics
Group Exercise
 Convert BasicMandel to use intrinsics
 AVX packs 8 32-bit floats to a single 256-bit register
 AVX Intrinsics:
 #include <immintrin.h>
 __m256 _mm256_add_ps(__m256 a, __m256 b)
 __m256 _m256_mul_ps(__m256 a, __m256 b)
 __m256 _m256_sub_ps(__m256 a, __m256 b)
 __m256 _mm256_load_ps(float const *c)
 __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)
 __m256i _mm256_castps_si256(__m256 a)
 Intel Intrinsics Guide

More Related Content

What's hot

Arithmetic Logic Unit .
Arithmetic Logic Unit .Arithmetic Logic Unit .
Arithmetic Logic Unit .
Deyaa Ahmed
 
arithmetic logic unit
arithmetic logic unitarithmetic logic unit
arithmetic logic unit
Shimak Sharook
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2
Ani Sridhar
 
Arithmetic and logic unit
Arithmetic and logic unitArithmetic and logic unit
Arithmetic and logic unit
IndrajaMeghavathula
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)
Student
 
ALU arithmetic logic unit
ALU  arithmetic logic unitALU  arithmetic logic unit
ALU arithmetic logic unit
Karthik Prof.
 
Cba lecture 6 intro_ch_06_a_br
Cba lecture 6 intro_ch_06_a_brCba lecture 6 intro_ch_06_a_br
Cba lecture 6 intro_ch_06_a_br
nazninislamnipa
 
Arithmetic logic shift unit
Arithmetic logic shift unitArithmetic logic shift unit
Arithmetic logic shift unit
rishi ram khanal
 
ALU
ALUALU
Aca2 06 new
Aca2 06 newAca2 06 new
Aca2 06 new
Sumit Mittu
 
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
CArcMOOC 04.01 - Von Neumann and CPU micro-architectureCArcMOOC 04.01 - Von Neumann and CPU micro-architecture
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
Alessandro Bogliolo
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
RaJibRaju3
 
2 bit alu
2 bit alu2 bit alu
2 bit alu
Mahmudul Hasan
 
Register & Memory
Register & MemoryRegister & Memory
Register & Memory
Education Front
 
X86 Architecture
X86 Architecture X86 Architecture
X86 Architecture
IGZ Software house
 
Intel x86 and ARM Data types
Intel x86 and ARM Data typesIntel x86 and ARM Data types
Intel x86 and ARM Data types
Rowena Cornejo
 

What's hot (16)

Arithmetic Logic Unit .
Arithmetic Logic Unit .Arithmetic Logic Unit .
Arithmetic Logic Unit .
 
arithmetic logic unit
arithmetic logic unitarithmetic logic unit
arithmetic logic unit
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2
 
Arithmetic and logic unit
Arithmetic and logic unitArithmetic and logic unit
Arithmetic and logic unit
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)
 
ALU arithmetic logic unit
ALU  arithmetic logic unitALU  arithmetic logic unit
ALU arithmetic logic unit
 
Cba lecture 6 intro_ch_06_a_br
Cba lecture 6 intro_ch_06_a_brCba lecture 6 intro_ch_06_a_br
Cba lecture 6 intro_ch_06_a_br
 
Arithmetic logic shift unit
Arithmetic logic shift unitArithmetic logic shift unit
Arithmetic logic shift unit
 
ALU
ALUALU
ALU
 
Aca2 06 new
Aca2 06 newAca2 06 new
Aca2 06 new
 
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
CArcMOOC 04.01 - Von Neumann and CPU micro-architectureCArcMOOC 04.01 - Von Neumann and CPU micro-architecture
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
2 bit alu
2 bit alu2 bit alu
2 bit alu
 
Register & Memory
Register & MemoryRegister & Memory
Register & Memory
 
X86 Architecture
X86 Architecture X86 Architecture
X86 Architecture
 
Intel x86 and ARM Data types
Intel x86 and ARM Data typesIntel x86 and ARM Data types
Intel x86 and ARM Data types
 

Similar to SIMD Processing Using Compiler Intrinsics

8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
ssuserc28b3c
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
Edge AI and Vision Alliance
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
Lihang Li
 
Introduction to computer architecture .pptx
Introduction to computer architecture .pptxIntroduction to computer architecture .pptx
Introduction to computer architecture .pptx
Fatma Sayed Ibrahim
 
The x86 Family
The x86 FamilyThe x86 Family
The x86 Family
Motaz Saad
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
jeronimored
 
C programming part2
C programming part2C programming part2
C programming part2
Keroles karam khalil
 
C programming part2
C programming part2C programming part2
C programming part2
Keroles karam khalil
 
C programming part2
C programming part2C programming part2
C programming part2
Keroles karam khalil
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
EstelaJeffery653
 
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers LibraryAdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
jamieayre
 
Instruction set.pptx
Instruction set.pptxInstruction set.pptx
Instruction set.pptx
ssuser000e54
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
Dmitri Nesteruk
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
IJERA Editor
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
IJERA Editor
 
Creating user-mode debuggers for Windows
Creating user-mode debuggers for WindowsCreating user-mode debuggers for Windows
Creating user-mode debuggers for Windows
Mithun Shanbhag
 
Lec02
Lec02Lec02
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formats
Mazin Alwaaly
 

Similar to SIMD Processing Using Compiler Intrinsics (20)

8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
Introduction to computer architecture .pptx
Introduction to computer architecture .pptxIntroduction to computer architecture .pptx
Introduction to computer architecture .pptx
 
The x86 Family
The x86 FamilyThe x86 Family
The x86 Family
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
C programming part2
C programming part2C programming part2
C programming part2
 
C programming part2
C programming part2C programming part2
C programming part2
 
C programming part2
C programming part2C programming part2
C programming part2
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers LibraryAdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
AdaCore Paris Tech Day 2016: Fabien Chouteau - Making the Ada Drivers Library
 
Instruction set.pptx
Instruction set.pptxInstruction set.pptx
Instruction set.pptx
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
 
Creating user-mode debuggers for Windows
Creating user-mode debuggers for WindowsCreating user-mode debuggers for Windows
Creating user-mode debuggers for Windows
 
Lec02
Lec02Lec02
Lec02
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formats
 

More from Richard Thomson

Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdfVintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
Richard Thomson
 
Automated Testing with CMake, CTest and CDash
Automated Testing with CMake, CTest and CDashAutomated Testing with CMake, CTest and CDash
Automated Testing with CMake, CTest and CDash
Richard Thomson
 
Feature and platform testing with CMake
Feature and platform testing with CMakeFeature and platform testing with CMake
Feature and platform testing with CMake
Richard Thomson
 
Consuming Libraries with CMake
Consuming Libraries with CMakeConsuming Libraries with CMake
Consuming Libraries with CMake
Richard Thomson
 
BEFLIX
BEFLIXBEFLIX
Modern C++
Modern C++Modern C++
Modern C++
Richard Thomson
 
Cross Platform Mobile Development with Visual Studio 2015 and C++
Cross Platform Mobile Development with Visual Studio 2015 and C++Cross Platform Mobile Development with Visual Studio 2015 and C++
Cross Platform Mobile Development with Visual Studio 2015 and C++
Richard Thomson
 
Consuming and Creating Libraries in C++
Consuming and Creating Libraries in C++Consuming and Creating Libraries in C++
Consuming and Creating Libraries in C++
Richard Thomson
 
Web mashups with NodeJS
Web mashups with NodeJSWeb mashups with NodeJS
Web mashups with NodeJS
Richard Thomson
 
C traps and pitfalls for C++ programmers
C traps and pitfalls for C++ programmersC traps and pitfalls for C++ programmers
C traps and pitfalls for C++ programmers
Richard Thomson
 

More from Richard Thomson (10)

Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdfVintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
Vintage Computing Festival Midwest 18 2023-09-09 What's In A Terminal.pdf
 
Automated Testing with CMake, CTest and CDash
Automated Testing with CMake, CTest and CDashAutomated Testing with CMake, CTest and CDash
Automated Testing with CMake, CTest and CDash
 
Feature and platform testing with CMake
Feature and platform testing with CMakeFeature and platform testing with CMake
Feature and platform testing with CMake
 
Consuming Libraries with CMake
Consuming Libraries with CMakeConsuming Libraries with CMake
Consuming Libraries with CMake
 
BEFLIX
BEFLIXBEFLIX
BEFLIX
 
Modern C++
Modern C++Modern C++
Modern C++
 
Cross Platform Mobile Development with Visual Studio 2015 and C++
Cross Platform Mobile Development with Visual Studio 2015 and C++Cross Platform Mobile Development with Visual Studio 2015 and C++
Cross Platform Mobile Development with Visual Studio 2015 and C++
 
Consuming and Creating Libraries in C++
Consuming and Creating Libraries in C++Consuming and Creating Libraries in C++
Consuming and Creating Libraries in C++
 
Web mashups with NodeJS
Web mashups with NodeJSWeb mashups with NodeJS
Web mashups with NodeJS
 
C traps and pitfalls for C++ programmers
C traps and pitfalls for C++ programmersC traps and pitfalls for C++ programmers
C traps and pitfalls for C++ programmers
 

Recently uploaded

美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 

Recently uploaded (20)

美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 

SIMD Processing Using Compiler Intrinsics

  • 3. SIMD Exploits Data Parallelism  Image Processing  Array Processing  Scientific Computing  3D Graphics
  • 4. Brief History of CPU SIMD Year Extension Register Size 1997 MMX 64 bits 1999 SSE 128 bits 2001 SSE2 128 bits 2004 SSE3 128 bits 2006 SSE4 128 bits 2008 AVX 256 bits 2015 AVX-512 512 bits
  • 5. Data Types  8-bit integers  16-bit integers  32-bit integers  64-bit integers  16-bit floats  32-bit floats  64-bit floats  Multiple smaller quantities are packed into registers ("multiple data")  Alignment requirements on data  Older extensions do not support all data types
  • 6. Alignment C++11 struct alignas(16) foo { int i; // 4 bytes int j; // 4 bytes alignas(4) char s[3]; // 3 bytes short q; // 2 bytes }; // outputs 16: std::cout << alignof(foo) << 'n';
  • 7. Alignment C++03 // pre-C++11 // MSVC: struct __declspec(align(16)) foo { // ... }; // gcc: struct foo __attribute__((aligned(16))) { // ... };
  • 8. Boost.Align  Handles heap allocation of aligned memory  Query the alignment requirements of a type  Declare alignment to the compiler portably
  • 9. Compiler Intrinsics  A function whose implementation is handled directly by the compiler.  SIMD registers exposed as data types  __m64, __m128, __m128d, __m128i, etc.  SIMD instructions exposed as intrinsic functions  _m_paddb, _m_paddd, _m_paddsb, etc.  Register allocation, instruction scheduling and addressing modes handled by the compiler  Proper alignment of operands is assumed
  • 10. Options Available Assembly Intrinsics Class Library Automatic Vectorization + Direct control, - Hard to program + Pure C/C++, - Hard to program + Easier to program, - Less control - Very little control
  • 11. Proposed Boost.Simd  https://github.com/NumScale/boost.simd  Seems promising; easier to program without loss of control?  I had problems using it on Windows (issue #189)  Abstracts away the different sizes of registers as packs  Provides facilities to deal with alignment  Provides natural syntax for manipulating packs, i.e. a+b adds two packs together  Single code base can target multiple extensions  Templates expand to calls to intrinsics
  • 12. Group Exercise  Convert BasicMandel to use intrinsics  AVX packs 8 32-bit floats to a single 256-bit register  AVX Intrinsics:  #include <immintrin.h>  __m256 _mm256_add_ps(__m256 a, __m256 b)  __m256 _m256_mul_ps(__m256 a, __m256 b)  __m256 _m256_sub_ps(__m256 a, __m256 b)  __m256 _mm256_load_ps(float const *c)  __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)  __m256i _mm256_castps_si256(__m256 a)  Intel Intrinsics Guide