SIMD Processing Using Compiler Intrinsics

Richard Thomson
legalize@xmission.com
@LegalizeAdulthd
github.com/LegalizeAdulthood

SIMD
 Single
 Instruction
 Multiple
 Data

SIMD Exploits Data Parallelism
 Image Processing
 Array Processing
 Scientific Computing
 3D Graphics

Brief History of CPU SIMD
Year Extension Register Size
1997 MMX 64 bits
1999 SSE 128 bits
2001 SSE2 128 bits
2004 SSE3 128 bits
2006 SSE4 128 bits
2008 AVX 256 bits
2015 AVX-512 512 bits

Data Types
 8-bit integers
 16-bit integers
 32-bit integers
 64-bit integers
 16-bit floats
 32-bit floats
 64-bit floats
 Multiple smaller
quantities are packed into
registers ("multiple data")
 Alignment requirements
on data
 Older extensions do not
support all data types

Alignment C++11
struct alignas(16) foo
{
int i; // 4 bytes
int j; // 4 bytes
alignas(4) char s[3]; // 3 bytes
short q; // 2 bytes
};
// outputs 16:
std::cout << alignof(foo) << 'n';

Alignment C++03
// pre-C++11
// MSVC:
struct __declspec(align(16)) foo
{
// ...
};
// gcc:
struct foo __attribute__((aligned(16)))
{
// ...
};

Boost.Align
 Handles heap allocation of aligned memory
 Query the alignment requirements of a type
 Declare alignment to the compiler portably

Compiler Intrinsics
 A function whose implementation is handled directly
by the compiler.
 SIMD registers exposed as data types
 __m64, __m128, __m128d, __m128i, etc.
 SIMD instructions exposed as intrinsic functions
 _m_paddb, _m_paddd, _m_paddsb, etc.
 Register allocation, instruction scheduling and
addressing modes handled by the compiler
 Proper alignment of operands is assumed

Options Available
Assembly
Intrinsics
Class Library
Automatic Vectorization
+ Direct control,
- Hard to program
+ Pure C/C++,
- Hard to program
+ Easier to program,
- Less control
- Very little control

Proposed Boost.Simd
 https://github.com/NumScale/boost.simd
 Seems promising; easier to program without loss of
control?
 I had problems using it on Windows (issue #189)
 Abstracts away the different sizes of registers as packs
 Provides facilities to deal with alignment
 Provides natural syntax for manipulating packs, i.e.
a+b adds two packs together
 Single code base can target multiple extensions
 Templates expand to calls to intrinsics

Group Exercise
 Convert BasicMandel to use intrinsics
 AVX packs 8 32-bit floats to a single 256-bit register
 AVX Intrinsics:
 #include <immintrin.h>
 __m256 _mm256_add_ps(__m256 a, __m256 b)
 __m256 _m256_mul_ps(__m256 a, __m256 b)
 __m256 _m256_sub_ps(__m256 a, __m256 b)
 __m256 _mm256_load_ps(float const *c)
 __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp)
 __m256i _mm256_castps_si256(__m256 a)
 Intel Intrinsics Guide

SIMD Processing Using Compiler Intrinsics

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to SIMD Processing Using Compiler Intrinsics

Similar to SIMD Processing Using Compiler Intrinsics (20)

More from Richard Thomson

More from Richard Thomson (10)

Recently uploaded

Recently uploaded (20)

SIMD Processing Using Compiler Intrinsics