SlideShare a Scribd company logo
1 of 67
Happy To Use SIMD
Weita, Wang
SIMD???
cv::Mat a(4,4,CV_32FC1);
cv::Mat b(4,4,CV_32FC1);
cv::Mat c(4,4,CV_32FC1),d;
d= a*b+c;
Eigen::MatrixXf a(3,3);
Eigen::MatrixXf b(3,3);
Eigen::MatrixXf c(3,3),d;
d=a*b+c;
Memory Register
Compiler
What is SIMD?
• The Extreme Optimization for C/C++
• Pointer only
• Have to exactly define act of memory, register,
compiler, it can be challenge the limit.
C/C++ level
Assembly
level
SIMD
C=A+B ?
float arr0[4] = { 1,2,3,4 };
float arr1[4] = { 5,6,7,8 };
float arr2[4] = { 0 };
A
B
C
A B
C +
=
Result: arr2[4] => { 6,8,10,12 };
Why is SIMD fast?
for(int i=0;i<4;i++)
arr2[i]=arr0[i]+arr1[i];
for(int i=0;i<4;i++)
*(arr2 + i) = *(arr0 + i)+*(arr1 + i);
1 1*4
(1+1)*4 (1+1)*4 (1+1)*4
37 cycles
1*4
1*4
Assume
the all
instruction sets
have
1 instruction and
1 cycle
Why is SIMD fast?
float32x4_t a,b,c;
a=*(float32x4_t *)arr0;
b=*(float32x4_t *)arr1;
c=a+b;
*(float32x4_t *)arr2=c;
4 cycles
9x fast
SIMD, Step 1
You have to calculate cycle count
Variable Architecture
dobule a; //64bits
float b; //32bits
int c; //32bits
short d; //16bits
char e; //8bits
unsigned int f; //32bits
unsigned short g; //16bits
unsigned char f; //8bits
….
• SSE
__m128d aa;
__m128 bb;
__m64d cc;
__m64d dd;
• NEON
float64x2_t aa;
float32x4_t bb;
int32x4_t cc;
int16x8_t dd;
int8x16_t ee;
…
Variable Architecture
64
32 32 32 32
32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 64bits 0 128bits
General Purpose Register
Scale Register
Vector Register
1x
1x
2x
4x
16x
8x
Re-Definition
Re-Definition Variable
• SSE
typedef __m128d double2;
typedef __m128d float4;
typedef __m64d float2;
typedef __m128 int4;
typedef __m64 int2;
typedef __m128 uint4;
typedef __m64 uint2;
….
• NEON
typedef float64x2_t double2;
typedef float32x4_t float4;
typedef float32x2_t float2;
typedef int32x4_t int4;
typedef int32x2_t int2;
typedef uint32x4_t uint4;
typedef uint32x2_t uint2;
….
Re-Definition Register
union reg128 {
uchar16 _uchar16;
short8 _short8;
int4 _int4;
float4 _float4;
double2 _double2;
uchar _uchar[16];
short _short[8];
int _int4;
float _float[4];
double _double[2];
…
void print_uchar()
{
printf(“%d %d..n”,
_uchar[0],_uchar[1],_uchar[2]….);
}
void print_float()
{
printf(“%f %f %f %fn”,
_float[0],_float[1]...);
}
…
};
To Find Instruction Set
SIMD, Step 2
Act of Memory
Memory
• In the approaching physical limits era, CPU
operation is not bottleneck, it keeps changing
per year, but the speed of memory transform
is only constant.
• The SIMD succeeded in reducing over
quadruple CPU cycles when multi-data
parallelize.
• The Data was producing latency from Memory
 L2  L1  register Load/Store.
Memory Level latency
Reference: https://tinyurl.com/gsnfzoy
L1 cache
• Create array in the function.
• The registers are used over established quantity for
reserving data, it will write back L1 cache by stack
pointer.
• Function arguments transfer data. (partial –O3 opt.
will through out by registers, No write back)
• Function call, to save current registers of data by
stack point, when it’s finished, read data out to
registers.
• Interrupt or exchanging thread, will write through
current registers of data, depend on OS capability.
I-Cache/D-Cache
• Instruction Cache:
– The code size that was compiled CPU instructions
(function symbol size), function firstly execution
will pre-fetch, and twice is in cache, if it is
computer vision application, the first execution
function have to ignore in calculating efficacy.
• Data Cache:
– It is L1 cache when we say, established
methodology pre-fetch data to L1 cache.
Page Table
• The page  4096 bytes
• The cache line64 bytes
• A page contains 64 cache lines
• L2 cache  5~10Mb
• L1 cache  512k~1Mb
• L1 entry way 2 way or 4 way
• A image320*240 or 640*480 bytes
• Does it have heavily cache miss while the
memory usage over cache size?????
Cache line
• 64 bytes= 16 float
• 128 bits = 4 float
To Address
64 bits
To Address
64 bits
Worldview
• In SIMD world, if you want to get limit, look-
up table should not be optimally method, if
the table is large, to use vector register will
reduce 4x cycles and more faster on the
contrary.
• At the extreme optimization, once you create
small Load/Store, this effect is very obvious!
Known Methodology
int arr0[100] = {1,2,3…};
void test1 (float *src,float *dst,int len)
{
int arr1[100] = {1,2,3…};
int b =4;
int *arr2 = (int *)malloc(100*sizeof(int));
int c = len + b;
…
}
Memory
(DDR3)
L1 cache
Instruction set
Const
Memory
(DDR3)
Known Methodology
class a
{
int val = 3;
int map[100] = {1,2,3,4,5};
a();
…
};
Memory
(DDR3)
Same as struct
Compile Is Not As Smart As You
Think
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 cc=*src_dst_ptr + *src_dst_ptr;
*src_dst_ptr +=cc;
…
}
Three Load
One Store
Correct Writing
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 val = *src_dst_ptr;
float4 cc= val + val;
*src_dst_ptr =cc + val;
…
} One Load
One Store
Few To Use Array,
More To Use Pointer++
void test1(float *src,float *dst,int len)
{
float4 *src_ptr =(float4 *)src;
float4 *dst_ptr=(float4 *)dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_ptr++;
reg1=*src_ptr++;
reg0 = reg0+reg1;
…..
*dst_ptr++=reg0;
*dst_ptr++=reg1;
}
}
• Not recommend to use
void test2(float4 *src,float4 *dst,int len)
{
int len_4 = len/4;
float4 reg0,reg1…;
for(int i=0;i<len_4;i+=2)
{
reg0=src[i]+src[i+1];
…
dst[i]=reg0;
dst[i+1]=src[i+1];
}
}
Single Source And Destination,
To Avoid Cache Miss/Page Fault
void test1(float *src_dst, int len)
{
float4 *src_dst_ptr =(float4 *)src_dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_dst_ptr++;
reg1=*src_dst_ptr++;
reg0 = reg0+reg1;
…..
*src_dst_ptr++=reg0;
*src_dst_ptr++=reg1;
}
}
• Cache line was 64 bytes, 16 bits address
alignment.
• if Vector Register Load/Store that is Not at
multiples of 16 address
– Latency penalty
– Depends on CPU architecture, almost will occur
Align/Unalign
0x0000 0x0010
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0x0020
From Here, Start to Load/Store 128bits
• Choice inner-instruction (Unalign Load/Store)
• Access aligning data, use alignr or vext
assembly
• Declare 16 bits alignment or malloc and shift
address
Solve Method
32 32 32 32 32 32 32 32
reg0
reg3=vext(reg0,reg1,1)
reg1
float __attribute__ ((aligned (16))) a[40];
float *b=(float *)malloc(sizeof(float)*40);
b= (float*)(((unsigned long)b + 15) & (~0x0F))
SIMD, Step 3
Act of Register
64
32 32
0 64bits
32 32 32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 128bits
Register
• Arm64
– 32 Vector Registers
– 32 Scale Registers
• Arm32
– 16 Vector Registers
– 32 Scale Registers
• Intel SSE
– 16 Vector Registers
– 32 Scale Registers
• DSP
– ? Vector Registers
– ? Scale Registers
Have to remember the number of registers in use
(Very Importance!!)
Why
• Under the premise of full float, 32 Vector
Registers can provide
– 128 space sizes simulate array (4*32)
– The extreme operation that no need to write back
to memory
• With use of shuffle instruction
– If you use over 32 variables at same time, excess
data will be written back L1, make latency.
Why
float arr[4*32+4] = {…};
float4 *arr_ptr = (float4 *)arr;
float4 a0,a1,a2,a3,a4,a5,a6 … a32;
A0 = *arr_ptr++;
A1= *arr_ptr++;
…
A32 = *arr_ptr++;
Over 32 register variables at same
time, it makes extra Load/Store in
operation process, can’t be
optimized.
About Register
• The Register doesn’t have data type in itself,
only defines the instruction sets on assembly
level
• To use variables reserving data, can Not over
the maximum number of registers on CPU, but
need fully utilize.
• Vector Register
– Make good use of shuffle instruction
– Input/Output data rearrangement
Act of Load/Store
float4 *src_ptr = (float4 *)src;
float4 *dst_ptr=(float4 *)dst;
reg128 reg0,reg1,reg2,reg3… reg31;
for(int i=0;i<640*480;i+=4) {
reg0._float4 = *src_ptr++;
reg1._float4 = *src_ptr++;
reg2._float4 = *src_ptr++;
….
..
*dst_ptr++=reg0._float4;
*dst_ptr++=reg1._float4;
*dst_ptr++=reg2._float4;
….
}
2 General
Purpose
Registers
(Addressing) 32 Vector Registers
(full utilize)
1 General
Purpose
Register
Read All at
Once
Write All at
Once
Main
Algorithm
Act of Function Call
void test1(float *src,float *dst,int len) {
int a= len/4;
int b= len%4;
float4 aa = *(float4 *)src;
float4 bb = *(float4 *)dst;
float4 cc = aa + bb;
int val=test2(src,dst,len);
cc = aa + bb + cc;
int c =(a+b+len)*val;
…
}
2 General Purpose
Registers write
to L1 cache,
produce act of
Load/Store 1 Vector Registers write
to L1 cache,
produce act of
Load/Store
To read src,dst
address to General
Purpose Register from
L1 cache
The Registers will
clean-up, read
arguments form L1
cache to registers Return original data
from L1 cache to
Vector/General
Purpose Registers
Stack Pointer
management
Act of Function Argument
void test3(float4 aa,float4 *bb,float4 &cc) {
…
}
void test4(float a,float *b,float &c) {
…
}
int main() {
float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2};
float a = 0,b=1,c=2;
test3(aa,&bb,&cc);
test4(a,&b,&c);
}
Under the
premise of O3
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Directly thoughout register!
Driectly throughout register!
Key Points
• Call by address, call by reference, those all
access L1 cache, unless them inline succeed,
must be slow.
• Reduce the function usage, all the way to the
end.
Act of Branch Instruction
float a[100],b[100];
for(int i=0;i<100;i++)
{
if(a[i]<50)
b[i]=a[i];
else
b[i] = 30;
}
1 100 100
(1+1+1)*100
(1+1)*100*2
(1+1)*100
1101 cycles
Act of Branch Instruction
float4 *a_ptr = (float4 *)a,*b_ptr=(float4 *)b;
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
1+2*25
9*25
1
1
278 cycles
3.96x
Analyse SIMD Branch
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
if(a[i]<50) b[i]=a[i];
else b[i] =30;
11..1 00..0 11..1 00..0
If true
32’s one
If false
32’s zero
0 128
0000 1111
0011 0100
0000 0100
AND
1111 0000
1011 0001
101 1 0000
NOT
AND
1011 0000
0000 0100
101 1 0100
XOR
Act of Branch Instruction
• Compare with Normal Comparison Operation,
more than multiples of 4 fast.
• The branch prediction is NO exist, CPU
pipeline will Not be predicted fail and clean-up,
the pipeline is running to end(Explosion fast).
Act of Shuffle
• The instruction sets like the sea, to find the
best fit shuffle.
– the key point that is extreme optimization of
mathematics model.
– Didn’t write shuffle, Didn’t say you can write SIMD.
Act of Shuffle
• Ex: Matrix Transpose
Act of Shuffle
4 cycles
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 3 7
2 6 4 8
reg0
reg1
reg2
reg3
reg0
reg1
vtrnq
9 13 11 15
10 14 12 16
reg2
reg3
1 5 9 13
2 6 10 14
reg0
reg1
3 7 11 15
4 8 12 16
reg2
reg3
vtrn
Act of Shuffle
for(int i=0;i<4;i++)
for(int j=i;j<4;j++)
{
int index0=i*4+j,index1=j*4+i;
float temp=a[index0];
a[index0]=a[index1];
a[index1]=temp;
}
(4+3+2+1)*3
4 4
1
(1+1)*10
(1+1)*10
(1+1)*10*2
159 cycles
(1+1+1+1)*10
Act of Shuffle
reg256 temp0,temp1;
reg128 reg0,reg1,reg2,reg3;
temp0._float4x2=vtrnq_f32(reg0,reg1);
temp1._float4x2=vtrnq_f32(reg2,reg3);
float2 temp =temp0._float2[1];
temp0._float2[1]=temp1._float2[0];
temp1._float2[0]=temp;
temp=temp0._float2[3];
temp0._float2[3]=temp1._float[2];
temp1._float[2]=temp;
4 cycles
vtrn
vtrn
39.75x
Act of Shuffle
• Ex: Matrix Transpose
transpose 4 cycles
mul 16 cycles
vpadd 12 cycles
Data Type Conversion
• uchar16short8int4float4
• float4int4uchar16
Image
32 32 32 32
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
SIMD, Step 4
Act of Compiler
Methodology of O3 Optimization
• Clang gcc ???
• New Version >>>>> Old Version
Lantency && Throughtput
• Consecutive Load or Consecutive Store will
reduce latency.
• Specific pipeline rearrange can reduce latency
space.
– VLIW, SLOT
• The Register instruction sets that have the
dependency penalty.
– Load/Store rearrange by yourself
– Compiler can deal with the dependency penalty
About inline
• Always inline has a change to fail, you should
be check function symbol in assembly.
– Lines of code was the key in Clang
Vector Register Optimization
• Specific algorithms can NOT optimize with
SIMD on contemporary compiler, because the
90% algorithms need to use lot of shuffle
instruction, have to paper work.
• Compiler only parses the for loop unrolling
with SIMD.
for(int i=0;i<64;i++)
{
….
}
Compiler says:
I know how to do
vectorization
Read Element and Write Back
reg128 reg0;
float4 a= {0,1,2,3};
reg0._float4 = a;
float2 val1= reg0._float2[0];
reg0._float2[1]=val1;
float val0=reg0._float[2];
reg0._float[3] = val0;
1. Whether Instruction is supported,
if not, write to L1 Load/Store as same as array.
1. It depends on whether compiler is smart or not!!
0 1 2 3
寫
讀
讀
寫
寫
Dump Assembly is Important
SIMD, Step 5
Methodology of Extreme Optimization
• Fix the all of algorithms parameters
– Make the constant value
• Remove the branch prediction, the code will
be very huge, but fast
• Don't doubt, the code is more than 4000 lines
casually.
Conception of SIMD Optimization
FunctionA
Algorithm A
FunctionB
Algorithm B
FunctionC
Algorithm C
FunctionEnd
(Final Algorithm)
Develop
One Month
Previous
codes
were no use,
only need to
develop
final algorithm
for one month
Waste Time
Develop
One Month
Develop
One Month
The Problems face on a daily basis
About Data
• The large data have to
– Satisfy multiples of 4.
– Know the maximum of quantity.
• Input data rearrangement can fly on the sky.
• Multiples of 4 are not met
– Padding zero, still use SIMD
– Use General Purpose Registers in the end.
Data Rearrangement
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a a a a a a ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
b b b b b b ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
c c c c c c ...
c c c c c c ...
Unrolling by Your Hands
Image
Tradition Method
for(int i=0;i<height;i++)
{
for(int j=0;j<width;j++)
{
if(...) // top
else if(...) // bottom
else if(...) // left
else if(...) // right
// middle
}
}
SIMD
for(int i=0;i<height;i++) // top
{ ...}
for(int i=0;i<height;i++)
{
// left
for(int j=0;j<width;j++) // middle
{ ... }
// right
}
for(int i=0;i<height;i++) // bottom
{ ...}
In Order To Cooperate With
SIMD, Crazy, Unlimited Unrolling
I cache is enough(over 32KB),
if Not enough, we'll talk about it then
Conclusion
• SIMD is strongly linked to mathematics
• Unknown field, or almost no course.
• Seldom data on the internet, seldom people
arrange success.
• Do you want to develop new algorithms?
You Can try it.

More Related Content

What's hot

A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
 
Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationKito Cheng
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowMarina Kolpakova
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionMarina Kolpakova
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Hsien-Hsin Sean Lee, Ph.D.
 
Format String Vulnerability
Format String VulnerabilityFormat String Vulnerability
Format String VulnerabilityJian-Yu Li
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4nomaddo
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言Simen Li
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
 
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsMITSUNARI Shigeo
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage CollectionEelco Visser
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITEgor Bogatov
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcYukio Okuda
 

What's hot (20)

A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
 
Format String Vulnerability
Format String VulnerabilityFormat String Vulnerability
Format String Vulnerability
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
3D-DRESD Lorenzo Pavesi
3D-DRESD Lorenzo Pavesi3D-DRESD Lorenzo Pavesi
3D-DRESD Lorenzo Pavesi
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 

Similar to Happy To Use SIMD

SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra Umbra Software
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT CompilerNetronome
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoValeriia Maliarenko
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.pptJaya Chavan
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 

Similar to Happy To Use SIMD (20)

Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
8253,8254
8253,8254 8253,8254
8253,8254
 
Lec05
Lec05Lec05
Lec05
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 

Recently uploaded

Malaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptxMalaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptxMok TH
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Henry Schreiner
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Andreas Granig
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMarkus Moeller
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...OnePlan Solutions
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAShane Coughlan
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 

Recently uploaded (20)

Malaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptxMalaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptx
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 

Happy To Use SIMD

  • 1. Happy To Use SIMD Weita, Wang
  • 2. SIMD??? cv::Mat a(4,4,CV_32FC1); cv::Mat b(4,4,CV_32FC1); cv::Mat c(4,4,CV_32FC1),d; d= a*b+c; Eigen::MatrixXf a(3,3); Eigen::MatrixXf b(3,3); Eigen::MatrixXf c(3,3),d; d=a*b+c; Memory Register Compiler
  • 3. What is SIMD? • The Extreme Optimization for C/C++ • Pointer only • Have to exactly define act of memory, register, compiler, it can be challenge the limit. C/C++ level Assembly level SIMD
  • 4. C=A+B ? float arr0[4] = { 1,2,3,4 }; float arr1[4] = { 5,6,7,8 }; float arr2[4] = { 0 }; A B C A B C + = Result: arr2[4] => { 6,8,10,12 };
  • 5. Why is SIMD fast? for(int i=0;i<4;i++) arr2[i]=arr0[i]+arr1[i]; for(int i=0;i<4;i++) *(arr2 + i) = *(arr0 + i)+*(arr1 + i); 1 1*4 (1+1)*4 (1+1)*4 (1+1)*4 37 cycles 1*4 1*4 Assume the all instruction sets have 1 instruction and 1 cycle
  • 6. Why is SIMD fast? float32x4_t a,b,c; a=*(float32x4_t *)arr0; b=*(float32x4_t *)arr1; c=a+b; *(float32x4_t *)arr2=c; 4 cycles 9x fast
  • 7. SIMD, Step 1 You have to calculate cycle count
  • 8. Variable Architecture dobule a; //64bits float b; //32bits int c; //32bits short d; //16bits char e; //8bits unsigned int f; //32bits unsigned short g; //16bits unsigned char f; //8bits …. • SSE __m128d aa; __m128 bb; __m64d cc; __m64d dd; • NEON float64x2_t aa; float32x4_t bb; int32x4_t cc; int16x8_t dd; int8x16_t ee; …
  • 9. Variable Architecture 64 32 32 32 32 32 32 64 64 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0 64bits 0 128bits General Purpose Register Scale Register Vector Register 1x 1x 2x 4x 16x 8x
  • 11. Re-Definition Variable • SSE typedef __m128d double2; typedef __m128d float4; typedef __m64d float2; typedef __m128 int4; typedef __m64 int2; typedef __m128 uint4; typedef __m64 uint2; …. • NEON typedef float64x2_t double2; typedef float32x4_t float4; typedef float32x2_t float2; typedef int32x4_t int4; typedef int32x2_t int2; typedef uint32x4_t uint4; typedef uint32x2_t uint2; ….
  • 12. Re-Definition Register union reg128 { uchar16 _uchar16; short8 _short8; int4 _int4; float4 _float4; double2 _double2; uchar _uchar[16]; short _short[8]; int _int4; float _float[4]; double _double[2]; … void print_uchar() { printf(“%d %d..n”, _uchar[0],_uchar[1],_uchar[2]….); } void print_float() { printf(“%f %f %f %fn”, _float[0],_float[1]...); } … };
  • 14. SIMD, Step 2 Act of Memory
  • 15. Memory • In the approaching physical limits era, CPU operation is not bottleneck, it keeps changing per year, but the speed of memory transform is only constant. • The SIMD succeeded in reducing over quadruple CPU cycles when multi-data parallelize. • The Data was producing latency from Memory  L2  L1  register Load/Store.
  • 16. Memory Level latency Reference: https://tinyurl.com/gsnfzoy
  • 17. L1 cache • Create array in the function. • The registers are used over established quantity for reserving data, it will write back L1 cache by stack pointer. • Function arguments transfer data. (partial –O3 opt. will through out by registers, No write back) • Function call, to save current registers of data by stack point, when it’s finished, read data out to registers. • Interrupt or exchanging thread, will write through current registers of data, depend on OS capability.
  • 18. I-Cache/D-Cache • Instruction Cache: – The code size that was compiled CPU instructions (function symbol size), function firstly execution will pre-fetch, and twice is in cache, if it is computer vision application, the first execution function have to ignore in calculating efficacy. • Data Cache: – It is L1 cache when we say, established methodology pre-fetch data to L1 cache.
  • 19. Page Table • The page  4096 bytes • The cache line64 bytes • A page contains 64 cache lines • L2 cache  5~10Mb • L1 cache  512k~1Mb • L1 entry way 2 way or 4 way • A image320*240 or 640*480 bytes • Does it have heavily cache miss while the memory usage over cache size?????
  • 20. Cache line • 64 bytes= 16 float • 128 bits = 4 float To Address 64 bits To Address 64 bits
  • 21. Worldview • In SIMD world, if you want to get limit, look- up table should not be optimally method, if the table is large, to use vector register will reduce 4x cycles and more faster on the contrary. • At the extreme optimization, once you create small Load/Store, this effect is very obvious!
  • 22. Known Methodology int arr0[100] = {1,2,3…}; void test1 (float *src,float *dst,int len) { int arr1[100] = {1,2,3…}; int b =4; int *arr2 = (int *)malloc(100*sizeof(int)); int c = len + b; … } Memory (DDR3) L1 cache Instruction set Const Memory (DDR3)
  • 23. Known Methodology class a { int val = 3; int map[100] = {1,2,3,4,5}; a(); … }; Memory (DDR3) Same as struct
  • 24. Compile Is Not As Smart As You Think void test0(float *src_dst,int len) { float4 *src_dst_ptr = (float4 *)src_dst; float4 cc=*src_dst_ptr + *src_dst_ptr; *src_dst_ptr +=cc; … } Three Load One Store
  • 25. Correct Writing void test0(float *src_dst,int len) { float4 *src_dst_ptr = (float4 *)src_dst; float4 val = *src_dst_ptr; float4 cc= val + val; *src_dst_ptr =cc + val; … } One Load One Store
  • 26. Few To Use Array, More To Use Pointer++ void test1(float *src,float *dst,int len) { float4 *src_ptr =(float4 *)src; float4 *dst_ptr=(float4 *)dst; float4 reg0,reg1…; for(int i=0;i<len;i+=4) { reg0=*src_ptr++; reg1=*src_ptr++; reg0 = reg0+reg1; ….. *dst_ptr++=reg0; *dst_ptr++=reg1; } } • Not recommend to use void test2(float4 *src,float4 *dst,int len) { int len_4 = len/4; float4 reg0,reg1…; for(int i=0;i<len_4;i+=2) { reg0=src[i]+src[i+1]; … dst[i]=reg0; dst[i+1]=src[i+1]; } }
  • 27. Single Source And Destination, To Avoid Cache Miss/Page Fault void test1(float *src_dst, int len) { float4 *src_dst_ptr =(float4 *)src_dst; float4 reg0,reg1…; for(int i=0;i<len;i+=4) { reg0=*src_dst_ptr++; reg1=*src_dst_ptr++; reg0 = reg0+reg1; ….. *src_dst_ptr++=reg0; *src_dst_ptr++=reg1; } }
  • 28. • Cache line was 64 bytes, 16 bits address alignment. • if Vector Register Load/Store that is Not at multiples of 16 address – Latency penalty – Depends on CPU architecture, almost will occur Align/Unalign 0x0000 0x0010 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0x0020 From Here, Start to Load/Store 128bits
  • 29. • Choice inner-instruction (Unalign Load/Store) • Access aligning data, use alignr or vext assembly • Declare 16 bits alignment or malloc and shift address Solve Method 32 32 32 32 32 32 32 32 reg0 reg3=vext(reg0,reg1,1) reg1 float __attribute__ ((aligned (16))) a[40]; float *b=(float *)malloc(sizeof(float)*40); b= (float*)(((unsigned long)b + 15) & (~0x0F))
  • 30. SIMD, Step 3 Act of Register 64 32 32 0 64bits 32 32 32 32 64 64 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0 128bits
  • 31. Register • Arm64 – 32 Vector Registers – 32 Scale Registers • Arm32 – 16 Vector Registers – 32 Scale Registers • Intel SSE – 16 Vector Registers – 32 Scale Registers • DSP – ? Vector Registers – ? Scale Registers Have to remember the number of registers in use (Very Importance!!)
  • 32. Why • Under the premise of full float, 32 Vector Registers can provide – 128 space sizes simulate array (4*32) – The extreme operation that no need to write back to memory • With use of shuffle instruction – If you use over 32 variables at same time, excess data will be written back L1, make latency.
  • 33. Why float arr[4*32+4] = {…}; float4 *arr_ptr = (float4 *)arr; float4 a0,a1,a2,a3,a4,a5,a6 … a32; A0 = *arr_ptr++; A1= *arr_ptr++; … A32 = *arr_ptr++; Over 32 register variables at same time, it makes extra Load/Store in operation process, can’t be optimized.
  • 34. About Register • The Register doesn’t have data type in itself, only defines the instruction sets on assembly level • To use variables reserving data, can Not over the maximum number of registers on CPU, but need fully utilize. • Vector Register – Make good use of shuffle instruction – Input/Output data rearrangement
  • 35. Act of Load/Store float4 *src_ptr = (float4 *)src; float4 *dst_ptr=(float4 *)dst; reg128 reg0,reg1,reg2,reg3… reg31; for(int i=0;i<640*480;i+=4) { reg0._float4 = *src_ptr++; reg1._float4 = *src_ptr++; reg2._float4 = *src_ptr++; …. .. *dst_ptr++=reg0._float4; *dst_ptr++=reg1._float4; *dst_ptr++=reg2._float4; …. } 2 General Purpose Registers (Addressing) 32 Vector Registers (full utilize) 1 General Purpose Register Read All at Once Write All at Once Main Algorithm
  • 36. Act of Function Call void test1(float *src,float *dst,int len) { int a= len/4; int b= len%4; float4 aa = *(float4 *)src; float4 bb = *(float4 *)dst; float4 cc = aa + bb; int val=test2(src,dst,len); cc = aa + bb + cc; int c =(a+b+len)*val; … } 2 General Purpose Registers write to L1 cache, produce act of Load/Store 1 Vector Registers write to L1 cache, produce act of Load/Store To read src,dst address to General Purpose Register from L1 cache The Registers will clean-up, read arguments form L1 cache to registers Return original data from L1 cache to Vector/General Purpose Registers Stack Pointer management
  • 37. Act of Function Argument void test3(float4 aa,float4 *bb,float4 &cc) { … } void test4(float a,float *b,float &c) { … } int main() { float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2}; float a = 0,b=1,c=2; test3(aa,&bb,&cc); test4(a,&b,&c); } Under the premise of O3 Produce act of Load/Store Produce act of Load/Store Produce act of Load/Store Produce act of Load/Store Directly thoughout register! Driectly throughout register!
  • 38. Key Points • Call by address, call by reference, those all access L1 cache, unless them inline succeed, must be slow. • Reduce the function usage, all the way to the end.
  • 39. Act of Branch Instruction float a[100],b[100]; for(int i=0;i<100;i++) { if(a[i]<50) b[i]=a[i]; else b[i] = 30; } 1 100 100 (1+1+1)*100 (1+1)*100*2 (1+1)*100 1101 cycles
  • 40. Act of Branch Instruction float4 *a_ptr = (float4 *)a,*b_ptr=(float4 *)b; float4 cmp = {50,50,50,50}; reg128 val0; reg128 reg0,mask,tmp0,tmp1; val0._float4 = { 30,30,30,30 }; for(int i=0;i<100;i+=4) { reg0._float4 = *a_ptr++; mask._uint4=vcltq_f32(reg0._float4,cmp); tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4); mask.uint4 = vnotq_u32(mask); tmp1._uint4=vandq_u32(val0._uint4,mask._uint4); reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4); *b_ptr++ = reg0._uint4; } 1+2*25 9*25 1 1 278 cycles 3.96x
  • 41. Analyse SIMD Branch float4 cmp = {50,50,50,50}; reg128 val0; reg128 reg0,mask,tmp0,tmp1; val0._float4 = { 30,30,30,30 }; for(int i=0;i<100;i+=4) { reg0._float4 = *a_ptr++; mask._uint4=vcltq_f32(reg0._float4,cmp); tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4); mask.uint4 = vnotq_u32(mask); tmp1._uint4=vandq_u32(val0._uint4,mask._uint4); reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4); *b_ptr++ = reg0._uint4; } if(a[i]<50) b[i]=a[i]; else b[i] =30; 11..1 00..0 11..1 00..0 If true 32’s one If false 32’s zero 0 128 0000 1111 0011 0100 0000 0100 AND 1111 0000 1011 0001 101 1 0000 NOT AND 1011 0000 0000 0100 101 1 0100 XOR
  • 42. Act of Branch Instruction • Compare with Normal Comparison Operation, more than multiples of 4 fast. • The branch prediction is NO exist, CPU pipeline will Not be predicted fail and clean-up, the pipeline is running to end(Explosion fast).
  • 43. Act of Shuffle • The instruction sets like the sea, to find the best fit shuffle. – the key point that is extreme optimization of mathematics model. – Didn’t write shuffle, Didn’t say you can write SIMD.
  • 44. Act of Shuffle • Ex: Matrix Transpose
  • 45. Act of Shuffle 4 cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 3 7 2 6 4 8 reg0 reg1 reg2 reg3 reg0 reg1 vtrnq 9 13 11 15 10 14 12 16 reg2 reg3 1 5 9 13 2 6 10 14 reg0 reg1 3 7 11 15 4 8 12 16 reg2 reg3 vtrn
  • 46. Act of Shuffle for(int i=0;i<4;i++) for(int j=i;j<4;j++) { int index0=i*4+j,index1=j*4+i; float temp=a[index0]; a[index0]=a[index1]; a[index1]=temp; } (4+3+2+1)*3 4 4 1 (1+1)*10 (1+1)*10 (1+1)*10*2 159 cycles (1+1+1+1)*10
  • 47. Act of Shuffle reg256 temp0,temp1; reg128 reg0,reg1,reg2,reg3; temp0._float4x2=vtrnq_f32(reg0,reg1); temp1._float4x2=vtrnq_f32(reg2,reg3); float2 temp =temp0._float2[1]; temp0._float2[1]=temp1._float2[0]; temp1._float2[0]=temp; temp=temp0._float2[3]; temp0._float2[3]=temp1._float[2]; temp1._float[2]=temp; 4 cycles vtrn vtrn 39.75x
  • 48. Act of Shuffle • Ex: Matrix Transpose transpose 4 cycles mul 16 cycles vpadd 12 cycles
  • 49. Data Type Conversion • uchar16short8int4float4 • float4int4uchar16 Image 32 32 32 32 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 16 16 16 16 16 16 16 16 32 32 32 32
  • 50. SIMD, Step 4 Act of Compiler
  • 51. Methodology of O3 Optimization • Clang gcc ??? • New Version >>>>> Old Version
  • 52. Lantency && Throughtput • Consecutive Load or Consecutive Store will reduce latency. • Specific pipeline rearrange can reduce latency space. – VLIW, SLOT • The Register instruction sets that have the dependency penalty. – Load/Store rearrange by yourself – Compiler can deal with the dependency penalty
  • 53. About inline • Always inline has a change to fail, you should be check function symbol in assembly. – Lines of code was the key in Clang
  • 54. Vector Register Optimization • Specific algorithms can NOT optimize with SIMD on contemporary compiler, because the 90% algorithms need to use lot of shuffle instruction, have to paper work. • Compiler only parses the for loop unrolling with SIMD. for(int i=0;i<64;i++) { …. } Compiler says: I know how to do vectorization
  • 55. Read Element and Write Back reg128 reg0; float4 a= {0,1,2,3}; reg0._float4 = a; float2 val1= reg0._float2[0]; reg0._float2[1]=val1; float val0=reg0._float[2]; reg0._float[3] = val0; 1. Whether Instruction is supported, if not, write to L1 Load/Store as same as array. 1. It depends on whether compiler is smart or not!! 0 1 2 3 寫 讀 讀 寫 寫
  • 56. Dump Assembly is Important
  • 57. SIMD, Step 5 Methodology of Extreme Optimization
  • 58. • Fix the all of algorithms parameters – Make the constant value • Remove the branch prediction, the code will be very huge, but fast • Don't doubt, the code is more than 4000 lines casually.
  • 59. Conception of SIMD Optimization FunctionA Algorithm A FunctionB Algorithm B FunctionC Algorithm C FunctionEnd (Final Algorithm) Develop One Month Previous codes were no use, only need to develop final algorithm for one month Waste Time Develop One Month Develop One Month
  • 60. The Problems face on a daily basis
  • 61. About Data • The large data have to – Satisfy multiples of 4. – Know the maximum of quantity. • Input data rearrangement can fly on the sky. • Multiples of 4 are not met – Padding zero, still use SIMD – Use General Purpose Registers in the end.
  • 62. Data Rearrangement a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a a a a a a ... a a a a a a ... a a a a a a ... b b b b b b ... b b b b b b ... b b b b b b ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a a a a a a ... a a a a a a ... b b b b b b ... b b b b b b ... c c c c c c ... c c c c c c ...
  • 63. Unrolling by Your Hands Image
  • 64. Tradition Method for(int i=0;i<height;i++) { for(int j=0;j<width;j++) { if(...) // top else if(...) // bottom else if(...) // left else if(...) // right // middle } }
  • 65. SIMD for(int i=0;i<height;i++) // top { ...} for(int i=0;i<height;i++) { // left for(int j=0;j<width;j++) // middle { ... } // right } for(int i=0;i<height;i++) // bottom { ...}
  • 66. In Order To Cooperate With SIMD, Crazy, Unlimited Unrolling I cache is enough(over 32KB), if Not enough, we'll talk about it then
  • 67. Conclusion • SIMD is strongly linked to mathematics • Unknown field, or almost no course. • Seldom data on the internet, seldom people arrange success. • Do you want to develop new algorithms? You Can try it.

Editor's Notes

  1. 模擬暫存器行為,組合語言階層是沒有型別的
  2. o
  3. Under the premise of