SlideShare a Scribd company logo
Happy To Use SIMD
Weita, Wang
SIMD???
cv::Mat a(4,4,CV_32FC1);
cv::Mat b(4,4,CV_32FC1);
cv::Mat c(4,4,CV_32FC1),d;
d= a*b+c;
Eigen::MatrixXf a(3,3);
Eigen::MatrixXf b(3,3);
Eigen::MatrixXf c(3,3),d;
d=a*b+c;
Memory Register
Compiler
What is SIMD?
• The Extreme Optimization for C/C++
• Pointer only
• Have to exactly define act of memory, register,
compiler, it can be challenge the limit.
C/C++ level
Assembly
level
SIMD
C=A+B ?
float arr0[4] = { 1,2,3,4 };
float arr1[4] = { 5,6,7,8 };
float arr2[4] = { 0 };
A
B
C
A B
C +
=
Result: arr2[4] => { 6,8,10,12 };
Why is SIMD fast?
for(int i=0;i<4;i++)
arr2[i]=arr0[i]+arr1[i];
for(int i=0;i<4;i++)
*(arr2 + i) = *(arr0 + i)+*(arr1 + i);
1 1*4
(1+1)*4 (1+1)*4 (1+1)*4
37 cycles
1*4
1*4
Assume
the all
instruction sets
have
1 instruction and
1 cycle
Why is SIMD fast?
float32x4_t a,b,c;
a=*(float32x4_t *)arr0;
b=*(float32x4_t *)arr1;
c=a+b;
*(float32x4_t *)arr2=c;
4 cycles
9x fast
SIMD, Step 1
You have to calculate cycle count
Variable Architecture
dobule a; //64bits
float b; //32bits
int c; //32bits
short d; //16bits
char e; //8bits
unsigned int f; //32bits
unsigned short g; //16bits
unsigned char f; //8bits
….
• SSE
__m128d aa;
__m128 bb;
__m64d cc;
__m64d dd;
• NEON
float64x2_t aa;
float32x4_t bb;
int32x4_t cc;
int16x8_t dd;
int8x16_t ee;
…
Variable Architecture
64
32 32 32 32
32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 64bits 0 128bits
General Purpose Register
Scale Register
Vector Register
1x
1x
2x
4x
16x
8x
Re-Definition
Re-Definition Variable
• SSE
typedef __m128d double2;
typedef __m128d float4;
typedef __m64d float2;
typedef __m128 int4;
typedef __m64 int2;
typedef __m128 uint4;
typedef __m64 uint2;
….
• NEON
typedef float64x2_t double2;
typedef float32x4_t float4;
typedef float32x2_t float2;
typedef int32x4_t int4;
typedef int32x2_t int2;
typedef uint32x4_t uint4;
typedef uint32x2_t uint2;
….
Re-Definition Register
union reg128 {
uchar16 _uchar16;
short8 _short8;
int4 _int4;
float4 _float4;
double2 _double2;
uchar _uchar[16];
short _short[8];
int _int4;
float _float[4];
double _double[2];
…
void print_uchar()
{
printf(“%d %d..n”,
_uchar[0],_uchar[1],_uchar[2]….);
}
void print_float()
{
printf(“%f %f %f %fn”,
_float[0],_float[1]...);
}
…
};
To Find Instruction Set
SIMD, Step 2
Act of Memory
Memory
• In the approaching physical limits era, CPU
operation is not bottleneck, it keeps changing
per year, but the speed of memory transform
is only constant.
• The SIMD succeeded in reducing over
quadruple CPU cycles when multi-data
parallelize.
• The Data was producing latency from Memory
 L2  L1  register Load/Store.
Memory Level latency
Reference: https://tinyurl.com/gsnfzoy
L1 cache
• Create array in the function.
• The registers are used over established quantity for
reserving data, it will write back L1 cache by stack
pointer.
• Function arguments transfer data. (partial –O3 opt.
will through out by registers, No write back)
• Function call, to save current registers of data by
stack point, when it’s finished, read data out to
registers.
• Interrupt or exchanging thread, will write through
current registers of data, depend on OS capability.
I-Cache/D-Cache
• Instruction Cache:
– The code size that was compiled CPU instructions
(function symbol size), function firstly execution
will pre-fetch, and twice is in cache, if it is
computer vision application, the first execution
function have to ignore in calculating efficacy.
• Data Cache:
– It is L1 cache when we say, established
methodology pre-fetch data to L1 cache.
Page Table
• The page  4096 bytes
• The cache line64 bytes
• A page contains 64 cache lines
• L2 cache  5~10Mb
• L1 cache  512k~1Mb
• L1 entry way 2 way or 4 way
• A image320*240 or 640*480 bytes
• Does it have heavily cache miss while the
memory usage over cache size?????
Cache line
• 64 bytes= 16 float
• 128 bits = 4 float
To Address
64 bits
To Address
64 bits
Worldview
• In SIMD world, if you want to get limit, look-
up table should not be optimally method, if
the table is large, to use vector register will
reduce 4x cycles and more faster on the
contrary.
• At the extreme optimization, once you create
small Load/Store, this effect is very obvious!
Known Methodology
int arr0[100] = {1,2,3…};
void test1 (float *src,float *dst,int len)
{
int arr1[100] = {1,2,3…};
int b =4;
int *arr2 = (int *)malloc(100*sizeof(int));
int c = len + b;
…
}
Memory
(DDR3)
L1 cache
Instruction set
Const
Memory
(DDR3)
Known Methodology
class a
{
int val = 3;
int map[100] = {1,2,3,4,5};
a();
…
};
Memory
(DDR3)
Same as struct
Compile Is Not As Smart As You
Think
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 cc=*src_dst_ptr + *src_dst_ptr;
*src_dst_ptr +=cc;
…
}
Three Load
One Store
Correct Writing
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 val = *src_dst_ptr;
float4 cc= val + val;
*src_dst_ptr =cc + val;
…
} One Load
One Store
Few To Use Array,
More To Use Pointer++
void test1(float *src,float *dst,int len)
{
float4 *src_ptr =(float4 *)src;
float4 *dst_ptr=(float4 *)dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_ptr++;
reg1=*src_ptr++;
reg0 = reg0+reg1;
…..
*dst_ptr++=reg0;
*dst_ptr++=reg1;
}
}
• Not recommend to use
void test2(float4 *src,float4 *dst,int len)
{
int len_4 = len/4;
float4 reg0,reg1…;
for(int i=0;i<len_4;i+=2)
{
reg0=src[i]+src[i+1];
…
dst[i]=reg0;
dst[i+1]=src[i+1];
}
}
Single Source And Destination,
To Avoid Cache Miss/Page Fault
void test1(float *src_dst, int len)
{
float4 *src_dst_ptr =(float4 *)src_dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_dst_ptr++;
reg1=*src_dst_ptr++;
reg0 = reg0+reg1;
…..
*src_dst_ptr++=reg0;
*src_dst_ptr++=reg1;
}
}
• Cache line was 64 bytes, 16 bits address
alignment.
• if Vector Register Load/Store that is Not at
multiples of 16 address
– Latency penalty
– Depends on CPU architecture, almost will occur
Align/Unalign
0x0000 0x0010
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0x0020
From Here, Start to Load/Store 128bits
• Choice inner-instruction (Unalign Load/Store)
• Access aligning data, use alignr or vext
assembly
• Declare 16 bits alignment or malloc and shift
address
Solve Method
32 32 32 32 32 32 32 32
reg0
reg3=vext(reg0,reg1,1)
reg1
float __attribute__ ((aligned (16))) a[40];
float *b=(float *)malloc(sizeof(float)*40);
b= (float*)(((unsigned long)b + 15) & (~0x0F))
SIMD, Step 3
Act of Register
64
32 32
0 64bits
32 32 32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 128bits
Register
• Arm64
– 32 Vector Registers
– 32 Scale Registers
• Arm32
– 16 Vector Registers
– 32 Scale Registers
• Intel SSE
– 16 Vector Registers
– 32 Scale Registers
• DSP
– ? Vector Registers
– ? Scale Registers
Have to remember the number of registers in use
(Very Importance!!)
Why
• Under the premise of full float, 32 Vector
Registers can provide
– 128 space sizes simulate array (4*32)
– The extreme operation that no need to write back
to memory
• With use of shuffle instruction
– If you use over 32 variables at same time, excess
data will be written back L1, make latency.
Why
float arr[4*32+4] = {…};
float4 *arr_ptr = (float4 *)arr;
float4 a0,a1,a2,a3,a4,a5,a6 … a32;
A0 = *arr_ptr++;
A1= *arr_ptr++;
…
A32 = *arr_ptr++;
Over 32 register variables at same
time, it makes extra Load/Store in
operation process, can’t be
optimized.
About Register
• The Register doesn’t have data type in itself,
only defines the instruction sets on assembly
level
• To use variables reserving data, can Not over
the maximum number of registers on CPU, but
need fully utilize.
• Vector Register
– Make good use of shuffle instruction
– Input/Output data rearrangement
Act of Load/Store
float4 *src_ptr = (float4 *)src;
float4 *dst_ptr=(float4 *)dst;
reg128 reg0,reg1,reg2,reg3… reg31;
for(int i=0;i<640*480;i+=4) {
reg0._float4 = *src_ptr++;
reg1._float4 = *src_ptr++;
reg2._float4 = *src_ptr++;
….
..
*dst_ptr++=reg0._float4;
*dst_ptr++=reg1._float4;
*dst_ptr++=reg2._float4;
….
}
2 General
Purpose
Registers
(Addressing) 32 Vector Registers
(full utilize)
1 General
Purpose
Register
Read All at
Once
Write All at
Once
Main
Algorithm
Act of Function Call
void test1(float *src,float *dst,int len) {
int a= len/4;
int b= len%4;
float4 aa = *(float4 *)src;
float4 bb = *(float4 *)dst;
float4 cc = aa + bb;
int val=test2(src,dst,len);
cc = aa + bb + cc;
int c =(a+b+len)*val;
…
}
2 General Purpose
Registers write
to L1 cache,
produce act of
Load/Store 1 Vector Registers write
to L1 cache,
produce act of
Load/Store
To read src,dst
address to General
Purpose Register from
L1 cache
The Registers will
clean-up, read
arguments form L1
cache to registers Return original data
from L1 cache to
Vector/General
Purpose Registers
Stack Pointer
management
Act of Function Argument
void test3(float4 aa,float4 *bb,float4 &cc) {
…
}
void test4(float a,float *b,float &c) {
…
}
int main() {
float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2};
float a = 0,b=1,c=2;
test3(aa,&bb,&cc);
test4(a,&b,&c);
}
Under the
premise of O3
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Directly thoughout register!
Driectly throughout register!
Key Points
• Call by address, call by reference, those all
access L1 cache, unless them inline succeed,
must be slow.
• Reduce the function usage, all the way to the
end.
Act of Branch Instruction
float a[100],b[100];
for(int i=0;i<100;i++)
{
if(a[i]<50)
b[i]=a[i];
else
b[i] = 30;
}
1 100 100
(1+1+1)*100
(1+1)*100*2
(1+1)*100
1101 cycles
Act of Branch Instruction
float4 *a_ptr = (float4 *)a,*b_ptr=(float4 *)b;
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
1+2*25
9*25
1
1
278 cycles
3.96x
Analyse SIMD Branch
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
if(a[i]<50) b[i]=a[i];
else b[i] =30;
11..1 00..0 11..1 00..0
If true
32’s one
If false
32’s zero
0 128
0000 1111
0011 0100
0000 0100
AND
1111 0000
1011 0001
101 1 0000
NOT
AND
1011 0000
0000 0100
101 1 0100
XOR
Act of Branch Instruction
• Compare with Normal Comparison Operation,
more than multiples of 4 fast.
• The branch prediction is NO exist, CPU
pipeline will Not be predicted fail and clean-up,
the pipeline is running to end(Explosion fast).
Act of Shuffle
• The instruction sets like the sea, to find the
best fit shuffle.
– the key point that is extreme optimization of
mathematics model.
– Didn’t write shuffle, Didn’t say you can write SIMD.
Act of Shuffle
• Ex: Matrix Transpose
Act of Shuffle
4 cycles
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 3 7
2 6 4 8
reg0
reg1
reg2
reg3
reg0
reg1
vtrnq
9 13 11 15
10 14 12 16
reg2
reg3
1 5 9 13
2 6 10 14
reg0
reg1
3 7 11 15
4 8 12 16
reg2
reg3
vtrn
Act of Shuffle
for(int i=0;i<4;i++)
for(int j=i;j<4;j++)
{
int index0=i*4+j,index1=j*4+i;
float temp=a[index0];
a[index0]=a[index1];
a[index1]=temp;
}
(4+3+2+1)*3
4 4
1
(1+1)*10
(1+1)*10
(1+1)*10*2
159 cycles
(1+1+1+1)*10
Act of Shuffle
reg256 temp0,temp1;
reg128 reg0,reg1,reg2,reg3;
temp0._float4x2=vtrnq_f32(reg0,reg1);
temp1._float4x2=vtrnq_f32(reg2,reg3);
float2 temp =temp0._float2[1];
temp0._float2[1]=temp1._float2[0];
temp1._float2[0]=temp;
temp=temp0._float2[3];
temp0._float2[3]=temp1._float[2];
temp1._float[2]=temp;
4 cycles
vtrn
vtrn
39.75x
Act of Shuffle
• Ex: Matrix Transpose
transpose 4 cycles
mul 16 cycles
vpadd 12 cycles
Data Type Conversion
• uchar16short8int4float4
• float4int4uchar16
Image
32 32 32 32
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
SIMD, Step 4
Act of Compiler
Methodology of O3 Optimization
• Clang gcc ???
• New Version >>>>> Old Version
Lantency && Throughtput
• Consecutive Load or Consecutive Store will
reduce latency.
• Specific pipeline rearrange can reduce latency
space.
– VLIW, SLOT
• The Register instruction sets that have the
dependency penalty.
– Load/Store rearrange by yourself
– Compiler can deal with the dependency penalty
About inline
• Always inline has a change to fail, you should
be check function symbol in assembly.
– Lines of code was the key in Clang
Vector Register Optimization
• Specific algorithms can NOT optimize with
SIMD on contemporary compiler, because the
90% algorithms need to use lot of shuffle
instruction, have to paper work.
• Compiler only parses the for loop unrolling
with SIMD.
for(int i=0;i<64;i++)
{
….
}
Compiler says:
I know how to do
vectorization
Read Element and Write Back
reg128 reg0;
float4 a= {0,1,2,3};
reg0._float4 = a;
float2 val1= reg0._float2[0];
reg0._float2[1]=val1;
float val0=reg0._float[2];
reg0._float[3] = val0;
1. Whether Instruction is supported,
if not, write to L1 Load/Store as same as array.
1. It depends on whether compiler is smart or not!!
0 1 2 3
寫
讀
讀
寫
寫
Dump Assembly is Important
SIMD, Step 5
Methodology of Extreme Optimization
• Fix the all of algorithms parameters
– Make the constant value
• Remove the branch prediction, the code will
be very huge, but fast
• Don't doubt, the code is more than 4000 lines
casually.
Conception of SIMD Optimization
FunctionA
Algorithm A
FunctionB
Algorithm B
FunctionC
Algorithm C
FunctionEnd
(Final Algorithm)
Develop
One Month
Previous
codes
were no use,
only need to
develop
final algorithm
for one month
Waste Time
Develop
One Month
Develop
One Month
The Problems face on a daily basis
About Data
• The large data have to
– Satisfy multiples of 4.
– Know the maximum of quantity.
• Input data rearrangement can fly on the sky.
• Multiples of 4 are not met
– Padding zero, still use SIMD
– Use General Purpose Registers in the end.
Data Rearrangement
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a a a a a a ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
b b b b b b ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
c c c c c c ...
c c c c c c ...
Unrolling by Your Hands
Image
Tradition Method
for(int i=0;i<height;i++)
{
for(int j=0;j<width;j++)
{
if(...) // top
else if(...) // bottom
else if(...) // left
else if(...) // right
// middle
}
}
SIMD
for(int i=0;i<height;i++) // top
{ ...}
for(int i=0;i<height;i++)
{
// left
for(int j=0;j<width;j++) // middle
{ ... }
// right
}
for(int i=0;i<height;i++) // bottom
{ ...}
In Order To Cooperate With
SIMD, Crazy, Unlimited Unrolling
I cache is enough(over 32KB),
if Not enough, we'll talk about it then
Conclusion
• SIMD is strongly linked to mathematics
• Unknown field, or almost no course.
• Seldom data on the internet, seldom people
arrange success.
• Do you want to develop new algorithms?
You Can try it.

More Related Content

What's hot

A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
Cysinfo Cyber Security Community
 
Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
Kito Cheng
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
Marina Kolpakova
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Hsien-Hsin Sean Lee, Ph.D.
 
Format String Vulnerability
Format String VulnerabilityFormat String Vulnerability
Format String Vulnerability
Jian-Yu Li
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
nomaddo
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
Platonov Sergey
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
Simen Li
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Hsien-Hsin Sean Lee, Ph.D.
 
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
MITSUNARI Shigeo
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
Eelco Visser
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
Mr. Vengineer
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
Electronic Arts / DICE
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
Egor Bogatov
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Yukio Okuda
 

What's hot (20)

A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
 
Format String Vulnerability
Format String VulnerabilityFormat String Vulnerability
Format String Vulnerability
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
3D-DRESD Lorenzo Pavesi
3D-DRESD Lorenzo Pavesi3D-DRESD Lorenzo Pavesi
3D-DRESD Lorenzo Pavesi
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 

Similar to Happy To Use SIMD

Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Eastern European Computer Vision Conference
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
dk03006
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
jeronimored
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Umbra Software
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
ssuserc28b3c
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
Netronome
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
Codemotion Tel Aviv
 
Lec05
Lec05Lec05
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
Valeriia Maliarenko
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
Yandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
Jaya Chavan
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 

Similar to Happy To Use SIMD (20)

Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
8253,8254
8253,8254 8253,8254
8253,8254
 
Lec05
Lec05Lec05
Lec05
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 

Recently uploaded

Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 

Recently uploaded (20)

Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 

Happy To Use SIMD

  • 1. Happy To Use SIMD Weita, Wang
  • 2. SIMD??? cv::Mat a(4,4,CV_32FC1); cv::Mat b(4,4,CV_32FC1); cv::Mat c(4,4,CV_32FC1),d; d= a*b+c; Eigen::MatrixXf a(3,3); Eigen::MatrixXf b(3,3); Eigen::MatrixXf c(3,3),d; d=a*b+c; Memory Register Compiler
  • 3. What is SIMD? • The Extreme Optimization for C/C++ • Pointer only • Have to exactly define act of memory, register, compiler, it can be challenge the limit. C/C++ level Assembly level SIMD
  • 4. C=A+B ? float arr0[4] = { 1,2,3,4 }; float arr1[4] = { 5,6,7,8 }; float arr2[4] = { 0 }; A B C A B C + = Result: arr2[4] => { 6,8,10,12 };
  • 5. Why is SIMD fast? for(int i=0;i<4;i++) arr2[i]=arr0[i]+arr1[i]; for(int i=0;i<4;i++) *(arr2 + i) = *(arr0 + i)+*(arr1 + i); 1 1*4 (1+1)*4 (1+1)*4 (1+1)*4 37 cycles 1*4 1*4 Assume the all instruction sets have 1 instruction and 1 cycle
  • 6. Why is SIMD fast? float32x4_t a,b,c; a=*(float32x4_t *)arr0; b=*(float32x4_t *)arr1; c=a+b; *(float32x4_t *)arr2=c; 4 cycles 9x fast
  • 7. SIMD, Step 1 You have to calculate cycle count
  • 8. Variable Architecture dobule a; //64bits float b; //32bits int c; //32bits short d; //16bits char e; //8bits unsigned int f; //32bits unsigned short g; //16bits unsigned char f; //8bits …. • SSE __m128d aa; __m128 bb; __m64d cc; __m64d dd; • NEON float64x2_t aa; float32x4_t bb; int32x4_t cc; int16x8_t dd; int8x16_t ee; …
  • 9. Variable Architecture 64 32 32 32 32 32 32 64 64 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0 64bits 0 128bits General Purpose Register Scale Register Vector Register 1x 1x 2x 4x 16x 8x
  • 11. Re-Definition Variable • SSE typedef __m128d double2; typedef __m128d float4; typedef __m64d float2; typedef __m128 int4; typedef __m64 int2; typedef __m128 uint4; typedef __m64 uint2; …. • NEON typedef float64x2_t double2; typedef float32x4_t float4; typedef float32x2_t float2; typedef int32x4_t int4; typedef int32x2_t int2; typedef uint32x4_t uint4; typedef uint32x2_t uint2; ….
  • 12. Re-Definition Register union reg128 { uchar16 _uchar16; short8 _short8; int4 _int4; float4 _float4; double2 _double2; uchar _uchar[16]; short _short[8]; int _int4; float _float[4]; double _double[2]; … void print_uchar() { printf(“%d %d..n”, _uchar[0],_uchar[1],_uchar[2]….); } void print_float() { printf(“%f %f %f %fn”, _float[0],_float[1]...); } … };
  • 14. SIMD, Step 2 Act of Memory
  • 15. Memory • In the approaching physical limits era, CPU operation is not bottleneck, it keeps changing per year, but the speed of memory transform is only constant. • The SIMD succeeded in reducing over quadruple CPU cycles when multi-data parallelize. • The Data was producing latency from Memory  L2  L1  register Load/Store.
  • 16. Memory Level latency Reference: https://tinyurl.com/gsnfzoy
  • 17. L1 cache • Create array in the function. • The registers are used over established quantity for reserving data, it will write back L1 cache by stack pointer. • Function arguments transfer data. (partial –O3 opt. will through out by registers, No write back) • Function call, to save current registers of data by stack point, when it’s finished, read data out to registers. • Interrupt or exchanging thread, will write through current registers of data, depend on OS capability.
  • 18. I-Cache/D-Cache • Instruction Cache: – The code size that was compiled CPU instructions (function symbol size), function firstly execution will pre-fetch, and twice is in cache, if it is computer vision application, the first execution function have to ignore in calculating efficacy. • Data Cache: – It is L1 cache when we say, established methodology pre-fetch data to L1 cache.
  • 19. Page Table • The page  4096 bytes • The cache line64 bytes • A page contains 64 cache lines • L2 cache  5~10Mb • L1 cache  512k~1Mb • L1 entry way 2 way or 4 way • A image320*240 or 640*480 bytes • Does it have heavily cache miss while the memory usage over cache size?????
  • 20. Cache line • 64 bytes= 16 float • 128 bits = 4 float To Address 64 bits To Address 64 bits
  • 21. Worldview • In SIMD world, if you want to get limit, look- up table should not be optimally method, if the table is large, to use vector register will reduce 4x cycles and more faster on the contrary. • At the extreme optimization, once you create small Load/Store, this effect is very obvious!
  • 22. Known Methodology int arr0[100] = {1,2,3…}; void test1 (float *src,float *dst,int len) { int arr1[100] = {1,2,3…}; int b =4; int *arr2 = (int *)malloc(100*sizeof(int)); int c = len + b; … } Memory (DDR3) L1 cache Instruction set Const Memory (DDR3)
  • 23. Known Methodology class a { int val = 3; int map[100] = {1,2,3,4,5}; a(); … }; Memory (DDR3) Same as struct
  • 24. Compile Is Not As Smart As You Think void test0(float *src_dst,int len) { float4 *src_dst_ptr = (float4 *)src_dst; float4 cc=*src_dst_ptr + *src_dst_ptr; *src_dst_ptr +=cc; … } Three Load One Store
  • 25. Correct Writing void test0(float *src_dst,int len) { float4 *src_dst_ptr = (float4 *)src_dst; float4 val = *src_dst_ptr; float4 cc= val + val; *src_dst_ptr =cc + val; … } One Load One Store
  • 26. Few To Use Array, More To Use Pointer++ void test1(float *src,float *dst,int len) { float4 *src_ptr =(float4 *)src; float4 *dst_ptr=(float4 *)dst; float4 reg0,reg1…; for(int i=0;i<len;i+=4) { reg0=*src_ptr++; reg1=*src_ptr++; reg0 = reg0+reg1; ….. *dst_ptr++=reg0; *dst_ptr++=reg1; } } • Not recommend to use void test2(float4 *src,float4 *dst,int len) { int len_4 = len/4; float4 reg0,reg1…; for(int i=0;i<len_4;i+=2) { reg0=src[i]+src[i+1]; … dst[i]=reg0; dst[i+1]=src[i+1]; } }
  • 27. Single Source And Destination, To Avoid Cache Miss/Page Fault void test1(float *src_dst, int len) { float4 *src_dst_ptr =(float4 *)src_dst; float4 reg0,reg1…; for(int i=0;i<len;i+=4) { reg0=*src_dst_ptr++; reg1=*src_dst_ptr++; reg0 = reg0+reg1; ….. *src_dst_ptr++=reg0; *src_dst_ptr++=reg1; } }
  • 28. • Cache line was 64 bytes, 16 bits address alignment. • if Vector Register Load/Store that is Not at multiples of 16 address – Latency penalty – Depends on CPU architecture, almost will occur Align/Unalign 0x0000 0x0010 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0x0020 From Here, Start to Load/Store 128bits
  • 29. • Choice inner-instruction (Unalign Load/Store) • Access aligning data, use alignr or vext assembly • Declare 16 bits alignment or malloc and shift address Solve Method 32 32 32 32 32 32 32 32 reg0 reg3=vext(reg0,reg1,1) reg1 float __attribute__ ((aligned (16))) a[40]; float *b=(float *)malloc(sizeof(float)*40); b= (float*)(((unsigned long)b + 15) & (~0x0F))
  • 30. SIMD, Step 3 Act of Register 64 32 32 0 64bits 32 32 32 32 64 64 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0 128bits
  • 31. Register • Arm64 – 32 Vector Registers – 32 Scale Registers • Arm32 – 16 Vector Registers – 32 Scale Registers • Intel SSE – 16 Vector Registers – 32 Scale Registers • DSP – ? Vector Registers – ? Scale Registers Have to remember the number of registers in use (Very Importance!!)
  • 32. Why • Under the premise of full float, 32 Vector Registers can provide – 128 space sizes simulate array (4*32) – The extreme operation that no need to write back to memory • With use of shuffle instruction – If you use over 32 variables at same time, excess data will be written back L1, make latency.
  • 33. Why float arr[4*32+4] = {…}; float4 *arr_ptr = (float4 *)arr; float4 a0,a1,a2,a3,a4,a5,a6 … a32; A0 = *arr_ptr++; A1= *arr_ptr++; … A32 = *arr_ptr++; Over 32 register variables at same time, it makes extra Load/Store in operation process, can’t be optimized.
  • 34. About Register • The Register doesn’t have data type in itself, only defines the instruction sets on assembly level • To use variables reserving data, can Not over the maximum number of registers on CPU, but need fully utilize. • Vector Register – Make good use of shuffle instruction – Input/Output data rearrangement
  • 35. Act of Load/Store float4 *src_ptr = (float4 *)src; float4 *dst_ptr=(float4 *)dst; reg128 reg0,reg1,reg2,reg3… reg31; for(int i=0;i<640*480;i+=4) { reg0._float4 = *src_ptr++; reg1._float4 = *src_ptr++; reg2._float4 = *src_ptr++; …. .. *dst_ptr++=reg0._float4; *dst_ptr++=reg1._float4; *dst_ptr++=reg2._float4; …. } 2 General Purpose Registers (Addressing) 32 Vector Registers (full utilize) 1 General Purpose Register Read All at Once Write All at Once Main Algorithm
  • 36. Act of Function Call void test1(float *src,float *dst,int len) { int a= len/4; int b= len%4; float4 aa = *(float4 *)src; float4 bb = *(float4 *)dst; float4 cc = aa + bb; int val=test2(src,dst,len); cc = aa + bb + cc; int c =(a+b+len)*val; … } 2 General Purpose Registers write to L1 cache, produce act of Load/Store 1 Vector Registers write to L1 cache, produce act of Load/Store To read src,dst address to General Purpose Register from L1 cache The Registers will clean-up, read arguments form L1 cache to registers Return original data from L1 cache to Vector/General Purpose Registers Stack Pointer management
  • 37. Act of Function Argument void test3(float4 aa,float4 *bb,float4 &cc) { … } void test4(float a,float *b,float &c) { … } int main() { float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2}; float a = 0,b=1,c=2; test3(aa,&bb,&cc); test4(a,&b,&c); } Under the premise of O3 Produce act of Load/Store Produce act of Load/Store Produce act of Load/Store Produce act of Load/Store Directly thoughout register! Driectly throughout register!
  • 38. Key Points • Call by address, call by reference, those all access L1 cache, unless them inline succeed, must be slow. • Reduce the function usage, all the way to the end.
  • 39. Act of Branch Instruction float a[100],b[100]; for(int i=0;i<100;i++) { if(a[i]<50) b[i]=a[i]; else b[i] = 30; } 1 100 100 (1+1+1)*100 (1+1)*100*2 (1+1)*100 1101 cycles
  • 40. Act of Branch Instruction float4 *a_ptr = (float4 *)a,*b_ptr=(float4 *)b; float4 cmp = {50,50,50,50}; reg128 val0; reg128 reg0,mask,tmp0,tmp1; val0._float4 = { 30,30,30,30 }; for(int i=0;i<100;i+=4) { reg0._float4 = *a_ptr++; mask._uint4=vcltq_f32(reg0._float4,cmp); tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4); mask.uint4 = vnotq_u32(mask); tmp1._uint4=vandq_u32(val0._uint4,mask._uint4); reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4); *b_ptr++ = reg0._uint4; } 1+2*25 9*25 1 1 278 cycles 3.96x
  • 41. Analyse SIMD Branch float4 cmp = {50,50,50,50}; reg128 val0; reg128 reg0,mask,tmp0,tmp1; val0._float4 = { 30,30,30,30 }; for(int i=0;i<100;i+=4) { reg0._float4 = *a_ptr++; mask._uint4=vcltq_f32(reg0._float4,cmp); tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4); mask.uint4 = vnotq_u32(mask); tmp1._uint4=vandq_u32(val0._uint4,mask._uint4); reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4); *b_ptr++ = reg0._uint4; } if(a[i]<50) b[i]=a[i]; else b[i] =30; 11..1 00..0 11..1 00..0 If true 32’s one If false 32’s zero 0 128 0000 1111 0011 0100 0000 0100 AND 1111 0000 1011 0001 101 1 0000 NOT AND 1011 0000 0000 0100 101 1 0100 XOR
  • 42. Act of Branch Instruction • Compare with Normal Comparison Operation, more than multiples of 4 fast. • The branch prediction is NO exist, CPU pipeline will Not be predicted fail and clean-up, the pipeline is running to end(Explosion fast).
  • 43. Act of Shuffle • The instruction sets like the sea, to find the best fit shuffle. – the key point that is extreme optimization of mathematics model. – Didn’t write shuffle, Didn’t say you can write SIMD.
  • 44. Act of Shuffle • Ex: Matrix Transpose
  • 45. Act of Shuffle 4 cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 3 7 2 6 4 8 reg0 reg1 reg2 reg3 reg0 reg1 vtrnq 9 13 11 15 10 14 12 16 reg2 reg3 1 5 9 13 2 6 10 14 reg0 reg1 3 7 11 15 4 8 12 16 reg2 reg3 vtrn
  • 46. Act of Shuffle for(int i=0;i<4;i++) for(int j=i;j<4;j++) { int index0=i*4+j,index1=j*4+i; float temp=a[index0]; a[index0]=a[index1]; a[index1]=temp; } (4+3+2+1)*3 4 4 1 (1+1)*10 (1+1)*10 (1+1)*10*2 159 cycles (1+1+1+1)*10
  • 47. Act of Shuffle reg256 temp0,temp1; reg128 reg0,reg1,reg2,reg3; temp0._float4x2=vtrnq_f32(reg0,reg1); temp1._float4x2=vtrnq_f32(reg2,reg3); float2 temp =temp0._float2[1]; temp0._float2[1]=temp1._float2[0]; temp1._float2[0]=temp; temp=temp0._float2[3]; temp0._float2[3]=temp1._float[2]; temp1._float[2]=temp; 4 cycles vtrn vtrn 39.75x
  • 48. Act of Shuffle • Ex: Matrix Transpose transpose 4 cycles mul 16 cycles vpadd 12 cycles
  • 49. Data Type Conversion • uchar16short8int4float4 • float4int4uchar16 Image 32 32 32 32 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 16 16 16 16 16 16 16 16 32 32 32 32
  • 50. SIMD, Step 4 Act of Compiler
  • 51. Methodology of O3 Optimization • Clang gcc ??? • New Version >>>>> Old Version
  • 52. Lantency && Throughtput • Consecutive Load or Consecutive Store will reduce latency. • Specific pipeline rearrange can reduce latency space. – VLIW, SLOT • The Register instruction sets that have the dependency penalty. – Load/Store rearrange by yourself – Compiler can deal with the dependency penalty
  • 53. About inline • Always inline has a change to fail, you should be check function symbol in assembly. – Lines of code was the key in Clang
  • 54. Vector Register Optimization • Specific algorithms can NOT optimize with SIMD on contemporary compiler, because the 90% algorithms need to use lot of shuffle instruction, have to paper work. • Compiler only parses the for loop unrolling with SIMD. for(int i=0;i<64;i++) { …. } Compiler says: I know how to do vectorization
  • 55. Read Element and Write Back reg128 reg0; float4 a= {0,1,2,3}; reg0._float4 = a; float2 val1= reg0._float2[0]; reg0._float2[1]=val1; float val0=reg0._float[2]; reg0._float[3] = val0; 1. Whether Instruction is supported, if not, write to L1 Load/Store as same as array. 1. It depends on whether compiler is smart or not!! 0 1 2 3 寫 讀 讀 寫 寫
  • 56. Dump Assembly is Important
  • 57. SIMD, Step 5 Methodology of Extreme Optimization
  • 58. • Fix the all of algorithms parameters – Make the constant value • Remove the branch prediction, the code will be very huge, but fast • Don't doubt, the code is more than 4000 lines casually.
  • 59. Conception of SIMD Optimization FunctionA Algorithm A FunctionB Algorithm B FunctionC Algorithm C FunctionEnd (Final Algorithm) Develop One Month Previous codes were no use, only need to develop final algorithm for one month Waste Time Develop One Month Develop One Month
  • 60. The Problems face on a daily basis
  • 61. About Data • The large data have to – Satisfy multiples of 4. – Know the maximum of quantity. • Input data rearrangement can fly on the sky. • Multiples of 4 are not met – Padding zero, still use SIMD – Use General Purpose Registers in the end.
  • 62. Data Rearrangement a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a b a b a b ... a a a a a a ... a a a a a a ... a a a a a a ... b b b b b b ... b b b b b b ... b b b b b b ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a b c a b c ... a a a a a a ... a a a a a a ... b b b b b b ... b b b b b b ... c c c c c c ... c c c c c c ...
  • 63. Unrolling by Your Hands Image
  • 64. Tradition Method for(int i=0;i<height;i++) { for(int j=0;j<width;j++) { if(...) // top else if(...) // bottom else if(...) // left else if(...) // right // middle } }
  • 65. SIMD for(int i=0;i<height;i++) // top { ...} for(int i=0;i<height;i++) { // left for(int j=0;j<width;j++) // middle { ... } // right } for(int i=0;i<height;i++) // bottom { ...}
  • 66. In Order To Cooperate With SIMD, Crazy, Unlimited Unrolling I cache is enough(over 32KB), if Not enough, we'll talk about it then
  • 67. Conclusion • SIMD is strongly linked to mathematics • Unknown field, or almost no course. • Seldom data on the internet, seldom people arrange success. • Do you want to develop new algorithms? You Can try it.

Editor's Notes

  1. 模擬暫存器行為,組合語言階層是沒有型別的
  2. o
  3. Under the premise of