Happy To Use SIMD

SIMD???
cv::Mat a(4,4,CV_32FC1);
cv::Mat b(4,4,CV_32FC1);
cv::Mat c(4,4,CV_32FC1),d;
d= a*b+c;
Eigen::MatrixXf a(3,3);
Eigen::MatrixXf b(3,3);
Eigen::MatrixXf c(3,3),d;
d=a*b+c;
Memory Register
Compiler

What is SIMD?
• The Extreme Optimization for C/C++
• Pointer only
• Have to exactly define act of memory, register,
compiler, it can be challenge the limit.
C/C++ level
Assembly
level
SIMD

C=A+B ?
float arr0[4] = { 1,2,3,4 };
float arr1[4] = { 5,6,7,8 };
float arr2[4] = { 0 };
A
B
C
A B
C +
=
Result: arr2[4] => { 6,8,10,12 };

Why is SIMD fast?
for(int i=0;i<4;i++)
arr2[i]=arr0[i]+arr1[i];
*(arr2 + i) = *(arr0 + i)+*(arr1 + i);
1 1*4
(1+1)*4 (1+1)*4 (1+1)*4
37 cycles
1*4
1*4
Assume
the all
instruction sets
have
1 instruction and
1 cycle

Why is SIMD fast?
float32x4_t a,b,c;
a=*(float32x4_t *)arr0;
b=*(float32x4_t *)arr1;
c=a+b;
*(float32x4_t *)arr2=c;
4 cycles
9x fast

SIMD, Step 1
You have to calculate cycle count

Variable Architecture
dobule a; //64bits
float b; //32bits
int c; //32bits
short d; //16bits
char e; //8bits
unsigned int f; //32bits
unsigned short g; //16bits
unsigned char f; //8bits
….
• SSE
__m128d aa;
__m128 bb;
__m64d cc;
__m64d dd;
• NEON
float64x2_t aa;
float32x4_t bb;
int32x4_t cc;
int16x8_t dd;
int8x16_t ee;
…

Variable Architecture
64
32 32 32 32
32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 64bits 0 128bits
General Purpose Register
Scale Register
Vector Register
1x
1x
2x
4x
16x
8x

Re-Definition Variable
• SSE
typedef __m128d double2;
typedef __m128d float4;
typedef __m64d float2;
typedef __m128 int4;
typedef __m64 int2;
typedef __m128 uint4;
typedef __m64 uint2;
….
• NEON
typedef float64x2_t double2;
typedef float32x4_t float4;
typedef float32x2_t float2;
typedef int32x4_t int4;
typedef int32x2_t int2;
typedef uint32x4_t uint4;
typedef uint32x2_t uint2;
….

Re-Definition Register
union reg128 {
uchar16 _uchar16;
short8 _short8;
int4 _int4;
float4 _float4;
double2 _double2;
uchar _uchar[16];
short _short[8];
int _int4;
float _float[4];
double _double[2];
…
void print_uchar()
{
printf(“%d %d..n”,
_uchar[0],_uchar[1],_uchar[2]….);
}
void print_float()
{
printf(“%f %f %f %fn”,
_float[0],_float[1]...);
}
…
};

Memory
• In the approaching physical limits era, CPU
operation is not bottleneck, it keeps changing
per year, but the speed of memory transform
is only constant.
• The SIMD succeeded in reducing over
quadruple CPU cycles when multi-data
parallelize.
• The Data was producing latency from Memory
 L2  L1  register Load/Store.

Memory Level latency
Reference: https://tinyurl.com/gsnfzoy

L1 cache
• Create array in the function.
• The registers are used over established quantity for
reserving data, it will write back L1 cache by stack
pointer.
• Function arguments transfer data. (partial –O3 opt.
will through out by registers, No write back)
• Function call, to save current registers of data by
stack point, when it’s finished, read data out to
registers.
• Interrupt or exchanging thread, will write through
current registers of data, depend on OS capability.

I-Cache/D-Cache
• Instruction Cache:
– The code size that was compiled CPU instructions
(function symbol size), function firstly execution
will pre-fetch, and twice is in cache, if it is
computer vision application, the first execution
function have to ignore in calculating efficacy.
• Data Cache:
– It is L1 cache when we say, established
methodology pre-fetch data to L1 cache.

Page Table
• The page  4096 bytes
• The cache line64 bytes
• A page contains 64 cache lines
• L2 cache  5~10Mb
• L1 cache  512k~1Mb
• L1 entry way 2 way or 4 way
• A image320*240 or 640*480 bytes
• Does it have heavily cache miss while the
memory usage over cache size?????

Cache line
• 64 bytes= 16 float
• 128 bits = 4 float
To Address
64 bits
To Address
64 bits

Worldview
• In SIMD world, if you want to get limit, look-
up table should not be optimally method, if
the table is large, to use vector register will
reduce 4x cycles and more faster on the
contrary.
• At the extreme optimization, once you create
small Load/Store, this effect is very obvious!

Known Methodology
int arr0[100] = {1,2,3…};
void test1 (float *src,float *dst,int len)
{
int arr1[100] = {1,2,3…};
int b =4;
int *arr2 = (int *)malloc(100*sizeof(int));
int c = len + b;
…
}
Memory
(DDR3)
L1 cache
Instruction set
Const
Memory
(DDR3)

Known Methodology
class a
{
int val = 3;
int map[100] = {1,2,3,4,5};
a();
…
};
Memory
(DDR3)
Same as struct

Compile Is Not As Smart As You
Think
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 cc=*src_dst_ptr + *src_dst_ptr;
*src_dst_ptr +=cc;
…
}
Three Load
One Store

Correct Writing
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 val = *src_dst_ptr;
float4 cc= val + val;
*src_dst_ptr =cc + val;
…
} One Load
One Store

Few To Use Array,
More To Use Pointer++
void test1(float *src,float *dst,int len)
{
float4 *src_ptr =(float4 *)src;
float4 *dst_ptr=(float4 *)dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_ptr++;
reg1=*src_ptr++;
reg0 = reg0+reg1;
…..
*dst_ptr++=reg0;
*dst_ptr++=reg1;
}
}
• Not recommend to use
void test2(float4 *src,float4 *dst,int len)
{
int len_4 = len/4;
for(int i=0;i<len_4;i+=2)
{
reg0=src[i]+src[i+1];
…
dst[i]=reg0;
dst[i+1]=src[i+1];
}
}

Single Source And Destination,
To Avoid Cache Miss/Page Fault
void test1(float *src_dst, int len)
{
float4 *src_dst_ptr =(float4 *)src_dst;
for(int i=0;i<len;i+=4)
{
reg0=*src_dst_ptr++;
reg1=*src_dst_ptr++;
reg0 = reg0+reg1;
…..
*src_dst_ptr++=reg0;
*src_dst_ptr++=reg1;
}
}

• Cache line was 64 bytes, 16 bits address
alignment.
• if Vector Register Load/Store that is Not at
multiples of 16 address
– Latency penalty
– Depends on CPU architecture, almost will occur
Align/Unalign
0x0000 0x0010
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0x0020
From Here, Start to Load/Store 128bits

• Choice inner-instruction (Unalign Load/Store)
• Access aligning data, use alignr or vext
assembly
• Declare 16 bits alignment or malloc and shift
address
Solve Method
32 32 32 32 32 32 32 32
reg0
reg3=vext(reg0,reg1,1)
reg1
float __attribute__ ((aligned (16))) a[40];
float *b=(float *)malloc(sizeof(float)*40);
b= (float*)(((unsigned long)b + 15) & (~0x0F))

SIMD, Step 3
Act of Register
64
32 32
0 64bits
32 32 32 32
64 64
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0 128bits

Register
• Arm64
– 32 Vector Registers
– 32 Scale Registers
• Arm32
• Intel SSE
• DSP
– ? Vector Registers
– ? Scale Registers
Have to remember the number of registers in use
(Very Importance!!)

Why
• Under the premise of full float, 32 Vector
Registers can provide
– 128 space sizes simulate array (4*32)
– The extreme operation that no need to write back
to memory
• With use of shuffle instruction
– If you use over 32 variables at same time, excess
data will be written back L1, make latency.

Why
float arr[4*32+4] = {…};
float4 *arr_ptr = (float4 *)arr;
float4 a0,a1,a2,a3,a4,a5,a6 … a32;
A0 = *arr_ptr++;
A1= *arr_ptr++;
…
A32 = *arr_ptr++;
Over 32 register variables at same
time, it makes extra Load/Store in
operation process, can’t be
optimized.

About Register
• The Register doesn’t have data type in itself,
only defines the instruction sets on assembly
level
• To use variables reserving data, can Not over
the maximum number of registers on CPU, but
need fully utilize.
• Vector Register
– Make good use of shuffle instruction
– Input/Output data rearrangement

Act of Load/Store
float4 *src_ptr = (float4 *)src;
float4 *dst_ptr=(float4 *)dst;
reg128 reg0,reg1,reg2,reg3… reg31;
for(int i=0;i<640*480;i+=4) {
reg0._float4 = *src_ptr++;
….
..
*dst_ptr++=reg0._float4;
….
}
2 General
Purpose
Registers
(Addressing) 32 Vector Registers
(full utilize)
1 General
Purpose
Register
Read All at
Once
Write All at
Once
Main
Algorithm

Act of Function Call
void test1(float *src,float *dst,int len) {
int a= len/4;
int b= len%4;
float4 aa = *(float4 *)src;
float4 bb = *(float4 *)dst;
float4 cc = aa + bb;
int val=test2(src,dst,len);
cc = aa + bb + cc;
int c =(a+b+len)*val;
…
}
2 General Purpose
Registers write
to L1 cache,
produce act of
Load/Store 1 Vector Registers write
to L1 cache,
produce act of
Load/Store
To read src,dst
address to General
Purpose Register from
L1 cache
The Registers will
clean-up, read
arguments form L1
cache to registers Return original data
from L1 cache to
Vector/General
Purpose Registers
Stack Pointer
management

Act of Function Argument
void test3(float4 aa,float4 *bb,float4 &cc) {
…
}
void test4(float a,float *b,float &c) {
…
}
int main() {
float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2};
float a = 0,b=1,c=2;
test3(aa,&bb,&cc);
test4(a,&b,&c);
}
Under the
premise of O3
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Directly thoughout register!
Driectly throughout register!

Key Points
• Call by address, call by reference, those all
access L1 cache, unless them inline succeed,
must be slow.
• Reduce the function usage, all the way to the
end.

Act of Branch Instruction
float a[100],b[100];
{
if(a[i]<50)
b[i]=a[i];
else
b[i] = 30;
}
1 100 100
(1+1+1)*100
(1+1)*100*2
(1+1)*100
1101 cycles

float4 *a_ptr = (float4 *)a,*b_ptr=(float4 *)b;
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
1+2*25
9*25
1
1
278 cycles
3.96x

Analyse SIMD Branch
float4 cmp = {50,50,50,50};
reg128 val0;
reg128 reg0,mask,tmp0,tmp1;
val0._float4 = { 30,30,30,30 };
for(int i=0;i<100;i+=4)
{
reg0._float4 = *a_ptr++;
mask._uint4=vcltq_f32(reg0._float4,cmp);
tmp0._uint4=vandq_u32(reg0._uint4,mask._uint4);
mask.uint4 = vnotq_u32(mask);
tmp1._uint4=vandq_u32(val0._uint4,mask._uint4);
reg0._uint4=vxor_u32(temp0._uint4,temp1._uint4);
*b_ptr++ = reg0._uint4;
}
if(a[i]<50) b[i]=a[i];
else b[i] =30;
11..1 00..0 11..1 00..0
If true
32’s one
If false
32’s zero
0 128
0000 1111
0011 0100
0000 0100
AND
1111 0000
1011 0001
101 1 0000
NOT
AND
1011 0000
0000 0100
101 1 0100
XOR

• Compare with Normal Comparison Operation,
more than multiples of 4 fast.
• The branch prediction is NO exist, CPU
pipeline will Not be predicted fail and clean-up,
the pipeline is running to end(Explosion fast).

Act of Shuffle
• The instruction sets like the sea, to find the
best fit shuffle.
– the key point that is extreme optimization of
mathematics model.
– Didn’t write shuffle, Didn’t say you can write SIMD.

Act of Shuffle
• Ex: Matrix Transpose

Act of Shuffle
4 cycles
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 3 7
2 6 4 8
reg0
reg1
reg2
reg3
reg0
reg1
vtrnq
9 13 11 15
10 14 12 16
reg2
reg3
1 5 9 13
2 6 10 14
reg0
reg1
3 7 11 15
4 8 12 16
reg2
reg3
vtrn

Act of Shuffle
for(int j=i;j<4;j++)
{
int index0=i*4+j,index1=j*4+i;
float temp=a[index0];
a[index0]=a[index1];
a[index1]=temp;
}
(4+3+2+1)*3
4 4
1
(1+1)*10
(1+1)*10
(1+1)*10*2
159 cycles
(1+1+1+1)*10

Act of Shuffle
reg256 temp0,temp1;
reg128 reg0,reg1,reg2,reg3;
temp0._float4x2=vtrnq_f32(reg0,reg1);
temp1._float4x2=vtrnq_f32(reg2,reg3);
float2 temp =temp0._float2[1];
temp0._float2[1]=temp1._float2[0];
temp1._float2[0]=temp;
temp=temp0._float2[3];
temp0._float2[3]=temp1._float[2];
temp1._float[2]=temp;
4 cycles
vtrn
vtrn
39.75x

Act of Shuffle
• Ex: Matrix Transpose
transpose 4 cycles
mul 16 cycles
vpadd 12 cycles

Data Type Conversion
• uchar16short8int4float4
• float4int4uchar16
Image
32 32 32 32
16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32

Methodology of O3 Optimization
• Clang gcc ???
• New Version >>>>> Old Version

Lantency && Throughtput
• Consecutive Load or Consecutive Store will
reduce latency.
• Specific pipeline rearrange can reduce latency
space.
– VLIW, SLOT
• The Register instruction sets that have the
dependency penalty.
– Load/Store rearrange by yourself
– Compiler can deal with the dependency penalty

About inline
• Always inline has a change to fail, you should
be check function symbol in assembly.
– Lines of code was the key in Clang

Vector Register Optimization
• Specific algorithms can NOT optimize with
SIMD on contemporary compiler, because the
90% algorithms need to use lot of shuffle
instruction, have to paper work.
• Compiler only parses the for loop unrolling
with SIMD.
{
….
}
Compiler says:
I know how to do
vectorization

Read Element and Write Back
reg128 reg0;
float4 a= {0,1,2,3};
reg0._float4 = a;
float2 val1= reg0._float2[0];
reg0._float2[1]=val1;
float val0=reg0._float[2];
reg0._float[3] = val0;
1. Whether Instruction is supported,
if not, write to L1 Load/Store as same as array.
1. It depends on whether compiler is smart or not!!
0 1 2 3
寫
讀
讀
寫
寫

SIMD, Step 5
Methodology of Extreme Optimization

• Fix the all of algorithms parameters
– Make the constant value
• Remove the branch prediction, the code will
be very huge, but fast
• Don't doubt, the code is more than 4000 lines
casually.

Conception of SIMD Optimization
FunctionA
Algorithm A
FunctionB
Algorithm B
FunctionC
Algorithm C
FunctionEnd
(Final Algorithm)
Develop
One Month
Previous
codes
were no use,
only need to
develop
final algorithm
for one month
Waste Time
Develop
One Month
Develop
One Month

The Problems face on a daily basis

About Data
• The large data have to
– Satisfy multiples of 4.
– Know the maximum of quantity.
• Input data rearrangement can fly on the sky.
• Multiples of 4 are not met
– Padding zero, still use SIMD
– Use General Purpose Registers in the end.

Data Rearrangement
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a a a a a a ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
b b b b b b ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
c c c c c c ...
c c c c c c ...

Tradition Method
for(int i=0;i<height;i++)
{
for(int j=0;j<width;j++)
{
if(...) // top
else if(...) // bottom
else if(...) // left
else if(...) // right
// middle
}
}

SIMD
for(int i=0;i<height;i++) // top
{ ...}
for(int i=0;i<height;i++)
{
// left
for(int j=0;j<width;j++) // middle
{ ... }
// right
}
for(int i=0;i<height;i++) // bottom
{ ...}

In Order To Cooperate With
SIMD, Crazy, Unlimited Unrolling
I cache is enough(over 32KB),
if Not enough, we'll talk about it then

Conclusion
• SIMD is strongly linked to mathematics
• Unknown field, or almost no course.
• Seldom data on the internet, seldom people
arrange success.
• Do you want to develop new algorithms?
You Can try it.

Happy To Use SIMD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Happy To Use SIMD

Similar to Happy To Use SIMD (20)

Recently uploaded

Recently uploaded (20)

Happy To Use SIMD

Editor's Notes