3. What is SIMD?
• The Extreme Optimization for C/C++
• Pointer only
• Have to exactly define act of memory, register,
compiler, it can be challenge the limit.
C/C++ level
Assembly
level
SIMD
4. C=A+B ?
float arr0[4] = { 1,2,3,4 };
float arr1[4] = { 5,6,7,8 };
float arr2[4] = { 0 };
A
B
C
A B
C +
=
Result: arr2[4] => { 6,8,10,12 };
5. Why is SIMD fast?
for(int i=0;i<4;i++)
arr2[i]=arr0[i]+arr1[i];
for(int i=0;i<4;i++)
*(arr2 + i) = *(arr0 + i)+*(arr1 + i);
1 1*4
(1+1)*4 (1+1)*4 (1+1)*4
37 cycles
1*4
1*4
Assume
the all
instruction sets
have
1 instruction and
1 cycle
6. Why is SIMD fast?
float32x4_t a,b,c;
a=*(float32x4_t *)arr0;
b=*(float32x4_t *)arr1;
c=a+b;
*(float32x4_t *)arr2=c;
4 cycles
9x fast
15. Memory
• In the approaching physical limits era, CPU
operation is not bottleneck, it keeps changing
per year, but the speed of memory transform
is only constant.
• The SIMD succeeded in reducing over
quadruple CPU cycles when multi-data
parallelize.
• The Data was producing latency from Memory
L2 L1 register Load/Store.
17. L1 cache
• Create array in the function.
• The registers are used over established quantity for
reserving data, it will write back L1 cache by stack
pointer.
• Function arguments transfer data. (partial –O3 opt.
will through out by registers, No write back)
• Function call, to save current registers of data by
stack point, when it’s finished, read data out to
registers.
• Interrupt or exchanging thread, will write through
current registers of data, depend on OS capability.
18. I-Cache/D-Cache
• Instruction Cache:
– The code size that was compiled CPU instructions
(function symbol size), function firstly execution
will pre-fetch, and twice is in cache, if it is
computer vision application, the first execution
function have to ignore in calculating efficacy.
• Data Cache:
– It is L1 cache when we say, established
methodology pre-fetch data to L1 cache.
19. Page Table
• The page 4096 bytes
• The cache line64 bytes
• A page contains 64 cache lines
• L2 cache 5~10Mb
• L1 cache 512k~1Mb
• L1 entry way 2 way or 4 way
• A image320*240 or 640*480 bytes
• Does it have heavily cache miss while the
memory usage over cache size?????
20. Cache line
• 64 bytes= 16 float
• 128 bits = 4 float
To Address
64 bits
To Address
64 bits
21. Worldview
• In SIMD world, if you want to get limit, look-
up table should not be optimally method, if
the table is large, to use vector register will
reduce 4x cycles and more faster on the
contrary.
• At the extreme optimization, once you create
small Load/Store, this effect is very obvious!
22. Known Methodology
int arr0[100] = {1,2,3…};
void test1 (float *src,float *dst,int len)
{
int arr1[100] = {1,2,3…};
int b =4;
int *arr2 = (int *)malloc(100*sizeof(int));
int c = len + b;
…
}
Memory
(DDR3)
L1 cache
Instruction set
Const
Memory
(DDR3)
24. Compile Is Not As Smart As You
Think
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 cc=*src_dst_ptr + *src_dst_ptr;
*src_dst_ptr +=cc;
…
}
Three Load
One Store
25. Correct Writing
void test0(float *src_dst,int len)
{
float4 *src_dst_ptr = (float4 *)src_dst;
float4 val = *src_dst_ptr;
float4 cc= val + val;
*src_dst_ptr =cc + val;
…
} One Load
One Store
26. Few To Use Array,
More To Use Pointer++
void test1(float *src,float *dst,int len)
{
float4 *src_ptr =(float4 *)src;
float4 *dst_ptr=(float4 *)dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_ptr++;
reg1=*src_ptr++;
reg0 = reg0+reg1;
…..
*dst_ptr++=reg0;
*dst_ptr++=reg1;
}
}
• Not recommend to use
void test2(float4 *src,float4 *dst,int len)
{
int len_4 = len/4;
float4 reg0,reg1…;
for(int i=0;i<len_4;i+=2)
{
reg0=src[i]+src[i+1];
…
dst[i]=reg0;
dst[i+1]=src[i+1];
}
}
27. Single Source And Destination,
To Avoid Cache Miss/Page Fault
void test1(float *src_dst, int len)
{
float4 *src_dst_ptr =(float4 *)src_dst;
float4 reg0,reg1…;
for(int i=0;i<len;i+=4)
{
reg0=*src_dst_ptr++;
reg1=*src_dst_ptr++;
reg0 = reg0+reg1;
…..
*src_dst_ptr++=reg0;
*src_dst_ptr++=reg1;
}
}
28. • Cache line was 64 bytes, 16 bits address
alignment.
• if Vector Register Load/Store that is Not at
multiples of 16 address
– Latency penalty
– Depends on CPU architecture, almost will occur
Align/Unalign
0x0000 0x0010
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0x0020
From Here, Start to Load/Store 128bits
31. Register
• Arm64
– 32 Vector Registers
– 32 Scale Registers
• Arm32
– 16 Vector Registers
– 32 Scale Registers
• Intel SSE
– 16 Vector Registers
– 32 Scale Registers
• DSP
– ? Vector Registers
– ? Scale Registers
Have to remember the number of registers in use
(Very Importance!!)
32. Why
• Under the premise of full float, 32 Vector
Registers can provide
– 128 space sizes simulate array (4*32)
– The extreme operation that no need to write back
to memory
• With use of shuffle instruction
– If you use over 32 variables at same time, excess
data will be written back L1, make latency.
33. Why
float arr[4*32+4] = {…};
float4 *arr_ptr = (float4 *)arr;
float4 a0,a1,a2,a3,a4,a5,a6 … a32;
A0 = *arr_ptr++;
A1= *arr_ptr++;
…
A32 = *arr_ptr++;
Over 32 register variables at same
time, it makes extra Load/Store in
operation process, can’t be
optimized.
34. About Register
• The Register doesn’t have data type in itself,
only defines the instruction sets on assembly
level
• To use variables reserving data, can Not over
the maximum number of registers on CPU, but
need fully utilize.
• Vector Register
– Make good use of shuffle instruction
– Input/Output data rearrangement
35. Act of Load/Store
float4 *src_ptr = (float4 *)src;
float4 *dst_ptr=(float4 *)dst;
reg128 reg0,reg1,reg2,reg3… reg31;
for(int i=0;i<640*480;i+=4) {
reg0._float4 = *src_ptr++;
reg1._float4 = *src_ptr++;
reg2._float4 = *src_ptr++;
….
..
*dst_ptr++=reg0._float4;
*dst_ptr++=reg1._float4;
*dst_ptr++=reg2._float4;
….
}
2 General
Purpose
Registers
(Addressing) 32 Vector Registers
(full utilize)
1 General
Purpose
Register
Read All at
Once
Write All at
Once
Main
Algorithm
36. Act of Function Call
void test1(float *src,float *dst,int len) {
int a= len/4;
int b= len%4;
float4 aa = *(float4 *)src;
float4 bb = *(float4 *)dst;
float4 cc = aa + bb;
int val=test2(src,dst,len);
cc = aa + bb + cc;
int c =(a+b+len)*val;
…
}
2 General Purpose
Registers write
to L1 cache,
produce act of
Load/Store 1 Vector Registers write
to L1 cache,
produce act of
Load/Store
To read src,dst
address to General
Purpose Register from
L1 cache
The Registers will
clean-up, read
arguments form L1
cache to registers Return original data
from L1 cache to
Vector/General
Purpose Registers
Stack Pointer
management
37. Act of Function Argument
void test3(float4 aa,float4 *bb,float4 &cc) {
…
}
void test4(float a,float *b,float &c) {
…
}
int main() {
float4 aa = { 0,0,0,0 },bb={1,1,1,1},cc = {2,2,2,2};
float a = 0,b=1,c=2;
test3(aa,&bb,&cc);
test4(a,&b,&c);
}
Under the
premise of O3
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Produce act of
Load/Store
Directly thoughout register!
Driectly throughout register!
38. Key Points
• Call by address, call by reference, those all
access L1 cache, unless them inline succeed,
must be slow.
• Reduce the function usage, all the way to the
end.
42. Act of Branch Instruction
• Compare with Normal Comparison Operation,
more than multiples of 4 fast.
• The branch prediction is NO exist, CPU
pipeline will Not be predicted fail and clean-up,
the pipeline is running to end(Explosion fast).
43. Act of Shuffle
• The instruction sets like the sea, to find the
best fit shuffle.
– the key point that is extreme optimization of
mathematics model.
– Didn’t write shuffle, Didn’t say you can write SIMD.
51. Methodology of O3 Optimization
• Clang gcc ???
• New Version >>>>> Old Version
52. Lantency && Throughtput
• Consecutive Load or Consecutive Store will
reduce latency.
• Specific pipeline rearrange can reduce latency
space.
– VLIW, SLOT
• The Register instruction sets that have the
dependency penalty.
– Load/Store rearrange by yourself
– Compiler can deal with the dependency penalty
53. About inline
• Always inline has a change to fail, you should
be check function symbol in assembly.
– Lines of code was the key in Clang
54. Vector Register Optimization
• Specific algorithms can NOT optimize with
SIMD on contemporary compiler, because the
90% algorithms need to use lot of shuffle
instruction, have to paper work.
• Compiler only parses the for loop unrolling
with SIMD.
for(int i=0;i<64;i++)
{
….
}
Compiler says:
I know how to do
vectorization
55. Read Element and Write Back
reg128 reg0;
float4 a= {0,1,2,3};
reg0._float4 = a;
float2 val1= reg0._float2[0];
reg0._float2[1]=val1;
float val0=reg0._float[2];
reg0._float[3] = val0;
1. Whether Instruction is supported,
if not, write to L1 Load/Store as same as array.
1. It depends on whether compiler is smart or not!!
0 1 2 3
寫
讀
讀
寫
寫
58. • Fix the all of algorithms parameters
– Make the constant value
• Remove the branch prediction, the code will
be very huge, but fast
• Don't doubt, the code is more than 4000 lines
casually.
59. Conception of SIMD Optimization
FunctionA
Algorithm A
FunctionB
Algorithm B
FunctionC
Algorithm C
FunctionEnd
(Final Algorithm)
Develop
One Month
Previous
codes
were no use,
only need to
develop
final algorithm
for one month
Waste Time
Develop
One Month
Develop
One Month
61. About Data
• The large data have to
– Satisfy multiples of 4.
– Know the maximum of quantity.
• Input data rearrangement can fly on the sky.
• Multiples of 4 are not met
– Padding zero, still use SIMD
– Use General Purpose Registers in the end.
62. Data Rearrangement
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a b a b a b ...
a a a a a a ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
b b b b b b ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a b c a b c ...
a a a a a a ...
a a a a a a ...
b b b b b b ...
b b b b b b ...
c c c c c c ...
c c c c c c ...
65. SIMD
for(int i=0;i<height;i++) // top
{ ...}
for(int i=0;i<height;i++)
{
// left
for(int j=0;j<width;j++) // middle
{ ... }
// right
}
for(int i=0;i<height;i++) // bottom
{ ...}
66. In Order To Cooperate With
SIMD, Crazy, Unlimited Unrolling
I cache is enough(over 32KB),
if Not enough, we'll talk about it then
67. Conclusion
• SIMD is strongly linked to mathematics
• Unknown field, or almost no course.
• Seldom data on the internet, seldom people
arrange success.
• Do you want to develop new algorithms?
You Can try it.