AA-sort with SSE4.1

AA-sort with SSE4.1

Cybozu Labs
2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)

Agenda
 Introduction of AA-sort
 classic combsort
 vectorized combsort
 vectorized merge
 benchmark

2012/6/16 #x86opti 4 2 /29

AA-sort
 Aligned-Access sort
 proposed by Hiroshi Inoue, etc. in
"A high-performance sorting algorithm for multicore
single-instruction multiple-data processors," 2011
 http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm
 http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
 For SIMD
less conditional branch, no unaligned data access
 For multicore processors
they implemented it for PowerPC and Cell BE
 O(n log n) complexity
 I tried it for Intel CPU(not complete)
 https://github.com/herumi/opti/blob/master/intsort.hpp
current version is for only one processor
2012/6/16 #x86opti 4 3 /29

AA-sort
 vectorized combsort for a block (<= L2cache?)
 vectorized merge sorted block

input array

block 0 block 1 block 2 block3 ...

sort sort sort sort

< < < < ...

merge merge
< < ...
merge
< ...
2012/6/16 #x86opti 4 4 /29

AA-sort algorithm
 sort each block
 O(n log n)
 merge sorted block
 O(n)

2012/6/16 #x86opti 4 5 /29

classic combsort(1/2)
 improved bubble sort
 unstable
 O(n log n)
 compare two elements having a gap(>=1)
gap is divided by shrink factor (about 1.3)
size_t nextGap(size_t N) { return (N * 10) / 13; }

void combsort(uint32_t *a, size_t N) {
size_t gap = nextGap(N);
while (gap > 1) {
for (size_t i = 0; i < N - gap; i++) {
if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
}
gap = nextGap(gap);
}
…
2012/6/16 #x86opti 4 6 /29

classic combsort(2/2)
 gap = 1 means bubble sort
 loop until the array is fully sorted

…
for (;;) {
bool isSwapped = false;
for (size_t i = 0; i < N - 1; i++) {
if (a[i] > a[i + 1]) {
std::swap(a[i], a[i + 1]);
isSwapped = true;
}
}
if (!isSwapped) return;
}
}

2012/6/16 #x86opti 4 7 /29

gap function
 Combsort11
 last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good
by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm

size_t nextGap(size_t n) {
n = (n * 10) / 13;
if (n == 9 || n == 10) return 11; // (*)
return n;
}

 a little faster if line(*) is appended

2012/6/16 #x86opti 4 8 /29

vectorized combsort
 step1 : sort values within each vector(32bitx4)
 step2 : SIMD version combsort
 step3 : reorder data
6 8 9 3 5 7 12 14 0 4 1 20 11 ...

step1
sort sort
+0 3 5 0 … … 0 1 3 … 101
+1 9 7 1 … … 102 104 105 … 380
+2 6 12 4 … … 389 391 392 … 502
+3 8 14 20 … …
step2
511 515 612 … 973
v0 v1 v2 v3
step3

0 1 3 … 101 102 104 105 … 380 389 391 392 …

2012/6/16 #x86opti 4 9 /29

step1
 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0,
1, 2, 3
 step1.2 : transpose
3 5 0 8

2 7 1 2
step1.1
8 12 4 13

9 14 20 15
sort

v0 v1 v2 v3 0 3 5 8
step1.2
1 2 2 7

4 8 12 13
transpose
9 14 15 20
0 1 4 9

3 2 8 14

5 2 12 15

8 7 13 20

2012/6/16 #x86opti 4 10 /29

sort of 4 items
 use max ud, minud for uint32_t x 4
a b

< v0 v1 v2 v3

min(a,b) max(a,b) < <

min01 max01 min23 max23

< <
s=max(min t=min(max
min0123 max0123
01,min23) 01,max23)
<

min0123 min(s,t) max(s,t) max0123

sorted

2012/6/16 #x86opti 4 11 /29

source of step1.1
 V128 is a type of 32-bit integer x 4
 pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3

void sort_step1_vec(V128 x[4])
{
V128 min01 = pminud(x[0], x[1]);
V128 max01 = pmaxud(x[0], x[1]);
V128 min23 = pminud(x[2], x[3]);
V128 max23 = pmaxud(x[2], x[3]);
x[0] = pminud(min01, min23);
x[3] = pmaxud(max01, max23);
V128 s = pmaxud(min01, min23);
V128 t = pminud(max01, max23);
x[1] = pminud(s, t);
x[2] = pmaxud(s, t);
}

2012/6/16 #x86opti 4 12 /29

transpose of 4x4 matrix
 use unpcklps and unpckhps
t0=unpcklps(x0,x2)
+0 3 5 0 8 3 5 8 12

+1 2 7 1 2
t2=unpckhps(x0,x2) 0 8 4 13

+2 8 12 4 13 2 7 9 14

+3 9 14 20 15 t1=unpcklps(x1,x3) 1 2 20 15
t3=unpckhps(x1,x3)
x0 x1 x2 x3 t0 t1 t2 t3

3 5 8 12 x0=unpcklps(t0,t1) 3 2 8 9
0 8 4 13 5 7 12 14
2 7 9 14 x1=unpckhps(t0,t1) 0 1 4 20
1 2 20 15 8 2 13 15
x2=unpcklps(t2,t3)
t0 t1 t2 t3 x3=unpckhps(t2,t3) x0 x1 x2 x3

2012/6/16 #x86opti 4 13 /29

source of transpose and step1
void transpose(V128 x[4]) void sort_step1(V128 *va, size_t N)
{ {
V128 x0 = x[0]; for(size_t i = 0; i < N; i+= 4) {
V128 x1 = x[1]; sort_step1_vec(&va[i]);
V128 x2 = x[2]; transpose(&va[i]);
V128 x3 = x[3]; }
V128 t0 = unpcklps(x0, x2); }
V128 t1 = unpcklps(x1, x3);
V128 t2 = unpckhps(x0, x2);
V128 t3 = unpckhps(x1, x3);
x[0] = unpcklps(t0, t1);
x[1] = unpckhps(t0, t1);
x[2] = unpcklps(t2, t3);
x[3] = unpckhps(t2, t3);
}

2012/6/16 #x86opti 4 14 /29

SIMD version combsort
 first half code use
 vector_cmpswap
 vector_cmpswap_skew
bool sort_step2(V128 *va, size_t N) {
size_t gap = nextGap(N);
while (gap > 1) {
for (size_t i = 0; i < N - gap; i++) {
vector_cmpswap(va[i], va[i + gap]);
}
for (size_t i = N - gap; i < N; i++) {
vector_cmpswap_skew(va[i], va[i + gap - N]);
}
gap = nextGap(gap);
}
...

2012/6/16 #x86opti 4 15 /29

vector_cmpswap
 no conditional branch
a b

<

min(a,b) max(a,b)

if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);

vectorised

void vector_cmpswap(V128& a, V128& b)
{
V128 t = pmaxud(a, b);
a = pminud(a, b);
b = t;
}

2012/6/16 #x86opti 4 16 /29

vector_cmpswap_skew
 for boundary of array

a a3 a2 a1 a0

b b3 b2 b1 b0

(a',b') = vector_cmpswap_ske(a,b)

a' a3 min(a2,b3) min(a1,b2) min(a0,b1)

b' max(a2,b3) max(a1,b2) max(a0,b1) b0

2012/6/16 #x86opti 4 17 /29

isSortedVec
 check whether array is sorted
 ptest_zf(a, b) is true if (a & b) == 0
 a <= b  max(a,b) == b  c := max(a,b) – b == 0
 pcmpgtd is for int32_t, so we can't use it
bool isSortedVec(const V128 *va, size_t N) {
for (size_t i = 0; i < N - 1; i++) {
V128 a = va[i];
V128 b = va[i + 1];
V128 c = pmaxud(a, b);
c = psubd(c, b);
if (!ptest_zf(c, c)) {
return false;
}
}
return true;
}
2012/6/16 #x86opti 4 18 /29

loop for gap == 1
 vectorised bubble sort for gap == 1
 retire if loop count reaches maxLoop
fall to std::sort
 almost rare
const int maxLoop = 10;
for (int i = 0; i < maxLoop; i++) {
for (size_t i = 0; i < N - 1; i++) {
vector_cmpswap(va[i], va[i + 1]);
}
vector_cmpswap_skew(va[N - 1], va[0]);
if (isSortedVec(va, N)) return true;
}

2012/6/16 #x86opti 4 19 /29

AA-sort algorithm
 sort each block
 O(n log n)
 merge sorted block
 O(n)

2012/6/16 #x86opti 4 20 /29

merge two sorted vector
 a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted
 c = [b:a] = merge and sort (a, b)
sorted

a a0 a1 a2 a3

sorted
b b0 b1 b2 b3

[b:a] = vector_merge(a,b)

c0 c1 c2 c3 c0 c1 c2 c3

sorted

2012/6/16 #x86opti 4 21 /29

data flow of merge
sorted sorted

a0 a1 a2 a3 b0 b1 b2 b3

< < < <
min00 max00 min11 max11 min22 max22 min33 max33

< <

< < <

2012/6/16 #x86opti 4 22 /29

source of vector_merge
 Too complex
 good idea? void vector_merge(V128& a, V128& b) {
V128 m = pminud(a, b);
V128 M = pmaxud(a, b);
V128 s0 = punpckhqdq(m, m);
V128 s1 = pminud(s0, M);
V128 s2 = pmaxud(s0, M);
V128 s3 = punpcklqdq(s1, punpckhqdq(M, M));
V128 s4 = punpcklqdq(s2, m);
s4 = pshufd<PACK(2, 1, 0, 3)>(s4);
V128 s5 = pminud(s3, s4);
V128 s6 = pmaxud(s3, s4);
V128 s7 = pinsrd<2>(s5, movd(s6));
V128 s8 = pinsrd<0>(s6, pextrd<2>(s5));
a = pshufd<PACK(1, 2, 0, 3)>(s7);
b = pshufd<PACK(3, 2, 0, 1)>(s8);
}
2012/6/16 #x86opti 4 23 /29

std::merge()
 merge [begin1, end1) and [begin2, end2)
template <class In1, class In2, class Out>
Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out)
{
for (;;) {
*out++ = *begin2 < *begin1 ? *begin2++ : *begin1++;
if (begin1 == end1) return copy(begin2, end2, result);
if (begin2 == end2) return copy(begin1, end1, result);
}
}

2012/6/16 #x86opti 4 24 /29

vectorised merge
 merge arrays with vector_merge()
void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){
uint32_t aPos = 0, bPos = 0, outPos = 0;
V128 vMin = va[aPos++];
V128 vMax = vb[bPos++];
for (;;) {
vector_merge(vMin, vMax);
vo[outPos++] = vMin;
if (aPos < aN) {
if (bPos < bN) {
V128 ta = va[aPos];
V128 tb = vb[bPos]; ; compare ta0 with tb0
if (movd(ta) <= movd(tb)) {
vMin = ta;
aPos++;
} else {
vMin = tb;
bPos++;
}

2012/6/16 #x86opti 4 25 /29

block size and rate of sort
 What is good size for vectorised sort?
 half size of L2 is recommended for PowerPC 970MP
L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
 BS = 32Ki seems good for Xeon, Core i7
 profile of sort and merge
100
80
60
40 merge(%)
20 sort(%)
0

2012/6/16 #x86opti 4 26 /29

Benchmark(1/3)
 AA-sort vs std::sort for random data
 Xeon X5650 + gcc-4.6.3
4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
10000000
std::sort fast
1000000
AA-sort
100000
clock cycle

10000
1000
100
10
1
16 64 256 1Ki 4Ki 16Ki 64Ki 256Ki 1Mi 4Mi
# of uint32_t

2012/6/16 #x86opti 4 27 /29

Benchmark(2/3)
 sort 64Ki uint on Xeon + gcc-4.6.3
 AA-sort speed does not strongly depend on pattern
25000
fast
20000
std::sort
15000 AA-sort

10000

5000

0

2012/6/16 #x86opti 4 28 /29

Benchmark(3/3)
 sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11

16000
fast
14000
12000
10000
std::sort(gcc)
8000
AA-sort(gcc)
6000
std::sort(VC)
4000
AA-sort(VC)
2000
0

2012/6/16 #x86opti 4 29 /29

AA-sort with SSE4.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AA-sort with SSE4.1

Similar to AA-sort with SSE4.1 (20)

More from MITSUNARI Shigeo

More from MITSUNARI Shigeo (20)

Recently uploaded

Recently uploaded (20)

AA-sort with SSE4.1