AA-sort with SSE4.1              Cybozu Labs  2012/6/16 MITSUNARI Shigeo(@herumi)x86/x64 optimization seminar 4(#x86opti)
Agenda Introduction of AA-sort   classic combsort   vectorized combsort   vectorized merge benchmark2012/6/16 #x86opt...
AA-sort Aligned-Access sort   proposed by Hiroshi Inoue, etc. in    "A high-performance sorting algorithm for multicore ...
AA-sort vectorized combsort for a block (<= L2cache?) vectorized merge sorted block                                     ...
AA-sort algorithm sort each block   O(n log n) merge sorted block   O(n)2012/6/16 #x86opti 4      5 /29
classic combsort(1/2) improved bubble sort   unstable   O(n log n)   compare two elements having a gap(>=1)     gap i...
classic combsort(2/2) gap = 1 means bubble sort   loop until the array is fully sorted           …           for (;;) { ...
gap function Combsort11   last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good    by http://cs.clackamas.cc.or.us/molat...
vectorized combsort step1 : sort values within each vector(32bitx4) step2 : SIMD version combsort step3 : reorder data ...
step1 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0,  1, 2, 3 step1.2 : transpose  3       5      0     8  2       7   ...
sort of 4 items use max ud, minud for uint32_t x 4        a                 b                  <                 v0      ...
source of step1.1 V128 is a type of 32-bit integer x 4   pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3                ...
transpose of 4x4 matrix use unpcklps and unpckhps                                 t0=unpcklps(x0,x2)+0     3       5     ...
source of transpose and step1  void transpose(V128 x[4])       void sort_step1(V128 *va, size_t N)  {                     ...
SIMD version combsort first half code use   vector_cmpswap   vector_cmpswap_skew     bool sort_step2(V128 *va, size_t N...
vector_cmpswap no conditional branch           a              b                  <       min(a,b)        max(a,b)     if ...
vector_cmpswap_skew for boundary of array       a               a3      a2           a1           a0       b             ...
isSortedVec check whether array is sorted   ptest_zf(a, b) is true if (a & b) == 0   a <= b  max(a,b) == b  c := max(...
loop for gap == 1 vectorised bubble sort for gap == 1   retire if loop count reaches maxLoop     fall to std::sort     ...
AA-sort algorithm sort each block   O(n log n) merge sorted block   O(n)2012/6/16 #x86opti 4      20 /29
merge two sorted vector   a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted   c = [b:a] = merge and sort (a, b)          ...
data flow of merge                                   sorted                                          sorted     a0        ...
source of vector_merge Too complex   good idea?          void vector_merge(V128& a, V128& b) {                         V...
std::merge() merge [begin1, end1) and [begin2, end2) template <class In1, class In2, class Out> Out merge(In1 begin1, In1...
vectorised merge merge arrays with vector_merge() void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t ...
block size and rate of sort What is good size for vectorised sort?   half size of L2 is recommended for PowerPC 970MP   ...
Benchmark(1/3) AA-sort vs std::sort for random data   Xeon X5650 + gcc-4.6.3      4 times faster for # < 64Ki, 2.85 tim...
Benchmark(2/3) sort 64Ki uint on Xeon + gcc-4.6.3   AA-sort speed does not strongly depend on pattern    25000          ...
Benchmark(3/3) sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11   16000                                        fast   14000  ...
Upcoming SlideShare
Loading in …5
×

AA-sort with SSE4.1

3,401 views

Published on

Published in: Technology, Self Improvement

AA-sort with SSE4.1

  1. 1. AA-sort with SSE4.1 Cybozu Labs 2012/6/16 MITSUNARI Shigeo(@herumi)x86/x64 optimization seminar 4(#x86opti)
  2. 2. Agenda Introduction of AA-sort  classic combsort  vectorized combsort  vectorized merge benchmark2012/6/16 #x86opti 4 2 /29
  3. 3. AA-sort Aligned-Access sort  proposed by Hiroshi Inoue, etc. in "A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011  http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm  http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm  For SIMD less conditional branch, no unaligned data access  For multicore processors they implemented it for PowerPC and Cell BE  O(n log n) complexity I tried it for Intel CPU(not complete)  https://github.com/herumi/opti/blob/master/intsort.hpp current version is for only one processor2012/6/16 #x86opti 4 3 /29
  4. 4. AA-sort vectorized combsort for a block (<= L2cache?) vectorized merge sorted block input array block 0 block 1 block 2 block3 ... sort sort sort sort < < < < ... merge merge < < ... merge < ...2012/6/16 #x86opti 4 4 /29
  5. 5. AA-sort algorithm sort each block  O(n log n) merge sorted block  O(n)2012/6/16 #x86opti 4 5 /29
  6. 6. classic combsort(1/2) improved bubble sort  unstable  O(n log n)  compare two elements having a gap(>=1) gap is divided by shrink factor (about 1.3) size_t nextGap(size_t N) { return (N * 10) / 13; } void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } …2012/6/16 #x86opti 4 6 /29
  7. 7. classic combsort(2/2) gap = 1 means bubble sort  loop until the array is fully sorted … for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; } }2012/6/16 #x86opti 4 7 /29
  8. 8. gap function Combsort11  last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n; }  a little faster if line(*) is appended2012/6/16 #x86opti 4 8 /29
  9. 9. vectorized combsort step1 : sort values within each vector(32bitx4) step2 : SIMD version combsort step3 : reorder data 6 8 9 3 5 7 12 14 0 4 1 20 11 ... step1 sort sort +0 3 5 0 … … 0 1 3 … 101 +1 9 7 1 … … 102 104 105 … 380 +2 6 12 4 … … 389 391 392 … 502 +3 8 14 20 … … step2 511 515 612 … 973 v0 v1 v2 v3 step3 0 1 3 … 101 102 104 105 … 380 389 391 392 …2012/6/16 #x86opti 4 9 /29
  10. 10. step1 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3 step1.2 : transpose 3 5 0 8 2 7 1 2 step1.1 8 12 4 13 9 14 20 15 sort v0 v1 v2 v3 0 3 5 8 step1.2 1 2 2 7 4 8 12 13 transpose 9 14 15 20 0 1 4 9 3 2 8 14 5 2 12 15 8 7 13 202012/6/16 #x86opti 4 10 /29
  11. 11. sort of 4 items use max ud, minud for uint32_t x 4 a b < v0 v1 v2 v3 min(a,b) max(a,b) < < min01 max01 min23 max23 < < s=max(min t=min(max min0123 max0123 01,min23) 01,max23) < min0123 min(s,t) max(s,t) max0123 sorted2012/6/16 #x86opti 4 11 /29
  12. 12. source of step1.1 V128 is a type of 32-bit integer x 4  pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3 void sort_step1_vec(V128 x[4]) { V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t); }2012/6/16 #x86opti 4 12 /29
  13. 13. transpose of 4x4 matrix use unpcklps and unpckhps t0=unpcklps(x0,x2)+0 3 5 0 8 3 5 8 12+1 2 7 1 2 t2=unpckhps(x0,x2) 0 8 4 13+2 8 12 4 13 2 7 9 14+3 9 14 20 15 t1=unpcklps(x1,x3) 1 2 20 15 t3=unpckhps(x1,x3) x0 x1 x2 x3 t0 t1 t2 t3 3 5 8 12 x0=unpcklps(t0,t1) 3 2 8 9 0 8 4 13 5 7 12 14 2 7 9 14 x1=unpckhps(t0,t1) 0 1 4 20 1 2 20 15 8 2 13 15 x2=unpcklps(t2,t3) t0 t1 t2 t3 x3=unpckhps(t2,t3) x0 x1 x2 x32012/6/16 #x86opti 4 13 /29
  14. 14. source of transpose and step1 void transpose(V128 x[4]) void sort_step1(V128 *va, size_t N) { { V128 x0 = x[0]; for(size_t i = 0; i < N; i+= 4) { V128 x1 = x[1]; sort_step1_vec(&va[i]); V128 x2 = x[2]; transpose(&va[i]); V128 x3 = x[3]; } V128 t0 = unpcklps(x0, x2); } V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3); }2012/6/16 #x86opti 4 14 /29
  15. 15. SIMD version combsort first half code use  vector_cmpswap  vector_cmpswap_skew bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ...2012/6/16 #x86opti 4 15 /29
  16. 16. vector_cmpswap no conditional branch a b < min(a,b) max(a,b) if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); vectorised void vector_cmpswap(V128& a, V128& b) { V128 t = pmaxud(a, b); a = pminud(a, b); b = t; }2012/6/16 #x86opti 4 16 /29
  17. 17. vector_cmpswap_skew for boundary of array a a3 a2 a1 a0 b b3 b2 b1 b0 (a,b) = vector_cmpswap_ske(a,b) a a3 min(a2,b3) min(a1,b2) min(a0,b1) b max(a2,b3) max(a1,b2) max(a0,b1) b02012/6/16 #x86opti 4 17 /29
  18. 18. isSortedVec check whether array is sorted  ptest_zf(a, b) is true if (a & b) == 0  a <= b  max(a,b) == b  c := max(a,b) – b == 0  pcmpgtd is for int32_t, so we cant use it bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true; }2012/6/16 #x86opti 4 18 /29
  19. 19. loop for gap == 1 vectorised bubble sort for gap == 1  retire if loop count reaches maxLoop fall to std::sort  almost rare const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; }2012/6/16 #x86opti 4 19 /29
  20. 20. AA-sort algorithm sort each block  O(n log n) merge sorted block  O(n)2012/6/16 #x86opti 4 20 /29
  21. 21. merge two sorted vector  a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted  c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c0 c1 c2 c3 sorted2012/6/16 #x86opti 4 21 /29
  22. 22. data flow of merge sorted sorted a0 a1 a2 a3 b0 b1 b2 b3 < < < < min00 max00 min11 max11 min22 max22 min33 max33 < < < < <2012/6/16 #x86opti 4 22 /29
  23. 23. source of vector_merge Too complex  good idea? void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8); }2012/6/16 #x86opti 4 23 /29
  24. 24. std::merge() merge [begin1, end1) and [begin2, end2) template <class In1, class In2, class Out> Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out) { for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); } }2012/6/16 #x86opti 4 24 /29
  25. 25. vectorised merge merge arrays with vector_merge() void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; ; compare ta0 with tb0 if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; }2012/6/16 #x86opti 4 25 /29
  26. 26. block size and rate of sort What is good size for vectorised sort?  half size of L2 is recommended for PowerPC 970MP L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t BS = 32Ki seems good for Xeon, Core i7 profile of sort and merge 100 80 60 40 merge(%) 20 sort(%) 02012/6/16 #x86opti 4 26 /29
  27. 27. Benchmark(1/3) AA-sort vs std::sort for random data  Xeon X5650 + gcc-4.6.3 4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi 10000000 std::sort fast 1000000 AA-sort 100000 clock cycle 10000 1000 100 10 1 16 64 256 1Ki 4Ki 16Ki 64Ki 256Ki 1Mi 4Mi # of uint32_t2012/6/16 #x86opti 4 27 /29
  28. 28. Benchmark(2/3) sort 64Ki uint on Xeon + gcc-4.6.3  AA-sort speed does not strongly depend on pattern 25000 fast 20000 std::sort 15000 AA-sort 10000 5000 02012/6/16 #x86opti 4 28 /29
  29. 29. Benchmark(3/3) sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11 16000 fast 14000 12000 10000 std::sort(gcc) 8000 AA-sort(gcc) 6000 std::sort(VC) 4000 AA-sort(VC) 2000 02012/6/16 #x86opti 4 29 /29

×