SlideShare a Scribd company logo
1 of 29
AA-sort with SSE4.1


              Cybozu Labs
  2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)
Agenda
 Introduction of AA-sort
   classic combsort
   vectorized combsort
   vectorized merge
 benchmark




2012/6/16 #x86opti 4        2 /29
AA-sort
 Aligned-Access sort
   proposed by Hiroshi Inoue, etc. in
    "A high-performance sorting algorithm for multicore
    single-instruction multiple-data processors," 2011
      http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm
      http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
   For SIMD
     less conditional branch, no unaligned data access
   For multicore processors
     they implemented it for PowerPC and Cell BE
   O(n log n) complexity
 I tried it for Intel CPU(not complete)
   https://github.com/herumi/opti/blob/master/intsort.hpp
     current version is for only one processor
2012/6/16 #x86opti 4                                                      3 /29
AA-sort
 vectorized combsort for a block (<= L2cache?)
 vectorized merge sorted block

                                         input array

          block 0          block 1        block 2        block3   ...

             sort             sort           sort          sort

             <               <               <             <      ...

                           merge                        merge
                       <                            <             ...
                                         merge
                                     <                            ...
2012/6/16 #x86opti 4                                                    4 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      5 /29
classic combsort(1/2)
 improved bubble sort
   unstable
   O(n log n)
   compare two elements having a gap(>=1)
     gap is divided by shrink factor (about 1.3)
     size_t nextGap(size_t N) { return (N * 10) / 13; }

     void combsort(uint32_t *a, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
         }
         gap = nextGap(gap);
       }
       …
2012/6/16 #x86opti 4                                             6 /29
classic combsort(2/2)
 gap = 1 means bubble sort
   loop until the array is fully sorted

           …
           for (;;) {
             bool isSwapped = false;
             for (size_t i = 0; i < N - 1; i++) {
               if (a[i] > a[i + 1]) {
                 std::swap(a[i], a[i + 1]);
                 isSwapped = true;
               }
             }
             if (!isSwapped) return;
           }
       }


2012/6/16 #x86opti 4                                7 /29
gap function
 Combsort11
   last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good
    by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm


      size_t nextGap(size_t n) {
          n = (n * 10) / 13;
          if (n == 9 || n == 10) return 11; // (*)
          return n;
      }



   a little faster if line(*) is appended



2012/6/16 #x86opti 4                                               8 /29
vectorized combsort
 step1 : sort values within each vector(32bitx4)
 step2 : SIMD version combsort
 step3 : reorder data
       6       8        9    3    5      7       12    14    0    4        1        20     11    ...

                                 step1
                                                      sort                                sort
  +0       3       5        0     …          …                        0         1          3       …   101
  +1       9       7        1     …          …                    102          104        105      …   380
  +2       6       12       4     …          …                    389          391        392      …   502
  +3       8       14       20    …          …
                                                        step2
                                                                  511          515        612      …   973
        v0         v1       v2    v3
                                                                                    step3

       0       1        3    …   101   102   104       105   …   380      389       391   392    …

2012/6/16 #x86opti 4                                                                                         9 /29
step1
 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0,
  1, 2, 3
 step1.2 : transpose
  3       5      0     8

  2       7      1     2
                            step1.1
  8      12      4     13

  9      14      20    15
                                          sort

 v0      v1     v2     v3        0    3          5    8
                                                           step1.2
                                 1    2          2    7

                                 4    8          12   13
                                                                     transpose
                                 9    14         15   20
                                                                 0   1     4     9

                                                                 3   2     8     14

                                                                 5   2    12     15

                                                                 8   7    13     20

2012/6/16 #x86opti 4                                                              10 /29
sort of 4 items
 use max ud, minud for uint32_t x 4
        a                 b

                  <                 v0                v1              v2              v3

    min(a,b)           max(a,b)             <                                   <

                                   min01            max01           min23           max23

                                                <                           <
                                                s=max(min          t=min(max
                                  min0123                                           max0123
                                                01,min23)          01,max23)
                                                               <

                                  min0123           min(s,t)        max(s,t)        max0123


                                                                           sorted

2012/6/16 #x86opti 4                                                                        11 /29
source of step1.1
 V128 is a type of 32-bit integer x 4
   pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3

                 void sort_step1_vec(V128 x[4])
                 {
                     V128 min01 = pminud(x[0], x[1]);
                     V128 max01 = pmaxud(x[0], x[1]);
                     V128 min23 = pminud(x[2], x[3]);
                     V128 max23 = pmaxud(x[2], x[3]);
                     x[0] = pminud(min01, min23);
                     x[3] = pmaxud(max01, max23);
                     V128 s = pmaxud(min01, min23);
                     V128 t = pminud(max01, max23);
                     x[1] = pminud(s, t);
                     x[2] = pmaxud(s, t);
                 }


2012/6/16 #x86opti 4                                    12 /29
transpose of 4x4 matrix
 use unpcklps and unpckhps
                                 t0=unpcklps(x0,x2)
+0     3       5        0   8                         3    5     8    12

+1     2       7        1   2
                                 t2=unpckhps(x0,x2)   0    8     4    13

+2     8      12        4   13                        2    7     9    14

+3     9      14       20   15   t1=unpcklps(x1,x3)   1    2     20   15
                                 t3=unpckhps(x1,x3)
      x0      x1       x2   x3                        t0   t1   t2    t3



       3       5        8   12   x0=unpcklps(t0,t1)   3    2    8     9
       0       8        4   13                        5    7    12    14
       2       7        9   14   x1=unpckhps(t0,t1)   0    1    4     20
       1       2       20   15                        8    2    13    15
                                 x2=unpcklps(t2,t3)
      t0      t1       t2   t3   x3=unpckhps(t2,t3)   x0   x1   x2    x3

2012/6/16 #x86opti 4                                                       13 /29
source of transpose and step1
  void transpose(V128 x[4])       void sort_step1(V128 *va, size_t N)
  {                               {
    V128 x0 = x[0];                 for(size_t i = 0; i < N; i+= 4) {
    V128 x1 = x[1];                   sort_step1_vec(&va[i]);
    V128 x2 = x[2];                   transpose(&va[i]);
    V128 x3 = x[3];                 }
    V128 t0 = unpcklps(x0, x2);   }
    V128 t1 = unpcklps(x1, x3);
    V128 t2 = unpckhps(x0, x2);
    V128 t3 = unpckhps(x1, x3);
    x[0] = unpcklps(t0, t1);
    x[1] = unpckhps(t0, t1);
    x[2] = unpcklps(t2, t3);
    x[3] = unpckhps(t2, t3);
  }



2012/6/16 #x86opti 4                                              14 /29
SIMD version combsort
 first half code use
   vector_cmpswap
   vector_cmpswap_skew
     bool sort_step2(V128 *va, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           vector_cmpswap(va[i], va[i + gap]);
         }
         for (size_t i = N - gap; i < N; i++) {
           vector_cmpswap_skew(va[i], va[i + gap - N]);
         }
         gap = nextGap(gap);
       }
       ...


2012/6/16 #x86opti 4                                      15 /29
vector_cmpswap
 no conditional branch
           a              b

                  <

       min(a,b)        max(a,b)


     if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);

                                  vectorised

     void vector_cmpswap(V128& a, V128& b)
     {
       V128 t = pmaxud(a, b);
       a = pminud(a, b);
       b = t;
     }

2012/6/16 #x86opti 4                                       16 /29
vector_cmpswap_skew
 for boundary of array

       a               a3      a2           a1           a0



       b               b3      b2           b1           b0


                                           (a',b') = vector_cmpswap_ske(a,b)

       a'              a3   min(a2,b3)   min(a1,b2)   min(a0,b1)



       b'        max(a2,b3) max(a1,b2) max(a0,b1)        b0




2012/6/16 #x86opti 4                                                           17 /29
isSortedVec
 check whether array is sorted
   ptest_zf(a, b) is true if (a & b) == 0
   a <= b  max(a,b) == b  c := max(a,b) – b == 0
   pcmpgtd is for int32_t, so we can't use it
          bool isSortedVec(const V128 *va, size_t N) {
            for (size_t i = 0; i < N - 1; i++) {
              V128 a = va[i];
              V128 b = va[i + 1];
              V128 c = pmaxud(a, b);
              c = psubd(c, b);
              if (!ptest_zf(c, c)) {
                return false;
              }
            }
            return true;
          }
2012/6/16 #x86opti 4                                     18 /29
loop for gap == 1
 vectorised bubble sort for gap == 1
   retire if loop count reaches maxLoop
     fall to std::sort
         almost rare
            const int maxLoop = 10;
            for (int i = 0; i < maxLoop; i++) {
              for (size_t i = 0; i < N - 1; i++) {
                vector_cmpswap(va[i], va[i + 1]);
              }
              vector_cmpswap_skew(va[N - 1], va[0]);
              if (isSortedVec(va, N)) return true;
            }




2012/6/16 #x86opti 4                                   19 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      20 /29
merge two sorted vector
   a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted
   c = [b:a] = merge and sort (a, b)
                                                      sorted

                   a        a0   a1     a2       a3

                                                       sorted
                   b        b0   b1     b2       b3


                                      [b:a] = vector_merge(a,b)


        c0             c1   c2   c3     c0       c1       c2      c3

                                                                  sorted



2012/6/16 #x86opti 4                                                   21 /29
data flow of merge
                                   sorted                                          sorted


     a0          a1        a2          a3            b0          b1         b2          b3




           <                       <                       <                       <
   min00       max00       min11       max11       min22       max22       min33       max33



                       <                                               <




                       <                       <                       <


2012/6/16 #x86opti 4                                                                         22 /29
source of vector_merge
 Too complex
   good idea?          void vector_merge(V128& a, V128& b) {
                         V128 m = pminud(a, b);
                         V128 M = pmaxud(a, b);
                         V128 s0 = punpckhqdq(m, m);
                         V128 s1 = pminud(s0, M);
                         V128 s2 = pmaxud(s0, M);
                         V128 s3 = punpcklqdq(s1, punpckhqdq(M, M));
                         V128 s4 = punpcklqdq(s2, m);
                         s4 = pshufd<PACK(2, 1, 0, 3)>(s4);
                         V128 s5 = pminud(s3, s4);
                         V128 s6 = pmaxud(s3, s4);
                         V128 s7 = pinsrd<2>(s5, movd(s6));
                         V128 s8 = pinsrd<0>(s6, pextrd<2>(s5));
                         a = pshufd<PACK(1, 2, 0, 3)>(s7);
                         b = pshufd<PACK(3, 2, 0, 1)>(s8);
                       }
2012/6/16 #x86opti 4                                                   23 /29
std::merge()
 merge [begin1, end1) and [begin2, end2)
 template <class In1, class In2, class Out>
 Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out)
 {
   for (;;) {
     *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++;
     if (begin1 == end1) return copy(begin2, end2, result);
     if (begin2 == end2) return copy(begin1, end1, result);
   }
 }




2012/6/16 #x86opti 4                                              24 /29
vectorised merge
 merge arrays with vector_merge()
 void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){
   uint32_t aPos = 0, bPos = 0, outPos = 0;
   V128 vMin = va[aPos++];
   V128 vMax = vb[bPos++];
   for (;;) {
     vector_merge(vMin, vMax);
     vo[outPos++] = vMin;
     if (aPos < aN) {
       if (bPos < bN) {
         V128 ta = va[aPos];
         V128 tb = vb[bPos];          ; compare ta0 with tb0
         if (movd(ta) <= movd(tb)) {
           vMin = ta;
           aPos++;
         } else {
           vMin = tb;
           bPos++;
         }

2012/6/16 #x86opti 4                                                     25 /29
block size and rate of sort
 What is good size for vectorised sort?
   half size of L2 is recommended for PowerPC 970MP
     L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
 BS = 32Ki seems good for Xeon, Core i7
 profile of sort and merge
        100
         80
         60
         40                                         merge(%)
         20                                         sort(%)
           0




2012/6/16 #x86opti 4                                           26 /29
Benchmark(1/3)
 AA-sort vs std::sort for random data
   Xeon X5650 + gcc-4.6.3
      4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
                   10000000
                                                      std::sort                                fast
                    1000000
                                                      AA-sort
                     100000
     clock cycle




                      10000
                       1000
                        100
                         10
                          1
                              16   64   256   1Ki   4Ki   16Ki    64Ki   256Ki   1Mi   4Mi
                                                                                             # of uint32_t


2012/6/16 #x86opti 4                                                                                    27 /29
Benchmark(2/3)
 sort 64Ki uint on Xeon + gcc-4.6.3
   AA-sort speed does not strongly depend on pattern
    25000
                                               fast
    20000
                       std::sort
    15000              AA-sort

    10000

     5000

          0




2012/6/16 #x86opti 4                                    28 /29
Benchmark(3/3)
 sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11

   16000
                                        fast
   14000
   12000
   10000
                                        std::sort(gcc)
    8000
                                        AA-sort(gcc)
    6000
                                        std::sort(VC)
    4000
                                        AA-sort(VC)
    2000
         0




2012/6/16 #x86opti 4                                29 /29

More Related Content

What's hot

深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言Simen Li
 
Exercices corriges application_lineaire_et_determinants
Exercices corriges application_lineaire_et_determinantsExercices corriges application_lineaire_et_determinants
Exercices corriges application_lineaire_et_determinantssarah Benmerzouk
 
Insertion sort analysis
Insertion sort analysisInsertion sort analysis
Insertion sort analysisKumar
 
Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Anand Ingle
 
用十分鐘 向jserv學習作業系統設計
用十分鐘  向jserv學習作業系統設計用十分鐘  向jserv學習作業系統設計
用十分鐘 向jserv學習作業系統設計鍾誠 陳鍾誠
 
[系列活動] 手把手打開Python資料分析大門
[系列活動] 手把手打開Python資料分析大門[系列活動] 手把手打開Python資料分析大門
[系列活動] 手把手打開Python資料分析大門台灣資料科學年會
 
Exercices pascal fenni_2018
Exercices pascal fenni_2018Exercices pascal fenni_2018
Exercices pascal fenni_2018salah fenni
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門Yoichi Iwata
 
Calculating critical values of t distributions using tables of percentage points
Calculating critical values of t distributions using tables of percentage pointsCalculating critical values of t distributions using tables of percentage points
Calculating critical values of t distributions using tables of percentage pointsmodelos-econometricos
 
Exercices avec les solutions d'analyse complexe
Exercices avec les solutions d'analyse complexeExercices avec les solutions d'analyse complexe
Exercices avec les solutions d'analyse complexeKamel Djeddi
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihmSajid Marwat
 
06.第六章用Matlab计算二重积分
06.第六章用Matlab计算二重积分06.第六章用Matlab计算二重积分
06.第六章用Matlab计算二重积分Xin Zheng
 
Amortized Analysis
Amortized Analysis Amortized Analysis
Amortized Analysis sathish sak
 
Automata theory - NFA to DFA Conversion
Automata theory - NFA to DFA ConversionAutomata theory - NFA to DFA Conversion
Automata theory - NFA to DFA ConversionAkila Krishnamoorthy
 
Binary search tree deletion
Binary search tree deletionBinary search tree deletion
Binary search tree deletionKousalya M
 

What's hot (20)

深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
Exercices corriges application_lineaire_et_determinants
Exercices corriges application_lineaire_et_determinantsExercices corriges application_lineaire_et_determinants
Exercices corriges application_lineaire_et_determinants
 
Insertion sort analysis
Insertion sort analysisInsertion sort analysis
Insertion sort analysis
 
Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure
 
用十分鐘 向jserv學習作業系統設計
用十分鐘  向jserv學習作業系統設計用十分鐘  向jserv學習作業系統設計
用十分鐘 向jserv學習作業系統設計
 
Valgrind
ValgrindValgrind
Valgrind
 
Lec9
Lec9Lec9
Lec9
 
[系列活動] 手把手打開Python資料分析大門
[系列活動] 手把手打開Python資料分析大門[系列活動] 手把手打開Python資料分析大門
[系列活動] 手把手打開Python資料分析大門
 
Exercices pascal fenni_2018
Exercices pascal fenni_2018Exercices pascal fenni_2018
Exercices pascal fenni_2018
 
04 cours matrices_suites
04 cours matrices_suites04 cours matrices_suites
04 cours matrices_suites
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門
 
Calculating critical values of t distributions using tables of percentage points
Calculating critical values of t distributions using tables of percentage pointsCalculating critical values of t distributions using tables of percentage points
Calculating critical values of t distributions using tables of percentage points
 
Exercices avec les solutions d'analyse complexe
Exercices avec les solutions d'analyse complexeExercices avec les solutions d'analyse complexe
Exercices avec les solutions d'analyse complexe
 
Hashing
HashingHashing
Hashing
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihm
 
Shell sort
Shell sortShell sort
Shell sort
 
06.第六章用Matlab计算二重积分
06.第六章用Matlab计算二重积分06.第六章用Matlab计算二重积分
06.第六章用Matlab计算二重积分
 
Amortized Analysis
Amortized Analysis Amortized Analysis
Amortized Analysis
 
Automata theory - NFA to DFA Conversion
Automata theory - NFA to DFA ConversionAutomata theory - NFA to DFA Conversion
Automata theory - NFA to DFA Conversion
 
Binary search tree deletion
Binary search tree deletionBinary search tree deletion
Binary search tree deletion
 

Similar to AA-sort with SSE4.1

Mongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongoDB
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming languageLincoln Hannah
 
R graphics260809
R graphics260809R graphics260809
R graphics260809lizbethfdz
 
Matrix by suman sir
Matrix by suman sirMatrix by suman sir
Matrix by suman sirsumandandal
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabkrishna_093
 
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Soumen Santra
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
Maximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityMaximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityWithTheBest
 
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
Incremental statistics for partitioned tables in 11g  by wwf from ebay COCIncremental statistics for partitioned tables in 11g  by wwf from ebay COC
Incremental statistics for partitioned tables in 11g by wwf from ebay COCLouis liu
 
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
Current Score  –  0 Due  Wednesday, November 19 2014 0400 .docxCurrent Score  –  0 Due  Wednesday, November 19 2014 0400 .docx
Current Score – 0 Due Wednesday, November 19 2014 0400 .docxfaithxdunce63732
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfahmed8651
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson
 
Reducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPReducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPgpolo
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentationManchireddy Reddy
 

Similar to AA-sort with SSE4.1 (20)

Mongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongodb debugging-performance-problems
Mongodb debugging-performance-problems
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
R graphics260809
R graphics260809R graphics260809
R graphics260809
 
Matrix by suman sir
Matrix by suman sirMatrix by suman sir
Matrix by suman sir
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Time complexity
Time complexityTime complexity
Time complexity
 
Doc 20180130-wa0006
Doc 20180130-wa0006Doc 20180130-wa0006
Doc 20180130-wa0006
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Es272 ch1
Es272 ch1Es272 ch1
Es272 ch1
 
Maximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityMaximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unity
 
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
Incremental statistics for partitioned tables in 11g  by wwf from ebay COCIncremental statistics for partitioned tables in 11g  by wwf from ebay COC
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
 
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
Current Score  –  0 Due  Wednesday, November 19 2014 0400 .docxCurrent Score  –  0 Due  Wednesday, November 19 2014 0400 .docx
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdf
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
 
Seeing Like Software
Seeing Like SoftwareSeeing Like Software
Seeing Like Software
 
Reducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPReducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSP
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentation
 

More from MITSUNARI Shigeo

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学MITSUNARI Shigeo
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advancedMITSUNARI Shigeo
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenMITSUNARI Shigeo
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化MITSUNARI Shigeo
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 

More from MITSUNARI Shigeo (20)

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advanced
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
暗認本読書会11
暗認本読書会11暗認本読書会11
暗認本読書会11
 
暗認本読書会10
暗認本読書会10暗認本読書会10
暗認本読書会10
 
暗認本読書会9
暗認本読書会9暗認本読書会9
暗認本読書会9
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
 
暗認本読書会8
暗認本読書会8暗認本読書会8
暗認本読書会8
 
暗認本読書会7
暗認本読書会7暗認本読書会7
暗認本読書会7
 
暗認本読書会6
暗認本読書会6暗認本読書会6
暗認本読書会6
 
暗認本読書会5
暗認本読書会5暗認本読書会5
暗認本読書会5
 
暗認本読書会4
暗認本読書会4暗認本読書会4
暗認本読書会4
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
私とOSSの25年
私とOSSの25年私とOSSの25年
私とOSSの25年
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
HPC Phys-20201203
HPC Phys-20201203HPC Phys-20201203
HPC Phys-20201203
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

AA-sort with SSE4.1

  • 1. AA-sort with SSE4.1 Cybozu Labs 2012/6/16 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 4(#x86opti)
  • 2. Agenda  Introduction of AA-sort  classic combsort  vectorized combsort  vectorized merge  benchmark 2012/6/16 #x86opti 4 2 /29
  • 3. AA-sort  Aligned-Access sort  proposed by Hiroshi Inoue, etc. in "A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011  http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm  http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm  For SIMD less conditional branch, no unaligned data access  For multicore processors they implemented it for PowerPC and Cell BE  O(n log n) complexity  I tried it for Intel CPU(not complete)  https://github.com/herumi/opti/blob/master/intsort.hpp current version is for only one processor 2012/6/16 #x86opti 4 3 /29
  • 4. AA-sort  vectorized combsort for a block (<= L2cache?)  vectorized merge sorted block input array block 0 block 1 block 2 block3 ... sort sort sort sort < < < < ... merge merge < < ... merge < ... 2012/6/16 #x86opti 4 4 /29
  • 5. AA-sort algorithm  sort each block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 5 /29
  • 6. classic combsort(1/2)  improved bubble sort  unstable  O(n log n)  compare two elements having a gap(>=1) gap is divided by shrink factor (about 1.3) size_t nextGap(size_t N) { return (N * 10) / 13; } void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } … 2012/6/16 #x86opti 4 6 /29
  • 7. classic combsort(2/2)  gap = 1 means bubble sort  loop until the array is fully sorted … for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; } } 2012/6/16 #x86opti 4 7 /29
  • 8. gap function  Combsort11  last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n; }  a little faster if line(*) is appended 2012/6/16 #x86opti 4 8 /29
  • 9. vectorized combsort  step1 : sort values within each vector(32bitx4)  step2 : SIMD version combsort  step3 : reorder data 6 8 9 3 5 7 12 14 0 4 1 20 11 ... step1 sort sort +0 3 5 0 … … 0 1 3 … 101 +1 9 7 1 … … 102 104 105 … 380 +2 6 12 4 … … 389 391 392 … 502 +3 8 14 20 … … step2 511 515 612 … 973 v0 v1 v2 v3 step3 0 1 3 … 101 102 104 105 … 380 389 391 392 … 2012/6/16 #x86opti 4 9 /29
  • 10. step1  step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3  step1.2 : transpose 3 5 0 8 2 7 1 2 step1.1 8 12 4 13 9 14 20 15 sort v0 v1 v2 v3 0 3 5 8 step1.2 1 2 2 7 4 8 12 13 transpose 9 14 15 20 0 1 4 9 3 2 8 14 5 2 12 15 8 7 13 20 2012/6/16 #x86opti 4 10 /29
  • 11. sort of 4 items  use max ud, minud for uint32_t x 4 a b < v0 v1 v2 v3 min(a,b) max(a,b) < < min01 max01 min23 max23 < < s=max(min t=min(max min0123 max0123 01,min23) 01,max23) < min0123 min(s,t) max(s,t) max0123 sorted 2012/6/16 #x86opti 4 11 /29
  • 12. source of step1.1  V128 is a type of 32-bit integer x 4  pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3 void sort_step1_vec(V128 x[4]) { V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t); } 2012/6/16 #x86opti 4 12 /29
  • 13. transpose of 4x4 matrix  use unpcklps and unpckhps t0=unpcklps(x0,x2) +0 3 5 0 8 3 5 8 12 +1 2 7 1 2 t2=unpckhps(x0,x2) 0 8 4 13 +2 8 12 4 13 2 7 9 14 +3 9 14 20 15 t1=unpcklps(x1,x3) 1 2 20 15 t3=unpckhps(x1,x3) x0 x1 x2 x3 t0 t1 t2 t3 3 5 8 12 x0=unpcklps(t0,t1) 3 2 8 9 0 8 4 13 5 7 12 14 2 7 9 14 x1=unpckhps(t0,t1) 0 1 4 20 1 2 20 15 8 2 13 15 x2=unpcklps(t2,t3) t0 t1 t2 t3 x3=unpckhps(t2,t3) x0 x1 x2 x3 2012/6/16 #x86opti 4 13 /29
  • 14. source of transpose and step1 void transpose(V128 x[4]) void sort_step1(V128 *va, size_t N) { { V128 x0 = x[0]; for(size_t i = 0; i < N; i+= 4) { V128 x1 = x[1]; sort_step1_vec(&va[i]); V128 x2 = x[2]; transpose(&va[i]); V128 x3 = x[3]; } V128 t0 = unpcklps(x0, x2); } V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3); } 2012/6/16 #x86opti 4 14 /29
  • 15. SIMD version combsort  first half code use  vector_cmpswap  vector_cmpswap_skew bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ... 2012/6/16 #x86opti 4 15 /29
  • 16. vector_cmpswap  no conditional branch a b < min(a,b) max(a,b) if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); vectorised void vector_cmpswap(V128& a, V128& b) { V128 t = pmaxud(a, b); a = pminud(a, b); b = t; } 2012/6/16 #x86opti 4 16 /29
  • 17. vector_cmpswap_skew  for boundary of array a a3 a2 a1 a0 b b3 b2 b1 b0 (a',b') = vector_cmpswap_ske(a,b) a' a3 min(a2,b3) min(a1,b2) min(a0,b1) b' max(a2,b3) max(a1,b2) max(a0,b1) b0 2012/6/16 #x86opti 4 17 /29
  • 18. isSortedVec  check whether array is sorted  ptest_zf(a, b) is true if (a & b) == 0  a <= b  max(a,b) == b  c := max(a,b) – b == 0  pcmpgtd is for int32_t, so we can't use it bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true; } 2012/6/16 #x86opti 4 18 /29
  • 19. loop for gap == 1  vectorised bubble sort for gap == 1  retire if loop count reaches maxLoop fall to std::sort  almost rare const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; } 2012/6/16 #x86opti 4 19 /29
  • 20. AA-sort algorithm  sort each block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 20 /29
  • 21. merge two sorted vector  a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted  c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c0 c1 c2 c3 sorted 2012/6/16 #x86opti 4 21 /29
  • 22. data flow of merge sorted sorted a0 a1 a2 a3 b0 b1 b2 b3 < < < < min00 max00 min11 max11 min22 max22 min33 max33 < < < < < 2012/6/16 #x86opti 4 22 /29
  • 23. source of vector_merge  Too complex  good idea? void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8); } 2012/6/16 #x86opti 4 23 /29
  • 24. std::merge()  merge [begin1, end1) and [begin2, end2) template <class In1, class In2, class Out> Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out) { for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); } } 2012/6/16 #x86opti 4 24 /29
  • 25. vectorised merge  merge arrays with vector_merge() void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; ; compare ta0 with tb0 if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; } 2012/6/16 #x86opti 4 25 /29
  • 26. block size and rate of sort  What is good size for vectorised sort?  half size of L2 is recommended for PowerPC 970MP L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t  BS = 32Ki seems good for Xeon, Core i7  profile of sort and merge 100 80 60 40 merge(%) 20 sort(%) 0 2012/6/16 #x86opti 4 26 /29
  • 27. Benchmark(1/3)  AA-sort vs std::sort for random data  Xeon X5650 + gcc-4.6.3 4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi 10000000 std::sort fast 1000000 AA-sort 100000 clock cycle 10000 1000 100 10 1 16 64 256 1Ki 4Ki 16Ki 64Ki 256Ki 1Mi 4Mi # of uint32_t 2012/6/16 #x86opti 4 27 /29
  • 28. Benchmark(2/3)  sort 64Ki uint on Xeon + gcc-4.6.3  AA-sort speed does not strongly depend on pattern 25000 fast 20000 std::sort 15000 AA-sort 10000 5000 0 2012/6/16 #x86opti 4 28 /29
  • 29. Benchmark(3/3)  sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11 16000 fast 14000 12000 10000 std::sort(gcc) 8000 AA-sort(gcc) 6000 std::sort(VC) 4000 AA-sort(VC) 2000 0 2012/6/16 #x86opti 4 29 /29