3. AA-sort
Aligned-Access sort
proposed by Hiroshi Inoue, etc. in
"A high-performance sorting algorithm for multicore
single-instruction multiple-data processors," 2011
http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm
http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
For SIMD
less conditional branch, no unaligned data access
For multicore processors
they implemented it for PowerPC and Cell BE
O(n log n) complexity
I tried it for Intel CPU(not complete)
https://github.com/herumi/opti/blob/master/intsort.hpp
current version is for only one processor
2012/6/16 #x86opti 4 3 /29
6. classic combsort(1/2)
improved bubble sort
unstable
O(n log n)
compare two elements having a gap(>=1)
gap is divided by shrink factor (about 1.3)
size_t nextGap(size_t N) { return (N * 10) / 13; }
void combsort(uint32_t *a, size_t N) {
size_t gap = nextGap(N);
while (gap > 1) {
for (size_t i = 0; i < N - gap; i++) {
if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
}
gap = nextGap(gap);
}
…
2012/6/16 #x86opti 4 6 /29
7. classic combsort(2/2)
gap = 1 means bubble sort
loop until the array is fully sorted
…
for (;;) {
bool isSwapped = false;
for (size_t i = 0; i < N - 1; i++) {
if (a[i] > a[i + 1]) {
std::swap(a[i], a[i + 1]);
isSwapped = true;
}
}
if (!isSwapped) return;
}
}
2012/6/16 #x86opti 4 7 /29
8. gap function
Combsort11
last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good
by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm
size_t nextGap(size_t n) {
n = (n * 10) / 13;
if (n == 9 || n == 10) return 11; // (*)
return n;
}
a little faster if line(*) is appended
2012/6/16 #x86opti 4 8 /29
15. SIMD version combsort
first half code use
vector_cmpswap
vector_cmpswap_skew
bool sort_step2(V128 *va, size_t N) {
size_t gap = nextGap(N);
while (gap > 1) {
for (size_t i = 0; i < N - gap; i++) {
vector_cmpswap(va[i], va[i + gap]);
}
for (size_t i = N - gap; i < N; i++) {
vector_cmpswap_skew(va[i], va[i + gap - N]);
}
gap = nextGap(gap);
}
...
2012/6/16 #x86opti 4 15 /29
16. vector_cmpswap
no conditional branch
a b
<
min(a,b) max(a,b)
if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
vectorised
void vector_cmpswap(V128& a, V128& b)
{
V128 t = pmaxud(a, b);
a = pminud(a, b);
b = t;
}
2012/6/16 #x86opti 4 16 /29
17. vector_cmpswap_skew
for boundary of array
a a3 a2 a1 a0
b b3 b2 b1 b0
(a',b') = vector_cmpswap_ske(a,b)
a' a3 min(a2,b3) min(a1,b2) min(a0,b1)
b' max(a2,b3) max(a1,b2) max(a0,b1) b0
2012/6/16 #x86opti 4 17 /29
18. isSortedVec
check whether array is sorted
ptest_zf(a, b) is true if (a & b) == 0
a <= b max(a,b) == b c := max(a,b) – b == 0
pcmpgtd is for int32_t, so we can't use it
bool isSortedVec(const V128 *va, size_t N) {
for (size_t i = 0; i < N - 1; i++) {
V128 a = va[i];
V128 b = va[i + 1];
V128 c = pmaxud(a, b);
c = psubd(c, b);
if (!ptest_zf(c, c)) {
return false;
}
}
return true;
}
2012/6/16 #x86opti 4 18 /29
19. loop for gap == 1
vectorised bubble sort for gap == 1
retire if loop count reaches maxLoop
fall to std::sort
almost rare
const int maxLoop = 10;
for (int i = 0; i < maxLoop; i++) {
for (size_t i = 0; i < N - 1; i++) {
vector_cmpswap(va[i], va[i + 1]);
}
vector_cmpswap_skew(va[N - 1], va[0]);
if (isSortedVec(va, N)) return true;
}
2012/6/16 #x86opti 4 19 /29