Aa sort-v4


Published on

Aligned Access Sort

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aa sort-v4

  1. 1. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami Presented By: M. Edirisinghe, H. Nawarathna
  2. 2. Content • Introduction • SIMD instruction set • AA-sort algorithm • In-core algorithm • Out-of-core algorithm • Sorting scheme in AA-sort • Experimental results
  3. 3. Introduction • High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading • Many processors provide Single Instruction Multiple Data (SIMD) instructions 3
  4. 4. SIMD Instructions • Advantages: – Data parallelism – Reduce the number of conditional branches in programs (can use vector compare and vector select instead) 5
  5. 5. SIMD Instruction Set • Used Vector Multimedia eXtension (VMX or AltiVec) instructions • Provides a set of 128 bit vector registers – Use four 32 bit values • Useful VMX instructions for sorting: – Vector Compare – Vector Selected – Vector Permutation 6
  6. 6. Sorting Algorithms and SIMD • Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort) • It incur additional overhead and attenuate the benefits of SIMD instructions 7
  7. 7. Paper’s Contribution • Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N) 8
  8. 8. AA-Sort Algorithm • Assumptions: – First element of the array to be sorted is aligned on a 128 bit boundary – Number of elements in the array, N, is a multiple of four 9
  9. 9. AA-Sort Algorithm • Array of integer values a[N] is equivalent to an array of vector integers va[N/4] 10
  10. 10. AA-Sort Algorithm • Consist of 2 algorithms: 1. In-core sorting algorithm 2. Out-of-core sorting algorithm • Phases of execution: –Divide all of the data into blocks that fit into the cache of the processor –Sort each block with the in-core sorting algorithm –Merge the sorted blocks with the out-of-core sorting algorithm 11
  11. 11. Combsort • Extension to bubble sort (kill turtles-lower values in the end) • Compares and swaps non-adjacent elements • Improves performance • Computational complexity N log (N) average • Problems with SIMD instructions: – Unaligned memory access – Loop-carried dependencies 12
  12. 12. Combsort
  13. 13. In-Core Algorithm • Execution steps: 1. Sort values within each vector in ascending order 2. Execute combsort to sort the values into the transposed order 14
  14. 14. In-Core Algorithm • Use extended Combsort 15
  15. 15. In-Core Algorithm 3. Reorder the values from the transposed order into the original order 16
  16. 16. In-Core Algorithm • All 3 steps can be executed using SIMD instructions without unaligned memory access • Computational complexity dominated by step 2 – Average O(N log N) – Worst case O(N^2) • Poor memory access locality – Performance degrade if the data cannot fit into the cache of the processor 17
  17. 17. Out of core Algorithm • Used to merge two sorted vectors – a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted – c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c4 c5 c6 c7 sorted 18
  18. 18. Dataflow of Merge sorted a0 min00 a1 < a2 max00 min11 sorted a3 < b0 max11 min22 < < lg(P + 1) stages, P – No of elements in a vector b1 < b2 max22 min33 b3 < max33 < < < Here P = 4 lg(P + 1) = 3 19
  19. 19. Merge Operation 20
  20. 20. Out of core Algorithm • No unaligned memory accesses • Better memory access locality compared with in-core sorting algorithm – Higher performance when data cannot fit in the cache 21
  21. 21. Overall AA Sort Scheme • Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor • Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block. • Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads 22
  22. 22. Overall AA Sort Scheme Contd. No of elements of data No of elements per block No of blocks =N =B = (N/B) Considering In-core sorting phase Computational time for the in-core sorting of each block proportional to B log(B) Complexity of in-core sorting = O(N) Considering out-of-core sorting phase Merging sorted blocks in out-of-core sorting involves log(N/B) stages Computational complexity of each stage = O(N) Complexity of out-of-core sorting = O(N log(N)) Hence, Computational complexity of entire AA-sort = O(N log(N)) 23
  23. 23. Overall AA Sort Scheme Contd. An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4 24
  24. 24. Experimental Setup • PowerPC 970MP System – Two 2.5 GHz dual-core processors – 8GB system memory – Each core had 1MB L2 cache memory – Linux kernel 2.6.20 • System with Cell BE processors – Two 2.4 GHz processors – 1GB system memory – Only SPE cores were used (16 SPE cores with 256KB local memory each) – Linux kernel 2.6.15 25
  25. 25. Implementation • Half of the size of L2 cache as the block size – 512KB (128K of 32 bit values) on PowerPC 970MP – 128KB (32K of 32 bit values) on the SPE • Shrink factor – 1.28 • Multiway merge technique with out-of-core sorting – 4 way merge – Number or merging stages reduced from log2(N/B) to log4(N/B) 26
  26. 26. Effects of Using SIMD Instructions Branch misprediction rate. Acceleration by SIMD instructions for sorting 16 K random integers on one core of PowerPC 970MP 27
  27. 27. Performance for 32 bit Integers Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes. 28
  28. 28. Performance for 32 bit Integers Contd. Performance comparison on one PowerPC 970MP core for various input datasets with 32 million integers. 29
  29. 29. Performance for 32 bit Integers Contd. The execution time of parallel versions of AA-sort and GPUTeraSort on up to 4 cores of PowerPC 970MP. 30
  30. 30. Performance for 32 bit Integers Contd. Scalability with increasing number of cores on Cell BE for 32 million integers 31
  31. 31. Conclusions • Describes a new parallel sorting algorithm called Aligned Access Sort • The algorithm does not involve any unaligned memory accesses • Evaluated on PowerPC 970MP and Cell Broadband Engine Processors • Demonstrated better scalability and performance in both sequential and parallel versions 32
  32. 32. Conclusions Contd. • Evaluation was performed only on 32 bit integers • Performance comparison was performed on limited number of architectures – Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand • Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads 33
  33. 33. Thank You. 34