SlideShare a Scribd company logo
1 of 18
REGISTER LEVEL SORT
ALGORITHM ON MULTI-CORE SIMD
PROCESSORS
Xiaochen Tian, Kamil Rocki and Reiji Suda
The University of Tokyo
Suda Lab
Xeon Phi and AVX-512
• Xeon Phi:
• 60 cores.
• 240 threads with Hyperthreading.
• AVX-512 supported
• AVX-512
• 512-bit registers
• SIMD 16 x 32-bit integers operation
Comparison Sorting on Xeon Phi
• Purpose:
• Utilize 60 cores
• Utilize AVX-512
• Focus on: sorting in registers
• Challenge:
• Restriction of registers’ memory accessing.(explained later)
• Proposed Algorithm:
• based on Bitonic Merge Sorting network
• 1 register method
• 2 register method
Bitonic Merge Sorting Network
• Numbers in the box represent the indices of elements
2
3
4
5
6
7
8
step: 1 2 3 4 5 6
stage: 1 2 3
1
Example of Bitonic Merge Sort
24
step: 1 2 3 4 5 6
76
19
67
53
13
21
37
24
76
67
19
13
53
37
21
24
19
67
76
37
53
13
21
19
24
67
76
53
37
21
13
19
24
21
13
53
37
67
76
19
13
21
24
53
37
67
76
13
19
21
24
37
53
67
76
2
3
4
5
6
7
8
1
• R1, R2 are two registers
1
2
3
4
5
6
7
8
comparison between
elements in the same register
is invalid!
R1
R2
R1’ = _m512_min_epi32(R1,R2);
R2’ = _m512_max_epi32(R1,R2);
Restriction of Memory Accessing
• Permutation and mask instruction
1
2
3
4
5
6
7
8
R1
R2
R2 = _mm512_shuffle_epi32(R2,<2 1 3 4>);
R1’ = _mm512_min_epi32(R1,R2);
R2’ = _mm512_max_epi32(R1,R2);
Restriction of Memory Accessing
R1’ = _mm512_mask_min_epi32
(R1,0b1010,R1,R2);
R2’ = _mm512_mask_max_epi32
(R2,0b0101,R1,R2);
R1’ = _mm512_mask_max_epi32
(R1,0b0101,R1,R2);
R2’ = _mm512_mask_min_epi32
(R2,0b1010,R1,R2);
1 Register Method
• Sort 𝐾 = 16 elements,
• 𝐾 : number of elements stored in one register
• Idea: copy R1 to R2 so that comparison becomes valid
1
2
3
4
1
2
3
4
Copy
1
2
3
4
4 elements
sorting network
of bitonic merge
Step 1 2 3
4 elements sorting network of 1 Register Method
2 Register Method: Invalid Situation
• Sort 2𝐾 elements
• Data are loaded into R1 and R2
Can be
considered
as
Step 1 2(invalid)
First 2 steps of Bitonic merge
sort
R1
R2
1
3
5
7
2
4
6
8
1
2
3
4
5
6
7
8
Compare and Exchange
• Exchange: exchange two positions’ data
• Compare + Exchange = opposite direction compare
1
2
3
4
1
2
3
4 equivalent
to
1
2
3
4
1
2
3
4
:exchange : opposite direction compare
2 Register Method: Solution of Invalid
• Look one step ahead.
• Exchange the position and prepare for next step.
1
3
5
7
2
4
6
8
R1
R2
1
2
3
4
5
6
7
8
:old direction :actual direction :opposite direction compare
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
2 Register Method Sorting Network
• Sort network of 8 elements.
• Red line represents permutation.
• Dashed line means opposite direction compare.
1
3
5
7
2
4
6
8
Step 1 2 3 4 5 6 additional step
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
7
5
3
6
4
2
8
1
7
5
3
6
4
2
8
Summary of two methods
Algorithm 1 register method 2 register method
Sort length 𝐾 2𝐾
# of steps log 𝐾 log 𝐾 + 1
2
log 2𝐾 log 2𝐾 + 1
2
+ 1
# of instructions per step 3 5
# of instructions when
sort 32 elements, 𝐾=16
86 77
• K: the length of SIMD instruction
Whole diagram of the algorithm
Core #2
Input sequence
Core #1
Register Level Sort
Recursive
Merge
Parallel merge sorted sequence
by 2 cores
Sorted sequence
Recursive
Merge
Core #3
Recursive
Merge
Core #4
Recursive
Merge
Parallel merge sorted sequence
by 2 cores
Parallel merge sorted sequence by 4 cores
Performance
• Experiment on Xeon Phi 5100
• AVX-512 contributes 3.7 times speed up.
525.6
560
0.0
100.0
200.0
300.0
400.0
500.0
600.0
18 20 22 24 26
MillionElementsSortedpersecond
Xeon Phi Our Approach
CPU i7 radix
CPU i7 merge
Xeon Phi Radix
log of Input data sizeSome data refers to [Nadathur Satish et al. 2010]
Performance
525.6
560
0.0
200.0
400.0
600.0
800.0
1,000.0
1,200.0
18 20 22 24 26
MillionElementsSortedpersecond
Xeon Phi Our Approach
CPU i7 radix
CPU i7 merge
Xeon Phi Radix
GPU GTX TITAN Merge
GPU GTX TITAN radix
log of Input data size
Some data refers to [Nadathur Satish et al. 2010]
Conclusion
• Proposed two register level sort:
• 1 register method
• 2 register method
• An parallel sort implementation on Xeon Phi
•Q&A
Contact: xchen.tian@gmail.com

More Related Content

Similar to Register Sort Xeon Phi final

Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldHeidi Howard
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksortdaced
 
A Cost-Effective and Scalable Merge Sort Tree on FPGAs
A Cost-Effective and Scalable Merge Sort Tree on FPGAsA Cost-Effective and Scalable Merge Sort Tree on FPGAs
A Cost-Effective and Scalable Merge Sort Tree on FPGAsTakuma Usui
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesHPCC Systems
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Queryoysteing
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 
Complexity of algorithms
Complexity of algorithmsComplexity of algorithms
Complexity of algorithmsJasur Ahmadov
 
Information and network security 20 data encryption standard des
Information and network security 20 data encryption standard desInformation and network security 20 data encryption standard des
Information and network security 20 data encryption standard desVaibhav Khanna
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveChieh (Jack) Yu
 
Базы данных. HBase
Базы данных. HBaseБазы данных. HBase
Базы данных. HBaseVadim Tsesko
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Algorithms Lecture 4: Sorting Algorithms I
Algorithms Lecture 4: Sorting Algorithms IAlgorithms Lecture 4: Sorting Algorithms I
Algorithms Lecture 4: Sorting Algorithms IMohamed Loey
 
Module 2.pptx
Module 2.pptxModule 2.pptx
Module 2.pptxseethal9
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Universität Rostock
 
DLD Lecture Unit 4 (1).pdf
DLD Lecture Unit 4 (1).pdfDLD Lecture Unit 4 (1).pdf
DLD Lecture Unit 4 (1).pdfASHISHMEHRA64
 
Quicksort algorithm and implantation process
Quicksort algorithm and implantation processQuicksort algorithm and implantation process
Quicksort algorithm and implantation processsamiulhasan0186
 

Similar to Register Sort Xeon Phi final (20)

Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable world
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksort
 
A Cost-Effective and Scalable Merge Sort Tree on FPGAs
A Cost-Effective and Scalable Merge Sort Tree on FPGAsA Cost-Effective and Scalable Merge Sort Tree on FPGAs
A Cost-Effective and Scalable Merge Sort Tree on FPGAs
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Query
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
Module vi
Module viModule vi
Module vi
 
Complexity of algorithms
Complexity of algorithmsComplexity of algorithms
Complexity of algorithms
 
Information and network security 20 data encryption standard des
Information and network security 20 data encryption standard desInformation and network security 20 data encryption standard des
Information and network security 20 data encryption standard des
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
Базы данных. HBase
Базы данных. HBaseБазы данных. HBase
Базы данных. HBase
 
Switching units
Switching unitsSwitching units
Switching units
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Chap9 slides
Chap9 slidesChap9 slides
Chap9 slides
 
Algorithms Lecture 4: Sorting Algorithms I
Algorithms Lecture 4: Sorting Algorithms IAlgorithms Lecture 4: Sorting Algorithms I
Algorithms Lecture 4: Sorting Algorithms I
 
Module 2.pptx
Module 2.pptxModule 2.pptx
Module 2.pptx
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
 
DLD Lecture Unit 4 (1).pdf
DLD Lecture Unit 4 (1).pdfDLD Lecture Unit 4 (1).pdf
DLD Lecture Unit 4 (1).pdf
 
Quicksort algorithm and implantation process
Quicksort algorithm and implantation processQuicksort algorithm and implantation process
Quicksort algorithm and implantation process
 

Register Sort Xeon Phi final

  • 1. REGISTER LEVEL SORT ALGORITHM ON MULTI-CORE SIMD PROCESSORS Xiaochen Tian, Kamil Rocki and Reiji Suda The University of Tokyo Suda Lab
  • 2. Xeon Phi and AVX-512 • Xeon Phi: • 60 cores. • 240 threads with Hyperthreading. • AVX-512 supported • AVX-512 • 512-bit registers • SIMD 16 x 32-bit integers operation
  • 3. Comparison Sorting on Xeon Phi • Purpose: • Utilize 60 cores • Utilize AVX-512 • Focus on: sorting in registers • Challenge: • Restriction of registers’ memory accessing.(explained later) • Proposed Algorithm: • based on Bitonic Merge Sorting network • 1 register method • 2 register method
  • 4. Bitonic Merge Sorting Network • Numbers in the box represent the indices of elements 2 3 4 5 6 7 8 step: 1 2 3 4 5 6 stage: 1 2 3 1
  • 5. Example of Bitonic Merge Sort 24 step: 1 2 3 4 5 6 76 19 67 53 13 21 37 24 76 67 19 13 53 37 21 24 19 67 76 37 53 13 21 19 24 67 76 53 37 21 13 19 24 21 13 53 37 67 76 19 13 21 24 53 37 67 76 13 19 21 24 37 53 67 76 2 3 4 5 6 7 8 1
  • 6. • R1, R2 are two registers 1 2 3 4 5 6 7 8 comparison between elements in the same register is invalid! R1 R2 R1’ = _m512_min_epi32(R1,R2); R2’ = _m512_max_epi32(R1,R2); Restriction of Memory Accessing
  • 7. • Permutation and mask instruction 1 2 3 4 5 6 7 8 R1 R2 R2 = _mm512_shuffle_epi32(R2,<2 1 3 4>); R1’ = _mm512_min_epi32(R1,R2); R2’ = _mm512_max_epi32(R1,R2); Restriction of Memory Accessing R1’ = _mm512_mask_min_epi32 (R1,0b1010,R1,R2); R2’ = _mm512_mask_max_epi32 (R2,0b0101,R1,R2); R1’ = _mm512_mask_max_epi32 (R1,0b0101,R1,R2); R2’ = _mm512_mask_min_epi32 (R2,0b1010,R1,R2);
  • 8. 1 Register Method • Sort 𝐾 = 16 elements, • 𝐾 : number of elements stored in one register • Idea: copy R1 to R2 so that comparison becomes valid 1 2 3 4 1 2 3 4 Copy 1 2 3 4 4 elements sorting network of bitonic merge Step 1 2 3 4 elements sorting network of 1 Register Method
  • 9. 2 Register Method: Invalid Situation • Sort 2𝐾 elements • Data are loaded into R1 and R2 Can be considered as Step 1 2(invalid) First 2 steps of Bitonic merge sort R1 R2 1 3 5 7 2 4 6 8 1 2 3 4 5 6 7 8
  • 10. Compare and Exchange • Exchange: exchange two positions’ data • Compare + Exchange = opposite direction compare 1 2 3 4 1 2 3 4 equivalent to 1 2 3 4 1 2 3 4 :exchange : opposite direction compare
  • 11. 2 Register Method: Solution of Invalid • Look one step ahead. • Exchange the position and prepare for next step. 1 3 5 7 2 4 6 8 R1 R2 1 2 3 4 5 6 7 8 :old direction :actual direction :opposite direction compare 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8
  • 12. 2 Register Method Sorting Network • Sort network of 8 elements. • Red line represents permutation. • Dashed line means opposite direction compare. 1 3 5 7 2 4 6 8 Step 1 2 3 4 5 6 additional step 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8 1 3 5 7 2 4 6 8 1 7 5 3 6 4 2 8 1 7 5 3 6 4 2 8
  • 13. Summary of two methods Algorithm 1 register method 2 register method Sort length 𝐾 2𝐾 # of steps log 𝐾 log 𝐾 + 1 2 log 2𝐾 log 2𝐾 + 1 2 + 1 # of instructions per step 3 5 # of instructions when sort 32 elements, 𝐾=16 86 77 • K: the length of SIMD instruction
  • 14. Whole diagram of the algorithm Core #2 Input sequence Core #1 Register Level Sort Recursive Merge Parallel merge sorted sequence by 2 cores Sorted sequence Recursive Merge Core #3 Recursive Merge Core #4 Recursive Merge Parallel merge sorted sequence by 2 cores Parallel merge sorted sequence by 4 cores
  • 15. Performance • Experiment on Xeon Phi 5100 • AVX-512 contributes 3.7 times speed up. 525.6 560 0.0 100.0 200.0 300.0 400.0 500.0 600.0 18 20 22 24 26 MillionElementsSortedpersecond Xeon Phi Our Approach CPU i7 radix CPU i7 merge Xeon Phi Radix log of Input data sizeSome data refers to [Nadathur Satish et al. 2010]
  • 16. Performance 525.6 560 0.0 200.0 400.0 600.0 800.0 1,000.0 1,200.0 18 20 22 24 26 MillionElementsSortedpersecond Xeon Phi Our Approach CPU i7 radix CPU i7 merge Xeon Phi Radix GPU GTX TITAN Merge GPU GTX TITAN radix log of Input data size Some data refers to [Nadathur Satish et al. 2010]
  • 17. Conclusion • Proposed two register level sort: • 1 register method • 2 register method • An parallel sort implementation on Xeon Phi

Editor's Notes

  1. Good morning everyone, I am Tian Xiaochen from the university of Tokyo. Today I am going to introduce a Register Level Sort Algorithm on Multi-Core SIMD Processors, mainly on Xeon Phi.
  2. Xeon Phi is one of the cutting edge accelerators. It has 60 cores and supports AVX-512 instructions. AVX-512 is a very new SIMD instruction set. It allows processing 16 integers simultaneously.
  3. Sorting is a fundamental algorithms. We believe it is necessary to parallelize it on this new hardware. Obliviously utilizing the 60 cores and AVX-512 is important. Our study especially focus on sorting with AVX-512 only in registers. One challenge is that Memory Accessing pattern within those SIMD registers is not exactly the same as PRAM and quite different from GPU. We will explain it later. We proposed two algorithms for sorting in registers on this kind of new instruction set: 1 register method and 2 register method, based on bitonic merge sorting network.
  4. This is a the sorting network for 8 elements. The arrow means compare two elements and put minimum on one side and maximum on the other.
  5. Here is one example of sorting network. The elements should be sorted from small to large. For instance 19 compares with 67, put the larger one 67 in position 3, and smaller one 19 in position 4.
  6. Next let’s talk about the restriction of memory accessing. In AVX-512 the calculation can only be done between registers, within one register is not OK. It is invalid
  7. AVX512 also supports permutation and mask instructions. With their help, we can do compare like this.
  8. Now we introduce the 1 register method. It does not mean we only use 1 register. It means only to sort one K elements. In the case of AVX-512 it is 16. The idea is very simple. Make a copy of elements to another register so that elements can be compared with each other. You can see the network on the right side follow the same logic on the left side. The weak ness is oblivious, each pair is compared twice.
  9. To reduce redundant compare, we developed 2 register method. This time we sort 2K elements without copying. Initially the data are loaded into R1 and R2. But the first step is the invalid situation due to the restriction. However the initially index of each element can be consider as any order since they are not sorted at all. Then first step is OK, but invalid again at 2nd step.
  10. To solve that problem, exchanging positions is necessary. This slide show the relationship between exchange and compare. If we apply compare and then exchange, it is equivalent to compare some elements in opposite direction.
  11. The solution is looking one step ahead. If we find that some elements should be compared in next step but be stored in the same register, exchange them, so that no invalid occurs. Here we exchange 3 with 4, 7 with 8, the 2nd step become valid.
  12. Then we derived 2 register method sorting network. Notice that at last one additional step is required. Because we need to place the elements in the order of 1 2 3 4 to 8.
  13. This is the whole diagram of our algorithm. Basically it is a merge sort algorithm. We employed partition and parallel merge algorithm in the multicore level. All the merge procedures employs register level merge. Register level merge is just like register level sort we introduced, but the last few steps of sorting network.
  14. Then we do the experiment on Xeon Phi 5100 of sorting integers.
  15. Performance compare with GPU.
  16. So today I introduced two method of applying SIMD instruction on sort algorithm. We also implement the algorithm.