6. • R1, R2 are two registers
1
2
3
4
5
6
7
8
comparison between
elements in the same register
is invalid!
R1
R2
R1’ = _m512_min_epi32(R1,R2);
R2’ = _m512_max_epi32(R1,R2);
Restriction of Memory Accessing
8. 1 Register Method
• Sort 𝐾 = 16 elements,
• 𝐾 : number of elements stored in one register
• Idea: copy R1 to R2 so that comparison becomes valid
1
2
3
4
1
2
3
4
Copy
1
2
3
4
4 elements
sorting network
of bitonic merge
Step 1 2 3
4 elements sorting network of 1 Register Method
9. 2 Register Method: Invalid Situation
• Sort 2𝐾 elements
• Data are loaded into R1 and R2
Can be
considered
as
Step 1 2(invalid)
First 2 steps of Bitonic merge
sort
R1
R2
1
3
5
7
2
4
6
8
1
2
3
4
5
6
7
8
10. Compare and Exchange
• Exchange: exchange two positions’ data
• Compare + Exchange = opposite direction compare
1
2
3
4
1
2
3
4 equivalent
to
1
2
3
4
1
2
3
4
:exchange : opposite direction compare
11. 2 Register Method: Solution of Invalid
• Look one step ahead.
• Exchange the position and prepare for next step.
1
3
5
7
2
4
6
8
R1
R2
1
2
3
4
5
6
7
8
:old direction :actual direction :opposite direction compare
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
Good morning everyone, I am Tian Xiaochen from the university of Tokyo. Today I am going to introduce a Register Level Sort Algorithm on Multi-Core SIMDProcessors, mainly on Xeon Phi.
Xeon Phi is one of the cutting edge accelerators. It has 60 cores and supports AVX-512 instructions.
AVX-512 is a very new SIMD instruction set. It allows processing 16 integers simultaneously.
Sorting is a fundamental algorithms. We believe it is necessary to parallelize it on this new hardware. Obliviously utilizing the 60 cores and AVX-512 is important.
Our study especially focus on sorting with AVX-512 only in registers.
One challenge is that Memory Accessing pattern within those SIMD registers is not exactly the same as PRAM and quite different from GPU. We will explain it later.
We proposed two algorithms for sorting in registers on this kind of new instruction set: 1 register method and 2 register method, based on bitonic merge sorting network.
This is a the sorting network for 8 elements. The arrow means compare two elements and put minimum on one side and maximum on the other.
Here is one example of sorting network. The elements should be sorted from small to large. For instance 19 compares with 67, put the larger one 67 in position 3, and smaller one 19 in position 4.
Next let’s talk about the restriction of memory accessing. In AVX-512 the calculation can only be done between registers, within one register is not OK. It is invalid
AVX512 also supports permutation and mask instructions. With their help, we can do compare like this.
Now we introduce the 1 register method. It does not mean we only use 1 register. It means only to sort one K elements. In the case of AVX-512 it is 16. The idea is very simple. Make a copy of elements to another register so that elements can be compared with each other. You can see the network on the right side follow the same logic on the left side. The weak ness is oblivious, each pair is compared twice.
To reduce redundant compare, we developed 2 register method. This time we sort 2K elements without copying. Initially the data are loaded into R1 and R2. But the first step is the invalid situation due to the restriction. However the initially index of each element can be consider as any order since they are not sorted at all. Then first step is OK, but invalid again at 2nd step.
To solve that problem, exchanging positions is necessary. This slide show the relationship between exchange and compare.
If we apply compare and then exchange, it is equivalent to compare some elements in opposite direction.
The solution is looking one step ahead. If we find that some elements should be compared in next step but be stored in the same register, exchange them, so that no invalid occurs. Here we exchange 3 with 4, 7 with 8, the 2nd step become valid.
Then we derived 2 register method sorting network. Notice that at last one additional step is required. Because we need to place the elements in the order of 1 2 3 4 to 8.
This is the whole diagram of our algorithm. Basically it is a merge sort algorithm. We employed partition and parallel merge algorithm in the multicore level. All the merge procedures employs register level merge. Register level merge is just like register level sort we introduced, but the last few steps of sorting network.
Then we do the experiment on Xeon Phi 5100 of sorting integers.
Performance compare with GPU.
So today I introduced two method of applying SIMD instruction on sort algorithm.
We also implement the algorithm.