Register Sort Xeon Phi final

REGISTER LEVEL SORT
ALGORITHM ON MULTI-CORE SIMD
PROCESSORS
Xiaochen Tian, Kamil Rocki and Reiji Suda
The University of Tokyo
Suda Lab

Xeon Phi and AVX-512
• Xeon Phi:
• 60 cores.
• 240 threads with Hyperthreading.
• AVX-512 supported
• AVX-512
• 512-bit registers
• SIMD 16 x 32-bit integers operation

Comparison Sorting on Xeon Phi
• Purpose:
• Utilize 60 cores
• Utilize AVX-512
• Focus on: sorting in registers
• Challenge:
• Restriction of registers’ memory accessing.(explained later)
• Proposed Algorithm:
• based on Bitonic Merge Sorting network
• 1 register method

Bitonic Merge Sorting Network
• Numbers in the box represent the indices of elements
2
3
4
5
6
7
8
step: 1 2 3 4 5 6
stage: 1 2 3
1

Example of Bitonic Merge Sort
24
step: 1 2 3 4 5 6
76
19
67
53
13
21
37
24
76
67
19
13
53
37
21
24
19
67
76
37
53
13
21
19
24
67
76
53
37
21
13
19
24
21
13
53
37
67
76
19
13
21
24
53
37
67
76
13
19
21
24
37
53
67
76
2
3
4
5
6
7
8
1

• R1, R2 are two registers
1
2
3
4
5
6
7
8
comparison between
elements in the same register
is invalid!
R1
R2
R1’ = _m512_min_epi32(R1,R2);
R2’ = _m512_max_epi32(R1,R2);
Restriction of Memory Accessing

• Permutation and mask instruction
1
2
3
4
5
6
7
8
R1
R2
R2 = _mm512_shuffle_epi32(R2,<2 1 3 4>);
R1’ = _mm512_min_epi32(R1,R2);
R2’ = _mm512_max_epi32(R1,R2);
Restriction of Memory Accessing
R1’ = _mm512_mask_min_epi32
(R1,0b1010,R1,R2);
R2’ = _mm512_mask_max_epi32
(R2,0b0101,R1,R2);
R1’ = _mm512_mask_max_epi32
(R1,0b0101,R1,R2);
R2’ = _mm512_mask_min_epi32
(R2,0b1010,R1,R2);

1 Register Method
• Sort 𝐾 = 16 elements,
• 𝐾 : number of elements stored in one register
• Idea: copy R1 to R2 so that comparison becomes valid
1
2
3
4
1
2
3
4
Copy
1
2
3
4
4 elements
sorting network
of bitonic merge
Step 1 2 3
4 elements sorting network of 1 Register Method

2 Register Method: Invalid Situation
• Sort 2𝐾 elements
• Data are loaded into R1 and R2
Can be
considered
as
Step 1 2(invalid)
First 2 steps of Bitonic merge
sort
R1
R2
1
3
5
7
2
4
6
8
1
2
3
4
5
6
7
8

Compare and Exchange
• Exchange: exchange two positions’ data
• Compare + Exchange = opposite direction compare
1
2
3
4
1
2
3
4 equivalent
to
1
2
3
4
1
2
3
4
:exchange : opposite direction compare

2 Register Method: Solution of Invalid
• Look one step ahead.
• Exchange the position and prepare for next step.
1
3
5
7
2
4
6
8
R1
R2
1
2
3
4
5
6
7
8
:old direction :actual direction :opposite direction compare
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8

2 Register Method Sorting Network
• Sort network of 8 elements.
• Red line represents permutation.
• Dashed line means opposite direction compare.
1
3
5
7
2
4
6
8
Step 1 2 3 4 5 6 additional step
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
3
5
7
2
4
6
8
1
7
5
3
6
4
2
8
1
7
5
3
6
4
2
8

Summary of two methods
Algorithm 1 register method 2 register method
Sort length 𝐾 2𝐾
# of steps log 𝐾 log 𝐾 + 1
2
log 2𝐾 log 2𝐾 + 1
2
+ 1
# of instructions per step 3 5
# of instructions when
sort 32 elements, 𝐾=16
86 77
• K: the length of SIMD instruction

Whole diagram of the algorithm
Core #2
Input sequence
Core #1
Register Level Sort
Recursive
Merge
Parallel merge sorted sequence
by 2 cores
Sorted sequence
Recursive
Merge
Core #3
Recursive
Merge
Core #4
Recursive
Merge
Parallel merge sorted sequence
by 2 cores
Parallel merge sorted sequence by 4 cores

Performance
• Experiment on Xeon Phi 5100
• AVX-512 contributes 3.7 times speed up.
525.6
560
0.0
100.0
200.0
300.0
400.0
500.0
600.0
18 20 22 24 26
MillionElementsSortedpersecond
Xeon Phi Our Approach
CPU i7 radix
CPU i7 merge
Xeon Phi Radix
log of Input data sizeSome data refers to [Nadathur Satish et al. 2010]

Performance
525.6
560
0.0
200.0
400.0
600.0
800.0
1,000.0
1,200.0
18 20 22 24 26
MillionElementsSortedpersecond
Xeon Phi Our Approach
CPU i7 radix
CPU i7 merge
Xeon Phi Radix
GPU GTX TITAN Merge
GPU GTX TITAN radix
log of Input data size
Some data refers to [Nadathur Satish et al. 2010]

Conclusion
• Proposed two register level sort:
• An parallel sort implementation on Xeon Phi

•Q&A
Contact: xchen.tian@gmail.com

Register Sort Xeon Phi final

Recommended

Recommended

More Related Content

Similar to Register Sort Xeon Phi final

Similar to Register Sort Xeon Phi final (20)

Register Sort Xeon Phi final

Editor's Notes