A Methodology for Automatic GPU Kernel Optimization

POLITECNICO DI MILANO
Master of Science in Engineering of Computing Systems
Department of Electronics, Informatics and Bioengineering
Master thesis of:
Alberto Zeni - 884540
A Methodology for Automatic GPU Kernel
Optimization
Advisor:
Ing. Marco D. Santambrogio
Co-advisor:
Dott. Ing. Lorenzo Di Tucci
December 18th 2019
Room 5.0.2 - Politecnico di Milano

Context Deﬁnition
2
1000x
by
2025
40 Years of Microprocessor Trend Data
____________________________________________________
1980 1990 2000 2010 2020
107
106
105
104
103
102

Thesis Contributions
4
● We propose a methodology that guides the user
to develop highly optimized GPU kernels
● We demonstrate the usefulness of our
methodology by implementing it into a semi
automatic tool for kernel optimization
● We show the results of the application of our
methodology on two highly computationally
intensive algorithms

Methodology application
5
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

6
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Rooﬂine Model Adaptation
8
● Model built on the characteristics of the GPU and
algorithm executed and independent to the
algorithm implementation
= number of iterations
= gpu cores frequency
= number of operations to be computed
at iteration i
= number of blocks
= number of scheduled threads per block
= number of integer cores
=
= number of streaming multiprocessors
= maximum number of blocks per streaming
multiprocessor
=

9
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

10
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Source Code Parser
11
● Automatically unrolls loops if possible
● Automatically changes the memory hierarchy
● Automatically changes the number of scheduled
threads
● Automatically changes the number of scheduled
blocks
● Creates a report of the optimizations that can be
applied manually

12
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

13
Unoptimized
source code
Rooﬂine
Generator
Rooﬂine and
Performance
Analyzer
Optimized
source code
Optimization Flow
Compiler
Optimizer
Source Code
Parser

Smith-Waterman Algorithm
14
● Optimal algorithm for
local sequence
alignment
● Execution times scale
up to the length of the
aligned sequences
A
G
G
G
T
C
A
A
0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 0 0 0 0 0 2 1 0
0 0 0 0 0 0 1 3 1
0 0 0 0 0 0 1 2 2
0 0 0 0 1 0 0 0 1
0 0 1 1 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 1 0 0 0 1 0 0 1
A C C T A G G A

X-Drop Algorithm
15
X-Drop Algorithm
0 -1 -2 -∞
-1 1 -1 -∞
-2 -1 0 -∞
-∞ -∞ -∞
A C C T A G G A
A
G
G
G
T
C
A
A
● X-Drop termination
offers a great tradeoff
between speed and
accuracy results
sequences
● Very efficient if the two
sequences do not align

GPU implemented optimizations
16
● The two algorithms follow the same computational pattern
● We started with a simple implementation of the algorithms
using a single thread and a single block
● We followed our methodology with the help of our tool to
optimize the algorithms at diﬀerent levels and introduce
Inter and Intra Parallelism
1 -1 -3 -4
-1 0 -2
-3
A C G G
A
T
T
C
A C G G
A
T
T
C

Intra Level Parallelism
17
● Parallel computation of
the anti-diagonals
● Each GPU thread is
assigned to compute a
single cell as our
methodology suggested
● Anti-diagonals split in
diﬀerent segments to
align sequences of any
length

Inter Level parallelism
18
● Parallel execution of
the alignments with
multiple blocks
● Each block has an
alignment assigned

GPU memory optimizations
19
● To ensure coalesced memory access one of the
sequences is stored backwards on the GPU

20
Evaluation Settings
Benchmarked Applications:
● SeqAn: State of the art library that includes an highly
optimized version of X-Drop), run on 176 threads
● ksw2: State of the art CPU SIMD implementation of
Z-drop, run on 80 threads
● Bowtie2: State of the art Smith-Waterman
implementation, run on 64 threads
● CUDASW++ 3.0: State of the art Smith-Waterman GPU
+ CPU SIMD implementation

Evaluation Settings
21
Platforms:
● Intel Haswell Nodes: 2 Intel® Xeon™ E5-2698
Processors (64 threads) with 128 GB of RAM
● Intel Skylake Nodes: 2 Intel® Xeon™ 6148 Gold
Processors (80 threads) with 384 GB of RAM
● IBM Power 9 Nodes: 2 IBM Power 9 Processors (168
threads) with 512 GB of RAM
● GPU: NVIDIA V100 ‘Tesla’ GPUs with 16GB of HBM2
memory

Smith-Waterman Unoptimized
Rooﬂine
22

Smith-Waterman Optimized
Rooﬂine
23

Smith-Waterman Comparison
24
34x
3x
1x
11x
1x

X-drop Unoptimized Rooﬂine
25

X-drop GPU and SeqAn Comparison
27
2x
6x

X-drop GPU and ksw2 Comparison
28
1.5x
120x

Conclusions
29
A methodology for automatic GPU Kernel Optimization
and its implementation inside a tool for automatic kernel
optimization
We applied our methodology to two highly computational
intensive algorithms
Optimized GPU X-drop Implementation with:
● More than 6.6x speed-up with respect to SeqAn
● More than 120x speed-up with respect to ksw2
Optimized GPU Smith-Waterman Implementation with:
● More than 34x speed-up with respect to Bowtie2
● More than 3x speed-up with respect to CUDASW++ 3.0

Thank you for your attention
Master thesis of:
Alberto Zeni - 884540
A Methodology for Automatic GPU Kernel
Optimization
Advisor:
Ing. Marco D. Santambrogio
Co-advisor:
Dott. Ing. Lorenzo Di Tucci
December 18th 2019
Room - Politecnico di Milano

Algorithm Computation
31
● Computation Flow of the Smith-Waterman and
X-drop algorithms

Smith-Waterman Algorithm
32
● Optimal algorithm for
local sequence
alignment
aligned sequences
A
G
G
G
T
C
A
A
A C C T A G G A

Intra Level Parallelism
33
● Parallel reduction to ﬁnd
the max of the
antidiagonal using warp
instructions
-1 2 5 0 -1 -2 3 1
-1 2 5 1
5 2
5

Methodology Overview
35
Unoptimized
source code
Plot Rooﬂine
and collect
performance
metrics
Analyze Kernel
Performance
and generate
optimizations
Apply
optimizations
to the Kernel
Optimized
source code
Analysis Flow

36
Unoptimized
source code
Plot Rooﬂine
and collect
performance
metrics
Analyze Kernel
Performance
and generate
optimizations
Apply
optimizations
to the Kernel
Optimized
source code
Analysis Flow

37
Unoptimized
source code
Plot Rooﬂine
and collect
performance
metrics
Analyze Kernel
Performance
and generate
optimizations
Apply
optimizations
to the Kernel
Optimized
source code
Analysis Flow

38
Unoptimized
source code
Plot Rooﬂine
and collect
performance
metrics
Analyze Kernel
Performance
and generate
optimizations
Apply
optimizations
to the Kernel
Optimized
source code
Analysis Flow

39
GPU Architecture
● Thousands of cores that operate in a SIMD fashion,
with respect to multiple sequential cores

GPU Programming Model
40
Host
Machine
Kernel
Grid
Block
(0,0)
Block
(0,1)
Block
(0,2)
Block
(0,3)
Block
(0,4)
Block
(0,5)
Block
(0,6)
Block
(1,0)
Block
(1,1)
Block
(1,2)
Block
(1,3)
Block
(1,4)
Block
(1,5)
Block
(1,6)
Thread
(0,2)
Thread
(0,1)
Thread
(1,0)
Thread
(1,2)
Thread
(1,1)
Thread
(2,0)
Thread
(2,2)
Thread
(2,1)
Thread
(3,0)
Thread
(3,2)
Thread
(3,1)
Thread
(4,0)
Thread
(4,2)
Thread
(4,1)
Thread
(0,0)
GPU

X-Drop Algorithm
41
● Optimized to follow the alignment of the two
sequences and stop when the two do not align

A Methodology for Automatic GPU Kernel Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to A Methodology for Automatic GPU Kernel Optimization

Similar to A Methodology for Automatic GPU Kernel Optimization (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

A Methodology for Automatic GPU Kernel Optimization