RT-CUDA: A Software Tool for CUDA Code Restructuring

RTCUDA
(A SOFTWARE TOOL FOR CUDA
CODE RESTRUCTURING)
By
Dr. Ayaz ul Hassan Khan
Email: ayazhk@gmail.com
1

FEATURES
Simplify writing high performing CUDA program
Modular Approach: based on ANTLR framework
Tested on Fermi and Kepler Architectures
 Easy to extend for supporting various architectures
Provides:
 GPU Memory Optimizations, Kernel Configurations, Synchronization, and Data Transfer Mechanisms
GPU Resource Optimization:
 Auto-tuning to find optimal set of CUDA kernel parameters
Generate Optimized CUDA parallel program from a given sequential C program
API functions to call highly optimized library routines for dense and sparse matrices
Synchronization primitives for inter-block synchronization
Supports multi-kernel conversions
2

OPTIMIZATIONS
SPECIFICATIONS
Input/Output GPU Memory Allocation
 Allocating memory for GPU input and output
 Explicit transfer of data between host (CPU) and
device (GPU)
Computation Partitioning and Decomposition
 Problem iteration space partitioning
 Block – level and Thread – level Parallelism
 Appropriate block/tile size to fit in the cache/shared memory
 Perform related transformations
4

OPTIMIZATIONS
SPECIFICATIONS
Locality optimizations and Datacopy Transformations
 Explicit copy of data into lower level portions
 Utilize special memories such as constant and texture caches
 Efficient shared memory and register file usages per thread block
Parallel Memory Bandwidth
 Increased memory bandwidth by
 Coalesced global memory access
 Bank conflict free shared memory access
5

OPTIMIZATIONS
SPECIFICATIONS
Optimization of Architectural
Parameters
 To set optimal thread granularity, block size, grid size
 Better resource management and machine occupancy
 Required auto-tuning mechanism
Use of automatic compiler
optimization and/or programmer-
guided optimization
 User choices for compiler optimizations
Synchronization across SMs
 Avoiding expensive inter-block synchronization
 No global synchronization mechanisms except the kernel 6

OPTIMIZATIONS
SPECIFICATIONS
Invocation of Optimized
external Libraries
optimized at lower level
programming
Examples:
 cuBLAS for dense linear algebra
 cuSparse for sparse arrays
Library details are hidden from the
user
But requires full understanding of
parameters and related
implementation logic 7

OPTIMIZATION
SPECIFICATIONS
COMPARISON AMONG
DIFFERENT TOOLS
8

RT-CUDA CODE
TRANSFORMATION
STRATEGY
9
Input/Output GPU Memory Allocation
Configuration File
Computation Partitioning and
Decomposition
Locality Optimizations and Datacopy
Transformations
Parallel Memory Bandwidth
Optimization of Architectural Parameters
Use of Automatic Compiler Optimization
and/or Programmer-Guided Optimization
Synchronization across SMs
Invocation of Optimized External Libraries
C-Loop Optimizations (Loop Collapsing)
Array Transformations
Loop Partitioning
Block Merging
Block Skewing
Prefetching using Shared Memory
Parameters Tuning
Custom API Functions
Final Code Generation

RESTRUCTURING
ALGORITHM
10
C-Function C-Loop Optimizations (Loop Collapsing)
Merge the nested loops if they are independent and
calculate array indices based on the new loop variable
Array Transformations (nD à 1D)
Mapping array representation to GPU’s linear
addressing space
Loop Partitioning
Performs task distribution among all CUDA threads
based on its block id and thread id
CUDA Kernel Optimizations
Transforms Naïve CUDA kernel into a Parameterized
CUDA kernel after applying a set of optimizations
Naïve CUDA Kernels
Parameters Tuning
Determines optimal parameter values for the
generated parametric cuda kernels
Parametric CUDA Kernels
Optimized
CUDA Kernel

CUDA KERNEL
OPTIMIZATIONS IN RTA-CUDA
2/28/2018 PHD DISSERTATION DEFENSE 11
Naïve CUDA
Kernel
Block Merging
Increased thread granularity by mapping one thread
block to multiple resultant blocks vertically
2D
Matrices
Do tilingYes
Prefetching Using Shared Memory
Effective usage of shared memory and coalesced
access in global memory
Yes
Block Skewing
Increased thread access locality by mapping one
thread block to multiple resultant blocks horizontally
No
Remove Redundant Array Access in Loop Body
Pre-fetched array loads that are independent of the
loop indices
Parameterized
CUDA Kernel
No

RT-CUDA DESIGN
C-Program
Optimized
CUDAProgram
Pre-Processing
(identifies CUDA kernels by partitioning the source program into a DAG
of loops, data dependence is enforced)
C Functions
Final Code Generation including kernel file with optimized CUDA kernels,
main file containing main function to invoke CUDA kernels, parameters
file with optimal values, definition of RT-CUDA API functions
Optimal Parameters
Configuration File
(defining basic structure of the target kernels, array dimensions, selected
optimizations, and range of kernel parameters for auto-tuning)
RT-CUDA Input Parameters
RT-CUDA API Functions
For Dense Matrix Operations: RTdSMM, RTdDMM, RTdSMV,
RTdDMV, RTdSMT, RTdDMT, RTdSVV, RTdDVV, RTdSDOT, RTdDDOT
For Sparse Matrix Operations: RTspSMM, RTspDMM, RTspdSMM,
RTspdDMM, RTspSMV, RTspDMV
For Synchronization: RTSync, RTRelaxedSync
RTA-CUDA

RT-CUDA IMPLEMENTATION
Parse Tree
Generation
ANTLR C Grammar Traverse Parse Tree
using ParseTreeWalker
Parse
Tree
Modify Payload based on the RT-CUDA
Transformations
Node Event
Generate
Code
Modified
Parse Tree
SourceCode
Transformed Code
Parser
Generator
Parser

RT-CUDA EXAMPLE: MATRIX-
MATRIX MULTIPLICATION
(INPUT)
15

RT-CUDA EXAMPLE: MATRIX-
MATRIX MULTIPLICATION
(OUTPUT)
16

RT-CUDA EXAMPLE: SPARSE
MATRIX OPERATORS USING RT-
CUDA API
(INPUT)
17

RT-CUDA EXAMPLE: SPARSE
MATRIX OPERATORS USING RT-
CUDA API
(OUTPUT)
18

RT-CUDA: CONJUGATE
GRADIENT USING RT-CUDA API
AND CUSTOM MERGE
OPERATIONS – (INPUT)
(MULTI-KERNEL CONVERSIONS)
19

RT-CUDA: CONJUGATE
AND CUSTOM MERGE
OPERATIONS – (CONFIGURATIONS)
20

RT-CUDA: CONJUGATE
AND CUSTOM MERGE
OPERATIONS – (OUTPUT)
21

EVALUATION OF BASIC LINEAR
ALGEBRA OPERATIONS
23

EVALUATION OF INTER-BLOCK
SYNCHRONIZATION PRIMITIVES
24
Single Precision Double Precision

EFFECTS OF CALLING EXTERNAL
CUBLAS FUNCTIONS
25

EFFECTS OF CALLING
EXTERNAL CUBLAS
FUNCTIONS
26

EFFECTS OF CALLING
EXTERNAL CUBLAS
FUNCTIONS
27

EFFECTS OF SPARSE MATRIX
OPERATIONS USING
CUDA SPARSE LIBRARY
ROUTINES
28

EFFECTS OF SPARSE MATRIX
OPERATIONS USING
CUDA SPARSE LIBRARY
ROUTINES
29
Matrix Plot Dimension non-zeros
bcsstm13 2003 11973
cavity10 2597 76367
cavity17 4562 138187
cavity18 4562 138187
cdde1 961 4681
cdde2 961 4681
cdde3 961 4681
coater1 1348 19457

CONCLUSION AND FUTURE
WORK
 Performance evaluation of the tool has been performed using basic
linear algebra operations including Lapack BLAS benchmark, Jacobi
iterative solver with different inter-block synchronization primitives,
dense and sparse matrix operations
 Testing of the tool has been performed by some graduate students
based on a set of 10 testing cases with progressive difficulties
ranging from simple vector matrix operations to full solver of linear
system of equations
RT-CUDA Possible Enhancements:
 Add more optimizations suitable for emerging GPU architectures
such as Maxwell
 More API functions can be added from cuBLAS and cuSparse libraries
with different sparse matrix formats 30

RT-CUDA: A Software Tool for CUDA Code Restructuring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RT-CUDA: A Software Tool for CUDA Code Restructuring

Similar to RT-CUDA: A Software Tool for CUDA Code Restructuring (20)

Recently uploaded

Recently uploaded (20)

RT-CUDA: A Software Tool for CUDA Code Restructuring