Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. We have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a con.figuration .file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RTCUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry. We expect RT-CUDA to be needed by many industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
Automating Google Workspace (GWS) & more with Apps Script
RT-CUDA: A Software Tool for CUDA Code Restructuring
1. RTCUDA
(A SOFTWARE TOOL FOR CUDA
CODE RESTRUCTURING)
By
Dr. Ayaz ul Hassan Khan
Email: ayazhk@gmail.com
1
2. FEATURES
Simplify writing high performing CUDA program
Modular Approach: based on ANTLR framework
Tested on Fermi and Kepler Architectures
Easy to extend for supporting various architectures
Provides:
GPU Memory Optimizations, Kernel Configurations, Synchronization, and Data Transfer Mechanisms
GPU Resource Optimization:
Auto-tuning to find optimal set of CUDA kernel parameters
Generate Optimized CUDA parallel program from a given sequential C program
API functions to call highly optimized library routines for dense and sparse matrices
Synchronization primitives for inter-block synchronization
Supports multi-kernel conversions
2
4. OPTIMIZATIONS
SPECIFICATIONS
Input/Output GPU Memory Allocation
Allocating memory for GPU input and output
Explicit transfer of data between host (CPU) and
device (GPU)
Computation Partitioning and Decomposition
Problem iteration space partitioning
Block – level and Thread – level Parallelism
Appropriate block/tile size to fit in the cache/shared memory
Perform related transformations
4
5. OPTIMIZATIONS
SPECIFICATIONS
Locality optimizations and Datacopy Transformations
Explicit copy of data into lower level portions
Utilize special memories such as constant and texture caches
Efficient shared memory and register file usages per thread block
Parallel Memory Bandwidth
Increased memory bandwidth by
Coalesced global memory access
Bank conflict free shared memory access
5
6. OPTIMIZATIONS
SPECIFICATIONS
Optimization of Architectural
Parameters
To set optimal thread granularity, block size, grid size
Better resource management and machine occupancy
Required auto-tuning mechanism
Use of automatic compiler
optimization and/or programmer-
guided optimization
User choices for compiler optimizations
Synchronization across SMs
Avoiding expensive inter-block synchronization
No global synchronization mechanisms except the kernel 6
7. OPTIMIZATIONS
SPECIFICATIONS
Invocation of Optimized
external Libraries
optimized at lower level
programming
Examples:
cuBLAS for dense linear algebra
cuSparse for sparse arrays
Library details are hidden from the
user
But requires full understanding of
parameters and related
implementation logic 7
9. RT-CUDA CODE
TRANSFORMATION
STRATEGY
9
Input/Output GPU Memory Allocation
Configuration File
Computation Partitioning and
Decomposition
Locality Optimizations and Datacopy
Transformations
Parallel Memory Bandwidth
Optimization of Architectural Parameters
Use of Automatic Compiler Optimization
and/or Programmer-Guided Optimization
Synchronization across SMs
Invocation of Optimized External Libraries
C-Loop Optimizations (Loop Collapsing)
Array Transformations
Loop Partitioning
Block Merging
Block Skewing
Prefetching using Shared Memory
Parameters Tuning
Custom API Functions
Final Code Generation
10. RESTRUCTURING
ALGORITHM
10
C-Function C-Loop Optimizations (Loop Collapsing)
Merge the nested loops if they are independent and
calculate array indices based on the new loop variable
Array Transformations (nD à 1D)
Mapping array representation to GPU’s linear
addressing space
Loop Partitioning
Performs task distribution among all CUDA threads
based on its block id and thread id
CUDA Kernel Optimizations
Transforms Naïve CUDA kernel into a Parameterized
CUDA kernel after applying a set of optimizations
Naïve CUDA Kernels
Parameters Tuning
Determines optimal parameter values for the
generated parametric cuda kernels
Parametric CUDA Kernels
Optimized
CUDA Kernel
11. CUDA KERNEL
OPTIMIZATIONS IN RTA-CUDA
2/28/2018 PHD DISSERTATION DEFENSE 11
Naïve CUDA
Kernel
Block Merging
Increased thread granularity by mapping one thread
block to multiple resultant blocks vertically
2D
Matrices
Do tilingYes
Prefetching Using Shared Memory
Effective usage of shared memory and coalesced
access in global memory
Yes
Block Skewing
Increased thread access locality by mapping one
thread block to multiple resultant blocks horizontally
No
Remove Redundant Array Access in Loop Body
Pre-fetched array loads that are independent of the
loop indices
Parameterized
CUDA Kernel
No
12. RT-CUDA DESIGN
2/28/2018 PHD DISSERTATION DEFENSE 12
C-Program
Optimized
CUDAProgram
Pre-Processing
(identifies CUDA kernels by partitioning the source program into a DAG
of loops, data dependence is enforced)
C Functions
Final Code Generation including kernel file with optimized CUDA kernels,
main file containing main function to invoke CUDA kernels, parameters
file with optimal values, definition of RT-CUDA API functions
Optimal Parameters
Configuration File
(defining basic structure of the target kernels, array dimensions, selected
optimizations, and range of kernel parameters for auto-tuning)
RT-CUDA Input Parameters
RT-CUDA API Functions
For Dense Matrix Operations: RTdSMM, RTdDMM, RTdSMV,
RTdDMV, RTdSMT, RTdDMT, RTdSVV, RTdDVV, RTdSDOT, RTdDDOT
For Sparse Matrix Operations: RTspSMM, RTspDMM, RTspdSMM,
RTspdDMM, RTspSMV, RTspDMV
For Synchronization: RTSync, RTRelaxedSync
RTA-CUDA
13. RT-CUDA IMPLEMENTATION
2/28/2018 PHD DISSERTATION DEFENSE 13
Parse Tree
Generation
ANTLR C Grammar Traverse Parse Tree
using ParseTreeWalker
Parse
Tree
Modify Payload based on the RT-CUDA
Transformations
Node Event
Generate
Code
Modified
Parse Tree
SourceCode
Transformed Code
Parser
Generator
Parser
30. CONCLUSION AND FUTURE
WORK
Performance evaluation of the tool has been performed using basic
linear algebra operations including Lapack BLAS benchmark, Jacobi
iterative solver with different inter-block synchronization primitives,
dense and sparse matrix operations
Testing of the tool has been performed by some graduate students
based on a set of 10 testing cases with progressive difficulties
ranging from simple vector matrix operations to full solver of linear
system of equations
RT-CUDA Possible Enhancements:
Add more optimizations suitable for emerging GPU architectures
such as Maxwell
More API functions can be added from cuBLAS and cuSparse libraries
with different sparse matrix formats 30