SlideShare a Scribd company logo
1 of 31
RTCUDA
(A SOFTWARE TOOL FOR CUDA
CODE RESTRUCTURING)
By
Dr. Ayaz ul Hassan Khan
Email: ayazhk@gmail.com
1
FEATURES
Simplify writing high performing CUDA program
Modular Approach: based on ANTLR framework
Tested on Fermi and Kepler Architectures
 Easy to extend for supporting various architectures
Provides:
 GPU Memory Optimizations, Kernel Configurations, Synchronization, and Data Transfer Mechanisms
GPU Resource Optimization:
 Auto-tuning to find optimal set of CUDA kernel parameters
Generate Optimized CUDA parallel program from a given sequential C program
API functions to call highly optimized library routines for dense and sparse matrices
Synchronization primitives for inter-block synchronization
Supports multi-kernel conversions
2
OPTIMIZATION
SPECIFICATIONS
3
OPTIMIZATIONS
SPECIFICATIONS
Input/Output GPU Memory Allocation
 Allocating memory for GPU input and output
 Explicit transfer of data between host (CPU) and
device (GPU)
Computation Partitioning and Decomposition
 Problem iteration space partitioning
 Block – level and Thread – level Parallelism
 Appropriate block/tile size to fit in the cache/shared memory
 Perform related transformations
4
OPTIMIZATIONS
SPECIFICATIONS
Locality optimizations and Datacopy Transformations
 Explicit copy of data into lower level portions
 Utilize special memories such as constant and texture caches
 Efficient shared memory and register file usages per thread block
Parallel Memory Bandwidth
 Increased memory bandwidth by
 Coalesced global memory access
 Bank conflict free shared memory access
5
OPTIMIZATIONS
SPECIFICATIONS
Optimization of Architectural
Parameters
 To set optimal thread granularity, block size, grid size
 Better resource management and machine occupancy
 Required auto-tuning mechanism
Use of automatic compiler
optimization and/or programmer-
guided optimization
 User choices for compiler optimizations
Synchronization across SMs
 Avoiding expensive inter-block synchronization
 No global synchronization mechanisms except the kernel 6
OPTIMIZATIONS
SPECIFICATIONS
Invocation of Optimized
external Libraries
optimized at lower level
programming
Examples:
 cuBLAS for dense linear algebra
 cuSparse for sparse arrays
Library details are hidden from the
user
But requires full understanding of
parameters and related
implementation logic 7
OPTIMIZATION
SPECIFICATIONS
COMPARISON AMONG
DIFFERENT TOOLS
8
RT-CUDA CODE
TRANSFORMATION
STRATEGY
9
Input/Output GPU Memory Allocation
Configuration File
Computation Partitioning and
Decomposition
Locality Optimizations and Datacopy
Transformations
Parallel Memory Bandwidth
Optimization of Architectural Parameters
Use of Automatic Compiler Optimization
and/or Programmer-Guided Optimization
Synchronization across SMs
Invocation of Optimized External Libraries
C-Loop Optimizations (Loop Collapsing)
Array Transformations
Loop Partitioning
Block Merging
Block Skewing
Prefetching using Shared Memory
Parameters Tuning
Custom API Functions
Final Code Generation
RESTRUCTURING
ALGORITHM
10
C-Function C-Loop Optimizations (Loop Collapsing)
Merge the nested loops if they are independent and
calculate array indices based on the new loop variable
Array Transformations (nD à 1D)
Mapping array representation to GPU’s linear
addressing space
Loop Partitioning
Performs task distribution among all CUDA threads
based on its block id and thread id
CUDA Kernel Optimizations
Transforms Naïve CUDA kernel into a Parameterized
CUDA kernel after applying a set of optimizations
Naïve CUDA Kernels
Parameters Tuning
Determines optimal parameter values for the
generated parametric cuda kernels
Parametric CUDA Kernels
Optimized
CUDA Kernel
CUDA KERNEL
OPTIMIZATIONS IN RTA-CUDA
2/28/2018 PHD DISSERTATION DEFENSE 11
Naïve CUDA
Kernel
Block Merging
Increased thread granularity by mapping one thread
block to multiple resultant blocks vertically
2D
Matrices
Do tilingYes
Prefetching Using Shared Memory
Effective usage of shared memory and coalesced
access in global memory
Yes
Block Skewing
Increased thread access locality by mapping one
thread block to multiple resultant blocks horizontally
No
Remove Redundant Array Access in Loop Body
Pre-fetched array loads that are independent of the
loop indices
Parameterized
CUDA Kernel
No
RT-CUDA DESIGN
2/28/2018 PHD DISSERTATION DEFENSE 12
C-Program
Optimized
CUDAProgram
Pre-Processing
(identifies CUDA kernels by partitioning the source program into a DAG
of loops, data dependence is enforced)
C Functions
Final Code Generation including kernel file with optimized CUDA kernels,
main file containing main function to invoke CUDA kernels, parameters
file with optimal values, definition of RT-CUDA API functions
Optimal Parameters
Configuration File
(defining basic structure of the target kernels, array dimensions, selected
optimizations, and range of kernel parameters for auto-tuning)
RT-CUDA Input Parameters
RT-CUDA API Functions
For Dense Matrix Operations: RTdSMM, RTdDMM, RTdSMV,
RTdDMV, RTdSMT, RTdDMT, RTdSVV, RTdDVV, RTdSDOT, RTdDDOT
For Sparse Matrix Operations: RTspSMM, RTspDMM, RTspdSMM,
RTspdDMM, RTspSMV, RTspDMV
For Synchronization: RTSync, RTRelaxedSync
RTA-CUDA
RT-CUDA IMPLEMENTATION
2/28/2018 PHD DISSERTATION DEFENSE 13
Parse Tree
Generation
ANTLR C Grammar Traverse Parse Tree
using ParseTreeWalker
Parse
Tree
Modify Payload based on the RT-CUDA
Transformations
Node Event
Generate
Code
Modified
Parse Tree
SourceCode
Transformed Code
Parser
Generator
Parser
RT-CUDA EXAMPLES
14
RT-CUDA EXAMPLE: MATRIX-
MATRIX MULTIPLICATION
(INPUT)
15
RT-CUDA EXAMPLE: MATRIX-
MATRIX MULTIPLICATION
(OUTPUT)
16
RT-CUDA EXAMPLE: SPARSE
MATRIX OPERATORS USING RT-
CUDA API
(INPUT)
17
RT-CUDA EXAMPLE: SPARSE
MATRIX OPERATORS USING RT-
CUDA API
(OUTPUT)
18
RT-CUDA: CONJUGATE
GRADIENT USING RT-CUDA API
AND CUSTOM MERGE
OPERATIONS – (INPUT)
(MULTI-KERNEL CONVERSIONS)
19
RT-CUDA: CONJUGATE
GRADIENT USING RT-CUDA API
AND CUSTOM MERGE
OPERATIONS – (CONFIGURATIONS)
(MULTI-KERNEL CONVERSIONS)
20
RT-CUDA: CONJUGATE
GRADIENT USING RT-CUDA API
AND CUSTOM MERGE
OPERATIONS – (OUTPUT)
(MULTI-KERNEL CONVERSIONS)
21
PERFORMANCE
EVALUATION
22
EVALUATION OF BASIC LINEAR
ALGEBRA OPERATIONS
23
EVALUATION OF INTER-BLOCK
SYNCHRONIZATION PRIMITIVES
24
Single Precision Double Precision
EFFECTS OF CALLING EXTERNAL
CUBLAS FUNCTIONS
25
EFFECTS OF CALLING
EXTERNAL CUBLAS
FUNCTIONS
26
EFFECTS OF CALLING
EXTERNAL CUBLAS
FUNCTIONS
27
EFFECTS OF SPARSE MATRIX
OPERATIONS USING
CUDA SPARSE LIBRARY
ROUTINES
28
EFFECTS OF SPARSE MATRIX
OPERATIONS USING
CUDA SPARSE LIBRARY
ROUTINES
29
Matrix Plot Dimension non-zeros
bcsstm13 2003 11973
cavity10 2597 76367
cavity17 4562 138187
cavity18 4562 138187
cdde1 961 4681
cdde2 961 4681
cdde3 961 4681
coater1 1348 19457
CONCLUSION AND FUTURE
WORK
 Performance evaluation of the tool has been performed using basic
linear algebra operations including Lapack BLAS benchmark, Jacobi
iterative solver with different inter-block synchronization primitives,
dense and sparse matrix operations
 Testing of the tool has been performed by some graduate students
based on a set of 10 testing cases with progressive difficulties
ranging from simple vector matrix operations to full solver of linear
system of equations
RT-CUDA Possible Enhancements:
 Add more optimizations suitable for emerging GPU architectures
such as Maxwell
 More API functions can be added from cuBLAS and cuSparse libraries
with different sparse matrix formats 30
31

More Related Content

What's hot

Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116ksk_ha
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaUniversidad Carlos III de Madrid
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPCinside-BigData.com
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...IDES Editor
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 

What's hot (20)

Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Basanta jtr2009
Basanta jtr2009Basanta jtr2009
Basanta jtr2009
 
A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time Java
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Chap05 gtp 03_kh
Chap05 gtp 03_khChap05 gtp 03_kh
Chap05 gtp 03_kh
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
 
Progress_190130
Progress_190130Progress_190130
Progress_190130
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 

Similar to RT-CUDA: A Software Tool for CUDA Code Restructuring

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track fAlona Gradman
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensAlona Gradman
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1wjunjmt
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PROIDEA
 

Similar to RT-CUDA: A Software Tool for CUDA Code Restructuring (20)

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
0507036
05070360507036
0507036
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

RT-CUDA: A Software Tool for CUDA Code Restructuring

  • 1. RTCUDA (A SOFTWARE TOOL FOR CUDA CODE RESTRUCTURING) By Dr. Ayaz ul Hassan Khan Email: ayazhk@gmail.com 1
  • 2. FEATURES Simplify writing high performing CUDA program Modular Approach: based on ANTLR framework Tested on Fermi and Kepler Architectures  Easy to extend for supporting various architectures Provides:  GPU Memory Optimizations, Kernel Configurations, Synchronization, and Data Transfer Mechanisms GPU Resource Optimization:  Auto-tuning to find optimal set of CUDA kernel parameters Generate Optimized CUDA parallel program from a given sequential C program API functions to call highly optimized library routines for dense and sparse matrices Synchronization primitives for inter-block synchronization Supports multi-kernel conversions 2
  • 4. OPTIMIZATIONS SPECIFICATIONS Input/Output GPU Memory Allocation  Allocating memory for GPU input and output  Explicit transfer of data between host (CPU) and device (GPU) Computation Partitioning and Decomposition  Problem iteration space partitioning  Block – level and Thread – level Parallelism  Appropriate block/tile size to fit in the cache/shared memory  Perform related transformations 4
  • 5. OPTIMIZATIONS SPECIFICATIONS Locality optimizations and Datacopy Transformations  Explicit copy of data into lower level portions  Utilize special memories such as constant and texture caches  Efficient shared memory and register file usages per thread block Parallel Memory Bandwidth  Increased memory bandwidth by  Coalesced global memory access  Bank conflict free shared memory access 5
  • 6. OPTIMIZATIONS SPECIFICATIONS Optimization of Architectural Parameters  To set optimal thread granularity, block size, grid size  Better resource management and machine occupancy  Required auto-tuning mechanism Use of automatic compiler optimization and/or programmer- guided optimization  User choices for compiler optimizations Synchronization across SMs  Avoiding expensive inter-block synchronization  No global synchronization mechanisms except the kernel 6
  • 7. OPTIMIZATIONS SPECIFICATIONS Invocation of Optimized external Libraries optimized at lower level programming Examples:  cuBLAS for dense linear algebra  cuSparse for sparse arrays Library details are hidden from the user But requires full understanding of parameters and related implementation logic 7
  • 9. RT-CUDA CODE TRANSFORMATION STRATEGY 9 Input/Output GPU Memory Allocation Configuration File Computation Partitioning and Decomposition Locality Optimizations and Datacopy Transformations Parallel Memory Bandwidth Optimization of Architectural Parameters Use of Automatic Compiler Optimization and/or Programmer-Guided Optimization Synchronization across SMs Invocation of Optimized External Libraries C-Loop Optimizations (Loop Collapsing) Array Transformations Loop Partitioning Block Merging Block Skewing Prefetching using Shared Memory Parameters Tuning Custom API Functions Final Code Generation
  • 10. RESTRUCTURING ALGORITHM 10 C-Function C-Loop Optimizations (Loop Collapsing) Merge the nested loops if they are independent and calculate array indices based on the new loop variable Array Transformations (nD à 1D) Mapping array representation to GPU’s linear addressing space Loop Partitioning Performs task distribution among all CUDA threads based on its block id and thread id CUDA Kernel Optimizations Transforms Naïve CUDA kernel into a Parameterized CUDA kernel after applying a set of optimizations Naïve CUDA Kernels Parameters Tuning Determines optimal parameter values for the generated parametric cuda kernels Parametric CUDA Kernels Optimized CUDA Kernel
  • 11. CUDA KERNEL OPTIMIZATIONS IN RTA-CUDA 2/28/2018 PHD DISSERTATION DEFENSE 11 Naïve CUDA Kernel Block Merging Increased thread granularity by mapping one thread block to multiple resultant blocks vertically 2D Matrices Do tilingYes Prefetching Using Shared Memory Effective usage of shared memory and coalesced access in global memory Yes Block Skewing Increased thread access locality by mapping one thread block to multiple resultant blocks horizontally No Remove Redundant Array Access in Loop Body Pre-fetched array loads that are independent of the loop indices Parameterized CUDA Kernel No
  • 12. RT-CUDA DESIGN 2/28/2018 PHD DISSERTATION DEFENSE 12 C-Program Optimized CUDAProgram Pre-Processing (identifies CUDA kernels by partitioning the source program into a DAG of loops, data dependence is enforced) C Functions Final Code Generation including kernel file with optimized CUDA kernels, main file containing main function to invoke CUDA kernels, parameters file with optimal values, definition of RT-CUDA API functions Optimal Parameters Configuration File (defining basic structure of the target kernels, array dimensions, selected optimizations, and range of kernel parameters for auto-tuning) RT-CUDA Input Parameters RT-CUDA API Functions For Dense Matrix Operations: RTdSMM, RTdDMM, RTdSMV, RTdDMV, RTdSMT, RTdDMT, RTdSVV, RTdDVV, RTdSDOT, RTdDDOT For Sparse Matrix Operations: RTspSMM, RTspDMM, RTspdSMM, RTspdDMM, RTspSMV, RTspDMV For Synchronization: RTSync, RTRelaxedSync RTA-CUDA
  • 13. RT-CUDA IMPLEMENTATION 2/28/2018 PHD DISSERTATION DEFENSE 13 Parse Tree Generation ANTLR C Grammar Traverse Parse Tree using ParseTreeWalker Parse Tree Modify Payload based on the RT-CUDA Transformations Node Event Generate Code Modified Parse Tree SourceCode Transformed Code Parser Generator Parser
  • 15. RT-CUDA EXAMPLE: MATRIX- MATRIX MULTIPLICATION (INPUT) 15
  • 16. RT-CUDA EXAMPLE: MATRIX- MATRIX MULTIPLICATION (OUTPUT) 16
  • 17. RT-CUDA EXAMPLE: SPARSE MATRIX OPERATORS USING RT- CUDA API (INPUT) 17
  • 18. RT-CUDA EXAMPLE: SPARSE MATRIX OPERATORS USING RT- CUDA API (OUTPUT) 18
  • 19. RT-CUDA: CONJUGATE GRADIENT USING RT-CUDA API AND CUSTOM MERGE OPERATIONS – (INPUT) (MULTI-KERNEL CONVERSIONS) 19
  • 20. RT-CUDA: CONJUGATE GRADIENT USING RT-CUDA API AND CUSTOM MERGE OPERATIONS – (CONFIGURATIONS) (MULTI-KERNEL CONVERSIONS) 20
  • 21. RT-CUDA: CONJUGATE GRADIENT USING RT-CUDA API AND CUSTOM MERGE OPERATIONS – (OUTPUT) (MULTI-KERNEL CONVERSIONS) 21
  • 23. EVALUATION OF BASIC LINEAR ALGEBRA OPERATIONS 23
  • 24. EVALUATION OF INTER-BLOCK SYNCHRONIZATION PRIMITIVES 24 Single Precision Double Precision
  • 25. EFFECTS OF CALLING EXTERNAL CUBLAS FUNCTIONS 25
  • 26. EFFECTS OF CALLING EXTERNAL CUBLAS FUNCTIONS 26
  • 27. EFFECTS OF CALLING EXTERNAL CUBLAS FUNCTIONS 27
  • 28. EFFECTS OF SPARSE MATRIX OPERATIONS USING CUDA SPARSE LIBRARY ROUTINES 28
  • 29. EFFECTS OF SPARSE MATRIX OPERATIONS USING CUDA SPARSE LIBRARY ROUTINES 29 Matrix Plot Dimension non-zeros bcsstm13 2003 11973 cavity10 2597 76367 cavity17 4562 138187 cavity18 4562 138187 cdde1 961 4681 cdde2 961 4681 cdde3 961 4681 coater1 1348 19457
  • 30. CONCLUSION AND FUTURE WORK  Performance evaluation of the tool has been performed using basic linear algebra operations including Lapack BLAS benchmark, Jacobi iterative solver with different inter-block synchronization primitives, dense and sparse matrix operations  Testing of the tool has been performed by some graduate students based on a set of 10 testing cases with progressive difficulties ranging from simple vector matrix operations to full solver of linear system of equations RT-CUDA Possible Enhancements:  Add more optimizations suitable for emerging GPU architectures such as Maxwell  More API functions can be added from cuBLAS and cuSparse libraries with different sparse matrix formats 30
  • 31. 31