1. A Quantitative Study of Irregular Programs on
GPUs
By
Prashant Momale
IIT Kanpur
Guided By
Prof. S. K. Aggarwal
2. Introduction
Regular vs Irregular Algorithms
- Regular Programs
(i) operate on large vectors or matrices
(ii) access them in statically predictable ways
- These codes often have high computational Demands
- exhibit extensive data parallelism
- access memory in a streaming fashion, and require little synchronization
i.e. Matrix Multiplication
3. Introduction(Continue...)
Irregular Programs
- build, traverse, and update irregular data structures such as trees, graphs, and priority
queues
i.e. domains like n-body simulation, data mining, decisions problems that use Boolean
satisfiability, optimization theory, social networks
- more difficult to parallelize
- more challenging to map to GPUs than regular programs
4. Introduction(Continue...)
Many Questions to be solved
- Several GPU implementation of irregular programs have been published but very little
is known about them
- Some questions do not have clear answers like
(i) Does irregularity really manifest itself as a binary property?
(ii) How is the irregularity behavior of an application influenced by its input, if at all?
(iii) Does an increase in irregularity necessarily degrade performance or might it help in
certain cases?
- Answers to above questions are really important to understand the behavior of irregular
programs
5. Irregularity
Regular Programs
- Control flow and memory access are not data dependent
Ex. In matrix multiplication, knowing source code, starting address and input size and
without knowing any matrix elements we can predict the behavior
Irregular Programs
- Control flow and memory access are data dependent
- Input values determine the program's behavior
Ex. Binary Search Tree implementation
The values and the order in which they are processed affect the control flow and
memory references
6. Irregularity (Continued....)
Warp Concept
- GPU contains processing elements (PEs) and tightly coupled PEs form a streaming
multiprocessor (SM).
- Each PE in an SM can run an independent thread of instructions
- The PEs in each SM execute vector instructions that conditionally operate on 32 data
items.
- A set of 32 threads that run together in this fashion is called a warp.
7. Irregularity (Continued....)
Control Flow Irregularity
- Sometimes all threads in warp can not perform same instruction.
- Threads automatically get subdivided into sets
- Threads from set performs same instruction
- But sets get executed in serial manner until they re-converge.
Situation where not all threads in warp follow the same control flow is call Thread
Divergence.
This is a Control Flow Irregularity
8. Irregularity (Continued....)
Memory Access Irregularity
- Coalesced memory transaction
- When memory access is not coalesced, hardware has to perform many memory
transactions, one after the other, compared to coalesced access.
This is how Memory Access Irregularity can lower the performance.
- Bank Conflict : Warp can simultaneously access 32 words in shared memory as long
as they reside in different banks. If more than one word is touched within a bank
bank conflict occurs.
Bank Conflict is another reason of memory access irregularity
9. Metrics of Irregularity
(i) Control Flow Irregularity
CFI = (divergent branches ) / (executed instructions)
(ii)Memory-Access Irregularity
MAI = ( replayed instructions) / ( issued instructions)
10. Metrics of Irregularity(Continued...)
- Both metrics ranges from 0% to 100%
- Higher the values higher is the irregularity
- CFI is usually low
- They are independent of runtime
- Both metric s measure irregularity at warp level
These metrics do not classify a program as regular or irregular. Rather, they
measure the Degree of Irregularity
11. Results and Analysis
- Analysis of observations about the irregularity exhibited by various CUDA kernels has
be presented.
- Investigated the effect of different program inputs
- Effect of optimizations on programs
- Variability of the results between different runs
(i) on same GPU
(ii) on different GPU
(Benchmarks Used :
Irregular - BFS, Barnes Hut, Data Compression, Delaunay Mesh Refinement,
Points-to Analysis, Survey Propagation, Single Source Shortest Path, TSP
Regular - Black Scholes, Histogram, Monte Carlo, Matrix Multiplication, N-Body )
12. Results and Analysis(Continued....)
Amount of Irregularity
- CFI is usually very low. For above benchmarks it is less than 4.1%
- Most of the programs can not strictly classified as regular or irregular
- Two irregularities appear to be independent of each other
- Irregular control flow generally implies irregular memory access
13. Results and Analysis(Continued....)
Input Sensitivity
- Input sensitivity is very difficult to predict
- Difficult to do it in application independent way
(i) Input Oblivious - Irregularity remains largely constant for different inputs
(ii) Input-type Dependent - Irregularity varies largely across different types of inputs
rather than within a single type
(iii) Input Dependent – Irregularity varies as size of the input varies
14. Results and Analysis(Continued....)
(iii) Arithmetic Precision –
Change from single precision to double precision increases CFI and MAI for small
inputs but decreases both for medium and large inputs
But the change is very small.
- It indicates that change in arithmetic precision does not affect the irregularity of
program.
15. Results and Analysis(Continued....)
Variability
- Observed for several kernels on different GPUs and same GPUs for multiple runs
Irregularities are quite stable for same GPU and vary somewhat between distinct
GPUs
16. Conclusion
- There is no type of programs as regular or irregular
- Irregularity is not necessarily bad for the performance
- By definition, irregular programs are data dependent but deferent inputs yield similar
degrees of irregularity
- Irregularity does no vary much between distinct GPUs
It is expected that above conclusions hold across a broad range of CUDA-capable GPUs
and hope that it will increase the understanding of the behavior of irregular GPU
applications.
17. References
Paper : A Quantitative Study of Irregular Programs on GPUs
By - Rupesh Nasre, Keshav Pingali, Martin Burtscher
Texas State University
Published in – IEEE International Symposium on
Workload Characterization ( IISWC '13 )
18. Results and Analysis(Continued....)
Effect of Optimizations and Arithmetic Precision
(i) Regular version of one program reads records from global memory but in optimized
version if calculates the record values on the fly.
- This actually increase the Control Flow Irregularity
- But faster is the performance because computations are cheaper than reading values
from global memory.
(ii) In optimized Single Source Shortest Path algorithm, nodes which are logically close
to each other are kept close in memory.
- It increase the Memory-Access Irregularity but increases the spatial locality