Programmer specifies the global dims and the work-group ; Work-group can often be determined automatically but also can be tuned for peak performance.
Wavefront – hardware-specific concept. Similar to the “vector width” of a machine. For example SSE=128, AVX=256.
We want to run a Sobel edge detect filter on Peaches the dog. The equation represents the kernel that is run on each pixel in the image. Note the equation examines surrounding pixels to determine the rate of change, ie an edge.
To use the HSA Parallel Execution Model, we map the image to a 2D grid. Each pixel is mapped to a work-item, and the grid specifies all the pixels in the image. The same kernel is run for each work-item in the grid, but each work-item gets it’s unique (x,y) coordinate. By using the coordinate, work-items can read the pixel values of the surrounding pixels.
Key observation is that the programmer writes the code for single pixel in what looks single thread, and the execution model exposes a huge amount of parallelism.
Work-groups provide an additional optional optimization opportunity. HSA supports special “group” memory that is shared by all work-items in the work-group, and is typically very high-performance. In this case, we have read-sharing between neighboring pixels, so we have picked a relatively large square grid so that each work-item can pull from neighboring pixels. This
Syntax used to specify parallel region Host code contained in binary file as well. (ie fat binary)
Parallel_for_each program. Runtime similar to OpenMP.
Java8 introduced in Mar-2014
Key points : Code computes percentage of total scores achieved by each player. This is in Java. Standard Java8. Uses Java Stream class, and a lambda function. Player class has a pointer (object reference) to a parent Team.
Stream supports parallel exectuion, including both multi-core and GPU accelerators. This is goal of HSA – make GPU programming as easy as multi-core CPU programming, and easier than multi-core + vector ISA programming.
Note all functions have been inlined.
This is the code for the Java ForEach loop – not translating the entire code into HSAIL.
Line4 – kernarg signature. Line5 – returns the coordinate of this work-item, so we know which player to operate on. Line10 – multiple, destination first. Line19 – dereference of team variable.
“Faraway accelerator” -> Separate address spaces, no pointers, perf/power cost of copies Shared virtual mem and platform atomics – HSA systems looks more like CPUs with fast vector units than discrete accelerators.
ISCA Final Presentation - HSAIL
ARCHITECTURE (HSA): HSAIL VIRTUAL
BEN SANDER, AMD
STATE OF GPU COMPUTING
Separate address spaces
Can’t share pointers
New language required for compute kernel
EX: OpenCL™ runtime API
Compute kernel compiled separately than host
Single address space
Fast access from all components
Can share pointers
Bring GPU computing to existing, popular,
Single-source, fully supported by compiler
HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program