More Related Content Similar to ISCA Final Presentation - HSAIL Similar to ISCA Final Presentation - HSAIL (20) More from HSA Foundation (13) ISCA Final Presentation - HSAIL2. TOPICS
Introduction and Motivation
HSAIL – what makes it special?
HSAIL Execution Model
How to program in HSAIL?
Conclusion
© Copyright 2014 HSA Foundation. All Rights Reserved
3. STATE OF GPU COMPUTING
Today’s Challenges
Separate address spaces
Copies
Can’t share pointers
New language required for compute kernel
EX: OpenCL™ runtime API
Compute kernel compiled separately than host
code
Emerging Solution
HSA Hardware
Single address space
Coherent
Virtual
Fast access from all components
Can share pointers
Bring GPU computing to existing, popular,
programming models
Single-source, fully supported by compiler
HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
4. THE PORTABILITY CHALLENGE
CPU ISAs
ISA innovations added incrementally (ie NEON, AVX, etc)
ISA retains backwards-compatibility with previous generation
Two dominant instruction-set architectures: ARM and x86
GPU ISAs
Massive diversity of architectures in the market
Each vendor has own ISA - and often several in market at same time
No commitment (or attempt!) to provide any backwards compatibility
Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction
© Copyright 2014 HSA Foundation. All Rights Reserved
6. WHAT IS HSAIL?
Intermediate language for parallel compute in HSA
Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)
Expresses parallel regions of code
Binary format of HSAIL is called “BRIG”
Goal: Bring parallel acceleration to mainstream programming languages
© Copyright 2014 HSA Foundation. All Rights Reserved
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
BRIG Finalizer Component
ISA
Host ISA
7. KEY HSAIL FEATURES
Parallel
Shared virtual memory
Portable across vendors in HSA Foundation
Stable across multiple product generations
Consistent numerical results (IEEE-754 with defined min accuracy)
Fast, robust, simple finalization step (no monthly updates)
Good performance (little need to write in ISA)
Supports all of OpenCL™
Supports Java, C++, and other languages as well
© Copyright 2014 HSA Foundation. All Rights Reserved
8. HSAIL INSTRUCTION SET - OVERVIEW
Similar to assembly language for a RISC CPU
Load-store architecture
Destination register first, then source registers
140 opcodes (Java™ bytecode has 200)
Floating point (single, double, half (f16))
Integer (32-bit, 64-bit)
Some packed operations
Branches
Function calls
Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
Synchronize host CPU and HSA Component!
Text and Binary formats (“BRIG”)
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2014 HSA Foundation. All Rights Reserved
9. SEGMENTS AND MEMORY (1/2)
7 segments of memory
global, readonly, group, spill, private, arg, kernarg
Memory instructions can (optionally) specify a segment
Control data sharing properties and communicate intent
Global Segment
Visible to all HSA agents (including host CPU)
Group Segment
Provides high-performance memory shared in the work-group.
Group memory can be read and written by any work-item in the work-group
HSAIL provides sync operations to control visibility of group memory
ld_global_u64 $d0,[$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]
© Copyright 2014 HSA Foundation. All Rights Reserved
10. SEGMENTS AND MEMORY (2/2)
Spill, Private, Arg Segments
Represent different regions of a per-work-item stack
Typically generated by compiler, not specified by programmer
Compiler can use these to convey intent – ie spills
Kernarg Segment
Programmer writes kernarg segment to pass arguments to a kernel
Read-Only Segment
Remains constant during execution of kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
11. FLAT ADDRESSING
Each segment mapped into virtual address space
Flat addresses can map to segments based on virtual address
Instructions with no explicit segment use flat addressing
Very useful for high-level language support (ie classes, libraries)
Aligns well with OpenCL 2.0 “generic” addressing feature
ld_global_u64 $d6, [%_arg0] ; global
ld_u64 $d0,[$d6+24] ; flat
© Copyright 2014 HSA Foundation. All Rights Reserved
12. REGISTERS
Four classes of registers:
S: 32-bit, Single-precision FP or Int
D: 64-bit, Double-precision FP or Long Int
Q: 128-bit, Packed data.
C: 1-bit, Control Registers (Compares)
Fixed number of registers
S, D, Q share a single pool of resources
S + 2*D + 4*Q <= 128
Up to 128 S or 64 D or 32 Q (or a blend)
Register allocation done in high-level compiler
Finalizer doesn’t perform expensive register allocation
c0
c1
c2
c3
c4
c5
c6
c7
s0
d0
q0
s1
s2
d1
s3
s4
d2
q1
s5
s6
d3
s7
s8
d4
q2
s9
s10
d5
s11
…
s120
d60
q30
s121
s122
d61
s123
s124
d62
q31
s125
s126
d63
s127
© Copyright 2014 HSA Foundation. All Rights Reserved
13. SIMT EXECUTION MODEL
HSAIL Presents a “SIMT” execution model to the programmer
“Single Instruction, Multiple Thread”
Programmer writes program for a single thread of execution
Each work-item appears to have its own program counter
Branch instructions look natural
Hardware Implementation
Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
Actually one program counter for the entire SIMD instruction
Branches implemented with predication
SIMT Advantages
Easier to program (branch code in particular)
Natural path for mainstream programming models and existing compilers
Scales across a wide variety of hardware (programmer doesn’t see vector width)
Cross-lane operations available for those who want peak performance
© Copyright 2014 HSA Foundation. All Rights Reserved
14. WAVEFRONTS
Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”
Lanes in wavefront can be “active” or “inactive”
Inactive lanes consume hardware resources but don’t do useful work
Tradeoffs
“Wavefront-aware” programming can be useful for peak performance
But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}
© Copyright 2014 HSA Foundation. All Rights Reserved
15. CROSS-LANE OPERATIONS
Example HSAIL cross-lane operation: “activelaneid”
Dest set to count of earlier work-items that are active for this instruction
Useful for compaction algorithms
Example HSAIL cross-lane operation: “activelaneshuffle”
Each workitem reads value from another lane in the wavefront
Supports selection of “identity” element for inactive lanes
Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0
// s0 = dest, s1= source, s2=lane select, no identity
activelaneid_u32 $s0
© Copyright 2014 HSA Foundation. All Rights Reserved
16. HSAIL MODES
Working group strived to limit optional modes and features in HSAIL
Minimize differences between HSA target machines
Better for compiler vendors and application developers
Two modes survived
Machine Models
Small: 32-bit pointers, 32-bit data
Large: 64-bit pointers, 32-bit or 64-bit data
Vendors can support one or both models
“Base” and “Full” Profiles
Two sets of requirements for FP accuracy, rounding, exception reporting, hard
pre-emption
© Copyright 2014 HSA Foundation. All Rights Reserved
17. HSA PROFILES
Feature Base Full
Addressing Modes Small, Large Small, Large
All 32-bit HSAIL operations according to the declared
profile Yes Yes
F16 support (IEEE 754 or better) Yes Yes
F64 support No Yes
Precision for add/sub/mul 1/2 ULP 1/2 ULP
Precision for div 2.5 ULP 1/2 ULP
Precision for sqrt 1 ULP 1/2 ULP
HSAIL Rounding: Near Yes Yes
HSAIL Rounding: Up / Down / Zero No Yes
Subnormal floating-point Flush-to-zero Supported
Propagate NaN Payloads No Yes
FMA Yes Yes
Arithmetic Exception reporting None DETECT or BREAK
Debug trap Yes Yes
Hard Preemption No Yes
© Copyright 2014 HSA Foundation. All Rights Reserved
19. HSA PARALLEL EXECUTION MODEL
Basic Idea:
Programmer supplies an HSAIL
“kernel” that is run on each work-item.
Kernel is written as a single thread of
execution.
Programmer specifies grid dimensions
(scope of problem) when launching
the kernel.
Each work-item has a unique
coordinate in the grid.
Programmer optionally specifies work-
group dimensions (for optimized
communication).
© Copyright 2014 HSA Foundation. All Rights Reserved
20. CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
© Copyright 2014 HSA Foundation. All Rights Reserved
21. CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
22. CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D work-group
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
23. HOW TO PROGRAM HSA?
WHAT DO I TYPE?
© Copyright 2014 HSA Foundation. All Rights Reserved
24. HSA PROGRAMMING MODELS : CORE PRINCIPLES
Single source
Host and device code side-by-side in same source file
Written in same programming language
Single unified coherent address space
Freely share pointers between host and device
Similar memory model as multi-core CPU
Parallel regions identified with existing language syntax
Typically same syntax used for multi-core CPU
HSAIL is the compiler IR that supports these programming models
© Copyright 2014 HSA Foundation. All Rights Reserved
25. GCC OPENMP : COMPILATION FLOW
SUSE GCC Project
Adding HSAIL code generator to GCC compiler infrastructure
Supports OpenMP 3.1 syntax
No data movement directives required !main() {
…
// Host code.
#pragma omp parallel for
for (int i=0;i<N; i++) {
C[i] = A[i] + B[i];
}
…
}
GCC OpenMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
26. GCC OpenMP flow
C/C++/Fortran OpenMP application
e.g., #pragma omp for
for( j = 0; j<n;j++) { b[j] = a[j]; }
GNU Compiler(GCC)
Compiles host code + Emits runtime
calls with kernel name, parameters,
launch attributes
Lowers OpenMP directives,
converts GIMPLE to BRIG.
Embeds BRIG into host code
Dispatch kernel to GPU
Pragmas map to calls into
HSA Runtime
Application
Compiler
Run time
Finalize kernel from BRIG->ISA
Kernels finalized once and cached.
Compile time
© Copyright 2014 HSA Foundation. All Rights Reserved
27. MCW C++AMP : COMPILATION FLOW
C++AMP : Single-source C++ template parallel programming model
MCW compiler based on CLANG/LLVM
Open-source and runs on Linux
Leverage open-source LLVM->HSAIL code generator
main() {
…
parallel_for_each(grid<1>(ext
entent<256>(…)
…
}
C++AMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
28. JAVA: RUNTIME FLOW
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA 8 – HSA ENABLED APARAPI
Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data parallel algorithms
‒ Initially targeted at multi-core.
APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at runtime via
HSAIL
JVM
Java Application
HSA Finalizer & Runtime
APARAPI + Lambda API
GPUCPU
Future Java – HSA ENABLED JAVA (SUMATRA)
Adds native GPU acceleration to Java Virtual Machine
(JVM)
Developer uses JDK Lambda, Stream API
JVM uses GRAAL compiler to generate HSAIL
JVM
Java Application
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT
backend
GPUCPU
29. AN EXAMPLE (IN JAVA 8)
© Copyright 2014 HSA Foundation. All Rights Reserved
//Example computes the percentage of total scores achieved by each player on a team.
class Player {
private Team team; // Note: Reference to the parent Team.
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Arrays.stream(allPlayers). // wrap the array in a stream
parallel(). // developer indication that lambda is thread-safe
forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});
30. HSAIL CODE EXAMPLE
© Copyright 2014 HSA Foundation. All Rights Reserved
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };
31. HOW TO PROGRAM HSA?
OTHER PROGRAMMING TOOLS
© Copyright 2014 HSA Foundation. All Rights Reserved
32. HSAIL ASSEMBLER
kernel &run (kernarg_u64 %_arg0)
{
ld_kernarg_u64 $d6, [%_arg0];
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];
. . .
HSAIL
Assembler BRIG Finalizer
Machine
ISA
• HSAIL has a text format and an assembler
© Copyright 2014 HSA Foundation. All Rights Reserved
33. OPENCL™ OFFLINE COMPILER (CLOC)
__kernel void vec_add(
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
int id = get_global_id(0);
// Bounds check
if (id < n)
c[id] = a[id] + b[id];
}
CLOC BRIG Finalizer
Machine
ISA
•OpenCL split-source model cleanly isolates kernel
•Can express many HSAIL features in OpenCL Kernel Language
•Higher productivity than writing in HSAIL assembly
•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)
•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model
© Copyright 2014 HSA Foundation. All Rights Reserved
34. KEY TAKEAWAYS
HSAIL
Thin, robust, fast finalizer
Portable (multiple HW vendors and parallel architectures)
Supports shared virtual memory and platform atomics
HSA brings GPU computing to mainstream programming models
Shared and coherent memory bridges “faraway accelerator” gap
HSAIL provides the common IL for high-level languages to benefit from
parallel computing
Languages and Compilers
HSAIL support in GCC, LLVM, Java JVM
Leverage same language syntax designed for multi-core CPUs
Can use pointer-containing data structures
© Copyright 2014 HSA Foundation. All Rights Reserved
Editor's Notes Proofpoints : Green500, top end is now heterogeneous computing solutions.
Components = CPU Cores, 3rd Party IP, GPU cores. CPU “Spill” = register spill.
Some tension between programmability and HW implementation. Explain how SIMD hardware uses masks to run both halves of the if-then-else statements. Virtual ISA is SIMT (one thread), but can use cross-lane operations.
Non-portable code.
Other cross-lane operations: countlane, countuplane, masklane, sendlane, receivelane
HSA supports IEEE-defined precisions.
F16 supported in both profiles
F64 supported only in Full profile.
Base relaxes some precision and rounding requirements.
FMA operation not required. Familiar to folks used to GPUs.
Grid contains work-groups; work-group contains workitems.
Programmer specifies the global dims and the work-group ;
Work-group can often be determined automatically but also can be tuned for peak performance.
Wavefront – hardware-specific concept. Similar to the “vector width” of a machine. For example SSE=128, AVX=256. We want to run a Sobel edge detect filter on Peaches the dog. The equation represents the kernel that is run on each pixel in the image. Note the equation examines surrounding pixels to determine the rate of change, ie an edge. To use the HSA Parallel Execution Model, we map the image to a 2D grid. Each pixel is mapped to a work-item, and the grid specifies all the pixels in the image. The same kernel is run for each work-item in the grid, but each work-item gets it’s unique (x,y) coordinate. By using the coordinate, work-items can read the pixel values of the surrounding pixels.
Key observation is that the programmer writes the code for single pixel in what looks single thread, and the execution model exposes a huge amount of parallelism.
Work-groups provide an additional optional optimization opportunity. HSA supports special “group” memory that is shared by all work-items in the work-group, and is typically very high-performance. In this case, we have read-sharing between neighboring pixels, so we have picked a relatively large square grid so that each work-item can pull from neighboring pixels. This Syntax used to specify parallel region
Host code contained in binary file as well. (ie fat binary) Parallel_for_each program.
Runtime similar to OpenMP. Java8 introduced in Mar-2014 Key points :
Code computes percentage of total scores achieved by each player.
This is in Java.
Standard Java8. Uses Java Stream class, and a lambda function.
Player class has a pointer (object reference) to a parent Team.
Stream supports parallel exectuion, including both multi-core and GPU accelerators. This is goal of HSA – make GPU programming as easy as multi-core CPU programming, and easier than multi-core + vector ISA programming.
Note all functions have been inlined. This is the code for the Java ForEach loop – not translating the entire code into HSAIL.
Line4 – kernarg signature.
Line5 – returns the coordinate of this work-item, so we know which player to operate on.
Line10 – multiple, destination first.
Line19 – dereference of team variable. “Faraway accelerator” -> Separate address spaces, no pointers, perf/power cost of copies
Shared virtual mem and platform atomics – HSA systems looks more like CPUs with fast vector units than discrete accelerators.