SlideShare a Scribd company logo
1 of 32
Download to read offline
HSAIL PROGRAMMER REF GUIDE:
UNCOVERED
BEN SANDER
AMD SENIOR FELLOW
STATE OF GPU COMPUTING
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
Today’s Challenges


Emerging Solution

Separate address spaces



HSA Hardware



Copies



Single address space



Can’t share pointers



Coherent



Virtual



Fast access from all components



Can share pointers

PCIe



New language required for compute kernel



EX: OpenCL™ runtime API
Compute kernel compiled separately than
host code



Bring GPU computing to existing, popular,
programming models


Single-source, fully supported by
compiler



HSAIL compiler IR (Cross-platform!)
WHAT IS HSAIL?


HSAIL is the intermediate language for parallel compute in HSA








Generated by a high level compiler (LLVM, gcc, Java VM, etc)
Low-level IR, close to machine ISA level
Compiled down to target ISA by an IHV “Finalizer”
Finalizer may execute at run time, install time, or build time

Example: OpenCL™ Compilation Stack using HSAIL

High-Level Compiler Flow (Developer)
OpenCL™ Kernel
EDG or CLANG
SPIR
LLVM
HSAIL

© Copyright 2012 HSA Foundation. All Rights Reserved.

Finalizer Flow (Runtime)

HSAIL
Finalizer
Hardware ISA

3
KEY HSAIL FEATURES


Parallel



Shared virtual memory



Portable across vendors in HSA Foundation



Stable across multiple product generations



Consistent numerical results (IEEE-754 with defined min accuracy)



Fast, robust, simple finalization step (no monthly updates)



Good performance (little need to write in ISA)



Supports all of OpenCL™ and C++ AMP



Support Java, C++, and other languages as well

© Copyright 2012 HSA Foundation. All Rights Reserved.

4
AGENDA


Introduction



HSA Parallel Execution Model



HSAIL Instruction Set



Example – in Java!



Key Takeaways
HSA PARALLEL EXECUTION
MODEL

© Copyright 2012 HSA Foundation. All Rights Reserved.

6
CONVOLUTION / SOBEL EDGE FILTER

Gx = [ -1
[ -2
[ -1

0 +1 ]
0 +2 ]
0 +1 ]

Gy = [ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy2)
CONVOLUTION / SOBEL EDGE FILTER
2D grid

workitem
kernel
Gx = [ -1
[ -2
[ -1

0 +1 ]
0 +2 ]
0 +1 ]

Gy = [ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy2)
CONVOLUTION / SOBEL EDGE FILTER
2D grid

2D work-group
workitem
kernel
Gx = [ -1
[ -2
[ -1

0 +1 ]
0 +2 ]
0 +1 ]

Gy = [ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy2)
HSA PARALLEL EXECUTION MODEL
Basic Idea:
Programmer supplies a
“kernel” that is run on
each work-item. Kernel is
written as a single thread
of execution.
Each work-item has a
unique coordinate.

Programmer specifies grid
dimensions (for scope of
problem).
Programmer optionally
specifies work-group
dimensions (for optimized
communication).
© Copyright 2012 HSA Foundation. All Rights Reserved.

10
HSAIL INSTRUCTION SET

© Copyright 2012 HSA Foundation. All Rights Reserved.

11
HSAIL INSTRUCTION SET - OVERVIEW


Similar to assembly language for a RISC CPU


Load-store architecture
ld_global_u64
$d0, [$d6 + 120]

; $d0= load($d6+120)



add_u64

; $d1= $d2+24





$d1, $d2, 24

136 opcodes (Java bytecode has 200)


Floating point (single, double, half (f16))



Integer (32-bit, 64-bit)



Some packed operations



Branches



Function calls



Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas




Synchronize host CPU and HSA Component!

Text and Binary formats (“BRIG”)
SEGMENTS AND MEMORY (1/2)


7 segments of memory






global, readonly, group, spill, private, arg, kernarg,

Memory instructions can (optionally) specify a segment

Global Segment




Visible to all HSA agents (including host CPU)

ld_global_u64 $d0, [$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]

Group Segment



Group memory can be read and written by any work-item in the work-group



HSAIL provides sync operations to control visibility of group memory





Provides high-performance memory shared in the work-group.

Useful for expert programmers

Spill, Private, Arg Segments


Represent different regions of a per-work-item stack



Typically generated by compiler, not specified by programmer



Compiler can use these to convey intent – ie spills

© Copyright 2012 HSA Foundation. All Rights Reserved.

13
SEGMENTS AND MEMORY (2/2)


Kernarg Segment




Read-Only Segment




Programmer writes kernarg segment to pass arguments to a kernel

Remains constant during execution of kernel

Flat Addressing


Each segment mapped into virtual address space


Flat addresses can map to segments based on virtual address



Instructions with no explicit segment use flat addressing



Very useful for high-level language support (ie classes, libraries)



Aligns well with OpenCL™ 2.0 “generic” addressing feature

ld_kernarg_u64 $d6, [%_arg0]
ld_u64 $d0,[$d6+24] ; flat

© Copyright 2012 HSA Foundation. All Rights Reserved.

14
REGISTERS


Four classes of registers



S: 32-bit, Single-precision FP or Int



D: 64-bit, Double-precision FP or Long Int





C: 1-bit, Control Registers

Q: 128-bit, Packed data.

Fixed number of registers:


8C



S, D, Q share a single pool of resources





S + 2*D + 4*Q <= 128
Up to 128 S or 64 D or 32 Q (or a blend)

Register allocation done in high-level compiler


Finalizer doesn’t have to perform expensive register allocation

© Copyright 2012 HSA Foundation. All Rights Reserved.

15
SIMT EXECUTION MODEL


HSAIL Presents a “SIMT” execution model to the programmer



Programmer writes program for a single thread of execution



Each work-item appears to have its own program counter





“Single Instruction, Multiple Thread”

Branch instructions look natural

Hardware Implementation



Actually one program counter for the entire SIMD instruction





Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
Branches implemented with HW-supported predication

SIMT Advantages


Easier to program (branch code in particular)



Natural path for mainstream programming models



Scales across a wide variety of hardware (programmer doesn’t see vector width)



Cross-lane operations available for those who want peak performance

© Copyright 2012 HSA Foundation. All Rights Reserved.

16
WAVEFRONTS


Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, or 64 “lanes”



Workgroups mapped to one or more wavefronts




For efficient execution the workgroup size is an integer multiple of the wavefront size

Lanes in wavefront can be “active” or “inactive”
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}



Inactive lanes consume hardware resources but don’t do useful work



Tradeoffs


“Wavefront-aware” programming can be useful for peak performance



But results in less portable code (since wavefront width is encoded in algorithm)

© Copyright 2012 HSA Foundation. All Rights Reserved.

17
CROSS-LANE OPERATIONS



Example HSAIL operation: “countlane”


Dest set to the number of work-items in current wavefront that have non-zero source

countlane_u32


$s0, $s6

Example HSAIL operation: “countuplane”


Dest set to count of earlier work-items that are active for this instruction



Useful for compaction algorithms

countuplane_u32 $s0

© Copyright 2012 HSA Foundation. All Rights Reserved.

18
HSAIL MODES


Working group strived to limit optional modes and features in HSAIL



Better for compiler vendors and application developers





Minimize differences between HSA target machines
Two modes survived

Machine Models



Large: 64-bit pointers, 32-bit or 64-bit data





Small: 32-bit pointers, 32-bit data
Vendors can support one or both models

“Base” and “Full” Profiles


Two sets of requirements for FP accuracy, rounding, exception reporting

© Copyright 2012 HSA Foundation. All Rights Reserved.

19
HSA PROFILES
Feature
Addressing Modes
Load/store conversion of all floating point
types (f16, f32, f64)
All 32-bit HSAIL operations according to the
declared profile
F16 support (IEEE 754 or better)
F64 support

Base
Small, Large

Full
Small, Large

Yes

Yes

Yes
Yes
No

Yes
Yes
Yes

Precision for add/sub/mul
Precision for div
Precision for sqrt
HSAIL Rounding: Near
HSAIL Rounding: Up
HSAIL Rounding: Down
HSAIL Rounding: Zero
Subnormal floating-point
Propagate NaN Payloads
FMA

1/2 ULP
2.5 ULP
1 ULP
Yes
No
No
No
Flush-to-zero
No
No

Arithmetic Exception reporting
Debug trap

DETECT
Yes

1/2 ULP
1/2 ULP
1/2 ULP
Yes
Yes
Yes
Yes
Supported
Yes
Yes
DETECT or
BREAK
Yes

© Copyright 2012 HSA Foundation. All Rights Reserved.

20
EXAMPLE

© Copyright 2012 HSA Foundation. All Rights Reserved.

21
AN EXAMPLE (IN JAVA 8)
class Player {
private Team team;
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Stream<Player> s = Arrays.stream(allPlayers).parallel();
s.forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});

© Copyright 2012 HSA Foundation. All Rights Reserved.

22
HSAIL CODE EXAMPLE
01:
02:
03:
04:
05:
06:
07:
08:
09:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:

version 0:95: $full : $large;
// static method HotSpotMethod<Main.lambda$2(Player)>
kernel &run (
kernarg_u64 %_arg0
// Kernel signature for lambda method
) {
ld_kernarg_u64 $d6, [%_arg0];
// Move arg to an HSAIL register
workitemabsid_u32 $s2, 0;
// Read the work-item global “X” coord
cvt_u64_s32
mul_u64 $d2,
add_u64 $d2,
add_u64 $d2,
ld_global_u64
@L0:
ld_global_u64
mov_b64
ld_global_s32
cvt_f32_s32
ld_global_s32
cvt_f32_s32
div_f32
st_global_f32
ret;
};

$d2,
$d2,
$d2,
$d2,
$d6,

$s2;
8;
24;
$d6;
[$d2];

$d0, [$d6 + 120];
$d3, $d0;
$s3, [$d6 + 40];
$s16, $s3;
$s0, [$d0 + 24];
$s17, $s0;
$s16, $s16, $s17;
$s16, [$d6 + 100];

© Copyright 2012 HSA Foundation. All Rights Reserved.

//
//
//
//
//

Convert X gid to long
Adjust index for sizeof ref
Adjust for actual elements start
Add to array ref ptr
Load from array element into reg

// p.getTeam()
// p.getScores ()
// Team getScores()

// p.getScores()/teamScores
// p.setPctOfTeamScores()

23
WHAT IS HSAIL?


HSAIL is the intermediate language for parallel compute in HSA








Generated by a high level compiler (LLVM, gcc, Java VM, etc)
Low-level IR, close to machine ISA level
Compiled down to target ISA by an IHV “Finalizer”
Finalizer may execute at run time, install time, or build time

Example: OpenCL™ Compilation Stack using HSAIL

High-Level Compiler Flow (Developer)
OpenCL™ Kernel
EDG or CLANG
SPIR
LLVM
HSAIL

© Copyright 2012 HSA Foundation. All Rights Reserved.

Finalizer Flow (Runtime)

HSAIL
Finalizer
Hardware ISA

24
HSAIL AND SPIR
Feature

Intended Users

SPIR
Compiler developers who want a fast
Compiler developers who want to path to acceleration across a wide
control their own code generation. variety of devices.

IR Level

Low-level, just above the machine
instruction set

Back-end code generation

Thin, fast, robust.

Where are compiler
optimizations performed?
Registers
SSA Form
Binary format
Code generator for LLVM

Most done in high-level compiler,
before HSAIL generation.
Fixed-size register pool
No
Yes
Yes

Back-end device targets

Memory Model

HSAIL

Modern GPU architectures
supported by members of the HSA
Foundation
Relaxed consistency with
acquire/release, barriers, and finegrained barriers

© Copyright 2012 HSA Foundation. All Rights Reserved.

High-level, just below LLVM-IR
Flexible. Can include many
optimizations and compiler
transformation including register
allocation.
Most done in back-end code generator,
between SPIR and device machine
instruction set
Infinite
Yes
Yes
Yes
Any OpenCL device including GPUs,
CPUs, FPGAs
Flexible. Can support the OpenCL 1.2
Memory Model
25
TAKEAWAYS


HSAIL Key Points



Portable (multiple HW vendors and parallel architectures)





Thin, robust, fast finalizer
Supports shared virtual memory and platform atomics

HSA brings GPU computing to mainstream programming models


Shared and coherent memory bridges “faraway accelerator” gap







HSAIL provides the common IL for high-level languages to benefit from parallel
computing. HSAIL is a building block.
HSA programmers will use popular programming models, with pointers & memory
coherency

Java Example


Unmodified Java8 accelerated on the GPU!



Can use pointer-containing data structures

© Copyright 2012 HSA Foundation. All Rights Reserved.

26
TOOLS ARE AVAILABLE NOW


HSA Programmer’s Reference Manual: HSAIL Virtual ISA and
Programming Model, Compiler Writer’s Guide, and Object Format (BRIG)





http://hsafoundation.com/standards/
https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal

Tools now at GitHUB – HSA Foundation


libHSA Assembler and Disassembler




https://github.com/HSAFoundation/HSAIL-Tools

HSAIL Instruction Set Simulator


https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator



Soon: LLVM Compilation stack which outputs HSAIL and BRIG



Java compiler for HSAIL (preliminary)


http://openjdk.java.net/projects/sumatra/)



http://openjdk.java.net/projects/graal/

© Copyright 2012 HSA Foundation. All Rights Reserved.

27
BACKUP

© Copyright 2012 HSA Foundation. All Rights Reserved.

28
AN EXAMPLE (IN OPENCL™)
//Vector add
// A[0:N-1] = B[0:N-1] + C[0:N-1]
__kernel void vec_add (
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
// Get our global thread ID
int id = get_global_id(0);
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}

© Copyright 2012 HSA Foundation. All Rights Reserved.

29
HSAIL VECTOR ADD
version 1:0:$full:$small;
function &get_global_id(arg_u32 %ret_val)
(arg_u32 %arg_val0);
function &abort() ();
kernel &__OpenCL_vec_add_kernel(
kernarg_u32 %arg_val0,
kernarg_u32 %arg_val1,
kernarg_u32 %arg_val2,
kernarg_u32 %arg_val3)
{
@__OpenCL_vec_add_kernel_entry:
// BB#0: // %entry
ld_kernarg_u32 $s0, [%arg_val3];
workitemabsid_u32 $s1, 0;
cmp_lt_b1_u32 $c0, $s1, $s0;
ld_kernarg_u32 $s0, [%arg_val2];
ld_kernarg_u32 $s2, [%arg_val1];
ld_kernarg_u32 $s3, [%arg_val0];
cbr $c0, @BB0_2;
brn @BB0_1;

© Copyright 2012 HSA Foundation. All Rights Reserved.

@BB0_1: // %if.end
ret;
@BB0_2: // %if.then
shl_u32 $s1, $s1, 2;
add_u32 $s2, $s2, $s1;
ld_global_f32 $s2, [$s2];
add_u32 $s3, $s3, $s1;
ld_global_f32 $s3, [$s3];
add_f32 $s2, $s3, $s2;
add_u32 $s0, $s0, $s1;
st_global_f32 $s2, [$s0];
brn @BB0_1;
};

30
SEGMENTS

© Copyright 2012 HSA Foundation. All Rights Reserved.

31
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical
inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including
but not limited to product and roadmap changes, component and motherboard version changes, new model and/or
product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware
upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However,
AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR
PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are
trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a trademark
of Apple Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

32 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

More Related Content

What's hot

ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - IntroHSA Foundation
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelHSA Foundation
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...AMD Developer Central
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...AMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overviewinside-BigData.com
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
 

What's hot (20)

ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 

Viewers also liked

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 

Viewers also liked (6)

Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 

Similar to HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander

HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glchangehee lee
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Igalia
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Fel Flyer F11
Fel Flyer F11Fel Flyer F11
Fel Flyer F11chitlesh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 

Similar to HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander (20)

HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_gl
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Fel Flyer F11
Fel Flyer F11Fel Flyer F11
Fel Flyer F11
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Software defined radio
Software defined radioSoftware defined radio
Software defined radio
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 

More from AMD Developer Central

Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...AMD Developer Central
 

More from AMD Developer Central (20)

Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander

  • 1. HSAIL PROGRAMMER REF GUIDE: UNCOVERED BEN SANDER AMD SENIOR FELLOW
  • 2. STATE OF GPU COMPUTING • GPUs are fast and power efficient : high compute density per-mm and per-watt • But: Can be hard to program Today’s Challenges  Emerging Solution Separate address spaces  HSA Hardware  Copies  Single address space  Can’t share pointers  Coherent  Virtual  Fast access from all components  Can share pointers PCIe  New language required for compute kernel   EX: OpenCL™ runtime API Compute kernel compiled separately than host code  Bring GPU computing to existing, popular, programming models  Single-source, fully supported by compiler  HSAIL compiler IR (Cross-platform!)
  • 3. WHAT IS HSAIL?  HSAIL is the intermediate language for parallel compute in HSA      Generated by a high level compiler (LLVM, gcc, Java VM, etc) Low-level IR, close to machine ISA level Compiled down to target ISA by an IHV “Finalizer” Finalizer may execute at run time, install time, or build time Example: OpenCL™ Compilation Stack using HSAIL High-Level Compiler Flow (Developer) OpenCL™ Kernel EDG or CLANG SPIR LLVM HSAIL © Copyright 2012 HSA Foundation. All Rights Reserved. Finalizer Flow (Runtime) HSAIL Finalizer Hardware ISA 3
  • 4. KEY HSAIL FEATURES  Parallel  Shared virtual memory  Portable across vendors in HSA Foundation  Stable across multiple product generations  Consistent numerical results (IEEE-754 with defined min accuracy)  Fast, robust, simple finalization step (no monthly updates)  Good performance (little need to write in ISA)  Supports all of OpenCL™ and C++ AMP  Support Java, C++, and other languages as well © Copyright 2012 HSA Foundation. All Rights Reserved. 4
  • 5. AGENDA  Introduction  HSA Parallel Execution Model  HSAIL Instruction Set  Example – in Java!  Key Takeaways
  • 6. HSA PARALLEL EXECUTION MODEL © Copyright 2012 HSA Foundation. All Rights Reserved. 6
  • 7. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 [ -2 [ -1 0 +1 ] 0 +2 ] 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0] [ +1 +2 +1 ] G = sqrt(Gx2 + Gy2)
  • 8. CONVOLUTION / SOBEL EDGE FILTER 2D grid workitem kernel Gx = [ -1 [ -2 [ -1 0 +1 ] 0 +2 ] 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0] [ +1 +2 +1 ] G = sqrt(Gx2 + Gy2)
  • 9. CONVOLUTION / SOBEL EDGE FILTER 2D grid 2D work-group workitem kernel Gx = [ -1 [ -2 [ -1 0 +1 ] 0 +2 ] 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0] [ +1 +2 +1 ] G = sqrt(Gx2 + Gy2)
  • 10. HSA PARALLEL EXECUTION MODEL Basic Idea: Programmer supplies a “kernel” that is run on each work-item. Kernel is written as a single thread of execution. Each work-item has a unique coordinate. Programmer specifies grid dimensions (for scope of problem). Programmer optionally specifies work-group dimensions (for optimized communication). © Copyright 2012 HSA Foundation. All Rights Reserved. 10
  • 11. HSAIL INSTRUCTION SET © Copyright 2012 HSA Foundation. All Rights Reserved. 11
  • 12. HSAIL INSTRUCTION SET - OVERVIEW  Similar to assembly language for a RISC CPU  Load-store architecture ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)  add_u64 ; $d1= $d2+24   $d1, $d2, 24 136 opcodes (Java bytecode has 200)  Floating point (single, double, half (f16))  Integer (32-bit, 64-bit)  Some packed operations  Branches  Function calls  Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas   Synchronize host CPU and HSA Component! Text and Binary formats (“BRIG”)
  • 13. SEGMENTS AND MEMORY (1/2)  7 segments of memory    global, readonly, group, spill, private, arg, kernarg, Memory instructions can (optionally) specify a segment Global Segment   Visible to all HSA agents (including host CPU) ld_global_u64 $d0, [$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] Group Segment   Group memory can be read and written by any work-item in the work-group  HSAIL provides sync operations to control visibility of group memory   Provides high-performance memory shared in the work-group. Useful for expert programmers Spill, Private, Arg Segments  Represent different regions of a per-work-item stack  Typically generated by compiler, not specified by programmer  Compiler can use these to convey intent – ie spills © Copyright 2012 HSA Foundation. All Rights Reserved. 13
  • 14. SEGMENTS AND MEMORY (2/2)  Kernarg Segment   Read-Only Segment   Programmer writes kernarg segment to pass arguments to a kernel Remains constant during execution of kernel Flat Addressing  Each segment mapped into virtual address space  Flat addresses can map to segments based on virtual address  Instructions with no explicit segment use flat addressing  Very useful for high-level language support (ie classes, libraries)  Aligns well with OpenCL™ 2.0 “generic” addressing feature ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat © Copyright 2012 HSA Foundation. All Rights Reserved. 14
  • 15. REGISTERS  Four classes of registers   S: 32-bit, Single-precision FP or Int  D: 64-bit, Double-precision FP or Long Int   C: 1-bit, Control Registers Q: 128-bit, Packed data. Fixed number of registers:  8C  S, D, Q share a single pool of resources    S + 2*D + 4*Q <= 128 Up to 128 S or 64 D or 32 Q (or a blend) Register allocation done in high-level compiler  Finalizer doesn’t have to perform expensive register allocation © Copyright 2012 HSA Foundation. All Rights Reserved. 15
  • 16. SIMT EXECUTION MODEL  HSAIL Presents a “SIMT” execution model to the programmer   Programmer writes program for a single thread of execution  Each work-item appears to have its own program counter   “Single Instruction, Multiple Thread” Branch instructions look natural Hardware Implementation   Actually one program counter for the entire SIMD instruction   Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency Branches implemented with HW-supported predication SIMT Advantages  Easier to program (branch code in particular)  Natural path for mainstream programming models  Scales across a wide variety of hardware (programmer doesn’t see vector width)  Cross-lane operations available for those who want peak performance © Copyright 2012 HSA Foundation. All Rights Reserved. 16
  • 17. WAVEFRONTS  Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, or 64 “lanes”  Workgroups mapped to one or more wavefronts   For efficient execution the workgroup size is an integer multiple of the wavefront size Lanes in wavefront can be “active” or “inactive” if (cond) { operationA; // cond=True lanes active here } else { operationB; // cond=False lanes active here }  Inactive lanes consume hardware resources but don’t do useful work  Tradeoffs  “Wavefront-aware” programming can be useful for peak performance  But results in less portable code (since wavefront width is encoded in algorithm) © Copyright 2012 HSA Foundation. All Rights Reserved. 17
  • 18. CROSS-LANE OPERATIONS  Example HSAIL operation: “countlane”  Dest set to the number of work-items in current wavefront that have non-zero source countlane_u32  $s0, $s6 Example HSAIL operation: “countuplane”  Dest set to count of earlier work-items that are active for this instruction  Useful for compaction algorithms countuplane_u32 $s0 © Copyright 2012 HSA Foundation. All Rights Reserved. 18
  • 19. HSAIL MODES  Working group strived to limit optional modes and features in HSAIL   Better for compiler vendors and application developers   Minimize differences between HSA target machines Two modes survived Machine Models   Large: 64-bit pointers, 32-bit or 64-bit data   Small: 32-bit pointers, 32-bit data Vendors can support one or both models “Base” and “Full” Profiles  Two sets of requirements for FP accuracy, rounding, exception reporting © Copyright 2012 HSA Foundation. All Rights Reserved. 19
  • 20. HSA PROFILES Feature Addressing Modes Load/store conversion of all floating point types (f16, f32, f64) All 32-bit HSAIL operations according to the declared profile F16 support (IEEE 754 or better) F64 support Base Small, Large Full Small, Large Yes Yes Yes Yes No Yes Yes Yes Precision for add/sub/mul Precision for div Precision for sqrt HSAIL Rounding: Near HSAIL Rounding: Up HSAIL Rounding: Down HSAIL Rounding: Zero Subnormal floating-point Propagate NaN Payloads FMA 1/2 ULP 2.5 ULP 1 ULP Yes No No No Flush-to-zero No No Arithmetic Exception reporting Debug trap DETECT Yes 1/2 ULP 1/2 ULP 1/2 ULP Yes Yes Yes Yes Supported Yes Yes DETECT or BREAK Yes © Copyright 2012 HSA Foundation. All Rights Reserved. 20
  • 21. EXAMPLE © Copyright 2012 HSA Foundation. All Rights Reserved. 21
  • 22. AN EXAMPLE (IN JAVA 8) class Player { private Team team; private int scores; private float pctOfTeamScores; public Team getTeam() {return team;} public int getScores() {return scores;} public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; } }; // “Team” class not shown // Assume “allPlayers’ is an initialized array of Players.. Stream<Player> s = Arrays.stream(allPlayers).parallel(); s.forEach(p -> { int teamScores = p.getTeam().getScores(); float pctOfTeamScores = (float)p.getScores()/(float) teamScores; p.setPctOfTeamScores(pctOfTeamScores); }); © Copyright 2012 HSA Foundation. All Rights Reserved. 22
  • 23. HSAIL CODE EXAMPLE 01: 02: 03: 04: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: version 0:95: $full : $large; // static method HotSpotMethod<Main.lambda$2(Player)> kernel &run ( kernarg_u64 %_arg0 // Kernel signature for lambda method ) { ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord cvt_u64_s32 mul_u64 $d2, add_u64 $d2, add_u64 $d2, ld_global_u64 @L0: ld_global_u64 mov_b64 ld_global_s32 cvt_f32_s32 ld_global_s32 cvt_f32_s32 div_f32 st_global_f32 ret; }; $d2, $d2, $d2, $d2, $d6, $s2; 8; 24; $d6; [$d2]; $d0, [$d6 + 120]; $d3, $d0; $s3, [$d6 + 40]; $s16, $s3; $s0, [$d0 + 24]; $s17, $s0; $s16, $s16, $s17; $s16, [$d6 + 100]; © Copyright 2012 HSA Foundation. All Rights Reserved. // // // // // Convert X gid to long Adjust index for sizeof ref Adjust for actual elements start Add to array ref ptr Load from array element into reg // p.getTeam() // p.getScores () // Team getScores() // p.getScores()/teamScores // p.setPctOfTeamScores() 23
  • 24. WHAT IS HSAIL?  HSAIL is the intermediate language for parallel compute in HSA      Generated by a high level compiler (LLVM, gcc, Java VM, etc) Low-level IR, close to machine ISA level Compiled down to target ISA by an IHV “Finalizer” Finalizer may execute at run time, install time, or build time Example: OpenCL™ Compilation Stack using HSAIL High-Level Compiler Flow (Developer) OpenCL™ Kernel EDG or CLANG SPIR LLVM HSAIL © Copyright 2012 HSA Foundation. All Rights Reserved. Finalizer Flow (Runtime) HSAIL Finalizer Hardware ISA 24
  • 25. HSAIL AND SPIR Feature Intended Users SPIR Compiler developers who want a fast Compiler developers who want to path to acceleration across a wide control their own code generation. variety of devices. IR Level Low-level, just above the machine instruction set Back-end code generation Thin, fast, robust. Where are compiler optimizations performed? Registers SSA Form Binary format Code generator for LLVM Most done in high-level compiler, before HSAIL generation. Fixed-size register pool No Yes Yes Back-end device targets Memory Model HSAIL Modern GPU architectures supported by members of the HSA Foundation Relaxed consistency with acquire/release, barriers, and finegrained barriers © Copyright 2012 HSA Foundation. All Rights Reserved. High-level, just below LLVM-IR Flexible. Can include many optimizations and compiler transformation including register allocation. Most done in back-end code generator, between SPIR and device machine instruction set Infinite Yes Yes Yes Any OpenCL device including GPUs, CPUs, FPGAs Flexible. Can support the OpenCL 1.2 Memory Model 25
  • 26. TAKEAWAYS  HSAIL Key Points   Portable (multiple HW vendors and parallel architectures)   Thin, robust, fast finalizer Supports shared virtual memory and platform atomics HSA brings GPU computing to mainstream programming models  Shared and coherent memory bridges “faraway accelerator” gap    HSAIL provides the common IL for high-level languages to benefit from parallel computing. HSAIL is a building block. HSA programmers will use popular programming models, with pointers & memory coherency Java Example  Unmodified Java8 accelerated on the GPU!  Can use pointer-containing data structures © Copyright 2012 HSA Foundation. All Rights Reserved. 26
  • 27. TOOLS ARE AVAILABLE NOW  HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG)    http://hsafoundation.com/standards/ https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal Tools now at GitHUB – HSA Foundation  libHSA Assembler and Disassembler   https://github.com/HSAFoundation/HSAIL-Tools HSAIL Instruction Set Simulator  https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator  Soon: LLVM Compilation stack which outputs HSAIL and BRIG  Java compiler for HSAIL (preliminary)  http://openjdk.java.net/projects/sumatra/)  http://openjdk.java.net/projects/graal/ © Copyright 2012 HSA Foundation. All Rights Reserved. 27
  • 28. BACKUP © Copyright 2012 HSA Foundation. All Rights Reserved. 28
  • 29. AN EXAMPLE (IN OPENCL™) //Vector add // A[0:N-1] = B[0:N-1] + C[0:N-1] __kernel void vec_add ( __global const float *a, __global const float *b, __global float *c, const unsigned int n) { // Get our global thread ID int id = get_global_id(0); // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } © Copyright 2012 HSA Foundation. All Rights Reserved. 29
  • 30. HSAIL VECTOR ADD version 1:0:$full:$small; function &get_global_id(arg_u32 %ret_val) (arg_u32 %arg_val0); function &abort() (); kernel &__OpenCL_vec_add_kernel( kernarg_u32 %arg_val0, kernarg_u32 %arg_val1, kernarg_u32 %arg_val2, kernarg_u32 %arg_val3) { @__OpenCL_vec_add_kernel_entry: // BB#0: // %entry ld_kernarg_u32 $s0, [%arg_val3]; workitemabsid_u32 $s1, 0; cmp_lt_b1_u32 $c0, $s1, $s0; ld_kernarg_u32 $s0, [%arg_val2]; ld_kernarg_u32 $s2, [%arg_val1]; ld_kernarg_u32 $s3, [%arg_val0]; cbr $c0, @BB0_2; brn @BB0_1; © Copyright 2012 HSA Foundation. All Rights Reserved. @BB0_1: // %if.end ret; @BB0_2: // %if.then shl_u32 $s1, $s1, 2; add_u32 $s2, $s2, $s1; ld_global_f32 $s2, [$s2]; add_u32 $s3, $s3, $s1; ld_global_f32 $s3, [$s3]; add_f32 $s2, $s3, $s2; add_u32 $s0, $s0, $s1; st_global_f32 $s2, [$s0]; brn @BB0_1; }; 30
  • 31. SEGMENTS © Copyright 2012 HSA Foundation. All Rights Reserved. 31
  • 32. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a trademark of Apple Inc. Other names are for informational purposes only and may be trademarks of their respective owners. 32 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013