SlideShare a Scribd company logo
1 of 353
Download to read offline
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): ARCHITECTURE
AND ALGORITHMS
ISCA TUTORIAL - JUNE 15, 2014
TOPICS
 Introduction
 HSAIL Virtual Parallel ISA
 HSA Runtime
 HSA Memory Model
 HSA Queuing Model
 HSA Applications
 HSA Compilation
© Copyright 2014 HSA Foundation. All Rights Reserved
The HSA Specifications are not at 1.0 final so all content is subject to change
SCHEDULE
© Copyright 2014 HSA Foundation. All Rights Reserved
Time Topic Speaker
8:45am Introduction to HSA Phil Rogers, AMD
9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD
10:30am Break
10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University
12 noon Lunch
1pm HSA Memory Model Benedict Gaster, Qualcomm
2pm HSA Queuing Model Hakan Persson, ARM
3pm Break
3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois
4pm HSA Application Programming Wen Mei Hwu, University of Illinois
4:45pm Questions All presenters
INTRODUCTION
PHIL ROGERS, AMD CORPORATE FELLOW &
PRESIDENT OF HSA FOUNDATION
HSA FOUNDATION
 Founded in June 2012
 Developing a new platform for heterogeneous
systems
 www.hsafoundation.com
 Specifications under development in working
groups to define the platform
 Membership consists of 43 companies and 16
universities
 Adding 1-2 new members each month
© Copyright 2014 HSA Foundation. All Rights Reserved
DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
© Copyright 2014 HSA Foundation. All Rights Reserved
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo
MEMBERSHIP TABLE
Membership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals,
Electronics and Telecommunications Research,
Institute (ETRI), General Processor, Huawei, Industrial
Technology Res. Institute, Marvell International Ltd.,
Mobica, Oracle, Sonics, Inc, Sony Mobile,
Communications, Swarm 64 GmbH, Synopsys,
Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA
Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software,
Fabric Engine, Kishonti, Lawrence Livermore National
Laboratory, Linaro, MultiCoreWare, Oak Ridge
National Laboratory, Sandia Corporation,
StreamComputing, SUSE LLC, UChicago Argonne LLC,
Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture,
Missouri University of Science & Technology, National
Tsing Hua University, NMAM Institute of Technology,
Northeastern University, Rice University, Seoul
National University, System Software Lab National,
Tsing Hua University, Tampere University of
Technology, TEI of Crete, The University of Mississippi,
University of North Texas, University of Bologna,
University of Bristol Microelectronic Research Group,
University of Edinburgh, University of Illinois at
Urbana-Champaign Department of Computer Science
© Copyright 2014 HSA Foundation. All Rights Reserved
HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
 Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
 SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
 How do we make them even better?
 Easier to program
 Easier to optimize
 Higher performance
 Lower power
 HSA unites accelerators architecturally
 Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
INFLECTIONS IN PROCESSOR DESIGN
© Copyright 2014 HSA Foundation. All Rights Reserved
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s
Law
 Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Law
 SMP
architecture
Constrained
by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly  C/C++  Java … pthreads  OpenMP / TBB …
Shader  CUDA OpenCL
 C++ and Java
LEGACY GPU COMPUTE
PCIe
™
System Memory
(Coherent)
CPU CPU CPU
. .
.
CU CU CU CU
CU CU CU CU
GPU Memory
(Non-Coherent)
GPU
 Multiple memory pools
 Multiple address spaces
 High overhead dispatch
 Data copies across PCIe
 New languages for
programming
 Dual source development
 Proprietary environments
 Expert programmers only
 Need to fix all of this to
unleash our programmers
The limiters
© Copyright 2014 HSA Foundation. All Rights Reserved
EXISTING APUS AND SOCS
CPU
1
CPU
N…
CPU
2
Physical Integration
CU
1 …
CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory
(Coherent)
GPU Memory
(Non-Coherent)
GPU
 Physical Integration
 Good first step
 Some copies gone
 Two memory pools remain
 Still queue through the OS
 Still requires expert
programmers
 Need to finish the job
AN HSA ENABLED SOC
 Unified Coherent
Memory enables
data sharing across
all processors
 Processors
architected to
operate
cooperatively
 Designed to enable
the application to
run on different
processors at
different times
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…
PILLARS OF HSA*
 Unified addressing across all processors
 Operation into pageable system memory
 Full memory coherency
 User mode dispatch
 Architected queuing language
 Scheduling and context switching
 HSA Intermediate Language (HSAIL)
 High level language support for GPU compute processors
© Copyright 2014 HSA Foundation. All Rights Reserved
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
HSA SPECIFICATIONS
 HSA System Architecture Specification
 Version 1.0 Provisional, Released April 2014
 Defines discovery, memory model, queue management, atomics, etc
 HSA Programmers Reference Specification
 Version 1.0 Provisional, Released June 2014
 Defines the HSAIL language and object format
 HSA Runtime Software Specification
 Version 1.0 Provisional, expected to be released in July 2014
 Defines the APIs through which an HSA application uses the platform
 All released specifications can be found at the HSA Foundation web site:
 www.hsafoundation.com/standards
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA - AN OPEN PLATFORM
 Open Architecture, membership open to all
 HSA Programmers Reference Manual
 HSA System Architecture
 HSA Runtime
 Delivered via royalty free standards
 Royalty Free IP, Specifications and APIs
 ISA agnostic for both CPU and GPU
 Membership from all areas of computing
 Hardware companies
 Operating Systems
 Tools and Middleware
 Applications
 Universities
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA INTERMEDIATE LAYER — HSAIL
 HSAIL is a virtual ISA for parallel programs
 Finalized to ISA by a JIT compiler or “Finalizer”
 ISA independent by design for CPU & GPU
 Explicitly parallel
 Designed for data parallel programming
 Support for exceptions, virtual functions,
and other high level language features
 Lower level than OpenCL SPIR
 Fits naturally in the OpenCL compilation stack
 Suitable to support additional high level languages and programming models:
 Java, C++, OpenMP, C++, Python, etc
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
 Defines visibility ordering between all
threads in the HSA System
 Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
 Relaxed consistency memory model
for parallel compute performance
 Visibility controlled by:
 Load.Acquire
 Store.Release
 Fences
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL
 User mode queuing for low latency dispatch
 Application dispatches directly
 No OS or driver required in the dispatch path
 Architected Queuing Layer
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Allows for dispatch to queue from any agent
 CPU or GPU
 GPU self enqueue enables lots of solutions
 Recursion
 Tree traversal
 Wavefront reforming
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA SOFTWARE
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL
 Not an alternative to OpenCL
 OpenCL on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL 2.0 leverages HSA Features
 Shared Virtual Memory
 Platform Atomics
© Copyright 2014 HSA Foundation. All Rights Reserved
ADDITIONAL LANGUAGES ON HSA
 In development
© Copyright 2014 HSA Foundation. All Rights Reserved
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425
SUMATRA PROJECT OVERVIEW
 AMD/Oracle sponsored Open Source (OpenJDK) project
 Targeted at Java 9 (2015 release)
 Allows developers to efficiently represent data parallel algorithms in
Java
 Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
 At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
 Developers of Java libraries are already refactoring their library code to
use these same constructs
 So developers using existing libraries should see GPU acceleration
without any code changes
 http://openjdk.java.net/projects/sumatra/
 https://wikis.oracle.com/display/HotSpotInternals/Sumatra
 http://mail.openjdk.java.net/pipermail/sumatra-dev/
© Copyright 2014 HSA Foundation. All Rights Reserved
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer
HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
© Copyright 2014 HSA Foundation. All Rights Reserved
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
WORKLOAD EXAMPLE
SUFFIX ARRAY CONSTRUCTION
CLOUD SERVER WORKLOAD
SUFFIX ARRAYS
 Suffix Arrays are a fundamental data structure
 Designed for efficient searching of a large text
 Quickly locate every occurrence of a substring S in a text T
 Suffix Arrays are used to accelerate in-memory cloud workloads
 Full text index search
 Lossless data compression
 Bio-informatics
© Copyright 2014 HSA Foundation. All Rights Reserved
ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
© Copyright 2014 HSA Foundation. All Rights Reserved
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2014 HSA Foundation. All Rights Reserved
THE HSA FUTURE
 Architected heterogeneous processing on the SOC
 Programming of accelerators becomes much easier
 Accelerated software that runs across multiple hardware vendors
 Scalability from smart phones to super computers on a common architecture
 GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
 Heterogeneous software ecosystem evolves at a much faster pace
 Lower power, more capable devices in your hand, on the wall, in the cloud
© Copyright 2014 HSA Foundation. All Rights Reserved
JOIN US!
WWW.HSAFOUNDATION.COM
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): HSAIL VIRTUAL
PARALLEL ISA
BEN SANDER, AMD
TOPICS
 Introduction and Motivation
 HSAIL – what makes it special?
 HSAIL Execution Model
 How to program in HSAIL?
 Conclusion
© Copyright 2014 HSA Foundation. All Rights Reserved
STATE OF GPU COMPUTING
Today’s Challenges
 Separate address spaces
 Copies
 Can’t share pointers
 New language required for compute kernel
 EX: OpenCL™ runtime API
 Compute kernel compiled separately than host
code
Emerging Solution
 HSA Hardware
 Single address space
 Coherent
 Virtual
 Fast access from all components
 Can share pointers
 Bring GPU computing to existing, popular,
programming models
 Single-source, fully supported by compiler
 HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
THE PORTABILITY CHALLENGE
 CPU ISAs
 ISA innovations added incrementally (ie NEON, AVX, etc)
 ISA retains backwards-compatibility with previous generation
 Two dominant instruction-set architectures: ARM and x86
 GPU ISAs
 Massive diversity of architectures in the market
 Each vendor has own ISA - and often several in market at same time
 No commitment (or attempt!) to provide any backwards compatibility
 Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL :
WHAT MAKES IT SPECIAL?
WHAT IS HSAIL?
 Intermediate language for parallel compute in HSA
 Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)
 Expresses parallel regions of code
 Binary format of HSAIL is called “BRIG”
 Goal: Bring parallel acceleration to mainstream programming languages
© Copyright 2014 HSA Foundation. All Rights Reserved
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
BRIG Finalizer Component
ISA
Host ISA
KEY HSAIL FEATURES
 Parallel
 Shared virtual memory
 Portable across vendors in HSA Foundation
 Stable across multiple product generations
 Consistent numerical results (IEEE-754 with defined min accuracy)
 Fast, robust, simple finalization step (no monthly updates)
 Good performance (little need to write in ISA)
 Supports all of OpenCL™
 Supports Java, C++, and other languages as well
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL INSTRUCTION SET - OVERVIEW
 Similar to assembly language for a RISC CPU
 Load-store architecture
 Destination register first, then source registers
 140 opcodes (Java™ bytecode has 200)
 Floating point (single, double, half (f16))
 Integer (32-bit, 64-bit)
 Some packed operations
 Branches
 Function calls
 Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
 Synchronize host CPU and HSA Component!
 Text and Binary formats (“BRIG”)
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (1/2)
 7 segments of memory
 global, readonly, group, spill, private, arg, kernarg
 Memory instructions can (optionally) specify a segment
 Control data sharing properties and communicate intent
 Global Segment
 Visible to all HSA agents (including host CPU)
 Group Segment
 Provides high-performance memory shared in the work-group.
 Group memory can be read and written by any work-item in the work-group
 HSAIL provides sync operations to control visibility of group memory
ld_global_u64 $d0,[$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (2/2)
 Spill, Private, Arg Segments
 Represent different regions of a per-work-item stack
 Typically generated by compiler, not specified by programmer
 Compiler can use these to convey intent – ie spills
 Kernarg Segment
 Programmer writes kernarg segment to pass arguments to a kernel
 Read-Only Segment
 Remains constant during execution of kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
FLAT ADDRESSING
 Each segment mapped into virtual address space
 Flat addresses can map to segments based on virtual address
 Instructions with no explicit segment use flat addressing
 Very useful for high-level language support (ie classes, libraries)
 Aligns well with OpenCL 2.0 “generic” addressing feature
ld_global_u64 $d6, [%_arg0] ; global
ld_u64 $d0,[$d6+24] ; flat
© Copyright 2014 HSA Foundation. All Rights Reserved
REGISTERS
 Four classes of registers:
 S: 32-bit, Single-precision FP or Int
 D: 64-bit, Double-precision FP or Long Int
 Q: 128-bit, Packed data.
 C: 1-bit, Control Registers (Compares)
 Fixed number of registers
 S, D, Q share a single pool of resources
 S + 2*D + 4*Q <= 128
 Up to 128 S or 64 D or 32 Q (or a blend)
 Register allocation done in high-level compiler
 Finalizer doesn’t perform expensive register allocation
c0
c1
c2
c3
c4
c5
c6
c7
s0
d0
q0
s1
s2
d1
s3
s4
d2
q1
s5
s6
d3
s7
s8
d4
q2
s9
s10
d5
s11
…
s120
d60
q30
s121
s122
d61
s123
s124
d62
q31
s125
s126
d63
s127
© Copyright 2014 HSA Foundation. All Rights Reserved
SIMT EXECUTION MODEL
 HSAIL Presents a “SIMT” execution model to the programmer
 “Single Instruction, Multiple Thread”
 Programmer writes program for a single thread of execution
 Each work-item appears to have its own program counter
 Branch instructions look natural
 Hardware Implementation
 Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
 Actually one program counter for the entire SIMD instruction
 Branches implemented with predication
 SIMT Advantages
 Easier to program (branch code in particular)
 Natural path for mainstream programming models and existing compilers
 Scales across a wide variety of hardware (programmer doesn’t see vector width)
 Cross-lane operations available for those who want peak performance
© Copyright 2014 HSA Foundation. All Rights Reserved
WAVEFRONTS
 Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”
 Lanes in wavefront can be “active” or “inactive”
 Inactive lanes consume hardware resources but don’t do useful work
 Tradeoffs
 “Wavefront-aware” programming can be useful for peak performance
 But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}
© Copyright 2014 HSA Foundation. All Rights Reserved
CROSS-LANE OPERATIONS
 Example HSAIL cross-lane operation: “activelaneid”
 Dest set to count of earlier work-items that are active for this instruction
 Useful for compaction algorithms
 Example HSAIL cross-lane operation: “activelaneshuffle”
 Each workitem reads value from another lane in the wavefront
 Supports selection of “identity” element for inactive lanes
 Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0
// s0 = dest, s1= source, s2=lane select, no identity
activelaneid_u32 $s0
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL MODES
 Working group strived to limit optional modes and features in HSAIL
 Minimize differences between HSA target machines
 Better for compiler vendors and application developers
 Two modes survived
 Machine Models
 Small: 32-bit pointers, 32-bit data
 Large: 64-bit pointers, 32-bit or 64-bit data
 Vendors can support one or both models
 “Base” and “Full” Profiles
 Two sets of requirements for FP accuracy, rounding, exception reporting, hard
pre-emption
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROFILES
Feature Base Full
Addressing Modes Small, Large Small, Large
All 32-bit HSAIL operations according to the declared
profile Yes Yes
F16 support (IEEE 754 or better) Yes Yes
F64 support No Yes
Precision for add/sub/mul 1/2 ULP 1/2 ULP
Precision for div 2.5 ULP 1/2 ULP
Precision for sqrt 1 ULP 1/2 ULP
HSAIL Rounding: Near Yes Yes
HSAIL Rounding: Up / Down / Zero No Yes
Subnormal floating-point Flush-to-zero Supported
Propagate NaN Payloads No Yes
FMA Yes Yes
Arithmetic Exception reporting None DETECT or BREAK
Debug trap Yes Yes
Hard Preemption No Yes
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION
MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION MODEL
Basic Idea:
Programmer supplies an HSAIL
“kernel” that is run on each work-item.
Kernel is written as a single thread of
execution.
Programmer specifies grid dimensions
(scope of problem) when launching
the kernel.
Each work-item has a unique
coordinate in the grid.
Programmer optionally specifies work-
group dimensions (for optimized
communication).
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D work-group
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO PROGRAM HSA?
WHAT DO I TYPE?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROGRAMMING MODELS : CORE PRINCIPLES
 Single source
 Host and device code side-by-side in same source file
 Written in same programming language
 Single unified coherent address space
 Freely share pointers between host and device
 Similar memory model as multi-core CPU
 Parallel regions identified with existing language syntax
 Typically same syntax used for multi-core CPU
 HSAIL is the compiler IR that supports these programming models
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OPENMP : COMPILATION FLOW
 SUSE GCC Project
 Adding HSAIL code generator to GCC compiler infrastructure
 Supports OpenMP 3.1 syntax
 No data movement directives required !main() {
…
// Host code.
#pragma omp parallel for
for (int i=0;i<N; i++) {
C[i] = A[i] + B[i];
}
…
}
GCC OpenMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OpenMP flow
C/C++/Fortran OpenMP application
e.g., #pragma omp for
for( j = 0; j<n;j++) { b[j] = a[j]; }
GNU Compiler(GCC)
Compiles host code + Emits runtime
calls with kernel name, parameters,
launch attributes
Lowers OpenMP directives,
converts GIMPLE to BRIG.
Embeds BRIG into host code
Dispatch kernel to GPU
Pragmas map to calls into
HSA Runtime
Application
Compiler
Run time
Finalize kernel from BRIG->ISA
Kernels finalized once and cached.
Compile time
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++AMP : COMPILATION FLOW
 C++AMP : Single-source C++ template parallel programming model
 MCW compiler based on CLANG/LLVM
 Open-source and runs on Linux
 Leverage open-source LLVM->HSAIL code generator
main() {
…
parallel_for_each(grid<1>(ext
entent<256>(…)
…
}
C++AMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA: RUNTIME FLOW
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA 8 – HSA ENABLED APARAPI
 Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data parallel algorithms
‒ Initially targeted at multi-core.
 APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at runtime via
HSAIL
JVM
Java Application
HSA Finalizer & Runtime
APARAPI + Lambda API
GPUCPU
Future Java – HSA ENABLED JAVA (SUMATRA)
 Adds native GPU acceleration to Java Virtual Machine
(JVM)
 Developer uses JDK Lambda, Stream API
 JVM uses GRAAL compiler to generate HSAIL
JVM
Java Application
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT
backend
GPUCPU
AN EXAMPLE (IN JAVA 8)
© Copyright 2014 HSA Foundation. All Rights Reserved
//Example computes the percentage of total scores achieved by each player on a team.
class Player {
private Team team; // Note: Reference to the parent Team.
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Arrays.stream(allPlayers). // wrap the array in a stream
parallel(). // developer indication that lambda is thread-safe
forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});
HSAIL CODE EXAMPLE
© Copyright 2014 HSA Foundation. All Rights Reserved
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };
HOW TO PROGRAM HSA?
OTHER PROGRAMMING TOOLS
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL ASSEMBLER
kernel &run (kernarg_u64 %_arg0)
{
ld_kernarg_u64 $d6, [%_arg0];
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];
. . .
HSAIL
Assembler BRIG Finalizer
Machine
ISA
• HSAIL has a text format and an assembler
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ OFFLINE COMPILER (CLOC)
__kernel void vec_add(
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
int id = get_global_id(0);
// Bounds check
if (id < n)
c[id] = a[id] + b[id];
}
CLOC BRIG Finalizer
Machine
ISA
•OpenCL split-source model cleanly isolates kernel
•Can express many HSAIL features in OpenCL Kernel Language
•Higher productivity than writing in HSAIL assembly
•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)
•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY TAKEAWAYS
 HSAIL
 Thin, robust, fast finalizer
 Portable (multiple HW vendors and parallel architectures)
 Supports shared virtual memory and platform atomics
 HSA brings GPU computing to mainstream programming models
 Shared and coherent memory bridges “faraway accelerator” gap
 HSAIL provides the common IL for high-level languages to benefit from
parallel computing
 Languages and Compilers
 HSAIL support in GCC, LLVM, Java JVM
 Leverage same language syntax designed for multi-core CPUs
 Can use pointer-containing data structures
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME
YEN-CHING CHUNG, NATIONAL TSING HUA
UNIVERSITY
OUTLINE
 Introduction
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Initialization and Shut Down
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Summary
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
 The HSA core runtime is a thin, user-mode API that provides the interface necessary for
the host to launch compute kernels to the available HSA components.
 The overall goal of the HSA core runtime design is to provide a high-performance dispatch
mechanism that is portable across multiple HSA vendor architectures.
 The dispatch mechanism differentiates the HSA runtime from other language runtimes by
architected argument setting and kernel launching at the hardware and specification level.
 The HSA core runtime API is standard across all HSA vendors, such that languages which use the
HSA runtime can run on different vendor’s platforms that support the API.
 The implementation of the HSA runtime may include kernel-level components (required for
some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,
simulators or CPU implementations).
© Copyright 2014 HSA Foundation. All Rights Reserved
Component 1
Driver
Component N…
Vendor m
…
Component 1
Driver
Component N…
Vendor 1
Component 1
HSA Runtime
Component N…
HSA Vendor 1
HSA
Finalizer Component 1
HSA Runtime
Component N…
HSA Vendor m
HSA
Finalizer
INTRODUCTION (2)
Programming Model
Language Runtime
 The software architecture stack without HSA runtime
OpenCL
App
Java
App
OpenMP
App
DSL
App
OpenCL
Runtime
Java
Runtime
OpenMP
Runtime
DSL
Runtime
…
…
 The software architecture stack with HSA runtime
…
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
OpenCL Runtime HSA RuntimeAgent
Start
Program
HSA Memory Allocation
Enqueue Dispatch Packet
Exit
Program Resource Deallocation
Command Queue
Platform, Device, and
Context Initialization
SVM Allocation and
Kernel Arguments Setting
Build Kernel
HSA Runtime Close
HSA Runtime Initialization
and Topology Discovery
HSAIL Finalization and
Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
 HSA Platform System Architecture Specification support
 Runtime initialization and shutdown
 Notifications (synchronous/asynchronous)
 Agent information
 Signals and synchronization (memory-based)
 Queues and Architected dispatch
 Memory management
 HSAIL support
 Finalization, linking, and debugging
 Image and Sampler support
HSA Runtime
HSA Memory Allocation
Enqueue Dispatch Packet
HSA Runtime Close
HSA Runtime
Initialization and
Topology Discovery
HSAIL Finalization and
Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
RUNTIME INITIALIZATION AND
SHUTDOWN
OUTLINE
 Runtime Initialization API
 hsa_init
 Runtime Shut Down API
 hsa_shut_down
 Examples
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME INITIALIZATION
 When the API is invoked for the first time in a given process, a runtime
instance is created.
 A typical runtime instance may contain information of platform, topology, reference
count, queues, signals, etc.
 The API can be called multiple times by applications
 Only a single runtime instance will exist for a given process.
 Whenever the API is invoked, the reference count is increased by one.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME SHUT DOWN
 When the API is invoked, the reference count is decreased by 1.
 When the reference count < 1
 All the resources associated with the runtime instance (queues, signals, topology
information, etc.) are considered invalid and any attempt to reference them in
subsequent API calls results in undefined behavior.
 The user might call hsa_init to initialize the HSA runtime again.
 The HSA runtime might release resources associated with it.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (1)
Data structure for
runtime instance
If hsa_init is called more than once,
increase the ref_count by 1
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (2)
hsa_init is called the first time, allocate
resources and set the reference count
Get the number of HSA agent
Initialize agents
Create an empty agent list
If initialization failed, release resources
Create topology table
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id 0
id 0
type CPU
vendor Generic
name Generic
wavefront_size 0
queue_size 200
group_memory 0
fbarrier_max_count 1
is_pic_supported 0
…
…
EXAMPLE - RUNTIME INSTANCE (1)
Platform Name: Generic Memory
node_id 0
id 0
segment_type 111111
address_base 0x0001
size 2048 MB
peak_bandwidth 6553.6 mpbs
Agent-1
node_id 0
id 0
type GPU
vendor Generic
name Generic
wavefront_size 64
queue_size 200
group_memory 64
fbarrier_max_count 1
is_pic_supported 1
Cache
node_id 0
id 0
levels 1
associativity 1
cache size 64KB
cache line size 4
is_inclusive 1
Agent: 2
Memory: 1
Cache: 1
…
…
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id = 0
id = 0
agent_type = 1 (CPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 0
queue_size =200
group_memory_size_bytes =0
fbarrier_max_count = 1
is_pic_supported = 0
Platform Header File
*base_address = 0x00001
Size = 248
system_timestamp_frequency_
mhz = 200
signal_maximum_wait = 1/200
*node_id
no_nodes = 1
*agent_list
no_agent = 2
*memory_descriptor_list
no_memory_descriptor = 1
*cache_descriptor_list
no_cache_descriptor = 1
EXAMPLE - RUNTIME INSTANCE (2)
…
…
cache
node_id = 0
Id = 0
Levels = 1
* associativity
* cache_size
* cache_line_size
* is_inclusive
1 NULL
64KB NULL
1 NULL
4 NULL
Memory
node_id = 0
Id = 0
supported_segment_type_mask =
111111
virtual_address_base = 0x0001
size_in_bytes = 2048MB
peak_bandwidth_mbps = 6553.6
0 NULL
45 165 NULL
285 NULL
325 NULL
Agent-1
node_id = 0
id = 0
agent_type = 2 (GPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 64
queue_size =200
group_memory_size_bytes =64
fbarrier_max_count = 1
is_pic_supported = 1
…
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME SHUT DOWN
© Copyright 2014 HSA Foundation. All Rights Reserved
If ref_count < 1, then free the list;
Otherwise decrease the ref_count
by 1.
NOTIFICATIONS
(SYNCHRONOUS/ASYNCHRONOUS)
OUTLINE
 Synchronous Notifications
 hsa_status_t
 hsa_status_string
 Asynchronous Notifications
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SYNCHRONOUS NOTIFICATIONS
 Notifications (errors, events, etc.) reported by the runtime can be synchronous or
asynchronous
 The HSA runtime uses the return values of API functions to pass notifications
synchronously.
 A status code is define as an enumeration, , to capture the return value
of any API function that has been executed, except accessors/mutators.
 The notification is a status code that indicates success or error.
 Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.
 An error status is assigned a positive integer and its identifier starts with the
HSA_STATUS_ERROR prefix.
 The status code can help to determine a cause of the unsuccessful execution.
© Copyright 2014 HSA Foundation. All Rights Reserved
STATUS CODE QUERY
 Query additional information on status code
 Parameters
 status (input): Status code that the user is seeking more information on
 status_string (output): An ISO/IEC 646 encoded English language string that potentially
describes the error status
© Copyright 2014 HSA Foundation. All Rights Reserved
ASYNCHRONOUS NOTIFICATIONS
 The runtime passes asynchronous notifications by calling user-defined
callbacks.
 For instance, queues are a common source of asynchronous events because the
tasks queued by an application are asynchronously consumed by the packet
processor. Callbacks are associated with queues when they are created. When the
runtime detects an error in a queue, it invokes the callback associated with that
queue and passes it an error flag (indicating what happened) and a pointer to the
erroneous queue.
 The HSA runtime does not implement any default callbacks.
 When using blocking functions within the callback implementation, a callback that
does not return can render the runtime state to be undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - CALLBACK
Pass the callback function
when create queue
If the queue is empty, set the
event and invoke callback
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION
OUTLINE
 Agent information
 hsa_node_t
 hsa_agent_t
 hsa_agent_info_t
 hsa_component_feature_t
 Agent Information manipulation APIs
 hsa_iterate_agents
 hsa_agent_get_info
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
 The runtime exposes a list of agents that are available in the system.
 An HSA agent is a hardware component that participates in the HSA memory model.
 An HSA agent can submit AQL packets for execution.
 An HSA agent may also but is not required to be an HSA component. It is possible for
a system to include HSA agents that are neither an HSA component nor a host CPU.
 HSA agents are defined as opaque handles of type hsa_agent_t .
 The HSA runtime provides APIs for applications to traverse the list of available
agents and query attributes of a particular agent.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (1)
 Opaque agent handle
 Opaque NUMA node handle
 An HSA memory node is a node that delineates a set of
system components (host CPUs and HSA Components) with
“local” access to a set of memory resources attached to the
node's memory controller and appropriate HSA-compliant
access attributes.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (2)
 Component features
 An HSA component is a hardware or software component that can be a target of the AQL queries
and conforms to the memory model of the HSA.
 Values
 HSA_COMPONENT_FEATURE_NONE = 0
 No component capabilities. The device is an agent, but not a component.
 HSA_COMPONENT_FEATURE_BASIC = 1
 The component supports the HSAIL instruction set and all the AQL packet types except Agent
dispatch.
 HSA_COMPONENT_FEATURE_ALL = 2
 The component supports the HSAIL instruction set and all the AQL packet types.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (3)
 Agent attributes
 Values
 HSA_AGENT_INFO_MAX_GRID_DIM
 HSA_AGENT_INFO_MAX_WORKGROUP_DIM
 HSA_AGENT_INFO_QUEUE_MAX_PACKETS
 HSA_AGENT_INFO_CLOCK
 HSA_AGENT_INFO_CLOCK_FREQUENCY
 HSA_AGENT_INFO_MAX_SIGNAL_WAIT
 HSA_AGENT_INFO_NAME
 HSA_AGENT_INFO_NODE
 HSA_AGENT_INFO_COMPONENT_FEATURES
 HSA_AGENT_INFO_VENDOR_NAME
 HSA_AGENT_INFO_WAVEFRONT_SIZE
 HSA_AGENT_INFO_CACHE_SIZE
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (1)
 Iterate over the available agents, and invoke an application-defined callback on
every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular
iteration, the traversal stops and the function returns that status value.
 Parameters
 callback (input): Callback to be invoked once per agent
 data (input): Application data that is passed to callback on every iteration. Can be
NULL.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (2)
 Get the current value of an attribute for a given agent
 Parameters
 agent (input): A valid agent
 attribute (input): Attribute to query
 value (output): Pointer to a user-allocated buffer where to store the value of the
attribute. If the buffer passed by the application is not large enough to hold the value
of attribute, the behavior is undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - AGENT ATTRIBUTE QUERY
Copy agent attribute information
Get the agent handle of Agent 0
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALS AND SYNCHRONIZATION
(MEMORY-BASED)
OUTLIINE
 Signal
 Signal manipulation API
 Create/Destroy
 Query
 Send
 Atomic Operations
 Signal wait
 Get time out
 Signal Condition
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (1)
 HSA agents can communicate with each other by using coherent global memory,
or by using signals.
 A signal is represented by an opaque signal handle
 A signal carries a value, which can be updated or conditionally waited upon via
an API call or HSAIL instruction.
 The value occupies four or eight bytes depending on the machine model in use.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (2)
 Updating the value of a signal is equivalent to sending the signal.
 In addition to the update (store) of signals, the API for sending signal must
support other atomic operations with specific memory order semantics
 Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS
 Memory order semantics : Release and Relaxed
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL CREATE/DESTROY
 Create a signal
 Parameters
 initial_value (input): Initial value of the
signal.
 signal_handle (output): Signal handle.
 Destroy a signal previous created by
hsa_signal_create
 Parameter
 signal_handle (input): Signal handle.
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically set the value of a signal
with release semantics
SIGNAL LOAD/STORE
 Atomically read the current signal value with
acquire semantics
 Atomically read the current signal value with
relaxed semantics
 Send and atomically set the value of a signal with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically increment the value of a
signal by a given amount with release semantics
SIGNAL ADD/SUBTRACT
 Send and atomically decrement the value of a
signal by a given amount with release semantics
 Send and atomically increment the value of a
signal by a given amount with relaxed semantics
 Send and atomically decrement the value of a
signal by a given amount with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
release semantics
SIGNAL AND (OR, XOR)/EXCHANGE
 Send and atomically set the value of a signal and
return its previous value with release semantics
 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
relaxed semantics
 Send and atomically set the value of a signal and
return its previous value with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (1)
 The application may wait on a signal, with a condition specifying the terms of
wait.
 Signal wait condition operator
 Values
 HSA_EQ: The two operands are equal.
 HSA_NE: The two operands are not equal.
 HSA_LT: The first operand is less than the second operand.
 HSA_GTE: The first operand is greater than or equal to the second operand.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (2)
 The wait can be done either in the HSA component via an HSAIL wait instruction
or via a runtime API defined here.
 Waiting on a signal returns the current value at the opaque signal object;
 The wait may have a runtime defined timeout which indicates the maximum amount of time that an
implementation can spend waiting.
 The signal infrastructure allows for multiple senders/waiters on a single signal.
 Wait reads the value, hence acquire synchronizations may be applied.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (3)
 Signal wait
 Parameters
 signal_handle (input): A signal handle
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (4)
 Signal wait with timeout
 Parameters
 signal_handle (input): A signal handle
 timeout (input): Maximum wait duration (A value of zero indicates no maximum)
 long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in
a short period of time. The HSA runtime may use this hint to optimize the wait implementation.
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (1)
thread_1 thread_2
thread_1 is blocked
hsa_signal_add_relaxed
(value = value + 3)
Return signal value
Condition satisfied, the
execution of thread_1
continues
value = 0
Timeline Timeline
value = 3
hsa_signal_substract_relaxed
(value = value - 1)value = 2
hsa_signal_wait_timeout_acquire
(value == 2)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (2)
If signal_handle is invalid, then return signal invalid status
Compare tmp->value with compare_value to see if the
condition is satisfied?
If timeout = 0 then return signal time out status
Signal wait condition function
If the condition is satisfied, then return signal and status
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUES AND ARCHITECTED
DISPATCH
OUTLINE
 Queues
 Queue Types and Structure
 HSA runtime API for Queue Manipulations
 Architected Queuing Language (AQL) Support
 Packet type
 Packet header
 Examples
 Enqueue Packet
 Packet Processor
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
 An HSA-compliant platform supports multiple user-level command queues allocation.
 A use-level command queue is characterized as runtime-allocated, user-level accessible virtual
memory of a certain size, containing packets defined in the Architected Queuing Language (AQL
packets).
 Queues are allocated by HSA applications through the HSA runtime.
 HSA software receives memory-based structures to configure the hardware queues to
allow for efficient software management of the hardware queues of the HSA agents.
 This queue memory shall be processed by the HSA Packet Processor as a ring buffer.
 Queues are read-only data structures.
 Writing values directly to a queue structure results in undefined behavior.
 But HSA agents can directly modify the contents of the buffer pointed by base_address, or use
runtime APIs to access the doorbell signal or the service queue.
© Copyright 2014 HSA Foundation. All Rights Reserved
 Two queue types, AQL and Service Queues, are supported
 AQL Queue consumes AQL packets that are used to specify the information of kernel functions
that will be executed on the HSA component
 Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user
registered functions that will be executed on the agent (typically, the host CPU)
INTRODUCTION (2)
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
 AQL queue structure
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
 In addition to the data held in the queue structure, the queue also defines two
properties (readIndex and writeIndex) that define the location of “head” and “tail”
of the queue.
 readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet to be consumed by the packet processor.
 writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet slot to be allocated.
 Both indices are not directly exposed to the user, who can only access them by using
dedicated HSA core runtime APIs.
 The available index functions differ on the index of interest (read or write), action to be
performed (addition, compare and swap, etc.), and memory consistency model
(relaxed, release, etc.).
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (5)
 The read index is automatically advanced when a packet is read by the packet
processor.
 When the packet processor observes that
 The read index matches the write index, the queue can be considered empty;
 The write index is greater than or equal to the sum of the read index and the size of
the queue, then the queue is full.
 The doorbell_signal field of a queue contains a signal that is used by the agent
to inform the packet processor to process the packets it writes.
 The value that the doorbell signaled is equal to the ID of the packet that is ready to be
launched.
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (6)
 The new task might be consumed by the packet processor even before the
doorbell signal has been signaled by the agent.
 This is because the packet processor might be already processing some other
packets and observes that there is new work available, so it processes the new
packets.
 In any case, the agent must ring the doorbell for every batch of packets it writes.
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE CREATE/DESTROY
 Create a user mode queue
 When a queue is created, the runtime also
allocates the packet buffer and the completion
signal.
 The application should only rely on the status
code returned to determine if the queue is valid
 Destroy a user mode queue
 A destroyed queue might not be accessed after being
destroyed.
 When a queue is destroyed, the state of the AQL packets
that have not been yet fully processed becomes undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
GET READ/WRITE INDEX
 Atomically retrieve read index of a queue with
acquire semantics
 Atomically retrieve write index of a queue with
acquire semantics
 Atomically retrieve read index of a queue with
relaxed semantics
 Atomically retrieve write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SET READ/WRITE INDEX
 Atomically set the read index of a queue with
release semantics
 Atomically set the read index of a queue with
relaxed semantics
 Atomically set the write index of a queue with
release semantics
 Atomically set the write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
COMPARE AND SWAP WRITE INDEX
 Atomically compare and set the write index of a
queue with acquire/release/relaxed/acquire-
release semantics
 Parameters
 queue (input): A queue
 expected (input): The expected index value
 val (input): Value to copy to the write index if expected
matches the observed write index
 Return value
 Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ADD WRITE INDEX
 Atomically increment the write index of a
queue by an offset with
release/acquire/relaxed/acquire-release
semantics
 Parameters
 queue (input): A queue
 val (input): The value to add to the write index
 Return value
 Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUING LANGUAGE (AQL)
 An HSA-compliant system provides a command interface for the dispatch of
HSA agent commands.
 This command interface is provided by the Architected Queuing Language (AQL).
 AQL allows HSA agents to build and enqueue their own command packets,
enabling fast and low-power dispatch.
 AQL also provides support for HSA component queue submissions
 The HSA component kernel can write commands in AQL format.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (1)
 AQL packet format
 Values
 Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.
 Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the
packet slot available to the HSA agents.
 Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by
HSA agents.
 Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent
packets. All queues support barrier packets.
 Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by
HSA agents.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (2)
HSA signaling object handle used to indicate completion of the job
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (1)
 An HSA agent submits a task to a queue by performing the following steps:
 Allocate a packet slot (by incrementing the writeIndex)
 Initialize the packet and copy packet to a queue associated with the Packet Processor
 Mark packet as valid
 Notify the Packet Processor of the packet (With doorbell signal)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (2)
Dispatch Queue
Allocate an AQL packet slot
Copy the packet into queue. Note
that, we can have a lock here to
prevent race condition in
multithread environment
WriteIndex
ReadIndex
Initialize
packet
Send doorbell signal
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - PACKET PROCESSOR
WriteIndex
ReadIndex
Get packet content
Check if barrier packet
Update readIndex, change packet state to invalid,
and send completion signal.
Receive doorbell
Dispatch Queue
If there is any packet in queue, process the packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY MANAGEMENT
OUTLINE
 Memory registration and deregistration
 Memory region and memory segment
 APIs for memory region manipulation
 APIs for memory registration and deregistration
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
 One of the key features of HSA is its ability to share global pointers between the
host application and code executing on the HSA component.
 This ability means that an application can directly pass a pointer to memory allocated on the host
to a kernel function dispatched to a component without an intermediate copy
 When a buffer created in the host is also accessed by a component,
programmers are encouraged to register the corresponding address range
beforehand.
 Registering memory expresses an intention to access (read or write) the passed buffer from a
component other than the host. This is a performance hint that allows the runtime implementation
to know which buffers will be accessed by some of the components ahead of time.
 When an HSA program no longer needs to access a registered buffer in a device,
the user should deregister that virtual address range.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION/SEGMENT
 A memory region represents a virtual memory interval that is visible to a particular agent,
and contains properties about how memory is accessed or allocated from that agent.
 Memory segments
 Values
 HSA_SEGMENT_GLOBAL = 1
 HSA_SEGMENT_PRIVATE = 2
 HSA_SEGMENT_GROUP = 4
 HSA_SEGMENT_KERNARG = 8
 HSA_SEGMENT_READONLY = 16
 HSA_SEGMENT_IMAGE = 32
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION INFORMATION
 Attributes of a memory region
 Values
 HSA_REGION_INFO_BASE_ADDRESS
 HSA_REGION_INFO_SIZE
 HSA_REGION_INFO_NODE
 HSA_REGION_INFO_MAX_ALLOCATION_SIZE
 HSA_REGION_INFO_SEGMENT
 HSA_REGION_INFO_BANDWIDTH
 HSA_REGION_INFO_CACHED
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (1)
 Get the current value of an attribute of a region
 Iterate over the memory regions that are visible to an agent, and invoke an
application-defined callback on every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the
traversal stops and the function returns that status value.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (2)
 Allocate a block of memory
 Deallocate a block of memory previously allocated
using hsa_memory_allocate
 Copy block of memory
 Copying a number of bytes larger than the size of the
memory regions pointed by dst or src results in
undefined behavior.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGISTRATION/DEREGISTRATION
 Register memory
 Parameters
 address (input): A pointer to the base of
the memory region to be registered. If a
NULL pointer is passed, no operation is
performed.
 size (input): Requested registration size
in bytes. A size of zero is only allowed if
address is NULL.
 Deregister memory previously registered
using hsa_memory_register
 Parameter
 address (input): A pointer to the base of the
memory region to be registered. If a NULL
pointer is passed, no operation is performed.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE
Allocate a memory space
Use hsa_region_get_info to get the
size in byte of this memory space
Register this memory space for a
performance hint
Finish operation, deregister and
free this memory space
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY
SUMMARY
 Covered
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Runtime Initialization and Shutdown (Open/Close)
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Memory Management
 Not covered
 Extension of Core Runtime
 HSAIL Finalization, Linking, and Debugging
 Images and Samplers
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
BEN GASTER, ENGINEER, QUALCOMM
OUTLINE
 HSA Memory Model
 OpenCL 2.0
 Has a memory model too
 Obstruction-free bounded deques
 An example using the HSA memory model
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
TYPES OF MODELS
 Shared memory computers and programming languages, divide complexity into
models:
1. Memory model specifies safety
 e.g. can a work-item prevent others from progressing?
 This is what this section of the tutorial will focus on
2. Execution model specifies liveness
 Described in Ben Sander’s tutorial section on HSAIL
 e.g. can a work-item prevent others from progressing
3. Performance model specifies the big picture
 e.g. caches or branch divergence
 Specific to particular implementations and outside the scope of today’s tutorial
© Copyright 2014 HSA Foundation. All Rights Reserved
THE PROBLEM
 Assume all locations (a, b, …) are initialized to 0
 What are the values of $s2 and $s4 after execution?
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*a = 1;
int x = *b;
*b = 1;
int y = *a;
initially *a = 0 && *b = 0
THE SOLUTION
 The memory model tells us:
 Defines the visibility of writes to memory at any given point
 Provides us with a set of possible executions
© Copyright 2014 HSA Foundation. All Rights Reserved
WHAT MAKES A GOOD MEMORY MODEL*
 Programmability ; A good model should make it (relatively) easy to write multi-
work-item programs. The model should be intuitive to most users, even to those
who have not read the details
 Performance ; A good model should facilitate high-performance implementations
at reasonable power, cost, etc. It should give implementers broad latitude in
options
 Portability ; A good model would be adopted widely or at least provide backward
compatibility or the ability to translate among models
* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,
University of Wisconsin–Madison, Nov. 1993.
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY (SC)*
 Axiomatic Definition
 A single processor (core) sequential if “the result of an execution is the same as if the
operations had been executed in the order specified by the program.”
 A multiprocessor sequentially consistent if “the result of any execution is the same as if the
operations of all processors (cores) were executed in some sequential order, and the
operations of each individual processor (core) appear in this sequence in the order specified by
its program.”
© Copyright 2014 HSA Foundation. All Rights Reserved
 But HW/Compiler actually implements more relaxed models, e.g. ARMv7
* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on Computers,
C-28(9):690–91, Sept. 1979.
SEQUENTIAL CONSISTENCY (SC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
$s2 = 0 && $s4 =
1
BUT WHAT ABOUT ACTUAL HARDWARE
 Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
 Many modern processors implement many reordering optimizations
 Store buffers (TSO*), work-items can see their own stores early
 Reorder buffers (XC*), work-items can see other work-items store early
© Copyright 2014 HSA Foundation. All Rights Reserved
*TSO – Total Store Order as implemented by Sparc and x86
*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
RELAXED CONSISTENCY (XC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;
$s2 = 0 && $s4 =
0
WHAT ARE OUR 3 Ps?
 Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
 many memory model experts have been known to get it wrong!
 Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements
 Portability ; XC is very portable as it places very little constraints
© Copyright 2014 HSA Foundation. All Rights Reserved
MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
 To have their cake and eat it!
© Copyright 2014 HSA Foundation. All Rights Reserved
Put picture with kids and cake
HSA Provides: The ability to enable
programmers to reason with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!
SEQUENTIAL CONSISTENCY FOR DRF*
 HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data
Race Free (DRF)
 plus some new capabilities !
 (Informally) A data race occurs when two (or more) work-items access the same memory
location such that:
 At least one of the accesses is a WRITE
 There are no intervening synchronization operations
 SC for DRF asks:
 Programmers to ensure programs are DRF under SC
 Implementers to ensure that all executions of DRF programs on the relaxed model are also SC
executions
© Copyright 2014 HSA Foundation. All Rights Reserved
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May
1990
HSA SUPPORTS RELEASE CONSISTENCY
 HSA’s memory model is based on RCSC:
 All atomic_ld_scacq and atomic_st_screl are SC
 Means coherence on all atomic_ld_scacq and atomic_st_screl to a single
address. )
 All atomic_ld_scacq and atomic_st_screl are program ordered per work-
item (actually: sequence-order by language constraints
 Similar model adopted by ARMv8
 HSA extends RCSC to SC for HRF*, to access the full capabilities of
modern heterogeneous systems, containing CPUs, GPUs, and DSPs,
for example.
© Copyright 2014 HSA Foundation. All Rights Reserved
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.
Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
MAKING RELAXED CONSISTENCY WORK
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a]
;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a] ;
$s2 = 0 && $s4 =
1
SEQUENTIAL CONSISTENCY FOR DRF
 Two memory accesses participate in a data race if they
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 A program is data-race-free if no possible execution results in a data race.
 Sequential consistency for data-race-free programs
 Avoid everything else
HSA: Not good enough!
© Copyright 2014 HSA Foundation. All Rights Reserved
ALL ARE NOT EQUAL – OR SOME CAN SEE
BETTER THAN OTHERS
 Remember the HSAIL
Execution Model
© Copyright 2014 HSA Foundation. All Rights Reserved
device scope
group scope
wave
scope
platform scope
DATA-RACE-FREE IS NOT ENOUGH
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar 1, 0, [&flag]
...
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar ,1 0, [&flag]
ld_global (??), [&x]
group #1-2 group #3-4
 Two ordinary memory accesses participate in a data race if they
 Access same location
 At least one is a store
 Can occur simultaneously
Not a data race…
Is it SC?
Well that depends
t4t3t1 t2
SGlobal
S12 S34
visibility implied by
causality?
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY FOR
HETEROGENEOUS-RACE-FREE
 Two memory accesses participate in a heterogeneous race if
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 Are not synchronized with “enough” scope
 A program is heterogeneous-race-free if no possible execution results in a
heterogeneous race.
 Sequential consistency for heterogeneous-race-free programs
 Avoid everything else
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA HETEROGENEOUS RACE FREE
 HRF0: Basic Scope Synchronization
 “enough” = both threads synchronize using identical scope
 Recall example:
 Contains a heterogeneous race in HSA
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_rcrel_wg 0, [&flag]
...
atomic_cas_global_scar_wg,1 0, [&flag]
ld_global (??), [&x]
Workgroup #1-2 Workgroup #3-4
HSA Conclusion:
This is bad. Don’t do it.
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO USE HSA WITH SCOPES
Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline
Implication:
Producers/consumers must be known at synchronization time
 Want: For performance, use smallest scope possible
 What is safe in HSA?
Is this a valid assumption?
© Copyright 2014 HSA Foundation. All Rights Reserved
REGULAR GPGPU WORKLOADS
N
M
Define
Problem Space
Partition
Hierarchically
Communicate
Locally
N times
Communicate
Globally
M times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
 Producer/consumers are always known
Generally: HSA works well with
regular data-parallel workloads
© Copyright 2014 HSA Foundation. All Rights Reserved
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_scar_plat 1, 0, [&flag]
...
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
IRREGULAR WORKLOADS
 HSA: example is race
 Must upgrade wg (workgroup) -> plat (platform)
 HSA memory model says:
 ld $s1, [&x], will see value (1)!
Workgroup #1-2 Workgroup #3-4
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL
HAS MEMORY MODELS TOO
MAPPING ONTO HSA’S MEMORY MODEL
 It is straightforward to provide a mapping from OpenCL 1.x to the
proposed model
 OpenCL 1.x atomics are unordered and so map to atomic_op_X
 Mapping for fences not shown but straightforward
OPENCL 1.X MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model
Operation
Atomic load ld_global_wg
ld_group_wg
Atomic store atomic_st_global_wg
atomic_st_group_wg
atomic_op atomic_op_global_comp
atomic_op_group_wg
barrier(…) fence ; barrier_wg
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 BACKGROUND
 Provisional specification released at SIGGRAPH’13, July 2013.
 Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
 Key features:
 Shared virtual memory, including platform atomics
 Formally defined memory model based on C11 plus support for scopes
 Includes an extended set of C1X atomic operations
 Generic address space, that subsumes global, local, and private
 Device to device enqueue
 Out-of-order device side queuing model
 Backwards compatible with OpenCL 1.x
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model Operation
Load
memory_order_relaxed
atomic_ld_[global | group]_relaxed_scope
Store
Memory_order_relaxed
atomic_st_[global | group]_relaxed_scope
Load
memory_order_acquire
atomic_ld_[global | group]_scacq_scope
Load
memory_order_seq_cst
atomic_ld_[global | group]_scacq_scope
Store
memory_order_release
atomic_st_[global | group]_screl_scope
Store
Memory_order_seq_cst
atomic_st_[global | group]_screl_scope
memory_order_acq_rel atomic_op_[global | group]_scar_scope
memory_order_seq_cst atomic_op_[global|group]_scar_scope
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope HSA Scope
memory_scope_sub_group _wave
memory_scope_work_group _wg
memory_scope_device _component
memory_scope_all_svm_devices _platform
© Copyright 2014 HSA Foundation. All Rights Reserved
OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL
CONCURRENT DATA-STRUCTURES
 Why do we need such a memory model in practice?
 One important application of memory consistency is in the development and use
of concurrent data-structures
 In particular, there are a class data-structures implementations that provide non-
blocking guarantees:
 wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
 In practice very hard to build efficient data-structures that meet this requirement
 lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of
the work-items (or threads) makes progress
 In practice lock-free algorithms are implemented by work-item cooperating with one
enough to allow progress
 Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can
make progress
© Copyright 2014 HSA Foundation. All Rights Reserved
Emerging Compute Cluster
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
© Copyright 2014 HSA Foundation. All Rights Reserved
Fabric & Memory Controller
Krait
CPUAdreno
GPU
Krait
CPU
Krait
CPU
Krait
CPU
MMU
MMUs
2MB L2
Hexagon
DSP
MMU
?? ??
Diversity in a heterogeneous system, such as
different clock speeds, different scheduling
policies, and more can mean traditional
mutual exclusion is not the right choice
CONCURRENT DATA-STRUCTURES
 Emerging heterogeneous compute clusters means we need:
 To adapt existing concurrent data-structures
 Developer new concurrent data-structures
 Lock based programming may still be useful but often these algorithms will need
to be lock-free
 Of course, this is a key application of the HSA memory model
 To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*
© Copyright 2014 HSA Foundation. All Rights Reserved
*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an
example. (2003), 522–529.
HLM - OBSTRUCTION-FREE DEQUE
 Uses a fixed length circular queue
 At any given time, reading from left to right, the array will contain:
 Zero or more left-null (LN) values
 Zero or more dummy-null (DN) values
 Zero or more right-null (RN) values
 At all times there must be:
 At least two different nulls values
 At least one LN or DN, and at least one DN or RN
 Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)
© Copyright 2014 HSA Foundation. All Rights Reserved
HLM - OBSTRUCTION-FREE DEQUE
© Copyright 2014 HSA Foundation. All Rights Reserved
LNLN vLN RNv RNRN
left right
Key:
LN – left null value
RN – right null value
v – value
left – left hint index
right – right hint index
C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2; // null type (LN, RN, DN)
uint64_t counter : 8 ; // version counter to avoid ABA
uint64_t value : 54 ; // index value stored in queue
}
struct queue {
unsigned int size; // size of bounded buffer
node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL REPRESENTATION
 Allocate a deque in global memory using HSAIL
@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;
© Copyright 2014 HSA Foundation. All Rights Reserved
ORACLE
 Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
 Which given a deque
 returns (%k) the position of the left most of RN
 atomic_ld_global_scacq used to read node from array
 Makes one if necessary (i.e. if there are only LN or DN)
 atomic_cas_global_scar, required to make new RN
 returns (%left) the left node (i.e. the value to the left of the left most RN position)
 returns (%right) the right node (i.e. the value at position (%k))
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;
brn @loop_forever ;
}
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
atomic_ld_global_scacq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not_empty:
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TRY READ/REMOVE NODE
// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];
%ret
@cas_failed:
// loop back around and try again
© Copyright 2014 HSA Foundation. All Rights Reserved
TAKE AWAYS
 HSA provides a powerful and modern memory model
 Based on the well know SC for DRF
 Defined as Release Consistency
 Extended with scopes as defined by HRF
 OpenCL 2.0 introduces a new memory model
 Also based on SC for DRF
 Also defined in terms of Release Consistency
 Also Extended with scope as defined in HRF
 Has a well defined mapping to HSA
 Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL
HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,
ARM
HSA QUEUEING, MOTIVATION
MOTIVATION (TODAY’S PICTURE)
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
HSA QUEUEING: REQUIREMENTS
REQUIREMENTS
 Three key technologies are used to build the user mode queueing
mechanism
 Shared Virtual Memory
 System Coherency
 Signaling
 AQL (Architected Queueing Language) enables any agent
enqueue tasks
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
 Multiple Virtual memory address spaces
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
 Common Virtual Memory for all HSA agents
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Implications
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing address
translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
 Specifics
 Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
 HSA agents may reserve VA ranges for internal use via system
software.
 All HSA agents other than the host unit must use the lowest privilege
level
 If present, read/write access flags for page tables must be
maintained by all agents.
 Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING THERE …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/3)
 Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (2/3)
 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Implications
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (3/3)
 Specifics
 No requirement for instruction memory accesses to be
coherent
 Only applies to the Primary memory type.
 No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the
same memory attributes
 Read-only image data is required to remain static during the
execution of an HSA kernel.
 No double mapping (via different attributes) in order to
modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING CLOSER …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SIGNALING
SIGNALING (1/3)
 HSA agents support the ability to use signaling objects
 All creation/destruction signaling objects occurs via HSA
runtime APIs
 From an HSA Agent you can directly access signaling objects.
 Signaling a signal object (this will wake up HSA agents
waiting upon the object)
 Query current object
 Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (2/3)
 Advantages
 Enables asynchronous events between HSA agents,
without involving the kernel
 Common idiom for work offload
 Low power waiting
 Implications
 Runtime support required
 Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (3/3)
 Specifics
 Only supported within a PASID
 Supported wait conditions are =, !=, < and >=
 Wait operations may return sporadically (no guarantee against
false positives)
 Programmer must test.
 Wait operations have a maximum duration before returning.
 The HSAIL atomic operations are supported on signal objects.
 Signal objects are opaque
 Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
ALMOST THERE…
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUING
ONE BLOCK LEFT
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUEING (1/3)
 User mode Queueing
 Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA
agents.
 Queues are created/destroyed via calls to the HSA
runtime.
 One (or many) agents enqueue packets, a single agent
dequeues packets.
 Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (2/3)
 Advantages
 Avoid involving the kernel/driver when dispatching work for an Agent.
 Lower latency job dispatch enables finer granularity of offload
 Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
 Implications
 Packet formats/fields are Architected – standard across vendors!
 Guaranteed backward compatibility
 Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
 More on this later……
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING
LANGUAGE, QUEUES
ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard shared
memory queues, supporting multi-producer,
single-consumer
 Single producer variant defined with some
optimizations possible.
 Queues consist of storage, read/write indices, ID,
etc.
 Queues are created/destroyed via calls to the
HSA runtime
 “Packets” are placed in queues directly from user
mode, via an architected protocol
 Packet format is architected
© Copyright 2014 HSA Foundation. All Rights Reserved
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
ARCHITECTED QUEUING LANGUAGE
 Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
 There is no guarantee that more than one packet will be processed in parallel at a
time
 There may be many queues. A single agent may also consume from several
queues.
 Any HSA agent may enqueue packets
 CPUs
 GPUs
 Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE
© Copyright 2014 HSA Foundation. All Rights Reserved
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
QUEUE VARIANTS
 queueType and queueFeatures together define queue semantics and
capabilities
 Two queueType values defined, other values reserved:
 MULTI – queue supports multiple producers
 SINGLE – queue supports single producer
 queueFeatures is a bitfield indicating capabilities
 DISPATCH (bit 0) if set then queue supports DISPATCH packets
 AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
 All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE DETAILS
 Queue doorbells are HSA signaling objects with restrictions
 Created as part of the queue – lifetime tied to queue object
 Atomic read-modify-write not allowed
 size field value must be aligned to a power of 2
 serviceQueue can be used by HSA kernel for callback services
 Provided by application when queue is created
 Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
READ/WRITE INDICES
 readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
 Accessed through HSA runtime API and HSAIL operations
 HSA runtime/HSAIL operations defined to
 Read readIndex or writeIndex property
 Write readIndex or writeIndex property
 Add constant to writeIndex property (returns previous writeIndex value)
 CAS on writeIndex property
 readIndex & writeIndex operations treated as atomic in memory model
 relaxed, acquire, release and acquire-release variants defined as applicable
 readIndex and writeIndex never wrap
 PacketID – the index of a particular packet
 Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET ENQUEUE
 Packet enqueue follows a few simple steps:
 Reserve space
 Multiple packets can be reserved at a time
 Write packet to queue
 Mark packet as valid
 Producer no longer allowed to modify packet
 Consumer is allowed to start processing packet
 Notify consumer of packet through the queue doorbell
 Multiple packets can be notified at a time
 Doorbell signal should be signaled with last packetID notified
 On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET RESERVATION
 Two flows envisaged
 Atomic add writeIndex with number of packets to reserve
 Producer must wait until packetID < readIndex + size before writing to packet
 Queue can be sized so that wait is unlikely (or impossible)
 Suitable when many threads use one queue
 Check queue not full first, then use atomic CAS to update writeIndex
 Can be inefficient if many threads use the same queue
 Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE OPTIMIZATIONS
 Queue behavior is loosely defined to allow optimizations
 Some potential producer behavior optimizations:
 Keep local copy of readIndex, update when required
 For single producer queues:
 Keep local copy of writeIndex
 Use store operation rather than add/cas atomic to update writeIndex
 Some potential consumer behavior optimizations:
 Use packet format field to determine whether a packet has been submitted rather than writeIndex
property
 Speculatively read multiple packets from the queue
 Not update readIndex for each packet processed
 Rely on value used for doorbellSignal to notify new packets
 Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packet
uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full.
uint64_t rdIdx;
do {
rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate index
uint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALID
q->baseAddress[arrayIdx] = pkt;
// Update format field with release semantics
q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL CONSUMER ALGORITHM
// Get location of next packet
uint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index
uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)
while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packet
pkt = q->baseAddress[arrayIdx];
// set the format field to invalid
q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsic
hsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUEING
LANGUAGE, PACKETS
PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
 Packets come in three main types with architected layouts
 Always reserved & Invalid
 Do not contain any valid tasks and are not processed (queue will not progress)
 Dispatch
 Specifies kernel execution over a grid
 Agent Dispatch
 Specifies a single function to perform with a set of parameters
 Barrier
 Used for task dependencies
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

More Related Content

What's hot

What's hot (20)

LAS16-200: SCMI - System Management and Control Interface
LAS16-200:  SCMI - System Management and Control InterfaceLAS16-200:  SCMI - System Management and Control Interface
LAS16-200: SCMI - System Management and Control Interface
 
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARM
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARMSFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARM
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARM
 
U-Boot - An universal bootloader
U-Boot - An universal bootloader U-Boot - An universal bootloader
U-Boot - An universal bootloader
 
Interrupts
InterruptsInterrupts
Interrupts
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted Firmware
 
Audio Drivers
Audio DriversAudio Drivers
Audio Drivers
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
Embedded Linux on ARM
 
Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation
 
BeagleBone Black Bootloaders
BeagleBone Black BootloadersBeagleBone Black Bootloaders
BeagleBone Black Bootloaders
 
linux device driver
linux device driverlinux device driver
linux device driver
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Qemu
QemuQemu
Qemu
 
Introducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for WarewulfIntroducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
 
I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜
 

Viewers also liked

Viewers also liked (12)

HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
 
LCU13: HSA Architecture Presentation
LCU13: HSA Architecture PresentationLCU13: HSA Architecture Presentation
LCU13: HSA Architecture Presentation
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloads
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
OpenCV 에서 OpenCL 살짝 써보기
OpenCV 에서 OpenCL 살짝 써보기OpenCV 에서 OpenCL 살짝 써보기
OpenCV 에서 OpenCL 살짝 써보기
 
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
 
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介
1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介
1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing ppt
 

Similar to ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
HSA Foundation
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)
dibyendu.das
 

Similar to ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial (20)

ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati..."Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)
 
OpenSouthCode 2016 - Accenture DevOps Platform 2016-05-07
OpenSouthCode 2016  - Accenture DevOps Platform 2016-05-07OpenSouthCode 2016  - Accenture DevOps Platform 2016-05-07
OpenSouthCode 2016 - Accenture DevOps Platform 2016-05-07
 
oneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel ProductoneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel Product
 
Oss the freedom dpm 2018
Oss the freedom dpm 2018Oss the freedom dpm 2018
Oss the freedom dpm 2018
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
HP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a Service
HP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a ServiceHP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a Service
HP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a Service
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
 
Power9 aihpc bigdataeducationserver
Power9 aihpc bigdataeducationserverPower9 aihpc bigdataeducationserver
Power9 aihpc bigdataeducationserver
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 

More from HSA Foundation

ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
HSA Foundation
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
HSA Foundation
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
HSA Foundation
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
HSA Foundation
 

More from HSA Foundation (20)

Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSA
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

  • 1. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): ARCHITECTURE AND ALGORITHMS ISCA TUTORIAL - JUNE 15, 2014
  • 2. TOPICS  Introduction  HSAIL Virtual Parallel ISA  HSA Runtime  HSA Memory Model  HSA Queuing Model  HSA Applications  HSA Compilation © Copyright 2014 HSA Foundation. All Rights Reserved The HSA Specifications are not at 1.0 final so all content is subject to change
  • 3. SCHEDULE © Copyright 2014 HSA Foundation. All Rights Reserved Time Topic Speaker 8:45am Introduction to HSA Phil Rogers, AMD 9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD 10:30am Break 10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University 12 noon Lunch 1pm HSA Memory Model Benedict Gaster, Qualcomm 2pm HSA Queuing Model Hakan Persson, ARM 3pm Break 3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois 4pm HSA Application Programming Wen Mei Hwu, University of Illinois 4:45pm Questions All presenters
  • 4. INTRODUCTION PHIL ROGERS, AMD CORPORATE FELLOW & PRESIDENT OF HSA FOUNDATION
  • 5. HSA FOUNDATION  Founded in June 2012  Developing a new platform for heterogeneous systems  www.hsafoundation.com  Specifications under development in working groups to define the platform  Membership consists of 43 companies and 16 universities  Adding 1-2 new members each month © Copyright 2014 HSA Foundation. All Rights Reserved
  • 6. DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING © Copyright 2014 HSA Foundation. All Rights Reserved Founders Promoters Supporters Contributors Academic Needs Updating – Add Toshiba Logo
  • 7. MEMBERSHIP TABLE Membership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science © Copyright 2014 HSA Foundation. All Rights Reserved
  • 8. HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER  Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms  SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory  How do we make them even better?  Easier to program  Easier to optimize  Higher performance  Lower power  HSA unites accelerators architecturally  Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 9. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2014 HSA Foundation. All Rights Reserved ? Single-thread Performance Time we are here Enabled by:  Moore’s Law  Voltage Scaling Constrained by: Power Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Abundant data parallelism  Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Law  SMP architecture Constrained by: Power Parallel SW Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java
  • 10. LEGACY GPU COMPUTE PCIe ™ System Memory (Coherent) CPU CPU CPU . . . CU CU CU CU CU CU CU CU GPU Memory (Non-Coherent) GPU  Multiple memory pools  Multiple address spaces  High overhead dispatch  Data copies across PCIe  New languages for programming  Dual source development  Proprietary environments  Expert programmers only  Need to fix all of this to unleash our programmers The limiters © Copyright 2014 HSA Foundation. All Rights Reserved
  • 11. EXISTING APUS AND SOCS CPU 1 CPU N… CPU 2 Physical Integration CU 1 … CU 2 CU 3 CU M-2 CU M-1 CU M System Memory (Coherent) GPU Memory (Non-Coherent) GPU  Physical Integration  Good first step  Some copies gone  Two memory pools remain  Still queue through the OS  Still requires expert programmers  Need to finish the job
  • 12. AN HSA ENABLED SOC  Unified Coherent Memory enables data sharing across all processors  Processors architected to operate cooperatively  Designed to enable the application to run on different processors at different times Unified Coherent Memory CPU 1 CPU N… CPU 2 CU 1 CU 2 CU 3 CU M-2 CU M-1 CU M…
  • 13. PILLARS OF HSA*  Unified addressing across all processors  Operation into pageable system memory  Full memory coherency  User mode dispatch  Architected queuing language  Scheduling and context switching  HSA Intermediate Language (HSAIL)  High level language support for GPU compute processors © Copyright 2014 HSA Foundation. All Rights Reserved * All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
  • 14. HSA SPECIFICATIONS  HSA System Architecture Specification  Version 1.0 Provisional, Released April 2014  Defines discovery, memory model, queue management, atomics, etc  HSA Programmers Reference Specification  Version 1.0 Provisional, Released June 2014  Defines the HSAIL language and object format  HSA Runtime Software Specification  Version 1.0 Provisional, expected to be released in July 2014  Defines the APIs through which an HSA application uses the platform  All released specifications can be found at the HSA Foundation web site:  www.hsafoundation.com/standards © Copyright 2014 HSA Foundation. All Rights Reserved
  • 15. HSA - AN OPEN PLATFORM  Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA System Architecture  HSA Runtime  Delivered via royalty free standards  Royalty Free IP, Specifications and APIs  ISA agnostic for both CPU and GPU  Membership from all areas of computing  Hardware companies  Operating Systems  Tools and Middleware  Applications  Universities © Copyright 2014 HSA Foundation. All Rights Reserved
  • 16. HSA INTERMEDIATE LAYER — HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or “Finalizer”  ISA independent by design for CPU & GPU  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Lower level than OpenCL SPIR  Fits naturally in the OpenCL compilation stack  Suitable to support additional high level languages and programming models:  Java, C++, OpenMP, C++, Python, etc © Copyright 2014 HSA Foundation. All Rights Reserved
  • 17. HSA MEMORY MODEL  Defines visibility ordering between all threads in the HSA System  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Visibility controlled by:  Load.Acquire  Store.Release  Fences © Copyright 2014 HSA Foundation. All Rights Reserved
  • 18. HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver required in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU self enqueue enables lots of solutions  Recursion  Tree traversal  Wavefront reforming © Copyright 2014 HSA Foundation. All Rights Reserved
  • 20. Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK © Copyright 2014 HSA Foundation. All Rights Reserved
  • 21. OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL  Not an alternative to OpenCL  OpenCL on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL 2.0 leverages HSA Features  Shared Virtual Memory  Platform Atomics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 22. ADDITIONAL LANGUAGES ON HSA  In development © Copyright 2014 HSA Foundation. All Rights Reserved Language Body More Information Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/ LLVM LLVM Code generator for HSAIL C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa mp-driver-ng/wiki/Home OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches /hsa/gcc/README.hsa?view=markup&p athrev=207425
  • 23. SUMATRA PROJECT OVERVIEW  AMD/Oracle sponsored Open Source (OpenJDK) project  Targeted at Java 9 (2015 release)  Allows developers to efficiently represent data parallel algorithms in Java  Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to enable both CPU or GPU computing  At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch ‘selected’ constructs to available HSA enabled devices  Developers of Java libraries are already refactoring their library code to use these same constructs  So developers using existing libraries should see GPU acceleration without any code changes  http://openjdk.java.net/projects/sumatra/  https://wikis.oracle.com/display/HotSpotInternals/Sumatra  http://mail.openjdk.java.net/pipermail/sumatra-dev/ © Copyright 2014 HSA Foundation. All Rights Reserved Application.java Java Compiler GPUCPU Sumatra Enabled JVM Application GPU ISA Lambda/Stream API CPU ISA Application.clas s Development Runtime HSA Finalizer
  • 24. HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do © Copyright 2014 HSA Foundation. All Rights Reserved Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
  • 25. WORKLOAD EXAMPLE SUFFIX ARRAY CONSTRUCTION CLOUD SERVER WORKLOAD
  • 26. SUFFIX ARRAYS  Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T  Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 27. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA © Copyright 2014 HSA Foundation. All Rights Reserved M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
  • 28. EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE
  • 29. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy- back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV “Hessian” Kernel) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 30. THE HSA FUTURE  Architected heterogeneous processing on the SOC  Programming of accelerators becomes much easier  Accelerated software that runs across multiple hardware vendors  Scalability from smart phones to super computers on a common architecture  GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model  Heterogeneous software ecosystem evolves at a much faster pace  Lower power, more capable devices in your hand, on the wall, in the cloud © Copyright 2014 HSA Foundation. All Rights Reserved
  • 32. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): HSAIL VIRTUAL PARALLEL ISA BEN SANDER, AMD
  • 33. TOPICS  Introduction and Motivation  HSAIL – what makes it special?  HSAIL Execution Model  How to program in HSAIL?  Conclusion © Copyright 2014 HSA Foundation. All Rights Reserved
  • 34. STATE OF GPU COMPUTING Today’s Challenges  Separate address spaces  Copies  Can’t share pointers  New language required for compute kernel  EX: OpenCL™ runtime API  Compute kernel compiled separately than host code Emerging Solution  HSA Hardware  Single address space  Coherent  Virtual  Fast access from all components  Can share pointers  Bring GPU computing to existing, popular, programming models  Single-source, fully supported by compiler  HSAIL compiler IR (Cross-platform!) • GPUs are fast and power efficient : high compute density per-mm and per-watt • But: Can be hard to program PCIe
  • 35. THE PORTABILITY CHALLENGE  CPU ISAs  ISA innovations added incrementally (ie NEON, AVX, etc)  ISA retains backwards-compatibility with previous generation  Two dominant instruction-set architectures: ARM and x86  GPU ISAs  Massive diversity of architectures in the market  Each vendor has own ISA - and often several in market at same time  No commitment (or attempt!) to provide any backwards compatibility  Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction © Copyright 2014 HSA Foundation. All Rights Reserved
  • 36. HSAIL : WHAT MAKES IT SPECIAL?
  • 37. WHAT IS HSAIL?  Intermediate language for parallel compute in HSA  Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)  Expresses parallel regions of code  Binary format of HSAIL is called “BRIG”  Goal: Bring parallel acceleration to mainstream programming languages © Copyright 2014 HSA Foundation. All Rights Reserved main() { … #pragma omp parallel for for (int i=0;i<N; i++) { } … } High-Level Compiler BRIG Finalizer Component ISA Host ISA
  • 38. KEY HSAIL FEATURES  Parallel  Shared virtual memory  Portable across vendors in HSA Foundation  Stable across multiple product generations  Consistent numerical results (IEEE-754 with defined min accuracy)  Fast, robust, simple finalization step (no monthly updates)  Good performance (little need to write in ISA)  Supports all of OpenCL™  Supports Java, C++, and other languages as well © Copyright 2014 HSA Foundation. All Rights Reserved
  • 39. HSAIL INSTRUCTION SET - OVERVIEW  Similar to assembly language for a RISC CPU  Load-store architecture  Destination register first, then source registers  140 opcodes (Java™ bytecode has 200)  Floating point (single, double, half (f16))  Integer (32-bit, 64-bit)  Some packed operations  Branches  Function calls  Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas  Synchronize host CPU and HSA Component!  Text and Binary formats (“BRIG”) ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120) add_u64 $d1, $d0, 24 ; $d1= $d2+24 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 40. SEGMENTS AND MEMORY (1/2)  7 segments of memory  global, readonly, group, spill, private, arg, kernarg  Memory instructions can (optionally) specify a segment  Control data sharing properties and communicate intent  Global Segment  Visible to all HSA agents (including host CPU)  Group Segment  Provides high-performance memory shared in the work-group.  Group memory can be read and written by any work-item in the work-group  HSAIL provides sync operations to control visibility of group memory ld_global_u64 $d0,[$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] © Copyright 2014 HSA Foundation. All Rights Reserved
  • 41. SEGMENTS AND MEMORY (2/2)  Spill, Private, Arg Segments  Represent different regions of a per-work-item stack  Typically generated by compiler, not specified by programmer  Compiler can use these to convey intent – ie spills  Kernarg Segment  Programmer writes kernarg segment to pass arguments to a kernel  Read-Only Segment  Remains constant during execution of kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 42. FLAT ADDRESSING  Each segment mapped into virtual address space  Flat addresses can map to segments based on virtual address  Instructions with no explicit segment use flat addressing  Very useful for high-level language support (ie classes, libraries)  Aligns well with OpenCL 2.0 “generic” addressing feature ld_global_u64 $d6, [%_arg0] ; global ld_u64 $d0,[$d6+24] ; flat © Copyright 2014 HSA Foundation. All Rights Reserved
  • 43. REGISTERS  Four classes of registers:  S: 32-bit, Single-precision FP or Int  D: 64-bit, Double-precision FP or Long Int  Q: 128-bit, Packed data.  C: 1-bit, Control Registers (Compares)  Fixed number of registers  S, D, Q share a single pool of resources  S + 2*D + 4*Q <= 128  Up to 128 S or 64 D or 32 Q (or a blend)  Register allocation done in high-level compiler  Finalizer doesn’t perform expensive register allocation c0 c1 c2 c3 c4 c5 c6 c7 s0 d0 q0 s1 s2 d1 s3 s4 d2 q1 s5 s6 d3 s7 s8 d4 q2 s9 s10 d5 s11 … s120 d60 q30 s121 s122 d61 s123 s124 d62 q31 s125 s126 d63 s127 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 44. SIMT EXECUTION MODEL  HSAIL Presents a “SIMT” execution model to the programmer  “Single Instruction, Multiple Thread”  Programmer writes program for a single thread of execution  Each work-item appears to have its own program counter  Branch instructions look natural  Hardware Implementation  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency  Actually one program counter for the entire SIMD instruction  Branches implemented with predication  SIMT Advantages  Easier to program (branch code in particular)  Natural path for mainstream programming models and existing compilers  Scales across a wide variety of hardware (programmer doesn’t see vector width)  Cross-lane operations available for those who want peak performance © Copyright 2014 HSA Foundation. All Rights Reserved
  • 45. WAVEFRONTS  Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”  Lanes in wavefront can be “active” or “inactive”  Inactive lanes consume hardware resources but don’t do useful work  Tradeoffs  “Wavefront-aware” programming can be useful for peak performance  But results in less portable code (since wavefront width is encoded in algorithm) if (cond) { operationA; // cond=True lanes active here } else { operationB; // cond=False lanes active here } © Copyright 2014 HSA Foundation. All Rights Reserved
  • 46. CROSS-LANE OPERATIONS  Example HSAIL cross-lane operation: “activelaneid”  Dest set to count of earlier work-items that are active for this instruction  Useful for compaction algorithms  Example HSAIL cross-lane operation: “activelaneshuffle”  Each workitem reads value from another lane in the wavefront  Supports selection of “identity” element for inactive lanes  Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0 // s0 = dest, s1= source, s2=lane select, no identity activelaneid_u32 $s0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 47. HSAIL MODES  Working group strived to limit optional modes and features in HSAIL  Minimize differences between HSA target machines  Better for compiler vendors and application developers  Two modes survived  Machine Models  Small: 32-bit pointers, 32-bit data  Large: 64-bit pointers, 32-bit or 64-bit data  Vendors can support one or both models  “Base” and “Full” Profiles  Two sets of requirements for FP accuracy, rounding, exception reporting, hard pre-emption © Copyright 2014 HSA Foundation. All Rights Reserved
  • 48. HSA PROFILES Feature Base Full Addressing Modes Small, Large Small, Large All 32-bit HSAIL operations according to the declared profile Yes Yes F16 support (IEEE 754 or better) Yes Yes F64 support No Yes Precision for add/sub/mul 1/2 ULP 1/2 ULP Precision for div 2.5 ULP 1/2 ULP Precision for sqrt 1 ULP 1/2 ULP HSAIL Rounding: Near Yes Yes HSAIL Rounding: Up / Down / Zero No Yes Subnormal floating-point Flush-to-zero Supported Propagate NaN Payloads No Yes FMA Yes Yes Arithmetic Exception reporting None DETECT or BREAK Debug trap Yes Yes Hard Preemption No Yes © Copyright 2014 HSA Foundation. All Rights Reserved
  • 49. HSA PARALLEL EXECUTION MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 50. HSA PARALLEL EXECUTION MODEL Basic Idea: Programmer supplies an HSAIL “kernel” that is run on each work-item. Kernel is written as a single thread of execution. Programmer specifies grid dimensions (scope of problem) when launching the kernel. Each work-item has a unique coordinate in the grid. Programmer optionally specifies work- group dimensions (for optimized communication). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 51. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 52. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 53. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D work-group 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 54. HOW TO PROGRAM HSA? WHAT DO I TYPE? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 55. HSA PROGRAMMING MODELS : CORE PRINCIPLES  Single source  Host and device code side-by-side in same source file  Written in same programming language  Single unified coherent address space  Freely share pointers between host and device  Similar memory model as multi-core CPU  Parallel regions identified with existing language syntax  Typically same syntax used for multi-core CPU  HSAIL is the compiler IR that supports these programming models © Copyright 2014 HSA Foundation. All Rights Reserved
  • 56. GCC OPENMP : COMPILATION FLOW  SUSE GCC Project  Adding HSAIL code generator to GCC compiler infrastructure  Supports OpenMP 3.1 syntax  No data movement directives required !main() { … // Host code. #pragma omp parallel for for (int i=0;i<N; i++) { C[i] = A[i] + B[i]; } … } GCC OpenMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  • 57. GCC OpenMP flow C/C++/Fortran OpenMP application e.g., #pragma omp for for( j = 0; j<n;j++) { b[j] = a[j]; } GNU Compiler(GCC) Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes Lowers OpenMP directives, converts GIMPLE to BRIG. Embeds BRIG into host code Dispatch kernel to GPU Pragmas map to calls into HSA Runtime Application Compiler Run time Finalize kernel from BRIG->ISA Kernels finalized once and cached. Compile time © Copyright 2014 HSA Foundation. All Rights Reserved
  • 58. MCW C++AMP : COMPILATION FLOW  C++AMP : Single-source C++ template parallel programming model  MCW compiler based on CLANG/LLVM  Open-source and runs on Linux  Leverage open-source LLVM->HSAIL code generator main() { … parallel_for_each(grid<1>(ext entent<256>(…) … } C++AMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  • 59. JAVA: RUNTIME FLOW © Copyright 2014 HSA Foundation. All Rights Reserved JAVA 8 – HSA ENABLED APARAPI  Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core.  APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL JVM Java Application HSA Finalizer & Runtime APARAPI + Lambda API GPUCPU Future Java – HSA ENABLED JAVA (SUMATRA)  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL JVM Java Application HSA Finalizer & Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend GPUCPU
  • 60. AN EXAMPLE (IN JAVA 8) © Copyright 2014 HSA Foundation. All Rights Reserved //Example computes the percentage of total scores achieved by each player on a team. class Player { private Team team; // Note: Reference to the parent Team. private int scores; private float pctOfTeamScores; public Team getTeam() {return team;} public int getScores() {return scores;} public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; } }; // “Team” class not shown // Assume “allPlayers’ is an initialized array of Players.. Arrays.stream(allPlayers). // wrap the array in a stream parallel(). // developer indication that lambda is thread-safe forEach(p -> { int teamScores = p.getTeam().getScores(); float pctOfTeamScores = (float)p.getScores()/(float) teamScores; p.setPctOfTeamScores(pctOfTeamScores); });
  • 61. HSAIL CODE EXAMPLE © Copyright 2014 HSA Foundation. All Rights Reserved 01: version 0:95: $full : $large; 02: // static method HotSpotMethod<Main.lambda$2(Player)> 03: kernel &run ( 04: kernarg_u64 %_arg0 // Kernel signature for lambda method 05: ) { 06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register 07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord 08: 09: cvt_u64_s32 $d2, $s2; // Convert X gid to long 10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref 11: add_u64 $d2, $d2, 24; // Adjust for actual elements start 12: add_u64 $d2, $d2, $d6; // Add to array ref ptr 13: ld_global_u64 $d6, [$d2]; // Load from array element into reg 14: @L0: 15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() 16: mov_b64 $d3, $d0; 17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores () 18: cvt_f32_s32 $s16, $s3; 19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores() 20: cvt_f32_s32 $s17, $s0; 21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores 22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() 23: ret; 24: };
  • 62. HOW TO PROGRAM HSA? OTHER PROGRAMMING TOOLS © Copyright 2014 HSA Foundation. All Rights Reserved
  • 63. HSAIL ASSEMBLER kernel &run (kernarg_u64 %_arg0) { ld_kernarg_u64 $d6, [%_arg0]; workitemabsid_u32 $s2, 0; cvt_u64_s32 $d2, $s2; mul_u64 $d2, $d2, 8; add_u64 $d2, $d2, 24; add_u64 $d2, $d2, $d6; ld_global_u64 $d6, [$d2]; . . . HSAIL Assembler BRIG Finalizer Machine ISA • HSAIL has a text format and an assembler © Copyright 2014 HSA Foundation. All Rights Reserved
  • 64. OPENCL™ OFFLINE COMPILER (CLOC) __kernel void vec_add( __global const float *a, __global const float *b, __global float *c, const unsigned int n) { int id = get_global_id(0); // Bounds check if (id < n) c[id] = a[id] + b[id]; } CLOC BRIG Finalizer Machine ISA •OpenCL split-source model cleanly isolates kernel •Can express many HSAIL features in OpenCL Kernel Language •Higher productivity than writing in HSAIL assembly •Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware) •Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model © Copyright 2014 HSA Foundation. All Rights Reserved
  • 65. KEY TAKEAWAYS  HSAIL  Thin, robust, fast finalizer  Portable (multiple HW vendors and parallel architectures)  Supports shared virtual memory and platform atomics  HSA brings GPU computing to mainstream programming models  Shared and coherent memory bridges “faraway accelerator” gap  HSAIL provides the common IL for high-level languages to benefit from parallel computing  Languages and Compilers  HSAIL support in GCC, LLVM, Java JVM  Leverage same language syntax designed for multi-core CPUs  Can use pointer-containing data structures © Copyright 2014 HSA Foundation. All Rights Reserved
  • 66. HSA RUNTIME YEN-CHING CHUNG, NATIONAL TSING HUA UNIVERSITY
  • 67. OUTLINE  Introduction  HSA Core Runtime API (Pre-release 1.0 provisional)  Initialization and Shut Down  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Summary © Copyright 2014 HSA Foundation. All Rights Reserved
  • 68. INTRODUCTION (1)  The HSA core runtime is a thin, user-mode API that provides the interface necessary for the host to launch compute kernels to the available HSA components.  The overall goal of the HSA core runtime design is to provide a high-performance dispatch mechanism that is portable across multiple HSA vendor architectures.  The dispatch mechanism differentiates the HSA runtime from other language runtimes by architected argument setting and kernel launching at the hardware and specification level.  The HSA core runtime API is standard across all HSA vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API.  The implementation of the HSA runtime may include kernel-level components (required for some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example, simulators or CPU implementations). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 69. Component 1 Driver Component N… Vendor m … Component 1 Driver Component N… Vendor 1 Component 1 HSA Runtime Component N… HSA Vendor 1 HSA Finalizer Component 1 HSA Runtime Component N… HSA Vendor m HSA Finalizer INTRODUCTION (2) Programming Model Language Runtime  The software architecture stack without HSA runtime OpenCL App Java App OpenMP App DSL App OpenCL Runtime Java Runtime OpenMP Runtime DSL Runtime … …  The software architecture stack with HSA runtime … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 70. INTRODUCTION (3) OpenCL Runtime HSA RuntimeAgent Start Program HSA Memory Allocation Enqueue Dispatch Packet Exit Program Resource Deallocation Command Queue Platform, Device, and Context Initialization SVM Allocation and Kernel Arguments Setting Build Kernel HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  • 71. INTRODUCTION (4)  HSA Platform System Architecture Specification support  Runtime initialization and shutdown  Notifications (synchronous/asynchronous)  Agent information  Signals and synchronization (memory-based)  Queues and Architected dispatch  Memory management  HSAIL support  Finalization, linking, and debugging  Image and Sampler support HSA Runtime HSA Memory Allocation Enqueue Dispatch Packet HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  • 73. OUTLINE  Runtime Initialization API  hsa_init  Runtime Shut Down API  hsa_shut_down  Examples © Copyright 2014 HSA Foundation. All Rights Reserved
  • 74. HSA RUNTIME INITIALIZATION  When the API is invoked for the first time in a given process, a runtime instance is created.  A typical runtime instance may contain information of platform, topology, reference count, queues, signals, etc.  The API can be called multiple times by applications  Only a single runtime instance will exist for a given process.  Whenever the API is invoked, the reference count is increased by one. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 75. HSA RUNTIME SHUT DOWN  When the API is invoked, the reference count is decreased by 1.  When the reference count < 1  All the resources associated with the runtime instance (queues, signals, topology information, etc.) are considered invalid and any attempt to reference them in subsequent API calls results in undefined behavior.  The user might call hsa_init to initialize the HSA runtime again.  The HSA runtime might release resources associated with it. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 76. EXAMPLE – RUNTIME INITIALIZATION (1) Data structure for runtime instance If hsa_init is called more than once, increase the ref_count by 1 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 77. EXAMPLE – RUNTIME INITIALIZATION (2) hsa_init is called the first time, allocate resources and set the reference count Get the number of HSA agent Initialize agents Create an empty agent list If initialization failed, release resources Create topology table © Copyright 2014 HSA Foundation. All Rights Reserved
  • 78. Agent-0 node_id 0 id 0 type CPU vendor Generic name Generic wavefront_size 0 queue_size 200 group_memory 0 fbarrier_max_count 1 is_pic_supported 0 … … EXAMPLE - RUNTIME INSTANCE (1) Platform Name: Generic Memory node_id 0 id 0 segment_type 111111 address_base 0x0001 size 2048 MB peak_bandwidth 6553.6 mpbs Agent-1 node_id 0 id 0 type GPU vendor Generic name Generic wavefront_size 64 queue_size 200 group_memory 64 fbarrier_max_count 1 is_pic_supported 1 Cache node_id 0 id 0 levels 1 associativity 1 cache size 64KB cache line size 4 is_inclusive 1 Agent: 2 Memory: 1 Cache: 1 … … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 79. Agent-0 node_id = 0 id = 0 agent_type = 1 (CPU) vendor[16] = Generic name[16] = Generic wavefront_size = 0 queue_size =200 group_memory_size_bytes =0 fbarrier_max_count = 1 is_pic_supported = 0 Platform Header File *base_address = 0x00001 Size = 248 system_timestamp_frequency_ mhz = 200 signal_maximum_wait = 1/200 *node_id no_nodes = 1 *agent_list no_agent = 2 *memory_descriptor_list no_memory_descriptor = 1 *cache_descriptor_list no_cache_descriptor = 1 EXAMPLE - RUNTIME INSTANCE (2) … … cache node_id = 0 Id = 0 Levels = 1 * associativity * cache_size * cache_line_size * is_inclusive 1 NULL 64KB NULL 1 NULL 4 NULL Memory node_id = 0 Id = 0 supported_segment_type_mask = 111111 virtual_address_base = 0x0001 size_in_bytes = 2048MB peak_bandwidth_mbps = 6553.6 0 NULL 45 165 NULL 285 NULL 325 NULL Agent-1 node_id = 0 id = 0 agent_type = 2 (GPU) vendor[16] = Generic name[16] = Generic wavefront_size = 64 queue_size =200 group_memory_size_bytes =64 fbarrier_max_count = 1 is_pic_supported = 1 … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 80. EXAMPLE – RUNTIME SHUT DOWN © Copyright 2014 HSA Foundation. All Rights Reserved If ref_count < 1, then free the list; Otherwise decrease the ref_count by 1.
  • 82. OUTLINE  Synchronous Notifications  hsa_status_t  hsa_status_string  Asynchronous Notifications  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 83. SYNCHRONOUS NOTIFICATIONS  Notifications (errors, events, etc.) reported by the runtime can be synchronous or asynchronous  The HSA runtime uses the return values of API functions to pass notifications synchronously.  A status code is define as an enumeration, , to capture the return value of any API function that has been executed, except accessors/mutators.  The notification is a status code that indicates success or error.  Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.  An error status is assigned a positive integer and its identifier starts with the HSA_STATUS_ERROR prefix.  The status code can help to determine a cause of the unsuccessful execution. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 84. STATUS CODE QUERY  Query additional information on status code  Parameters  status (input): Status code that the user is seeking more information on  status_string (output): An ISO/IEC 646 encoded English language string that potentially describes the error status © Copyright 2014 HSA Foundation. All Rights Reserved
  • 85. ASYNCHRONOUS NOTIFICATIONS  The runtime passes asynchronous notifications by calling user-defined callbacks.  For instance, queues are a common source of asynchronous events because the tasks queued by an application are asynchronously consumed by the packet processor. Callbacks are associated with queues when they are created. When the runtime detects an error in a queue, it invokes the callback associated with that queue and passes it an error flag (indicating what happened) and a pointer to the erroneous queue.  The HSA runtime does not implement any default callbacks.  When using blocking functions within the callback implementation, a callback that does not return can render the runtime state to be undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 86. EXAMPLE - CALLBACK Pass the callback function when create queue If the queue is empty, set the event and invoke callback © Copyright 2014 HSA Foundation. All Rights Reserved
  • 88. OUTLINE  Agent information  hsa_node_t  hsa_agent_t  hsa_agent_info_t  hsa_component_feature_t  Agent Information manipulation APIs  hsa_iterate_agents  hsa_agent_get_info  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 89. INTRODUCTION  The runtime exposes a list of agents that are available in the system.  An HSA agent is a hardware component that participates in the HSA memory model.  An HSA agent can submit AQL packets for execution.  An HSA agent may also but is not required to be an HSA component. It is possible for a system to include HSA agents that are neither an HSA component nor a host CPU.  HSA agents are defined as opaque handles of type hsa_agent_t .  The HSA runtime provides APIs for applications to traverse the list of available agents and query attributes of a particular agent. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 90. AGENT INFORMATION (1)  Opaque agent handle  Opaque NUMA node handle  An HSA memory node is a node that delineates a set of system components (host CPUs and HSA Components) with “local” access to a set of memory resources attached to the node's memory controller and appropriate HSA-compliant access attributes. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 91. AGENT INFORMATION (2)  Component features  An HSA component is a hardware or software component that can be a target of the AQL queries and conforms to the memory model of the HSA.  Values  HSA_COMPONENT_FEATURE_NONE = 0  No component capabilities. The device is an agent, but not a component.  HSA_COMPONENT_FEATURE_BASIC = 1  The component supports the HSAIL instruction set and all the AQL packet types except Agent dispatch.  HSA_COMPONENT_FEATURE_ALL = 2  The component supports the HSAIL instruction set and all the AQL packet types. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 92. AGENT INFORMATION (3)  Agent attributes  Values  HSA_AGENT_INFO_MAX_GRID_DIM  HSA_AGENT_INFO_MAX_WORKGROUP_DIM  HSA_AGENT_INFO_QUEUE_MAX_PACKETS  HSA_AGENT_INFO_CLOCK  HSA_AGENT_INFO_CLOCK_FREQUENCY  HSA_AGENT_INFO_MAX_SIGNAL_WAIT  HSA_AGENT_INFO_NAME  HSA_AGENT_INFO_NODE  HSA_AGENT_INFO_COMPONENT_FEATURES  HSA_AGENT_INFO_VENDOR_NAME  HSA_AGENT_INFO_WAVEFRONT_SIZE  HSA_AGENT_INFO_CACHE_SIZE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 93. AGENT INFORMATION MANIPULATION (1)  Iterate over the available agents, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value.  Parameters  callback (input): Callback to be invoked once per agent  data (input): Application data that is passed to callback on every iteration. Can be NULL. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 94. AGENT INFORMATION MANIPULATION (2)  Get the current value of an attribute for a given agent  Parameters  agent (input): A valid agent  attribute (input): Attribute to query  value (output): Pointer to a user-allocated buffer where to store the value of the attribute. If the buffer passed by the application is not large enough to hold the value of attribute, the behavior is undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 95. EXAMPLE - AGENT ATTRIBUTE QUERY Copy agent attribute information Get the agent handle of Agent 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 97. OUTLIINE  Signal  Signal manipulation API  Create/Destroy  Query  Send  Atomic Operations  Signal wait  Get time out  Signal Condition  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 98. SIGNAL (1)  HSA agents can communicate with each other by using coherent global memory, or by using signals.  A signal is represented by an opaque signal handle  A signal carries a value, which can be updated or conditionally waited upon via an API call or HSAIL instruction.  The value occupies four or eight bytes depending on the machine model in use. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 99. SIGNAL (2)  Updating the value of a signal is equivalent to sending the signal.  In addition to the update (store) of signals, the API for sending signal must support other atomic operations with specific memory order semantics  Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS  Memory order semantics : Release and Relaxed © Copyright 2014 HSA Foundation. All Rights Reserved
  • 100. SIGNAL CREATE/DESTROY  Create a signal  Parameters  initial_value (input): Initial value of the signal.  signal_handle (output): Signal handle.  Destroy a signal previous created by hsa_signal_create  Parameter  signal_handle (input): Signal handle. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 101.  Send and atomically set the value of a signal with release semantics SIGNAL LOAD/STORE  Atomically read the current signal value with acquire semantics  Atomically read the current signal value with relaxed semantics  Send and atomically set the value of a signal with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 102.  Send and atomically increment the value of a signal by a given amount with release semantics SIGNAL ADD/SUBTRACT  Send and atomically decrement the value of a signal by a given amount with release semantics  Send and atomically increment the value of a signal by a given amount with relaxed semantics  Send and atomically decrement the value of a signal by a given amount with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 103.  Send and atomically perform a logical AND operation on the value of a signal and a given value with release semantics SIGNAL AND (OR, XOR)/EXCHANGE  Send and atomically set the value of a signal and return its previous value with release semantics  Send and atomically perform a logical AND operation on the value of a signal and a given value with relaxed semantics  Send and atomically set the value of a signal and return its previous value with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 104. SIGNAL WAIT (1)  The application may wait on a signal, with a condition specifying the terms of wait.  Signal wait condition operator  Values  HSA_EQ: The two operands are equal.  HSA_NE: The two operands are not equal.  HSA_LT: The first operand is less than the second operand.  HSA_GTE: The first operand is greater than or equal to the second operand. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 105. SIGNAL WAIT (2)  The wait can be done either in the HSA component via an HSAIL wait instruction or via a runtime API defined here.  Waiting on a signal returns the current value at the opaque signal object;  The wait may have a runtime defined timeout which indicates the maximum amount of time that an implementation can spend waiting.  The signal infrastructure allows for multiple senders/waiters on a single signal.  Wait reads the value, hence acquire synchronizations may be applied. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 106. SIGNAL WAIT (3)  Signal wait  Parameters  signal_handle (input): A signal handle  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  • 107. SIGNAL WAIT (4)  Signal wait with timeout  Parameters  signal_handle (input): A signal handle  timeout (input): Maximum wait duration (A value of zero indicates no maximum)  long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in a short period of time. The HSA runtime may use this hint to optimize the wait implementation.  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  • 108. EXAMPLE – SIGNAL WAIT (1) thread_1 thread_2 thread_1 is blocked hsa_signal_add_relaxed (value = value + 3) Return signal value Condition satisfied, the execution of thread_1 continues value = 0 Timeline Timeline value = 3 hsa_signal_substract_relaxed (value = value - 1)value = 2 hsa_signal_wait_timeout_acquire (value == 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 109. EXAMPLE – SIGNAL WAIT (2) If signal_handle is invalid, then return signal invalid status Compare tmp->value with compare_value to see if the condition is satisfied? If timeout = 0 then return signal time out status Signal wait condition function If the condition is satisfied, then return signal and status © Copyright 2014 HSA Foundation. All Rights Reserved
  • 111. OUTLINE  Queues  Queue Types and Structure  HSA runtime API for Queue Manipulations  Architected Queuing Language (AQL) Support  Packet type  Packet header  Examples  Enqueue Packet  Packet Processor © Copyright 2014 HSA Foundation. All Rights Reserved
  • 112. INTRODUCTION (1)  An HSA-compliant platform supports multiple user-level command queues allocation.  A use-level command queue is characterized as runtime-allocated, user-level accessible virtual memory of a certain size, containing packets defined in the Architected Queuing Language (AQL packets).  Queues are allocated by HSA applications through the HSA runtime.  HSA software receives memory-based structures to configure the hardware queues to allow for efficient software management of the hardware queues of the HSA agents.  This queue memory shall be processed by the HSA Packet Processor as a ring buffer.  Queues are read-only data structures.  Writing values directly to a queue structure results in undefined behavior.  But HSA agents can directly modify the contents of the buffer pointed by base_address, or use runtime APIs to access the doorbell signal or the service queue. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 113.  Two queue types, AQL and Service Queues, are supported  AQL Queue consumes AQL packets that are used to specify the information of kernel functions that will be executed on the HSA component  Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user registered functions that will be executed on the agent (typically, the host CPU) INTRODUCTION (2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 114. INTRODUCTION (3)  AQL queue structure © Copyright 2014 HSA Foundation. All Rights Reserved
  • 115. INTRODUCTION (4)  In addition to the data held in the queue structure, the queue also defines two properties (readIndex and writeIndex) that define the location of “head” and “tail” of the queue.  readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet to be consumed by the packet processor.  writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet slot to be allocated.  Both indices are not directly exposed to the user, who can only access them by using dedicated HSA core runtime APIs.  The available index functions differ on the index of interest (read or write), action to be performed (addition, compare and swap, etc.), and memory consistency model (relaxed, release, etc.). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 116. INTRODUCTION (5)  The read index is automatically advanced when a packet is read by the packet processor.  When the packet processor observes that  The read index matches the write index, the queue can be considered empty;  The write index is greater than or equal to the sum of the read index and the size of the queue, then the queue is full.  The doorbell_signal field of a queue contains a signal that is used by the agent to inform the packet processor to process the packets it writes.  The value that the doorbell signaled is equal to the ID of the packet that is ready to be launched. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 117. INTRODUCTION (6)  The new task might be consumed by the packet processor even before the doorbell signal has been signaled by the agent.  This is because the packet processor might be already processing some other packets and observes that there is new work available, so it processes the new packets.  In any case, the agent must ring the doorbell for every batch of packets it writes. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 118. QUEUE CREATE/DESTROY  Create a user mode queue  When a queue is created, the runtime also allocates the packet buffer and the completion signal.  The application should only rely on the status code returned to determine if the queue is valid  Destroy a user mode queue  A destroyed queue might not be accessed after being destroyed.  When a queue is destroyed, the state of the AQL packets that have not been yet fully processed becomes undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 119. GET READ/WRITE INDEX  Atomically retrieve read index of a queue with acquire semantics  Atomically retrieve write index of a queue with acquire semantics  Atomically retrieve read index of a queue with relaxed semantics  Atomically retrieve write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 120. SET READ/WRITE INDEX  Atomically set the read index of a queue with release semantics  Atomically set the read index of a queue with relaxed semantics  Atomically set the write index of a queue with release semantics  Atomically set the write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 121. COMPARE AND SWAP WRITE INDEX  Atomically compare and set the write index of a queue with acquire/release/relaxed/acquire- release semantics  Parameters  queue (input): A queue  expected (input): The expected index value  val (input): Value to copy to the write index if expected matches the observed write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  • 122. ADD WRITE INDEX  Atomically increment the write index of a queue by an offset with release/acquire/relaxed/acquire-release semantics  Parameters  queue (input): A queue  val (input): The value to add to the write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  • 123. ARCHITECTED QUEUING LANGUAGE (AQL)  An HSA-compliant system provides a command interface for the dispatch of HSA agent commands.  This command interface is provided by the Architected Queuing Language (AQL).  AQL allows HSA agents to build and enqueue their own command packets, enabling fast and low-power dispatch.  AQL also provides support for HSA component queue submissions  The HSA component kernel can write commands in AQL format. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 124. AQL PACKET (1)  AQL packet format  Values  Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.  Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.  Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.  Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.  Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 125. AQL PACKET (2) HSA signaling object handle used to indicate completion of the job © Copyright 2014 HSA Foundation. All Rights Reserved
  • 126. EXAMPLE - ENQUEUE AQL PACKET (1)  An HSA agent submits a task to a queue by performing the following steps:  Allocate a packet slot (by incrementing the writeIndex)  Initialize the packet and copy packet to a queue associated with the Packet Processor  Mark packet as valid  Notify the Packet Processor of the packet (With doorbell signal) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 127. EXAMPLE - ENQUEUE AQL PACKET (2) Dispatch Queue Allocate an AQL packet slot Copy the packet into queue. Note that, we can have a lock here to prevent race condition in multithread environment WriteIndex ReadIndex Initialize packet Send doorbell signal © Copyright 2014 HSA Foundation. All Rights Reserved
  • 128. EXAMPLE - PACKET PROCESSOR WriteIndex ReadIndex Get packet content Check if barrier packet Update readIndex, change packet state to invalid, and send completion signal. Receive doorbell Dispatch Queue If there is any packet in queue, process the packet. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 130. OUTLINE  Memory registration and deregistration  Memory region and memory segment  APIs for memory region manipulation  APIs for memory registration and deregistration © Copyright 2014 HSA Foundation. All Rights Reserved
  • 131. INTRODUCTION  One of the key features of HSA is its ability to share global pointers between the host application and code executing on the HSA component.  This ability means that an application can directly pass a pointer to memory allocated on the host to a kernel function dispatched to a component without an intermediate copy  When a buffer created in the host is also accessed by a component, programmers are encouraged to register the corresponding address range beforehand.  Registering memory expresses an intention to access (read or write) the passed buffer from a component other than the host. This is a performance hint that allows the runtime implementation to know which buffers will be accessed by some of the components ahead of time.  When an HSA program no longer needs to access a registered buffer in a device, the user should deregister that virtual address range. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 132. MEMORY REGION/SEGMENT  A memory region represents a virtual memory interval that is visible to a particular agent, and contains properties about how memory is accessed or allocated from that agent.  Memory segments  Values  HSA_SEGMENT_GLOBAL = 1  HSA_SEGMENT_PRIVATE = 2  HSA_SEGMENT_GROUP = 4  HSA_SEGMENT_KERNARG = 8  HSA_SEGMENT_READONLY = 16  HSA_SEGMENT_IMAGE = 32 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 133. MEMORY REGION INFORMATION  Attributes of a memory region  Values  HSA_REGION_INFO_BASE_ADDRESS  HSA_REGION_INFO_SIZE  HSA_REGION_INFO_NODE  HSA_REGION_INFO_MAX_ALLOCATION_SIZE  HSA_REGION_INFO_SEGMENT  HSA_REGION_INFO_BANDWIDTH  HSA_REGION_INFO_CACHED © Copyright 2014 HSA Foundation. All Rights Reserved
  • 134. MEMORY REGION MANIPULATION (1)  Get the current value of an attribute of a region  Iterate over the memory regions that are visible to an agent, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 135. MEMORY REGION MANIPULATION (2)  Allocate a block of memory  Deallocate a block of memory previously allocated using hsa_memory_allocate  Copy block of memory  Copying a number of bytes larger than the size of the memory regions pointed by dst or src results in undefined behavior. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 136. MEMORY REGISTRATION/DEREGISTRATION  Register memory  Parameters  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed.  size (input): Requested registration size in bytes. A size of zero is only allowed if address is NULL.  Deregister memory previously registered using hsa_memory_register  Parameter  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 137. EXAMPLE Allocate a memory space Use hsa_region_get_info to get the size in byte of this memory space Register this memory space for a performance hint Finish operation, deregister and free this memory space © Copyright 2014 HSA Foundation. All Rights Reserved
  • 139. SUMMARY  Covered  HSA Core Runtime API (Pre-release 1.0 provisional)  Runtime Initialization and Shutdown (Open/Close)  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Memory Management  Not covered  Extension of Core Runtime  HSAIL Finalization, Linking, and Debugging  Images and Samplers © Copyright 2014 HSA Foundation. All Rights Reserved
  • 140. QUESTIONS? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 141. HSA MEMORY MODEL BEN GASTER, ENGINEER, QUALCOMM
  • 142. OUTLINE  HSA Memory Model  OpenCL 2.0  Has a memory model too  Obstruction-free bounded deques  An example using the HSA memory model © Copyright 2014 HSA Foundation. All Rights Reserved
  • 143. HSA MEMORY MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 144. TYPES OF MODELS  Shared memory computers and programming languages, divide complexity into models: 1. Memory model specifies safety  e.g. can a work-item prevent others from progressing?  This is what this section of the tutorial will focus on 2. Execution model specifies liveness  Described in Ben Sander’s tutorial section on HSAIL  e.g. can a work-item prevent others from progressing 3. Performance model specifies the big picture  e.g. caches or branch divergence  Specific to particular implementations and outside the scope of today’s tutorial © Copyright 2014 HSA Foundation. All Rights Reserved
  • 145. THE PROBLEM  Assume all locations (a, b, …) are initialized to 0  What are the values of $s2 and $s4 after execution? © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; *a = 1; int x = *b; *b = 1; int y = *a; initially *a = 0 && *b = 0
  • 146. THE SOLUTION  The memory model tells us:  Defines the visibility of writes to memory at any given point  Provides us with a set of possible executions © Copyright 2014 HSA Foundation. All Rights Reserved
  • 147. WHAT MAKES A GOOD MEMORY MODEL*  Programmability ; A good model should make it (relatively) easy to write multi- work-item programs. The model should be intuitive to most users, even to those who have not read the details  Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options  Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models * S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 148. SEQUENTIAL CONSISTENCY (SC)*  Axiomatic Definition  A single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.”  A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” © Copyright 2014 HSA Foundation. All Rights Reserved  But HW/Compiler actually implements more relaxed models, e.g. ARMv7 * L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. IEEE Transactions on Computers, C-28(9):690–91, Sept. 1979.
  • 149. SEQUENTIAL CONSISTENCY (SC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 150. BUT WHAT ABOUT ACTUAL HARDWARE  Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform  Many modern processors implement many reordering optimizations  Store buffers (TSO*), work-items can see their own stores early  Reorder buffers (XC*), work-items can see other work-items store early © Copyright 2014 HSA Foundation. All Rights Reserved *TSO – Total Store Order as implemented by Sparc and x86 *XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
  • 151. RELAXED CONSISTENCY (XC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; ld_global_u32 $s2, [&b] ; ld_global_u32 $s4, [&a] ; st_global_u32 $s1, [&a] ; st_global_u32 $s3, [&b] ; $s2 = 0 && $s4 = 0
  • 152. WHAT ARE OUR 3 Ps?  Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when  many memory model experts have been known to get it wrong!  Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements  Portability ; XC is very portable as it places very little constraints © Copyright 2014 HSA Foundation. All Rights Reserved
  • 153. MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT  To have their cake and eat it! © Copyright 2014 HSA Foundation. All Rights Reserved Put picture with kids and cake HSA Provides: The ability to enable programmers to reason with (relatively) intuitive model of SC, while still achieving the benefits of XC!
  • 154. SEQUENTIAL CONSISTENCY FOR DRF*  HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data Race Free (DRF)  plus some new capabilities !  (Informally) A data race occurs when two (or more) work-items access the same memory location such that:  At least one of the accesses is a WRITE  There are no intervening synchronization operations  SC for DRF asks:  Programmers to ensure programs are DRF under SC  Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions © Copyright 2014 HSA Foundation. All Rights Reserved *S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990
  • 155. HSA SUPPORTS RELEASE CONSISTENCY  HSA’s memory model is based on RCSC:  All atomic_ld_scacq and atomic_st_screl are SC  Means coherence on all atomic_ld_scacq and atomic_st_screl to a single address. )  All atomic_ld_scacq and atomic_st_screl are program ordered per work- item (actually: sequence-order by language constraints  Similar model adopted by ARMv8  HSA extends RCSC to SC for HRF*, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example. © Copyright 2014 HSA Foundation. All Rights Reserved *Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
  • 156. MAKING RELAXED CONSISTENCY WORK © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 157. SEQUENTIAL CONSISTENCY FOR DRF  Two memory accesses participate in a data race if they  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  A program is data-race-free if no possible execution results in a data race.  Sequential consistency for data-race-free programs  Avoid everything else HSA: Not good enough! © Copyright 2014 HSA Foundation. All Rights Reserved
  • 158. ALL ARE NOT EQUAL – OR SOME CAN SEE BETTER THAN OTHERS  Remember the HSAIL Execution Model © Copyright 2014 HSA Foundation. All Rights Reserved device scope group scope wave scope platform scope
  • 159. DATA-RACE-FREE IS NOT ENOUGH t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl 0, [&flag] atomic_cas_global_scar 1, 0, [&flag] ... atomic_st_global_screl 0, [&flag] atomic_cas_global_scar ,1 0, [&flag] ld_global (??), [&x] group #1-2 group #3-4  Two ordinary memory accesses participate in a data race if they  Access same location  At least one is a store  Can occur simultaneously Not a data race… Is it SC? Well that depends t4t3t1 t2 SGlobal S12 S34 visibility implied by causality? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 160. SEQUENTIAL CONSISTENCY FOR HETEROGENEOUS-RACE-FREE  Two memory accesses participate in a heterogeneous race if  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  Are not synchronized with “enough” scope  A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.  Sequential consistency for heterogeneous-race-free programs  Avoid everything else © Copyright 2014 HSA Foundation. All Rights Reserved
  • 161. HSA HETEROGENEOUS RACE FREE  HRF0: Basic Scope Synchronization  “enough” = both threads synchronize using identical scope  Recall example:  Contains a heterogeneous race in HSA t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_rcrel_wg 0, [&flag] ... atomic_cas_global_scar_wg,1 0, [&flag] ld_global (??), [&x] Workgroup #1-2 Workgroup #3-4 HSA Conclusion: This is bad. Don’t do it. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 162. HOW TO USE HSA WITH SCOPES Use smallest scope that includes all producers/consumers of shared data HSA Scope Selection Guideline Implication: Producers/consumers must be known at synchronization time  Want: For performance, use smallest scope possible  What is safe in HSA? Is this a valid assumption? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 163. REGULAR GPGPU WORKLOADS N M Define Problem Space Partition Hierarchically Communicate Locally N times Communicate Globally M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern =  Producer/consumers are always known Generally: HSA works well with regular data-parallel workloads © Copyright 2014 HSA Foundation. All Rights Reserved
  • 164. t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_scar_plat 1, 0, [&flag] ... atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_ar_plat ,1 0, [&flag] ld $s1, [&x] IRREGULAR WORKLOADS  HSA: example is race  Must upgrade wg (workgroup) -> plat (platform)  HSA memory model says:  ld $s1, [&x], will see value (1)! Workgroup #1-2 Workgroup #3-4 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 165. OPENCL HAS MEMORY MODELS TOO MAPPING ONTO HSA’S MEMORY MODEL
  • 166.  It is straightforward to provide a mapping from OpenCL 1.x to the proposed model  OpenCL 1.x atomics are unordered and so map to atomic_op_X  Mapping for fences not shown but straightforward OPENCL 1.X MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Atomic load ld_global_wg ld_group_wg Atomic store atomic_st_global_wg atomic_st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) fence ; barrier_wg © Copyright 2014 HSA Foundation. All Rights Reserved
  • 167. OPENCL 2.0 BACKGROUND  Provisional specification released at SIGGRAPH’13, July 2013.  Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads)  Key features:  Shared virtual memory, including platform atomics  Formally defined memory model based on C11 plus support for scopes  Includes an extended set of C1X atomic operations  Generic address space, that subsumes global, local, and private  Device to device enqueue  Out-of-order device side queuing model  Backwards compatible with OpenCL 1.x © Copyright 2014 HSA Foundation. All Rights Reserved
  • 168. OPENCL 2.0 MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load memory_order_relaxed atomic_ld_[global | group]_relaxed_scope Store Memory_order_relaxed atomic_st_[global | group]_relaxed_scope Load memory_order_acquire atomic_ld_[global | group]_scacq_scope Load memory_order_seq_cst atomic_ld_[global | group]_scacq_scope Store memory_order_release atomic_st_[global | group]_screl_scope Store Memory_order_seq_cst atomic_st_[global | group]_screl_scope memory_order_acq_rel atomic_op_[global | group]_scar_scope memory_order_seq_cst atomic_op_[global|group]_scar_scope © Copyright 2014 HSA Foundation. All Rights Reserved
  • 169. OPENCL 2.0 MEMORY SCOPE MAPPING OpenCL Scope HSA Scope memory_scope_sub_group _wave memory_scope_work_group _wg memory_scope_device _component memory_scope_all_svm_devices _platform © Copyright 2014 HSA Foundation. All Rights Reserved
  • 170. OBSTRUCTION-FREE BOUNDED DEQUES AN EXAMPLE USING THE HSA MEMORY MODEL
  • 171. CONCURRENT DATA-STRUCTURES  Why do we need such a memory model in practice?  One important application of memory consistency is in the development and use of concurrent data-structures  In particular, there are a class data-structures implementations that provide non- blocking guarantees:  wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes  In practice very hard to build efficient data-structures that meet this requirement  lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress  In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress  Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress © Copyright 2014 HSA Foundation. All Rights Reserved
  • 172. Emerging Compute Cluster BUT WAY NOT JUST USE MUTUAL EXCLUSION? © Copyright 2014 HSA Foundation. All Rights Reserved Fabric & Memory Controller Krait CPUAdreno GPU Krait CPU Krait CPU Krait CPU MMU MMUs 2MB L2 Hexagon DSP MMU ?? ?? Diversity in a heterogeneous system, such as different clock speeds, different scheduling policies, and more can mean traditional mutual exclusion is not the right choice
  • 173. CONCURRENT DATA-STRUCTURES  Emerging heterogeneous compute clusters means we need:  To adapt existing concurrent data-structures  Developer new concurrent data-structures  Lock based programming may still be useful but often these algorithms will need to be lock-free  Of course, this is a key application of the HSA memory model  To showcase this we highlight the development of a well known (HLM) obstruction-free deque* © Copyright 2014 HSA Foundation. All Rights Reserved *Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529.
  • 174. HLM - OBSTRUCTION-FREE DEQUE  Uses a fixed length circular queue  At any given time, reading from left to right, the array will contain:  Zero or more left-null (LN) values  Zero or more dummy-null (DN) values  Zero or more right-null (RN) values  At all times there must be:  At least two different nulls values  At least one LN or DN, and at least one DN or RN  Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 175. HLM - OBSTRUCTION-FREE DEQUE © Copyright 2014 HSA Foundation. All Rights Reserved LNLN vLN RNv RNRN left right Key: LN – left null value RN – right null value v – value left – left hint index right – right hint index
  • 176. C REPRESENTATION OF DEQUE struct node { uint64_t type : 2; // null type (LN, RN, DN) uint64_t counter : 8 ; // version counter to avoid ABA uint64_t value : 54 ; // index value stored in queue } struct queue { unsigned int size; // size of bounded buffer node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
  • 177. HSAIL REPRESENTATION  Allocate a deque in global memory using HSAIL @deque_instance: align 64 global_u32 &size; align 8 global_u64 &array; © Copyright 2014 HSA Foundation. All Rights Reserved
  • 178. ORACLE  Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);  Which given a deque  returns (%k) the position of the left most of RN  atomic_ld_global_scacq used to read node from array  Makes one if necessary (i.e. if there are only LN or DN)  atomic_cas_global_scar, required to make new RN  returns (%left) the left node (i.e. the value to the left of the left most RN position)  returns (%right) the right node (i.e. the value at position (%k)) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 179. RIGHT POP function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) { // load queue address ld_arg_u64 $d0, [%deque]; @loop_forever: // setup and call right oracle to get next RN arg_u32 %k; arg_u64 %current; arg_u64 %next; call &rcheck_oracle(%queue) ; ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next]; // current.value($d5) shr_u64 $d5, $d1, 62; // current.counter($d6) and_u64 $d6, $d1, 0x3FC0000000000000; shr_u64 $d6, $d6, 54; // current.value($d7) and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF; // next.counter($d8) and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ; } © Copyright 2014 HSA Foundation. All Rights Reserved
  • 180. RIGHT POP – TEST FOR EMPTY // current.type($d5) == LN || current.type($d5) == DN cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN; or_b1 $c0, $c0, $c1; cbr $c0, @not_empty ; // current node index (%deque($d0) + (%k($s1) - 1) * 16) add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0; atomic_ld_global_scacq_u64 $d4, [$d3]; cmp_neq_b1_u64 $c0, $d4, $d1; cbr $c0, @not_empty; st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY %ret @not_empty: © Copyright 2014 HSA Foundation. All Rights Reserved
  • 181. RIGHT POP – TRY READ/REMOVE NODE // $d9 = (RN, next.cnt+1, 0) add_u64 $d8, $d8, 1; shl_u64 $d9, RN, 62; and_u64 $d8, $d8, $d9; // cas(deq+k, next, node(RN, next.cnt+1, 0)) atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9; cmp_neq_u64 $c0, $d9, $d2; cbr $c0, @cas_failed; // $d9 = (RN, current.cnt+1, 0) add_u64 $d6, $d6, 1; shl_u64 $d9, RN, 62; and_u64 $d9, $d6, $d9; // cas(deq+(k-1), curr, node(RN, curr.cnt+1,0) atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9; cmp_neq_u64 $c0, $d9, $d1; cbr $c0, @cas_failed; st_arg_u32 SUCCESS, [&err]; st_arg_u64 $d7, [&value]; %ret @cas_failed: // loop back around and try again © Copyright 2014 HSA Foundation. All Rights Reserved
  • 182. TAKE AWAYS  HSA provides a powerful and modern memory model  Based on the well know SC for DRF  Defined as Release Consistency  Extended with scopes as defined by HRF  OpenCL 2.0 introduces a new memory model  Also based on SC for DRF  Also defined in terms of Release Consistency  Also Extended with scope as defined in HRF  Has a well defined mapping to HSA  Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models © Copyright 2014 HSA Foundation. All Rights Reserved
  • 183. HSA QUEUING MODEL HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER, ARM
  • 185. MOTIVATION (TODAY’S PICTURE) © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 187. REQUIREMENTS  Three key technologies are used to build the user mode queueing mechanism  Shared Virtual Memory  System Coherency  Signaling  AQL (Architected Queueing Language) enables any agent enqueue tasks © Copyright 2014 HSA Foundation. All Rights Reserved
  • 189. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (TODAY)  Multiple Virtual memory address spaces © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  • 190. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (HSA)  Common Virtual Memory for all HSA agents © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  • 191. SHARED VIRTUAL MEMORY  Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Implications  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 192. SHARED VIRTUAL MEMORY  Specifics  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems.  HSA agents may reserve VA ranges for internal use via system software.  All HSA agents other than the host unit must use the lowest privilege level  If present, read/write access flags for page tables must be maintained by all agents.  Read/write permissions apply to all HSA agents, equally. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 193. GETTING THERE … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 195. CACHE COHERENCY DOMAINS (1/3)  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 196. CACHE COHERENCY DOMAINS (2/3)  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Implications  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 197. CACHE COHERENCY DOMAINS (3/3)  Specifics  No requirement for instruction memory accesses to be coherent  Only applies to the Primary memory type.  No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes  Read-only image data is required to remain static during the execution of an HSA kernel.  No double mapping (via different attributes) in order to modify. Must remain static © Copyright 2014 HSA Foundation. All Rights Reserved
  • 198. GETTING CLOSER … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 200. SIGNALING (1/3)  HSA agents support the ability to use signaling objects  All creation/destruction signaling objects occurs via HSA runtime APIs  From an HSA Agent you can directly access signaling objects.  Signaling a signal object (this will wake up HSA agents waiting upon the object)  Query current object  Wait on the current object (various conditions supported). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 201. SIGNALING (2/3)  Advantages  Enables asynchronous events between HSA agents, without involving the kernel  Common idiom for work offload  Low power waiting  Implications  Runtime support required  Commonly implemented on top of cache coherency flows © Copyright 2014 HSA Foundation. All Rights Reserved
  • 202. SIGNALING (3/3)  Specifics  Only supported within a PASID  Supported wait conditions are =, !=, < and >=  Wait operations may return sporadically (no guarantee against false positives)  Programmer must test.  Wait operations have a maximum duration before returning.  The HSAIL atomic operations are supported on signal objects.  Signal objects are opaque  Must use dedicated HSAIL/HSA runtime operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 203. ALMOST THERE… © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 205. ONE BLOCK LEFT © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 206. USER MODE QUEUEING (1/3)  User mode Queueing  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.  Queues are created/destroyed via calls to the HSA runtime.  One (or many) agents enqueue packets, a single agent dequeues packets.  Requires coherency and shared virtual memory. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 207. USER MODE QUEUEING (2/3)  Advantages  Avoid involving the kernel/driver when dispatching work for an Agent.  Lower latency job dispatch enables finer granularity of offload  Standard memory protection mechanisms may be used to protect communication with the consuming agent.  Implications  Packet formats/fields are Architected – standard across vendors!  Guaranteed backward compatibility  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signaling)  More on this later…… © Copyright 2014 HSA Foundation. All Rights Reserved
  • 208. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 209. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Queue Job Start Job Finish Job
  • 211. ARCHITECTED QUEUEING LANGUAGE  HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Single producer variant defined with some optimizations possible.  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected © Copyright 2014 HSA Foundation. All Rights Reserved Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets
  • 212. ARCHITECTED QUEUING LANGUAGE  Packets are read and dispatched for execution from the queue in order, but may complete in any order.  There is no guarantee that more than one packet will be processed in parallel at a time  There may be many queues. A single agent may also consume from several queues.  Any HSA agent may enqueue packets  CPUs  GPUs  Other accelerators © Copyright 2014 HSA Foundation. All Rights Reserved
  • 213. QUEUE STRUCTURE © Copyright 2014 HSA Foundation. All Rights Reserved Offset (bytes) Size (bytes) Field Notes 0 4 queueType Differentiate different queues 4 4 queueFeatures Indicate supported features 8 8 baseAddress Pointer to packet array 16 16 doorbellSignal HSA signaling object handle 24 4 size Packet array cardinality 28 4 queueId Unique per process 32 8 serviceQueue Queue for callback services intrinsic 8 writeIndex Packet array write index intrinsic 8 readIndex Packet array read index
  • 214. QUEUE VARIANTS  queueType and queueFeatures together define queue semantics and capabilities  Two queueType values defined, other values reserved:  MULTI – queue supports multiple producers  SINGLE – queue supports single producer  queueFeatures is a bitfield indicating capabilities  DISPATCH (bit 0) if set then queue supports DISPATCH packets  AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets  All other bits are reserved and must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 215. QUEUE STRUCTURE DETAILS  Queue doorbells are HSA signaling objects with restrictions  Created as part of the queue – lifetime tied to queue object  Atomic read-modify-write not allowed  size field value must be aligned to a power of 2  serviceQueue can be used by HSA kernel for callback services  Provided by application when queue is created  Can be mapped to HSA runtime provided serviceQueue, an application serviced queue, or NULL if no serviceQueue required © Copyright 2014 HSA Foundation. All Rights Reserved
  • 216. READ/WRITE INDICES  readIndex and writeIndex properties are part of the queue, but not visible in the queue structure  Accessed through HSA runtime API and HSAIL operations  HSA runtime/HSAIL operations defined to  Read readIndex or writeIndex property  Write readIndex or writeIndex property  Add constant to writeIndex property (returns previous writeIndex value)  CAS on writeIndex property  readIndex & writeIndex operations treated as atomic in memory model  relaxed, acquire, release and acquire-release variants defined as applicable  readIndex and writeIndex never wrap  PacketID – the index of a particular packet  Uniquely identifies each packet of a queue © Copyright 2014 HSA Foundation. All Rights Reserved
  • 217. PACKET ENQUEUE  Packet enqueue follows a few simple steps:  Reserve space  Multiple packets can be reserved at a time  Write packet to queue  Mark packet as valid  Producer no longer allowed to modify packet  Consumer is allowed to start processing packet  Notify consumer of packet through the queue doorbell  Multiple packets can be notified at a time  Doorbell signal should be signaled with last packetID notified  On small machine model the lower 32 bits of the packetID are used © Copyright 2014 HSA Foundation. All Rights Reserved
  • 218. PACKET RESERVATION  Two flows envisaged  Atomic add writeIndex with number of packets to reserve  Producer must wait until packetID < readIndex + size before writing to packet  Queue can be sized so that wait is unlikely (or impossible)  Suitable when many threads use one queue  Check queue not full first, then use atomic CAS to update writeIndex  Can be inefficient if many threads use the same queue  Allows different failure model if queue is congested © Copyright 2014 HSA Foundation. All Rights Reserved
  • 219. QUEUE OPTIMIZATIONS  Queue behavior is loosely defined to allow optimizations  Some potential producer behavior optimizations:  Keep local copy of readIndex, update when required  For single producer queues:  Keep local copy of writeIndex  Use store operation rather than add/cas atomic to update writeIndex  Some potential consumer behavior optimizations:  Use packet format field to determine whether a packet has been submitted rather than writeIndex property  Speculatively read multiple packets from the queue  Not update readIndex for each packet processed  Rely on value used for doorbellSignal to notify new packets  Especially useful for single producer queues © Copyright 2014 HSA Foundation. All Rights Reserved
  • 220. POTENTIAL MULTI-PRODUCER ALGORITHM // Allocate packet uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1); // Wait until the queue is no longer full. uint64_t rdIdx; do { rdIdx = hsa_queue_load_read_index_relaxed(q); } while (packetID >= (rdIdx + q->size)); // calculate index uint32_t arrayIdx = packetID & (q->size-1); // copy over the packet, the format field is INVALID q->baseAddress[arrayIdx] = pkt; // Update format field with release semantics q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release); // ring doorbell, with release semantics (could also amortize over multiple packets) hsa_signal_send_relaxed(q->doorbellSignal, packetID); © Copyright 2014 HSA Foundation. All Rights Reserved
  • 221. POTENTIAL CONSUMER ALGORITHM // Get location of next packet uint64_t readIndex = hsa_queue_load_read_index_relaxed(q); // calculate the index uint32_t arrayIdx = readIndex & (q->size-1); // spin while empty (could also perform low-power wait on doorbell) while (INVALID == q->baseAddress[arrayIdx].hdr.format) { } // copy over the packet pkt = q->baseAddress[arrayIdx]; // set the format field to invalid q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed); // Update the readIndex using HSA intrinsic hsa_queue_store_read_index_relaxed(q, readIndex+1); // Now process <pkt>! © Copyright 2014 HSA Foundation. All Rights Reserved
  • 223. PACKETS © Copyright 2014 HSA Foundation. All Rights Reserved  Packets come in three main types with architected layouts  Always reserved & Invalid  Do not contain any valid tasks and are not processed (queue will not progress)  Dispatch  Specifies kernel execution over a grid  Agent Dispatch  Specifies a single function to perform with a set of parameters  Barrier  Used for task dependencies