ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

8,112 views

Published on

Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Published in: Technology
3 Comments
25 Likes
Statistics
Notes
No Downloads
Views
Total views
8,112
On SlideShare
0
From Embeds
0
Number of Embeds
187
Actions
Shares
0
Downloads
444
Comments
3
Likes
25
Embeds 0
No embeds

No notes for slide

ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

  1. 1. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): ARCHITECTURE AND ALGORITHMS ISCA TUTORIAL - JUNE 15, 2014
  2. 2. TOPICS  Introduction  HSAIL Virtual Parallel ISA  HSA Runtime  HSA Memory Model  HSA Queuing Model  HSA Applications  HSA Compilation © Copyright 2014 HSA Foundation. All Rights Reserved The HSA Specifications are not at 1.0 final so all content is subject to change
  3. 3. SCHEDULE © Copyright 2014 HSA Foundation. All Rights Reserved Time Topic Speaker 8:45am Introduction to HSA Phil Rogers, AMD 9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD 10:30am Break 10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University 12 noon Lunch 1pm HSA Memory Model Benedict Gaster, Qualcomm 2pm HSA Queuing Model Hakan Persson, ARM 3pm Break 3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois 4pm HSA Application Programming Wen Mei Hwu, University of Illinois 4:45pm Questions All presenters
  4. 4. INTRODUCTION PHIL ROGERS, AMD CORPORATE FELLOW & PRESIDENT OF HSA FOUNDATION
  5. 5. HSA FOUNDATION  Founded in June 2012  Developing a new platform for heterogeneous systems  www.hsafoundation.com  Specifications under development in working groups to define the platform  Membership consists of 43 companies and 16 universities  Adding 1-2 new members each month © Copyright 2014 HSA Foundation. All Rights Reserved
  6. 6. DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING © Copyright 2014 HSA Foundation. All Rights Reserved Founders Promoters Supporters Contributors Academic Needs Updating – Add Toshiba Logo
  7. 7. MEMBERSHIP TABLE Membership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science © Copyright 2014 HSA Foundation. All Rights Reserved
  8. 8. HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER  Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms  SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory  How do we make them even better?  Easier to program  Easier to optimize  Higher performance  Lower power  HSA unites accelerators architecturally  Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  9. 9. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2014 HSA Foundation. All Rights Reserved ? Single-thread Performance Time we are here Enabled by:  Moore’s Law  Voltage Scaling Constrained by: Power Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Abundant data parallelism  Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Law  SMP architecture Constrained by: Power Parallel SW Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java
  10. 10. LEGACY GPU COMPUTE PCIe ™ System Memory (Coherent) CPU CPU CPU . . . CU CU CU CU CU CU CU CU GPU Memory (Non-Coherent) GPU  Multiple memory pools  Multiple address spaces  High overhead dispatch  Data copies across PCIe  New languages for programming  Dual source development  Proprietary environments  Expert programmers only  Need to fix all of this to unleash our programmers The limiters © Copyright 2014 HSA Foundation. All Rights Reserved
  11. 11. EXISTING APUS AND SOCS CPU 1 CPU N… CPU 2 Physical Integration CU 1 … CU 2 CU 3 CU M-2 CU M-1 CU M System Memory (Coherent) GPU Memory (Non-Coherent) GPU  Physical Integration  Good first step  Some copies gone  Two memory pools remain  Still queue through the OS  Still requires expert programmers  Need to finish the job
  12. 12. AN HSA ENABLED SOC  Unified Coherent Memory enables data sharing across all processors  Processors architected to operate cooperatively  Designed to enable the application to run on different processors at different times Unified Coherent Memory CPU 1 CPU N… CPU 2 CU 1 CU 2 CU 3 CU M-2 CU M-1 CU M…
  13. 13. PILLARS OF HSA*  Unified addressing across all processors  Operation into pageable system memory  Full memory coherency  User mode dispatch  Architected queuing language  Scheduling and context switching  HSA Intermediate Language (HSAIL)  High level language support for GPU compute processors © Copyright 2014 HSA Foundation. All Rights Reserved * All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
  14. 14. HSA SPECIFICATIONS  HSA System Architecture Specification  Version 1.0 Provisional, Released April 2014  Defines discovery, memory model, queue management, atomics, etc  HSA Programmers Reference Specification  Version 1.0 Provisional, Released June 2014  Defines the HSAIL language and object format  HSA Runtime Software Specification  Version 1.0 Provisional, expected to be released in July 2014  Defines the APIs through which an HSA application uses the platform  All released specifications can be found at the HSA Foundation web site:  www.hsafoundation.com/standards © Copyright 2014 HSA Foundation. All Rights Reserved
  15. 15. HSA - AN OPEN PLATFORM  Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA System Architecture  HSA Runtime  Delivered via royalty free standards  Royalty Free IP, Specifications and APIs  ISA agnostic for both CPU and GPU  Membership from all areas of computing  Hardware companies  Operating Systems  Tools and Middleware  Applications  Universities © Copyright 2014 HSA Foundation. All Rights Reserved
  16. 16. HSA INTERMEDIATE LAYER — HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or “Finalizer”  ISA independent by design for CPU & GPU  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Lower level than OpenCL SPIR  Fits naturally in the OpenCL compilation stack  Suitable to support additional high level languages and programming models:  Java, C++, OpenMP, C++, Python, etc © Copyright 2014 HSA Foundation. All Rights Reserved
  17. 17. HSA MEMORY MODEL  Defines visibility ordering between all threads in the HSA System  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Visibility controlled by:  Load.Acquire  Store.Release  Fences © Copyright 2014 HSA Foundation. All Rights Reserved
  18. 18. HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver required in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU self enqueue enables lots of solutions  Recursion  Tree traversal  Wavefront reforming © Copyright 2014 HSA Foundation. All Rights Reserved
  19. 19. HSA SOFTWARE
  20. 20. Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK © Copyright 2014 HSA Foundation. All Rights Reserved
  21. 21. OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL  Not an alternative to OpenCL  OpenCL on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL 2.0 leverages HSA Features  Shared Virtual Memory  Platform Atomics © Copyright 2014 HSA Foundation. All Rights Reserved
  22. 22. ADDITIONAL LANGUAGES ON HSA  In development © Copyright 2014 HSA Foundation. All Rights Reserved Language Body More Information Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/ LLVM LLVM Code generator for HSAIL C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa mp-driver-ng/wiki/Home OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches /hsa/gcc/README.hsa?view=markup&p athrev=207425
  23. 23. SUMATRA PROJECT OVERVIEW  AMD/Oracle sponsored Open Source (OpenJDK) project  Targeted at Java 9 (2015 release)  Allows developers to efficiently represent data parallel algorithms in Java  Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to enable both CPU or GPU computing  At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch ‘selected’ constructs to available HSA enabled devices  Developers of Java libraries are already refactoring their library code to use these same constructs  So developers using existing libraries should see GPU acceleration without any code changes  http://openjdk.java.net/projects/sumatra/  https://wikis.oracle.com/display/HotSpotInternals/Sumatra  http://mail.openjdk.java.net/pipermail/sumatra-dev/ © Copyright 2014 HSA Foundation. All Rights Reserved Application.java Java Compiler GPUCPU Sumatra Enabled JVM Application GPU ISA Lambda/Stream API CPU ISA Application.clas s Development Runtime HSA Finalizer
  24. 24. HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do © Copyright 2014 HSA Foundation. All Rights Reserved Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
  25. 25. WORKLOAD EXAMPLE SUFFIX ARRAY CONSTRUCTION CLOUD SERVER WORKLOAD
  26. 26. SUFFIX ARRAYS  Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T  Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics © Copyright 2014 HSA Foundation. All Rights Reserved
  27. 27. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA © Copyright 2014 HSA Foundation. All Rights Reserved M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
  28. 28. EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE
  29. 29. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy- back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV “Hessian” Kernel) © Copyright 2014 HSA Foundation. All Rights Reserved
  30. 30. THE HSA FUTURE  Architected heterogeneous processing on the SOC  Programming of accelerators becomes much easier  Accelerated software that runs across multiple hardware vendors  Scalability from smart phones to super computers on a common architecture  GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model  Heterogeneous software ecosystem evolves at a much faster pace  Lower power, more capable devices in your hand, on the wall, in the cloud © Copyright 2014 HSA Foundation. All Rights Reserved
  31. 31. JOIN US! WWW.HSAFOUNDATION.COM
  32. 32. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): HSAIL VIRTUAL PARALLEL ISA BEN SANDER, AMD
  33. 33. TOPICS  Introduction and Motivation  HSAIL – what makes it special?  HSAIL Execution Model  How to program in HSAIL?  Conclusion © Copyright 2014 HSA Foundation. All Rights Reserved
  34. 34. STATE OF GPU COMPUTING Today’s Challenges  Separate address spaces  Copies  Can’t share pointers  New language required for compute kernel  EX: OpenCL™ runtime API  Compute kernel compiled separately than host code Emerging Solution  HSA Hardware  Single address space  Coherent  Virtual  Fast access from all components  Can share pointers  Bring GPU computing to existing, popular, programming models  Single-source, fully supported by compiler  HSAIL compiler IR (Cross-platform!) • GPUs are fast and power efficient : high compute density per-mm and per-watt • But: Can be hard to program PCIe
  35. 35. THE PORTABILITY CHALLENGE  CPU ISAs  ISA innovations added incrementally (ie NEON, AVX, etc)  ISA retains backwards-compatibility with previous generation  Two dominant instruction-set architectures: ARM and x86  GPU ISAs  Massive diversity of architectures in the market  Each vendor has own ISA - and often several in market at same time  No commitment (or attempt!) to provide any backwards compatibility  Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction © Copyright 2014 HSA Foundation. All Rights Reserved
  36. 36. HSAIL : WHAT MAKES IT SPECIAL?
  37. 37. WHAT IS HSAIL?  Intermediate language for parallel compute in HSA  Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)  Expresses parallel regions of code  Binary format of HSAIL is called “BRIG”  Goal: Bring parallel acceleration to mainstream programming languages © Copyright 2014 HSA Foundation. All Rights Reserved main() { … #pragma omp parallel for for (int i=0;i<N; i++) { } … } High-Level Compiler BRIG Finalizer Component ISA Host ISA
  38. 38. KEY HSAIL FEATURES  Parallel  Shared virtual memory  Portable across vendors in HSA Foundation  Stable across multiple product generations  Consistent numerical results (IEEE-754 with defined min accuracy)  Fast, robust, simple finalization step (no monthly updates)  Good performance (little need to write in ISA)  Supports all of OpenCL™  Supports Java, C++, and other languages as well © Copyright 2014 HSA Foundation. All Rights Reserved
  39. 39. HSAIL INSTRUCTION SET - OVERVIEW  Similar to assembly language for a RISC CPU  Load-store architecture  Destination register first, then source registers  140 opcodes (Java™ bytecode has 200)  Floating point (single, double, half (f16))  Integer (32-bit, 64-bit)  Some packed operations  Branches  Function calls  Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas  Synchronize host CPU and HSA Component!  Text and Binary formats (“BRIG”) ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120) add_u64 $d1, $d0, 24 ; $d1= $d2+24 © Copyright 2014 HSA Foundation. All Rights Reserved
  40. 40. SEGMENTS AND MEMORY (1/2)  7 segments of memory  global, readonly, group, spill, private, arg, kernarg  Memory instructions can (optionally) specify a segment  Control data sharing properties and communicate intent  Global Segment  Visible to all HSA agents (including host CPU)  Group Segment  Provides high-performance memory shared in the work-group.  Group memory can be read and written by any work-item in the work-group  HSAIL provides sync operations to control visibility of group memory ld_global_u64 $d0,[$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] © Copyright 2014 HSA Foundation. All Rights Reserved
  41. 41. SEGMENTS AND MEMORY (2/2)  Spill, Private, Arg Segments  Represent different regions of a per-work-item stack  Typically generated by compiler, not specified by programmer  Compiler can use these to convey intent – ie spills  Kernarg Segment  Programmer writes kernarg segment to pass arguments to a kernel  Read-Only Segment  Remains constant during execution of kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  42. 42. FLAT ADDRESSING  Each segment mapped into virtual address space  Flat addresses can map to segments based on virtual address  Instructions with no explicit segment use flat addressing  Very useful for high-level language support (ie classes, libraries)  Aligns well with OpenCL 2.0 “generic” addressing feature ld_global_u64 $d6, [%_arg0] ; global ld_u64 $d0,[$d6+24] ; flat © Copyright 2014 HSA Foundation. All Rights Reserved
  43. 43. REGISTERS  Four classes of registers:  S: 32-bit, Single-precision FP or Int  D: 64-bit, Double-precision FP or Long Int  Q: 128-bit, Packed data.  C: 1-bit, Control Registers (Compares)  Fixed number of registers  S, D, Q share a single pool of resources  S + 2*D + 4*Q <= 128  Up to 128 S or 64 D or 32 Q (or a blend)  Register allocation done in high-level compiler  Finalizer doesn’t perform expensive register allocation c0 c1 c2 c3 c4 c5 c6 c7 s0 d0 q0 s1 s2 d1 s3 s4 d2 q1 s5 s6 d3 s7 s8 d4 q2 s9 s10 d5 s11 … s120 d60 q30 s121 s122 d61 s123 s124 d62 q31 s125 s126 d63 s127 © Copyright 2014 HSA Foundation. All Rights Reserved
  44. 44. SIMT EXECUTION MODEL  HSAIL Presents a “SIMT” execution model to the programmer  “Single Instruction, Multiple Thread”  Programmer writes program for a single thread of execution  Each work-item appears to have its own program counter  Branch instructions look natural  Hardware Implementation  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency  Actually one program counter for the entire SIMD instruction  Branches implemented with predication  SIMT Advantages  Easier to program (branch code in particular)  Natural path for mainstream programming models and existing compilers  Scales across a wide variety of hardware (programmer doesn’t see vector width)  Cross-lane operations available for those who want peak performance © Copyright 2014 HSA Foundation. All Rights Reserved
  45. 45. WAVEFRONTS  Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”  Lanes in wavefront can be “active” or “inactive”  Inactive lanes consume hardware resources but don’t do useful work  Tradeoffs  “Wavefront-aware” programming can be useful for peak performance  But results in less portable code (since wavefront width is encoded in algorithm) if (cond) { operationA; // cond=True lanes active here } else { operationB; // cond=False lanes active here } © Copyright 2014 HSA Foundation. All Rights Reserved
  46. 46. CROSS-LANE OPERATIONS  Example HSAIL cross-lane operation: “activelaneid”  Dest set to count of earlier work-items that are active for this instruction  Useful for compaction algorithms  Example HSAIL cross-lane operation: “activelaneshuffle”  Each workitem reads value from another lane in the wavefront  Supports selection of “identity” element for inactive lanes  Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0 // s0 = dest, s1= source, s2=lane select, no identity activelaneid_u32 $s0 © Copyright 2014 HSA Foundation. All Rights Reserved
  47. 47. HSAIL MODES  Working group strived to limit optional modes and features in HSAIL  Minimize differences between HSA target machines  Better for compiler vendors and application developers  Two modes survived  Machine Models  Small: 32-bit pointers, 32-bit data  Large: 64-bit pointers, 32-bit or 64-bit data  Vendors can support one or both models  “Base” and “Full” Profiles  Two sets of requirements for FP accuracy, rounding, exception reporting, hard pre-emption © Copyright 2014 HSA Foundation. All Rights Reserved
  48. 48. HSA PROFILES Feature Base Full Addressing Modes Small, Large Small, Large All 32-bit HSAIL operations according to the declared profile Yes Yes F16 support (IEEE 754 or better) Yes Yes F64 support No Yes Precision for add/sub/mul 1/2 ULP 1/2 ULP Precision for div 2.5 ULP 1/2 ULP Precision for sqrt 1 ULP 1/2 ULP HSAIL Rounding: Near Yes Yes HSAIL Rounding: Up / Down / Zero No Yes Subnormal floating-point Flush-to-zero Supported Propagate NaN Payloads No Yes FMA Yes Yes Arithmetic Exception reporting None DETECT or BREAK Debug trap Yes Yes Hard Preemption No Yes © Copyright 2014 HSA Foundation. All Rights Reserved
  49. 49. HSA PARALLEL EXECUTION MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  50. 50. HSA PARALLEL EXECUTION MODEL Basic Idea: Programmer supplies an HSAIL “kernel” that is run on each work-item. Kernel is written as a single thread of execution. Programmer specifies grid dimensions (scope of problem) when launching the kernel. Each work-item has a unique coordinate in the grid. Programmer optionally specifies work- group dimensions (for optimized communication). © Copyright 2014 HSA Foundation. All Rights Reserved
  51. 51. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  52. 52. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  53. 53. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D work-group 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  54. 54. HOW TO PROGRAM HSA? WHAT DO I TYPE? © Copyright 2014 HSA Foundation. All Rights Reserved
  55. 55. HSA PROGRAMMING MODELS : CORE PRINCIPLES  Single source  Host and device code side-by-side in same source file  Written in same programming language  Single unified coherent address space  Freely share pointers between host and device  Similar memory model as multi-core CPU  Parallel regions identified with existing language syntax  Typically same syntax used for multi-core CPU  HSAIL is the compiler IR that supports these programming models © Copyright 2014 HSA Foundation. All Rights Reserved
  56. 56. GCC OPENMP : COMPILATION FLOW  SUSE GCC Project  Adding HSAIL code generator to GCC compiler infrastructure  Supports OpenMP 3.1 syntax  No data movement directives required !main() { … // Host code. #pragma omp parallel for for (int i=0;i<N; i++) { C[i] = A[i] + B[i]; } … } GCC OpenMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  57. 57. GCC OpenMP flow C/C++/Fortran OpenMP application e.g., #pragma omp for for( j = 0; j<n;j++) { b[j] = a[j]; } GNU Compiler(GCC) Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes Lowers OpenMP directives, converts GIMPLE to BRIG. Embeds BRIG into host code Dispatch kernel to GPU Pragmas map to calls into HSA Runtime Application Compiler Run time Finalize kernel from BRIG->ISA Kernels finalized once and cached. Compile time © Copyright 2014 HSA Foundation. All Rights Reserved
  58. 58. MCW C++AMP : COMPILATION FLOW  C++AMP : Single-source C++ template parallel programming model  MCW compiler based on CLANG/LLVM  Open-source and runs on Linux  Leverage open-source LLVM->HSAIL code generator main() { … parallel_for_each(grid<1>(ext entent<256>(…) … } C++AMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  59. 59. JAVA: RUNTIME FLOW © Copyright 2014 HSA Foundation. All Rights Reserved JAVA 8 – HSA ENABLED APARAPI  Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core.  APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL JVM Java Application HSA Finalizer & Runtime APARAPI + Lambda API GPUCPU Future Java – HSA ENABLED JAVA (SUMATRA)  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL JVM Java Application HSA Finalizer & Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend GPUCPU
  60. 60. AN EXAMPLE (IN JAVA 8) © Copyright 2014 HSA Foundation. All Rights Reserved //Example computes the percentage of total scores achieved by each player on a team. class Player { private Team team; // Note: Reference to the parent Team. private int scores; private float pctOfTeamScores; public Team getTeam() {return team;} public int getScores() {return scores;} public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; } }; // “Team” class not shown // Assume “allPlayers’ is an initialized array of Players.. Arrays.stream(allPlayers). // wrap the array in a stream parallel(). // developer indication that lambda is thread-safe forEach(p -> { int teamScores = p.getTeam().getScores(); float pctOfTeamScores = (float)p.getScores()/(float) teamScores; p.setPctOfTeamScores(pctOfTeamScores); });
  61. 61. HSAIL CODE EXAMPLE © Copyright 2014 HSA Foundation. All Rights Reserved 01: version 0:95: $full : $large; 02: // static method HotSpotMethod<Main.lambda$2(Player)> 03: kernel &run ( 04: kernarg_u64 %_arg0 // Kernel signature for lambda method 05: ) { 06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register 07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord 08: 09: cvt_u64_s32 $d2, $s2; // Convert X gid to long 10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref 11: add_u64 $d2, $d2, 24; // Adjust for actual elements start 12: add_u64 $d2, $d2, $d6; // Add to array ref ptr 13: ld_global_u64 $d6, [$d2]; // Load from array element into reg 14: @L0: 15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() 16: mov_b64 $d3, $d0; 17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores () 18: cvt_f32_s32 $s16, $s3; 19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores() 20: cvt_f32_s32 $s17, $s0; 21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores 22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() 23: ret; 24: };
  62. 62. HOW TO PROGRAM HSA? OTHER PROGRAMMING TOOLS © Copyright 2014 HSA Foundation. All Rights Reserved
  63. 63. HSAIL ASSEMBLER kernel &run (kernarg_u64 %_arg0) { ld_kernarg_u64 $d6, [%_arg0]; workitemabsid_u32 $s2, 0; cvt_u64_s32 $d2, $s2; mul_u64 $d2, $d2, 8; add_u64 $d2, $d2, 24; add_u64 $d2, $d2, $d6; ld_global_u64 $d6, [$d2]; . . . HSAIL Assembler BRIG Finalizer Machine ISA • HSAIL has a text format and an assembler © Copyright 2014 HSA Foundation. All Rights Reserved
  64. 64. OPENCL™ OFFLINE COMPILER (CLOC) __kernel void vec_add( __global const float *a, __global const float *b, __global float *c, const unsigned int n) { int id = get_global_id(0); // Bounds check if (id < n) c[id] = a[id] + b[id]; } CLOC BRIG Finalizer Machine ISA •OpenCL split-source model cleanly isolates kernel •Can express many HSAIL features in OpenCL Kernel Language •Higher productivity than writing in HSAIL assembly •Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware) •Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model © Copyright 2014 HSA Foundation. All Rights Reserved
  65. 65. KEY TAKEAWAYS  HSAIL  Thin, robust, fast finalizer  Portable (multiple HW vendors and parallel architectures)  Supports shared virtual memory and platform atomics  HSA brings GPU computing to mainstream programming models  Shared and coherent memory bridges “faraway accelerator” gap  HSAIL provides the common IL for high-level languages to benefit from parallel computing  Languages and Compilers  HSAIL support in GCC, LLVM, Java JVM  Leverage same language syntax designed for multi-core CPUs  Can use pointer-containing data structures © Copyright 2014 HSA Foundation. All Rights Reserved
  66. 66. HSA RUNTIME YEN-CHING CHUNG, NATIONAL TSING HUA UNIVERSITY
  67. 67. OUTLINE  Introduction  HSA Core Runtime API (Pre-release 1.0 provisional)  Initialization and Shut Down  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Summary © Copyright 2014 HSA Foundation. All Rights Reserved
  68. 68. INTRODUCTION (1)  The HSA core runtime is a thin, user-mode API that provides the interface necessary for the host to launch compute kernels to the available HSA components.  The overall goal of the HSA core runtime design is to provide a high-performance dispatch mechanism that is portable across multiple HSA vendor architectures.  The dispatch mechanism differentiates the HSA runtime from other language runtimes by architected argument setting and kernel launching at the hardware and specification level.  The HSA core runtime API is standard across all HSA vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API.  The implementation of the HSA runtime may include kernel-level components (required for some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example, simulators or CPU implementations). © Copyright 2014 HSA Foundation. All Rights Reserved
  69. 69. Component 1 Driver Component N… Vendor m … Component 1 Driver Component N… Vendor 1 Component 1 HSA Runtime Component N… HSA Vendor 1 HSA Finalizer Component 1 HSA Runtime Component N… HSA Vendor m HSA Finalizer INTRODUCTION (2) Programming Model Language Runtime  The software architecture stack without HSA runtime OpenCL App Java App OpenMP App DSL App OpenCL Runtime Java Runtime OpenMP Runtime DSL Runtime … …  The software architecture stack with HSA runtime … © Copyright 2014 HSA Foundation. All Rights Reserved
  70. 70. INTRODUCTION (3) OpenCL Runtime HSA RuntimeAgent Start Program HSA Memory Allocation Enqueue Dispatch Packet Exit Program Resource Deallocation Command Queue Platform, Device, and Context Initialization SVM Allocation and Kernel Arguments Setting Build Kernel HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  71. 71. INTRODUCTION (4)  HSA Platform System Architecture Specification support  Runtime initialization and shutdown  Notifications (synchronous/asynchronous)  Agent information  Signals and synchronization (memory-based)  Queues and Architected dispatch  Memory management  HSAIL support  Finalization, linking, and debugging  Image and Sampler support HSA Runtime HSA Memory Allocation Enqueue Dispatch Packet HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  72. 72. RUNTIME INITIALIZATION AND SHUTDOWN
  73. 73. OUTLINE  Runtime Initialization API  hsa_init  Runtime Shut Down API  hsa_shut_down  Examples © Copyright 2014 HSA Foundation. All Rights Reserved
  74. 74. HSA RUNTIME INITIALIZATION  When the API is invoked for the first time in a given process, a runtime instance is created.  A typical runtime instance may contain information of platform, topology, reference count, queues, signals, etc.  The API can be called multiple times by applications  Only a single runtime instance will exist for a given process.  Whenever the API is invoked, the reference count is increased by one. © Copyright 2014 HSA Foundation. All Rights Reserved
  75. 75. HSA RUNTIME SHUT DOWN  When the API is invoked, the reference count is decreased by 1.  When the reference count < 1  All the resources associated with the runtime instance (queues, signals, topology information, etc.) are considered invalid and any attempt to reference them in subsequent API calls results in undefined behavior.  The user might call hsa_init to initialize the HSA runtime again.  The HSA runtime might release resources associated with it. © Copyright 2014 HSA Foundation. All Rights Reserved
  76. 76. EXAMPLE – RUNTIME INITIALIZATION (1) Data structure for runtime instance If hsa_init is called more than once, increase the ref_count by 1 © Copyright 2014 HSA Foundation. All Rights Reserved
  77. 77. EXAMPLE – RUNTIME INITIALIZATION (2) hsa_init is called the first time, allocate resources and set the reference count Get the number of HSA agent Initialize agents Create an empty agent list If initialization failed, release resources Create topology table © Copyright 2014 HSA Foundation. All Rights Reserved
  78. 78. Agent-0 node_id 0 id 0 type CPU vendor Generic name Generic wavefront_size 0 queue_size 200 group_memory 0 fbarrier_max_count 1 is_pic_supported 0 … … EXAMPLE - RUNTIME INSTANCE (1) Platform Name: Generic Memory node_id 0 id 0 segment_type 111111 address_base 0x0001 size 2048 MB peak_bandwidth 6553.6 mpbs Agent-1 node_id 0 id 0 type GPU vendor Generic name Generic wavefront_size 64 queue_size 200 group_memory 64 fbarrier_max_count 1 is_pic_supported 1 Cache node_id 0 id 0 levels 1 associativity 1 cache size 64KB cache line size 4 is_inclusive 1 Agent: 2 Memory: 1 Cache: 1 … … © Copyright 2014 HSA Foundation. All Rights Reserved
  79. 79. Agent-0 node_id = 0 id = 0 agent_type = 1 (CPU) vendor[16] = Generic name[16] = Generic wavefront_size = 0 queue_size =200 group_memory_size_bytes =0 fbarrier_max_count = 1 is_pic_supported = 0 Platform Header File *base_address = 0x00001 Size = 248 system_timestamp_frequency_ mhz = 200 signal_maximum_wait = 1/200 *node_id no_nodes = 1 *agent_list no_agent = 2 *memory_descriptor_list no_memory_descriptor = 1 *cache_descriptor_list no_cache_descriptor = 1 EXAMPLE - RUNTIME INSTANCE (2) … … cache node_id = 0 Id = 0 Levels = 1 * associativity * cache_size * cache_line_size * is_inclusive 1 NULL 64KB NULL 1 NULL 4 NULL Memory node_id = 0 Id = 0 supported_segment_type_mask = 111111 virtual_address_base = 0x0001 size_in_bytes = 2048MB peak_bandwidth_mbps = 6553.6 0 NULL 45 165 NULL 285 NULL 325 NULL Agent-1 node_id = 0 id = 0 agent_type = 2 (GPU) vendor[16] = Generic name[16] = Generic wavefront_size = 64 queue_size =200 group_memory_size_bytes =64 fbarrier_max_count = 1 is_pic_supported = 1 … © Copyright 2014 HSA Foundation. All Rights Reserved
  80. 80. EXAMPLE – RUNTIME SHUT DOWN © Copyright 2014 HSA Foundation. All Rights Reserved If ref_count < 1, then free the list; Otherwise decrease the ref_count by 1.
  81. 81. NOTIFICATIONS (SYNCHRONOUS/ASYNCHRONOUS)
  82. 82. OUTLINE  Synchronous Notifications  hsa_status_t  hsa_status_string  Asynchronous Notifications  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  83. 83. SYNCHRONOUS NOTIFICATIONS  Notifications (errors, events, etc.) reported by the runtime can be synchronous or asynchronous  The HSA runtime uses the return values of API functions to pass notifications synchronously.  A status code is define as an enumeration, , to capture the return value of any API function that has been executed, except accessors/mutators.  The notification is a status code that indicates success or error.  Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.  An error status is assigned a positive integer and its identifier starts with the HSA_STATUS_ERROR prefix.  The status code can help to determine a cause of the unsuccessful execution. © Copyright 2014 HSA Foundation. All Rights Reserved
  84. 84. STATUS CODE QUERY  Query additional information on status code  Parameters  status (input): Status code that the user is seeking more information on  status_string (output): An ISO/IEC 646 encoded English language string that potentially describes the error status © Copyright 2014 HSA Foundation. All Rights Reserved
  85. 85. ASYNCHRONOUS NOTIFICATIONS  The runtime passes asynchronous notifications by calling user-defined callbacks.  For instance, queues are a common source of asynchronous events because the tasks queued by an application are asynchronously consumed by the packet processor. Callbacks are associated with queues when they are created. When the runtime detects an error in a queue, it invokes the callback associated with that queue and passes it an error flag (indicating what happened) and a pointer to the erroneous queue.  The HSA runtime does not implement any default callbacks.  When using blocking functions within the callback implementation, a callback that does not return can render the runtime state to be undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  86. 86. EXAMPLE - CALLBACK Pass the callback function when create queue If the queue is empty, set the event and invoke callback © Copyright 2014 HSA Foundation. All Rights Reserved
  87. 87. AGENT INFORMATION
  88. 88. OUTLINE  Agent information  hsa_node_t  hsa_agent_t  hsa_agent_info_t  hsa_component_feature_t  Agent Information manipulation APIs  hsa_iterate_agents  hsa_agent_get_info  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  89. 89. INTRODUCTION  The runtime exposes a list of agents that are available in the system.  An HSA agent is a hardware component that participates in the HSA memory model.  An HSA agent can submit AQL packets for execution.  An HSA agent may also but is not required to be an HSA component. It is possible for a system to include HSA agents that are neither an HSA component nor a host CPU.  HSA agents are defined as opaque handles of type hsa_agent_t .  The HSA runtime provides APIs for applications to traverse the list of available agents and query attributes of a particular agent. © Copyright 2014 HSA Foundation. All Rights Reserved
  90. 90. AGENT INFORMATION (1)  Opaque agent handle  Opaque NUMA node handle  An HSA memory node is a node that delineates a set of system components (host CPUs and HSA Components) with “local” access to a set of memory resources attached to the node's memory controller and appropriate HSA-compliant access attributes. © Copyright 2014 HSA Foundation. All Rights Reserved
  91. 91. AGENT INFORMATION (2)  Component features  An HSA component is a hardware or software component that can be a target of the AQL queries and conforms to the memory model of the HSA.  Values  HSA_COMPONENT_FEATURE_NONE = 0  No component capabilities. The device is an agent, but not a component.  HSA_COMPONENT_FEATURE_BASIC = 1  The component supports the HSAIL instruction set and all the AQL packet types except Agent dispatch.  HSA_COMPONENT_FEATURE_ALL = 2  The component supports the HSAIL instruction set and all the AQL packet types. © Copyright 2014 HSA Foundation. All Rights Reserved
  92. 92. AGENT INFORMATION (3)  Agent attributes  Values  HSA_AGENT_INFO_MAX_GRID_DIM  HSA_AGENT_INFO_MAX_WORKGROUP_DIM  HSA_AGENT_INFO_QUEUE_MAX_PACKETS  HSA_AGENT_INFO_CLOCK  HSA_AGENT_INFO_CLOCK_FREQUENCY  HSA_AGENT_INFO_MAX_SIGNAL_WAIT  HSA_AGENT_INFO_NAME  HSA_AGENT_INFO_NODE  HSA_AGENT_INFO_COMPONENT_FEATURES  HSA_AGENT_INFO_VENDOR_NAME  HSA_AGENT_INFO_WAVEFRONT_SIZE  HSA_AGENT_INFO_CACHE_SIZE © Copyright 2014 HSA Foundation. All Rights Reserved
  93. 93. AGENT INFORMATION MANIPULATION (1)  Iterate over the available agents, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value.  Parameters  callback (input): Callback to be invoked once per agent  data (input): Application data that is passed to callback on every iteration. Can be NULL. © Copyright 2014 HSA Foundation. All Rights Reserved
  94. 94. AGENT INFORMATION MANIPULATION (2)  Get the current value of an attribute for a given agent  Parameters  agent (input): A valid agent  attribute (input): Attribute to query  value (output): Pointer to a user-allocated buffer where to store the value of the attribute. If the buffer passed by the application is not large enough to hold the value of attribute, the behavior is undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  95. 95. EXAMPLE - AGENT ATTRIBUTE QUERY Copy agent attribute information Get the agent handle of Agent 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  96. 96. SIGNALS AND SYNCHRONIZATION (MEMORY-BASED)
  97. 97. OUTLIINE  Signal  Signal manipulation API  Create/Destroy  Query  Send  Atomic Operations  Signal wait  Get time out  Signal Condition  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  98. 98. SIGNAL (1)  HSA agents can communicate with each other by using coherent global memory, or by using signals.  A signal is represented by an opaque signal handle  A signal carries a value, which can be updated or conditionally waited upon via an API call or HSAIL instruction.  The value occupies four or eight bytes depending on the machine model in use. © Copyright 2014 HSA Foundation. All Rights Reserved
  99. 99. SIGNAL (2)  Updating the value of a signal is equivalent to sending the signal.  In addition to the update (store) of signals, the API for sending signal must support other atomic operations with specific memory order semantics  Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS  Memory order semantics : Release and Relaxed © Copyright 2014 HSA Foundation. All Rights Reserved
  100. 100. SIGNAL CREATE/DESTROY  Create a signal  Parameters  initial_value (input): Initial value of the signal.  signal_handle (output): Signal handle.  Destroy a signal previous created by hsa_signal_create  Parameter  signal_handle (input): Signal handle. © Copyright 2014 HSA Foundation. All Rights Reserved
  101. 101.  Send and atomically set the value of a signal with release semantics SIGNAL LOAD/STORE  Atomically read the current signal value with acquire semantics  Atomically read the current signal value with relaxed semantics  Send and atomically set the value of a signal with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  102. 102.  Send and atomically increment the value of a signal by a given amount with release semantics SIGNAL ADD/SUBTRACT  Send and atomically decrement the value of a signal by a given amount with release semantics  Send and atomically increment the value of a signal by a given amount with relaxed semantics  Send and atomically decrement the value of a signal by a given amount with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  103. 103.  Send and atomically perform a logical AND operation on the value of a signal and a given value with release semantics SIGNAL AND (OR, XOR)/EXCHANGE  Send and atomically set the value of a signal and return its previous value with release semantics  Send and atomically perform a logical AND operation on the value of a signal and a given value with relaxed semantics  Send and atomically set the value of a signal and return its previous value with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  104. 104. SIGNAL WAIT (1)  The application may wait on a signal, with a condition specifying the terms of wait.  Signal wait condition operator  Values  HSA_EQ: The two operands are equal.  HSA_NE: The two operands are not equal.  HSA_LT: The first operand is less than the second operand.  HSA_GTE: The first operand is greater than or equal to the second operand. © Copyright 2014 HSA Foundation. All Rights Reserved
  105. 105. SIGNAL WAIT (2)  The wait can be done either in the HSA component via an HSAIL wait instruction or via a runtime API defined here.  Waiting on a signal returns the current value at the opaque signal object;  The wait may have a runtime defined timeout which indicates the maximum amount of time that an implementation can spend waiting.  The signal infrastructure allows for multiple senders/waiters on a single signal.  Wait reads the value, hence acquire synchronizations may be applied. © Copyright 2014 HSA Foundation. All Rights Reserved
  106. 106. SIGNAL WAIT (3)  Signal wait  Parameters  signal_handle (input): A signal handle  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  107. 107. SIGNAL WAIT (4)  Signal wait with timeout  Parameters  signal_handle (input): A signal handle  timeout (input): Maximum wait duration (A value of zero indicates no maximum)  long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in a short period of time. The HSA runtime may use this hint to optimize the wait implementation.  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  108. 108. EXAMPLE – SIGNAL WAIT (1) thread_1 thread_2 thread_1 is blocked hsa_signal_add_relaxed (value = value + 3) Return signal value Condition satisfied, the execution of thread_1 continues value = 0 Timeline Timeline value = 3 hsa_signal_substract_relaxed (value = value - 1)value = 2 hsa_signal_wait_timeout_acquire (value == 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  109. 109. EXAMPLE – SIGNAL WAIT (2) If signal_handle is invalid, then return signal invalid status Compare tmp->value with compare_value to see if the condition is satisfied? If timeout = 0 then return signal time out status Signal wait condition function If the condition is satisfied, then return signal and status © Copyright 2014 HSA Foundation. All Rights Reserved
  110. 110. QUEUES AND ARCHITECTED DISPATCH
  111. 111. OUTLINE  Queues  Queue Types and Structure  HSA runtime API for Queue Manipulations  Architected Queuing Language (AQL) Support  Packet type  Packet header  Examples  Enqueue Packet  Packet Processor © Copyright 2014 HSA Foundation. All Rights Reserved
  112. 112. INTRODUCTION (1)  An HSA-compliant platform supports multiple user-level command queues allocation.  A use-level command queue is characterized as runtime-allocated, user-level accessible virtual memory of a certain size, containing packets defined in the Architected Queuing Language (AQL packets).  Queues are allocated by HSA applications through the HSA runtime.  HSA software receives memory-based structures to configure the hardware queues to allow for efficient software management of the hardware queues of the HSA agents.  This queue memory shall be processed by the HSA Packet Processor as a ring buffer.  Queues are read-only data structures.  Writing values directly to a queue structure results in undefined behavior.  But HSA agents can directly modify the contents of the buffer pointed by base_address, or use runtime APIs to access the doorbell signal or the service queue. © Copyright 2014 HSA Foundation. All Rights Reserved
  113. 113.  Two queue types, AQL and Service Queues, are supported  AQL Queue consumes AQL packets that are used to specify the information of kernel functions that will be executed on the HSA component  Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user registered functions that will be executed on the agent (typically, the host CPU) INTRODUCTION (2) © Copyright 2014 HSA Foundation. All Rights Reserved
  114. 114. INTRODUCTION (3)  AQL queue structure © Copyright 2014 HSA Foundation. All Rights Reserved
  115. 115. INTRODUCTION (4)  In addition to the data held in the queue structure, the queue also defines two properties (readIndex and writeIndex) that define the location of “head” and “tail” of the queue.  readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet to be consumed by the packet processor.  writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet slot to be allocated.  Both indices are not directly exposed to the user, who can only access them by using dedicated HSA core runtime APIs.  The available index functions differ on the index of interest (read or write), action to be performed (addition, compare and swap, etc.), and memory consistency model (relaxed, release, etc.). © Copyright 2014 HSA Foundation. All Rights Reserved
  116. 116. INTRODUCTION (5)  The read index is automatically advanced when a packet is read by the packet processor.  When the packet processor observes that  The read index matches the write index, the queue can be considered empty;  The write index is greater than or equal to the sum of the read index and the size of the queue, then the queue is full.  The doorbell_signal field of a queue contains a signal that is used by the agent to inform the packet processor to process the packets it writes.  The value that the doorbell signaled is equal to the ID of the packet that is ready to be launched. © Copyright 2014 HSA Foundation. All Rights Reserved
  117. 117. INTRODUCTION (6)  The new task might be consumed by the packet processor even before the doorbell signal has been signaled by the agent.  This is because the packet processor might be already processing some other packets and observes that there is new work available, so it processes the new packets.  In any case, the agent must ring the doorbell for every batch of packets it writes. © Copyright 2014 HSA Foundation. All Rights Reserved
  118. 118. QUEUE CREATE/DESTROY  Create a user mode queue  When a queue is created, the runtime also allocates the packet buffer and the completion signal.  The application should only rely on the status code returned to determine if the queue is valid  Destroy a user mode queue  A destroyed queue might not be accessed after being destroyed.  When a queue is destroyed, the state of the AQL packets that have not been yet fully processed becomes undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  119. 119. GET READ/WRITE INDEX  Atomically retrieve read index of a queue with acquire semantics  Atomically retrieve write index of a queue with acquire semantics  Atomically retrieve read index of a queue with relaxed semantics  Atomically retrieve write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  120. 120. SET READ/WRITE INDEX  Atomically set the read index of a queue with release semantics  Atomically set the read index of a queue with relaxed semantics  Atomically set the write index of a queue with release semantics  Atomically set the write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  121. 121. COMPARE AND SWAP WRITE INDEX  Atomically compare and set the write index of a queue with acquire/release/relaxed/acquire- release semantics  Parameters  queue (input): A queue  expected (input): The expected index value  val (input): Value to copy to the write index if expected matches the observed write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  122. 122. ADD WRITE INDEX  Atomically increment the write index of a queue by an offset with release/acquire/relaxed/acquire-release semantics  Parameters  queue (input): A queue  val (input): The value to add to the write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  123. 123. ARCHITECTED QUEUING LANGUAGE (AQL)  An HSA-compliant system provides a command interface for the dispatch of HSA agent commands.  This command interface is provided by the Architected Queuing Language (AQL).  AQL allows HSA agents to build and enqueue their own command packets, enabling fast and low-power dispatch.  AQL also provides support for HSA component queue submissions  The HSA component kernel can write commands in AQL format. © Copyright 2014 HSA Foundation. All Rights Reserved
  124. 124. AQL PACKET (1)  AQL packet format  Values  Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.  Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.  Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.  Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.  Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents. © Copyright 2014 HSA Foundation. All Rights Reserved
  125. 125. AQL PACKET (2) HSA signaling object handle used to indicate completion of the job © Copyright 2014 HSA Foundation. All Rights Reserved
  126. 126. EXAMPLE - ENQUEUE AQL PACKET (1)  An HSA agent submits a task to a queue by performing the following steps:  Allocate a packet slot (by incrementing the writeIndex)  Initialize the packet and copy packet to a queue associated with the Packet Processor  Mark packet as valid  Notify the Packet Processor of the packet (With doorbell signal) © Copyright 2014 HSA Foundation. All Rights Reserved
  127. 127. EXAMPLE - ENQUEUE AQL PACKET (2) Dispatch Queue Allocate an AQL packet slot Copy the packet into queue. Note that, we can have a lock here to prevent race condition in multithread environment WriteIndex ReadIndex Initialize packet Send doorbell signal © Copyright 2014 HSA Foundation. All Rights Reserved
  128. 128. EXAMPLE - PACKET PROCESSOR WriteIndex ReadIndex Get packet content Check if barrier packet Update readIndex, change packet state to invalid, and send completion signal. Receive doorbell Dispatch Queue If there is any packet in queue, process the packet. © Copyright 2014 HSA Foundation. All Rights Reserved
  129. 129. MEMORY MANAGEMENT
  130. 130. OUTLINE  Memory registration and deregistration  Memory region and memory segment  APIs for memory region manipulation  APIs for memory registration and deregistration © Copyright 2014 HSA Foundation. All Rights Reserved
  131. 131. INTRODUCTION  One of the key features of HSA is its ability to share global pointers between the host application and code executing on the HSA component.  This ability means that an application can directly pass a pointer to memory allocated on the host to a kernel function dispatched to a component without an intermediate copy  When a buffer created in the host is also accessed by a component, programmers are encouraged to register the corresponding address range beforehand.  Registering memory expresses an intention to access (read or write) the passed buffer from a component other than the host. This is a performance hint that allows the runtime implementation to know which buffers will be accessed by some of the components ahead of time.  When an HSA program no longer needs to access a registered buffer in a device, the user should deregister that virtual address range. © Copyright 2014 HSA Foundation. All Rights Reserved
  132. 132. MEMORY REGION/SEGMENT  A memory region represents a virtual memory interval that is visible to a particular agent, and contains properties about how memory is accessed or allocated from that agent.  Memory segments  Values  HSA_SEGMENT_GLOBAL = 1  HSA_SEGMENT_PRIVATE = 2  HSA_SEGMENT_GROUP = 4  HSA_SEGMENT_KERNARG = 8  HSA_SEGMENT_READONLY = 16  HSA_SEGMENT_IMAGE = 32 © Copyright 2014 HSA Foundation. All Rights Reserved
  133. 133. MEMORY REGION INFORMATION  Attributes of a memory region  Values  HSA_REGION_INFO_BASE_ADDRESS  HSA_REGION_INFO_SIZE  HSA_REGION_INFO_NODE  HSA_REGION_INFO_MAX_ALLOCATION_SIZE  HSA_REGION_INFO_SEGMENT  HSA_REGION_INFO_BANDWIDTH  HSA_REGION_INFO_CACHED © Copyright 2014 HSA Foundation. All Rights Reserved
  134. 134. MEMORY REGION MANIPULATION (1)  Get the current value of an attribute of a region  Iterate over the memory regions that are visible to an agent, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value. © Copyright 2014 HSA Foundation. All Rights Reserved
  135. 135. MEMORY REGION MANIPULATION (2)  Allocate a block of memory  Deallocate a block of memory previously allocated using hsa_memory_allocate  Copy block of memory  Copying a number of bytes larger than the size of the memory regions pointed by dst or src results in undefined behavior. © Copyright 2014 HSA Foundation. All Rights Reserved
  136. 136. MEMORY REGISTRATION/DEREGISTRATION  Register memory  Parameters  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed.  size (input): Requested registration size in bytes. A size of zero is only allowed if address is NULL.  Deregister memory previously registered using hsa_memory_register  Parameter  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed. © Copyright 2014 HSA Foundation. All Rights Reserved
  137. 137. EXAMPLE Allocate a memory space Use hsa_region_get_info to get the size in byte of this memory space Register this memory space for a performance hint Finish operation, deregister and free this memory space © Copyright 2014 HSA Foundation. All Rights Reserved
  138. 138. SUMMARY
  139. 139. SUMMARY  Covered  HSA Core Runtime API (Pre-release 1.0 provisional)  Runtime Initialization and Shutdown (Open/Close)  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Memory Management  Not covered  Extension of Core Runtime  HSAIL Finalization, Linking, and Debugging  Images and Samplers © Copyright 2014 HSA Foundation. All Rights Reserved
  140. 140. QUESTIONS? © Copyright 2014 HSA Foundation. All Rights Reserved
  141. 141. HSA MEMORY MODEL BEN GASTER, ENGINEER, QUALCOMM
  142. 142. OUTLINE  HSA Memory Model  OpenCL 2.0  Has a memory model too  Obstruction-free bounded deques  An example using the HSA memory model © Copyright 2014 HSA Foundation. All Rights Reserved
  143. 143. HSA MEMORY MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  144. 144. TYPES OF MODELS  Shared memory computers and programming languages, divide complexity into models: 1. Memory model specifies safety  e.g. can a work-item prevent others from progressing?  This is what this section of the tutorial will focus on 2. Execution model specifies liveness  Described in Ben Sander’s tutorial section on HSAIL  e.g. can a work-item prevent others from progressing 3. Performance model specifies the big picture  e.g. caches or branch divergence  Specific to particular implementations and outside the scope of today’s tutorial © Copyright 2014 HSA Foundation. All Rights Reserved
  145. 145. THE PROBLEM  Assume all locations (a, b, …) are initialized to 0  What are the values of $s2 and $s4 after execution? © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; *a = 1; int x = *b; *b = 1; int y = *a; initially *a = 0 && *b = 0
  146. 146. THE SOLUTION  The memory model tells us:  Defines the visibility of writes to memory at any given point  Provides us with a set of possible executions © Copyright 2014 HSA Foundation. All Rights Reserved
  147. 147. WHAT MAKES A GOOD MEMORY MODEL*  Programmability ; A good model should make it (relatively) easy to write multi- work-item programs. The model should be intuitive to most users, even to those who have not read the details  Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options  Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models * S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993. © Copyright 2014 HSA Foundation. All Rights Reserved
  148. 148. SEQUENTIAL CONSISTENCY (SC)*  Axiomatic Definition  A single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.”  A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” © Copyright 2014 HSA Foundation. All Rights Reserved  But HW/Compiler actually implements more relaxed models, e.g. ARMv7 * L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. IEEE Transactions on Computers, C-28(9):690–91, Sept. 1979.
  149. 149. SEQUENTIAL CONSISTENCY (SC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; $s2 = 0 && $s4 = 1
  150. 150. BUT WHAT ABOUT ACTUAL HARDWARE  Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform  Many modern processors implement many reordering optimizations  Store buffers (TSO*), work-items can see their own stores early  Reorder buffers (XC*), work-items can see other work-items store early © Copyright 2014 HSA Foundation. All Rights Reserved *TSO – Total Store Order as implemented by Sparc and x86 *XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
  151. 151. RELAXED CONSISTENCY (XC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; ld_global_u32 $s2, [&b] ; ld_global_u32 $s4, [&a] ; st_global_u32 $s1, [&a] ; st_global_u32 $s3, [&b] ; $s2 = 0 && $s4 = 0
  152. 152. WHAT ARE OUR 3 Ps?  Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when  many memory model experts have been known to get it wrong!  Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements  Portability ; XC is very portable as it places very little constraints © Copyright 2014 HSA Foundation. All Rights Reserved
  153. 153. MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT  To have their cake and eat it! © Copyright 2014 HSA Foundation. All Rights Reserved Put picture with kids and cake HSA Provides: The ability to enable programmers to reason with (relatively) intuitive model of SC, while still achieving the benefits of XC!
  154. 154. SEQUENTIAL CONSISTENCY FOR DRF*  HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data Race Free (DRF)  plus some new capabilities !  (Informally) A data race occurs when two (or more) work-items access the same memory location such that:  At least one of the accesses is a WRITE  There are no intervening synchronization operations  SC for DRF asks:  Programmers to ensure programs are DRF under SC  Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions © Copyright 2014 HSA Foundation. All Rights Reserved *S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990
  155. 155. HSA SUPPORTS RELEASE CONSISTENCY  HSA’s memory model is based on RCSC:  All atomic_ld_scacq and atomic_st_screl are SC  Means coherence on all atomic_ld_scacq and atomic_st_screl to a single address. )  All atomic_ld_scacq and atomic_st_screl are program ordered per work- item (actually: sequence-order by language constraints  Similar model adopted by ARMv8  HSA extends RCSC to SC for HRF*, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example. © Copyright 2014 HSA Foundation. All Rights Reserved *Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
  156. 156. MAKING RELAXED CONSISTENCY WORK © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; $s2 = 0 && $s4 = 1
  157. 157. SEQUENTIAL CONSISTENCY FOR DRF  Two memory accesses participate in a data race if they  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  A program is data-race-free if no possible execution results in a data race.  Sequential consistency for data-race-free programs  Avoid everything else HSA: Not good enough! © Copyright 2014 HSA Foundation. All Rights Reserved
  158. 158. ALL ARE NOT EQUAL – OR SOME CAN SEE BETTER THAN OTHERS  Remember the HSAIL Execution Model © Copyright 2014 HSA Foundation. All Rights Reserved device scope group scope wave scope platform scope
  159. 159. DATA-RACE-FREE IS NOT ENOUGH t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl 0, [&flag] atomic_cas_global_scar 1, 0, [&flag] ... atomic_st_global_screl 0, [&flag] atomic_cas_global_scar ,1 0, [&flag] ld_global (??), [&x] group #1-2 group #3-4  Two ordinary memory accesses participate in a data race if they  Access same location  At least one is a store  Can occur simultaneously Not a data race… Is it SC? Well that depends t4t3t1 t2 SGlobal S12 S34 visibility implied by causality? © Copyright 2014 HSA Foundation. All Rights Reserved
  160. 160. SEQUENTIAL CONSISTENCY FOR HETEROGENEOUS-RACE-FREE  Two memory accesses participate in a heterogeneous race if  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  Are not synchronized with “enough” scope  A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.  Sequential consistency for heterogeneous-race-free programs  Avoid everything else © Copyright 2014 HSA Foundation. All Rights Reserved
  161. 161. HSA HETEROGENEOUS RACE FREE  HRF0: Basic Scope Synchronization  “enough” = both threads synchronize using identical scope  Recall example:  Contains a heterogeneous race in HSA t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_rcrel_wg 0, [&flag] ... atomic_cas_global_scar_wg,1 0, [&flag] ld_global (??), [&x] Workgroup #1-2 Workgroup #3-4 HSA Conclusion: This is bad. Don’t do it. © Copyright 2014 HSA Foundation. All Rights Reserved
  162. 162. HOW TO USE HSA WITH SCOPES Use smallest scope that includes all producers/consumers of shared data HSA Scope Selection Guideline Implication: Producers/consumers must be known at synchronization time  Want: For performance, use smallest scope possible  What is safe in HSA? Is this a valid assumption? © Copyright 2014 HSA Foundation. All Rights Reserved
  163. 163. REGULAR GPGPU WORKLOADS N M Define Problem Space Partition Hierarchically Communicate Locally N times Communicate Globally M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern =  Producer/consumers are always known Generally: HSA works well with regular data-parallel workloads © Copyright 2014 HSA Foundation. All Rights Reserved
  164. 164. t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_scar_plat 1, 0, [&flag] ... atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_ar_plat ,1 0, [&flag] ld $s1, [&x] IRREGULAR WORKLOADS  HSA: example is race  Must upgrade wg (workgroup) -> plat (platform)  HSA memory model says:  ld $s1, [&x], will see value (1)! Workgroup #1-2 Workgroup #3-4 © Copyright 2014 HSA Foundation. All Rights Reserved
  165. 165. OPENCL HAS MEMORY MODELS TOO MAPPING ONTO HSA’S MEMORY MODEL
  166. 166.  It is straightforward to provide a mapping from OpenCL 1.x to the proposed model  OpenCL 1.x atomics are unordered and so map to atomic_op_X  Mapping for fences not shown but straightforward OPENCL 1.X MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Atomic load ld_global_wg ld_group_wg Atomic store atomic_st_global_wg atomic_st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) fence ; barrier_wg © Copyright 2014 HSA Foundation. All Rights Reserved
  167. 167. OPENCL 2.0 BACKGROUND  Provisional specification released at SIGGRAPH’13, July 2013.  Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads)  Key features:  Shared virtual memory, including platform atomics  Formally defined memory model based on C11 plus support for scopes  Includes an extended set of C1X atomic operations  Generic address space, that subsumes global, local, and private  Device to device enqueue  Out-of-order device side queuing model  Backwards compatible with OpenCL 1.x © Copyright 2014 HSA Foundation. All Rights Reserved
  168. 168. OPENCL 2.0 MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load memory_order_relaxed atomic_ld_[global | group]_relaxed_scope Store Memory_order_relaxed atomic_st_[global | group]_relaxed_scope Load memory_order_acquire atomic_ld_[global | group]_scacq_scope Load memory_order_seq_cst atomic_ld_[global | group]_scacq_scope Store memory_order_release atomic_st_[global | group]_screl_scope Store Memory_order_seq_cst atomic_st_[global | group]_screl_scope memory_order_acq_rel atomic_op_[global | group]_scar_scope memory_order_seq_cst atomic_op_[global|group]_scar_scope © Copyright 2014 HSA Foundation. All Rights Reserved
  169. 169. OPENCL 2.0 MEMORY SCOPE MAPPING OpenCL Scope HSA Scope memory_scope_sub_group _wave memory_scope_work_group _wg memory_scope_device _component memory_scope_all_svm_devices _platform © Copyright 2014 HSA Foundation. All Rights Reserved
  170. 170. OBSTRUCTION-FREE BOUNDED DEQUES AN EXAMPLE USING THE HSA MEMORY MODEL
  171. 171. CONCURRENT DATA-STRUCTURES  Why do we need such a memory model in practice?  One important application of memory consistency is in the development and use of concurrent data-structures  In particular, there are a class data-structures implementations that provide non- blocking guarantees:  wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes  In practice very hard to build efficient data-structures that meet this requirement  lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress  In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress  Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress © Copyright 2014 HSA Foundation. All Rights Reserved
  172. 172. Emerging Compute Cluster BUT WAY NOT JUST USE MUTUAL EXCLUSION? © Copyright 2014 HSA Foundation. All Rights Reserved Fabric & Memory Controller Krait CPUAdreno GPU Krait CPU Krait CPU Krait CPU MMU MMUs 2MB L2 Hexagon DSP MMU ?? ?? Diversity in a heterogeneous system, such as different clock speeds, different scheduling policies, and more can mean traditional mutual exclusion is not the right choice
  173. 173. CONCURRENT DATA-STRUCTURES  Emerging heterogeneous compute clusters means we need:  To adapt existing concurrent data-structures  Developer new concurrent data-structures  Lock based programming may still be useful but often these algorithms will need to be lock-free  Of course, this is a key application of the HSA memory model  To showcase this we highlight the development of a well known (HLM) obstruction-free deque* © Copyright 2014 HSA Foundation. All Rights Reserved *Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529.
  174. 174. HLM - OBSTRUCTION-FREE DEQUE  Uses a fixed length circular queue  At any given time, reading from left to right, the array will contain:  Zero or more left-null (LN) values  Zero or more dummy-null (DN) values  Zero or more right-null (RN) values  At all times there must be:  At least two different nulls values  At least one LN or DN, and at least one DN or RN  Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents) © Copyright 2014 HSA Foundation. All Rights Reserved
  175. 175. HLM - OBSTRUCTION-FREE DEQUE © Copyright 2014 HSA Foundation. All Rights Reserved LNLN vLN RNv RNRN left right Key: LN – left null value RN – right null value v – value left – left hint index right – right hint index
  176. 176. C REPRESENTATION OF DEQUE struct node { uint64_t type : 2; // null type (LN, RN, DN) uint64_t counter : 8 ; // version counter to avoid ABA uint64_t value : 54 ; // index value stored in queue } struct queue { unsigned int size; // size of bounded buffer node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
  177. 177. HSAIL REPRESENTATION  Allocate a deque in global memory using HSAIL @deque_instance: align 64 global_u32 &size; align 8 global_u64 &array; © Copyright 2014 HSA Foundation. All Rights Reserved
  178. 178. ORACLE  Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);  Which given a deque  returns (%k) the position of the left most of RN  atomic_ld_global_scacq used to read node from array  Makes one if necessary (i.e. if there are only LN or DN)  atomic_cas_global_scar, required to make new RN  returns (%left) the left node (i.e. the value to the left of the left most RN position)  returns (%right) the right node (i.e. the value at position (%k)) © Copyright 2014 HSA Foundation. All Rights Reserved
  179. 179. RIGHT POP function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) { // load queue address ld_arg_u64 $d0, [%deque]; @loop_forever: // setup and call right oracle to get next RN arg_u32 %k; arg_u64 %current; arg_u64 %next; call &rcheck_oracle(%queue) ; ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next]; // current.value($d5) shr_u64 $d5, $d1, 62; // current.counter($d6) and_u64 $d6, $d1, 0x3FC0000000000000; shr_u64 $d6, $d6, 54; // current.value($d7) and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF; // next.counter($d8) and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ; } © Copyright 2014 HSA Foundation. All Rights Reserved
  180. 180. RIGHT POP – TEST FOR EMPTY // current.type($d5) == LN || current.type($d5) == DN cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN; or_b1 $c0, $c0, $c1; cbr $c0, @not_empty ; // current node index (%deque($d0) + (%k($s1) - 1) * 16) add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0; atomic_ld_global_scacq_u64 $d4, [$d3]; cmp_neq_b1_u64 $c0, $d4, $d1; cbr $c0, @not_empty; st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY %ret @not_empty: © Copyright 2014 HSA Foundation. All Rights Reserved
  181. 181. RIGHT POP – TRY READ/REMOVE NODE // $d9 = (RN, next.cnt+1, 0) add_u64 $d8, $d8, 1; shl_u64 $d9, RN, 62; and_u64 $d8, $d8, $d9; // cas(deq+k, next, node(RN, next.cnt+1, 0)) atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9; cmp_neq_u64 $c0, $d9, $d2; cbr $c0, @cas_failed; // $d9 = (RN, current.cnt+1, 0) add_u64 $d6, $d6, 1; shl_u64 $d9, RN, 62; and_u64 $d9, $d6, $d9; // cas(deq+(k-1), curr, node(RN, curr.cnt+1,0) atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9; cmp_neq_u64 $c0, $d9, $d1; cbr $c0, @cas_failed; st_arg_u32 SUCCESS, [&err]; st_arg_u64 $d7, [&value]; %ret @cas_failed: // loop back around and try again © Copyright 2014 HSA Foundation. All Rights Reserved
  182. 182. TAKE AWAYS  HSA provides a powerful and modern memory model  Based on the well know SC for DRF  Defined as Release Consistency  Extended with scopes as defined by HRF  OpenCL 2.0 introduces a new memory model  Also based on SC for DRF  Also defined in terms of Release Consistency  Also Extended with scope as defined in HRF  Has a well defined mapping to HSA  Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models © Copyright 2014 HSA Foundation. All Rights Reserved
  183. 183. HSA QUEUING MODEL HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER, ARM
  184. 184. HSA QUEUEING, MOTIVATION
  185. 185. MOTIVATION (TODAY’S PICTURE) © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  186. 186. HSA QUEUEING: REQUIREMENTS
  187. 187. REQUIREMENTS  Three key technologies are used to build the user mode queueing mechanism  Shared Virtual Memory  System Coherency  Signaling  AQL (Architected Queueing Language) enables any agent enqueue tasks © Copyright 2014 HSA Foundation. All Rights Reserved
  188. 188. SHARED VIRTUAL MEMORY
  189. 189. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (TODAY)  Multiple Virtual memory address spaces © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  190. 190. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (HSA)  Common Virtual Memory for all HSA agents © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  191. 191. SHARED VIRTUAL MEMORY  Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Implications  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. © Copyright 2014 HSA Foundation. All Rights Reserved
  192. 192. SHARED VIRTUAL MEMORY  Specifics  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems.  HSA agents may reserve VA ranges for internal use via system software.  All HSA agents other than the host unit must use the lowest privilege level  If present, read/write access flags for page tables must be maintained by all agents.  Read/write permissions apply to all HSA agents, equally. © Copyright 2014 HSA Foundation. All Rights Reserved
  193. 193. GETTING THERE … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  194. 194. CACHE COHERENCY
  195. 195. CACHE COHERENCY DOMAINS (1/3)  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. © Copyright 2014 HSA Foundation. All Rights Reserved
  196. 196. CACHE COHERENCY DOMAINS (2/3)  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Implications  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … © Copyright 2014 HSA Foundation. All Rights Reserved
  197. 197. CACHE COHERENCY DOMAINS (3/3)  Specifics  No requirement for instruction memory accesses to be coherent  Only applies to the Primary memory type.  No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes  Read-only image data is required to remain static during the execution of an HSA kernel.  No double mapping (via different attributes) in order to modify. Must remain static © Copyright 2014 HSA Foundation. All Rights Reserved
  198. 198. GETTING CLOSER … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  199. 199. SIGNALING
  200. 200. SIGNALING (1/3)  HSA agents support the ability to use signaling objects  All creation/destruction signaling objects occurs via HSA runtime APIs  From an HSA Agent you can directly access signaling objects.  Signaling a signal object (this will wake up HSA agents waiting upon the object)  Query current object  Wait on the current object (various conditions supported). © Copyright 2014 HSA Foundation. All Rights Reserved
  201. 201. SIGNALING (2/3)  Advantages  Enables asynchronous events between HSA agents, without involving the kernel  Common idiom for work offload  Low power waiting  Implications  Runtime support required  Commonly implemented on top of cache coherency flows © Copyright 2014 HSA Foundation. All Rights Reserved
  202. 202. SIGNALING (3/3)  Specifics  Only supported within a PASID  Supported wait conditions are =, !=, < and >=  Wait operations may return sporadically (no guarantee against false positives)  Programmer must test.  Wait operations have a maximum duration before returning.  The HSAIL atomic operations are supported on signal objects.  Signal objects are opaque  Must use dedicated HSAIL/HSA runtime operations © Copyright 2014 HSA Foundation. All Rights Reserved
  203. 203. ALMOST THERE… © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  204. 204. USER MODE QUEUING
  205. 205. ONE BLOCK LEFT © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  206. 206. USER MODE QUEUEING (1/3)  User mode Queueing  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.  Queues are created/destroyed via calls to the HSA runtime.  One (or many) agents enqueue packets, a single agent dequeues packets.  Requires coherency and shared virtual memory. © Copyright 2014 HSA Foundation. All Rights Reserved
  207. 207. USER MODE QUEUEING (2/3)  Advantages  Avoid involving the kernel/driver when dispatching work for an Agent.  Lower latency job dispatch enables finer granularity of offload  Standard memory protection mechanisms may be used to protect communication with the consuming agent.  Implications  Packet formats/fields are Architected – standard across vendors!  Guaranteed backward compatibility  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signaling)  More on this later…… © Copyright 2014 HSA Foundation. All Rights Reserved
  208. 208. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  209. 209. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Queue Job Start Job Finish Job
  210. 210. ARCHITECTED QUEUEING LANGUAGE, QUEUES
  211. 211. ARCHITECTED QUEUEING LANGUAGE  HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Single producer variant defined with some optimizations possible.  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected © Copyright 2014 HSA Foundation. All Rights Reserved Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets
  212. 212. ARCHITECTED QUEUING LANGUAGE  Packets are read and dispatched for execution from the queue in order, but may complete in any order.  There is no guarantee that more than one packet will be processed in parallel at a time  There may be many queues. A single agent may also consume from several queues.  Any HSA agent may enqueue packets  CPUs  GPUs  Other accelerators © Copyright 2014 HSA Foundation. All Rights Reserved
  213. 213. QUEUE STRUCTURE © Copyright 2014 HSA Foundation. All Rights Reserved Offset (bytes) Size (bytes) Field Notes 0 4 queueType Differentiate different queues 4 4 queueFeatures Indicate supported features 8 8 baseAddress Pointer to packet array 16 16 doorbellSignal HSA signaling object handle 24 4 size Packet array cardinality 28 4 queueId Unique per process 32 8 serviceQueue Queue for callback services intrinsic 8 writeIndex Packet array write index intrinsic 8 readIndex Packet array read index
  214. 214. QUEUE VARIANTS  queueType and queueFeatures together define queue semantics and capabilities  Two queueType values defined, other values reserved:  MULTI – queue supports multiple producers  SINGLE – queue supports single producer  queueFeatures is a bitfield indicating capabilities  DISPATCH (bit 0) if set then queue supports DISPATCH packets  AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets  All other bits are reserved and must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  215. 215. QUEUE STRUCTURE DETAILS  Queue doorbells are HSA signaling objects with restrictions  Created as part of the queue – lifetime tied to queue object  Atomic read-modify-write not allowed  size field value must be aligned to a power of 2  serviceQueue can be used by HSA kernel for callback services  Provided by application when queue is created  Can be mapped to HSA runtime provided serviceQueue, an application serviced queue, or NULL if no serviceQueue required © Copyright 2014 HSA Foundation. All Rights Reserved
  216. 216. READ/WRITE INDICES  readIndex and writeIndex properties are part of the queue, but not visible in the queue structure  Accessed through HSA runtime API and HSAIL operations  HSA runtime/HSAIL operations defined to  Read readIndex or writeIndex property  Write readIndex or writeIndex property  Add constant to writeIndex property (returns previous writeIndex value)  CAS on writeIndex property  readIndex & writeIndex operations treated as atomic in memory model  relaxed, acquire, release and acquire-release variants defined as applicable  readIndex and writeIndex never wrap  PacketID – the index of a particular packet  Uniquely identifies each packet of a queue © Copyright 2014 HSA Foundation. All Rights Reserved
  217. 217. PACKET ENQUEUE  Packet enqueue follows a few simple steps:  Reserve space  Multiple packets can be reserved at a time  Write packet to queue  Mark packet as valid  Producer no longer allowed to modify packet  Consumer is allowed to start processing packet  Notify consumer of packet through the queue doorbell  Multiple packets can be notified at a time  Doorbell signal should be signaled with last packetID notified  On small machine model the lower 32 bits of the packetID are used © Copyright 2014 HSA Foundation. All Rights Reserved
  218. 218. PACKET RESERVATION  Two flows envisaged  Atomic add writeIndex with number of packets to reserve  Producer must wait until packetID < readIndex + size before writing to packet  Queue can be sized so that wait is unlikely (or impossible)  Suitable when many threads use one queue  Check queue not full first, then use atomic CAS to update writeIndex  Can be inefficient if many threads use the same queue  Allows different failure model if queue is congested © Copyright 2014 HSA Foundation. All Rights Reserved
  219. 219. QUEUE OPTIMIZATIONS  Queue behavior is loosely defined to allow optimizations  Some potential producer behavior optimizations:  Keep local copy of readIndex, update when required  For single producer queues:  Keep local copy of writeIndex  Use store operation rather than add/cas atomic to update writeIndex  Some potential consumer behavior optimizations:  Use packet format field to determine whether a packet has been submitted rather than writeIndex property  Speculatively read multiple packets from the queue  Not update readIndex for each packet processed  Rely on value used for doorbellSignal to notify new packets  Especially useful for single producer queues © Copyright 2014 HSA Foundation. All Rights Reserved
  220. 220. POTENTIAL MULTI-PRODUCER ALGORITHM // Allocate packet uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1); // Wait until the queue is no longer full. uint64_t rdIdx; do { rdIdx = hsa_queue_load_read_index_relaxed(q); } while (packetID >= (rdIdx + q->size)); // calculate index uint32_t arrayIdx = packetID & (q->size-1); // copy over the packet, the format field is INVALID q->baseAddress[arrayIdx] = pkt; // Update format field with release semantics q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release); // ring doorbell, with release semantics (could also amortize over multiple packets) hsa_signal_send_relaxed(q->doorbellSignal, packetID); © Copyright 2014 HSA Foundation. All Rights Reserved
  221. 221. POTENTIAL CONSUMER ALGORITHM // Get location of next packet uint64_t readIndex = hsa_queue_load_read_index_relaxed(q); // calculate the index uint32_t arrayIdx = readIndex & (q->size-1); // spin while empty (could also perform low-power wait on doorbell) while (INVALID == q->baseAddress[arrayIdx].hdr.format) { } // copy over the packet pkt = q->baseAddress[arrayIdx]; // set the format field to invalid q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed); // Update the readIndex using HSA intrinsic hsa_queue_store_read_index_relaxed(q, readIndex+1); // Now process <pkt>! © Copyright 2014 HSA Foundation. All Rights Reserved
  222. 222. ARCHITECTED QUEUEING LANGUAGE, PACKETS
  223. 223. PACKETS © Copyright 2014 HSA Foundation. All Rights Reserved  Packets come in three main types with architected layouts  Always reserved & Invalid  Do not contain any valid tasks and are not processed (queue will not progress)  Dispatch  Specifies kernel execution over a grid  Agent Dispatch  Specifies a single function to perform with a set of parameters  Barrier  Used for task dependencies

×