HC-4017, HSA Compilers Technology, by Debyendu Das


Published on

Presentation HC-4017 by Debyendu Das from the AMD Developer Summit (APU13) November 11-13, 2013.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HC-4017, HSA Compilers Technology, by Debyendu Das

  2. 2. OUTLINE  H(eterogeneous) S(ystem) A(rchitecture) SW Stack  Architecture of HSA Compilers  Performance  HSA Compiler Deliverables  OpenCL™ 2.0 features  Conclusions and Future Direction 2 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  3. 3. HSA SOFTWARE STACK  How we deliver the HSA value proposition?  Make GPU easily accessible ‒ Support mainstream languages ‒ Expandable to domain specific languages  Make compute offload efficient ‒ Eliminate memory copying ‒ Low-latency dispatch Applications Application and System Languages, domain specific languages, etc. e.g. OpenCL™, Java ™, C++ AMP, Python, R, … LLVM IR  Make it ubiquitous ‒ Drive standard through HSA Foundation ‒ Open Source key components  Optimized Compiler Technology ‒ Leverage llvm framework ‒ HSAIL as a new IR for heterogeneous computing 3 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC HSAIL HSA Runtime(HSA RT) HSA Hardware
  4. 4. HSAIL  HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs ‒ Finalized to a native ISA by a finalizer/JIT ‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible across implementations ‒ Enable multiple hardware vendors to support HSA High-Level Compiler Flow (Developer) OpenCL™ Kernel EDG or CLANG SPIR LLVM HSAIL  Key design points and benefits for HSA compilers ‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the finalizer ‒ Drive performance optimizations through high-level compilers (HLC) Finalizer Flow (Runtime) ‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations HSAIL Finalizer Hardware ISA EDG – Edison Design Group CLANG – LLVM FE SPIR – Standard Portable Intermediate Representation 4 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  5. 5. Architecture of HSA Compilers
  6. 6. TECHNOLOGY BASE FOR COMPILER COMPONENTS  H(igh) L(evel) C(ompiler) front-end ‒ C++ FE from Edison Design Group (EDG) under a proprietary license ‒ May support CLANG FE in the future ‒ Generates llvm-ir  HLC back-end ‒ LLVM optimizer and code-gen ‒ Generates HSAIL from llvm-ir  Finalizer ‒ Converts HSAIL to GPU ISA ‒ SSA-based optimizer  HSAIL assembler/disassembler (libHSAIL) ‒ Assembling, disassembling, validating HSAIL and BRIG (binary format of HSAIL)  Libraries ‒ Optimized implementation of OpenCL™ builtins 6 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  7. 7. OPENCL™ COMPILER ARCHITECTURE C/C++ Front End EDG for OpenCL™ Kernels Compiler Optimizations LLVM Optimizer X86 code generation LLVM HSAIL code generation Finalizer x86 Executable with OpenCL™ API Calls  Minimize architectural changes. OpenCL™Device Compiler Host Linker  OpenCL™ compiler is expected to continue evolving based on new specs from Khronos.  HSA OpenCL™ compiler leverages the existing and evolving compiler architecture of llvm. OpenCL™ Host Compiler GPU ISA  Shifting aggressive optimizations toward HLC  Thin Finalizer 7 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  8. 8. DEVICE COMPILER Device code in LLVM-IR LLVM IR  Based on LLVM optimizer LLVM optimizer  Custom HSAIL back-end Optimized device code in LLVM-IR  Parallel -aware compiler optimizations  SIMT-friendly code generation LLVM HSAIL code generator  GPU specific optimizations Device code in binary BRIG  DWARF generation form BRIGContainer  Direct binary object generation libHSAIL BRIGStreamer BRIG Binary Object ELF with BRIG sections. 8 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  9. 9. libHSAIL – assembler/disassembler/validator for HSA libHSAIL clients ASCII HSAIL  HIDEL - High Level HSAIL Description Language  Automatically generated code to: Disassembler HSAILAsm Scanner libBrigDwarf ‒ Access BRIG fields in safe and effective way Parser ‒ Validate BRIG and HSAIL conformance to spec HLC (LLVM) ‒ Encapsulate BRIG version differences  Brigantine API to ease creation of BRIG on the fly  HSAIL<->BRIG assembler and disassembler  HSAIL->BRIG debug information generator BrigContainer Validator Proxy classes  BRIG streaming routines  HSAIL test generation framework Finalizer Device linker BifStreamer BrigStreamer Loader  HSAIL instruction level simulation BRIG, BIF files 9 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC Brigantine Direct binary object generation Test Generation
  10. 10. FINALIZER  Fast optimizations for translation efficiency  Expected HLC to perform heavyweight optimizations  Supports Unstructured control flow  Dynamic calling convention HSAIL HSAIL-to-IR IR SSA  Optimized ISA Libraries  Indirect branches Optimizations on IR  Exception handling  Offline mode available for caching ISA translation Scheduler  Debugging support: mapping between BRIG and GPU ISA Allocator Assembler GPU ISA 10 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  11. 11. HSA RT COMPILER INTERACTION High-level Models/Runtimes OpenCL™ Java ™ C++AMP … Debugger/Profiler HSA RT API Categories Topology Memory Queues Images Signals Tools Dispatch Syscall Compiler Library Direct3D Compilation Interop OpenGL™ KFD Thunk API Kernel Mode 11 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC KFD User-Mode KMD
  12. 12. HSA DEBUG INFORMATION  Two layers of debug information  Source to BRIG  BRIG to ISA  Source line number for ISA DWARF line table is BRIG code offset. This way the two line tables (source -> BRIG code offset, BRIG code offset -> ISA program counter) map from kernel source to ISA program counter value.  Relocations support to be used with BRIG linking  HSAIL assembly source -> BRIG mapping in DWARF is supported in libHSAIL  HSA-specific attributes that identifies the ISA memory region of the variables (global, group, etc)  Allows:  Setting breakpoints on kernel /HSAIL/ISA source lines  Inspecting and modifying kernel source variables  Stepping through kernel/HSAIL/ISA source 12 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  13. 13. Performance
  14. 14. PERFORMANCE  Avoid memory copying and use system buffers  Device memory can be used at developers choice  Flat pointer support allows advanced data structures, such as trees, to be used to optimize algorithms  Genuine 64 bit support provides access to more memory allowing not to split tasks and avoid reduction code  Reduced user mode dispatch cost  New HSAIL standard allows to leverage modern HW features  Evolving compiler optimizations give better performance compared to previous SW even without change  Platform atomics provide an improved way to exploit parallelism for lock-free programs 14 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  15. 15. EVOLVING COMPILER 300% 250% 200% 150% 100% 50% 0% FFT MD SGEMM Sort Spmv OpenCL ™ with HSA Previous OpenCL ™ Stencil2D SHOC benchmark, level1 OpenCL™ set on “Kaveri” HW 15 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  16. 16. HSA Compiler Release
  17. 17. HSA COMPILER DELIVERABLES  Q2 2014: OpenCL™/LLVM/HSAIL compiler with HSA support enabled ‒ OpenCL™ 1.2 + AMD extensions on Windows ® and Linux ® ‒ HSA RT API 1.0 with HSAIL and AQL inputs ‒ SVM and Platform atomics (OCL 2.0 features)  Q1 2015: Second release of the OpenCL™/LLVM/HSAIL compiler, with higher performance and support for additional hardware ‒ OpenCL™ 2.0 on Windows and Linux ‒ One single compiler stack for OpenCL™ on AMD platforms  Compiler components to be delivered: ‒ High-level compilers (HLC) ‒ HSA Finalizer ‒ libHSAIL ‒ Libraries: language-specific & math ‒ DWARF generation for debugging  Open Source 17 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  18. 18. OpenCL™ 2.0 features
  19. 19. OPENCL™ 2.0 SUPPORT FOR SVM (SHARED-VIRTUAL MEMORY)  Shared-Virtual Memory (SVM) ‒ Address-space exposed to both host and device ‒ Makes a ‘pointer’ meaningful to both host and device ‒ Logically extends a portion of the global memory into the host address space giving work-items access to the host address space ‒ Three types of SVM supported ‒ Coarse-Grained Buffer ‒ Can be used to share linked-lists and such data structures between CPU and GPU but memory synchronization happens only at kernel entry/exit points and at the level of the entire buffer ‒ Map/unmap calls are used as synch points ‒ Need to use clSVMalloc() call ‒ Fine-Grained Buffer ‒ Can be used to share individual bytes in buffer. Memory synchronization happens at kernel entry/exit as well as at atomic call points ‒ Need to use clSVMalloc() call ‒ Fine-Grained System ‒ Can be used to share individual bytes appearing anywhere in system memory. Memory synchronization happens at kernel entry/exit as well as at atomic call points. ‒ A ‘normal malloc’ is able to provide access to SVM 19 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  20. 20. OPENCL™ 2.0 SUPPORT FOR ‘PLATFORM ATOMICS’  Follows C11 and C++11 specs on atomics. Additional use of memory_scope in addition to memory_order  Ld/str ‒ void atomic_store_explicit(volatile global A *object, C desired, memory_order order) ‒ C atomic_load_explicit(volatile A *object, memory_order order, memory_scope scope)  Exchange/Compare-Exchange ‒ C atomic_exchange_explicit(volatile global A *object, C desired, memory_order order)  Fetch-and-modify ‒ C atomic_fetch_add(sub)_explicit(volatile global A *object, M operand, memory_order order, memory_scope scope)  Fence ‒ void atomic_work_item_fence(cl_mem_fence_flags flags, memory_order order, memory_scope scope)  Flag ‒ bool atomic_flag_test_and_set_explicit(volatile atomic_flag *object, memory_order order, memory_scope scope) 20 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  21. 21. DEVICE ENQUEUE  OpenCL™ 2.0 spec introduces the concept of enqueuing by the device (GPU). The idea is to launch a new kernel from the running (parent) kernel.  Helpful in cases where there is “enough” data parallelism, within the kernel, which can be exploited by launching a new kernel. By doing without going back to host, would lead to better performance.  The new kernel is launched by the device WITHOUT the support from HSA RT.  The compiler generates the code in BRIG to enqueue the “child” kernel. This includes creating the AQL Q element, filling the Q structure and finally enqueuing the kernel (by using AQL commands)  The challenges are ‒ To create new buffers for filling the data into the kernel (without RT support) ‒ To enqueue the new kernel in a thread safe manner (multiple GPU threads may be enqueueing concurrently). For this, we are using platform atomics. 21 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  22. 22. DEVICE ENQUEUE – AN EXAMPLE kernel void childKernel (global int * a) { …… } kernel void parentKernel(global int *b) { ndrange_t ndrange; /* Divide the work ‘b’ into many parts */ if (more_work_available(b) ) { void (^myblockChild) (void) = ^{childKernel(b);}; enqueue_kernel (get_default_queue(), CLK_WAIT_KERNEL, ndrange, myblockChild); } }  OpenCL™ 2.0 supports many more sophisticated ways of enqueueing using various events (wait for child), various ndranges, etc.  HSA compiler has implemented some of the features of OpenCL™ 2.0 enqueue kernel. CLANG blocks which are shown above may not be implemented in the first version. 22 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  23. 23. Conclusions and Future Direction
  24. 24. CONCLUSIONS AND FUTURE DIRECTIONS  Controlled Alpha Release of the First HSA compiler ‒ Supports OpenCL™ 1.2 and a few features from OpenCL™ 2.0 ‒ Performance tuning  OpenCL™ 2.0 support  Open-Source ‒ May Contribute to LLVM ‒ May open source the backend 24 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  25. 25. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a registered trademark of the Khronos Group. Windows ® is a Trademark of Microsoft and Linux ® is Trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners. 25 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  27. 27. ABI, LINKING, & LOADING  HSAIL spec enables traditional linking tasks (e.g. symbol resolution) spread across static and dynamic stages ‒ Static host and device linking ‒ Merge multiple object files (including host and device) into a single executable ‒ Device linker created to resolve symbols across multiple compilation units ‒ Host linker unmodified ‒ Pre-ISA loading ‒ Load statically allocated, globally scoped global memory data in HSAIL ‒ Track the addresses of globally scoped data symbols ‒ ISA linking and loading ‒ Finalizer resolves all local code and data symbols ‒ Finalizer and RT collectively resolve function symbols ‒ Resolve global-scoped data symbols by getting addresses from pre-ISA loader ‒ Allocate/resolve globally scoped group and private memory data per dispatch ‒ RT loads ISA binary for execution after translation of kernel closure done  Compiler lib drives the invocations of compiler components and functionality from OpenCL™ RT and HSA Core RT 27 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  28. 28. KEY HSAIL FEATURES  Parallel  Shared virtual memory  Portable across vendors in HSA Foundation  Stable across multiple product generations  Consistent numerical results (IEEE-754 with defined min accuracy)  Fast, robust, simple finalization step (no monthly updates)  Good performance (little need to write in ISA)  Supports all of OpenCL™ and C++ AMP  Support Java ™, C++, and other languages as well 28 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  29. 29. REGISTERS  Four classes of registers ‒ C: 1-bit, Control Registers ‒ S: 32-bit, Single-precision FP or Int ‒ D: 64-bit, Double-precision FP or Long Int ‒ Q: 128-bit, Packed data.  Fixed number of registers: ‒8C ‒ S, D, Q share a single pool of resources ‒ S + 2*D + 4*Q <= 128 ‒ Up to 128 S or 64 D or 32 Q (or a blend)  Register allocation done in high-level compiler ‒ Finalizer doesn’t have to perform expensive register allocation 29 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  30. 30. HSAIL INSTRUCTION SET - OVERVIEW  Similar to assembly language for a RISC CPU ‒ Load-store architecture ‒ ld_global_u64 $d0, [$d6 + 120] ‒ add_u64 $d1, $d2, 24 ; $d0= load($d6+120) ; $d1= $d2+24  136 opcodes (Java™ bytecode has 200) ‒ Floating point (single, double, half (f16)) ‒ Integer (32-bit, 64-bit) ‒ Some packed operations ‒ Branches ‒ Function calls ‒ Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas ‒ Synchronize host CPU and HSA Component!  Text and Binary formats (“BRIG”) 30 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC
  31. 31. SEGMENTS AND MEMORY  7 segments of memory ‒ global, readonly, group, spill, private, arg, kernarg, ‒ Memory instructions can (optionally) specify a segment  Global Segment ‒ Visible to all HSA agents (including host CPU) ld_global_u64 $d0, [$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4]  Group Segment ‒ Provides high-performance memory shared in the work-group by every work-item  Spill, Private, Arg Segments ‒ Represent different regions of a per-work-item stack typically generated by compiler  Kernarg Segment ‒ Programmer writes kernarg segment to pass arguments to a kernel  Read-Only Segment ‒ Remains constant during execution of kernel  Flat Addressing ld_kernarg_u64 ‒ Each segment mapped into virtual address space ‒ Flat addresses can map to segments based on virtual address ‒ Instructions with no explicit segment use flat addressing ‒ Very useful for high-level language support (ie classes, libraries) ‒ Aligns well with OpenCL™ 2.0 “generic” addressing feature 31 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat
  32. 32. HSAIL AND SPIR Feature HSAIL SPIR Intended Users Compiler developers who want to control their Compiler developers who want a fast path to own code generation. acceleration across a wide variety of devices. IR Level Low-level, just above the machine instruction set High-level, just below LLVM-IR Back-end code generation Thin, fast, robust. Flexible. Can include many optimizations and compiler transformation including register allocation. Where are compiler optimizations performed? Registers SSA Form Binary format Code generator for LLVM Most done in high-level compiler, before HSAIL generation. Fixed-size register pool No Yes Yes Most done in back-end code generator, between SPIR and device machine instruction set Infinite Yes Yes Yes Back-end device targets Modern GPU architectures supported by members of the HSA Foundation Any OpenCL(tm) device including GPUs, CPUs, FPGAs Memory Model Relaxed consistency with acquire/release, barriers, and fine-grained barriers Flexible. Can support the OpenCL™ 1.2 Memory Model 32 | HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC