Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Heterogeneous System Architecture Overview


Published on

In this video from SC13, Vinod Tipparaju presents an Heterogeneous System Architecture Overview.

"The HSA Foundation seeks to create applications that seamlessly blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing on the DSP via high bandwidth shared memory access enabling greater application performance at low power consumption. The Foundation is defining key interfaces for parallel computation utilizing CPUs, GPUs, DSPs, and other programmable and fixed-function devices, thus supporting a diverse set of high-level programming languages and creating the next generation in general-purpose computing."

Learn more:
Watch the video presentation:

Published in: Technology
  • Login to see the comments

Heterogeneous System Architecture Overview

  3. 3. SOME TERMINOLOGY u  u  u  HSA is heterogeneous systems architecture, not just GPUs HSA Component – IP that satisfies architecture requirements and provides identified features SoC – system on Chip, collection of various IPs u  u  u  u  E.g. AMD APU (Accelerated Processing Unit) integrates AMD/ARM CPU cores and Graphics IP It is possible to conceive companies just building parts of the IP HSAIL -- HSA intermediate language very low-level SIMT language HSA Agent – something that can participate in the HSA memory subsystem (i.e. respect page sizes, memory properties, atomics, etc.) AMD Confidential - NDA Required
  4. 4. WHAT IS HSA? Systems Architecture u  u  From a hardware point of view, system architecture requirements necessary Specifies shared memory, cache coherence domains, concept of clocks, context switching, memory based signaling, topology, Programmers Reference (HSAIL) u  u  u  u  Rules governing design and agent behavior RUNTIME u  API that wraps the features like user mode queues, clocks, signalling, etc u  Provides execution control u  An intermediate representation, very low level. Vendor independence, device compiler optimizations Abstracts HW, or can serve as the lowest level instruction set TOOLS u  Supporting profilers, debuggers and compilers Supports tools u  u  Unique debugging support that greatly simplifies implementing debuggers Excellent profiling support with some user mode access
  5. 5. HSA ORIGINS, EVOLUTION IN COMPUTE u  Next step from AMD in general purpose compute u  Evolutionary step u  u  Exceptional graphics IP u  u  Lot of experience in building general purpose CPUs Natural to utilize graphics IP for doing compute Prior step was HW integration phase u  GPU was pre-GCN (graphics core next) u  u  Did not have all features to support HSA Memory management unit was still evolving AMD Confidential - NDA Required
  6. 6. TAKING THE HW INTEGRATION TO ITS NATURAL CONCLUSION u  Architectural and System integration u  Extend architecture to make the component a first class citizen on the SoC u  Fully-evolved MMU u  Provide same level of support for tools as CPU u  Provide context switching, preemption, full-coherence u  u  Helps simulators, migrations, checkpoints, etc Future, other HSA IP AMD Confidential - NDA Required
  7. 7. SOCS HAVE PROLIFERATED — MAKE THEM BETTER u  u  u  SOCs have arrived and are a tremendous advance over previous platforms SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory How do we make them even better? u  u  Higher performance u  u  Easier to optimize u  u  Easier to program Lower power HSA unites accelerators architecturally Early focus is APU (CPU with GPU compute accelerator), but HSA goes well beyond the GPU AMD Confidential - NDA Required
  8. 8. HIGH LEVEL USAGE SCENARIOS u  Bulk-Synchronous Parallelism -like concurrent computation u  u  Rather large parallel sections followed by synchronization Outstanding support for task-based parallelism u  u  256 threads sufficient to fully fill the pipeline u  Launch is quick u  Support for execution schedules – excellent compiler target u  u  Wavefront is 64 threads Architected Queueing Language (AQL), dependencies Advanced language support u  Function calls u  Virtual functions u  Exception handling (throw-catch) AMD Confidential - NDA Required
  10. 10. HSA FOUNDATION u  u  u  u  u  u  Founded in June 2012 Developing a new platform for heterogeneous systems Specifications under development in working groups Our first specification, HSA Programmers Reference Manual is already published and available on our web site Additional specifications for System Architecture, Runtime Software and Tools are in process AMD Confidential - NDA Required
  11. 11. HSA FOUNDATION MEMBERSHIP — AUGUST 2013 Founders Promoters Supporters Contributors Academic Associates AMD Confidential - NDA Required
  12. 12. HSA — AN OPEN PLATFORM u  Open Architecture, membership open to all u  u  HSA System Architecture u  u  HSA Programmers Reference Manual HSA Runtime Delivered via royalty free standards u  Royalty Free IP, Specifications and APIs u  ISA agnostic for both CPU and GPU u  Membership from all areas of computing u  Hardware Companies u  Operating Systems u  Tools and Middleware AMD Confidential - NDA Required
  14. 14. HSA MEMORY MODEL u  u  u  u  Defines visibility ordering between all threads in the HSA System Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models Relaxed consistency memory model for parallel compute performance Visibility controlled by: u  Load.Acquire u  Store.Release u  Barriers AMD Confidential - NDA Required
  15. 15. HSA QUEUING MODEL u  User mode queuing for low latency dispatch u  u  u  Application dispatches directly No OS or driver in the dispatch path Architected Queuing Layer u  u  u  Single compute dispatch path for all hardware No driver translation, direct to hardware Allows for dispatch to queue from any agent u  u  CPU or GPU GPU self enqueue enables lots of solutions u  Recursion u  Tree traversal u  Wavefront reforming AMD Confidential - NDA Required
  16. 16. HSAIL AMD Confidential - NDA Required
  17. 17. HSA INTERMEDIATE LAYER — HSAIL u  HSAIL is a virtual ISA for parallel programs u  u  u  Finalized to ISA by a JIT compiler or “Finalizer” ISA independent by design for CPU & GPU Explicitly parallel u  u  u  Support for exceptions, virtual functions, and other high level language features Lower level than OpenCL SPIR u  u  Designed for data parallel programming Fits naturally in the OpenCL compilation stack Suitable to support additional high level languages and programming models: u  Java, C++, OpenMP, Fortran etc AMD Confidential - NDA Required
  18. 18. WHAT IS HSAIL? u  HSAIL is the intermediate language for parallel compute in HSA u  u  u  u  u  Generated by a high level compiler (LLVM, gcc, Java VM, etc) Low-level IR, close to machine ISA level Compiled down to target ISA by an IHV “Finalizer” Finalizer may execute at run time, install time, or build time Example: OpenCL™ Compilation Stack using HSAIL High-Level Compiler Flow (Developer) OpenCL™ Kernel EDG or CLANG SPIR LLVM HSAIL AMD Confidential - NDA Required Finalizer Flow (Runtime) HSAIL Finalizer Hardware ISA
  19. 19. KEY HSAIL FEATURES u  Parallel u  Shared virtual memory u  Portable across vendors in HSA Foundation u  Stable across multiple product generations u  Consistent numerical results (IEEE-754 with defined min accuracy) u  Fast, robust, simple finalization step (no monthly updates) u  Good performance (little need to write in ISA) u  Supports all of OpenCL™ and C++ AMP™ u  Support Java, C++, and other languages as well AMD Confidential - NDA Required
  20. 20. SIMT EXECUTION MODEL u  HSAIL Presents a “SIMT” execution model to the programmer u  u  Programmer writes program for a single thread of execution u  Each work-item appears to have its own program counter u  u  “Single Instruction, Multiple Thread” Branch instructions look natural Hardware Implementation u  u  Actually one program counter for the entire SIMD instruction u  u  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency Branches implemented with predication SIMT Advantages u  Easier to program (branch code in particular) u  Natural path for mainstream programming models u  Scales across a wide variety of hardware (programmer doesn’t see vector width) u  Cross-lane operations available for those who want peak performance AMD Confidential - NDA Required
  22. 22. OPPORTUNITIES WITH LLVM BASED COMPILATION C99 C++ 11 C++AMP Objective C OpenCL OpenMP KL OSL Render script UPC Halide CLANG LLVM AMD Confidential - NDA Required Rust Julia Mono Fortran Haskell
  24. 24. HIGH LEVEL FEATURES OF HSA u  Features currently being defined in the HSA Working Groups** u  Unified addressing across all processors u  Operation into pageable system memory u  Full memory coherency u  User mode dispatch u  Architected queuing language u  High level language support for GPU compute processors u  Preemption and context switching ** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups AMD Confidential - NDA Required
  25. 25. STATE OF GPU COMPUTING •  GPUs are fast and power efficient : high compute density per-mm and per-watt •  But: Can be hard to program Today’s Challenges u  Emerging Solution Separate address spaces u  HSA Hardware u  Copies u  Single address space u  Can’t share pointers u  Coherent u  Virtual address space u  Fast access from all components u  Can share pointers PCIe u  New language required for compute kernel u  u  EX: OpenCL™ runtime API Compute kernel compiled separately than host code u  Bring GPU computing to existing, popular, programming models u  u  Single-source, fully supported by compiler HSAIL compiler IR (Cross-platform!)
  26. 26. MOTIVATION (TODAY’S PICTURE) Application Transfer buffer to GPU OS GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory AMD Confidential - NDA Required
  27. 27. SHARED VIRTUAL MEMORY (TODAY) u  Multiple virtual memory address spaces PHYSICAL MEMORY PHYSICAL MEMORY VIRTUAL MEMORY1 CPU0 VA1->PA1 AMD Confidential - NDA Required VIRTUAL MEMORY2 GPU VA2->PA1
  28. 28. SHARED VIRTUAL MEMORY (HSA) u  Common Virtual Memory for all HSA agents PHYSICAL MEMORY PHYSICAL MEMORY VIRTUAL MEMORY CPU0 VA->PA AMD Confidential - NDA Required GPU VA->PA
  29. 29. SHARED VIRTUAL MEMORY u  u  Advantages u  No mapping tricks, no copying back-and-forth between different PA addresses u  Send pointers (not data) back and forth between HSA agents. Implications u  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc). u  Common mechanisms for address translation (and servicing address translation faults) u  Concept of a process address space ID (PASID) to allow multiple, per process virtual address spaces within the system. AMD Confidential - NDA Required
  30. 30. GETTING THERE … Application Transfer buffer to GPU OS GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory AMD Confidential - NDA Required
  31. 31. SHARED VIRTUAL MEMORY u  Specifics u  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems. u  HSA agents may reserve VA ranges for internal use via system software. u  All HSA agents other than the host unit must use the lowest privilege level u  If present, read/write access flags for page tables must be maintained by all agents. u  Read/write permissions apply to all HSA agents, equally. AMD Confidential - NDA Required
  33. 33. CACHE COHERENCY DOMAINS (1/2) u  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. AMD Confidential - NDA Required
  34. 34. CACHE COHERENCY DOMAINS (2/2) u  u  Advantages u  Composability u  Reduced SW complexity when communicating between agents u  Lower barrier to entry when porting software Implications u  Hardware coherency support between all HSA agents u  Can take many forms u  Stand alone Snoop Filters / Directories u  Combined L3/Filters u  Snoop-based systems (no filter) u  Etc … AMD Confidential - NDA Required
  35. 35. GETTING CLOSER … Application Transfer buffer to GPU OS GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory AMD Confidential - NDA Required
  36. 36. SIGNALING
  37. 37. SIGNALING (1/2) u  HSA agents support the ability to use signaling objects u  All creation/destruction signaling objects occurs via HSA runtime APIs u  Object creation/destruction u  From an HSA Agent you can directly accessing signaling objects. u  Signaling a signal object (this will wake up HSA agents waiting upon the object) u  Query current object u  Wait on the current object (various conditions supported). AMD Confidential - NDA Required
  38. 38. SIGNALING (2/2) u  u  Advantages u  Enables asynchronous interrupts between HSA agents, without involving the kernel u  Common idiom for work offload u  Low power waiting Implications u  Runtime support required u  Commonly implemented on top of cache coherency flows AMD Confidential - NDA Required
  39. 39. ALMOST THERE… Application Transfer buffer to GPU OS GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory AMD Confidential - NDA Required
  41. 41. USER MODE QUEUEING (1/3) u  User mode Queueing u  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents. u  Dispatch packet is a job of work u  Support for multiple queues per PASID u  Multiple threads/agents within a PASID may enqueue Packets in the same Queue. u  Dependency mechanisms created for ensuring ordering between packets. AMD Confidential - NDA Required
  42. 42. USER MODE QUEUEING (2/3) u  Advantages u  Avoid involving the kernel/driver when dispatching work for an Agent. u  Lower latency job dispatch enables finer granularity of offload u  u  Standard memory protection mechanisms may be used to protect communication with the consuming agent. Implications u  Packet formats/fields are Architected – standard across vendors! u  u  u  Guaranteed backward compatibility Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signalling) More on this later…… AMD Confidential - NDA Required
  43. 43. SUCCESS! Application Transfer buffer to GPU OS GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory AMD Confidential - NDA Required
  44. 44. SUCCESS! Application OS GPU Queue Job Start Job Finish Job AMD Confidential - NDA Required
  46. 46. SUFFIX ARRAYS u  Suffix Arrays are a fundamental data structure u  Designed for efficient searching of a large text u  u  Quickly locate every occurrence of a substring S in a text T Suffix Arrays are used to accelerate in-memory cloud workloads u  Full text index search u  Lossless data compression u  Bio-informatics AMD Confidential - NDA Required
  47. 47. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU. Skew Algorithm for Compute SA Radix Sort::GPU +5.8x Lexical Rank::CPU Compute SA::CPU -5x Radix Sort::GPU Merge Sort::GPU INCREASED PERFORMANCE DECREASED ENERGY M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM AMD Confidential - NDA Required
  48. 48. THE HSA FUTURE Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from smart phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall, in the cloud or at your supercomputing center. AMD Confidential - NDA Required