HSA Introduction
Upcoming SlideShare
Loading in...5
×
 

HSA Introduction

on

  • 232 views

A presentation which I gave at the TCE Systems day on June 9th 2014

A presentation which I gave at the TCE Systems day on June 9th 2014
Link: http://events-tce.technion.ac.il/systems-day-2014/

Statistics

Views

Total Views
232
Views on SlideShare
219
Embed Views
13

Actions

Likes
1
Downloads
3
Comments
0

2 Embeds 13

http://www.slideee.com 10
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HSA Introduction HSA Introduction Presentation Transcript

  • HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW Ofer Rosenberg
  • DISCLAIMER: This presentation is not an Official HSA Foundation presentation. Most of the Material is taken from HSA HotChips 2013 Some slides contains my insights / Opinions
  • CONTENT  Introduction  hUMA  hQ  HSAIL  HSA Software  HSA Challenges  HSA Availability
  • INTRODUCTION
  • HISTORIC PERSPECTIVE Accelerated System  Program runs on CPU  API to access Accelerators  ASIC or Firmware  Configurable, but operation is fixed Heterogeneous System  Program runs on CPU  Offloads work on Accelerators  GPU, DSP, etc.  Offloaded work is JITed (compiled at runtime) 5 Distributed SoC based
  • HSA FOUNDATION  Originated from AMD’s FSA – Fusion System Architecture  HSA Foundation Founded in June 2012 6
  • HSA FOUNDATION MEMBERS 7 Founders Promoters Supporters Contributors Academic Associates Slide Taken from Phil Rogers HSA Overview, HotChips 2013
  • WHAT IS HSA ALL ABOUT ? (MY TAKE)  “Bring Accelerators forward as a first class processor”  Unified address space, pageable memory, coherency  Eliminate drivers from dispatch path (user mode queues)  Standardized SW stack built on top of a set of HW requirements  Improve interoperability between IP vendors  Unified Architecture for Accelerators  Start from GPU, extend to DSP / FPGA / Fixed-Function Acc , etc.  SoC Centric  Major features are optimal for SoC environment (same memory/die)  Support of distributed system is possible, yet inefficient (PCI atomics, others) 8Slide Taken from Phil Rogers HSA Overview, HotChips 2013
  • HSA WORKING GROUPS  HSA Systems Architecture  hUMA – Unified Memory Model  hQ – HSA Queuing Model  HSA Programmer Reference Specification  HSAIL – HSA Intermediate Language  HSA System Runtime  HSA Compliance  HSA Tools 9http://hsafoundation.com/standards/
  • OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL™  Not an alternative to OpenCL™  OpenCL™ on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL™ 2.0 shows considerable alignment with HSA  Many HSA member companies are also active with Khronos in the OpenCL™ working group 10Slide Taken from Phil Rogers HSA Overview, HotChips 2013
  • hUMA © Copyright 2012 HSA Foundation. All Rights Reserved. 11
  • hUMA HSA Unified Memory Architecture Evolution of CPU / GPU memory systems: 1. CPU uses Virtual Addresses, GPU uses Physical Addresses  Memory had to be pinned  GPU can access a limited area in the CPU memory (Aperture)  Requires copy from system memory to GPU-visible memory  Pointer-based data structures can’t be shared 2. CPU uses Virtual Addresses, GPU uses Virtual Addresses (but not the same)  Memory still had to be pinned 12  GPU can access the entire system memory  Copy is not required  Pointer-based data structures still can’t be shared 3. hUMA
  • hUMA HSA Unified Memory Architecture  Shared Virtual Memory  CPU & GPU see the same addresses  Pageable Memory  GPU can (somehow) initiate a page fault  Cache coherency 13
  • SHARED VIRTUAL MEMORY  Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Note the Hardware Implications …  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. 14 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
  • CACHE COHERENCY DOMAINS  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Note the Hardware Implications …  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … 15 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
  • hQ © Copyright 2012 HSA Foundation. All Rights Reserved. 16
  • hQ Motivation 1. GPU Dispatch has a lot of overhead  SW/Driver stack overhead  User mode to Kernel mode switch 17
  • hQ Motivation 2. Master/Slave pattern is limiting (and has a lot of overhead)  CPU schedules work to the GPU  Communication overhead (report results  next kernel grid size) 18 Slide from “Introduction to Dynamic Parallelism”, Stephen Jones, NVIDIA Corporation
  • hQ HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU can spawn its own work 19 Picture from AMD Blog: hQ: From Master/Slave to Masterpiece
  • ARCHITECTED QUEUEING LANGUAGE  HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Support is allowed for single-producer, single-consumer  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected 20 Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets  Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
  • HSAIL © Copyright 2012 HSA Foundation. All Rights Reserved. 21
  • WHAT IS HSAIL?  HSAIL is the intermediate language for parallel compute in HSA  Generated by a high level compiler (LLVM, gcc, Java VM, etc)  Low-level IR, close to machine ISA level  Compiled down to target ISA by an IHV “Finalizer”  Finalizer may execute at run time, install time, or build time  Example: OpenCL™ Compilation Stack using HSAIL 22 OpenCL™ Kernel High-Level Compiler Flow (Developer) Finalizer Flow (Runtime) EDG or CLANG SPIR LLVM HSAIL HSAIL Finalizer Hardware ISA Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
  • HSAIL INSTRUCTION SET HIGHLIGHTS  “SIMT” – Single Instruction Multiple Data  ISA is Scalar, describes one serial thread – Parallelism is done by HW  RISC-Like  Load-store architecture  136 opcodes  Fixed number of Registers  1 Control  Pool of 512 bytes  Single  Double  Quad  7 segments of memory  global, read only, group, spill, private, arg, kernarg 23 01: version 0:95: $full : $large; 02: // static method HotSpotMethod<Main.lambda$2(Player)> 03: kernel &run ( 04: kernarg_u64 %_arg0 // Kernel signature for lambda method 05: ) { 06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register 07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord 08: 09: cvt_u64_s32 $d2, $s2; // Convert X gid to long 10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref 11: add_u64 $d2, $d2, 24; // Adjust for actual elements start 12: add_u64 $d2, $d2, $d6; // Add to array ref ptr 13: ld_global_u64 $d6, [$d2]; // Load from array element into reg 14: @L0: 15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() 16: mov_b64 $d3, $d0; 17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores () 18: cvt_f32_s32 $s16, $s3; 19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores() 20: cvt_f32_s32 $s17, $s0; 21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores 22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() 23: ret; 24: };
  • HSA SOFTWARE © Copyright 2012 HSA Foundation. All Rights Reserved. 24
  • HIGH-LEVEL SOFTWARE STACK  Programming Languages  OpenCL 2.0  C++ AMP  Java (Aparapi/Sumatra)  HSA Runtime (User Mode Driver)  System Query  Access to JIT Compilers  Access to Queues  JIT Compilers  Offline or online (JIT)  LLVM Compiler (LLVM  HSAIL)  HSAIL Finalizer (HSAIL  BIN)  Kernel Mode Driver 25 http://www.hsafoundation.com/hsa-developer-tools/
  • HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do 26 Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
  • JAVA HETEROGENEOUS ENABLEMENT ROADMAP CPU ISA GPU ISA JVM Application APARAPI GPUCPU OpenCL™ 27 CPU ISA GPU ISA JVM Application APARAPI HSA CPUHSA CPU HSA Finalizer HSAIL CPU ISA GPU ISA JVM Application APARAPI HSA CPUHSA CPU HSA Finalizer HSAIL HSA Runtime LLVM Optimizer IR CPU ISA GPU ISA Sumatra Enabled JVM Application HSA CPUHSA CPU HSA Finalizer HSAIL Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
  • HSA Challenges (My Take) © Copyright 2012 HSA Foundation. All Rights Reserved. 28
  • HSA CHALLENGES – VENDOR SUPPORT 29 Founders Promoters Supporters Contributors Academic Slide Taken from Phil Rogers HSA Overview, HotChips 2013 Missing some key players: Intel, NVIDIA, Apple, Microsoft, Google, …
  • HSA CHALLENGES – LANGUAGES SUPPORT  HSAIL (or LLVM) is not an attractive level to code at…  Leverage existing parallel languages/paradigms to exploit HSA features:  C++ AMP  OpenCL 2.0 (done!)  OpenMP  Add your favorite …  Extend popular languages to exploit HSA:  Scripting languages: Python  Web languages : HTML5, RoR, Javascript, …  DSL languages 30
  • HSA CHALLENGES – SECURITY  HSA design had some security measures in mind:  Accelerator supports privilege level, with user and privileged memory  Execute, Read and Write are protected by page table entries  Support in fixed time context sceduling (DoS protection)  But:  Advanced features such as hUMA & uQ are potential back door  OS & Security Apps currently do not monitor the accelerators  Monitoring may require OS changes  Detailed specification can be used to find attack vectors  Some accelerators architecture may introduce a security flaw  Example: local memory on GPU 31
  • HSA Availability © Copyright 2012 HSA Foundation. All Rights Reserved. 32
  • HSA AVAILABILITY  AMD released “Kaveri”, the first SoC which is HSA-able  HW supports HUMA, hQ, etc.  HSA software stack is not publicly available yet (expected this year) © Copyright 2012 HSA Foundation. All Rights Reserved. 33 http://www.tomshardware.com/reviews/a8 8x-socket-fm2-motherboard,3764.html
  • HSA AVAILABILITY Simulators:  HSAEMU – A full system emulator for HSA platforms  Work done by System SW Lab at NTHU (National Tsing Hua University)  http://hsaemu.org/  Code available on GitHub - https://github.com/SSLAB-HSA/HSAemu  HSAIL Simulator  Code available on GitHub - https://github.com/HSAFoundation/HSAIL-Instruction- Set-Simulator 34
  • THANK YOU 35
  • BACKUPS 36
  • REFERENCES • HSA Foundation: • http://hsafoundation.com/ • HSA whitepaper • http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf • hUMA • http://www.slideshare.net/AMD/amd-heterogeneous-uniform-memory-access • http://www.pcper.com/reviews/Processors/AMD-Details-hUMA-HSA-Action • http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/ • http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3 • ANANDTECH Hawaii architecture • http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3 • hQ • http://community.amd.com/community/amd-blogs/amd-business/blog/2013/10/21/hq-from-masterslave-to-masterpiece • http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf • HSA purpose analysis by Moor • http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf • IOMMUv2 spec • http://developer.amd.com/wordpress/media/2012/10/48882.pdf © Copyright 2012 HSA Foundation. All Rights Reserved. 37
  • hUMA & Discrete GPUs  hUMA can be extended beyond SoC, if the proper HW exists (such as Hawaii GPU…) 38 Slide from “IOMMUv2: the Ins and Outs of Heterogeneous GPU use”, AFDS 2012
  • HSAIL AND SPIR 39 Feature HSAIL SPIR Intended Users Compiler developers who want to control their own code generation. Compiler developers who want a fast path to acceleration across a wide variety of devices. IR Level Low-level, just above the machine instruction set High-level, just below LLVM-IR Back-end code generation Thin, fast, robust. Flexible. Can include many optimizations and compiler transformation including register allocation. Where are compiler optimizations performed? Most done in high-level compiler, before HSAIL generation. Most done in back-end code generator, between SPIR and device machine instruction set Registers Fixed-size register pool Infinite SSA Form No Yes Binary format Yes Yes Code generator for LLVM Yes Yes Back-end device targets Modern GPU architectures supported by members of the HSA Foundation Any OpenCL device including GPUs, CPUs, FPGAs Memory Model Relaxed consistency with acquire/release, barriers, and fine- grained barriers Flexible. Can support the OpenCL 1.2 Memory Model Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
  • Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties HSA SOFTWARE STACK 40
  • OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL™  Not an alternative to OpenCL™  OpenCL™ on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL™ 2.0 shows considerable alignment with HSA  Many HSA member companies are also active with Khronos in the OpenCL™ working group 41Slide Taken from Phil Rogers HSA Overview, HotChips 2013
  • BOLT — PARALLEL PRIMITIVES LIBRARY FOR HSA  Easily leverage the inherent power efficiency of GPU computing  Common routines such as scan, sort, reduce, transform  More advanced routines like heterogeneous pipelines  Bolt library works with OpenCL and C++ AMP  Enjoy the unique advantages of the HSA platform  Move the computation not the data  Finally a single source code base for the CPU and GPU!  Developers can focus on core algorithms  Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt 42Slide Taken from Phil Rogers HSA Overview, HotChips 2013
  • HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do 43 Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
  • LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy-back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV “Hessian” Kernel) 44
  • AMD’S FIRST HSA SOC © Copyright 2012 HSA Foundation. All Rights Reserved. 45