Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to HSA

448 views

Published on

Introduction to HSA, presented on group meeting

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Introduction to HSA

  1. 1. INTRODUCTION TO HETEROGENEOUS SYSTEM ARCHITECTURE Presenter: BingRu Wu
  2. 2. Outline ◻ Introduction ◻ Goal ◻ Concept ◻ Memory Model ◻ System Components
  3. 3. Introduction ◻ HSA: Heterogeneous System Architecture ◻ Promising future: ◻ Arm processors producers ◻ GPU vendors: AMD, Imaginations ◻ Fully utilize computation resource ◻ Our system may connect to major application base with supporting HSA
  4. 4. Goal of HSA ◻ Remove programmability barrier ◻ Memory space barrier ◻ Access latency among devices ◻ Backward compatible ◻ Utilize existing programming models
  5. 5. Concept of HSA
  6. 6. Abstract ◻ Two kinds of compute unit ◻ LCU: Latency Compute Unit (ex. CPU) ◻ TCU: Throughput Compute Unit (ex. GPU) ◻ Merged memory space
  7. 7. Memory Management (1/2) ◻ Shared page table ◻ Memory is shared by all devices ◻ No longer host to device copy and vice versa ◻ Support pointer data structure (ex. list) ◻ Page faulting ◻ Virtual memory space for all devices ◻ ex. GPU now can use memory as if it has whole memory space
  8. 8. Memory Management (2/2) ◻ Coherent memory regions ◻ The memory is coherent ◻ Shared among all devices (CUs) ◻ Unified address space ◻ Memory type separated by address ◻ Private / local / global memory decided by memory region ◻ No special instruction is required
  9. 9. User-Level Command Queue ◻ Queues for communication ◻ User to device ◻ Device to device ◻ HSA runtime handles the queue ◻ Allocation & destruction ◻ Each per application ◻ Vendor dependent implementation ◻ Direct access to devices ◻ No OS syscall ◻ No task managing
  10. 10. Hardware Scheduler (1/3) ◻ No real scheduling on TCU (GPU) ◻ Task scheduling ◻ Task preemption ◻ Current implementation ◻ Execute without lock: ◻ All threads execute ◻ Multiple tasks cause error result
  11. 11. Hardware Scheduler (2/3) ◻ Current implementation ◻ Execute with lock: ◻ Code exception may cause the resource being locked up ◻ Long runtime tasks prevent others from execution ◻ We may fail to finish critical jobs
  12. 12. Hardware Scheduler (3/3) HSA runtime guarantees: ◻ Bounded execution time ◻ Any process cease in reasonable time ◻ Fast switch among applications ◻ Use hardware to save time ◻ Application level parallelism
  13. 13. HSAIL (1/2) ◻ HSA Intermediate Language ◻ The language for TCU ◻ Similar to “PTX” code ◻ No graphic-specific instructions ◻ Further translated to HW ISA (by Finalizer) ◻ The abstract platform is similar to OpenCL ◻ Work item (thread) ◻ Work group (block) ◻ NDRange (grid)
  14. 14. HSAIL (2/2)
  15. 15. Memory Model
  16. 16. ◻ All types of memory using same space ◻ Memory access behavior ◻ Not all regions are accessible by all devices ◻ OS kernel should not be accessible ◻ Mapping to a region in kernel is still possible ◻ Accessing identical address may gives different values ◻ Work item private memory ◻ Work group local memory ◻ Accessing other item / group is not valid Virtual Memory Address
  17. 17. ◻ Global ◻ The memory shared by all LCU & TCU ◻ Accessible via work item / group ◻ Group ◻ The memory shared by all work items in the same group ◻ Private ◻ The memory only visible by a work item Memory Region
  18. 18. ◻ Kernarg ◻ The memory for kernel arguments ◻ Kernel is the code fragment we ask a device to run on ◻ Readonly ◻ Read-only type of global memory ◻ Spill ◻ Memory for register spill ◻ Arg ◻ Memory for function call arguments Memory Region
  19. 19. Memory Consistency ◻ LCU ◻ LCU maintains its own consistency ◻ Shares global memory ◻ Work item ◻ Memory operation to same address by single work item is in order ◻ Memory operations to different address may be reordered ◻ Other than that, nothing is guaranteed
  20. 20. System Components
  21. 21. HSA System
  22. 22. Compilation ◻ Frontend ◻ LLVM IR ◻ No data dependency ◻ Backend ◻ Convert IR to HSAIL ◻ Optimization happens here ◻ Binary format ◻ ELF format ◻ Embedded container for HSAIL (BRIG)
  23. 23. Runtime ◻ HSA runtime ◻ Issue tasks to device protocol ◻ Device ◻ Convert HSAIL to ISA with Finalizer
  24. 24. HSAIL Program Features ◻ Backward Compatible ◻ A system without HSA support should still run the executable ◻ Function Invocation ◻ LCU functions may call LCU ones ◻ TCU functions may call TCU ones with Finalizer support ◻ LCU to TCU / TCU to LCU is supported by using queue ◻ C++ compatible
  25. 25. Conclusion ◻ HSA is an open and standard layer between software / hardware ◻ The cardinal feature of HSA is the unified virtual memory space ◻ No replacement for current programming framework, no new language is required
  26. 26. Reference ◻ Heterogeneous System Architecture: A Technical Review ◻ HSA Programmer’s Reference Manual ◻ HSAIL: Write-Once-Run-Everywhere for Heterogeneous Systems

×