HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

1,338 views

Published on

Presentation HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding at the AMD Developer Summit (APU13) November 11-13, 2013

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,338
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
58
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

  1. 1. HSAemu ‐ A F ll S t HSA A Full System Emulator  E l t for HSA Platform for HSA Platform Prof. Yeh‐Ching Chung System Software Laboratory Department of Computer science  Department of Computer science National Tsing Hua University National Tsing Hua University ® copyright OIA National Tsing Hua University 1
  2. 2. Outline     Introduction to HSA Introduction to HSA Design of HSAemu Performance Evaluation P f E l ti Conclusions and Future Work National Tsing Hua University ® copyright OIA National Tsing Hua University 2
  3. 3. Introduction to HSA  HSA is an industry standard to  f g define next‐generation  hardware/software architecture  for heterogeneous computing  for heterogeneous computing National Tsing Hua University ® copyright OIA National Tsing Hua University 3
  4. 4. Hardware Platform of HSA National Tsing Hua University ® copyright OIA National Tsing Hua University 4
  5. 5. Simplified HSA Software Stack Application Domain Specific Libs (Bolt, OpenCV™, … many others) Application  SW Renderscript p /OpenCL Runtime HSA Runtime HSA Software HSAIL OpenGL‐ES O GL ES Runtime Legacy  Driver Other Oth Runtime Legacy  Driver Ctl Drivers HSA Finalizer Kernel Driver GPU ISA Differentiated HW National Tsing Hua University ® copyright OIA National Tsing Hua University CPU(s) GPU(s) Other  Accelerators 5
  6. 6. Specification of Simple HSA Platform  Hardware – Memory Memory  • Shared Virtual Memory (hUMA) • Cache Coherency Domains • Memory‐Based Signaling and  Synchronization for CPU and GPU – Task Control • Architected Queuing Language (AQL) • Efficient Syscall Infrastructure • Preemptive Context Switching  – Debugging Infrastructure gg g • Allow system software to set  Instruction/ Memory/ Conditional, etc.,  breakpoints  Software – HSA R ti HSA Runtime APIs API • • • • • • Initialization of HSA components Topology discovery Manage AQL packets Manage AQL packets Dispatch application tasks Signal HW and wait for result Recycle available resources – User Mode Queue • Store AQL packets – Virtual ISA ‐ HSAIL Virtual ISA  • A low level instruction set designed for  parallel computing – E Exception Handling ti H dli • GPU trap handler to trigger GPU  interrupt for GPU exception National Tsing Hua University ® copyright OIA National Tsing Hua University 6
  7. 7. What Is HSAemu  HSAemu is a full system emulator that supports  the following HSA features – – – – – – –   Shared virtual memory between CPU and GPU Memory based signaling and synchronization Memory based signaling and synchronization Multiple user level command queues Preemptive GPU context switching Concurrent execution of CPU threads and GPU threads Concurrent execution of CPU threads and GPU threads HSA runtime Finalizer A Project Sponsored by MediaTek (MTK) AP j S d b M di T k Currently, it supports simple HSA platform  simulation – – Functional‐accurate simulation Cycle‐accurate simulation National Tsing Hua University ® copyright OIA National Tsing Hua University 7
  8. 8. Architecture of HSAemu  HSAemu consists of 6 components – HSA Runtime – CPU Simulation Module – GPU Task Dispatcher – Functional‐Accurate GPU Simulator (Fast‐ GPU Simulator) – Cycle‐Accurate GPU Simulator (Mult2sim) – GPU Helper Functions National Tsing Hua University ® copyright OIA National Tsing Hua University 8
  9. 9. HSAemu Runtime  User Mode Queue – Store AQL packets Store AQL packets  AQL Queue Manager  – Manage AQL packets in User Mode   Queue  AQL Command Dispatcher  – Launch the execution of kernel jobs on Launch the execution of kernel jobs on   HSAemu  National Tsing Hua University ® copyright OIA National Tsing Hua University Support OpenCL runtime pp p 9
  10. 10. CPU Simulation Module (1) PQEMU – Perform multicore CPU simulation  HSA Signal Handler – Receive AQL command  from HSA Runtime and launch GPU simulation  National Tsing Hua University ® copyright OIA National Tsing Hua University 10
  11. 11. CPU Simulation Module (2)  PQEMU – A parallel system emulator based on QEMU A parallel system emulator based on QEMU – Tow efficient synchronization models (UCC/SCC) – Dynamic binary translation (DBT) technique – A project sponsored by MTK  Agent code, HSA runtime, and operating  system are run on PQEMU system are run on PQEMU Code Cache DBT DBT DBT DBT CPU CPU CPU CPU Unified Code Cache (UCC) Model “PQEMU: A Parallel System Emulator Based on QEMU” (ICPADS 2011) National Tsing Hua University ® copyright OIA National Tsing Hua University 11
  12. 12. GPU Task Dispatcher (1)  AQL Command Monitor – Receive signal from HSA Signal Handler – Copy AQL packets from User Mode Queue  to HW AQL Queue – Launch AQL Packet Worker  AQL Packet Worker – Dequeue AQL packets from HW AQL Queue – Parse AQL packet Parse AQL packet – Dispatch kernel jobs to Fast‐GPU Simulator  or M2S‐GPU Simulator according to the  kernel information kernel information National Tsing Hua University ® copyright OIA National Tsing Hua University 12
  13. 13. GPU Task Dispatcher (2)  Execution Flow  National Tsing Hua University ® copyright OIA National Tsing Hua University
  14. 14. GPU Task Dispatcher (3) Signal from HAS Signal Handler National Tsing Hua University ® copyright OIA National Tsing Hua University
  15. 15. GPU Task Dispatcher (4) Copy AQL packets from User Mode Queue National Tsing Hua University ® copyright OIA National Tsing Hua University
  16. 16. GPU Task Dispatcher (5) Ask AQL Packet Worker to parse AQL Packet National Tsing Hua University ® copyright OIA National Tsing Hua University
  17. 17. GPU Task Dispatcher (6) Launch Fast-GPU Simulator National Tsing Hua University ® copyright OIA National Tsing Hua University
  18. 18. GPU Task Dispatcher (7) Launch M2S-GPU Simulation National Tsing Hua University ® copyright OIA National Tsing Hua University
  19. 19. Fast‐GPU Simulator  A functional‐accurate simulator  for generic GPU model simulation – HSAIL Translator  • Act as a Finalizer • Use static binary translation technique   to translate BRIG file to host executable  to translate BRIG file to host executable binary file (x86) based on LLVM • Host SSE instruction optimization – GPU Thread Scheduler • Simulate a generic GPU model National Tsing Hua University ® copyright OIA National Tsing Hua University 19
  20. 20. HSAIL Translator (1)  Architecture National Tsing Hua University ® copyright OIA National Tsing Hua University
  21. 21. HSAIL Translator (2) Launch LLVM HSAIL Translator National Tsing Hua University ® copyright OIA National Tsing Hua University
  22. 22. HSAIL Translator (3) Construct Control Flow Graph of HSAIL National Tsing Hua University ® copyright OIA National Tsing Hua University
  23. 23. HSAIL Translator (4) Translate HSAIL to LLVM IR National Tsing Hua University ® copyright OIA National Tsing Hua University
  24. 24. HSAIL Translator (5) Translate LLVM IR to Host Executable Object File National Tsing Hua University ® copyright OIA National Tsing Hua University
  25. 25. HSAIL Translator (6) Load Host Executable Object File to memory National Tsing Hua University ® copyright OIA National Tsing Hua University
  26. 26. HSAIL Translator (7) Link to GPU Helper Functions National Tsing Hua University ® copyright OIA National Tsing Hua University
  27. 27. HSAIL Translator (8) Store the translation result to GPU Code Cache National Tsing Hua University ® copyright OIA National Tsing Hua University
  28. 28. HSAIL Translator (2)  Host SSE instruction Optimization – Reconstruct the control flow graph of kernel  function – Use bitmap masking and packing/unpacking  algorithms to generate host SSE instructions  algorithms to generate host SSE instructions National Tsing Hua University ® copyright OIA National Tsing Hua University 28
  29. 29. HSAIL Translator (3)  Example : The control flow graph for kernel  function $foo National Tsing Hua University ® copyright OIA National Tsing Hua University 29
  30. 30. HSAIL Translator (4)  Reconstruct the control flow graph by depth‐first traversal  Perform bitmap masking and packing & unpacking algorithms  algorithms National Tsing Hua University ® copyright OIA National Tsing Hua University 30
  31. 31. GPU Thread Scheduler  Simulate a generic GPU model – GPU Thread Scheduler assigns work groups  to free CU threads in the GPU Thread Pool to free CU threads in the GPU Thread Pool – Each CU thread executes all work items in a  work group  – The maximum number of CU threads is The maximum number of CU threads is  limited by host operating system    National Tsing Hua University ® copyright OIA National Tsing Hua University 31
  32. 32. M2S‐GPU Simulator (1)  A cycle‐accurate simulator for  AMD Southern Islands GPU  model simulation – HSAIL Translator  • Translate BRIG file to GPU binary – M2S Bridge • Bridge Multi2Sim GPU Model to  HSAemu – M2S GPU Module • Simulate a cycle‐accurate GPU model Simulate a cycle accurate GPU model National Tsing Hua University ® copyright OIA National Tsing Hua University 32
  33. 33. M2S‐GPU Simulator (2)  HSAIL Translator – Act as a Finalizer – Translate HSAIL to AMD Southern  Islands GPU binary – Use static binary translation  technique based on LLVM National Tsing Hua University ® copyright OIA National Tsing Hua University 33
  34. 34. M2S‐GPU Simulator (3)  M2S Bridge : An interface to launch  M2S GPU Module M2S GPU M d l – Initialize the data structures used by  AMD Southern Islands GPU, including a  AMD Southern Islands GPU, including a memory register for AMD Southern  Islands GPU to access the shared system  memory in HSAemu memory in HSAemu – Invoke M2S GPU Module (the AMD  Southern Islands GPU module in  Multi2Sim)   National Tsing Hua University ® copyright OIA National Tsing Hua University 34
  35. 35. M2S‐GPU Simulator (4)  M2S GPU Module – A cycle‐accurate AMD Southern  Islands GPU simulator in Multi2Sim  National Tsing Hua University ® copyright OIA National Tsing Hua University Memory access is performed by  y p y HSAemu memory helper function  to comply the hUMA model py 35
  36. 36. GPU Helper Functions (1)  Memory Helper Function – A soft‐mmu of GPU with a page table  worker and a TLB to enable hUMA model – Support the redirect access of a local  pp segment memory to a non‐shared private  memory in GPU   Kernel Information Helper Function K lI f ti H l F ti – Collect and return information of GPU  s u at o a d cu e t e ecut o state simulation and current execution state  – Retrieve kernel information such as  working item ID, work group size, etc, from  AQL packet AQL packet National Tsing Hua University ® copyright OIA National Tsing Hua University 36
  37. 37. GPU Helper Functions (2)  Mathematic Helper Function – Simulate special mathematical instructions  such as trigonometric instructions by  calling the corresponding mathematical  functions in standard library   Synchronization Helper Function – Barrier synchronization implementation for  generic GPU model simulation  National Tsing Hua University ® copyright OIA National Tsing Hua University 37
  38. 38. hUMA Model in HSAemu  Unified coherent address space  – GPU can access a  virtual memory  page allocated by CPU  Soft‐mmu is simulated for GPU – TLB hit/miss events can be traced  Memory segment access – Global memory segment access is  handled by memory helper function – Group memory segment access is  handled by host ld/st instructions National Tsing Hua University ® copyright OIA National Tsing Hua University 38
  39. 39. Recall: Hardware Simulation of HSAemu  HSA hardware components simulated – Multicore CPU: A parallel multicore CPU model simulation – Functional‐Accrate GPU: A generic GPU model simulation – Cycle‐Accurate GPU: AMD Southern Islands GPU model  simulation – hUMA: A unified address space between CPU and GPU  simulation – Synchronization Primitive: Barrier instruction simulation – Hardware AQL Queue: A HW dispatch queue for GPU  simulation i l ti National Tsing Hua University ® copyright OIA National Tsing Hua University 39
  40. 40. Recall: Software Utilities of HSAemu  HSA software utilities designed – HAS Runtime: HSA runtime library (OpenCL runtime) – Topology Discovery: Discover the current platform topology – User Mode Queue: A queue for each user application – Signal Event: Notify GPU to work – HSAIL Generator: A PTX to HSAIL source level translator – BRIG Generator: Generate a binary format from a Kernel file – HSAIL Translator: Translate HSAIL to host executable binary – GPU Code Cache: store translated host binaries National Tsing Hua University ® copyright OIA National Tsing Hua University 40
  41. 41. Performance Evaluation  Experimental Environment  Benchmarks:  – Nearest Neightbor (NN), K‐Means, FFT, FWT, N‐Body – Binary Search, Bitonic Sort, Reduction, FWT y , , , National Tsing Hua University ® copyright OIA National Tsing Hua University 41
  42. 42. Scalability of Fast‐GPU Simulator   Comparison of NN, K‐means and FWT benchmarks on 32  physical cores physical cores The speedup is scalable when # of CU threads < # of host  physical cores physical cores National Tsing Hua University ® copyright OIA National Tsing Hua University 42
  43. 43. SSE Optimization of Fast‐GPU Simulator  Performance comparison of FFT when turn on/off  SSE optimization SSE i i i National Tsing Hua University ® copyright OIA National Tsing Hua University 43
  44. 44. N‐Body Simulation by Fast‐GPU Simulator  N‐Body Simulation  All of host physical  CPUs are running National Tsing Hua University ® copyright OIA National Tsing Hua University 44
  45. 45. Comparison of HSAemu and Multi2Sim benchmark speedup 20 18 16 Fast‐GPU Sim > M2S‐GPU sim > Multi2Sim 14 12 10 8 6 4 2 0 multi2sim HSAemu Hybrid BinarySearch 1 2.931317 2.873768 BitonicSort 1 18.88827 0.921835 multi2sim National Tsing Hua University ® copyright OIA National Tsing Hua University HSAemu FastWalshTransform 1 8.645516 2.407809 Reduction 1 6.294213 2.105663 Hybrid 45
  46. 46. Conclusions   An HSA‐compliant full system emulator has been  implemented – A functional‐accurate simulator for generic GPU model – A cycle‐accurate simulator for AMD Southern Islands GPU  model (from Multi2Sim) The HSAIL Translator acts as a finalizer that enables  the integration of HSAemu with existing simulators,  for example, Multi2Sim  Open source – Nov. 12, 2013 p ,  – http://hsaemu.org/ National Tsing Hua University ® copyright OIA National Tsing Hua University 46
  47. 47. Future work  Enhance HSAemu by implementing more HSA  features f t  Integrate HSAemu with some existing cycle‐accurate  I HSA ih i i l GPU simulators  Design a cycle‐accurate simulator based on PQEMU  for generic CPU model  Deisgn a cycle‐accurate simulator based on PQEMU  for big.LITTLE CPU model National Tsing Hua University ® copyright OIA National Tsing Hua University 47
  48. 48. Q & A Q&A National Tsing Hua University ® copyright OIA National Tsing Hua University 48

×