ISCA Final Presentation - Applications

802 views

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
802
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
53
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow.
    Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU.
    Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays.
    Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed.
    Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.
  • Kernel should have input buffer which is list of keys being searched for and outputs are the values from the key value pairs
  • This case study implements a dynamic task scheduling scheme that aims load balancing among work-groups.
    Traditional heterogeneous approach:
    The host system enqueues tasks in several queues located in GPU memory.
    Two variables per queue are used to synchronize CPU and GPU: The number of tasks that have been written in the queue, and the number of tasks that have been already consumed from the queue.
    These variables are duplicated in CPU and GPU memory.
    The GPU runs a number of persistent work-groups. A work-group can dequeue one task and update the number of consumed tasks.
  • A group of tasks are asynchronously transferred to one queue in GPU memory.
  • Then, the host updates the number of written tasks in CPU memory.
  • The number of written tasks is updated in GPU memory by an asynchronous transfer.
  • A work-group dequeues one task from a queue.
  • The work-group updates the number of consumed tasks by using a global memory atomic operation.
    Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  • Then, a different work-group dequeues the next task.
  • Work-group 2 updates the number of consumed tasks.
    Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  • Work-group 3 dequeues the next task.
  • Work-group 3 updates the number of consumed tasks.
    Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  • Work-group 4 dequeues the next task.
  • Work-group 4 updates the number of consumed tasks. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks and the number of written tasks.
  • Since the number of consumed tasks and the number of written tasks are equal, the queue is empty.
    Then, the number of consumed tasks should be updated in CPU memory. This is implemented by using the zero-copy feature.
    Once the number of consumed tasks in CPU memory is updated, the host thread will detect that this number is equal to the number of written tasks. More tasks can be then enqueued in queue 1.
  • Using HSA and full OpenCL 2.0, queues and synchronization variables can be allocated in host coherent memory.
  • Moving tasks to a queue is as simple as using memcpy.
  • No copies of the number of written tasks and the number of consumed tasks are needed in GPU memory.
  • A work-group can dequeue one task from a queue in host coherent memory.
  • The number of consumed tasks is updated by using platform atomics.
  • Only the function that inserts tasks in the queues needs 5x less lines of code than the legacy implementation.
  • This slide presents 8 tests. The total number of tasks in the tasks pool is 4096 or 16384. The number of queues is 4 in every test.
    Each time the host inserts tasks in a queue, the number of tasks per insertion is 64, 128, 256 or 512.
  • Atomic is a lock on a parent before adding to it. Semaphore on the tree struct. CAS to take semaphore
    Tree has 2M nodes. Add 0.5 nodes
    Time 3 ways: CPU, GPU, Both
  • Just the dividing planes are loaded to GPU memory for first two levels
  • BVH – Bounding Volume Hierarchy
    Each leaf has a collection of primitives (spheres)
    Looking for first sphere that intercepts a point
  • Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow.
    Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU.
    Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays.
    Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed.
    Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.
  • ISCA Final Presentation - Applications

    1. 1. HSA APPLICATIONS WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS WITH J.P. BORDES AND JUAN GOMEZ
    2. 2. USE CASES SHOWING HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved
    3. 3. UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES
    4. 4. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
    5. 5. L R Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES © Copyright 2014 HSA Foundation. All Rights Reserved
    6. 6. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    7. 7. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
    8. 8. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R © Copyright 2014 HSA Foundation. All Rights Reserved
    9. 9. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
    10. 10. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
    11. 11. SYSTEM MEMORY KERNEL GPU UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA and full OpenCL 2.0 TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    12. 12. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    13. 13. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    14. 14. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    15. 15. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
    16. 16. POINTER DATA STRUCTURES - CODE COMPLEXITY HSA Legacy © Copyright 2014 HSA Foundation. All Rights Reserved
    17. 17. POINTER DATA STRUCTURES - PERFORMANCE 0 10,000 20,000 30,000 40,000 50,000 60,000 1M 5M 10M 25M Searchrate(nodes/ms) Tree size ( # nodes ) Binary Tree Search CPU (1 core) CPU (4 core) Legacy APU HSA APU Measured in AMD labs Jan 1-3 on system shown in back up slide © Copyright 2014 HSA Foundation. All Rights Reserved
    18. 18. PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT
    19. 19. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    20. 20. 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
    21. 21. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    22. 22. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
    23. 23. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    24. 24. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    25. 25. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    26. 26. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    27. 27. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    28. 28. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    29. 29. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
    30. 30. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    31. 31. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Zero-copy © Copyright 2014 HSA Foundation. All Rights Reserved
    32. 32. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    33. 33. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY memcpy © Copyright 2014 HSA Foundation. All Rights Reserved
    34. 34. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    35. 35. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    36. 36. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    37. 37. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    38. 38. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    39. 39. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    40. 40. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    41. 41. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    42. 42. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
    43. 43. PLATFORM ATOMICS – CODE COMPLEXITY HSA Legacy Host enqueue function: 20 lines of code Host enqueue function: 102 lines of code © Copyright 2014 HSA Foundation. All Rights Reserved
    44. 44. PLATFORM ATOMICS - PERFORMANCE 0 100 200 300 400 500 600 700 64 128 256 512 64 128 256 512 4096 16384 Executiontime(ms) Tasks per insertion Tasks pool size Legacy implementation (ms) HSA implementation (ms) © Copyright 2014 HSA Foundation. All Rights Reserved
    45. 45. PLATFORM ATOMICS FOR CPU/GPU COLLABORATION
    46. 46. PLATFORM ATOMICS ENABLING EFFICIENT GPU/CPU COLLABORATION Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    47. 47. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    48. 48. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    49. 49. GPU KERNEL PLATFORM ATOMICS Both CPU+GPU operating on same data structure concurren tly TREEINPUT BUFFER CPU 0 CPU 1 HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
    50. 50. GPU KERNEL PLATFORM ATOMICS Both CPU+GPU operating on same data structure concurren tly TREEINPUT BUFFER CPU 0 CPU 1 HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
    51. 51. UNIFIED COHERENT MEMORY FOR LARGE DATA SETS
    52. 52. PROCESSING LARGE DATA SETS The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. SYSTEM MEMORY GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    53. 53. SYSTEM MEMORY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 PROCESSING LARGE DATA SETS Large3Dspatialdata structure GPU The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. Compare HSA and Legacy methods © Copyright 2014 HSA Foundation. All Rights Reserved
    54. 54. SYSTEM MEMORY LEGACY ACCESS USING GPU MEMORY Legacy GPU Memory is smaller Have to copy and process in chunks GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    55. 55. SYSTEM MEMORY Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 LEGACY ACCESS TO LARGE STRUCTURES Large3Dspatialdata structure GPU GPU MEMOR Y © Copyright 2014 HSA Foundation. All Rights Reserved
    56. 56. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of top 2 levels of hierarchy Large3Dspatialdata structure GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    57. 57. GPU GPU MEMORY SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    58. 58. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    59. 59. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    60. 60. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    61. 61. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of bottom 3 levels of one branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    62. 62. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    63. 63. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    64. 64. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    65. 65. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    66. 66. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    67. 67. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU Copy of bottom 3 levels of a different branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
    68. 68. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    69. 69. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    70. 70. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
    71. 71. LARGE SPATIAL DATA STRUCTURE Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 Large3Dspatialdata structure SYSTEM MEMORY KERNEL GPU HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
    72. 72. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    73. 73. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    74. 74. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    75. 75. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    76. 76. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 KERNEL HSA GPU © Copyright 2014 HSA Foundation. All Rights Reserved
    77. 77. CALLBACKS
    78. 78. CALLBACKS  Parallel processing algorithm with branches  A seldom taken branch requires new data from the CPU  On legacy systems, the algorithm must be split:  Process Kernel 1 on GPU  Check for CPU callbacks and if any, process on CPU  Process Kernel 2 on GPU  Example algorithm from Image Processing  Perform a filter  Calculate average LUMA in each tile  Compare LUMA against threshold and call CPU callback if exceeded (rare)  Perform special processing on tiles with callbackxs COMMON SITUATION IN HC Input Image Output Image © Copyright 2014 HSA Foundation. All Rights Reserved
    79. 79. CALLBACKS Legacy GPUTHREADS 0 1 2 N . . . . . . . . . Continuation kernel finishes up kernel works results in poor GPU utilization © Copyright 2014 HSA Foundation. All Rights Reserved
    80. 80. CALLBACKS Input Image 1 Tile = 1 OpenCL Work Item Output Image GPU • Work items compute average RGB value of all the pixels in a tile • Work items also compute average Luma from the average RGB • If average Luma > threshold, workgroup invokes CPU CALLBACK • In parallel with callback, continue compute CPU • For selected tiles, update average Luma value (set to RED) GPU • Work items apply the Luma value to all pixels in the tile GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap. © Copyright 2014 HSA Foundation. All Rights Reserved
    81. 81. CALLBACKS A few kernel threads need CPU callback services but serviced immediately GPUTHREADS 0 1 2 N . . . . . . . . . CPU callbacks HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
    82. 82. SUMMARY - HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved
    83. 83. QUESTIONS?

    ×