On Dynamic Load Balancing on Graphics Processors<br />Daniel Cederman and Philippas Tsigas<br />Chalmers University of Tec...
Overview<br />Motivation<br />Methods<br />Experimental evaluation<br />Conclusion<br />
The problem setting<br />Work<br />Offline<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Onli...
Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />
Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<...
Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<...
Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<...
Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<...
Dynamic Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task...
Task sharing<br />Check condition<br />Work done?<br />Done<br />Task Set<br />Acquire Task<br />Try to get task<br />Task...
System Model<br />CUDA<br />Global Memory<br />Gather and scatter<br />Compare-And-Swap<br />Fetch-And-Inc<br />Multiproce...
Synchronization<br />Blocking<br />Uses mutual exclusion to only allow one process at a time to access the object.<br /> L...
Load Balancing Methods<br />Blocking Task Queue<br />Non-blocking Task Queue<br />Task Stealing<br />Static Task List<br />
Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<b...
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<b...
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<b...
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<b...
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br ...
Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br ...
Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />Reference<br />Arora N. S., Blumofe R. D., Plaxto...
Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
Task stealing<br />T1<br />T4<br />T5<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
Task stealing<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
Task stealing<br />TB 1<br />T2<br />TB 2<br />TB n<br />
Static Task List<br />In<br />T1<br />T2<br />T3<br />T4<br />
Static Task List<br />In<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
Static Task List<br />In<br />Out<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<...
Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />T7<br />TB 3<br />T4<br...
Octree Partitioning<br />Bandwidth bound<br />
Octree Partitioning<br />Bandwidth bound<br />
Octree Partitioning<br />Bandwidth bound<br />
Octree Partitioning<br />Bandwidth bound<br />
Four-in-a-row<br />Computation intensive<br />
Graphics Processors<br />8800GT<br />14 Multiprocessors<br />57 GB/sec bandwidth<br />9600GT<br />8 Multiprocessors<br />5...
Blocking Queue – Octree/9600GT<br />
Blocking Queue – Octree/8800GT<br />
Blocking Queue – Four-in-a-row<br />
Non-blocking Queue – Octree/9600GT<br />
Non-blocking Queue – Octree/8800GT<br />
Non-blocking Queue - Four-in-a-row<br />
Task stealing – Octree/9600GT <br />
Task stealing – Octree/8800GT<br />
Task stealing – Four-in-a-row<br />
Static List<br />
Octree Comparison<br />
Previous work<br />Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Co...
Conclusion<br />Synchronization plays a significant role in dynamic load-balancing<br />Lock-free data structures/synchron...
Thank you!<br />http://www.cs.chalmers.se/~dcs<br />
Upcoming SlideShare
Loading in …5
×

Dynamic Load-balancing On Graphics Processors

992 views
907 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
992
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dynamic Load-balancing On Graphics Processors

  1. 1. On Dynamic Load Balancing on Graphics Processors<br />Daniel Cederman and Philippas Tsigas<br />Chalmers University of Technology<br />
  2. 2. Overview<br />Motivation<br />Methods<br />Experimental evaluation<br />Conclusion<br />
  3. 3. The problem setting<br />Work<br />Offline<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Online<br />Task<br />Task<br />Task<br />Task<br />
  4. 4. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />
  5. 5. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />
  6. 6. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />
  7. 7. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  8. 8. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  9. 9. Dynamic Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  10. 10. Task sharing<br />Check condition<br />Work done?<br />Done<br />Task Set<br />Acquire Task<br />Try to get task<br />Task<br />Got task?<br />No, retry<br />Task<br />Task<br />Perform task<br />Task<br />New tasks?<br />No, continue<br />Add Task<br />Task<br />Add task<br />
  11. 11. System Model<br />CUDA<br />Global Memory<br />Gather and scatter<br />Compare-And-Swap<br />Fetch-And-Inc<br />Multiprocessors<br />Maximum number ofconcurrent thread blocks<br />Global Memory<br />Multi-processor<br />Multi-processor<br />Multi-processor<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />
  12. 12. Synchronization<br />Blocking<br />Uses mutual exclusion to only allow one process at a time to access the object.<br /> Lockfree<br />Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.<br />Waitfree<br />Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.<br />
  13. 13. Load Balancing Methods<br />Blocking Task Queue<br />Non-blocking Task Queue<br />Task Stealing<br />Static Task List<br />
  14. 14. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
  15. 15. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
  16. 16. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  17. 17. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  18. 18. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  19. 19. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />Reference<br />P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]<br />
  20. 20. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  21. 21. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  22. 22. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  23. 23. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br />TB n<br />
  24. 24. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br />TB n<br />
  25. 25. Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />Reference<br />Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]<br />
  26. 26. Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  27. 27. Task stealing<br />T1<br />T4<br />T5<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  28. 28. Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  29. 29. Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  30. 30. Task stealing<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  31. 31. Task stealing<br />TB 1<br />T2<br />TB 2<br />TB n<br />
  32. 32. Static Task List<br />In<br />T1<br />T2<br />T3<br />T4<br />
  33. 33. Static Task List<br />In<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  34. 34. Static Task List<br />In<br />Out<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  35. 35. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  36. 36. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  37. 37. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />T7<br />TB 3<br />T4<br />TB 4<br />
  38. 38. Octree Partitioning<br />Bandwidth bound<br />
  39. 39. Octree Partitioning<br />Bandwidth bound<br />
  40. 40. Octree Partitioning<br />Bandwidth bound<br />
  41. 41. Octree Partitioning<br />Bandwidth bound<br />
  42. 42. Four-in-a-row<br />Computation intensive<br />
  43. 43. Graphics Processors<br />8800GT<br />14 Multiprocessors<br />57 GB/sec bandwidth<br />9600GT<br />8 Multiprocessors<br />57 GB/sec bandwidth<br />
  44. 44. Blocking Queue – Octree/9600GT<br />
  45. 45. Blocking Queue – Octree/8800GT<br />
  46. 46. Blocking Queue – Four-in-a-row<br />
  47. 47. Non-blocking Queue – Octree/9600GT<br />
  48. 48. Non-blocking Queue – Octree/8800GT<br />
  49. 49. Non-blocking Queue - Four-in-a-row<br />
  50. 50. Task stealing – Octree/9600GT <br />
  51. 51. Task stealing – Octree/8800GT<br />
  52. 52. Task stealing – Four-in-a-row<br />
  53. 53. Static List<br />
  54. 54. Octree Comparison<br />
  55. 55. Previous work<br />Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003<br />Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998<br />Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005<br />
  56. 56. Conclusion<br />Synchronization plays a significant role in dynamic load-balancing<br />Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming<br />Locks perform poorly <br />It is good that operations such as CAS and FAA have been introduced in the new GPUs<br />Work stealing could outperform static load balancing<br />
  57. 57. Thank you!<br />http://www.cs.chalmers.se/~dcs<br />

×