Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dynamic Load-balancing On Graphics Processors

1,062 views

Published on

Published in: Technology
  • Be the first to comment

Dynamic Load-balancing On Graphics Processors

  1. 1. On Dynamic Load Balancing on Graphics Processors<br />Daniel Cederman and Philippas Tsigas<br />Chalmers University of Technology<br />
  2. 2. Overview<br />Motivation<br />Methods<br />Experimental evaluation<br />Conclusion<br />
  3. 3. The problem setting<br />Work<br />Offline<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Task<br />Online<br />Task<br />Task<br />Task<br />Task<br />
  4. 4. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />
  5. 5. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />
  6. 6. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />
  7. 7. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  8. 8. Static Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  9. 9. Dynamic Load Balancing<br />Processor<br />Processor<br />Processor<br />Processor<br />Task<br />Task<br />Task<br />Task<br />Subtask<br />Subtask<br />Subtask<br />Subtask<br />
  10. 10. Task sharing<br />Check condition<br />Work done?<br />Done<br />Task Set<br />Acquire Task<br />Try to get task<br />Task<br />Got task?<br />No, retry<br />Task<br />Task<br />Perform task<br />Task<br />New tasks?<br />No, continue<br />Add Task<br />Task<br />Add task<br />
  11. 11. System Model<br />CUDA<br />Global Memory<br />Gather and scatter<br />Compare-And-Swap<br />Fetch-And-Inc<br />Multiprocessors<br />Maximum number ofconcurrent thread blocks<br />Global Memory<br />Multi-processor<br />Multi-processor<br />Multi-processor<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />Thread Block<br />
  12. 12. Synchronization<br />Blocking<br />Uses mutual exclusion to only allow one process at a time to access the object.<br /> Lockfree<br />Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.<br />Waitfree<br />Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.<br />
  13. 13. Load Balancing Methods<br />Blocking Task Queue<br />Non-blocking Task Queue<br />Task Stealing<br />Static Task List<br />
  14. 14. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
  15. 15. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />Tail<br />TB n<br />
  16. 16. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  17. 17. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  18. 18. Blocking queue<br />Free<br />TB 1<br />Head<br />TB 2<br />T1<br />Tail<br />TB n<br />
  19. 19. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />Reference<br />P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]<br />
  20. 20. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  21. 21. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  22. 22. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />Tail<br />TB n<br />
  23. 23. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br />TB n<br />
  24. 24. Non-blocking Queue<br />TB 1<br />TB 1<br />Head<br />TB 2<br />TB 2<br />T1<br />T2<br />T3<br />T4<br />T5<br />Tail<br />TB n<br />
  25. 25. Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />Reference<br />Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]<br />
  26. 26. Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  27. 27. Task stealing<br />T1<br />T4<br />T5<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  28. 28. Task stealing<br />T1<br />T4<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  29. 29. Task stealing<br />T1<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  30. 30. Task stealing<br />TB 1<br />T3<br />T2<br />TB 2<br />TB n<br />
  31. 31. Task stealing<br />TB 1<br />T2<br />TB 2<br />TB n<br />
  32. 32. Static Task List<br />In<br />T1<br />T2<br />T3<br />T4<br />
  33. 33. Static Task List<br />In<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  34. 34. Static Task List<br />In<br />Out<br />T1<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  35. 35. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  36. 36. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />TB 3<br />T4<br />TB 4<br />
  37. 37. Static Task List<br />In<br />Out<br />T1<br />T5<br />TB 1<br />T2<br />T6<br />TB 2<br />T3<br />T7<br />TB 3<br />T4<br />TB 4<br />
  38. 38. Octree Partitioning<br />Bandwidth bound<br />
  39. 39. Octree Partitioning<br />Bandwidth bound<br />
  40. 40. Octree Partitioning<br />Bandwidth bound<br />
  41. 41. Octree Partitioning<br />Bandwidth bound<br />
  42. 42. Four-in-a-row<br />Computation intensive<br />
  43. 43. Graphics Processors<br />8800GT<br />14 Multiprocessors<br />57 GB/sec bandwidth<br />9600GT<br />8 Multiprocessors<br />57 GB/sec bandwidth<br />
  44. 44. Blocking Queue – Octree/9600GT<br />
  45. 45. Blocking Queue – Octree/8800GT<br />
  46. 46. Blocking Queue – Four-in-a-row<br />
  47. 47. Non-blocking Queue – Octree/9600GT<br />
  48. 48. Non-blocking Queue – Octree/8800GT<br />
  49. 49. Non-blocking Queue - Four-in-a-row<br />
  50. 50. Task stealing – Octree/9600GT <br />
  51. 51. Task stealing – Octree/8800GT<br />
  52. 52. Task stealing – Four-in-a-row<br />
  53. 53. Static List<br />
  54. 54. Octree Comparison<br />
  55. 55. Previous work<br />Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003<br />Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998<br />Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005<br />
  56. 56. Conclusion<br />Synchronization plays a significant role in dynamic load-balancing<br />Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming<br />Locks perform poorly <br />It is good that operations such as CAS and FAA have been introduced in the new GPUs<br />Work stealing could outperform static load balancing<br />
  57. 57. Thank you!<br />http://www.cs.chalmers.se/~dcs<br />

×