Task sharing Check condition Work done? Done Task Set Acquire Task Try to get task Task Got task? No, retry Task Task Perform task Task New tasks? No, continue Add Task Task Add task
System Model CUDA Global Memory Gather and scatter Compare-And-Swap Fetch-And-Inc Multiprocessors Maximum number ofconcurrent thread blocks Global Memory Multi-processor Multi-processor Multi-processor Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block
Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Task stealing T1 TB 1 T3 T2 TB 2 TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
Task stealing T1 T4 TB 1 T3 T2 TB 2 TB n
Task stealing T1 T4 T5 TB 1 T3 T2 TB 2 TB n
Task stealing T1 T4 TB 1 T3 T2 TB 2 TB n
Task stealing T1 TB 1 T3 T2 TB 2 TB n
Task stealing TB 1 T3 T2 TB 2 TB n
Task stealing TB 1 T2 TB 2 TB n
Static Task List In T1 T2 T3 T4
Static Task List In T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 T5 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 T5 TB 1 T2 T6 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 T5 TB 1 T2 T6 TB 2 T3 T7 TB 3 T4 TB 4
Previous work Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003 Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998 Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005
Conclusion Synchronization plays a significant role in dynamic load-balancing Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming Locks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUs Work stealing could outperform static load balancing
0 comments
Post a comment