Distribute the work between threads in order to put as many cores to work, to maximize throughput
1KB and more instructions 1KB and more instructions 1KB and more instructions 1KB and more instructions 1KB and more instructions 1KB and more instructions Work Load OpenMP TPL Parallel Studio 1KB and more instructions
Why? The number of transistors never stopped climbing The Free Lunch is Over However, clock speed stopped somewhere near 3GHz
The Solution Re-Enable the Free Lunch Use the Thread-Pool to execute your work asynchronously Add a concurrency control mechanism which will adjust the amount of work items thrown into the pool according to the workload and the machine architecture, in order to put the maximum number of cores to work with minimum contentions How many callbacks to put in the pool? How to separate the work?
The Future Lock-Free Thread-Pool Instead of using a linked list, use the array-style, lock-free, GC-friendly ConcurrentQueue<T> class The increasing number of work items and worker threads result in a problematic contention on the pool.
The Future Work Stealing Queues When work is queued from a non-pool thread, it goes into the global queue. Each worker thread in the pool has its own private WSQ. When work is queued from a pool worker thread, the work goes into its WSQ, most of the time, avoiding all locking. WSQ has two ends, it allows lock-free pushes and pops from one end (“private”), but requires synchronization from the other end (“public”) Worker thread is being created/assigned to grab work from the global queue The worker thread grab work from its WSQ in a LIFO fashion, avoiding all locking. Worker threads steal work from other WSQs in a FIFO fashion, synchronization is required.
Aims to lower the cost of fine-grained parallelism by executing the asynchronous work (tasks) in a way that fit the number of available cores, and providing developers with more control over the way in which the tasks get scheduled and executed.
Wrapping all the code that access shared memory in a transaction and let the runtime execute it atomically and in isolation by doing the appropriate synchronization behind the scenes. By that, maintaining a single lock illusion.
With STM we can write code that looks sequential, can be reasoned about sequentially, and yet run concurrently in a scalable fashion.