Building Scalable Producer-Consumer  Pools based on Elimination-Diraction Trees
Upcoming SlideShare
Loading in...5
×
 

Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees

on

  • 505 views

We present the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention

We present the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention

Statistics

Views

Total Views
505
Views on SlideShare
504
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The theme is building a data structure that is used as a pool, making it scalable and usable for high loads, and not less usable than existing implementations for low loads. <br />
  • What is a pool? A collection of items, which my be objects or tasks. Resource pool – objects that are used and then returned to the pool, Pool of jobs to perform, etc… <br /> The pool is approached by Producers and Consumers, that perform Put/Get (Push/Pop, Enqueue/Dequeue) actions. <br /> These actions can implement different semantics, be blocking/non-blocking, depends on how the pool was defined (Explanation of blocking <br /> on blocking) <br />
  • The data structure we present is called ED-Tree and this is a highly scalable pool to, to be used in multithreaded application. We reach high performance and scalability by combining two paradigms: Elimination and diffraction <br /> The Ed-Tree is implemented in Java <br />
  • If we look in Java JDK for data structures that can be used as pool, we will find the following… <br />
  • All the mentioned data structures are problematic…. They are based on centralized structures… the head or tail of queue/stack becomes a hot spot and in case large number of threads performance becomes worse, instead of improving <br />
  • If we think about it, we don’t care about the order in which the items are inserted/removed from the pool. All we want is to avoid starvation (if item is inserted to the pool, eventually it will be removed). <br /> Therefore we can avoid using centralized structure and distribute the pool in memory. <br />
  • A single level of an elimination array was also used in implementing shared concurrent stacks. However, elimination trees and diffracting trees were never used to implement real world structures. This is <br /> mostly due the fact that there was no need for them: machines with a sufficient level of concurrency and low enough interconnect latency to benefit from them did not exist. Today, multi-core machines present the necessary combination of high levels of parallelism and low interconnection costs. Indeed, this paper is the first to show that that ED-Tree based implementations of data structures from the java.util.concurrent <br /> scale impressively on a real machine (a Sun Maramba multicore machine with 2x8 cores and 128 hardware threads), delivering throughput that is at high concurrency levels 10 times that of the new proposed JDK6.0 algorithms. <br />
  • A balancer is usually implemented as a toggle bit: a bit that holds a binary value. Each thread change the value to the opposite one and picks a direction to exit, according to the bit value. For example 0 – go left, 1 – go right. <br />
  • The diffraction tree constructed from a set of balancers…. You can say that the tree counts the elements, i.e. distributes them equally across the leafs… <br />
  • If we connect a lock free queue/stack to each leaf and use two toggle bits in each balancer, we get a data structure which obeys a pool semantics… <br />
  • We can see that we just moved our contention source from a single queue/stack to the balancers, starting from the entrance to the tree <br />
  • The problem is solved by diffraction… what we get eventually is that each thread that approaches the pool, traverses the whole tree and eventually reaches one of the queues at the leafs. <br />
  • Actually, if at some point during the tree traversal a producer and consumer threads meet each other, they don’t have to continue traversing the tree. The consumer can take the producers value, and they both can leave the tree. <br />
  • In high loads, according to our statistics 50% of the threads are successfully eliminated on each level. I.e. if we use 3-level tree, 50% are eliminated at the first level, another 25% on the second, and 12.5% on the third, meaning, only about 10% of the requests survive till reaching the leaves. <br />
  • We also use two toggle bits at each balancer – one for producers and one for consumers, to assure fair distribution <br />
  • In the described implementation, another problem we can encounter is starvation… <br />
  • Each balancer is composed from an EliminationArray, a pair of toggle bits, and two references one to each of its child nodes. <br />
  • The implementation of an eliminationArray is based on an array of Exchangers. Each exchanger contains a single AtomicReference which is used as an Atomic placeholder for exchanging ExchangerPackage, where the ExchangerPackage is an object used to wrap the actual data and to mark its state and type. <br />
  • At its peak at 64 threads the ED-Tree delivers more than 10 times the performance of the JDK. <br /> Beyond 64 threads the threads are no longer bound to a single CPU, and traffic across the interconnect causes a moderate performance decline for the ED-Tree version <br /> (the performance of the JDK is already very low). <br />
  • ` <br />

Building Scalable Producer-Consumer  Pools based on Elimination-Diraction Trees Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Presentation Transcript

  • Building Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Yehuda Afek and Guy Korland and Maria Natanzon and Nir Shavit
  • The Pool Producer-consumer pools, that is, collections of unordered objects or tasks, are a fundamental element of modern multiprocessor software and a target of extensive research and development Get( ) P1 Put(x) . . P2 C1 . . C2 Put(y) Get( ) Pn Put(z) Get( ) pool Cn
  • ED-Tree Pool We present the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention
  • Java JDK6.0:  SynchronousQueue/Stack (Lea, Scott, and Shearer) - pairing up function without buffering. Producers and consumers wait for one another  LinkedBlockingQueue - Producers put their value and leave, Consumers wait for a value to become available.  ConcurrentLinkedQueue - Producers put their value and leave, Consumers return null if the pool is empty.
  • Drawback All these structures are based on a centralized structures like a lock-free queue or a stack, and thus are limited in their scalability: the head of the stack or queue is a sequential bottleneck and source of contention.
  • Some Observations A pool does not have to obey neither LIFO or FIFO semantics.  Therefore, no centralized structure needed, to hold the items and to serve producers and consumers requests.
  • New approach ED-Tree: a combined variant of the diffracting-tree structure (Shavit and Zemach) and the elimination-tree structure (Shavit and Touitou) The basic idea:  Use randomization to distribute the concurrent requests of threads onto many locations so that they collide with one another and can exchange values, thus avoiding using a central place through which all threads pass. The result:  A pool that allows both parallelism and reduced contention.
  • A little history  Both diffraction and elimination were presented years ago, and claimed to be effective through simulation  However, elimination trees and diffracting trees were never used to implement real world structures  Elimination and diffraction were never combined in a single data structure
  • Diffraction trees A binary tree of objects called balancers [Aspnes-Herlihy-Shavit] with a single input wire and two output wires 5 4 3 2 1 b 1 3 2 5 4 Threads arrive at a balancer and it repeatedly sends them left and right, so its top wire always has maximum one more than the bottom one.
  • Diffraction trees 1 [Shavit-Zemach] b b 10 9 8 7 6 5 4 3 2 1 b 9 2 10 3 4 b b b 5 6 7 b 8 In any quiescent state (when there are no threads in the tree), the tree preserves the step property: the output items are balanced out so that the top leaves outputted at most one more element than the bottom ones, and there are no gaps.
  • Diffraction trees Connect each output wire to a lock free queue b b b b b b b To perform a push, threads traverse the balancers from the root to the leaves and then push the item onto the appropriate queue. To perform a pop, threads traverse the balancers from the root to the leaves and then pop from the appropriate queue/block if the queue is empty.
  • Diffraction trees Problem: Each toggle bit is a hot spot 1 1 b 0/1 1 b 0/1 3 3 2 1 b 0/1 0/1 0/1 2 2 b 0/1 b 0/1 b 0/1 2 3
  • Diffraction trees Observation: If an even number of threads pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer's state remains unchanged The approach: Add a diffraction array in front of each toggle bit 0/1 Prism Array toggle bit
  • Elimination  At any point while traversing the tree, if producer and consumer collide, there is no need for them to diffract and continue traversing the tree  Producer can hand out his item to the consumer, and both can leave the tree.
  • Adding elimination x Get( ) 1 2 . . : : k Put(x) ok 0/1 0/1
  • Using elimination-diffraction balancers Let the array at balancer each be a diffraction-elimination array:  If two producer (two consumer) threads meet in the array, they leave on opposite wires, without a need to touch the bit, as anyhow it would remain in its original state.  If producer and consumer meet, they eliminate, exchanging items.  If a producer or consumer call does not manage to meet another in the array, it toggles the respective bit of the balancer and moves on.
  • ED-tree
  • What about low concurrency levels?  We show that elimination and diffraction techniques can be combined to work well at both high and low loads  To insure good performance in low loads we use several techniques, making the algorithm adapt to the current contention level.
  • Adaptation mechanisms  Use backoff in space:  Randomly choose a cell in a certain range of the array  If the cell is busy (already occupied by two threads), increase the range and repeat.  Else Spin and wait to collision  If timed out (no collision)  Decrease the range and repeat  If certain amount of timeouts reached, spin on the first cell of the array for a period, and then move on to the toggle bit and the next level.  If certain amount of timeouts was reached, don’t try to diffract on any of the next levels, just go straight to the toggle bit  Each thread remembers the last range it used at the current balancer and next time starts from this range
  • Starvation avoidance  Threads that failed to eliminate and propagated all the way to the leaves can wait for a long time for their requests to complete, while new threads entering the tree and eliminating finish faster.  To avoid starvation we limit the time a thread can be blocked in the queues before it retries the whole traversal again.
  • Implementation  Each balancer is composed from an elimination array, a pair of toggle bits, and two references one to each of its child nodes. public class Balancer { ToggleBit producerToggle, consumerToggle; Exchanger[] eliminationArray; Balancer leftChild , rightChild; ThreadLocal<Integer> lastSlotRange; }
  • Implementation public class Exchanger { AtomicReference<ExchangerPackage> slot; } public class ExchangerPackage { Object value; State state ; // WAITING/ELIMINATION/DIFFRACTION, Type type; // PRODUCER/CONSUMER }
  • Implementation  Starting from the root of the tree:  Enter balancer  Choose a cell in the array and try to collide with another thread, using backoff mechanism described earlier.  If collision with another thread occurred     If both threads are of the same type, leave to the next level balancer (each to separate direction) If threads are of different type, exchange values and leave Else (no collision) use appropriate toggle bit and move to next level If one of the leaves reached, go to the appropriate queue and Insert/Remove an item according to the thread type
  • Performance evaluation Sun UltraSPARC T2 Plus multi-core machine.  2 processors, each with 8 cores  each core with 8 hardware threads  64 way parallelism on a processor and 128 way parallelism across the machine.  Most of the tests were done on one processor. i.e. max 64 hardware threads
  • Performance evaluation   A tree with 3 levels and 8 queues The queues are SynchronousBlocking/LinkedBlocking/ConcurrentLinked, according to the pool specification b b b b b b b
  • Performance evaluation Synchronous stack of Lea et. Al vs ED synchronous pool
  • Performance evaluation Linked blocking queue vs ED blocking pool
  • Performance evaluation Concurrent linked queue vs ED non blocking pool
  • Adding a delay between accesses to the pool 32 consumers, 32 producers
  • Changing percentage of Consumers vs. total threads number 64 threads
  • 25% Producers 75%Consumers
  • Elimination rate
  • Elimination range