Distributed Coordination


Published on

A brief overview of several alternatives for Distributed Counting and Sorting.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Coordination

  1. 1. Counting, Sorting and Distributed Coordination Luis Galárraga Saarland University
  2. 2. Agenda <ul><li>Justification
  3. 3. Concurrent objects and basic concepts
  4. 4. Distributed counting </li><ul><li>Software Combining Trees
  5. 5. Counting networks </li></ul><li>Distributed sorting </li><ul><li>Sorting networks
  6. 6. Sample sorting </li></ul></ul>
  7. 7. Some justification <ul><li>Couting and Sorting are basic tasks in software and algorithms.
  8. 8. At any stage our programs rely intensively on one of this tasks. </li><ul><li>Take advantage of new system architecture trends like multiprocessor environments
  9. 9. Avoid memory contention!!! </li></ul></ul>
  10. 10. Memory contention <ul><li>Defined as a scenario in which many threads in different processors try to access the same memory location.
  11. 11. Not serious for reads.. but for writes </li></ul>
  12. 12. Concurrent Objects
  13. 13. Quiescent consistency <ul><li>Quiescent periods for concurrent objects: </li><ul><li>No pending calls </li></ul><li>The state of any quiescent object must be equivalent to some sequential order of the completed method calls.
  14. 14. A quiescent counter: </li><ul><li>Neither omissions nor duplicates!! </li></ul></ul>
  15. 15. Quiescent consistency
  16. 16. Sequential Consistency <ul><li>Calls should appear to take effect in program order.
  17. 17. Method calls by different threads are unrelated by program order. </li></ul>
  18. 18. Sequential Consistency
  19. 19. To discuss... <ul><li>What about these examples? </li></ul>
  20. 20. Linearizability <ul><li>If one call precedes another (even across different threads), then the earlier call must have taken effect before the later call.
  21. 21. If two calls overlap, then their order is ambiguous and we are free to order them in any convenient way.
  22. 22. Linearization points: points where the method seems to take effect. </li></ul>
  23. 23. Linearizability
  24. 24. Measuring the performance <ul><li>Latency </li><ul><li>Time it takes an individual call to complete </li></ul><li>Throughput </li><ul><li>Average rate at which a set of method calls complete </li></ul><li>Lock-based approaches benefit latency. </li></ul>
  25. 25. Distributed Counting
  26. 26. The “classical” approach <ul><li>Counter: </li><ul><li>Shared object holding an integer value with getandIncrement(int n = 1) method which returns the value and then adds n. </li></ul><li>Do you want to increment the counter? Use a lock: </li><ul><li>Acquire the lock
  27. 27. Perform the increment
  28. 28. Free the lock </li></ul></ul>
  29. 29. Locks.. so many locks <ul><li>They come in different flavors: </li><ul><li>Spin locks (Filter and Bakery)
  30. 30. Test and set (with/without Exponential Backoff)
  31. 31. Queue locks (MCS, CLH, etc... )
  32. 32. Chapters 7 and 8 of [1] </li></ul><li>Some are better than others but all suffer from memory contention in some degree. </li></ul>
  33. 33. Software Combining Trees <ul><li>Several threads decide to call getandIncrement at “more or less” the same time.
  34. 34. Some threads become responsible of gathering other increments and combine them.
  35. 35. Some hierarchical data structure must be used.
  36. 36. Binary trees </li></ul>
  37. 37. Software Combining Trees <ul><li>p threads
  38. 38. Balanced binary tree with k levels </li><ul><li>k = min{ j | 2 j >= p}, 2 k-1 leave nodes </li></ul><li>Each nodes holds 2 values to combine and a state.
  39. 39. Each thread is assigned to a leaf. A leaf can be assigned to 2 threads at most.
  40. 40. The value of the counter is stored at the root </li></ul>
  41. 41. Software Combining Trees <ul><li>To increment the counter, threads have to traverse the tree from leaf node to the root. </li><ul><li>Might combine their values with some other threads during the way. </li></ul><li>Node states </li><ul><li>IDLE: Initial state of the nodes
  42. 42. FIRST: One thread has visited this node and becomes the master
  43. 43. SECOND: A second thread (slave) is waiting for the master to combine. </li></ul></ul>
  44. 44. Software Combining Trees
  45. 45. Performance analysis <ul><li>They benefit throughput instead of latency: </li><ul><li>Calls take O(log p) </li></ul><li>Reduce in memory contention
  46. 46. Linearizable alternative
  47. 47. Sensitive to changes in concurrency rate </li><ul><li>Threads might fail to combine immediately </li></ul><li>What about n-ary trees?
  48. 48. How long to wait for other threads to combine? </li></ul>
  49. 49. Counting networks - Balancers <ul><li>A balancer distributes tokens coming asynchronously from its 2 input wires.
  50. 50. Balancing networks are constructed connecting balancer's outputs to other balancer's inputs. </li></ul>
  51. 51. Balancing network <ul><li>A balancing network has width w with w inputs x 0 , x 1 ,...,x w-1 and outputs y 0 , y 1 ,... y w-1 and in quiescent periods: </li></ul><ul><li>The depth d is defined as the maximum number of balancers a token can traverse starting from any input wire. </li></ul>
  52. 52. The step property <ul><li>If a balancing network follows the step property it is called a Counting Network
  53. 53. Threads sheperd tokens through the network. </li><ul><li>Given the step property it is easy to see, we can use the network to count how many tokens have traversed it. </li></ul></ul>
  54. 54. Bitonic Counting Network <ul><li>For k = 1, a single balancer
  55. 55. For k > 1, k is a power of 2: </li></ul>
  56. 56. Bitonic Counting Network <ul><li>2k-merger (2k = 8): </li></ul>
  57. 57. Periodic Counting network <ul><li>For k = 1, a single balancer
  58. 58. For k > 1, a sequence of lg(2k) block networks: </li></ul>log(2k) blocks
  59. 59. Periodic Counting network <ul><li>2k-block (2k = 8) </li></ul>
  60. 60. Counting networks <ul><li>Bitonic and Sorting networks have depth O(lg 2 (w)), w = 2k = width </li></ul>
  61. 61. Counting networks <ul><li>Saturation measures the ratio token/balancers </li><ul><li>S > 1 oversaturated, S <1 undersaturated </li></ul><li>2k-block and 2k-merger are used in Barrier implementations </li><ul><li>They are threshold networks </li></ul></ul>
  62. 62. Counting networks <ul><li>Periodic and Bitonic are not the only: </li><ul><li>Difracting trees, O(lg w)
  63. 63. BM or Busch-Mavronikolas, w inputs, p*w outputs for some p>1
  64. 64. BM outperforms Bitonic under most conditions
  65. 65. Bitonic are the best in situations of low concurrency. </li></ul><li>Robust to changes in concurrency rate </li></ul>
  66. 66. Distributed Sorting
  67. 67. Sorting networks <ul><li>A comparator is to a sorting network which a balancer is to counting network. </li><ul><li>But they are synchronous !!! </li></ul></ul>
  68. 68. Wow... This saves time!! <ul><li>Isomorphism: </li><ul><li>If a balancing network counts, then its comparison counterpart also does.
  69. 69. Proof in [3] </li></ul></ul>
  70. 70. Bitonic Sorting Network <ul><li>Same structure of Bitonic Counting network.
  71. 71. Collection of: </li><ul><li>p threads, d layers of w/2 comparators. Layers define rounds
  72. 72. Table of size d * w/2 storing which entries (wires) compare in each layer
  73. 73. Synchronization via Barriers, see [4]
  74. 74. Each thread/processor does s comparisons in every round, p*s is a power of 2 </li></ul></ul>
  75. 75. Bitonic Sort
  76. 76. Sorting networks <ul><li>Swaps do not need synchronization
  77. 77. All threads must be always in the same stage </li><ul><li>Synchronization via a Barrier
  78. 78. barrier.await() returns when all threads have called it. </li></ul><li>Time O(s*log 2 p), p = no. of threads
  79. 79. Suitable for small sets </li><ul><li>In every round a key will be compared by a different thread
  80. 80. No cache efficient!!! </li></ul></ul>
  81. 81. Sample sorting <ul><li>Designed for large sets which do not fit in main memory </li><ul><li>Accessing them can be very expensive (if they are in disk.. auch!!)
  82. 82. We need more locality of reference, how? </li></ul><li>p threads, n input keys </li></ul>
  83. 83. Sample sorting – 3 magic steps <ul><li>Step 1: Choose p-1 splitter keys to divide the set evenly. </li><ul><li>But they are not sorted!!
  84. 84. Take s samples then sort them using BitonicSort
  85. 85. Select keys in positions s, 2s,... (p-1)*s as splitters </li></ul><li>Now we have divided the big set into subsets of size n/p approx. </li></ul>
  86. 86. Sample sorting – 3 magic steps <ul><li>Step 2: Each thread sequentially process n/p moving each item to its bucket (defined by the splitters)
  87. 87. Step 3: Each thread sequentially sorts the items in its bucket. </li></ul>
  88. 88. Sample sorting <ul><li>Time O(n/p * log(n/p)) </li><ul><li>Assuming a comparison-based sequential algorithm
  89. 89. What about integer, fixed keys? Radix sort!! </li></ul><li>Sample might be avoided with prior knowledge of data probability distribution </li></ul>
  90. 90. Other alternatives to Sample Sorting <ul><li>Flash sorting
  91. 91. Parallel Merge Sorting
  92. 92. Parallel Radix Sorting </li><ul><li>Load Balanced Parallel Radix Sort [5]
  93. 93. Partitioned Parallel Radix Sort [6]
  94. 94. An equilibrium between fairness in terms of work distribution and communication effort. </li></ul></ul>
  95. 95. Sources [1] M. Herlihy and N. Shavit, “Concurrent objects,” in The Art of Multiprocessor Programming , pp. 45–69, Burlington, USA: Elsevier Inc., 2008. [2] E. N. Klein, C. Busch, and D. R. Musser, “An experimental analysis of counting networks,” Journal of the ACM , pp. 1020–1048, September 1994. [3] J. Aspnes, M. Herlihy, and N. Shavit, “Counting networks,” Journal of the ACM , 1994. [4] M. Herlihy and N. Shavit, “Barriers,” in The Art of Multiprocessor Programming , pp. 397–415, Burlington, USA: Elsevier Inc., 2008. [5] A. Sohn and Y. Kodama, “Load balanced parallel radix sort,” in ICS ’98: Proceedings of the 12th international conference on Supercomputing , (New York, NY, USA), pp. 305–312, ACM, 1998. [6] S.-J. Lee, M. Jeon, D. Kim, and A. Sohn, “Partitioned parallel radix sort,” J. Parallel Distrib. Comput. , vol. 62, no. 4, pp. 656–668, 2002.