Scalable membership management
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
410
On Slideshare
410
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • (a)Timeuntildeadnodesareforgotten.(b)Numberofdeadlinks.
  • Fig.7. (a)Numberofdisjointclusters,asaresultofremovingalargepercentageofnodes.Shows that the overlay does not break into two or more disjoint clusters, unless a major percentage of the nodes are removed. (b) Number of nodes not belonging to the largest cluster. Shows that in the first steps of clustering only a few nodes are separated from the main cluster, which still connects the grand majority of the nodes.
  • Note that the graph for the experiment with cache size 100 is practically a flat line. That is, for 100,000 nodes and cache size 100, the overlay created is so robust, that no matter how many nodes are removed, the remaining ones remain connected in a single cluster.
  • In-degreedistributioninconverged100,000nodeoverlay,forbasicshuffling,enhanced shuffling, and an overlay where each node has c randomly chosen outgoing links. t is, however, clear that enhanced shuffling does a significantly better job with respect to spreading out the links extremely evenly across all nodes. For the experiment with cache size 20, 80.31% of the nodes have an in-degree of 20 ± 5%. For the experiment with cache size 50, 93.95% of the nodes have an in-degree of 50 ± 5%. The respective percentages for basic shuffling are 36.22 and 38.47%.
  • This is an average case analysis. In reality, there are noise terms in this recurrence as we pick a node whose degree is only approximately d(N). In order to prove that the argument is correct in the presence of this noise, we need to control the variance of that noise (and invoke the martingale convergence theorem!).
  • to the overall value incase the aggregation is a average function, and “super- exponentially” incase of a maximum function
  • maximum finding protocol. N = 105 , points are averages of 50 runs. Standard deviation is not shown, it is several orders of magnitude lower than the average.

Transcript

  • 1. Scalable membership management and failure detection? Vinay Setty INF5360
  • 2. What is Gossiping?• Spread of information in a random manner• Some examples: – Human gossiping – Epidemic diseases – Physical phenomenon: wild fire, diffusion etc – Computer viruses and worms
  • 3. Gossiping in Computer Science• Term first coined by Demers et al (1987)• Some applications of gossip protocols – Peer Sampling – Data Aggregation – Clustering – Information Dissemination (Multicast, Pub/Sub) – Overlay/topology – Maintenance – Failure detection?
  • 4. Gossip-Based Protocol: Example 3 4 0 5 1 2 9 8 7 6
  • 5. Today’s Focus• Theoretical angle for Gossip-based protocols [Allavena et al PODC 2005] – Probability of partitioning – Time till partitioning – Bounds on in-degree – Essential elements of gossiping – Simulation results• Cyclon [Voulgaris et al]• Scamp [Ganesh et al]• NewsCast [Jelasity et al]
  • 6. Membership Service• Full Membership – Complete knowledge at each node – Random subset used for gossiping – Not scalable – Hard to maintain• Partial Membership – Random subset at each node – Gossip partners chosen from local view
  • 7. View Selection L1 s s,p,r p v s,p,t u r t t,q,r q L2 L1 L2v Weighted with w
  • 8. Essential Elements of Gossiping• Mixing: Construct a list L1 consisting of local views of local view of node u – Guarantees non partitioning – “Pull” based• Reinforcement: Construct a list L2 consisting of local views of nodes that requested local view of u – Balances network – removes old possibly dead edges, adds new edges
  • 9. Partitioning and Size Estimate• A and B partition iff x=1 and y=0• Partitioning is least possible when x=y• Goal of protocol is to maintain this balance
  • 10. Size Estimates• Idea: – Assuming edges were drawn uniformly randomly, expected x+y  |A| – x is estimate of size of A by nodes in A – y is estimate of size of A by nodes in B• Mixing: – Agreeing on estimation of x and y ensures no partition (even if x and y are not accurate)• Reinforcement: – Brings estimation of x and y to correct value
  • 11. K-regularity• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• |A|= γn ≤ |B|• #edges from A to B: (1-x)γkn• #edges from B to A: y (1-γ)kn• Number of edges in A-B cut: – (1-x)γkn +x (1-γ)kn (since x=y) – ≥ γkn (assuming γ≤½)
  • 12. Time Till Partitioning• View Size: k• Number of nodes: n• Fraction of nodes in partition: γ• Churn rate: μ (μn nodes leave and join)• Claim: Expected time before a partition of size γ happens ≈ 2γkn – As long as μ≪γkn
  • 13. 100, 000 nodes, view sizes of 17, a fanout of 3, and a loosely synchronised syst em, t he maximum in-degree was always which re-samples ran- below 4.5 t imes t hat of a random graph and t he st andardg t he names of t he nodes deviat ion was not more t han 3.2 t imes larger t han t hat of a Iterations until Partitioning y or anot her is doomedenat ion of all t he views. random graph. T hese values would improve wit h increased fanout , but even a fanout of 2 gives sat isfact ory perfor- onds t o creat ing a new mance.eplacement from t he oldom V at each it erat ion. 10000annot reappear wit hout cement . T he diversit y 9000 ime, and in fact rat her 8000 Number of iterations rk. Not e t hat it is t he- 7000e by creat ing a protnodes: n Number of ocol 6000 t at ion on V, size: tk = log n View but his is Churn: n/32esn’t necessarily behave 5000ving or joining t he net - 4000 ively add t he names of 3000 t o V, a process we callome reinforcement , even 2000 n t he art icle: each pro- 1000 hen sending a message. 0 nd Cyclon [16] as well: 1 1.5 2 2.5 3 3.5eir view t hat t hey t hen Log10 of the number of nodesaviour in say t he cont ext F igure 4: N umber of it erat ions unt il part it ioning t he “ News Event s” aredes. Let only t he nodes We were int erest ed in mat ching our t heoret ical result sodes add t heir names t o about part it ioning and churn. We ran simulat ions evaluat -Event s” inst ead of every ing t he number of it erat ions unt il part it ioning. By part it ion-
  • 14. View Size vs Time until PartitionNumber of nodes: nView size: k = log nChurn: n/32
  • 15. Simplified Model for Proof– Single randomly chosen element from view is replaced instead of whole views– Assumption: The out-edges of nodes of A are identically distributed and same applies to B– a = #edges from A to A– c = #edges from A to B– b = #edges from B to A– d = #edges from B to B
  • 16. Proof IntuitionPartition state: a = γkn and b = 0
  • 17. In-Degree Analysis• Load balancing requires balance in in-degree distribution• In-degree is governed by the way edges created, copied and destroyed• Copying some edges more than others cause variability in in-degree• Node living longer is expected to have higher in- degree• Solution: Increase reinforcement and keep track of timestamps like in Cyclon• Simulation: max in-degree < 4.5 times of random graph and standard deviation < 3.2 times
  • 18. Discussion• Are these theoretical guarantees practically useful?• Goal is not provide failure detection
  • 19. Cyclon• Consists of same elements as suggested by [Allavena et al PODC 2005]• [Allavena et al PODC 2005] Analysis holds for Cyclon• Major differences: – Timestamps – shuffling
  • 20. Basic Shuffling• Select a random subset of l neighbors (1 ≤ l ≤ c) from P’s own cache, and a random peer, Q, within this subset, where l is a system parameter, called shuffle length.• Replace Q’s address with P’s address.• Send the updated subset to Q.• Receive from Q a subset of no more than l of Q’s neighbors.• Discard entries pointing to P, and entries that are already in P’s cache.• Update P’s cache to include all remaining entries, by – firstly using empty cache slots (if any), and – secondly replacing entries among the ones originally sent to Q.
  • 21. Shuffling Example
  • 22. Enhanced Shuffling• Increase by one the age of all neighbors.• Select neighbor Q with the highest age among all neighbors, and l − 1 other random neighbors.• Replace Q’s entry with a new entry of age 0 and with P’s address.• Send the updated subset to peer Q.• Receive from Q a subset of no more that l of its own entries.• Discard entries pointing at P and entries already contained in P’s cache.• Update P’s cache to include all remaining entries, by firstly using empty• cache slots (if any), and secondly replacing entries among the ones sent to Q.
  • 23. Time Until Dead Links Removed
  • 24. removed. Note that the number of clusters decreases as we approach 100% noderemoval because the total number of surviving nodes becomes too small. Fig- Number of Clustersure 7(b) shows the number of nodes not belonging to the largest cluster, in logscale. These graphs show considerable robustness to node failures, especially con-sidering the fact that in the early stages of clustering very few nodes are out ofthe largest cluster, which indicates that most nodes are still connected in a singleFig. 7. (a) Number of disjoint clusters, as a result of removing a large percentage of nodes. Showsthat the overlay does not break into two or more disjoint clusters, unless a major percentage of thenodes are removed. (b) Number of nodes not belonging to the largest cluster. Shows that in the firststeps of clustering only a few nodes are separated from the main cluster, which still connects the
  • 25. Tolerance to Partitioning
  • 26. In-Degree Distribution
  • 27. SCAMP• Partial knowledge of the membership: local view• Fanout automatically set = size of the local view• Fanout evolves naturally with the size of the group – Size of local views converges towards C.log(n)
  • 28. Join (Subscription)Subscription to Subscription forwardeda random member P=1/sizeof view s 1 sS 0 (1-P) s s P=1/sizeof view s 2 s (1-P) s P=1/sizeof view 3 s (1-P)
  • 29. Join(Subscription) algorithm 7 6Local view 1 4 5 6 4 0 6 7 2 3 6 6 6 1 2 6 0 8 3 6 7 0 1 5 6 6 6 5 8 7
  • 30. Load Balancing• Indirection: – Forward the subscription instead of handling request• Lease associated with each subscription• Periodically nodes have to re-subscribe – Nodes having failed permanently will time out – Re-balance the partial views
  • 31. UnsubscriptionLocal view 8 9 0 8 9 4 1 4 5 Unsub (0), [1,4,5] 4 x x 0 7 3 0 7 3 5 1 y y 6 0 2 6 0 1 5 z z
  • 32. Degree• System modelled as random directed graph• D(N) = Average out-degree for N-nodes system• Subscription adds D(N)+1 directed arcs, so• (N+1) D(N+1) = N D(N) + D(N)+1• Solution of this recursion is• D(N)=D(1)+1/2+1/3+…+1/N  Log(N)
  • 33. Distribution of view size 35000 30000 Log=13.12 25000Number of nodes 20000 15000 10000 Log=12.2 5000 0 0 5 10 15 20 25 30 35 40 45 50 View Size 200 000 Node System 500 000 Node System 33
  • 34. Reliability: 5000 node system 1 0.98 0.96 Reliability 0.94 0.92 0.9 0 500 1000 1500 2000 2500 Number of failures SCAMP Global membership knowledge, fanout=8 Global membership knowledge, fanout=9 34
  • 35. NewsCast• Goal: Aggregate information in – a large and dynamic – distributed environment – a robust and dependable manner
  • 36. Idea• Gets news from application, timestamps it and adds local peer address to the cache entry• Finds a random peer in cache addresses – Sends all cache entries to this peer – Receives all cache entries from that peer• Passes on cache entries (containing news items) to application• Merges old cache with received cache – Keeps at most C most recent cache entries
  • 37. Aggregation• Each node ni maintains a single number xi• Every node ni selects a random node nk, and sends its value xi to nk• nk responds with the aggregate (e.g. max(xi,xk) ) of the incoming and its own value• 4. Aggregate values will converge “exponentially”
  • 38. Path length under failures
  • 39. Connectivity Under Failures
  • 40. Aggregation 1proportion of not-reached nodes 0.1 0.01 0.001 theoretical model c=20 c=40 c=80 0.0001 6 7 8 9 10 11 cycle