Your SlideShare is downloading. ×
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf

360
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
360
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) CCGrid 2008 - Lyon, France
    • 2.
      • To distribute the same, but large data to many nodes
        • Ex: content delivery
      • Widely used in parallel processing
      Broadcasting Large Messages Data Data Data Data Data < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 3.
      • Usually, in a broadcast transfer, the source can deliver much less data than a single transfer from the source
      Problem of Broadcast S D S D 100 25 25 25 25 < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 4.
      • Pipeline-manner transfers improve the performance
      • Even in a pipeline transfer, nodes with small bandwidth ( slow nodes ) may degrade receiving bandwidth of all other nodes
      Problem of Slow Nodes    10 10 100  100 100 10 10
    • 5.
      • Propose an idea of Stable Broadcast
      • In a stable broadcast:
        • Slow nodes never degrade receiving bandwidth to other nodes
        • All nodes receive the maximum possible amount of data
      Contributions
    • 6.
      • Propose a stable broadcast algorithm for tree topologies
        • Proved to be stable in a theoretical model
        • Improve performances in general graph networks
      • In a real-machine experiment, our algorithm achieved 2.5 times the aggregate bandwidth than the previous algorithm (FPFR)
      Contributions (cont.)
    • 7.
      • Introduction
      • Problem Settings
      • Related Work
      • Proposed Algorithm
      • Evaluation
      • Conclusion
      Agenda < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 8.
      • Target: large message broadcast
      • Only computational nodes handle messages
      Problem Settings < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 9.
      • Only bandwidth matters for large messages
        • (Transfer time) = (Latency) +
      • Bandwidth is only limited by link capacities
        • Assume that nodes and switches have enough processing throughput
      Problem Settings (cont.) (Message Size) (Bandwidth) 50msec 1Gbps 1GB 99%
    • 10.
      • Bandwidth-annotated topologies are given in advance
        • Bandwidth and topologies can be rapidly inferred
          • - Shirai et al. A Fast Topology Inference - A building block for network-aware parallel computing. (HPDC 2007)
          • - Naganuma et al. Improving Efficiency of Network Bandwidth Estimation Using Topology Information (SACSIS 2008, Tsukuba, Japan)
      Problem Settings (cont.) 100 10 80 40 30
    • 11.
      • Previous algorithms evaluated broadcast by completion time
      • However, it cannot evaluate the effect of slowly receiving nodes
        • It is desirable that each node receives as much data as possible
      • Aggregate Bandwidth is a more reasonable evaluation criterion in many cases
      Evaluation of Broadcast
    • 12.
      • All nodes receive maximum possible bandwidth
        • Receiving bandwidth for each node does not lessen by adding other nodes to the broadcast
      Definition of Stable Broadcast D0 D1 D2 D3 100 10 120 100 D2 120 Single Transfer Broadcast
    • 13.
      • Maximize aggregate bandwidth
      • Minimize completion time
      Properties of Stable broadcast
    • 14.
      • Introduction
      • Problem Settings
      • Related Work
      • Proposed Algorithm
      • Evaluation
      • Conclusion
      Agenda < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 15.
      • Flat tree:
        • The outgoing link from the source becomes a bottleneck
      • Random Pipeline:
        • Some links used many times become bottlenecks
      • Depth-first Pipeline:
        • Each link is only used once, but fast nodes suffer from slow nodes
      • Dijkstra:
        • Fast nodes do not suffer from slow nodes, but some link are used many times
      Single-Tree Algorithms Flat Tree Random Pipeline Dijkstra Depth-First (FPFR) < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 16.
      • FPFR ( Fast Parallel File Replication ) has improved the aggregate bandwidth from algorithms that use only one tree
      • Idea:
        • (1) Construct multiple spanning trees
        • (2) Use these trees in parallel
      FPFR Algorithm [†] [†] Izmailov et al. Fast Parallel File Replication in Data Grid. (GGF-10, March 2004.)
    • 17.
      • Iteratively construct spanning trees
        • Create a spanning tree ( Tn ) by tracing every destination
        • Set the throughput ( Vn ) to the bottleneck bandwidth in Tn
        • Subtract Vn from the remaining bandwidth of each link
      Tree constructions in FPFR Bottleneck First Tree (T1) V1 V2 Second Tree (T2)
    • 18.
      • Each tree sends different fractions of data in parallel
        • The proportion of data sent through each tree may be optimized by linear programming ( Balanced Multicasting [†] )
      Data transfer with FPFR T1: Sends the former part T2: sends the latter part [†] den Burger et al. Balanced Multicasting: High-throughput Communication for Grid Applications (SC ‘2005) V1 V2
    • 19.
      • In FPFR, slow nodes degrade receiving bandwidth to other nodes
      • For tree topologies, FPFR only outputs one depth-first pipeline, which cannot utilize the potential network performance
      Problems of FPFR Bottleneck    
    • 20.
      • Introduction
      • Problem Settings
      • Related Work
      • Our Algorithm
      • Evaluation
      • Conclusion
      Agenda < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 21.
      • Modify FPFR algorithm
        • Create both spanning trees and partial trees
      • Stable for tree topologies whose links have the same bandwidth in both directions
      Our Algorithm V V < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 22.
      • Iteratively construct trees
        • Create a tree Tn by tracing every destination
        • Set the throughput Vn to the bottleneck in Tn
        • Subtract Vn from the remaining capacities
      Tree Constructions T3: Third Tree (Partial Tree) S A B C T1: First Tree (Spanning) S A B C T2: Second Tree (Partial Tree) S A B C V1 V2 V3 Throughput of T1
    • 23.
      • Send data proportional to the tree throughput Vn
      • Example:
        • Stage1: use T1, T2 and T3
        • Stage2: use T1 and T2 to send data previously sent by T3
        • Stage3: use T1 to send data previously sent by T2
      Data Transfer A B S C T1 T2 T3 (V1) (V2) (V3) A B S C T1 T2 A B S C T1
    • 24.
      • Our algorithm is Stable for tree topologies (whose links have the same capacities in both directions)
        • Every node receives maximum bandwidth
      • For any topology, it achieves greater aggregate bandwidth than the baseline algorithm (FPFR)
        • Fully utilize link capacity by using partial trees
      • It has small calculation cost to create a broadcast plan
      Properties of Our Algorithm
    • 25.
      • Introduction
      • Problem Settings
      • Related Work
      • Proposed Algorithm
      • Evaluation
      • Conclusion
      Agenda < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 26.
      • Simulated 5 broadcast algorithms using a real topology
      • Compared the aggregate bandwidth of each method
        • Many bandwidth distributions
        • Broadcast to 10, 50, and 100 nodes
        • 10 kinds of conditions (src, dest)
      (1) Simulations < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura … … … … 110 nodes 81 nodes 36 nodes 4 nodes
    • 27. Compared Algorithms Flat Tree Random Dijkstra Depth-First (FPFR) … and OURS
    • 28.
      • Mixed two kinds of Links (100 and 1000)
        • Vertical axis: speedup from FlatTree
        • 40 times more than random , 3 times more than depth-first (FPFR) with 100 nodes
      Result of Simulations 100 100 100 1000 1000 1000
    • 29.
      • Tested 8 bandwidth distributions
        • Uniform distribution (500-1000)
        • Uniform distribution (100-1000)
        • Mixed 100 and 1000 links
        • Uniform distribution (500-100) between switches
      • (for each distribution, tested two conditions that bandwidth of both directions are the same and different )
      • Our method achieved the largest bandwidth in 7/8 cases
        • Large improvement especially in large bandwidth variance
        • In a uniform distribution (100-1000) and link bandwidth in two directions are different, Dijkstra achieved 2% more aggregate bandwidth
      Result of Simulations (cont.)
    • 30.
      • Performed broadcasts in 4 clusters
        • Number of destinations:10, 47 and 105 nodes
        • Bandwidths of each link: (10M - 1Gbps)
      • Compared the aggregate bandwidth in 4 algorithms
        • Our algorithm
        • Depth-first (FPFR)
        • Dijkstra
        • Random (Best among 100 trials)
      (2) Real Machine Experiment
    • 31. Theoretical Maximum Aggregate Bandwidth
      • Also, we calculated the theoretical maximum aggregate bandwidth
        • The total of the receiving bandwidth in a case of separate direct transfer from the source to each destination
      D0 D1 D2 D3 100 10 120 100
    • 32.
      • For 105 nodes broadcast, 2.5 times more bandwidth than the baseline algorithm DepthFirst (FPFR)
      • However, our algorithm stayed 50-70% the aggregate bandwidth compared to the theoretical maximum
        • Computational nodes cannot fully utilize up/down network
      Evaluation of Aggregate Bandwidth 700 700 900
    • 33.
      • Compared aggregate bandwidth of 9 nodes before/after adding one slow node
        • Unlike DepthFirst ( FPFR ), existing nodes do not suffer from adding a slow node in our algorithm
        • Achieved 1.6 times bandwidth than Dijkstra
      Evaluation of Stability Slow
    • 34.
      • Introduction
      • Problem Settings
      • Related Work
      • Our Algorithm
      • Evaluation
      • Conclusion
      Agenda < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 35.
      • Introduced the notion of Stable Broadcast
        • Slow nodes never degrade receiving bandwidth of fast nodes
      • Proposed a stable broadcast algorithm for tree topologies
        • Theoretically proved
        • 2.5 times the aggregate bandwidth in real machine experiments
        • Confirmed speedup in simulations with many different conditions
      Conclusion < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 36.
      • Algorithm that maximizes aggregate bandwidth in general graph topologies
      • Algorithm that changes relay schedule by detecting bandwidth fluctuations
      Future Work < A Stable Broadcast Algorithm > Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura
    • 37.
      • Algorithm that maximizes aggregate bandwidth in general graph topologies
      • Algorithm that changes relay schedule by detecting bandwidth fluctuations
      Future work
    • 38. All the graphs 0 10 20 30 40 50 60 70 80 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (a) [Low Bandwidth Variance] (Symmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 5 10 15 20 25 30 35 40 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (g) [Random Bandwidth among Clusters ] (Symmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 5 10 15 20 25 30 35 10 50 100 R e l a t i v e P e r f o r a m a n c e Number of Destinations Ours Depthfirst Dijkstra Random (best) Random (avg) (h) [Random Bandwidth among Clusters ] (Asymmetric) 0 10 20 30 40 50 60 70 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (b) [Low Bandwidth Variance] (Asymmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 5 10 15 20 25 30 35 40 45 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (c) [High Bandwidth Variance] (Symmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 5 10 15 20 25 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (d) [High Bandwidth Variance] (Asymmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 5 10 15 20 25 30 35 40 45 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (e) [Mixed Fast and Slow links ] (Symmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 2 4 6 8 10 12 14 16 18 20 10 50 100 R e l a t i v e P e r f o r m a n c e Number of Destinations (f) [Mixed Fast and Slow links ] (Asymmetric) Ours Depthfirst Dijkstra Random (best) Random (avg) 0 2 4 6 8 10 12 (1,10) (2,20) (3,30) (4,40) (5,50) R e l a t i v e P e r f o r m a m Ours Depthfirst Dijkstra (# of srcs, # of dests) (i) [Mulrtiple Souces] (Low Bandwidth Variance, Symmetric) Random (best) Random (avg)
    • 39.
      • BitTorrent gradually improves the transfer schedule by adaptively choosing the parent node
      • Since relaying structure created by BitTorrent has many branches, these links may become bottlenecks
      Broadcast with BitTorrent [†] [†] Wei et al. Scheduling Independent Tasks Sharing Large Data Distributed with BitTorrent. (In GRID ’05) Transfer tree snapshot Bottleneck Link
    • 40.
      • Uniform distribution (100-1000) between switches
        • Vertical axis: speedup from FlatTree
        • 36 times more than FlatTree , 1.2 times more than DepthFirst (FPFR) for 100-nodes broadcast
      Simulation 1 1000 1000 100~1000 100~1000
    • 41.
      • Trace all the destinations from the source
        • Some links used by many transfers become bottlenecks
      Topology-unaware pipeline Bottleneck
    • 42.
      • Construct a depth-first pipeline by using topology information
        • Avoid link sharing by using each link only once
        • Minimize the completion time in a tree topology
      • Slow nodes degrade the performance of other nodes
      Depth-first Pipeline Slow Node [†] Shirai et al. A Fast Topology Inference - A building block for network-aware parallel computing. (HPDC 2007)
    • 43.
      • Construct a relaying structure in a greedy manner
        • Add a node reachable in the maximum bandwidth one by one
        • Effects of slow nodes are small
      • Some links may be used by many transfers, may become bottlenecks
      Dijkstra Algorithm [†] [†] Wang et al. A novel data grid coherence protocol using pipeline-based aggressive copy method. (GPC, pages 484–495, 2007) Bottleneck Link