Hedera: Dynamic Flow
Scheduling for Data Center
Network
Mohammad Al-Fares, Sivasankar
Radhakrishnan, Barath Raghavan, Nels...
Problem
• Relying on multipathing, due to…
– Limited port densities of
routers/switches
– Horizontal expansion
• Multi-roo...
Problem
• BW demand is essential and volatile
– Must route among multiple paths
– Avoid bottlenecks and deliver aggre. BW
...
Collisions of elephant flows
• Collisions in two ways: Upward or Downward
D1S1 D2S2 D3S3 D4S4
Equal Cost Paths
• Many equal cost paths going up to the core
switches
• Only one path down from each core switch
• Need t...
Goal
• Given a dynamic flow demands
– Need to find paths that maximize
network bisection BW
– No end hosts modifications
•...
Architecture
• Detect Large Flows
– Flows that need bandwidth but are network-limited
• Estimate Flow Demands
– Use min-ma...
Architecture
• Feedback loop
• Optimize achievable bisection BW by
assigning flow-to-core mappings
• Heuristics of flow de...
Elephant Detection
9
Elephant Detection
• Scheduler polls edge switches
– Flows exceeding threshold are “large”
– 10% of hosts’ link capacity (...
Demand Estimation
11
Demand Estimation
• Current flow rate: misleading
– May be already constrained by network
• Need to find flow’s “natural” ...
Demand Estimation
• Given pairs of large flows, modify
each flow size at S/D iteratively
– S distributes unconv. BW among ...
Demand Estimation
A
B
C
X
Y
Flow Estimate Conv. ?
AX
AY
BY
CY
Sender
Available
Unconv. BW
Flows Share
A 1 2 1/2
B 1 1 ...
Demand Estimation
Recv RL?
Non-SL
Flows
Share
X No - -
Y Yes 3 1/3
Receivers
Flow Estimate Conv. ?
AX 1/2
AY 1/2
BY 1
C...
Demand Estimation
Flow Estimate Conv. ?
AX 1/2
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Sender
Available
Unconv. BW
Flows Shar...
Demand Estimation
Flow Estimate Conv. ?
AX 2/3 Yes
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Recv RL?
Non-SL
Flows
Share
X No -...
Placement Heuristics
18
Placement Heuristics
• Find a good large-flow-to-core mapping
– such that average bisection BW is maximized
• Two approach...
Global First-Fit
• New flow found, linearly search all paths from SD
• Place on first path with links can fit the flow
• ...
Simulated Annealing
• Annealing: letting metal to cool down
and get better crystal structure
– Heating up to enter higher ...
Simulated Annealing
• State / State Space
– Possible solutions
• Energy
– Objective
• Neighborhood
– Other options
• Boltz...
Simulated Annealing
• State Space:
– All possible large-flow-to-core mappings
– However, same destinations map to same cor...
Simulated Annealing
• Energy:
– Estimated demand of flows
– Total exceeded BW capacity of links, minimize
• Temperature: r...
Evaluation
25
Implementation
• 16 hosts, k=4 fat-tree data plane
– 20 switches: 4-port NetFPGAs / OpenFlow
– Parallel 48-port non-blocki...
Simulator
• k=32; 8,192 hosts
– Pack-level simulators not applicable
– 1Gbps for 8k hosts, takes 2.5x1011 pkts
• Model TCP...
PortLand/OpenFlow, k=4
28
Simulator
29
Reactiveness
• Demand Estimation:
– 27K hosts, 250K flows, converges < 200ms
• Simulated Annealing:
– Asymptotically depen...
Comments
31
Comments
• Destine to same host, via same core
– May congest at cores, but how severe?
– Large flows to/from a host: <k/2
...
Comments
• Own flow-level simulator
– Aim to saturate network
– No flow number by different size
– Traffic generation: avg...
Comments
• Shared responsibility
– Controller only deals with critical situations
– Switches perform default measures
– Im...
Comments
• Critical controller actions
– Considers large flows degrade overall efficiency
– What are critical situations?
...
References
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for
Data Center Network”, USENIX NSDI 2010
• Tathagata...
Supplement
37
Fault-Tolerance
• Link / Switch failure
– Use PortLand’s fault notification protocol
– Hedera routes around failed compone...
Fault-Tolerance
• Scheduler failure
– Soft-state, not required for correctness
(connectivity)
– Switches fall back to ECMP...
Limitations
• Dynamic workloads,
large flow turnover
faster than control
loop
– Scheduler will be
continually chasing
the ...
Upcoming SlideShare
Loading in …5
×

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

1,641 views

Published on

Internet Research Lab at NTU, Taiwan.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,641
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
44
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

  1. 1. Hedera: Dynamic Flow Scheduling for Data Center Network Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat - USENIX NSDI 2010 - 1 Presenter: Jason, Tsung-Cheng, HOU Advisor: Wanjiun Liao Dec. 22nd, 2011
  2. 2. Problem • Relying on multipathing, due to… – Limited port densities of routers/switches – Horizontal expansion • Multi-rooted tree topologies – Example: Fat-tree / Clos 2
  3. 3. Problem • BW demand is essential and volatile – Must route among multiple paths – Avoid bottlenecks and deliver aggre. BW • However, current multipath routing… – Mostly: flow-hash-based ECMP – Static and oblivious to link-utilization – Causes long-term large-flow collisions • Inefficiently utilizing path diversity – Need a protocol or a scheduler 3
  4. 4. Collisions of elephant flows • Collisions in two ways: Upward or Downward D1S1 D2S2 D3S3 D4S4
  5. 5. Equal Cost Paths • Many equal cost paths going up to the core switches • Only one path down from each core switch • Need to find good flow-to-core mapping DS
  6. 6. Goal • Given a dynamic flow demands – Need to find paths that maximize network bisection BW – No end hosts modifications • However, local switch information is unable to find proper allocation – Need a central scheduler – Must use commodity Ethernet switches – OpenFlow 6
  7. 7. Architecture • Detect Large Flows – Flows that need bandwidth but are network-limited • Estimate Flow Demands – Use min-max fairness to allocate flows between SD pairs • Allocate Flows – Use estimated demands to heuristically find better placement of large flows on the EC paths – Arrange switches and iterate again Detect Large Flows Estimate Flow Demands Allocate Flows
  8. 8. Architecture • Feedback loop • Optimize achievable bisection BW by assigning flow-to-core mappings • Heuristics of flow demand estimation and placement • Central Scheduler – Global knowledge of all links in the network – Control tables of all switches (OpenFlow) Detect Large Flows Estimate Flow Demands Allocate Flows
  9. 9. Elephant Detection 9
  10. 10. Elephant Detection • Scheduler polls edge switches – Flows exceeding threshold are “large” – 10% of hosts’ link capacity (> 100Mbps) • Small flows: Default ECMP hashing • Hedera complements ECMP – Default forwarding is ECMP – Only schedules large flows contributing to bisection BW bottlenecks • Centralized functions: the essentials 10
  11. 11. Demand Estimation 11
  12. 12. Demand Estimation • Current flow rate: misleading – May be already constrained by network • Need to find flow’s “natural” BW demand when not limited by network – As if only limited by NIC of S or D • Allocate S/D capacity among flows using max-min fairness • Equals to BW allocation of optimal routing, input to placement algorithm 12
  13. 13. Demand Estimation • Given pairs of large flows, modify each flow size at S/D iteratively – S distributes unconv. BW among flows – R limited: redistributes BW among excessive-demand flows – Repeat until all flows converge • Guaranteed to converge in O(|F|) – Linear to no. of flows 13
  14. 14. Demand Estimation A B C X Y Flow Estimate Conv. ? AX AY BY CY Sender Available Unconv. BW Flows Share A 1 2 1/2 B 1 1 1 C 1 1 1 Senders
  15. 15. Demand Estimation Recv RL? Non-SL Flows Share X No - - Y Yes 3 1/3 Receivers Flow Estimate Conv. ? AX 1/2 AY 1/2 BY 1 CY 1 A B C X Y
  16. 16. Demand Estimation Flow Estimate Conv. ? AX 1/2 AY 1/3 Yes BY 1/3 Yes CY 1/3 Yes Sender Available Unconv. BW Flows Share A 2/3 1 2/3 B 0 0 0 C 0 0 0 Senders A B C X Y
  17. 17. Demand Estimation Flow Estimate Conv. ? AX 2/3 Yes AY 1/3 Yes BY 1/3 Yes CY 1/3 Yes Recv RL? Non-SL Flows Share X No - - Y No - - Receivers A B C X Y
  18. 18. Placement Heuristics 18
  19. 19. Placement Heuristics • Find a good large-flow-to-core mapping – such that average bisection BW is maximized • Two approaches • Global First Fit: Greedily choose path that has sufficient unreserved BW – O([ports/switch]2) • Simulated Annealing: Iteratively find a globally better mapping of paths to flows – O(# flows)
  20. 20. Global First-Fit • New flow found, linearly search all paths from SD • Place on first path with links can fit the flow • Once flow ends, entries + reservations time out ? Flow A Flow B Flow C ? ? 0 1 2 3 Scheduler S D
  21. 21. Simulated Annealing • Annealing: letting metal to cool down and get better crystal structure – Heating up to enter higher energy state – Cooling to lower energy state with a better structure and stopping at a temp • Simulated Annealing: – Search neighborhood for possible states – Probabilistically accepting worse state – Accepting better state, settle gradually – Avoid local minima 21
  22. 22. Simulated Annealing • State / State Space – Possible solutions • Energy – Objective • Neighborhood – Other options • Boltzman’s Function – Prob. to higher state • Control Temperature – Current temp. affect prob. to higher state • Cooling Schedule – How temp. falls • Stopping Criterion 22 )/(1)( tEEP
  23. 23. Simulated Annealing • State Space: – All possible large-flow-to-core mappings – However, same destinations map to same core – Reduce state space, as long as not too many large flows and proper threshold • Neighborhood: – Swap cores for two hosts within same pod, attached to same edge / aggregate – Avoids local minima 23
  24. 24. Simulated Annealing • Energy: – Estimated demand of flows – Total exceeded BW capacity of links, minimize • Temperature: remaining iterations • Probability: • Final state is published to switches and used as initial state for next round • Incremental calculation of exceeded cap. • No recalculation of all links, only new large flows found and neighborhood swaps 24
  25. 25. Evaluation 25
  26. 26. Implementation • 16 hosts, k=4 fat-tree data plane – 20 switches: 4-port NetFPGAs / OpenFlow – Parallel 48-port non-blocking Quanta switch – 1 scheduler, OpenFlow control protocol – Testbed: PortLand 26
  27. 27. Simulator • k=32; 8,192 hosts – Pack-level simulators not applicable – 1Gbps for 8k hosts, takes 2.5x1011 pkts • Model TCP flows – TCP’s AIMD when constrained by topology – Poisson arrival of flows – No pkt size variations – No bursty traffic – No inter-flow dynamics 27
  28. 28. PortLand/OpenFlow, k=4 28
  29. 29. Simulator 29
  30. 30. Reactiveness • Demand Estimation: – 27K hosts, 250K flows, converges < 200ms • Simulated Annealing: – Asymptotically dependent on # of flows + # iter., 50K flows and 1K iter.: 11ms – Most of final bisection BW: few hundred iter. • Scheduler control loop: – Polling + Est. + SA = 145ms for 27K hosts
  31. 31. Comments 31
  32. 32. Comments • Destine to same host, via same core – May congest at cores, but how severe? – Large flows to/from a host: <k/2 – No proof, no evaluation • Decrease search space and runtime – Scalable for per-flow basis? For large k? • No protection for mice flows, RPCs – Only assumes work well under ECMP – No address when route with large flows 32
  33. 33. Comments • Own flow-level simulator – Aim to saturate network – No flow number by different size – Traffic generation: avg. flow size and arrival rates (Poisson) with a mean – Only above descriptions, no specific numbers – Too ideal or not volatile enough? – Avg. bisection BW, but real-time graphs? • States that per-flow VLB = per-flow ECMP – Does not compare with other options (VL2) – No further elaboration 33
  34. 34. Comments • Shared responsibility – Controller only deals with critical situations – Switches perform default measures – Improves performance and saves time – How to strike a balance? – Adopt to different problems? • Default multipath routing – States problems of per-flow VLB and ECMP – How about per-pkt? Author’s future work – How to improve switches’ default actions? 34
  35. 35. Comments • Critical controller actions – Considers large flows degrade overall efficiency – What are critical situations? – How to detect and react? – How to improve reactiveness and adaptability? • Amin Vahdat’s lab – Proposes fat-tree topology – Develops PortLand L2 virtualization – Hedera: enhances multipath performance – Integrate all above 35
  36. 36. References • M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010 • Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data Center Networks”, UC Berkeley course CS 294 • M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010, slides 36
  37. 37. Supplement 37
  38. 38. Fault-Tolerance • Link / Switch failure – Use PortLand’s fault notification protocol – Hedera routes around failed components 0 1 3 Flow A Flow B Flow C 2 Scheduler
  39. 39. Fault-Tolerance • Scheduler failure – Soft-state, not required for correctness (connectivity) – Switches fall back to ECMP 0 1 3 Flow A Flow B Flow C 2 Scheduler
  40. 40. Limitations • Dynamic workloads, large flow turnover faster than control loop – Scheduler will be continually chasing the traffic matrix • Need to include penalty term for unnecessary SA flow re-assignmentsFlow Size MatrixStability StableUnstable ECMP Hedera

×