TCP for Data center networks Deepti Surjyendu Ray
What is a Datacenter ? <ul><li>A facility used for housing a large amount of computer and communications equipment maintai...
What is a Datacenter network ? <ul><li>Data centers consist of: </li></ul><ul><ul><li>Server racks with servers ( compute ...
Properties of  a typical Datacenter network ? <ul><li>Characters of a datacenter network: </li></ul><ul><ul><li>High-fan-i...
The  TCP Incast problem? <ul><li>Incast :- TCP Throughput Collapse i.e </li></ul><ul><ul><li>drastic reduction in applicat...
The  root of TCP Incast : <ul><li>Highly bursty, fast data transmissions overfill Ethernet switch buffers. </li></ul><ul><...
Round trips !!!! <ul><li>RTT << TCP Timeout. </li></ul><ul><li>Sender will have to wait for TCP timeout before re-transmis...
Link Idle Time Due To Timeouts <ul><li>RTT << TCP Timeout. </li></ul><ul><li>Sender will have to wait for TCP timeout befo...
Induced timeout due to barrier synchronization <ul><li>The client can not make forward progress untill the responses from ...
Barrier Synchronization: a typical request pattern in Data Centers
Idle Link issue !!
Idle Link issue !!
200ms timeouts    Throughput Collapse <ul><li>Advent of more servers into the network induces  overflow of switch  </li><...
Proposed solution to TCP Incast <ul><ul><li>A bi-pronged attack on the problem entails: </li></ul></ul><ul><ul><ul><li>Sys...
Motivation to resolve  Incast  using TCP <ul><ul><li>TCP is well-understood and mature, facilitating its use as a transpor...
Solution Domain
Insight into fine-grained TCP <ul><li>Premise: </li></ul><ul><ul><li>The timers must operate on a granularity close to the...
RTO Estimation and Minimum Bound <ul><li>Jacobson’s TCP RTO Estimator </li></ul><ul><ul><li>RTO Estimated  = SRTT + (4 * R...
RTO Estimation and Minimum Bound <ul><li>Jacobson’s TCP RTO Estimator </li></ul><ul><ul><li>RTO Estimated  = SRTT + (4 * R...
Evaluation workload <ul><li>The test client requests for a block of data, striped across “n” servers. </li></ul><ul><li>Th...
µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) Does eliminating RTO min  helps avoid TCP incast collaps...
Simulation result Reducing the RTO min  in the simulation to us from the current default value of 200ms improves goodput
Real world cluster Experiments on a real cluster validate the simulation result => reducing the RTO min  improves the good...
Real world cluster Experiments on a real cluster validate the simulation result => reducing the RTO min  improves the good...
TCP Requirements for µsecond RTO <ul><li>TCP must track RTT in microseconds </li></ul><ul><ul><li>Efficient high-resolutio...
Modifications to the TCP stack <ul><li>The minimal modifications required of the TCP stack, to support ”hrtimers” are: </l...
µsecond TCP + no minRTO For a 48 node cluster providing TCP re transmissions in us eliminates incast collapse for upto 47 ...
Simulation: Scaling to thousands In simulation, introducing a randomized component to the RTO desynchronizes  retransmissi...
Conclusion <ul><li>Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and ...
Future Work <ul><li>The practical implémentation of the proposed work is shown on about 48 servers in a data centre. </li>...
Upcoming SlideShare
Loading in...5
×

TCP for Data center networks

1,534

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,534
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
50
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • It all goes back to the way a RTO is calculated. Every packet sent out is associated with a timeout value. This RTO should roughly be equal to the RTT of the network. Think of it this way: If I send a packet out and don’t get a response for it within the time it takes to reach the other end and come back, the packet probably is lost. But the RTT is not something that we can determine in advance for each flow because RTTs vary due to buffering at routers/switches (and depending on network conditions). Individual RTTs are determined by attaching a timestamp to a packet when it leaves the network, and checking the time when the ACK for the packet comes back. To simplify state maintenance, ACKs echo back the timestamp of the data packet they are acknowledging. So, the RTT of the network is estimated, averaged over time. The RTO value is a conservative estimate based on the estimated and smoothed RTT and RTT variance. The actual RTO of a packet is determined by the equation shown above  max(…) minRTO = 200ms, exists because TCP timer granularity: you can timeout in 200ms only if you can tell 200ms have passed, and you can tell 200ms have passed if the TCP timer is interrupted &lt; 100ms safety concerns (Allman99). The thing to note is 200ms is three orders of magnitude greater than datacenter RTTs.
  • TCP for Data center networks

    1. 1. TCP for Data center networks Deepti Surjyendu Ray
    2. 2. What is a Datacenter ? <ul><li>A facility used for housing a large amount of computer and communications equipment maintained by an organization for the purpose of handling the data necessary for its operations. ( MSDN Glossary ) </li></ul><ul><li>A data center (sometimes spelled datacenter ) is a centralized repository, either physical or virtual, for the storage, management, and dissemination of data and information organized around a particular body of knowledge or pertaining to a particular business . </li></ul>
    3. 3. What is a Datacenter network ? <ul><li>Data centers consist of: </li></ul><ul><ul><li>Server racks with servers ( compute nodes/storage ) </li></ul></ul><ul><ul><li>Switches, </li></ul></ul><ul><ul><li>Connecting links along with its topology. </li></ul></ul><ul><li>The network architecture typically is made up of: </li></ul><ul><ul><li>tree of routing and </li></ul></ul><ul><ul><li>switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. </li></ul></ul>
    4. 4. Properties of a typical Datacenter network ? <ul><li>Characters of a datacenter network: </li></ul><ul><ul><li>High-fan-in of the tree. </li></ul></ul><ul><ul><li>High-bandwidth, low-latency workload. </li></ul></ul><ul><ul><li>Clients that issue barrier-synchronized requests in parallel. </li></ul></ul><ul><ul><li>Relatively small amount of data per request. </li></ul></ul><ul><ul><li>Network constraint: Small switch buffer. </li></ul></ul>
    5. 5. The TCP Incast problem? <ul><li>Incast :- TCP Throughput Collapse i.e </li></ul><ul><ul><li>drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. </li></ul></ul><ul><li>Leading to :- </li></ul><ul><ul><li>Gross under utilization of link capacity in many- to-one communication networks, like Data Center networks. </li></ul></ul>
    6. 6. The root of TCP Incast : <ul><li>Highly bursty, fast data transmissions overfill Ethernet switch buffers. </li></ul><ul><li>Cause being:- </li></ul><ul><ul><li>Intense packet loss that results in TCP timeouts. </li></ul></ul><ul><li>The TCP timeouts last 100’s of milliseconds. </li></ul><ul><ul><li>TCP timeout ≈ 100’s ms </li></ul></ul><ul><li>But, round trip time of a data centre network is around 100’s of microsecond. </li></ul><ul><ul><li>RTT ≈ 100’s us </li></ul></ul>
    7. 7. Round trips !!!! <ul><li>RTT << TCP Timeout. </li></ul><ul><li>Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO) </li></ul><ul><li>Coarse grained RTOs reduce application throughput by 90% </li></ul>
    8. 8. Link Idle Time Due To Timeouts <ul><li>RTT << TCP Timeout. </li></ul><ul><li>Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO) </li></ul><ul><li>Coarse grained RTOs reduce application throughput by 90% </li></ul>
    9. 9. Induced timeout due to barrier synchronization <ul><li>The client can not make forward progress untill the responses from every server for the current request have been received. </li></ul><ul><li>Barrier synchronized workloads are becoming increasingly common in today’s commodity clusters. </li></ul><ul><ul><li>E.g. parallel reads/writes in cluster file systems like Lustre, Panasas. </li></ul></ul><ul><ul><li>search queries sent to dozen of nodes, with results returned to be sorted. </li></ul></ul>
    10. 10. Barrier Synchronization: a typical request pattern in Data Centers
    11. 11. Idle Link issue !!
    12. 12. Idle Link issue !!
    13. 13. 200ms timeouts  Throughput Collapse <ul><li>Advent of more servers into the network induces overflow of switch </li></ul><ul><li>buffer. </li></ul><ul><li>This overflow causes severe packet loss. </li></ul><ul><li>Under packet loss, TCP experiences a time out that lasts a minimum of </li></ul><ul><li>200 ms. </li></ul>
    14. 14. Proposed solution to TCP Incast <ul><ul><li>A bi-pronged attack on the problem entails: </li></ul></ul><ul><ul><ul><li>System extensions to enable microsecond granularity retransmission </li></ul></ul></ul><ul><ul><ul><ul><li>Fine grained TCP retransmission through high resolution Linux kernel timers. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Reducing RTO min improves system throughput. </li></ul></ul></ul></ul><ul><ul><ul><li>Removing acknowledgement delay </li></ul></ul></ul><ul><ul><ul><ul><li>The client acknowledges every other packet, thus reducing network load. </li></ul></ul></ul></ul>
    15. 15. Motivation to resolve Incast using TCP <ul><ul><li>TCP is well-understood and mature, facilitating its use as a transport protocol, in data centers. </li></ul></ul><ul><ul><li>Commodity Ethernet switches are cost-competitive to specialized technology i.e. Infiniband. </li></ul></ul><ul><ul><li>TCP, being well understood, gives us the potential to harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch. </li></ul></ul>
    16. 16. Solution Domain
    17. 17. Insight into fine-grained TCP <ul><li>Premise: </li></ul><ul><ul><li>The timers must operate on a granularity close to the RTT of the network, hundreds of us or less. </li></ul></ul><ul><ul><li>Commodity Ethernet switches are cost-competitive to specialized technology i.e. Infiniband. </li></ul></ul><ul><ul><li>TCP, being well understood, gives us the potential to harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch. </li></ul></ul>
    18. 18. RTO Estimation and Minimum Bound <ul><li>Jacobson’s TCP RTO Estimator </li></ul><ul><ul><li>RTO Estimated = SRTT + (4 * RTTVAR) </li></ul></ul><ul><li>Actual RTO = max(minRTO, RTO Estimated ) </li></ul><ul><li>Minimum RTO bound (minRTO) = 200ms </li></ul><ul><ul><li>TCP timer granularity </li></ul></ul><ul><ul><li>Safety (Allman99) </li></ul></ul><ul><ul><li>minRTO (200ms) >> Datacenter RTT (100µs) </li></ul></ul><ul><ul><li>1 TCP Timeout lasts 1000 datacenter RTTs! </li></ul></ul>
    19. 19. RTO Estimation and Minimum Bound <ul><li>Jacobson’s TCP RTO Estimator </li></ul><ul><ul><li>RTO Estimated = SRTT + (4 * RTTVAR) </li></ul></ul><ul><li>Actual RTO = max(minRTO, RTO Estimated ) </li></ul><ul><li>Minimum RTO bound (minRTO) = 200ms </li></ul><ul><ul><li>TCP timer granularity </li></ul></ul><ul><ul><li>minRTO (200ms) >> Datacenter RTT (100µs) </li></ul></ul><ul><ul><li>1 TCP Timeout lasts 1000 datacenter RTTs! </li></ul></ul>
    20. 20. Evaluation workload <ul><li>The test client requests for a block of data, striped across “n” servers. </li></ul><ul><li>Thus each server responds with blocksize/n bytes of data </li></ul>
    21. 21. µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) Does eliminating RTO min helps avoid TCP incast collapse ?
    22. 22. Simulation result Reducing the RTO min in the simulation to us from the current default value of 200ms improves goodput
    23. 23. Real world cluster Experiments on a real cluster validate the simulation result => reducing the RTO min improves the goodput.
    24. 24. Real world cluster Experiments on a real cluster validate the simulation result => reducing the RTO min improves the goodput.
    25. 25. TCP Requirements for µsecond RTO <ul><li>TCP must track RTT in microseconds </li></ul><ul><ul><li>Efficient high-resolution kernel timers </li></ul></ul><ul><ul><ul><li>Use HPET for efficient interrupt signaling </li></ul></ul></ul><ul><ul><li>(HPET is High Precision Event Timer) </li></ul></ul><ul><li>The HPET is a programmable hardware timer that consists of a free-running up counter and several comparators and registers, which modern operating systems can set. </li></ul>
    26. 26. Modifications to the TCP stack <ul><li>The minimal modifications required of the TCP stack, to support ”hrtimers” are: </li></ul><ul><ul><li>microsecond resolution time accounting to track RTTs with greater precision. </li></ul></ul><ul><ul><li>redefinition of TCP constants. </li></ul></ul><ul><ul><li>Replacement of low resolution timers with hrtimers. </li></ul></ul>
    27. 27. µsecond TCP + no minRTO For a 48 node cluster providing TCP re transmissions in us eliminates incast collapse for upto 47 servers.
    28. 28. Simulation: Scaling to thousands In simulation, introducing a randomized component to the RTO desynchronizes retransmission following timeouts and avoids good put degradation for a large number of flows.
    29. 29. Conclusion <ul><li>Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput. </li></ul><ul><li>Safe for wide-area communication. </li></ul><ul><li>This paper presented a practical, effective, and safe solution to eliminate TCP incast in data centre environments. </li></ul><ul><ul><li>microsecond granularity TCP timeouts. </li></ul></ul><ul><ul><li>Randomized re transmissions. </li></ul></ul>
    30. 30. Future Work <ul><li>The practical implémentation of the proposed work is shown on about 48 servers in a data centre. </li></ul><ul><li>Its practical implémentation needs to be seen on thousands of machines. </li></ul><ul><li>Narrow down the TCP variables of interest for introducing microsecond granularity to decrease the problem space. </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×