Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Part 8 : TCP and Congestion control

  1. Week 8 TCP Congestion control
  2. Agenda • TCP • Connection establishment (more details) • Reliable data transfer • Connection release • Congestion control
  3. Three-way handshake ACK(seq=x+1, ack=y+1) CONNECT.req CONNECT.ind SYN+ACK(ack=x+1,seq=y) CONNECT.resp CONNECT.conf Initial sequence number (x) Initial sequence number (y) SYN(seq=x) Connection established Connection established The sequence numbers of all segments A->B will start at x+1 The sequence numbers of all segments B->A will start at y+1 How is this number chosen ? How is this number chosen ?
  4. Initial sequence number • First approach • Each TCP host has a clock that increments the iss every 4 microsecond • Current approach • Each TCP host picks a random number as its initial sequence number
  5. The problem with trusted addresses B T A ACK(seq=x+1, ack=y+1) SYN+ACK(ack=x+1,seq=y) SYN(seq=x) Connection comes from Alice’s IP address don’t need to ask username and password DATA(seq=x+1, ack=y+1) Can Trudy hijack this connection ?
  6. TCP and spoofing • Can Trudy create a fake TCP connection by spoofing Alice's IP when she is away ? • Trudy can send spoofed IP packets to Bob using Alice’s address • But Trudy cannot receive the packets sent by Bob to Alice
  7. TCP and spoofing • Trudy's view of the transfer SYN+ACK(Dst=A,ack=x+1,seq=y) SYN(Src=A,seq=x) ACK(seq=x+1, ack=y+1) Data(Src=A,seq=x+1) Trudy Alice Ignored if Alice is offline Can Trudy predict y ? Bob
  8. TCP establishment SYN(Src=C,seq=x) CONNECT.ind SYN+ACK(Dest=C,ack=x+1,seq=y) ACK(Src=A,seq=x) CONNECT.req
  9. DoS attack SYN(Src=A,seq=x) CONNECT.ind CONNECT.ind SYN+ACK(Dest=A,ack=x+1,seq=y) SYN+ACK(Dest=B,ack=x+1,seq=z) SYN(Src=B,seq=x) • Attacker sends 1000s of (spoofed) SYNs
  10. Countering DoS attacks • Principle of the solution • Server should not create any state before being sure that the client can receive the segments that it sends SYN(Src=C,seq=x) SYN+ACK(Dest=C,ack=x+1,seq=y) ACK(Src=A,seq=x, ack=y+1) CONNECT.req Server does not store anything Server checks that third ACK is valid and creates state
  11. SYN Cookies SYN+ACK(ack=x+1,seq=y) SYN(seq=x) ACK(seq=x+1, ack=y+1) CONNECT.req CONNECT.ind CONNECT.conf No state created y=Hash(IPClient,PortClient,Secret) Verify that ack=1+Hash(IPClient,PortClient,Secret) State is created • Server wants to verify 3rd ack without any state How should the server select y ?
  12. TCP options Source port Destination port Payload 32 bits Checksum Urgent pointer THL Reserved Flags 20 bytes Sequence number Optional header extension Window Acknowledgement number Space in the header with new fields which can be exchanged over a connection Each TCP Option encoded as: • Type • Length • Value What is the maximum length of the TCP header including options ?
  13. TCP options • Maximum Segment Size • Selective acknowledgements • Window Scale • Timestamps • Multipath TCP • ...
  14. Negotiating the utilization of TCP Options ACK(seq=x+1, ack=y+1) CONNECT.req CONNECT.ind SYN+ACK(ack=x+1,seq=y) Option K CONNECT.resp CONNECT.conf Initial sequence number (x) Option K proposed Initial sequence number (y) Option K accepted SYN(seq=x),Option K Connection established Option accepted Connection established The sequence numbers of all segments A->B will start at x+1 The sequence numbers of all segments B->A will start at y+1
  15. The MSS Option ACK(seq=x+1, ack=y+1) CONNECT.req CONNECT.ind SYN+ACK(ack=x+1,seq=y) MSS=1000 CONNECT.resp CONNECT.conf Initial sequence number (x) Will not accept segments longer than 1200 bytes Initial sequence number (y) Will never send segments longer than 1200 bytes Will not accept segments longer than 1000 bytes SYN(seq=x),MSS=1200 Connection established Option accepted Connection established Will never send segments longer than 1000 bytes The sequence numbers of all segments A->B will start at x+1 The sequence numbers of all segments B->A will start at y+1 What is the usual MSS size advertised by an Internet host today ?
  16. Agenda • TCP • Connection establishment • Reliable data transfer (more details) • Connection release • Congestion control
  17. Reliable data transfer (seq=127,"ef") (seq=123,"abcd") (seq=123,"abcd") (seq=127,"ef") (ack=123) Retransmission timer (ack=129) (ack=129) unnecessary retransmission "abcdef" Retransmission of all unacked segments “ef” placed in buffer
  18. Retransmission timer • How to compute it ? • round-trip-time may change frequently during the lifetime of a TCP connection
  19. Retransmission timer • Algorithm • timer = mean(rtt) + 4*std_dev(rtt) • est_mean(rtt) = (1- )*est_mean(rtt) + *rtt_measured • est_std_dev=(1-)*est_std_dev+ *|rtt_measured - est_mean(rtt)|
  20. RTT measurements • Solution (Karn/Partridge) • Do not measure rtt of retransmitted segments (seq=123,"abcd") (seq=120,"xyz") (ack=123) (ack=127) measured rtt Timer which is the good rtt ? (seq=123,"abcd")
  21. With Timestamp option (seq=123,TS=3, TS echo=12, "abcd") (seq=120,TS=1, TS echo=7, "xyz") (ack=123, TS=12, TS echo=1) (ack=127, TS=17, TS echo=3) measured rtt timer measured rtt (seq=123,TS=5, TS echo=12, "abcd")
  22. TCP Timestamps • Two different roles • Help in rtt measurements • Protection Against Wrapped Sequence Numbers (PAWS)
  23. Fast retransmit (seq=123,"abcd") (ack=123) (ack=123) (ack=123) (ack=123) (ack=133) (seq=123,"abcd") "abcdefghij" (seq=127,"ef") Out of sequence, in buffer (seq=129,"gh") Out of sequence, in buffer (seq=131,"ij") Out of sequence, in buffer 3 duplicates was the initial specification. Modern TCP stacks adjust the number dynamically
  24. Selective Acks (seq=123,"abcd") (seq=127,"ef") (ack=123) (seq=129,"gh") (seq=131,"ij") (ack=123,sack:127-128) (ack=123, sack:127-130) (ack=123, sack:127-132) Lost (seq=123,"abcd") (ack=133) "abcdefghij" only 123-126 must be retransmitted • Receiver reports SACK blocks • Negotiated during establishment
  25. The SACK TCP Option • Negotiated during establishment • SACK-permitted TCP option • SACK option format How many SACK blocks in one TCP segment ? Kind=5 Length Left edge 1st block Right edge 1st block Left edge last block Right edge last block SACK block
  26. Delayed acks • Sending an ack per segment is costly • Tradeoff • In sequence data segment • no ack waiting, delay by up to 50 msec • one ack waiting, send immediately • Out-of-sequence data segment • send ack immediately What is the benefit of delayed acks ?
  27. When to send data ? • When should a segment be sent ? • Option 1 • After each write system call • Option 2 • When there is a full segment of data What is the solution that you would recommand for a TCP
  28. Nagle algorithm • A new data segment can be sent if either • This is a full segment (MSS bytes) • There are no unacknowledged bytes
  29. Observed IP packets http://www.caida.org/research/traffic-analysis/pkt_size_distribution/graphs.xml
  30. Limitation of TCP flow control Source port Destination port Payload 32 bits Checksum Urgent pointer THL Reserved Flags 20 bytes Sequence number Optional header extension Window Acknowledgement number 16 bits ! What is the maximum throughput of a TCP connection with a 64KB window and a 10 msec rtt (in Mbps) ?
  31. TCP flow control • Performance function of window size • Throughput ~= window/rtt • TCP window : 16 bits field • RFC1323 Window scale extension rtt 1 msec 10 msec 100 msec Window 8 Kbytes 65.6 Mbps 6.5 Mbps 0.66 Mbps 64 Kbytes 524.3 Mbps 52.4 Mbps 5.2 Mbps
  32. Window scaling • Window maintained as a 32 bits integer by TCP implementations • But sent as a scaled 16 bits in segments • Scaling factor announced in WScale option in SYN/SYN+ACK segments
  33. Agenda • TCP • Connection establishment • Reliable data transfer • Connection release (more details) • Congestion control
  34. Connection release FIN(seq=x) DISCONNECT.req (A-B) DISCONNECT.ind(A-B) ACK(ack=x+1) DISCONNECT.conf(A-B) ACK(ack=y+1) DISCONNECT.conf(A-B) DISCONNECT.req(B-A) DISCONNECT.ind(B-A) FIN(seq=y) Time WAIT Maintain state for this connection during twice MSL to be able to retransmit ACK if a segment is received from the other entity outgoing connection closed incoming connection closed incoming connection closed outgoing connection closed State can be removed Last sent data : x-1 Last sent data : y-1 Sent only after all data up to x has been received
  35. TCP connection release FIN Wait1 SYN RCVD CLOSE Wait Established FIN Wait2 LAST-ACK TIME Wait Closing Closed ?FIN/!ACK !FIN ?ACK Timeout[2MSL] ?FIN/!ACK ?ACK !FIN ?ACK ?FIN/!ACK !FIN
  36. Agenda • TCP • Congestion control • AIMD in TCP • Explicit Congestion Notification • Modern TCP congestion control
  37. The congestion problem Bottleneck link N Mbps How to efficiently share the bandwidth ?
  38. Delay versus bandwidth ∑Bwi observed rtt min rtt buffer almost overloaded huge packet losses
  39. Router buffers • Rule of thumb • Routers should have RTT * C of buffers • RTT is average rtt for flows • C is bandwidth of output link • Backbone routers • 𝐵𝑢𝑓𝑓𝑒𝑟 ≥ 𝑅𝑇𝑇 ∗𝐶 √𝑁 N. McKeown, G. Appenzeller, I. Keslassy, Sizing Router Buffers (Redux), SIGCOMM CCR October 2019
  40. Congestion signals • Main types of congestion signals • Packet loss • most popular signal • Explicit Congestion Notification • requires router cooperation • Increase in measured round-trip-time • can be fragile
  41. Additive Increase • No congestion ? • All acks move window • Additive increase • Increment cwnd by one MSS every rtt Cwnd Time
  42. Faster increase • How to speed up the growth of the congestion window at connection startup ? • Slow-start • Double cwnd every rtt Cwnd Slow-start exponential increase of cwnd Time Max window
  43. Multiplicative decrease • How to detect congestion ? • Three duplicate acks • mild congestion for TCP • cwnd/2 and restart additive increase • Expiration of retransmission timer • severe congestion • Reset cwnd at 1 MSS • Perform slow-start until half previous cwnd and then continue with congestion avoidance
  44. Cwnd Fast retransmit Threshold Threshold Slow-start exponential increase of cwnd Congestion avoidance linear increase of cwnd Fast retransmit Mild congestion
  45. Severe congestion Cwnd Time Timer expiration Threshold Timer expiration Threshold Slow-start exponential increase of cwnd Congestion avoidance linear increase of cwnd
  46. AIMD in TCP # Initialisation cwnd = MSS ssthresh= swin dupacks=0 # Ack arrival (assumes no delayed acks) if tcp.ack > snd.una : # new ack, no congestion if dupacks==0: # not currently recovering from loss if cwnd < ssthresh : # slow-start : increase quickly cwnd # double cwnd every rtt cwnd = cwnd + MSS else: # congestion avoidance : increase slowly cwnd # increase cwnd by one mss every rtt cwnd = cwnd+ mss*(mss/cwnd) else: # recovering from loss cwnd=ssthresh # deflate cwnd rfc5681 dupacks=0
  47. Slow-start cwnd=1000 ssthresh=64000 0-999 ACK(1000) 1000-1999 2000-2999 ACK(2000) ACK(3000) cwnd=2000 ssthresh=64000 cwnd=3000 ssthresh=64000 cwnd=4000 ssthresh=64000 3000-3999 4000-4999 5000-5999 6000-6999 ACK(4000 ) ACK(7000 ) cwnd ?
  48. Congestion avoidance cwnd=2000 ssthresh=4000 1000-1999 2000-2999 ACK(2000) ACK(3000) cwnd=3000 ssthresh=4000 cwnd=4000 ssthresh=4000 cwnd=4000+(1000*1000/4000)=4250 3000-3999 4000-4999 5000-5999 6000-6999 ACK(4000 ) ACK(7000 ) cwnd=4250+(1000*1000/4250)=4485 cwnd=4485+(1000*1000/4485)=4707 cwnd=4707+(1000*1000/4707)=4919 cwnd ?
  49. AIMD in TCP else: # duplicate or old ack if tcp.ack==snd.una: # duplicate acknowledgment dupacks++ if dupacks==1 or dupacks==2: send_next_unacked_segment # rfc3042 if dupacks==3: retransmitsegment(snd.una) ssthresh=max(cwnd/2,2*MSS) cwnd=ssthresh if dupacks>3: # rfc5681 cwnd=cwnd+MSS # inflate cwnd else: # ack for old segment, ignored Expiration of the retransmission timer: send(snd.una) # retransmit first lost segment sshtresh=max(cwnd/2,2*MSS) cwnd=MSS
  50. TCP and losses cwnd=1000 0-999 ACK(1000) 1000-1999 2000-2999 ACK(2000) cwnd=2000 cwnd=3000 3000-3999 4000-4999 ACK(2000 ) ACK(2000 ) cwnd=3000 dupack=1 cwnd=3000 dupack=2 write(5000 bytes)
  51. TCP and losses cwnd=3000 dupack=2 2000-2999 RTO expires cwnd=1000 dupack=0 ssthresh=2000 ACK(5000 ) cwnd=2000 dupack=0
  52. TCP and losses cwnd=1000 ssthresh=64000 0-999 ACK(1000) 1000-1999 2000-2999 ACK(2000) ACK(3000) cwnd=2000 ssthresh=64000 cwnd=3000 ssthresh=64000 cwnd=4000 ssthresh=64000 3000-3999 4000-4999 5000-5999 6000-6999 write(10000 bytes)
  53. TCP and losses cwnd=4000 3000-3999 cwnd=2000 dupack=0 ssthresh=2000 ACK(9000) 7000-7999 8000-8999 ACK(3000 ) ACK(3000 ) ACK(3000 ) dupack=1 dupack=2 dupack=3 9000-9999 ACK(10000)
  54. Simplified model • Assume all segment losses are periodic and the every 1/p segment is lost Cwnd(segments) W W/2 0 0 W/2 W 3W/2 2W time(rtt) Surface It can be shown that the throughput of a TCP connection can be approximated by : Maximum throughput without losses Throughput with losses/congestion
  55. Tuning TCP @google • Objectives • Minimize time to receive result from search engine • HTTP GET fits a single segment • HTTP Response in <16 KBytes
  56. Initial retransmission timer • What happens if SYN or SYN+ACK is lost ?
  57. Initial congestion window • What is the impact of initial congestion window and the slow-start on the time to receive an HTTP response ?
  58. Initial TCP congestion window today Source: M. Bagnulo, tcpmp mailing list, Nov 16th, 2016
  59. TCP Fast Open • Can we we reduce the overhead of the three-way handshake ? • HTTP/1.1 • Putting data inside SYN and SYN+ACK
  60. TCP Fast Open • Is this safe ? • Risk of denial of service attack SYN(Src=C,seq=x, HTTP GET) CONNECT.ind+HTTP GET SYN+ACK(Dest=C,ack=x+1,seq=y, HTTP Resp) CONNECT.req+Data ACK(Src=A,seq=x) Is this safe ?
  61. Safe TCP Fast Open • How to make TCP Fast Open safe in the presence of attackers ? • Server needs to ensure that SYN segment does not come from an attacker who sent a spoofed packet
  62. Agenda • TCP • Congestion control • AIMD in TCP • Explicit Congestion Notification • Modern Congestion Control
  63. Basic ECN • Issues • What happens if the returning ECN-echo is lost ? • How can we deploy ECN? R1 R2 A D Congestion Notification Mark the IP packet that caused congestion by setting one bit flag (CE: Congestion Experienced) TCP source behaviour Upon reception of a ECN-Echo=1 TCP ack, behave as if the corresponding segment was lost (perform congestion avoidance). TCP destination behaviour Upon reception of a CE=1 IP packet indicate the congestion to the source by setting a special flag (ECN-Echo) in the returning TCP ack
  64. Dealing with lost acks R1 R2 A D TCP receiver behavior Upon reception of a CE=1 IP packet indicate the congestion to the source by setting a special flag (ECN-Echo) in all returning TCP acks until a TCP TPDU with CWR set is received TCP sender behavior Upon reception of a ECN-Echo=1 TCP ack, perform congestion avoidance and set CWR flag in next TCP PDU
  65. Deploying ECN • On endhosts • Update the TCP stack to support ECN • Negotiate ECN usage in SYN • Encode ECN info in packets/segments • For other transport protocols…
  66. Deploying ECN • On routers • Routers need to distinguish between • ECN-capable hosts that react to ECN • If congestion, such packets are marked • Other hosts that do not react to ECN • If congestion, such packets are dropped
  67. ECN support on routers • Specialised buffer acceptance algorithms R1 R2 A D In case of congestion If ECT bit is set Mark the IP packet that caused congestion by setting on bit flag (CE: Congestion Experienced) If ECT bit is not set Discard the IP packet that caused congestion ECN-capable source If destination is also ECN capable Set ECT bit in all IP packets towards destination Otherwise Reset ECT bit
  68. Agenda • TCP • Congestion control • AIMD in TCP • Explicit Congestion Notification • Modern Congestion Control
  69. Issues with AIMD • Performance on high bandwidth*delay links • Each loss forces TCP in congestion avoidance and grows slowly • Bufferbloat • TCP AIMD tries to saturate buffers until it causes congestion • Inflates round-trip-times • Fairness • TCP sources with a lower rtt are favored
  70. TCP Congestion Controls • Supposed to be fair • MSS size • rtt • Many congestion control schemes urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Perform and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
  71. CUBIC • A modern congestion controller designed for high bandwidth*delay product links • Default on Linux • Principles • Use concave and convex profiles of cubic function to increase cwnd • CUBIC behaves like AIMD with small rtt/bw • CUBIC provides linear bw sharing among flows with different rtt
  72. CUBIC • Congestion window increase during congestion avoidance urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Performa and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf and RFC8312 cwnd=𝑐𝑤𝑛𝑑𝑚𝑎𝑥+ C× (Δ− 3 𝑐𝑤𝑛𝑑𝑚𝑎𝑥 × 1−𝛽 𝐶 )3 Packet loss: cwnd=𝛽 × 𝑐𝑤𝑛𝑑 Parameters 𝛽 = 0.7 𝐶 = 0.4
  73. Bottleneck Bandwidth and Round-Trip-Time (BBR) • Recent congestion control scheme that aims at achieving high throughput and low delay • Operates in four phases • Startup (similar to slow-start until measured rate stops increase) • Drain (empty the queues, send at 0.75 rate) • compute rttmin over last 10 seconds • Probe bandwidth every 8 rtt (send at 1.25 rate for one rtt and then at 0.75 rate) • Probe RTT (reduce rate for more precise rttmin)
  74. Reno, CUBIC, BBR Cwnd ssthresh Slow-start exponential increase of cwnd Congestion avoidance linear increase of cwnd
  75. Reno, CUBIC, BBR Cwnd Slow-start exponential increase of cwnd Congestion avoidance ssthresh W_max
  76. Reno, CUBIC, BBR Cwnd Startup Congestion avoidance Fast retransmit
  77. Delay-based techniques • TCP Vegas : intuition • Expected rate = 𝑐𝑤𝑛𝑑 𝑟𝑡𝑡𝑚𝑖𝑛 • Estimate buffer size at bottleneck every rtt Δ = 𝑐𝑤𝑛𝑑 − 𝑟𝑡𝑡𝑖 − 𝑟𝑡𝑡𝑚𝑖𝑛 𝑟𝑡𝑡𝑖 • if Δ > 4 => cwnd=cwnd-MSS //congestion • if Δ < 2 => cwnd=cwnd+MSS // increase
  78. CUBIC, Vegas and BBR urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Performa and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf and RFC8312
  79. Two TCP connections Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
  80. Two TCP connections, different rtt Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
  81. Two different congestion controllers Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf

Editor's Notes

  1. MSL in IP networks : 120 seconds
  2. Most TCP implementations today have fixes for those problems. We will discuss them later.
  3. Most TCP implementations today have fixes for those problems. We will discuss them later.
  4. This utilization of a hash function to compute the value of the initial sequence number is usually called a SYN cookie. In practice, the computation of the SYN cookie is slightly more complex than a simple hash function because the server must also remember inside the cookie the following information : - the MSS value advertised by the client - the optional utilization of TCP options such as RFC1323 large windows or timestamps or SACK by the sender The original discussions that lead to the development of the SYN cookie solution may be found in : http://cr.yp.to/syncookies/archive
  5. Urgent pointer is rarely used and will not be described. The THL is indicated in blocs of 32 bits. The TCP header may contain options, these will be discussed later.
  6. MSL in IP networks : 120 seconds
  7. MSL in IP networks : 120 seconds
  8. The computation of TCP’s retransmission timer is described in RFC2988 Computing TCP's Retransmission Timer. V. Paxson, M. Allman. November 2000. Usual values for alpha and beta are 1/8 and 1/4.
  9. See P. Karn, C. Partridge, Improving round-trip time estimates in reliable transport protocols, Proc. ACM SIGCOMM87, August 1987
  10. Les timestamps TCP ont étés introduits dans : RFC1323 TCP Extensions for High Performance. V. Jacobson, R. Braden, D. Borman. May 1992. L'utilisation de ces timestamps est négociée lors de l'établissement de la connexion TCP. La plupart des implémentations TCP actuelles supportent ces extensions.
  11. See e.g. RFC2001 TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms. W. Stevens. January 1997.
  12. RFC2018 TCP Selective Acknowledgement Options. M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. October 1996.
  13. RFC2018 TCP Selective Acknowledgement Options. M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. October 1996.
  14. Urgent pointer is rarely used and will not be described. The THL is indicated in blocs of 32 bits. The TCP header may contain options, these will be discussed later.
  15. Some heavily loaded web servers, use abrupt release to close their connection to avoid maintaining state for 2*MSL seconds.
  16. More detailed models can be found in the scientific literature : M. Mathis,J. Semke, J. Mahdavi and T. Ott, The macroscopic behaviour of the TCP congestion avoidance algorithm, ACM Computer Communication Review, 1997
Advertisement