Successfully reported this slideshow.
Upcoming SlideShare
×

# Course on TCP Dynamic Performance

508 views

Published on

A short but packed course on TCP Dynamic Behavior. It starts by explaining TCP from scratch so the dynamic parts can be understood. Then it dives deep into how TCP behaves in real IP networks in the face of packet losses, delays and other phenomena.

Published in: Internet, Technology, Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Course on TCP Dynamic Performance

1. 1. Javier Araúz TCP Dynamic Behavior
2. 2. TCP: The Basics Transmission Control Protocol, invented in 1974 by Cerf&Kahn Provides connection-oriented, reliable, causal octet delivery between hosts in the Internet... ...at the expense of potentially long delays and low throughput One of the possible ______ choices to exchange data on top of IP four
4. 4. TCP: The Basics The TCP Sliding Window (the LEN value is inferred) Alice Bob SEQ=x, LEN=dA SEQ=y, ACK=x+dA, LEN=dB SEQ=x+dA, ACK=y+dB, LEN=dA
5. 5. TCP: The Basics Connection establishment Alice Bob SYN, SEQ=x, LEN=0 SYN, SEQ=y, ACK=x+dA, LEN=0 SEQ=x, ACK=y, LEN=dA Presence of ACK allows Bob to tell this message apart from the initial one
6. 6. TCP: The Basics Connection release - ordered Alice Bob FIN, SEQ=x, LEN=0 FIN, SEQ=y, ACK=x+1, LEN=0 possibly k more octetsFIN, SEQ=y+k, ACK=x+1, LEN=0 FIN, SEQ=x, ACK=y+k+1, LEN=0 The additional 1 allows Bob to tell this message apart from a spontaneous FIN from Alice The additional 1 allows Alice to tell this message apart from a re-transmisison of the initial FIN from Bob
7. 7. TCP: The Basics Connection release - unilateral Alice Bob RST, SEQ=y, LEN=0
8. 8. TCP: The Basics Alice Bob Segment Exchange, Nagle algorithm SEQ=x, ACK=y, LEN=dA SEQ=x+dA, ACK=y, LEN=dA SEQ=x+2dA, ACK=y, LEN=dA/3 time-out SEQ=y, ACK=x+dA, LEN=0 Nagle criteria #1: send when have complete segment dA Nagle criteria #3: send when time-out happens
9. 9. TCP: The Basics Alice Bob SEQ=x, ACK=y, LEN=dA SEQ=y, ACK=x+dA, LEN=dB SEQ=x+dA, ACK=y+dB, LEN=dASEQ=x+2dA, ACK=y+dB, LEN=dASEQ=x+3dA, ACK=y+dB, LEN=dA SEQ=y+dB, ACK=x+3dA, LEN=dB typically >200ms typically >200ms Single ACK acknowledges multiple segments One ACK for every two full segments Segment exchange, delayed ACK SEQ=y+2dB, ACK=x+4dA, LEN=dB
10. 10. SEQ=x, ACK=y, LEN=dA SEQ=x+dA, ACK=y, LEN=dA SEQ=x+2dA, ACK=y+dB, LEN=dA SEQ=y, ACK=x+dA, LEN=dB TCP: The Basics Segment exchange, loss Alice Bob RTO SEQ=x, ACK=y, LEN=dA SEQ=y+dB, ACK=x+2dA, LEN=0 SEQ=x+dA, ACK=y, LEN=dA
11. 11. SEQ=x, ACK=y, LEN=dA SEQ=y, ACK=x+dA, LEN=dB SEQ=x+dA, ACK=y, LEN=dA SEQ=x+3dA, ACK=y+dB, LEN=dA SEQ=x+dA, ACK=y+2dB, LEN=dA SEQ=y+dB, ACK=x+dA, LEN=dB TCP: The Basics Segment exchange, loss with fast recovery Alice Bob SEQ=y+dB, ACK=x+3dA, LEN=0 SEQ=x+4dA, ACK=y+2dB, LEN=dA :O :O :O :O SEQ=x+2dA, ACK=y, LEN=dA
12. 12. TCP: The Basics Flow Control SEQ=y, ACK=x+dA, LEN=dB, W=4dA SEQ=x+dA, ACK=y+dB, LEN=dA, W=4dB SEQ=x+2dA, ACK=y+2dB, LEN=dA, W=3dB SEQ=y+dB, ACK=x+2dA, LEN=dB, W=4dA Alice BobSEQ=x, ACK=y, LEN=dA, W=5dB SEQ=y+2dB, ACK=x+3dA, LEN=0, W=3dA
13. 13. TCP: The Basics Flow Control: breaking deadlocks SEQ=y, ACK=x+dA, LEN=0, W=4dA SEQ=x+dA, ACK=y, LEN=dA, W=5dB SEQ=x+2dA, ACK=y, LEN=dA, W=5dB SEQ=y, ACK=x+2dA, LEN=dB, W=3dA Alice Bob SEQ=x, ACK=y, LEN=dA, W=5dB SEQ=x+3dA, ACK=y, LEN=dA, W=5dB SEQ=x+4dA, ACK=y, LEN=dA, W=5dB SEQ=y+dB, ACK=x+3dA, LEN=0, W=2dA SEQ=y+dB, ACK=x+4dA, LEN=dB, W=dA SEQ=y+2dB, ACK=x+5dA, LEN=0, W=0 !!! SEQ=y+2dB, ACK=x+5dA, LEN=0, W=1
14. 14. TCP: Congestion Control Congestion and how it shows up: •Congestion is the situation in which a router within the IP network is not able to route all the traffic offered to it •A router is congested when one or more of its ingress queues are full •Congestion manifests at end hosts in the form of one or more lost TCP segments, which is known as a "congestion event"
15. 15. TCP: Congestion Control How does TCP react to a congestion event? •A congestion event may take one of two shapes: 1. A burst loss, detected as no ACK for the sent bytes received within RTO seconds from sending a byte 2. A single loss, detected as a delayed ACK (an ACK for a byte sent later than a not-yet-acknowledged byte) •On detection of a congestion event at a sender, TCP throttles down its sending rate by decreasing its send window size (represented as W), entering a state known as Congestion Avoidance mode •While in Congestion Avoidance mode, the sending window is known as the congestion window and its size is represented as cWnd
16. 16. Congestion Avoidance mode in TCP-Reno •On a congestion event, TCP-Reno slashes its sending window as follows: • While in congestion avoidance mode, TCP-Reno increases its congestion window by n segments for each n acknowledged segments: •Additionally, in the face of a burst loss TCP Reno shuts down its sending window and starts a slow-start phase until it reaches the congestion window size: TCP: Congestion Control
17. 17. Congestion Avoidance mode in TCP-Reno TCP: Congestion Control
18. 18. TCP: Congestion Control Congestion Avoidance mode in TCP-Cubic:
19. 19. TCP: Congestion Control Congestion Avoidance mode in TCP-Vegas:
20. 20. TCP: Congestion Control Other CC algorithms: •TCP-Ledbat •TCP-Ericsson-Akamai To know which CC algorithm your box is running: > \$ cat /proc/sys/net/ipv4/tcp_congestion_control reno To change the CC algorithm in your box: •edit /boot/config-x.y.zz-generic •change CONFIG_DEFAULT_TCP_CONG
21. 21. Main parameters affecting host performance: TCP: End host charact. *: tcp_wmem overrides /proc/sys/net/core/wmem_default **: tcp_wmem overridden by /proc/sys/net/core/wmem_max ***: setsockopt() changing buffer sizes disables auto-tuning!! Browse with (requires super-user privileges): sysctl -a | fgrep net.ipv4.tcp sysctl name (net.ipv4) meaning default explanation tcp_low_latency Nagle algorithm status 0 Nagle enabled tcp_window_scaling Window scaling status 1 Window scaling enabled tcp_adv_win_scale Window scaling factor 2 W = w2 tcp_wmem Sending buffer size 4KB 85KB* 170KB** min/default/max tcp_moderate_rcvbu f Auto-tuning (2.6.7 and later) 1*** Auto-tuning enabled
22. 22. TCP: End host charact. Measuring end-host performance: > netstat -natuw Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:2207 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:36285 127.0.0.1:12865 TIME_WAIT tcp 0 0 10.0.0.5:37322 10.0.0.4:33932 TIME_WAIT tcp 0 1 10.0.0.5:55351 10.0.0.4:33932 SYN_SENT tcp 0 1 10.0.0.5:55350 10.0.0.4:33932 LAST_ACK tcp 0 0 10.0.0.5:64093 10.0.0.4:33932 TIME_WAIT tcp 0 0 10.0.0.5:35122 10.0.0.4:12865 ESTABLISHED tcp 0 0 10.0.0.5:17318 10.0.0.4: 0 0 :::22 :::* LISTEN
23. 23. TCP: IP Network charact. Measuring end-to-end performance: netserver -p <portnum> netperf -p <portnum> -l <secs> -H <target> -t TCP_RR -r <reqsize,rspsize> -s <sndbuf,rcvbuf> -S <sndbuf,rcvbuf> -C -c
24. 24. TCP: IP Network charact. How Sprint characterized its network:
25. 25. TCP: IP Network charact.
26. 26. TCP: IP Router behavior IP router simplified model: switch fabric Routing logic Line cards Line cards
27. 27. Queue management strategies: Passive: drop every incoming packet when the ingress queue is full Active: drop incoming packets selectively when the ingress queue lenght grows above a threshold TCP: IP Router behavior threshold Router can't choose, Red flow gets more share than Green and Blue Router can choose, thus share queue space more evenly
28. 28. Active Queue Management flavors: Random Early Discard (RED) Weighted Fair Queueing (WFQ) TCP: IP Router behavior minThresmaxThres token buckets 1 RED mitigates tail-drops but is unfair drop probability distribution WFQ mitigates tail-drops and is fair
29. 29. TCP: IP Router behavior Packet drop probability distributions Passive RED WFQ Ideal
30. 30. TCP: Dynamic Performance Bandwidth-delay product: physical analogy h l P lpm @ V mps W mps 2 r h r2 ≥ VP*l/W + P*l = (1 + V/W)P*lᴫ If V=W then h r2 ≥ 2P*lᴫ
31. 31. TCP: Dynamic Performance Bandwidth-delay product: a TCP invariant BW bps T s W W > 2BW*T = BW*D
32. 32. TCP: Dynamic Performance Window sizes for different BW and D values: TCP w/o LFN extensions (RFC1323) limits window size to 216 = 64KB Solutions: A) open multiple TCP connections (each has its own 64KB window) B) use a LFN-friendly TCP stack (i.e. supporting RFC1323), like the one in Linux kernel 2.6.16 that comes with LOTC
33. 33. 2 sources of packet loss: 1) Transmission errors: random, non-correlated 2.a) Router queues: random, correlated 2.b) Router queues with Active Queue Management (e.g. RED): random, non-correlated Theoretical throughput model comes given by: (non-correlated) Max values are capped by send/receive window size limits TCP: Dynamic Performance
34. 34. TCP: Dynamic Performance TCP stable state: s t w w/2 D 2D 3D wD/2 wD 3wD/2 3w/4
35. 35. Group was not exported from SlideRocket TCP: Dynamic Performance Congestion window size (cWnd) is independent of buffer size at the receiver (RCVBUF): Group was not exported from SlideRocket p RCVBUF D 2D 3D wD w wD/2 cWnd=2*w/2=w cWnd’=(3w/2)/2=3w/4 3wD/2 cWnd’’=(5w/4)/2=5w/8 3w/2
36. 36. TCP: Dynamic Performance BW limit for different p and D values (W = ∞, non-correlated loss with C=1): Solutions: Multiple TCP connections in theory allow more aggregated throughput; however, effect of increasing load on the network at congestion spots is unknown
37. 37. TCP: Dynamic Performance Evolution of sender’s buffer with time (uniform arrival rate): w w -- - 2 D 2 D 3 D w --- D 2 Queued packets 3w --- 4 Qi = 3w/4 – [(w/2 + i) – Qi-1], Q-1 = 1 Qi = w/4 – i + Qi-1 = w/4 – i + [w/4 – (i-1) + [w/4 – (i-2) + … ]] = w/4*i – [i + (i-1) + (i-2) + … ] = w/4*i – [i*i – (1+2+3+…+i)] = w/4*i – i2 + i(1+i)/2 = = w/4*i – i2 + (i2+i)/2 = = (w+2)*i/4 – i2/2 Note: the term 1+2+3+…+i is an arithmetic progression of n=i elements which initial value a1=1, final value an=i and coefficient d=1 which sums up to n(a1+an)/2 = i(1+i)/2 = (i2+i)/2
38. 38. TCP: Dynamic Performance How does the TCP buffer at the sender look like?
39. 39. TCP: Dynamic Performance μλ Qi S cWndicWndi+1 i = ¼*(w+2-2k ± √((w+2)2+4k(2-5w-3k)) kmax < 1/6[(2-5w)+√((2-5w)2+3(w+2)2) When Qi < cWndi -> S = 0 When cWndi < Qi < cWndi+cWndi+1 -> S = D cWndi+cWndi+1 < Qi < cWndi+cWndi+1+cWndi+2 -> S = 2D … When ΣcWndi+k-1 < Qi < ΣcWndi+k -> S = kD Since cWndi = w/2 + i: ΣcWndi+k = (w/2+i) + (w/2+i+1) + … + (w/2+i+k) = (k+1)(w/2+i) + k(1+k)/2 = ½[k(w+1+2i+k)] Service time (S) evolution: i Q
40. 40. TCP: Dynamic Performance Expected throughput, latency and jitter p=10-3 => w = 51.64 => kmax = 88
41. 41. Group was not exported from SlideRocket TCP: Aggregation Flows ingressing a congested back-bone router "resonate" after a few packet drops: T2 T3 Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Tk Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Resonance takes place once every flow has suffered at least one drop Resonance period tends to the average of all periods weighted by the flow size
42. 42. Group was not exported from SlideRocket Group was not exported from SlideRocket TCP: Aggregation Resonating flows with similar RTT have their own micro- resonance ; this phenomenon is known as 'flocking': T2 T3 Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et Gr ou p wa s no t ex po rte d fro m Sli de Ro ck et There is no theoretical approach to calculating the micro-resonance thresholds Flows with short micro-resonance periods steal throughput from flows with long micro-resonance periods
43. 43. TCP: Application Behavior Misbehaved applications: A host is allowed to open as many flows as its resources can afford to the same target host IP routers deal with each flow blindly, ignoring the fact that many flows might start and end in the same host pair Therefore the bandwidth share taken by an application at a congested link depends on the number of flows it puts through that link: BW = k*BWi Corollary: applications holding many flows between end-hosts are both unfair to applications holding fewer flows and a potential congestion cause
44. 44. TCP: Application Behavior Misbehaved applications: A host may start transmitting at a high rate (if the receiver has enough buffer) then drop the connection when stable state has been reached (connection is trained) If the host re-connects very quickly and starts trasmitting at high rate again, on average it shall take more bandwidth than its peers maintaining trained connections Corollary: dropping a trained connection is never a good idea if there's a chance it shall be used soon
45. 45. TCP: Application Behavior Well-behaved applications: A sensible application shall open more connections if needed, but close them if it perceives it is causing congestion A sensible application shall try to maintain and re-use trained connections as much as possible
46. 46. TCP: Application Behavior De-bunking send&receive window limitations: From Linux kernel 2.6.7, the TCP receiving window adjusts itself depending on the free space in the buffer Remember the /proc/sys/net/ipv4/tcp_write_mem system variable?: Window starts at the middle value (default 16KB) Window grows and shrinks as needed depending on number of queued segments Window growth is limited by the right-most value (default 1MB), and never shrinks to less than the left-most value (default 4KB) Sending window has been self-adjusting from much earlier than 2.6.7 kernel
47. 47. TCP: Research Issues Research topic #1: improving congestion control by routers Though much more harmless than tail drops, single packet drops drive TCP into congestion avoidance mode Even when using WFQ, traffic spikes can cause the much undesirable tail drops If a router could signal transmitters when congestion is coming, trasmitters might adjust their transmission rates without drops and without entering congestion avoidance mode This research field focuses on the use of the ECN bit in the TCP packet header
48. 48. TCP: Research Issues Research topic #2: how to conciliate the congestion-bound TCP traffic with the unbound, heavyweight UDP traffic TCP traffic can be throttled up and down before, during and after congestion UDP traffic on the contrary cannot be throttled, and is potentially causing more congestion than TCP Focus of this research area is on congestion-controlling UDP (for instance the just-created RMCAT IETF WG, check https://datatracker.ietf.org/wg/rmcat/charter/)
49. 49. TCP: Research Issues Research topic #3: congestion control algorithms for 4G radio accesses Which of the existing CC algorithms (Reno, Vegas, Cubic...) performs best in 4G RANs? Would new CC algorithms perform better than existing ones? How can the more aggressive algorithms co-exist pacifically with the more conservative ones? Research report on CC algorithm performance in LTE networks (link)
50. 50. TCP: References Macroscopic Behavior of the TCP Congestion Avoidance Algorithm, http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.25.3452 Modeling TCP Throughput: A Simple Model and its Empirical Validation, http://?