Slides supporting the "Computer Networking: Principles, Protocols and Practice" ebook. The slides can be freely reused to teach an undergraduate computer networking class using the open-source ebook.
Initial sequence
number
• First approach
• Each TCP host has a clock that
increments the iss every 4 microsecond
• Current approach
• Each TCP host picks a random number
as its initial sequence number
The problem with
trusted addresses
B
T
A
ACK(seq=x+1, ack=y+1)
SYN+ACK(ack=x+1,seq=y)
SYN(seq=x)
Connection comes
from Alice’s IP
address don’t need
to ask username
and password
DATA(seq=x+1, ack=y+1)
Can Trudy hijack this
connection ?
TCP and spoofing
• Can Trudy create a fake TCP connection
by spoofing Alice's IP when she is away
?
• Trudy can send spoofed IP packets to
Bob using Alice’s address
• But Trudy cannot receive the packets
sent by Bob to Alice
TCP and spoofing
• Trudy's view of the transfer
SYN+ACK(Dst=A,ack=x+1,seq=y)
SYN(Src=A,seq=x)
ACK(seq=x+1, ack=y+1)
Data(Src=A,seq=x+1)
Trudy Alice
Ignored if Alice is offline
Can Trudy predict y ?
Bob
Countering DoS attacks
• Principle of the solution
• Server should not create any state
before being sure that the client can
receive the segments that it sends
SYN(Src=C,seq=x)
SYN+ACK(Dest=C,ack=x+1,seq=y)
ACK(Src=A,seq=x,
ack=y+1)
CONNECT.req
Server does not
store anything
Server checks
that third ACK is
valid
and creates state
TCP options
Source port Destination port
Payload
32 bits
Checksum Urgent pointer
THL Reserved Flags
20 bytes
Sequence number
Optional header extension
Window
Acknowledgement number
Space in the
header with new
fields which can
be exchanged
over a
connection
Each TCP Option encoded as:
• Type
• Length
• Value
What is the maximum length of the
TCP header including options ?
Negotiating the utilization
of TCP Options
ACK(seq=x+1, ack=y+1)
CONNECT.req
CONNECT.ind
SYN+ACK(ack=x+1,seq=y) Option K
CONNECT.resp
CONNECT.conf
Initial sequence number (x)
Option K proposed
Initial sequence number (y)
Option K accepted
SYN(seq=x),Option K
Connection established
Option accepted
Connection established
The sequence numbers of all
segments A->B will start at x+1
The sequence numbers of all
segments B->A will start at y+1
The MSS Option
ACK(seq=x+1, ack=y+1)
CONNECT.req
CONNECT.ind
SYN+ACK(ack=x+1,seq=y) MSS=1000
CONNECT.resp
CONNECT.conf
Initial sequence number (x)
Will not accept segments longer
than 1200 bytes
Initial sequence number (y)
Will never send segments longer than
1200 bytes
Will not accept segments longer
than 1000 bytes
SYN(seq=x),MSS=1200
Connection established
Option accepted
Connection established
Will never send segments longer than
1000 bytes
The sequence numbers of all
segments A->B will start at x+1
The sequence numbers of all
segments B->A will start at y+1
What is the usual MSS size advertised
by an Internet host today ?
Agenda
• TCP
• Connection establishment
• Reliable data transfer (more details)
• Connection release
• Congestion control
RTT measurements
• Solution (Karn/Partridge)
• Do not measure rtt of retransmitted segments
(seq=123,"abcd")
(seq=120,"xyz")
(ack=123)
(ack=127)
measured rtt
Timer
which is the good rtt ?
(seq=123,"abcd")
The SACK TCP Option
• Negotiated during establishment
• SACK-permitted TCP option
• SACK option format
How many SACK blocks in one TCP
segment ?
Kind=5 Length
Left edge 1st block
Right edge 1st block
Left edge last block
Right edge last block
SACK block
Delayed acks
• Sending an ack per segment is costly
• Tradeoff
• In sequence data segment
• no ack waiting, delay by up to 50 msec
• one ack waiting, send immediately
• Out-of-sequence data segment
• send ack immediately
What is the benefit of delayed
acks ?
When to send data ?
• When should a segment be sent ?
• Option 1
• After each write system call
• Option 2
• When there is a full segment of data
What is the solution that you
would recommand for a TCP
Nagle algorithm
• A new data segment can be sent if either
• This is a full segment (MSS bytes)
• There are no unacknowledged bytes
Limitation of TCP flow
control
Source port Destination port
Payload
32 bits
Checksum Urgent pointer
THL Reserved Flags
20 bytes
Sequence number
Optional header extension
Window
Acknowledgement number
16 bits !
What is the maximum throughput of a TCP
connection with a 64KB window and a 10 msec rtt
(in Mbps) ?
Window scaling
• Window maintained as a 32 bits integer
by TCP implementations
• But sent as a scaled 16 bits in
segments
• Scaling factor announced in WScale
option in SYN/SYN+ACK segments
Agenda
• TCP
• Connection establishment
• Reliable data transfer
• Connection release (more details)
• Congestion control
TCP connection
release
FIN Wait1
SYN RCVD
CLOSE Wait
Established
FIN Wait2
LAST-ACK
TIME Wait
Closing
Closed
?FIN/!ACK
!FIN
?ACK
Timeout[2MSL]
?FIN/!ACK
?ACK
!FIN
?ACK
?FIN/!ACK
!FIN
Agenda
• TCP
• Congestion control
• AIMD in TCP
• Explicit Congestion Notification
• Modern TCP congestion control
Router buffers
• Rule of thumb
• Routers should have RTT * C of buffers
• RTT is average rtt for flows
• C is bandwidth of output link
• Backbone routers
• 𝐵𝑢𝑓𝑓𝑒𝑟 ≥
𝑅𝑇𝑇 ∗𝐶
√𝑁
N. McKeown, G. Appenzeller, I. Keslassy, Sizing Router Buffers (Redux),
SIGCOMM CCR October 2019
Congestion signals
• Main types of congestion signals
• Packet loss
• most popular signal
• Explicit Congestion Notification
• requires router cooperation
• Increase in measured round-trip-time
• can be fragile
Additive Increase
• No congestion ?
• All acks move window
• Additive increase
• Increment cwnd by one MSS every rtt
Cwnd
Time
Faster increase
• How to speed up the growth of the
congestion window at connection startup
?
• Slow-start
• Double cwnd every rtt
Cwnd
Slow-start
exponential increase of cwnd
Time
Max window
Multiplicative
decrease
• How to detect congestion ?
• Three duplicate acks
• mild congestion for TCP
• cwnd/2 and restart additive increase
• Expiration of retransmission timer
• severe congestion
• Reset cwnd at 1 MSS
• Perform slow-start until half previous cwnd
and then continue with congestion
avoidance
AIMD in TCP
else: # duplicate or old ack
if tcp.ack==snd.una: # duplicate acknowledgment
dupacks++
if dupacks==1 or dupacks==2:
send_next_unacked_segment # rfc3042
if dupacks==3:
retransmitsegment(snd.una)
ssthresh=max(cwnd/2,2*MSS)
cwnd=ssthresh
if dupacks>3: # rfc5681
cwnd=cwnd+MSS # inflate cwnd
else: # ack for old segment, ignored
Expiration of the retransmission timer:
send(snd.una) # retransmit first lost segment
sshtresh=max(cwnd/2,2*MSS)
cwnd=MSS
Simplified model
• Assume all segment losses are periodic
and the every 1/p segment is lost
Cwnd(segments)
W
W/2
0
0 W/2 W 3W/2 2W time(rtt)
Surface
It can be shown that the throughput of a TCP
connection can be approximated by :
Maximum throughput without losses Throughput with
losses/congestion
Tuning TCP @google
• Objectives
• Minimize time to receive result from
search engine
• HTTP GET fits a single segment
• HTTP Response in <16 KBytes
TCP Fast Open
• Can we we reduce the overhead of the
three-way handshake ?
• HTTP/1.1
• Putting data inside SYN and SYN+ACK
TCP Fast Open
• Is this safe ?
• Risk of denial of service attack
SYN(Src=C,seq=x, HTTP GET)
CONNECT.ind+HTTP GET
SYN+ACK(Dest=C,ack=x+1,seq=y, HTTP Resp)
CONNECT.req+Data
ACK(Src=A,seq=x)
Is this safe
?
Safe TCP Fast Open
• How to make TCP Fast Open safe in the
presence of attackers ?
• Server needs to ensure that SYN
segment does not come from an
attacker who sent a spoofed packet
Agenda
• TCP
• Congestion control
• AIMD in TCP
• Explicit Congestion Notification
• Modern Congestion Control
Basic ECN
• Issues
• What happens if the returning ECN-echo is
lost ?
• How can we deploy ECN?
R1 R2
A D
Congestion Notification
Mark the IP packet that caused congestion
by setting one bit flag (CE: Congestion Experienced)
TCP source behaviour
Upon reception of a ECN-Echo=1 TCP ack,
behave as if the corresponding segment was lost
(perform congestion avoidance).
TCP destination behaviour
Upon reception of a CE=1 IP packet indicate the
congestion to the source by setting a special
flag (ECN-Echo) in the returning TCP ack
Dealing with lost acks
R1 R2
A D
TCP receiver behavior
Upon reception of a CE=1 IP packet indicate the
congestion to the source by setting a special
flag (ECN-Echo) in all returning TCP acks until
a TCP TPDU with CWR set is received
TCP sender behavior
Upon reception of a ECN-Echo=1 TCP ack,
perform congestion avoidance and set CWR
flag in next TCP PDU
Deploying ECN
• On endhosts
• Update the TCP stack to support ECN
• Negotiate ECN usage in SYN
• Encode ECN info in
packets/segments
• For other transport protocols…
Deploying ECN
• On routers
• Routers need to distinguish between
• ECN-capable hosts that react to ECN
• If congestion, such packets are
marked
• Other hosts that do not react to ECN
• If congestion, such packets are
dropped
ECN support on
routers
• Specialised buffer acceptance
algorithms
R1 R2
A D
In case of congestion
If ECT bit is set
Mark the IP packet that caused congestion
by setting on bit flag (CE: Congestion Experienced)
If ECT bit is not set
Discard the IP packet that caused congestion
ECN-capable source
If destination is also ECN capable
Set ECT bit in all IP packets towards destination
Otherwise
Reset ECT bit
Agenda
• TCP
• Congestion control
• AIMD in TCP
• Explicit Congestion Notification
• Modern Congestion Control
Issues with AIMD
• Performance on high bandwidth*delay links
• Each loss forces TCP in congestion
avoidance and grows slowly
• Bufferbloat
• TCP AIMD tries to saturate buffers until it
causes congestion
• Inflates round-trip-times
• Fairness
• TCP sources with a lower rtt are favored
TCP Congestion
Controls
• Supposed to be fair
• MSS size
• rtt
• Many congestion
control schemes
urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Perform
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
CUBIC
• A modern congestion controller designed for
high bandwidth*delay product links
• Default on Linux
• Principles
• Use concave and convex profiles of cubic
function to increase cwnd
• CUBIC behaves like AIMD with small
rtt/bw
• CUBIC provides linear bw sharing among
flows with different rtt
CUBIC
• Congestion window increase during
congestion avoidance
urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Performa
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf and RFC8312
cwnd=𝑐𝑤𝑛𝑑𝑚𝑎𝑥+
C× (Δ−
3
𝑐𝑤𝑛𝑑𝑚𝑎𝑥 ×
1−𝛽
𝐶
)3
Packet loss:
cwnd=𝛽 × 𝑐𝑤𝑛𝑑
Parameters
𝛽 = 0.7 𝐶 = 0.4
Bottleneck Bandwidth
and Round-Trip-Time
(BBR)
• Recent congestion control scheme that aims at
achieving high throughput and low delay
• Operates in four phases
• Startup (similar to slow-start until measured rate
stops increase)
• Drain (empty the queues, send at 0.75 rate)
• compute rttmin over last 10 seconds
• Probe bandwidth every 8 rtt (send at 1.25 rate for
one rtt and then at 0.75 rate)
• Probe RTT (reduce rate for more precise rttmin)
CUBIC, Vegas and
BBR
urce: B. Turkovic, F. Kuipers and S. Uhlig, Fifty Shades of Congestion Control: A Performa
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf and RFC8312
Two TCP connections
Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
Two TCP connections,
different rtt
Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
Two different congestion
controllers
Source: B. Turkovic, et al., Fifty Shades of Congestion Control: A Performance
and Interactions Evaluation, https://arxiv.org/pdf/1903.03852.pdf
Editor's Notes
MSL in IP networks : 120 seconds
Most TCP implementations today have fixes for those problems. We will discuss them later.
Most TCP implementations today have fixes for those problems. We will discuss them later.
This utilization of a hash function to compute the value of the initial sequence number is usually called a SYN cookie.
In practice, the computation of the SYN cookie is slightly more complex than a simple hash function because the server must also remember inside the cookie the following information :
- the MSS value advertised by the client
- the optional utilization of TCP options such as RFC1323 large windows or timestamps or SACK by the sender
The original discussions that lead to the development of the SYN cookie solution may be found in :
http://cr.yp.to/syncookies/archive
Urgent pointer is rarely used and will not be described.
The THL is indicated in blocs of 32 bits. The TCP header may contain options, these will be discussed later.
MSL in IP networks : 120 seconds
MSL in IP networks : 120 seconds
The computation of TCP’s retransmission timer is described in
RFC2988 Computing TCP's Retransmission Timer. V. Paxson, M. Allman. November 2000.
Usual values for alpha and beta are 1/8 and 1/4.
See
P. Karn, C. Partridge, Improving round-trip time estimates in reliable transport protocols, Proc. ACM SIGCOMM87, August 1987
Les timestamps TCP ont étés introduits dans :
RFC1323 TCP Extensions for High Performance. V. Jacobson, R. Braden, D. Borman. May 1992.
L'utilisation de ces timestamps est négociée lors de l'établissement de la connexion TCP. La plupart des implémentations TCP actuelles supportent ces extensions.
See e.g.
RFC2001 TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms. W. Stevens. January 1997.
RFC2018 TCP Selective Acknowledgement Options. M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. October 1996.
RFC2018 TCP Selective Acknowledgement Options. M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. October 1996.
Urgent pointer is rarely used and will not be described.
The THL is indicated in blocs of 32 bits. The TCP header may contain options, these will be discussed later.
Some heavily loaded web servers, use abrupt release to close their connection to avoid maintaining state for 2*MSL seconds.
More detailed models can be found in the scientific literature :
M. Mathis,J. Semke, J. Mahdavi and T. Ott, The macroscopic behaviour of the TCP congestion avoidance algorithm, ACM Computer Communication Review, 1997