2. TCP/IP Protocol Suite
• TCP/IP protocol stack is a layered architecture. TCP and IP
are the two most important protocols of this stack.
• It was originally developed by the DARPA (Defense Advanced
Research Projects Agency ) for an experimental packet-
switched network .
• It was later included in the Berkeley Software Distribution of
UNIX.
• It maps closely to the OSI layers and it supports all standard
physical and data link protocols.
• It also includes specifications for such common applications as
e-mail, remote login, terminal emulation, and file transfer.
3. HTTP SMTP RTP
TCP UDP
IP
Network
Interface 1
Network
Interface 3
Network
Interface 2
DNS
TCP/IP Protocol Suite (Transport Layer)
(ICMP, ARP)
Best-effort
connectionless packet
transfer
Variety of Network Technologies
Reliable
stream
service
User
datagram
service
Distributed
applications
4. Transport services and protocols
• Provide logical communication
between application processes
running on end hosts
• Transport protocols run only in end
systems (not in routers/switches)
– Sender breaks application
messages into segments, passes
to network layer
– Receiver reassembles segments
into messages, passes to
application layer
• Transport protocols available are
TCP and UDP
application
transport
network
data link
physical
application
transport
network
data link
physical
Network Layer: Logical communication
between hosts
Transport Layer: Logical communication
between processes (relies on, and enhances,
network layer services)
5. TCP/IP Encapsulation
TCP Header contains
source & destination port
numbers for identifying
the application
IP Header contains source
and destination IP addresses;
transport protocol type (TCP
or UDP)
Ethernet Header
contains source &
destination MAC
addresses
HTTP
Request
TCP
header
HTTP
Request
IP
header
TCP
header
HTTP
Request
Ethernet
header
IP
header
TCP
header
HTTP
Request
FCS
Example application :
HTTP
6. Transport Layer Multiplexing/Demultiplexing
application
transport
network
link
physical
P1 application
transport
network
link
physical
application
transport
network
link
physical
P2P3 P4P1
Host 1 Host 2 Host 3
= process= socket
Delivering received segments
to correct socket
Demultiplexing at Receiver:
Gathering data from multiple
sockets, enveloping data with
header (later used for
demultiplexing)
Multiplexing at Sender:
7. Demultiplexing at the Transport Layer
• Host receives IP datagrams
– Each datagram has source
IP address, destination IP
address
– Each datagram carries 1
transport-layer segment
– Each segment has source,
destination port number
• Host uses IP addresses & port
numbers to direct segment to
appropriate socket
Source Port # Dest Port #
32 bits
Application
Data
(Message)
other header fields
TCP/UDP Segment Format
0-255 Well-known ports
256-1023 Less well-known ports
1024-65536 Ephemeral client ports
8. Connectionless Demultiplexing (UDP)
• Create sockets with port
numbers:
DatagramSocket mySocket1 = new
DatagramSocket(12534);
DatagramSocket mySocket2 = new
DatagramSocket(12535);
• UDP socket identified by two-
tuple
(dest IP address, dest port number)
• When host receives UDP
segment:
– checks destination port
number in segment
– directs UDP segment to
socket with that port
number
• IP datagrams with
different source IP
addresses and/or source
port numbers directed to
same socket if the
destination port number is
the same in the
destination host
Note that TCP does this differently,
using a 4-tuple (S-IP, SP, D-IP, DP) to
identify a socket!
9. Connectionless Demultiplexing (UDP)
Example: Server creates socket at 6428 to provide UDP service to some
application DatagramSocket serverSocket = new DatagramSocket(6428);
Client IP:B
P2
Client IP: A
P1P1P3
Server IP: C
SP: 6428
DP: 9157
SP: 9157
DP: 6428
SP: 6428
DP: 5775
SP: 5775
DP: 6428
• Same socket (6428) at server for both clients in this example
• DP specifies the process to which data should be delivered at the Receiver
• SP specifies the process from which data is coming, for the specified source
IP address; acts like a return address for replies/responses if required to be
sent back
10. Connection-oriented Demultiplexing (TCP)
• TCP socket identified by 4-
tuple:
– source IP address
– source port number
– destination IP address
– destination port number
• Receiver host uses all four
values to direct segment to
appropriate socket; socket
is uniquely identified by 4-
tuple (S-IP, SP, D-IP, DP)
• Server host may support
many simultaneous TCP
sockets:
– each socket identified by its
own 4-tuple
• Web servers have different
sockets for each connecting
client
– non-persistent HTTP will
have different socket for
each request
11. Connection-oriented Demultiplexing (TCP)
Client IP:B
P1
Client IP: A
P1P2P4
Server IP: C
SP: 9157
DP: 80
SP: 9157
DP: 80
P5 P6 P3
D-IP:C
S-IP: A
D-IP:C
S-IP: B
SP: 5775
DP: 80
D-IP:C
S-IP: B
• This is a Web Server example as the segments are being sent to Port 80
of the server which corresponds to the HTTP Service
• Note that in this case, the server is creating a separate process for each
of the sockets. This would be inefficient (see next slide for a more
efficient example with “threading”)
12. Connection-oriented Demultiplexing (TCP)
Threaded Web Server
Client IP:B
P1
Client IP: A
P1P2
Server IP: C
SP: 9157
DP: 80
SP: 9157
DP: 80
P4 P3
D-IP:C
S-IP: A
D-IP:C
S-IP: B
SP: 5775
DP: 80
D-IP:C
S-IP: B
• This is also a Web Server example as the segments are being sent to Port
80 of the server which corresponds to the HTTP Service
• Note that in this case, the server is creating one process for all the
sockets. A new thread (kind of like a sub-process) is created for each socket
13. UDP: User Datagram Protocol
• “no frills,” “bare bones”
Internet transport
protocol
• “best effort” service, UDP
segments may be:
– lost
– delivered out of order
to application
• Connectionless:
– no handshaking between
UDP sender, receiver
– each UDP segment
handled independently
of others
Why have UDP?
• No connection
establishment (which can
add delay)
• Simple: no connection state
information kept at sender
or receiver
• Small Segment Header
• No Congestion Control: UDP
can transmit as fast as it
can
14. UDP: User Datagram Protocol
• Commonly used for
streaming multimedia
applications which tend
to be loss tolerant but
rate sensitive
• UDP also used for DNS
and SNMP
• For reliable transfer
over UDP one must add
reliability at the level
of the application layer,
e.g. application-specific
error recovery!
Source Port # Dest Port #
32 bits
Application
Data
(Message)
UDP Segment Format
Length Checksum
Length, in
bytes of
UDP
segment,
including
header
UDP Checksum: Standard Internet
Checksum added by the sender. Used by
the receiver to check for bit errors.
(See next slide)
15. UDP Checksum Calculation
• UDP checksum covers pseudoheader followed by UDP datagram
• IP addresses included to detect against misdelivery
• Receiver recalculates the checksum and silently discards the
datagram if errors detected (i.e. no error message generated)
• Using UDP checksums is optional but hosts are required to have
checksums enabled
0 0 0 0 0 0 0 0 Protocol = 17 UDP Length
Source IP Address
Destination IP Address
0 8 16 31
UDP Pseudoheader
(used in checksum calculation but never actually transmitted,
nor is it included in the “Length”)
Note that IP Address information will come from another layer (Network
Layer). Strictly speaking, this goes against the philosophy of keeping the
layers separate from each other.
16. UDP Destination Port Usage
Port 1 Port 2 Port 3
UDP Demultiplexing
(based on destination port #)
IP Layer
Arrival of UDP Datagram
Datagram demultiplexed
to its appropriate port
Error Message sent
back if the Dest.
Port # indicated in
the datagram does
not exist!
17. UDP Port Numbers
Well Known Port Numbers Dynamically Assigned Port
Numbers
Universally assigned and
accepted port #s providing
some designated service.
Typically, lower port
numbers used for this
Examples -
37 Time
53 Domain Name Server
67 DHCP Server
68 DHCP Client
• Ports are not globally
known
• When a program needs a
port, it asks for and gets
one from the network
software
• Destination m/c needs to
be queried to find the port
number at which it may be
offering the service to be
accessed
• Typically higher port
numbers used for this
18. TCP: Transmission Control Protocol
• full duplex data:
– bi-directional data flow in
same connection
– MSS: maximum segment
size
• connection-oriented:
– handshaking (exchange of
control msgs) initializes
sender, receiver state
before data exchange
• flow controlled:
– sender will not overwhelm
receiver
• point-to-point:
– one sender, one receiver
• reliable, in-order byte
steam:
– no “message boundaries”
inside!
• pipelined:
– TCP congestion and flow
control set window size
• send & receive buffers
socket
door
TCP
send buffer
TCP
receive buffer
socket
door
segment
application
writes data
application
reads data
Important to remember though that a TCP stream is unstructured, i.e.
no boundary marks in the stream itself so application would have to
create such boundary marks if needed (e.g. separating different fields)
19. TCP Segment Format
Each TCP segment has header of 20 or more bytes + 0 or more bytes of data
Source Port Destination Port
Sequence Number
Acknowledgment Number
Checksum Urgent Pointer
Options Padding
0 4 10 16 24 31
U
R
G
A
C
K
P
S
H
R
S
T
S
Y
N
F
I
N
Header
Length
Reserved Window Size
Data
Header
20. TCP Header
Port Numbers
• A socket identifies a
connection endpoint
– IP address + port
• A connection specified by a
socket pair
• Well-known ports
– FTP 20
– Telnet 23
– DNS 53
– HTTP 80
Sequence Number
• Byte count
• First byte in segment
• 32 bits long
• 0 SN 232-1
• Initial sequence number
selected during connection
setup
21. TCP Header
Acknowledgement Number
• SN of next byte expected by
receiver
• Acknowledges that all prior
bytes in stream have been
received correctly
• Valid if ACK flag is set
Header length
• 4 bits
• Length of header in multiples
of 32-bit words
• Minimum header length is 20
bytes
• Maximum header length is 60
bytes
22. TCP Header
Reserved
• 6 bits
Control
• 6 bits
• URG: urgent pointer flag
– Urgent message end = SN + urgent pointer
• ACK: ACK packet flag
• PSH: override TCP buffering
• RST: reset connection
– Upon receipt of RST, connection is
terminated and application layer notified
• SYN: establish connection
• FIN: close connection
23. TCP Header
Window Size
• 16 bits to advertise window
size
• Used for flow control
• Sender will accept bytes with
SN from ACK to ACK +
window
• Maximum window size is
65535 bytes
TCP Checksum
• Internet checksum method
• Computed over
TCP pseudo header + TCP
segment (header+ data)
(See next slide for TCP pseudo
header)
24. TCP Pseudo Header
(for checksum calculation)
0 0 0 0 0 0 0 0 Protocol = 6 TCP Segment Length
Source IP address
Destination IP address
0 8 16 31
Used in checksum calculation but never actually transmitted,
nor is it included in the “Length”
Usage similar to that of the UDP Pseudoheader
25. TCP Header
Options
• Variable length
• NOP (No Operation) option is
used to pad TCP header to
multiple of 32 bits
• Time stamp option is used
for round trip measurements
Options
• Maximum Segment Size
(MSS) option specifices
largest segment a receiver
wants to receive
• Window Scale option
increases TCP window from
16 to 32 bits
26. TCP Services
• Provides a full duplex connection-oriented and reliable byte-
stream service using a sliding-window flow control.
• User data are broken into segments not exceeding 64 kbytes
(usually about 1500 bytes) and sent to the destination by
encapsulating them in IP datagrams
– IP provides unreliable packet delivery
– packets can get lost, duplicated or delivered out of
sequence
• Receiver sends an acknowledgment back after receiving a
segment.
• Retransmission of segment if necessary
27. TCP Services
buffer
segments buffer used
Application
Transport
advertised
window size < B
buffer available = B
Application
buffer
segments
buffer
Application
Transport
ACKS
RTT
Estimation
Application
28. TCP Round Trip Time and Timeout
How to set TCP timeout value?
• Must be set longer than the RTT, but the RTT also
varies
• If it is set too short, the premature timeout may
happen leading to unnecessary retransmissions
• If it is set too long then the response to segment
loss will be too slow.
29. TCP Round Trip Time and Timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
Exponential weighted moving average
influence of past sample decreases exponentially fast
typical value: = 0.125
How is the RTT estimated?
• SampleRTT = measured time from segment transmission
until ACK for that is received, ignoring retransmissions
• SampleRTT will fluctuate but we want the estimated RTT
to be “smoother”. This is done by taking a moving average
EstimatedRTT over recent measurements – should not just
use the current SampleRTT
30. RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RTT(milliseconds)
SampleRTT Estimated RTT
Example Measurments
(SampleRTT and EstimatedRTT)
31. TCP Round Trip Time and Timeout
Setting the timeout
• EstimtedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
• First estimate of how much SampleRTT deviates from
EstimatedRTT:
TimeoutInterval = EstimatedRTT + 4*DevRTT
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically, = 0.25)
• Then set the timeout interval as -
32. TCP Connection Establishment
• Three-Way Handshake
– A sends a SYN segment specifying the port number of
the other party B , the initial sequence number (ISN)
that A will use and other info (eg. max. segment size)
– B responds with its own SYN segment containing its ISN.
B also acknowledges A’s SYN by ACKing A’s ISN plus one
– A acknowledges B’s SYN by ACKing B’s ISN plus one
• Initial Sequence Number (ISN) may be randomly chosen but
with some important considerations
33. Initial Sequence Number (ISN)
• Select initial sequence numbers (ISN) to protect against
segments from prior connections (that may circulate in the
network and arrive at a much later time)
• Select ISN to avoid overlap with sequence numbers of prior
connections
• Use local clock to select ISN sequence number
• Time for clock to go through a full cycle should be greater than
the maximum lifetime of a segment (MSL); Typically MSL=120
seconds
• High bandwidth connections pose a problem
34. Three Way Handshake
(TCP Connection Setup)
Host A Host B
Protects the ISN against responding falsely to old segments from prior connections
35. Maximum Segment Size
• Maximum Segment Size (MSS) - largest block of data that TCP
sends to other end
• Each end can announce its MSS during connection establishment
• Default is 576 bytes including 20 bytes for IP header and 20
bytes for TCP header
• Slight difference between the MSS of Ethernet and IEEE
802.3.
Ethernet MSS = 1460 bytes
IEEE 802.3 MSS = 1452 bytes
36. TCP Window Flow Control
Host A Host B
t1
t2
t3
t4
t0
Win =Advertised
Window size
128 bytes to
transmit
Only 512 bytes
sent as that is the
advertised value
of Win
1024 bytes to
transmit
1024 bytes to
transmit
1024 bytes to
transmit
1024 bytes to
transmit
37. Nagle Algorithm
• Situation: User types one character at a time
– Transmitter sends TCP segment per character (41B)
– Receiver sends ACK (40B)
– Receiver echoes received character (41B)
– Transmitter ACKs echo (40 B)
– 162 bytes transmitted to transfer one character! Problem!
• Solution:
– TCP sends data & waits for ACK
– New characters buffered
– Send new characters when ACK arrives
– Algorithm adjusts to RTT as follows -
• Short RTT send frequently at low efficiency
• Long RTT send less frequently at greater efficiency
38. Silly Window Syndrome
• Situation:
– Transmitter sends large amount of data
– Receiver’s buffer is depleted slowly, so buffer fills up
– Every time a few bytes read from buffer, a new advertisement to
transmitter is generated
– Sender immediately sends data & fills buffer
– This leads to many small, inefficient segments being transmitted
• Solution:
– Receiver does not advertize window until window is at least ½ of
receiver buffer or is equal to the maximum segment size (MSS)
– Transmitter refrains from sending small segments
39. Sequence Number Wraparound
(Potential problem at high data rates)
• 232 = 4.29x109 bytes = 34.3x109 bits (TCP has 32-bit seq. no.)
Therefore, at 1 Gbps, sequence numbers will wraparound in just 34.3
seconds transmitter can only transmit for very brief periods
Solution: Use Timestamp Option in TCP option field. Transmitter inserts
32-byte timestamp in transmitted segment. Receiver echoes this in ACK.
This option must be requested in the SYN segment and is negotiated
during the Connection Setup.
– Timestamp + sequence no → 64-bit seq. no (effectively a
much larger sequence number than the original 32-bit)
– Timestamp clock must:
• Tick forward at least once every 231 bits
• Not complete cycle in less than one MSL
• Example: clock tick every 1 ms @ 8 Tbps wraps around in 25
days
40. Delay-BW Product & Advertised Window Size
• Suppose RTT=100 ms, R=2.4 Gbps then –
No. of bits in pipe = 3 Mbytes
• If a single TCP process occupies the pipe, then required
advertised window size is RTT x Bit rate = 3 Mbytes
• But, normal maximum window size is only 65535 bytes which
clearly is inadequately small
• Solution: Use the “Window Scale Option” which will allow the
window to be scaled upward by a factor of 214 . Then a Window
Size up to 65535 x 214 = 1 Gbyte will be allowed. This window
scaling option must be requested in the SYN segment and is
negotiated during the Connection Setup.
41. (Graceful Close)
Host B still
delivers 150
bytes
Host A Host B
Closing a TCP Connection
Host A initiates the TCP
connection termination,
sends its FIN Host B sends ACK
but does not yet
send its own FIN
Host B now sends
its own FIN
Host A ACKs B’s
FIN closing its side
of the connection
Host B gets A’s
ACK and closes its
side of the
connection
After sending FIN, Host A cannot send any more data but
cannot close the connection as B may still be sending something
42. TIME_WAIT state
TIME_WAIT State is entered if the host sending a FIN (e.g. Host
A in previous slide) receives an ACK from the other side
This protects future incarnations of connection from delayed
segments
TIME_WAIT = 2 x MSL
Maximum Segment Lifetime (MSL) is the maximum time that an
IP packet packet can live in the network
Only valid segment that can arrive while in TIME_WAIT
state is a FIN retransmission. If such segment arrives,
resent ACK & restart TIME_WAIT timer
When timer expires, close TCP connection
44. Congestion Control in TCP
• Advertised window size is used to ensure that receiver’s buffer will not
overflow
• However, buffers at intermediate routers between source and
destination may still overflow (i.e. because of network congestion)
Router
R bps
Packetflowsmay
comeinfrommany
sources
Congestion occurs when total arrival rate from all packet flows exceeds
R over a sustained period of time
When congestion occurs, buffers at routers will fill and packets will be
lost
45. Different Phases of Congestion Behavior
1. Light traffic
– Arrival Rate << R
– Low delay
– Can accommodate more
2. Knee (congestion onset)
– Arrival rate approaches R
– Delay increases rapidly
– Throughput begins to
saturate
3. Congestion collapse
– Arrival rate > R
– Large delays, packet loss
– Useful application
throughput drops
Throughput(bps)Delay(sec)
R
R
Arrival
Rate
Arrival
Rate
46. Window Congestion Control
• (From previous slide) Desired operating point will be just
before knee as shown there. Sources must control their sending
rates so that aggregate arrival rate is just before knee
• TCP sender maintains a congestion window “cwnd” to control
congestion at intermediate routers
• Effective window is minimum of congestion window and
advertised window
• Problem: The source does not know what its “fair” share of
available bandwidth should be and so does not know what value
to set for cwnd
• Solution: Adjust cwnd dynamically to available BW as follows
– Sources probe the network by gradually increasing cwnd
(Initially set cwnd to a low value)
– When congestion detected, sources reduce rate
– Ideally, sources’ sending rate will stabilize near ideal point
47. Congestion Window
How does the TCP congestion algorithm change congestion
window dynamically according to the most up-to-date state of
the network?
• At light traffic: each segment is ACKed quickly
– Increase cwnd aggresively
• At knee: segment ACKs arrive, but more slowly
– Slow down increase in cwnd
• At congestion: segments encounter large delays (so
retransmission timeouts occur); segments get dropped in
router buffers
– Reduce transmission rate, then probe again
48. TCP Congestion Control: Slow Start
Slow Start: Increase congestion window size by one segment upon
receiving an ACK from receiver
– initialized at 2 segments
– used at (re)start of data transfer
– congestion window increases exponentially
ACK
Seg
RTTs
1
2
4
8
cwnd
49. TCP Congestion Control: Congestion Avoidance
• Algorithm progressively sets
a congestion threshold
– When cwnd > threshold,
slow down rate at which
cwnd is increased
• Increase congestion window
size by one segment per
round-trip-time (RTT)
– Each time an ACK
arrives, cwnd is
increased by 1/cwnd
– In one RTT, cwnd
segments are sent, so
total increase in cwnd is
cwnd x 1/cwnd = 1
– cwnd grows linearly with
time
RTTs
1
2
4
8
cwnd
threshold
50. TCP Congestion Control: Congestion
• Congestion is detected upon
timeout or receipt of
duplicate ACKs
• Assume current cwnd
corresponds to available
bandwidth
• Adjust congestion threshold
= ½ x current cwnd
• Reset cwnd to 1
• Go back to slow-start
• Over several cycles expect
to converge to congestion
threshold equal to about ½
the available bandwidth
Congestionwindow
10
5
15
20
0
Round-trip times
Slow
start
Congestion
Avoidance
Time-out
Threshold
51. Fast Retransmit & Fast Recovery
• Congestion causes many segments to be
dropped
• If only a single segment is dropped, then
subsequent segments trigger duplicate ACKs
before timeout (as shown)
• Can avoid large decrease in cwnd as follows:
– When three duplicate ACKs arrive,
retransmit lost segment immediately
– Reset congestion threshold to ½ cwnd
– Reset cwnd to congestion threshold + 3
to account for the three segments that
triggered duplicate ACKs
– Remain in congestion avoidance phase
– However if timeout expires, reset cwnd
to 1
– In absence of timeouts, cwnd will
oscillate around optimal value
SN=1
ACK=2
ACK=2
ACK=2
ACK=2
SN=2
SN=3
SN=4
SN=5
52. TCP Congestion Control:
Fast Retransmit & Fast Recovery
Congestionwindow
10
5
15
20
0
Time (in units of RTT)
Slow
start
Congestion
avoidance
Time-out
Threshold