RESOURCE
ALLOCATION
IN
COMPUTER
NETWORKS
joão taveira araújo
@jta
A S S U M P T I O N
A S S U M P T I O N
How do you share a network?
TCP
“How do you share a network?”
TCP
TCP
an answer
(maybe)
“How do you share a network?”
the question
A S S U M P T I O N
given an answer
A S S U M P T I O N
given an answer
can’t fully understand
A S S U M P T I O N
given an answer
never worked through question
can’t fully understand
T H I S T A L K
‘62
‘74
‘88
‘07
T H I S T A L K
‘62
‘74
‘88
‘07
different interpretations
of the same question
T H I S T A L K
‘62
‘74
‘88
‘07
foundational papers
O B J E C T I V E S
O B J E C T I V E S
how did we get here
O B J E C T I V E S
how did we get here
what assumptions
at what cost
O B J E C T I V E S
how did we get here
what assumptions
T H I S T A L K
Let us consider the synthesis of a communication
network which will allow several hundred major
communications stations to talk with one another
after an enemy attack. As a criterion of survivability
we elect to use the percentage of stations both
surviving the physical attack and remaining in
electrical connection with the largest single group of
surviving stations. This criterion is chosen as a
conservative measure of the ability of the surviving
stations to operate together as a coherent entity after
the attack. This means that small groups of stations
isolated from the single largest group are considered
to be ineffective.
Although one can draw a wide variety of networks,
they all factor into two components: centralized (or
star) and distributed (or grid or mesh) (see Fig. 1).
The centralized network is obviously vulnerable as
destruction of a single central node destroys
communication between the end stations. In practice,
a mixture of star and mesh components is used to
form communications networks. For example, type
(b) in Fig. 1 shows the hierarchical structure of a set
Paul Baran‘62 On Distributed Communications Networks
INTRODUCTION
Let us consider the synthesis of a communication
network which will allow several hundred major
communications stations to talk with one another
after an enemy attack. As a criterion of survivability
we elect to use the percentage of stations both
surviving the physical attack and remaining in
electrical connection with the largest single group of
surviving stations. This criterion is chosen as a
conservative measure of the ability of the surviving
stations to operate together as a coherent entity after
the attack. This means that small groups of stations
isolated from the single largest group are considered
to be ineffective.
Although one can draw a wide variety of networks,
they all factor into two components: centralized (or
star) and distributed (or grid or mesh) (see Fig. 1).
The centralized network is obviously vulnerable as
destruction of a single central node destroys
communication between the end stations. In practice,
a mixture of star and mesh components is used to
form communications networks. For example, type
(b) in Fig. 1 shows the hierarchical structure of a set
Paul Baran‘62 On Distributed Communications Networks
INTRODUCTION
Let us consider the synthesis of a communication
network which will allow several hundred major
communications stations to talk with one another
after an enemy attack. As a criterion of survivability
we elect to use the percentage of stations both
surviving the physical attack and remaining in
electrical connection with the largest single group of
surviving stations. This criterion is chosen as a
conservative measure of the ability of the surviving
stations to operate together as a coherent entity after
the attack. This means that small groups of stations
isolated from the single largest group are considered
to be ineffective.
Although one can draw a wide variety of networks,
they all factor into two components: centralized (or
star) and distributed (or grid or mesh) (see Fig. 1).
The centralized network is obviously vulnerable as
destruction of a single central node destroys
communication between the end stations. In practice,
a mixture of star and mesh components is used to
form communications networks. For example, type
(b) in Fig. 1 shows the hierarchical structure of a set
Paul Baran‘62 On Distributed Communications Networks
INTRODUCTION
“Let us consider the synthesis of
a communication network which
will allow several hundred major
communications stations to talk
with one another after an enemy
attack.”
Paul Baran‘62 On Distributed Communications Networks
transmission between any ith station and any jth
station, provided a path can be drawn from the ith to
the jth station.
Starting with a network composed of an array of
stations connected as in Fig. 3, an assigned percentage
of nodes and links is destroyed. If, after this
operation, it is still possible to draw a line to connect
the ith station to the jth station, the ith and jth
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the effects of the weapons against
each and every link, the raid fails. The high risk of
such raids against highly parallel structures causes
examination of alternative attack policies. Consider
the following uniform raid example. Assume that
2,000 weapons are deployed against a 1000-station
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
transmission between any ith station and any jth
station, provided a path can be drawn from the ith to
the jth station.
Starting with a network composed of an array of
stations connected as in Fig. 3, an assigned percentage
of nodes and links is destroyed. If, after this
operation, it is still possible to draw a line to connect
the ith station to the jth station, the ith and jth
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the effects of the weapons against
each and every link, the raid fails. The high risk of
such raids against highly parallel structures causes
examination of alternative attack policies. Consider
the following uniform raid example. Assume that
2,000 weapons are deployed against a 1000-station
“(…) destruction caused by
conventional hardware failure,
the failures would be randomly
distributed through the network.
But, if the disturbance were
caused by enemy attack, the
possible "worst cases" must be
considered.”
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the effects of the weapons against
each and every link, the raid fails. The high risk of
such raids against highly parallel structures causes
examination of alternative attack policies. Consider
the following uniform raid example. Assume that
2,000 weapons are deployed against a 1000-station
network. The stations are so spaced that destruction
of two stations with a single weapon is unlikely.
Divide the 2,000 weapons into two equal 1000-
weapon salvos. Assume any probability of destruction
of a single node from a single weapon less than 1.0;
for example, 0.5. Each weapon on the first salvo has a
0.5 probability of destroying its target. But, each
weapon of the second salvo has only a 0.25
probability, since one-half the targets have already
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
“To bisect a 32-link network requires
direction of 288 weapons each with
a probability of kill, pk = 0.5, or 160
with a pk = 0.7, to produce over an
0.9 probability of successfully
bisecting the network.”
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the effects of the weapons against
each and every link, the raid fails. The high risk of
such raids against highly parallel structures causes
examination of alternative attack policies. Consider
the following uniform raid example. Assume that
2,000 weapons are deployed against a 1000-station
network. The stations are so spaced that destruction
of two stations with a single weapon is unlikely.
Divide the 2,000 weapons into two equal 1000-
weapon salvos. Assume any probability of destruction
of a single node from a single weapon less than 1.0;
for example, 0.5. Each weapon on the first salvo has a
0.5 probability of destroying its target. But, each
weapon of the second salvo has only a 0.25
probability, since one-half the targets have already
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
Each node and link in the array of Fig. 2 has the
capacity and the switching flexibility to allow
transmission between any ith station and any jth
station, provided a path can be drawn from the ith to
the jth station.
Starting with a network composed of an array of
stations connected as in Fig. 3, an assigned percentage
of nodes and links is destroyed. If, after this
operation, it is still possible to draw a line to connect
the ith station to the jth station, the ith and jth
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the effects of the weapons against
each and every link, the raid fails. The high risk of
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
4. First, extremely survivable networks can be built
using a moderately low redundancy of connectivity
level. Redundancy levels on the order of only three
permit withstanding extremely heavy level attacks
with negligible additional loss to communications.
Secondly, the survivability curves have sharp break-
points. A network of this type will withstand an
increasing attack level until a certain point is reached,
beyond which the network rapidly deteriorates. Thus,
the optimum degree of redundancy can be chosen as
a function of the expected level of attack. Further
redundancy buys little. The redundancy level required
to survive even very heavy attacks is not great--on the
order of only three or four times that of the
minimum span network.
Link Destruction
In the previous example we have examined network
performance as a function of the destruction of the
nodes (which are better targets than links). We shall
now re-examine the same network, but using
unreliable links. In particular, we want to know how
unreliable the links may be without further degrading
the performance of the network.
Figure 5 shows the results for the case of perfect
nodes; only the links fail. There is little system
degradation caused even using extremely unreliable
links--on the order of 50 per cent down-time--
assuming all nodes are working.
Combination Link and Node Destruction
The worst case is the composite effect of failures of
both the links and the nodes. Figure 6 shows the
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
Link Destruction
In the previous example we have examined network
performance as a function of the destruction of the
nodes (which are better targets than links). We shall
now re-examine the same network, but using
unreliable links. In particular, we want to know how
unreliable the links may be without further degrading
the performance of the network.
Figure 5 shows the results for the case of perfect
nodes; only the links fail. There is little system
degradation caused even using extremely unreliable
links--on the order of 50 per cent down-time--
assuming all nodes are working.
Combination Link and Node Destruction
The worst case is the composite effect of failures of
both the links and the nodes. Figure 6 shows the
effect of link failure upon a network having 40 per
cent of its nodes destroyed. It appears that what
would today be regarded as an unreliable link can be
used in a distributed network almost as effectively as
perfectly reliable links. Figure 7 examines the result of
100 trial cases in order to estimate the probability
density distribution of system performance for a
mixture of node and link failures. This is the
distribution of cases for 20 per cent nodal damage
and 35 per cent link damage.
Paul Baran‘62 On Distributed Communications Networks
EXAMINATION OF A
DISTRIB UTE D NETWOR K
We will soon be living in an era in which we cannot
guarantee survivability of any single point. However,
we can still design systems in which system
destruction requires the enemy to pay the price of
destroying n of n stations. If n is made sufficiently
large, it can be shown that highly survivable system
structures can be built - even in the thermonuclear
era. In order to build such networks and systems we
will have to use a large number of elements. We are
interested in knowing how inexpensive these
elements may be and still permit the system to
operate reliably. There is a strong relationship
between element cost and element reliability. To
design a system that must anticipate a worst-case
destruction of both enemy attack and normal system
failures, one can combine the failures expected by
enemy attack together with the failures caused by
normal reliability problems, provided the enemy does
not know which elements are inoperative. Our future
systems design problem is that of building very
reliable systems out of the described set of unreliable
elements at lowest cost. In choosing the
communications links of the future, digital links
appear increasingly attractive by permitting low-cost
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
We will soon be living in an era in which we cannot
guarantee survivability of any single point. However,
we can still design systems in which system
destruction requires the enemy to pay the price of
destroying n of n stations. If n is made sufficiently
large, it can be shown that highly survivable system
structures can be built - even in the thermonuclear
era. In order to build such networks and systems we
will have to use a large number of elements. We are
interested in knowing how inexpensive these
elements may be and still permit the system to
operate reliably. There is a strong relationship
between element cost and element reliability. To
design a system that must anticipate a worst-case
destruction of both enemy attack and normal system
failures, one can combine the failures expected by
enemy attack together with the failures caused by
normal reliability problems, provided the enemy does
not know which elements are inoperative. Our future
systems design problem is that of building very
reliable systems out of the described set of unreliable
elements at lowest cost. In choosing the
communications links of the future, digital links
appear increasingly attractive by permitting low-cost
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
“(…) highly survivable system
structures can be built - even in
the thermonuclear era.”
We will soon be living in an era in which we cannot
guarantee survivability of any single point. However,
we can still design systems in which system
destruction requires the enemy to pay the price of
destroying n of n stations. If n is made sufficiently
large, it can be shown that highly survivable system
structures can be built - even in the thermonuclear
era. In order to build such networks and systems we
will have to use a large number of elements. We are
interested in knowing how inexpensive these
elements may be and still permit the system to
operate reliably. There is a strong relationship
between element cost and element reliability. To
design a system that must anticipate a worst-case
destruction of both enemy attack and normal system
failures, one can combine the failures expected by
enemy attack together with the failures caused by
normal reliability problems, provided the enemy does
not know which elements are inoperative. Our future
systems design problem is that of building very
reliable systems out of the described set of unreliable
elements at lowest cost. In choosing the
communications links of the future, digital links
appear increasingly attractive by permitting low-cost
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
“(…) have to use a large number
of elements. We are interested
in knowing how inexpensive
these elements may be”
high data rate links in emergencies.[2]
Satellites
The problem of building a reliable network using
satellites is somewhat similar to that of building a
communications network with unreliable links. When
a satellite is overhead, the link is operative. When a
satellite is not overhead, the link is out of service.
Thus, such links are highly compatible with the type
of system to be described.
Variable Data Rate Links
In a conventional circuit switched system each of the
tandem links requires matched transmission
bandwidths. In order to make fullest use of a digital
link, the post-error-removal data rate would have to
vary, as it is a function of noise level. The problem
then is to build a communication network made up of
links of variable data rate to use the communication
resource most efficiently.
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to efficiently
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
high data rate links in emergencies.[2]
Satellites
The problem of building a reliable network using
satellites is somewhat similar to that of building a
communications network with unreliable links. When
a satellite is overhead, the link is operative. When a
satellite is not overhead, the link is out of service.
Thus, such links are highly compatible with the type
of system to be described.
Variable Data Rate Links
In a conventional circuit switched system each of the
tandem links requires matched transmission
bandwidths. In order to make fullest use of a digital
link, the post-error-removal data rate would have to
vary, as it is a function of noise level. The problem
then is to build a communication network made up of
links of variable data rate to use the communication
resource most efficiently.
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to efficiently
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
Variable Data Rate Links
In a conventional circuit switched system each of the
tandem links requires matched transmission
bandwidths. In order to make fullest use of a digital
link, the post-error-removal data rate would have to
vary, as it is a function of noise level. The problem
then is to build a communication network made up of
links of variable data rate to use the communication
resource most efficiently.
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to efficiently
handle both those users who make highly
intermittent bit demands on the network, and those
who make long-term continuous, low bit demands.
Common User
In communications, as in transportation, it is more
economical for many users to share a common
resource rather than each to build his own system--
particularly when supplying intermittent or
occasional service. This intermittency of service is
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to efficiently
handle both those users who make highly
intermittent bit demands on the network, and those
who make long-term continuous, low bit demands.
Common User
In communications, as in transportation, it is more
economical for many users to share a common
resource rather than each to build his own system--
particularly when supplying intermittent or
occasional service. This intermittency of service is
highly characteristic of digital communication
requirements. Therefore, we would like to consider
the interconnection, one day, of many all-digital links
to provide a resource optimized for the handling of
data for many potential intermittent users--a new
common-user system.
Figure 9 demonstrates the basic notion. A wide
mixture of different digital transmission links is
combined to form a common resource divided among
many potential users. But, each of these
communications links could possibly have a different
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to efficiently
handle both those users who make highly
intermittent bit demands on the network, and those
who make long-term continuous, low bit demands.
Common User
In communications, as in transportation, it is more
economical for many users to share a common
resource rather than each to build his own system--
particularly when supplying intermittent or
occasional service. This intermittency of service is
highly characteristic of digital communication
requirements. Therefore, we would like to consider
the interconnection, one day, of many all-digital links
to provide a resource optimized for the handling of
data for many potential intermittent users--a new
common-user system.
Figure 9 demonstrates the basic notion. A wide
mixture of different digital transmission links is
combined to form a common resource divided among
many potential users. But, each of these
communications links could possibly have a different
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
“more economical to share a
common (…) resource optimized
for the handling of data”
common-user system.
Figure 9 demonstrates the basic notion. A wide
mixture of different digital transmission links is
combined to form a common resource divided among
many potential users. But, each of these
communications links could possibly have a different
data rate. Therefore, we shall next consider how links
of different data rates may be interconnected.
Standard Message Block
Present common carrier communications networks,
used for digital transmission, use links and concepts
originally designed for another purpose--voice. These
systems are built around a frequency division
multiplexing link-to-link interface standard. The
standard between links is that of data rate. Time
division multiplexing appears so natural to data
transmission that we might wish to consider an
alternative approach--a standardized message block as
a network interface standard. While a standardized
message block is common in many computer-
communications applications, no serious attempt has
ever been made to use it as a universal standard. A
universally standardized message block would be
composed of perhaps 1024 bits. Most of the message
block would be reserved for whatever type data is to
be transmitted, while the remainder would contain
housekeeping information such as error detection and
routing data, as in Fig. 10.
As we move to the future, there appears to be an
increasing need for a standardized message block for
all-digital communications networks. As data rates
increase, the velocity of propagation over long links
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
common-user system.
Figure 9 demonstrates the basic notion. A wide
mixture of different digital transmission links is
combined to form a common resource divided among
many potential users. But, each of these
communications links could possibly have a different
data rate. Therefore, we shall next consider how links
of different data rates may be interconnected.
Standard Message Block
Present common carrier communications networks,
used for digital transmission, use links and concepts
originally designed for another purpose--voice. These
systems are built around a frequency division
multiplexing link-to-link interface standard. The
standard between links is that of data rate. Time
division multiplexing appears so natural to data
transmission that we might wish to consider an
alternative approach--a standardized message block as
a network interface standard. While a standardized
message block is common in many computer-
communications applications, no serious attempt has
ever been made to use it as a universal standard. A
universally standardized message block would be
composed of perhaps 1024 bits. Most of the message
block would be reserved for whatever type data is to
be transmitted, while the remainder would contain
housekeeping information such as error detection and
routing data, as in Fig. 10.
As we move to the future, there appears to be an
increasing need for a standardized message block for
all-digital communications networks. As data rates
increase, the velocity of propagation over long links
Paul Baran‘62 On Distributed Communications Networks
ON A FUTURE SYST E M
DEVELOPMENT
“Time division multiplexing
appears so natural to data that
we might wish to consider an
alternative approach - a
standardized message block”
Telecommunications textbooks arrive at a fire
according to a Poisson distribution
“How do you share a network?”
priority marking
(defense contractor)
IP type of service field
Act I
AN EXERCISE TO THE READER
A R P A N E T
Act II
Scientific Positivism
A protocol that supports the sharing of resources that
exist in different packet switching networks is
presented. The protocol provides for variation in
individual network packet sizes, transmission failures,
sequencing, flow control, end-to-end error checking,
and the creation and destruction of logical process-
to-process connections. Some implementation issues
are considered, and problems such as internetwork
routing, accounting, and timeouts are exposed.
In the last few years considerable effort has been
expended on the design and implementation of
packet switching networks [1]-[7],[14],[17]. A principle
reason for developing such networks has been to
facilitate the sharing of computer resources. A packet
communication network includes a transportation
mechanism for delivering data between computers or
between computers and terminals. To make the data
meaningful, computer and terminals share a common
protocol (i.e, a set of agreed upon conventions).
Several protocols have already been developed for this
purpose [8]-[12],[16]. However, these protocols have
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ABSTRACT
INTRODUCTION
In the last few years considerable effort has been
expended on the design and implementation of
packet switching networks [1]-[7],[14],[17]. A principle
reason for developing such networks has been to
facilitate the sharing of computer resources. A packet
communication network includes a transportation
mechanism for delivering data between computers or
between computers and terminals. To make the data
meaningful, computer and terminals share a common
protocol (i.e, a set of agreed upon conventions).
Several protocols have already been developed for this
purpose [8]-[12],[16]. However, these protocols have
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ABSTRACT
INTRODUCTION
A protocol that supports the sharing of resources that
exist in different packet switching networks is
presented. The protocol provides for variation in
individual network packet sizes, transmission failures,
sequencing, flow control, end-to-end error checking,
and the creation and destruction of logical process-
to-process connections. Some implementation issues
are considered, and problems such as internetwork
routing, accounting, and timeouts are exposed.
In the last few years considerable effort has been
expended on the design and implementation of
packet switching networks [1]-[7],[14],[17]. A principle
reason for developing such networks has been to
facilitate the sharing of computer resources. A packet
communication network includes a transportation
mechanism for delivering data between computers or
between computers and terminals. To make the data
meaningful, computer and terminals share a common
protocol (i.e, a set of agreed upon conventions).
Several protocols have already been developed for this
purpose [8]-[12],[16]. However, these protocols have
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ABSTRACT
INTRODUCTION
“A protocol that supports the
sharing of resources that exist in
different packet switching
networks is presented.”
A protocol that supports the sharing of resources that
exist in different packet switching networks is
presented. The protocol provides for variation in
individual network packet sizes, transmission failures,
sequencing, flow control, end-to-end error checking,
and the creation and destruction of logical process-
to-process connections. Some implementation issues
are considered, and problems such as internetwork
routing, accounting, and timeouts are exposed.
A protocol that supports the sharing of resources that
exist in different packet switching networks is
presented. The protocol provides for variation in
individual network packet sizes, transmission failures,
sequencing, flow control, end-to-end error checking,
and the creation and destruction of logical process-
to-process connections. Some implementation issues
are considered, and problems such as internetwork
routing, accounting, and timeouts are exposed.
In the last few years considerable effort has been
expended on the design and implementation of
packet switching networks [1]-[7],[14],[17]. A principle
reason for developing such networks has been to
facilitate the sharing of computer resources. A packet
communication network includes a transportation
mechanism for delivering data between computers or
between computers and terminals. To make the data
meaningful, computer and terminals share a common
protocol (i.e, a set of agreed upon conventions).
Several protocols have already been developed for this
purpose [8]-[12],[16]. However, these protocols have
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ABSTRACT
INTRODUCTION
packet fragmentation
transmission failures
sequencing
flow control
error checking
connection setup
Fig. 2. Three networks interconnected by two GATEWAYS.
may be null) b- Internetwork Header
CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK
g. 3. Internetworkpacketformat (fields not shown to sc
orlc header, is illustrated in Fig. 3 . The source and d
ation entries uniforndyand uniquely identifythe add
every HOST in the composite network. Addressing is
ubject of considerablecomplexitywhichisdiscussed
greater detail in the nextsection. Thenext two entr
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
IEEE TRANSACTIONS ON COMMUNICATIOK
byte identification-sequencenumber
First Message
(SEQ = k)
Fig. 7. Assignment of sequencenumbers.
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
egments and packets frommessages.
32 16 16 En
Wmdow ACK Text (Field sizes in bits1
Hed..LJ
format (processheader andtext).
the message bythe source
for internetworktransmission, the
first byte of segment text is used as
for the packet. Thebytecount
rk header accounts for all the text
ocs not include the check-sum bytes
ernetxork or process header).
e sequence number associated with
16 bits
Y E S M
S
N L
_ . .EER
I l l I
LEnd of Message when set = 1
End of Segmentwhen set = 1
Release Use of ProcessIPortwhen set=l
Synchronize to PacketSequence Number wh
Fig. 8. Internetworkheader flag field.
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A
Fig. 2. Three networks interconnected by two GATEWAYS.
may be null) b- Internetwork Header
CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK
g. 3. Internetworkpacketformat (fields not shown to sc
orlc header, is illustrated in Fig. 3 . The source and d
ation entries uniforndyand uniquely identifythe add
every HOST in the composite network. Addressing is
ubject of considerablecomplexitywhichisdiscussed
greater detail in the nextsection. Thenext two entr
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
643
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
Fig. 5. Creation of segments and packets frommessages.
32 32 16 16 En
SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1
,+JPlOLIIl Hed
Fig.6. Segment format (processheader andtext).
IEEE TRANSACTIONS ON COMMUNICATIOK
byte identification-sequencenumber
First Message
(SEQ = k)
Fig. 7. Assignment of sequencenumbers.
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
egments and packets frommessages.
32 16 16 En
Wmdow ACK Text (Field sizes in bits1
Hed..LJ
format (processheader andtext).
the message bythe source
for internetworktransmission, the
first byte of segment text is used as
for the packet. Thebytecount
rk header accounts for all the text
ocs not include the check-sum bytes
ernetxork or process header).
e sequence number associated with
16 bits
Y E S M
S
N L
_ . .EER
I l l I
LEnd of Message when set = 1
End of Segmentwhen set = 1
Release Use of ProcessIPortwhen set=l
Synchronize to PacketSequence Number wh
Fig. 8. Internetworkheader flag field.
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A
Fig. 2. Three networks interconnected by two GATEWAYS.
may be null) b- Internetwork Header
CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK
g. 3. Internetworkpacketformat (fields not shown to sc
orlc header, is illustrated in Fig. 3 . The source and d
ation entries uniforndyand uniquely identifythe add
every HOST in the composite network. Addressing is
ubject of considerablecomplexitywhichisdiscussed
greater detail in the nextsection. Thenext two entr
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
643
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
Fig. 5. Creation of segments and packets frommessages.
32 32 16 16 En
SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1
,+JPlOLIIl Hed
Fig.6. Segment format (processheader andtext).
IEEE TRANSACTIONS ON COMMUNICATIOK
byte identification-sequencenumber
First Message
(SEQ = k)
Fig. 7. Assignment of sequencenumbers.
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
egments and packets frommessages.
32 16 16 En
Wmdow ACK Text (Field sizes in bits1
Hed..LJ
format (processheader andtext).
the message bythe source
for internetworktransmission, the
first byte of segment text is used as
for the packet. Thebytecount
rk header accounts for all the text
ocs not include the check-sum bytes
ernetxork or process header).
e sequence number associated with
16 bits
Y E S M
S
N L
_ . .EER
I l l I
LEnd of Message when set = 1
End of Segmentwhen set = 1
Release Use of ProcessIPortwhen set=l
Synchronize to PacketSequence Number wh
Fig. 8. Internetworkheader flag field.
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A
wat?!?
Fig. 2. Three networks interconnected by two GATEWAYS.
may be null) b- Internetwork Header
CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK
g. 3. Internetworkpacketformat (fields not shown to sc
orlc header, is illustrated in Fig. 3 . The source and d
ation entries uniforndyand uniquely identifythe add
every HOST in the composite network. Addressing is
ubject of considerablecomplexitywhichisdiscussed
greater detail in the nextsection. Thenext two entr
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
643
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
Fig. 5. Creation of segments and packets frommessages.
32 32 16 16 En
SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1
,+JPlOLIIl Hed
Fig.6. Segment format (processheader andtext).
IEEE TRANSACTIONS ON COMMUNICATIOK
byte identification-sequencenumber
First Message
(SEQ = k)
Fig. 7. Assignment of sequencenumbers.
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
egments and packets frommessages.
32 16 16 En
Wmdow ACK Text (Field sizes in bits1
Hed..LJ
format (processheader andtext).
the message bythe source
for internetworktransmission, the
first byte of segment text is used as
for the packet. Thebytecount
rk header accounts for all the text
ocs not include the check-sum bytes
ernetxork or process header).
e sequence number associated with
16 bits
Y E S M
S
N L
_ . .EER
I l l I
LEnd of Message when set = 1
End of Segmentwhen set = 1
Release Use of ProcessIPortwhen set=l
Synchronize to PacketSequence Number wh
Fig. 8. Internetworkheader flag field.
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A
wat?!?
SEQ and SYN in internetwork header
SEQ and SYN in internetwork header
if there’s an internetwork header, and a
process header, what the hell is TCP?
We suppose that processes wish to communicate in
full duplex with their correspondents using
unbounded but finite length messages. A single
character might constitute the text of a message from
a process to a terminal or vice versa. An entire page of
characters might constitute the text of a message
from a file to a process. A data stream (e.g. a
continuously generated bit string) can be represented
as a sequence of finite length messages.
Within a HOST we assume that existence of a
transmission control program (TCP) which handles
the transmission and acceptance of messages on
behalf of the processes it serves. The TCP is in turn
served by one or more packet switches connected to
the HOST in which the TCP resides. Processes that
want to communicate present messages to the TCP
for transmission, and TCP’s deliver incoming
messages to the appropriate destination processes.
We allow the TCP to break up messages into
segments because the destination may restrict the
amount of data that may arrive, because the local
network may limit the maximum transmissin size, or
because the TCP may need to share its resources
among many processes concurrently. Furthermore, we
constrain the length of a segment to an integral
number of 8-bit bytes. This uniformity is most helpful
in simplifying the software needed with HOST
machines of different natural word lengths.
Provision at the process level can be made for
padding a message that is not an integral number of
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
PROCESS LEV EL
COMMUNICATION
We suppose that processes wish to communicate in
full duplex with their correspondents using
unbounded but finite length messages. A single
character might constitute the text of a message from
a process to a terminal or vice versa. An entire page of
characters might constitute the text of a message
from a file to a process. A data stream (e.g. a
continuously generated bit string) can be represented
as a sequence of finite length messages.
Within a HOST we assume that existence of a
transmission control program (TCP) which handles
the transmission and acceptance of messages on
behalf of the processes it serves. The TCP is in turn
served by one or more packet switches connected to
the HOST in which the TCP resides. Processes that
want to communicate present messages to the TCP
for transmission, and TCP’s deliver incoming
messages to the appropriate destination processes.
We allow the TCP to break up messages into
segments because the destination may restrict the
amount of data that may arrive, because the local
network may limit the maximum transmissin size, or
because the TCP may need to share its resources
among many processes concurrently. Furthermore, we
constrain the length of a segment to an integral
number of 8-bit bytes. This uniformity is most helpful
in simplifying the software needed with HOST
machines of different natural word lengths.
Provision at the process level can be made for
padding a message that is not an integral number of
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
PROCESS LEV EL
COMMUNICATION
“Within a HOST we assume the
existence of a transmission
control program (TCP) which
handles transmission”
TCP is a userspace networking stack
SEQ and SYN in internetwork header
No transmission can be 100 percent reliable. We
propose a timeout and positive acknowledgement
mechanism which will allow TCP’s to recover from
packet losses from one HOST to another. A TCP
transmits packets and waits for replies
(acknowledgements) that are carried in the reverse
packet stream. If no acknowledgement for a
particular packet is received, the TCP will retransmit.
It is our expectation that the HOST level
retransmission mechanism, which is described in the
following paragraphs, will not be called upon very
often in practice. Evidence already exists that
individual networks can be effectively constructed
without this feature. However, the inclusion of a
HOST retransmission capability makes it possible to
recover from occasional network problems and allows
a wide range of HOST protocol strategies to be
incorporated. We envision it will occasionally be
invoked to allow HOST accommodation to
infrequent overdemands for limited buffer resources,
and otherwise not used much.
Any retransmission policy requires some means by
which the receiver can detect duplicate arrivals. Even
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
RETRANSMISSION A N D
DUPLICATE DETEC TI ON
No transmission can be 100 percent reliable. We
propose a timeout and positive acknowledgement
mechanism which will allow TCP’s to recover from
packet losses from one HOST to another. A TCP
transmits packets and waits for replies
(acknowledgements) that are carried in the reverse
packet stream. If no acknowledgement for a
particular packet is received, the TCP will retransmit.
It is our expectation that the HOST level
retransmission mechanism, which is described in the
following paragraphs, will not be called upon very
often in practice. Evidence already exists that
individual networks can be effectively constructed
without this feature. However, the inclusion of a
HOST retransmission capability makes it possible to
recover from occasional network problems and allows
a wide range of HOST protocol strategies to be
incorporated. We envision it will occasionally be
invoked to allow HOST accommodation to
infrequent overdemands for limited buffer resources,
and otherwise not used much.
Any retransmission policy requires some means by
which the receiver can detect duplicate arrivals. Even
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
RETRANSMISSION A N D
DUPLICATE DETEC TI ON
“No transmission can be 100
percent reliable.”
No transmission can be 100 percent reliable. We
propose a timeout and positive acknowledgement
mechanism which will allow TCP’s to recover from
packet losses from one HOST to another. A TCP
transmits packets and waits for replies
(acknowledgements) that are carried in the reverse
packet stream. If no acknowledgement for a
particular packet is received, the TCP will retransmit.
It is our expectation that the HOST level
retransmission mechanism, which is described in the
following paragraphs, will not be called upon very
often in practice. Evidence already exists that
individual networks can be effectively constructed
without this feature. However, the inclusion of a
HOST retransmission capability makes it possible to
recover from occasional network problems and allows
a wide range of HOST protocol strategies to be
incorporated. We envision it will occasionally be
invoked to allow HOST accommodation to
infrequent overdemands for limited buffer resources,
and otherwise not used much.
Any retransmission policy requires some means by
which the receiver can detect duplicate arrivals. Even
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
RETRANSMISSION A N D
DUPLICATE DETEC TI ON
“No transmission can be 100
percent reliable.”
“retransmission (…) will not be
called upon very often in
practice. Evidence already exists
that individual networks can be
effectively constructed without
this feature.”
TCP is a userspace networking stack
SEQ and SYN in internetwork header
retransmissions are pathological
incorporated. We envision it will occasionally be
invoked to allow HOST accommodation to
infrequent overdemands for limited buffer resources,
and otherwise not used much.
Any retransmission policy requires some means by
which the receiver can detect duplicate arrivals. Even
if an infinite number of distinct packet sequence
numbers were available, the receiver would still have
the problem of knowing how long to remember
previously received packets in order to detect
duplicates. Matters are complicated by the fact that
only a finite number of distinct sequence numbers are
in fact available, and if they are reused, the receiver
must be able to distinguish between new
transmissions and retransmissions.
A window strategy, similar to that used by the French
CYCLADES system (voie virtuelle transmission
mode [8]) and the ARPANET very distant HOST
connection [18]), is proposed here (see Fig. 10).
Suppose that the sequence number field in the
internetwork header permits sequence numbers to
range from 0 to n − 1. We assume that the sender will
not transmit more than w bytes without receiving an
acknowledgment. The w bytes serve as the window
(see Fig. 11). Clearly, w must be less than n. The rules
for sender and receiver are as follows.
Sender: Let L be the sequence number associated
with the left window edge.
1) The sender transmits bytes from segments whose
text lies between L and up to L + w − 1.
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
RETRANSMISSION A N D
DUPLICATE DETEC TI ON
NETWORK INTERCOMMUNICATION 643
SSIONANDDUPLICATE
DETECTION
e 100 percent reliable. We
d positive acknowledgment mecha-
TCP’s torecover from packet losses
other. A TCP transmits packets and
knowledgements) that are carried in
eam. If noacknowledgment for a
received, theTCP will retransmit.
that the HOST level retransmission
s described inthe following para-
called uponveryofteninpractice.
sts2 that individual networks can be
d without this feature. However, the
retransmissioncapabilitymakes it
om occasional network problems and
of HOST protocol strategies to be in-
ion it will occasionally be invoked to
dation to infrequent overdemandsfor
es, and otherwise not used much.
Left Window Edge
I
0 n-1a+w- 1a
1- window -4
I< packet sequence number space
-1
Fig. 10. The windowconcept.
Source
Address
I Address
Destination
I
6
7
8
9
10
Next Read Position
End ReadPosition
Timeout
Fig. 11. Conceptual TCBformat.
On retransmission, the same packet might be broken
into three 200-byte packets going through a different
HOST. Since each byte has a sequence number, there
is no confusion at the receiving TCP. We leave for
later the issue of initially synchronizing the sender
and receiver left window edges and the window size.
Every segment that arrives at the destination TCP is
ultimately acknowlegded by returning the sequence
number of the next segment which must be passed to
the process (it may not yet have arrived).
Earlier we described the use of a sequence number
space and window to aid in duplicate detection.
Acknowledgments are carried in the process header
(see Fig. 6) and along with them there is provision for
a “suggested window” which the receiver can use to
control the flow of data from the sender. This is
intended to be the main component of the process
flow control mechanism. The receiver is free to vary
the window size according to any algorithm it desires
so long as the window size never exceeds half the
sequence number space.
This flow control mechanism is exceedingly powerful
and flexible and does not suffer from synchronization
troubles that may be encountered by incremental
buffer allocation schemes [9], [10]. However, it relies
heavily on an effective retransmission strategy. The
receiver can reduce the window even while packets
are en route from the sender whose window is
presently larger. The net effect of this reduction will
be that the receiver may discard incoming packets
(they may be outside the window) and reiterate the
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
NETWORK INTERCOMMUNICATION 643
SSIONANDDUPLICATE
DETECTION
e 100 percent reliable. We
d positive acknowledgment mecha-
TCP’s torecover from packet losses
other. A TCP transmits packets and
knowledgements) that are carried in
eam. If noacknowledgment for a
received, theTCP will retransmit.
that the HOST level retransmission
s described inthe following para-
called uponveryofteninpractice.
sts2 that individual networks can be
d without this feature. However, the
retransmissioncapabilitymakes it
om occasional network problems and
of HOST protocol strategies to be in-
ion it will occasionally be invoked to
dation to infrequent overdemandsfor
es, and otherwise not used much.
Left Window Edge
I
0 n-1a+w- 1a
1- window -4
I< packet sequence number space
-1
Fig. 10. The windowconcept.
Source
Address
I Address
Destination
I
6
7
8
9
10
Next Read Position
End ReadPosition
Timeout
Fig. 11. Conceptual TCBformat.
“a ‘suggested window’ which
the receiver can use to control
the flow of data from the sender.
This is intended to be the main
component of the process flow
control mechanism.”
FLOW CONTROL
TCP is a userspace networking stack
SEQ and SYN in internetwork header
retransmissions are pathological
the resource is the host
TCP is a userspace networking stack
SEQ and SYN in internetwork header
retransmissions are pathological
the resource is the host
no UDP
“How do you share a network?”
flow control
(systems engineer)
x.25
flow control
diagnostics
connection setup
hop-by-hop reliability
Act II
Scientific Positivism
Act III
HARSH, BITTER REALITY
The authors wish to thank a number of colleagues for
helpful comments during early discussions of
international network protocols, especially R.
Metcalfe, R. Scantlebury, D. Walden, and H.
Zimmerman; D. Davies and L. Pouzin who
constructively commented on the fragmentation and
accounting issues; and S. Crocker who commented on
the creation and destruction of associations.
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ACK NOWLEDGEME NT S
The authors wish to thank a number of colleagues for
helpful comments during early discussions of
international network protocols, especially R.
Metcalfe, R. Scantlebury, D. Walden, and H.
Zimmerman; D. Davies and L. Pouzin who
constructively commented on the fragmentation and
accounting issues; and S. Crocker who commented on
the creation and destruction of associations.
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ACK NOWLEDGEME NT S
“The authors wish to thank (…)
especially R. Metcalfe (…)”
BOB
METCALFE
what if instead
of all this…
what if instead
of all this…
x.25
flow control
diagnostics
connection setup
hop-by-hop reliability
what if instead
of all this…
x.25
flow control
diagnostics
connection setup
hop-by-hop reliability
???
…i did nothing?
ethernet
R F C 1 2 9 6
1981 1982 1983 1984 1985 1986 1987
30,000
0
5000
10,000
15,000
20,000
25,000
Year
Numberofhosts
In October of '86, the Internet had the first of what
became a series of 'congestion collapses'. During this
period, the data throughput from LBL to UC
Berkeley (sites separated by 400 yards and three IMP
hops) dropped from 32 Kbps to 40 bps. Mike Karels
and I were fascinated by this sudden factor-of-
thousand drop in bandwidth and embarked on an
investigation of why things had gotten so bad. We
wondered, in particular, if the 4.3BSD (Berkeley
UNIX) TCP was mis-behaving or if it could be tuned
to work better under abysmal network conditions.
The answer to both of these questions was "yes".
Since that time, we have put seven new algorithms
into the 4BSDTCP:
(i) round-trip-time variance estimation
(ii) exponential retransmit timer backoff
(iii) slow-start
(iv) more aggressive receiver ack policy
(v) dynamic window sizing on congestion
(vi) Karn's clamped retransmit backoff
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
Van Jacobson‘88 Congestion Avoidance and Control
“In October of '86, the Internet
had the first of what became a
series of 'congestion collapses’.
(…) were fascinated by this
sudden factor-of-thousand drop
in bandwidth and embarked on
an investigation of why things
had gotten so bad.”
In October of '86, the Internet had the first of what
became a series of 'congestion collapses'. During this
period, the data throughput from LBL to UC
Berkeley (sites separated by 400 yards and three IMP
hops) dropped from 32 Kbps to 40 bps. Mike Karels
and I were fascinated by this sudden factor-of-
thousand drop in bandwidth and embarked on an
investigation of why things had gotten so bad. We
wondered, in particular, if the 4.3BSD (Berkeley
UNIX) TCP was mis-behaving or if it could be tuned
to work better under abysmal network conditions.
The answer to both of these questions was "yes".
Since that time, we have put seven new algorithms
into the 4BSDTCP:
(i) round-trip-time variance estimation
(ii) exponential retransmit timer backoff
(iii) slow-start
(iv) more aggressive receiver ack policy
(v) dynamic window sizing on congestion
(vi) Karn's clamped retransmit backoff
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
Van Jacobson‘88 Congestion Avoidance and Control
Van Jacobson‘88 Congestion Avoidance and Control
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
what if instead
of all this…
Van Jacobson‘88 Congestion Avoidance and Control
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
what if instead
of all this…
Van Jacobson‘88 Congestion Avoidance and Control
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
what if instead
of all this…
Van Jacobson‘88 Congestion Avoidance and Control
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
what if instead
of all this…
aggravating
retransmissions
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
with congested conditions on the Internet.
This paper is a brief description of (i) - (v) and the
rationale behind them. (vi) is an algorithm recently
developed by Phil Karn of Bell Communications
Research, described in [KP87]. (vii) is described in a
soon-to-be-published RFC.
Algorithms (i) - (v) spring from one observation: The
flow on a TCP connection (or ISO TP-4 or Xerox NS
SPP connection) should obey a 'conservation of
packets' principle. And, if this principle were obeyed,
congestion collapse would become the exception
rather than the rule. Thus congestion control involves
finding places that violate conservation and fixing
them.
By 'conservation of packets' I mean that for a
connection 'in equilibrium', i.e., running stably with a
full window of data in transit, the packet flow is what
a physicist would call 'conservative': A new packet
isn't put into the network until an old packet leaves.
The physics of flow predicts that systems with this
property should be robust in the face of congestion.
Observation of the Internet suggests that it was not
particularly robust. Why the discrepancy?
There are only three ways for packet conservation to
fail:
1. The connection doesn't get to equilibrium, or
2. A sender injects a new packet before an old packet
has exited, or
3. The equilibrium can't be reached because of
Van Jacobson‘88 Congestion Avoidance and Control
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
with congested conditions on the Internet.
This paper is a brief description of (i) - (v) and the
rationale behind them. (vi) is an algorithm recently
developed by Phil Karn of Bell Communications
Research, described in [KP87]. (vii) is described in a
soon-to-be-published RFC.
Algorithms (i) - (v) spring from one observation: The
flow on a TCP connection (or ISO TP-4 or Xerox NS
SPP connection) should obey a 'conservation of
packets' principle. And, if this principle were obeyed,
congestion collapse would become the exception
rather than the rule. Thus congestion control involves
finding places that violate conservation and fixing
them.
By 'conservation of packets' I mean that for a
connection 'in equilibrium', i.e., running stably with a
full window of data in transit, the packet flow is what
a physicist would call 'conservative': A new packet
isn't put into the network until an old packet leaves.
The physics of flow predicts that systems with this
property should be robust in the face of congestion.
Observation of the Internet suggests that it was not
particularly robust. Why the discrepancy?
There are only three ways for packet conservation to
fail:
1. The connection doesn't get to equilibrium, or
2. A sender injects a new packet before an old packet
has exited, or
3. The equilibrium can't be reached because of
Van Jacobson‘88 Congestion Avoidance and Control
“(…) should obey a
‘conservation of packets’
principle”
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
with congested conditions on the Internet.
This paper is a brief description of (i) - (v) and the
rationale behind them. (vi) is an algorithm recently
developed by Phil Karn of Bell Communications
Research, described in [KP87]. (vii) is described in a
soon-to-be-published RFC.
Algorithms (i) - (v) spring from one observation: The
flow on a TCP connection (or ISO TP-4 or Xerox NS
SPP connection) should obey a 'conservation of
packets' principle. And, if this principle were obeyed,
congestion collapse would become the exception
rather than the rule. Thus congestion control involves
finding places that violate conservation and fixing
them.
By 'conservation of packets' I mean that for a
connection 'in equilibrium', i.e., running stably with a
full window of data in transit, the packet flow is what
a physicist would call 'conservative': A new packet
isn't put into the network until an old packet leaves.
The physics of flow predicts that systems with this
property should be robust in the face of congestion.
Observation of the Internet suggests that it was not
particularly robust. Why the discrepancy?
There are only three ways for packet conservation to
fail:
1. The connection doesn't get to equilibrium, or
2. A sender injects a new packet before an old packet
has exited, or
3. The equilibrium can't be reached because of
“(…) for a connection 'in
equilibrium', (…) the packet flow
is what a physicist would call
'conservative': A new packet
isn't put into the network until an
old packet leaves.”
Van Jacobson‘88 Congestion Avoidance and Control
“(…) should obey a
‘conservation of packets’
principle”
Van Jacobson‘88 Congestion Avoidance and Control
slow start
Van Jacobson‘88 Congestion Avoidance and Control
congestion
avoidance
Van Jacobson‘88 Congestion Avoidance and Control
..'
.f"J /
o _ ..."' ,,#. ........'"'"" 0/"8
o
,/fy /fj,f, . •
v o
80 -
~" j,Z ZZf "''ff "
o . , , , , ,
2 4 6 8 10
SendTime(sec)
Same conditions as the previous figure (same time of day, same Suns, same network
path, same buffer and window sizes), except the machines were running the 4.3+TCP
Van Jacobson‘88 Congestion Avoidance and Control
..'
.f"J /
o _ ..."' ,,#. ........'"'"" 0/"8
o
,/fy /fj,f, . •
v o
80 -
~" j,Z ZZf "''ff "
o . , , , , ,
2 4 6 8 10
SendTime(sec)
Same conditions as the previous figure (same time of day, same Suns, same network
path, same buffer and window sizes), except the machines were running the 4.3+TCP
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The y-axis
is the sequence number in the packet header. Thus a vertical array of dots indicate
back-to-back packets and two dots with the same y but different x indicate a retransmit.
'Desirable' behavior on this graph would be a relatively smooth line of clots extending
diagonally from the lower left to the upper right. The slope of this line would equal the
available bandwidth. Nothing in this trace resembles desirable behavior.
C O N G E S T I O N C O N T R O L
fix RTT estimator
slow start (slower than flow control)
congestion avoidance
Act IV
Van Jacobson‘88 Congestion Avoidance and Control
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
0.5 and I for reasons partially explained in appendix C.
A more complete analysis is in yet another in-progress
paper.
The preceding has probably made the congestion
control algorithm sound hairy but it's not. Like slow-
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
Van Jacobson‘88 Congestion Avoidance and Control
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
0.5 and I for reasons partially explained in appendix C.
A more complete analysis is in yet another in-progress
paper.
The preceding has probably made the congestion
control algorithm sound hairy but it's not. Like slow-
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1
Van Jacobson‘88 Congestion Avoidance and Control
(These are the first two terms in a Taylor series expan-
sion of L(t). There is reason to believe one might even-
tually need a three term, second order model, but not
until the Internet has grown substantially.)
When the network is congested, 7 must be large and
the queue lengths will start increasing exponentially, s
The system will stabilize only if the traffic sources throt-
tle back at least as quickly as the queues are growing.
Since a source controls load in a window-based proto-
col by adjusting the size of the window, W, we end up
with the sender policy
On congestion:
Wi = dWi_l (d < 1)
I.e., a multiplicative decrease of the window size (which
becomes an exponential decrease over time if the con-
gestion persists).
If there's no congestion, 7 must be near zero and the
load approximately constant. The network announces,
via a dropped packet, when demand is excessive but
utili
have
verg
able
widt
crea
T
tive
Wi =
will
thro
tedio
easy
to re
effec
9I
the s
will f
shrink
bottle
of co
move
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
0.5 and I for reasons partially explained in appendix C.
A more complete analysis is in yet another in-progress
paper.
The preceding has probably made the congestion
control algorithm sound hairy but it's not. Like slow-
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1
Van Jacobson‘88 Congestion Avoidance and Control
(These are the first two terms in a Taylor series expan-
sion of L(t). There is reason to believe one might even-
tually need a three term, second order model, but not
until the Internet has grown substantially.)
When the network is congested, 7 must be large and
the queue lengths will start increasing exponentially, s
The system will stabilize only if the traffic sources throt-
tle back at least as quickly as the queues are growing.
Since a source controls load in a window-based proto-
col by adjusting the size of the window, W, we end up
with the sender policy
On congestion:
Wi = dWi_l (d < 1)
I.e., a multiplicative decrease of the window size (which
becomes an exponential decrease over time if the con-
gestion persists).
If there's no congestion, 7 must be near zero and the
load approximately constant. The network announces,
via a dropped packet, when demand is excessive but
utili
have
verg
able
widt
crea
T
tive
Wi =
will
thro
tedio
easy
to re
effec
9I
the s
will f
shrink
bottle
of co
move
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
0.5 and I for reasons partially explained in appendix C.
A more complete analysis is in yet another in-progress
paper.
The preceding has probably made the congestion
control algorithm sound hairy but it's not. Like slow-
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1
d = 0.5
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
to slow-start in addition to the above. But, because
both congestion avoidance and slow-start are triggered
by a timeout and both manipulate the congestion win-
dow, they are frequently confused. They are actually in-
dependent algorithms with completely different objec-
tives. To emphasize the difference, the two algorithms
have been presented separately even though in prac-
tise they should be implemented together. Appendix B
describes a combined slow-start/congestion avoidance
algorithm. 11
Figures 7 through 12 show the behavior of TCP con-
nections with and without congestion avoidance. Al-
though the test conditions (e.g., 16 KB windows) were
deliberately chosen to stimulate congestion, the test sce-
nario isn't far from common practice: The Arpanet
IMP end-to-end protocol allows at most eight packets
in transit between any pair of gateways. The default
The first thought is to use a symmetric, multiplicative
increase, possibly with a longer time constant, Wi =
bWi-1, 1 < b <1/d. This is a mistake. The result will
oscillate wildly and, on the average, deliver poor
throughput. There is an analytic reason for this but
it's tedious to derive. It has to do with that fact that it
is easy to drive the net into saturation but hard for
the net to recover (what [Kle76], chap. 2.1, calls the
rush-hour effect).9 Thus overestimating the available
bandwidth is costly. But an exponential, almost
regardless of its time constant, increases so quickly
that overestimates are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the
window size:
Van Jacobson‘88 Congestion Avoidance and Control
ADAPTING TO THE PATH :
CONGESTION AVOI DAN C E
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
to slow-start in addition to the above. But, because
both congestion avoidance and slow-start are triggered
by a timeout and both manipulate the congestion win-
dow, they are frequently confused. They are actually in-
dependent algorithms with completely different objec-
tives. To emphasize the difference, the two algorithms
have been presented separately even though in prac-
tise they should be implemented together. Appendix B
describes a combined slow-start/congestion avoidance
algorithm. 11
Figures 7 through 12 show the behavior of TCP con-
nections with and without congestion avoidance. Al-
though the test conditions (e.g., 16 KB windows) were
deliberately chosen to stimulate congestion, the test sce-
nario isn't far from common practice: The Arpanet
IMP end-to-end protocol allows at most eight packets
in transit between any pair of gateways. The default
The first thought is to use a symmetric, multiplicative
increase, possibly with a longer time constant, Wi =
bWi-1, 1 < b <1/d. This is a mistake. The result will
oscillate wildly and, on the average, deliver poor
throughput. There is an analytic reason for this but
it's tedious to derive. It has to do with that fact that it
is easy to drive the net into saturation but hard for
the net to recover (what [Kle76], chap. 2.1, calls the
rush-hour effect).9 Thus overestimating the available
bandwidth is costly. But an exponential, almost
regardless of its time constant, increases so quickly
that overestimates are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the
window size:
Van Jacobson‘88 Congestion Avoidance and Control
“There is an analytic reason for
this but it's tedious to derive.”
ADAPTING TO THE PATH :
CONGESTION AVOI DAN C E
is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
to slow-start in addition to the above. But, because
both congestion avoidance and slow-start are triggered
by a timeout and both manipulate the congestion win-
dow, they are frequently confused. They are actually in-
dependent algorithms with completely different objec-
tives. To emphasize the difference, the two algorithms
have been presented separately even though in prac-
tise they should be implemented together. Appendix B
describes a combined slow-start/congestion avoidance
algorithm. 11
Figures 7 through 12 show the behavior of TCP con-
nections with and without congestion avoidance. Al-
though the test conditions (e.g., 16 KB windows) were
deliberately chosen to stimulate congestion, the test sce-
nario isn't far from common practice: The Arpanet
IMP end-to-end protocol allows at most eight packets
in transit between any pair of gateways. The default
The first thought is to use a symmetric, multiplicative
increase, possibly with a longer time constant, Wi =
bWi-1, 1 < b <1/d. This is a mistake. The result will
oscillate wildly and, on the average, deliver poor
throughput. There is an analytic reason for this but
it's tedious to derive. It has to do with that fact that it
is easy to drive the net into saturation but hard for
the net to recover (what [Kle76], chap. 2.1, calls the
rush-hour effect).9 Thus overestimating the available
bandwidth is costly. But an exponential, almost
regardless of its time constant, increases so quickly
that overestimates are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the
window size:
Van Jacobson‘88 Congestion Avoidance and Control
“There is an analytic reason for
this but it's tedious to derive.”
“Without justification, I’ll state
that the best increase policy
(…)”
ADAPTING TO THE PATH :
CONGESTION AVOI DAN C E
A reason for using 1⁄2 as the decrease term, as op-
posed to the 7/8 in [JRC87], was the following
handwaving: When a packet is dropped, you're either
starting (or restarting after a drop) or steady-state
sending. If you're starting, you know that half the
current window size 'worked', i.e., that a window's
worth of packets were exchanged with no drops
(slow-start guarantees this). Thus on congestion you
set the window to the largest size that you know
works then slowly increase the size. If the connection
is steady-state running and a packet is dropped, it's
probably because a new connection started up and
took some of your bandwidth. We usually run our
nets with p < 0.5 so it's probable that there are now
exactly two conversations sharing the bandwidth. I.e.,
you should reduce your window by half because the
bandwidth available to you has been reduced by half.
And, if there are more than two conversations sharing
the bandwidth, halving your window is conservative -
and being conservative at high traffic intensities is
probably wise.
Although a factor of two change in window size
seems a large performance penalty, in system terms
Van Jacobson‘88 Congestion Avoidance and Control
WINDOW ADJUSTMEN T POL ICY
A reason for using 1⁄2 as the decrease term, as op-
posed to the 7/8 in [JRC87], was the following
handwaving: When a packet is dropped, you're either
starting (or restarting after a drop) or steady-state
sending. If you're starting, you know that half the
current window size 'worked', i.e., that a window's
worth of packets were exchanged with no drops
(slow-start guarantees this). Thus on congestion you
set the window to the largest size that you know
works then slowly increase the size. If the connection
is steady-state running and a packet is dropped, it's
probably because a new connection started up and
took some of your bandwidth. We usually run our
nets with p < 0.5 so it's probable that there are now
exactly two conversations sharing the bandwidth. I.e.,
you should reduce your window by half because the
bandwidth available to you has been reduced by half.
And, if there are more than two conversations sharing
the bandwidth, halving your window is conservative -
and being conservative at high traffic intensities is
probably wise.
Although a factor of two change in window size
seems a large performance penalty, in system terms
Van Jacobson‘88 Congestion Avoidance and Control
“A reason for using 1/2 as the
decrease term (…) was the
following handwaving (…)”
WINDOW ADJUSTMEN T POL ICY
nets with p < 0.5 so it's probable that there are now
exactly two conversations sharing the bandwidth. I.e.,
you should reduce your window by half because the
bandwidth available to you has been reduced by half.
And, if there are more than two conversations sharing
the bandwidth, halving your window is conservative -
and being conservative at high traffic intensities is
probably wise.
Although a factor of two change in window size
seems a large performance penalty, in system terms
the cost is negligible: Currently, packets are dropped
only when a large queue has formed. Even with an
[ISO86] 'congestion experienced' bit to force senders
to reduce their windows, we're stuck with the queue
because the bottleneck is running at 100% utilization
with no excess bandwidth available to dissipate the
queue. If a packet is tossed, some sender shuts up for
two RTT, exactly the time needed to empty the
queue. If that sender restarts with the correct window
size, the queue won't reform. Thus the delay has been
reduced to minimum without the system losing any
bottleneck bandwidth.
The 1 packet increase has less justification than the
0.5 decrease. In fact, it's almost certainly too large. If
the algorithm converges to a window size of w, there
are O(w2) packets between drops with an additive
increase policy. We were shooting for an average drop
rate of < 1% and found that on the Arpanet (the worst
case of the four networks we tested), windows
converged to 8-12 packets. This yields I packet
increments for a 1% average drop rate.
Van Jacobson‘88 Congestion Avoidance and Control
“A reason for using 1/2 as the
decrease term (…) was the
following handwaving (…)”
“The 1-packet increase has less
justification than the 0.5
decrease. In fact, it's almost
certainly too large.”
WINDOW ADJUSTMEN T POL ICY
“How do you share a network?”
packet conservation principle
(physicist)
flow rate fairness
(sharing buffer space)
3 0 Y E A R S
improve detection of congestion
improve RTT estimation
faster window adaptation
enforce flow rate fairness
This paper is deliberately destructive. It sets out to
destroy an ideology that is blocking progress - the
idea that fairness between multiplexed packet traffic
can be achieved by controlling relative flow rates
alone. Flow rate fairness was the goal behind fair
resource allocation in widely deployed protocols like
weighted fair queuing (WFQ), TCP congestion
control and TCP-friendly rate control [8, 1, 11]. But it
is actually just unsubstantiated dogma to say that
equal flow rates are fair. This is why resource
allocation and accountability keep reappearing on
every list of requirements for the Internet
architecture (e.g. [2]), but never get solved. Obscured
by this broken idea, we wouldn’t know a good
solution from a bad one.
Controlling relative flow rates alone is a completely
impractical way of going about the problem. To be
realistic for large-scale Internet deployment, relative
flow rates should be the outcome of another fairness
mechanism, not the mechanism itself. That other
mechanism should share out the ‘cost’ of one user’s
actions on others—how much each user’s transfers
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
This paper is deliberately destructive. It sets out to
destroy an ideology that is blocking progress - the
idea that fairness between multiplexed packet traffic
can be achieved by controlling relative flow rates
alone. Flow rate fairness was the goal behind fair
resource allocation in widely deployed protocols like
weighted fair queuing (WFQ), TCP congestion
control and TCP-friendly rate control [8, 1, 11]. But it
is actually just unsubstantiated dogma to say that
equal flow rates are fair. This is why resource
allocation and accountability keep reappearing on
every list of requirements for the Internet
architecture (e.g. [2]), but never get solved. Obscured
by this broken idea, we wouldn’t know a good
solution from a bad one.
Controlling relative flow rates alone is a completely
impractical way of going about the problem. To be
realistic for large-scale Internet deployment, relative
flow rates should be the outcome of another fairness
mechanism, not the mechanism itself. That other
mechanism should share out the ‘cost’ of one user’s
actions on others—how much each user’s transfers
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
This paper is deliberately destructive. It sets out to
destroy an ideology that is blocking progress - the
idea that fairness between multiplexed packet traffic
can be achieved by controlling relative flow rates
alone. Flow rate fairness was the goal behind fair
resource allocation in widely deployed protocols like
weighted fair queuing (WFQ), TCP congestion
control and TCP-friendly rate control [8, 1, 11]. But it
is actually just unsubstantiated dogma to say that
equal flow rates are fair. This is why resource
allocation and accountability keep reappearing on
every list of requirements for the Internet
architecture (e.g. [2]), but never get solved. Obscured
by this broken idea, we wouldn’t know a good
solution from a bad one.
Controlling relative flow rates alone is a completely
impractical way of going about the problem. To be
realistic for large-scale Internet deployment, relative
flow rates should be the outcome of another fairness
mechanism, not the mechanism itself. That other
mechanism should share out the ‘cost’ of one user’s
actions on others—how much each user’s transfers
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
“This paper is deliberately
destructive.”
INTRODUCTION
flow rate fairness
flow rate fairness
shares the wrong thing
rate
x2(t)
x1(t)
bit rate
S H A R I N G W H A T ?
x1(t) = x2(t)
S H A R I N G B E N E F I T S ?
u1(x)
u2(x)
utility function
u1(t) > u2(t)
S H A R I N G C O S T S ?
S H A R I N G C O S T S ?
S H A R I N G C O S T S ?
the marginal cost of bandwidth is 0
S H A R I N G C O S T S ?
the marginal cost of bandwidth is 0
sunk cost
S H A R I N G C O S T S ?
the marginal cost of bandwidth is 0
sunk cost
ephemeral commodity
S H A R I N G C O S T S ?
c1(t)
c2(t)
S H A R I N G C O S T S ?
c1(t)
c2(t)
x2(t) > x1(t)
higher rate
S H A R I N G C O S T S ?
c1(t)
c2(t)
x2(t) > x1(t)
higher rate
c1(t) = c2(t)
same cost
So in networking, the cost of one flow’s behaviour
depends on the congestion volume it causes which is
the product of its instantaneous flow rate and
congestion on its path, integrated over time. For
instance, if two users are sending at 200kbps and
300kbps into a 450kbps line for 0.5s, congestion is
(200 + 300 − 450)/(200 + 300) = 10% so the congestion
volume each causes is 200k × 10% × 0.5 = 10kb and
15kb respectively.
So cost depends not only on flow rate, but on
congestion as well. Typically congestion might be in
the fractions of a percent but it varies from zero to
tens of percent. So, flow rate can never alone serve as
a measure of cost.
To summarise so far, flow rate is a hopelessly
incorrect proxy both for benefit and for cost. Even if
the intent was to equalise benefits, equalising flow
rates wouldn’t achieve it. Even if the intent was to
equalise costs, equalising flow rates wouldn’t achieve
it.
But actually a realistic resource allocation mechanism
only needs to concern itself with costs. If we set aside
political economy for a moment and use pure
microeconomics, we can use a competitive market to
arbitrate fairness, which handles the benefits side, as
we shall now explain. Then once we have a feasible,
scalable system that at least implements one defined
form of fairness, we will show how to build other
forms of fairness within that.
In life, as long as people cover the cost of their
actions, it is generally considered fair enough. If one
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
COST, NOT BENEFIT
So in networking, the cost of one flow’s behaviour
depends on the congestion volume it causes which is
the product of its instantaneous flow rate and
congestion on its path, integrated over time. For
instance, if two users are sending at 200kbps and
300kbps into a 450kbps line for 0.5s, congestion is
(200 + 300 − 450)/(200 + 300) = 10% so the congestion
volume each causes is 200k × 10% × 0.5 = 10kb and
15kb respectively.
So cost depends not only on flow rate, but on
congestion as well. Typically congestion might be in
the fractions of a percent but it varies from zero to
tens of percent. So, flow rate can never alone serve as
a measure of cost.
To summarise so far, flow rate is a hopelessly
incorrect proxy both for benefit and for cost. Even if
the intent was to equalise benefits, equalising flow
rates wouldn’t achieve it. Even if the intent was to
equalise costs, equalising flow rates wouldn’t achieve
it.
But actually a realistic resource allocation mechanism
only needs to concern itself with costs. If we set aside
political economy for a moment and use pure
microeconomics, we can use a competitive market to
arbitrate fairness, which handles the benefits side, as
we shall now explain. Then once we have a feasible,
scalable system that at least implements one defined
form of fairness, we will show how to build other
forms of fairness within that.
In life, as long as people cover the cost of their
actions, it is generally considered fair enough. If one
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
“(…) flow rate is a hopelessly
incorrect proxy both for benefit and
for cost. Even if the intent was to
equalise benefits, equalising flow
rates wouldn’t achieve it. Even if
the intent was to equalise costs,
equalising flow rates wouldn’t
achieve it.”
COST, NOT BENEFIT
flow rate fairness
shares the wrong thing
rate
flow rate fairness
shares the wrong thing
flow
amongst the wrong entity
x2(t)
x1(t)
bit rate
S H A R I N G A M O N G S T W H A T ?
x1(t) = x2(t)
x2(t)
x1(t)
bit rate
x1(t) = x2(t) = x3(t)
x3(t)
x2(t) + x3(t) > x1(t)
S H A R I N G A M O N G S T W H A T ?
x2(t)
x1(t)
bit rate
x1(t) = x2(t) = x3(t) = x4(t)
x3(t)
x2(t) + x3(t) + x4(t) > x1(t)
S H A R I N G A M O N G S T W H A T ?
x4(t)
fairness is not a question of technical function—any
allocation ‘works’. But getting it hopelessly wrong
badly skews the outcome of conflicts between the
vested interests of real businesses and real people.
But isn’t it a basic article of faith that multiple views
of fairness should be able to co-exist, the choice
depending on policy? Absolutely correct—and we
shall return to how this can be done later. But that
doesn’t mean we have to give the time of day to any
random idea of fairness.
Fair allocation of rates between flows isn’t based on
any respected definition of fairness from philosophy
or the social sciences. It has just gradually become the
way things are done in networking. But it’s actually
self-referential dogma. Or put more bluntly, bonkers.
We expect to be fair to people, groups of people,
institutions, companies - things the security
community would call ‘principals’. But a flow is
merely an information transfer between two
applications. Where does the argument come from
that information transfers should have equal rights?
It’s equivalent to claiming food rations are fair
because the boxes are all the same size, irrespective of
how many boxes each person gets or how often they
get them.
Because flows don’t deserve rights in real life, it is not
surprising that two loopholes the size of barn doors
appear when trying to allocate rate fairly to flows in a
nonco-operative environment. If at every instant a
resource is shared among the flows competing for a
share, any realworld entity can gain by i) creating
more flows than anyone else, and ii) keeping them
going longer than anyone else.
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
fairness is not a question of technical function—any
allocation ‘works’. But getting it hopelessly wrong
badly skews the outcome of conflicts between the
vested interests of real businesses and real people.
But isn’t it a basic article of faith that multiple views
of fairness should be able to co-exist, the choice
depending on policy? Absolutely correct—and we
shall return to how this can be done later. But that
doesn’t mean we have to give the time of day to any
random idea of fairness.
Fair allocation of rates between flows isn’t based on
any respected definition of fairness from philosophy
or the social sciences. It has just gradually become the
way things are done in networking. But it’s actually
self-referential dogma. Or put more bluntly, bonkers.
We expect to be fair to people, groups of people,
institutions, companies - things the security
community would call ‘principals’. But a flow is
merely an information transfer between two
applications. Where does the argument come from
that information transfers should have equal rights?
It’s equivalent to claiming food rations are fair
because the boxes are all the same size, irrespective of
how many boxes each person gets or how often they
get them.
Because flows don’t deserve rights in real life, it is not
surprising that two loopholes the size of barn doors
appear when trying to allocate rate fairly to flows in a
nonco-operative environment. If at every instant a
resource is shared among the flows competing for a
share, any realworld entity can gain by i) creating
more flows than anyone else, and ii) keeping them
going longer than anyone else.
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
“It’s equivalent to claiming food
rations are fair because the boxes
are all the same size, irrespective of
how many boxes each person gets
or how often they get them.”
flow rate fairness
shares the wrong thing
flow
amongst the wrong entity
flow rate
shares the wrong thing
fairness
amongst the wrong entity
non-sequitur
Whether the prevailing notion of flow rate fairness
has been the root cause or not, there will certainly be
no solution until the networking community gets its
head out of the sand and understands how unrealistic
its view is, and how important this issue is. Certainly
fairness is not a question of technical function—any
allocation ‘works’. But getting it hopelessly wrong
badly skews the outcome of conflicts between the
vested interests of real businesses and real people.
But isn’t it a basic article of faith that multiple views
of fairness should be able to co-exist, the choice
depending on policy? Absolutely correct—and we
shall return to how this can be done later. But that
doesn’t mean we have to give the time of day to any
random idea of fairness.
Fair allocation of rates between flows isn’t based on
any respected definition of fairness from philosophy
or the social sciences. It has just gradually become the
way things are done in networking. But it’s actually
self-referential dogma. Or put more bluntly, bonkers.
We expect to be fair to people, groups of people,
institutions, companies - things the security
community would call ‘principals’. But a flow is
merely an information transfer between two
applications. Where does the argument come from
that information transfers should have equal rights?
It’s equivalent to claiming food rations are fair
because the boxes are all the same size, irrespective of
how many boxes each person gets or how often they
get them.
Because flows don’t deserve rights in real life, it is not
surprising that two loopholes the size of barn doors
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
Whether the prevailing notion of flow rate fairness
has been the root cause or not, there will certainly be
no solution until the networking community gets its
head out of the sand and understands how unrealistic
its view is, and how important this issue is. Certainly
fairness is not a question of technical function—any
allocation ‘works’. But getting it hopelessly wrong
badly skews the outcome of conflicts between the
vested interests of real businesses and real people.
But isn’t it a basic article of faith that multiple views
of fairness should be able to co-exist, the choice
depending on policy? Absolutely correct—and we
shall return to how this can be done later. But that
doesn’t mean we have to give the time of day to any
random idea of fairness.
Fair allocation of rates between flows isn’t based on
any respected definition of fairness from philosophy
or the social sciences. It has just gradually become the
way things are done in networking. But it’s actually
self-referential dogma. Or put more bluntly, bonkers.
We expect to be fair to people, groups of people,
institutions, companies - things the security
community would call ‘principals’. But a flow is
merely an information transfer between two
applications. Where does the argument come from
that information transfers should have equal rights?
It’s equivalent to claiming food rations are fair
because the boxes are all the same size, irrespective of
how many boxes each person gets or how often they
get them.
Because flows don’t deserve rights in real life, it is not
surprising that two loopholes the size of barn doors
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
“Fair allocation of rates between
flows isn’t based on any respected
definition of fairness from
philosophy or the social sciences. It
has just gradually become the way
things are done in networking.”
This paper is deliberately destructive. It sets out to
destroy an ideology that is blocking progress - the
idea that fairness between multiplexed packet traffic
can be achieved by controlling relative flow rates
alone. Flow rate fairness was the goal behind fair
resource allocation in widely deployed protocols like
weighted fair queuing (WFQ), TCP congestion
control and TCP-friendly rate control [8, 1, 11]. But it
is actually just unsubstantiated dogma to say that
equal flow rates are fair. This is why resource
allocation and accountability keep reappearing on
every list of requirements for the Internet
architecture (e.g. [2]), but never get solved. Obscured
by this broken idea, we wouldn’t know a good
solution from a bad one.
Controlling relative flow rates alone is a completely
impractical way of going about the problem. To be
realistic for large-scale Internet deployment, relative
flow rates should be the outcome of another fairness
mechanism, not the mechanism itself. That other
mechanism should share out the ‘cost’ of one user’s
actions on others—how much each user’s transfers
restrict other transfers, given capacity constraints.
Then flow rates will depend on a deeper level of
fairness that has so far remained unnamed in the
literature, but is best termed ‘cost fairness’.
It really is only the idea of flow rate fairness that
needs destroying—nearly ever ything we’ve
engineered can remain. The Internet architecture
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION
“Obscured by this broken idea, we
wouldn’t know a good solution
from a bad one.”
what would fair look like?
C O S T F A I R
the cost is congestion
increase with flow rate, but the shape and size of the
function relating the two (the utility function) is
unknown, subjective and private to each user. Flow
rate itself is an extremely inadequate measure for
comparing benefits: user benefit per bit rate might be
ten orders of magnitude different for different types
of flow (e.g. SMS and video). So different applications
might derive completely different benefits from equal
flow rates and equal benefits might be derived from
very different flow rates.
Turning to the cost of a data transfer across a
network, flow rate alone is not the measure of that
either. Cost is also dependent on the level of
congestion on the path. This is counter-intuitive for
some people so we shall explain a little further. Once
a network has been provisioned at a certain size, it
doesn’t cost a network operator any more whether a
user sends more data or not. But if the network
becomes congested, each user restricts every other
user, which can be interpreted as a cost to all - an
externality in economic terms. For any level of
congestion, Kelly showed [20] that the system is
optimal if the blame for congestion is attributed
among all the users causing it, in proportion to their
bit rates. That’s exactly what routers are designed to
do anyway. During congestion, a queue randomly
distributes the losses so all flows see about the same
loss (or ECN marking) rate; if a flow has twice the bit
rate of another it should see twice the losses. In this
respect random early detection (RED [12]) is slightly
fairer than drop tail, but to a first order
approximation they both meet this criterion.
So in networking, the cost of one flow’s behaviour
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
COST, NOT BENEFIT
increase with flow rate, but the shape and size of the
function relating the two (the utility function) is
unknown, subjective and private to each user. Flow
rate itself is an extremely inadequate measure for
comparing benefits: user benefit per bit rate might be
ten orders of magnitude different for different types
of flow (e.g. SMS and video). So different applications
might derive completely different benefits from equal
flow rates and equal benefits might be derived from
very different flow rates.
Turning to the cost of a data transfer across a
network, flow rate alone is not the measure of that
either. Cost is also dependent on the level of
congestion on the path. This is counter-intuitive for
some people so we shall explain a little further. Once
a network has been provisioned at a certain size, it
doesn’t cost a network operator any more whether a
user sends more data or not. But if the network
becomes congested, each user restricts every other
user, which can be interpreted as a cost to all - an
externality in economic terms. For any level of
congestion, Kelly showed [20] that the system is
optimal if the blame for congestion is attributed
among all the users causing it, in proportion to their
bit rates. That’s exactly what routers are designed to
do anyway. During congestion, a queue randomly
distributes the losses so all flows see about the same
loss (or ECN marking) rate; if a flow has twice the bit
rate of another it should see twice the losses. In this
respect random early detection (RED [12]) is slightly
fairer than drop tail, but to a first order
approximation they both meet this criterion.
So in networking, the cost of one flow’s behaviour
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
“(…) if the network becomes
congested, each user restricts every
other user, which can be
interpreted as a cost to all - an
externality in economic terms.”
COST, NOT BENEFIT
time
rate V O L U M E C A P P I N G
time
rate V O L U M E C A P P I N G
time
rate V O L U M E C A P P I N G not much faster
time
rate V O L U M E C A P P I N G not much faster
waste
time
rate R A T E L I M I T I N G
time
rate R A T E L I M I T I N G
time
rate R A T E L I M I T I N G
much slower
time
rate R A T E L I M I T I N G
much slowerwaste
C O S T F A I R N E S S
c2(t)
c1(t)
congestion
rate
reflects cost
integrates correctly
verifiable across

network borders
time
rate W E I G H T E D C O S T
time
rate W E I G H T E D C O S T
causes disproportionate
congestion
causes disproportionate
congestion
“protect customers” /
demand more money
causes disproportionate
congestion
“protect customers” /
demand more money
“not fair”
congestion marking starts. Such operators continually
receive information on how much real demand there
is for capacity while collecting revenue to repay their
investments. Such congestion marking controls
demand without risk of actual congestion
deteriorating service.
Once a cost is assigned to congestion that equates to
the cost of alleviating it, users will only cause
congestion if they want extra capacity enough to be
willing to pay its cost. Of course, there will be no
need to be too precise about that rule. Perhaps some
people might be allowed to get more than they pay
for and others less. Perhaps some people will be
prepared to pay for what others get, and so on. But,
in a system the size of the Internet, there has to be be
some handle to arbitrate how much cost some users
cause to others. Flow rate fairness comes nowhere
near being up to the job. It just isn’t realistic to create
a system the size of the Internet and define fairness
within the system without reference to fairness
outside the system — in the real world where
everyone grudgingly accepts that fairness usually
means “you get what you pay for”.
Note that we use the phrase “you get what you pay
for” not just “you pay for what you get”. In Kelly’s
original formulation, users had to pay for the
congestion they caused, which was unlikely to be
taken up commercially. But the reason we are
revitalising Kelly’s work is that recent advances
(§4.3.2) should allow ISPs to keep their popular flat
fee pricing packages by aiming to ensure that users
cannot cause more congestion costs than their flat fee
pays for.
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
COST, NOT BENEFIT
congestion marking starts. Such operators continually
receive information on how much real demand there
is for capacity while collecting revenue to repay their
investments. Such congestion marking controls
demand without risk of actual congestion
deteriorating service.
Once a cost is assigned to congestion that equates to
the cost of alleviating it, users will only cause
congestion if they want extra capacity enough to be
willing to pay its cost. Of course, there will be no
need to be too precise about that rule. Perhaps some
people might be allowed to get more than they pay
for and others less. Perhaps some people will be
prepared to pay for what others get, and so on. But,
in a system the size of the Internet, there has to be be
some handle to arbitrate how much cost some users
cause to others. Flow rate fairness comes nowhere
near being up to the job. It just isn’t realistic to create
a system the size of the Internet and define fairness
within the system without reference to fairness
outside the system — in the real world where
everyone grudgingly accepts that fairness usually
means “you get what you pay for”.
Note that we use the phrase “you get what you pay
for” not just “you pay for what you get”. In Kelly’s
original formulation, users had to pay for the
congestion they caused, which was unlikely to be
taken up commercially. But the reason we are
revitalising Kelly’s work is that recent advances
(§4.3.2) should allow ISPs to keep their popular flat
fee pricing packages by aiming to ensure that users
cannot cause more congestion costs than their flat fee
pays for.
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
“It just isn’t realistic to create a
system the size of the Internet and
define fairness within the system
without reference to fairness
outside the system”
COST, NOT BENEFIT
“How do you share a network?”
cost
(economist)
H O W M A N Y W O R K A R O U N D S ?
“TCP is bad with small flows”
batch and re-use connections
open parallel connections
artificial limits in multitenancy
we still have no idea
2 0 1 6
we know what we have is wrong
we still have no idea
2 0 1 6
we know what we have is wrong
not broken enough to fix
we still have no idea
2 0 1 6
End

Resource Allocation in Computer Networks

  • 1.
  • 2.
    A S SU M P T I O N
  • 3.
    A S SU M P T I O N
  • 4.
    How do youshare a network?
  • 5.
  • 7.
    “How do youshare a network?”
  • 8.
  • 9.
  • 10.
    “How do youshare a network?” the question
  • 11.
    A S SU M P T I O N given an answer
  • 12.
    A S SU M P T I O N given an answer can’t fully understand
  • 13.
    A S SU M P T I O N given an answer never worked through question can’t fully understand
  • 14.
    T H IS T A L K ‘62 ‘74 ‘88 ‘07
  • 15.
    T H IS T A L K ‘62 ‘74 ‘88 ‘07 different interpretations of the same question
  • 16.
    T H IS T A L K ‘62 ‘74 ‘88 ‘07 foundational papers
  • 17.
    O B JE C T I V E S
  • 18.
    O B JE C T I V E S how did we get here
  • 19.
    O B JE C T I V E S how did we get here what assumptions
  • 20.
    at what cost OB J E C T I V E S how did we get here what assumptions
  • 21.
    T H IS T A L K
  • 22.
    Let us considerthe synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack. As a criterion of survivability we elect to use the percentage of stations both surviving the physical attack and remaining in electrical connection with the largest single group of surviving stations. This criterion is chosen as a conservative measure of the ability of the surviving stations to operate together as a coherent entity after the attack. This means that small groups of stations isolated from the single largest group are considered to be ineffective. Although one can draw a wide variety of networks, they all factor into two components: centralized (or star) and distributed (or grid or mesh) (see Fig. 1). The centralized network is obviously vulnerable as destruction of a single central node destroys communication between the end stations. In practice, a mixture of star and mesh components is used to form communications networks. For example, type (b) in Fig. 1 shows the hierarchical structure of a set Paul Baran‘62 On Distributed Communications Networks INTRODUCTION
  • 23.
    Let us considerthe synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack. As a criterion of survivability we elect to use the percentage of stations both surviving the physical attack and remaining in electrical connection with the largest single group of surviving stations. This criterion is chosen as a conservative measure of the ability of the surviving stations to operate together as a coherent entity after the attack. This means that small groups of stations isolated from the single largest group are considered to be ineffective. Although one can draw a wide variety of networks, they all factor into two components: centralized (or star) and distributed (or grid or mesh) (see Fig. 1). The centralized network is obviously vulnerable as destruction of a single central node destroys communication between the end stations. In practice, a mixture of star and mesh components is used to form communications networks. For example, type (b) in Fig. 1 shows the hierarchical structure of a set Paul Baran‘62 On Distributed Communications Networks INTRODUCTION
  • 24.
    Let us considerthe synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack. As a criterion of survivability we elect to use the percentage of stations both surviving the physical attack and remaining in electrical connection with the largest single group of surviving stations. This criterion is chosen as a conservative measure of the ability of the surviving stations to operate together as a coherent entity after the attack. This means that small groups of stations isolated from the single largest group are considered to be ineffective. Although one can draw a wide variety of networks, they all factor into two components: centralized (or star) and distributed (or grid or mesh) (see Fig. 1). The centralized network is obviously vulnerable as destruction of a single central node destroys communication between the end stations. In practice, a mixture of star and mesh components is used to form communications networks. For example, type (b) in Fig. 1 shows the hierarchical structure of a set Paul Baran‘62 On Distributed Communications Networks INTRODUCTION “Let us consider the synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack.”
  • 25.
    Paul Baran‘62 OnDistributed Communications Networks
  • 26.
    transmission between anyith station and any jth station, provided a path can be drawn from the ith to the jth station. Starting with a network composed of an array of stations connected as in Fig. 3, an assigned percentage of nodes and links is destroyed. If, after this operation, it is still possible to draw a line to connect the ith station to the jth station, the ith and jth stations are said to be connected. Node Destruction Figure 4 indicates network performance as a function of the probability of destruction for each separate node. If the expected "noise" was destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered. To bisect a 32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network. If hidden alternative command is allowed, then the largest single group would still have an expected value of almost 50 per cent of the initial stations surviving intact. If this raid misjudges complete availability of weapons, or complete knowledge of all links in the cross section, or the effects of the weapons against each and every link, the raid fails. The high risk of such raids against highly parallel structures causes examination of alternative attack policies. Consider the following uniform raid example. Assume that 2,000 weapons are deployed against a 1000-station Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 27.
    transmission between anyith station and any jth station, provided a path can be drawn from the ith to the jth station. Starting with a network composed of an array of stations connected as in Fig. 3, an assigned percentage of nodes and links is destroyed. If, after this operation, it is still possible to draw a line to connect the ith station to the jth station, the ith and jth stations are said to be connected. Node Destruction Figure 4 indicates network performance as a function of the probability of destruction for each separate node. If the expected "noise" was destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered. To bisect a 32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network. If hidden alternative command is allowed, then the largest single group would still have an expected value of almost 50 per cent of the initial stations surviving intact. If this raid misjudges complete availability of weapons, or complete knowledge of all links in the cross section, or the effects of the weapons against each and every link, the raid fails. The high risk of such raids against highly parallel structures causes examination of alternative attack policies. Consider the following uniform raid example. Assume that 2,000 weapons are deployed against a 1000-station “(…) destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered.” Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 28.
    stations are saidto be connected. Node Destruction Figure 4 indicates network performance as a function of the probability of destruction for each separate node. If the expected "noise" was destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered. To bisect a 32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network. If hidden alternative command is allowed, then the largest single group would still have an expected value of almost 50 per cent of the initial stations surviving intact. If this raid misjudges complete availability of weapons, or complete knowledge of all links in the cross section, or the effects of the weapons against each and every link, the raid fails. The high risk of such raids against highly parallel structures causes examination of alternative attack policies. Consider the following uniform raid example. Assume that 2,000 weapons are deployed against a 1000-station network. The stations are so spaced that destruction of two stations with a single weapon is unlikely. Divide the 2,000 weapons into two equal 1000- weapon salvos. Assume any probability of destruction of a single node from a single weapon less than 1.0; for example, 0.5. Each weapon on the first salvo has a 0.5 probability of destroying its target. But, each weapon of the second salvo has only a 0.25 probability, since one-half the targets have already Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 29.
    “To bisect a32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network.” stations are said to be connected. Node Destruction Figure 4 indicates network performance as a function of the probability of destruction for each separate node. If the expected "noise" was destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered. To bisect a 32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network. If hidden alternative command is allowed, then the largest single group would still have an expected value of almost 50 per cent of the initial stations surviving intact. If this raid misjudges complete availability of weapons, or complete knowledge of all links in the cross section, or the effects of the weapons against each and every link, the raid fails. The high risk of such raids against highly parallel structures causes examination of alternative attack policies. Consider the following uniform raid example. Assume that 2,000 weapons are deployed against a 1000-station network. The stations are so spaced that destruction of two stations with a single weapon is unlikely. Divide the 2,000 weapons into two equal 1000- weapon salvos. Assume any probability of destruction of a single node from a single weapon less than 1.0; for example, 0.5. Each weapon on the first salvo has a 0.5 probability of destroying its target. But, each weapon of the second salvo has only a 0.25 probability, since one-half the targets have already Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 30.
    Each node andlink in the array of Fig. 2 has the capacity and the switching flexibility to allow transmission between any ith station and any jth station, provided a path can be drawn from the ith to the jth station. Starting with a network composed of an array of stations connected as in Fig. 3, an assigned percentage of nodes and links is destroyed. If, after this operation, it is still possible to draw a line to connect the ith station to the jth station, the ith and jth stations are said to be connected. Node Destruction Figure 4 indicates network performance as a function of the probability of destruction for each separate node. If the expected "noise" was destruction caused by conventional hardware failure, the failures would be randomly distributed through the network. But, if the disturbance were caused by enemy attack, the possible "worst cases" must be considered. To bisect a 32-link network requires direction of 288 weapons each with a probability of kill, pk = 0.5, or 160 with a pk = 0.7, to produce over an 0.9 probability of successfully bisecting the network. If hidden alternative command is allowed, then the largest single group would still have an expected value of almost 50 per cent of the initial stations surviving intact. If this raid misjudges complete availability of weapons, or complete knowledge of all links in the cross section, or the effects of the weapons against each and every link, the raid fails. The high risk of Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 31.
    4. First, extremelysurvivable networks can be built using a moderately low redundancy of connectivity level. Redundancy levels on the order of only three permit withstanding extremely heavy level attacks with negligible additional loss to communications. Secondly, the survivability curves have sharp break- points. A network of this type will withstand an increasing attack level until a certain point is reached, beyond which the network rapidly deteriorates. Thus, the optimum degree of redundancy can be chosen as a function of the expected level of attack. Further redundancy buys little. The redundancy level required to survive even very heavy attacks is not great--on the order of only three or four times that of the minimum span network. Link Destruction In the previous example we have examined network performance as a function of the destruction of the nodes (which are better targets than links). We shall now re-examine the same network, but using unreliable links. In particular, we want to know how unreliable the links may be without further degrading the performance of the network. Figure 5 shows the results for the case of perfect nodes; only the links fail. There is little system degradation caused even using extremely unreliable links--on the order of 50 per cent down-time-- assuming all nodes are working. Combination Link and Node Destruction The worst case is the composite effect of failures of both the links and the nodes. Figure 6 shows the Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 32.
    Link Destruction In theprevious example we have examined network performance as a function of the destruction of the nodes (which are better targets than links). We shall now re-examine the same network, but using unreliable links. In particular, we want to know how unreliable the links may be without further degrading the performance of the network. Figure 5 shows the results for the case of perfect nodes; only the links fail. There is little system degradation caused even using extremely unreliable links--on the order of 50 per cent down-time-- assuming all nodes are working. Combination Link and Node Destruction The worst case is the composite effect of failures of both the links and the nodes. Figure 6 shows the effect of link failure upon a network having 40 per cent of its nodes destroyed. It appears that what would today be regarded as an unreliable link can be used in a distributed network almost as effectively as perfectly reliable links. Figure 7 examines the result of 100 trial cases in order to estimate the probability density distribution of system performance for a mixture of node and link failures. This is the distribution of cases for 20 per cent nodal damage and 35 per cent link damage. Paul Baran‘62 On Distributed Communications Networks EXAMINATION OF A DISTRIB UTE D NETWOR K
  • 35.
    We will soonbe living in an era in which we cannot guarantee survivability of any single point. However, we can still design systems in which system destruction requires the enemy to pay the price of destroying n of n stations. If n is made sufficiently large, it can be shown that highly survivable system structures can be built - even in the thermonuclear era. In order to build such networks and systems we will have to use a large number of elements. We are interested in knowing how inexpensive these elements may be and still permit the system to operate reliably. There is a strong relationship between element cost and element reliability. To design a system that must anticipate a worst-case destruction of both enemy attack and normal system failures, one can combine the failures expected by enemy attack together with the failures caused by normal reliability problems, provided the enemy does not know which elements are inoperative. Our future systems design problem is that of building very reliable systems out of the described set of unreliable elements at lowest cost. In choosing the communications links of the future, digital links appear increasingly attractive by permitting low-cost Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 36.
    We will soonbe living in an era in which we cannot guarantee survivability of any single point. However, we can still design systems in which system destruction requires the enemy to pay the price of destroying n of n stations. If n is made sufficiently large, it can be shown that highly survivable system structures can be built - even in the thermonuclear era. In order to build such networks and systems we will have to use a large number of elements. We are interested in knowing how inexpensive these elements may be and still permit the system to operate reliably. There is a strong relationship between element cost and element reliability. To design a system that must anticipate a worst-case destruction of both enemy attack and normal system failures, one can combine the failures expected by enemy attack together with the failures caused by normal reliability problems, provided the enemy does not know which elements are inoperative. Our future systems design problem is that of building very reliable systems out of the described set of unreliable elements at lowest cost. In choosing the communications links of the future, digital links appear increasingly attractive by permitting low-cost Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT “(…) highly survivable system structures can be built - even in the thermonuclear era.”
  • 37.
    We will soonbe living in an era in which we cannot guarantee survivability of any single point. However, we can still design systems in which system destruction requires the enemy to pay the price of destroying n of n stations. If n is made sufficiently large, it can be shown that highly survivable system structures can be built - even in the thermonuclear era. In order to build such networks and systems we will have to use a large number of elements. We are interested in knowing how inexpensive these elements may be and still permit the system to operate reliably. There is a strong relationship between element cost and element reliability. To design a system that must anticipate a worst-case destruction of both enemy attack and normal system failures, one can combine the failures expected by enemy attack together with the failures caused by normal reliability problems, provided the enemy does not know which elements are inoperative. Our future systems design problem is that of building very reliable systems out of the described set of unreliable elements at lowest cost. In choosing the communications links of the future, digital links appear increasingly attractive by permitting low-cost Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT “(…) have to use a large number of elements. We are interested in knowing how inexpensive these elements may be”
  • 38.
    high data ratelinks in emergencies.[2] Satellites The problem of building a reliable network using satellites is somewhat similar to that of building a communications network with unreliable links. When a satellite is overhead, the link is operative. When a satellite is not overhead, the link is out of service. Thus, such links are highly compatible with the type of system to be described. Variable Data Rate Links In a conventional circuit switched system each of the tandem links requires matched transmission bandwidths. In order to make fullest use of a digital link, the post-error-removal data rate would have to vary, as it is a function of noise level. The problem then is to build a communication network made up of links of variable data rate to use the communication resource most efficiently. Variable Data Rate Users We can view both the links and the entry point nodes of a multiple-user all-digital communications system as elements operating at an ever changing data rate. From instant to instant the demand for transmission will vary. We would like to take advantage of the average demand over all users instead of having to allocate a full peak demand channel to each. Bits can become a common denominator of loading for economic charging of customers. We would like to efficiently Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 39.
    high data ratelinks in emergencies.[2] Satellites The problem of building a reliable network using satellites is somewhat similar to that of building a communications network with unreliable links. When a satellite is overhead, the link is operative. When a satellite is not overhead, the link is out of service. Thus, such links are highly compatible with the type of system to be described. Variable Data Rate Links In a conventional circuit switched system each of the tandem links requires matched transmission bandwidths. In order to make fullest use of a digital link, the post-error-removal data rate would have to vary, as it is a function of noise level. The problem then is to build a communication network made up of links of variable data rate to use the communication resource most efficiently. Variable Data Rate Users We can view both the links and the entry point nodes of a multiple-user all-digital communications system as elements operating at an ever changing data rate. From instant to instant the demand for transmission will vary. We would like to take advantage of the average demand over all users instead of having to allocate a full peak demand channel to each. Bits can become a common denominator of loading for economic charging of customers. We would like to efficiently Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 40.
    Variable Data RateLinks In a conventional circuit switched system each of the tandem links requires matched transmission bandwidths. In order to make fullest use of a digital link, the post-error-removal data rate would have to vary, as it is a function of noise level. The problem then is to build a communication network made up of links of variable data rate to use the communication resource most efficiently. Variable Data Rate Users We can view both the links and the entry point nodes of a multiple-user all-digital communications system as elements operating at an ever changing data rate. From instant to instant the demand for transmission will vary. We would like to take advantage of the average demand over all users instead of having to allocate a full peak demand channel to each. Bits can become a common denominator of loading for economic charging of customers. We would like to efficiently handle both those users who make highly intermittent bit demands on the network, and those who make long-term continuous, low bit demands. Common User In communications, as in transportation, it is more economical for many users to share a common resource rather than each to build his own system-- particularly when supplying intermittent or occasional service. This intermittency of service is Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 41.
    Variable Data RateUsers We can view both the links and the entry point nodes of a multiple-user all-digital communications system as elements operating at an ever changing data rate. From instant to instant the demand for transmission will vary. We would like to take advantage of the average demand over all users instead of having to allocate a full peak demand channel to each. Bits can become a common denominator of loading for economic charging of customers. We would like to efficiently handle both those users who make highly intermittent bit demands on the network, and those who make long-term continuous, low bit demands. Common User In communications, as in transportation, it is more economical for many users to share a common resource rather than each to build his own system-- particularly when supplying intermittent or occasional service. This intermittency of service is highly characteristic of digital communication requirements. Therefore, we would like to consider the interconnection, one day, of many all-digital links to provide a resource optimized for the handling of data for many potential intermittent users--a new common-user system. Figure 9 demonstrates the basic notion. A wide mixture of different digital transmission links is combined to form a common resource divided among many potential users. But, each of these communications links could possibly have a different Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 42.
    Variable Data RateUsers We can view both the links and the entry point nodes of a multiple-user all-digital communications system as elements operating at an ever changing data rate. From instant to instant the demand for transmission will vary. We would like to take advantage of the average demand over all users instead of having to allocate a full peak demand channel to each. Bits can become a common denominator of loading for economic charging of customers. We would like to efficiently handle both those users who make highly intermittent bit demands on the network, and those who make long-term continuous, low bit demands. Common User In communications, as in transportation, it is more economical for many users to share a common resource rather than each to build his own system-- particularly when supplying intermittent or occasional service. This intermittency of service is highly characteristic of digital communication requirements. Therefore, we would like to consider the interconnection, one day, of many all-digital links to provide a resource optimized for the handling of data for many potential intermittent users--a new common-user system. Figure 9 demonstrates the basic notion. A wide mixture of different digital transmission links is combined to form a common resource divided among many potential users. But, each of these communications links could possibly have a different Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT “more economical to share a common (…) resource optimized for the handling of data”
  • 43.
    common-user system. Figure 9demonstrates the basic notion. A wide mixture of different digital transmission links is combined to form a common resource divided among many potential users. But, each of these communications links could possibly have a different data rate. Therefore, we shall next consider how links of different data rates may be interconnected. Standard Message Block Present common carrier communications networks, used for digital transmission, use links and concepts originally designed for another purpose--voice. These systems are built around a frequency division multiplexing link-to-link interface standard. The standard between links is that of data rate. Time division multiplexing appears so natural to data transmission that we might wish to consider an alternative approach--a standardized message block as a network interface standard. While a standardized message block is common in many computer- communications applications, no serious attempt has ever been made to use it as a universal standard. A universally standardized message block would be composed of perhaps 1024 bits. Most of the message block would be reserved for whatever type data is to be transmitted, while the remainder would contain housekeeping information such as error detection and routing data, as in Fig. 10. As we move to the future, there appears to be an increasing need for a standardized message block for all-digital communications networks. As data rates increase, the velocity of propagation over long links Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT
  • 44.
    common-user system. Figure 9demonstrates the basic notion. A wide mixture of different digital transmission links is combined to form a common resource divided among many potential users. But, each of these communications links could possibly have a different data rate. Therefore, we shall next consider how links of different data rates may be interconnected. Standard Message Block Present common carrier communications networks, used for digital transmission, use links and concepts originally designed for another purpose--voice. These systems are built around a frequency division multiplexing link-to-link interface standard. The standard between links is that of data rate. Time division multiplexing appears so natural to data transmission that we might wish to consider an alternative approach--a standardized message block as a network interface standard. While a standardized message block is common in many computer- communications applications, no serious attempt has ever been made to use it as a universal standard. A universally standardized message block would be composed of perhaps 1024 bits. Most of the message block would be reserved for whatever type data is to be transmitted, while the remainder would contain housekeeping information such as error detection and routing data, as in Fig. 10. As we move to the future, there appears to be an increasing need for a standardized message block for all-digital communications networks. As data rates increase, the velocity of propagation over long links Paul Baran‘62 On Distributed Communications Networks ON A FUTURE SYST E M DEVELOPMENT “Time division multiplexing appears so natural to data that we might wish to consider an alternative approach - a standardized message block”
  • 45.
    Telecommunications textbooks arriveat a fire according to a Poisson distribution
  • 46.
    “How do youshare a network?”
  • 47.
  • 49.
    IP type ofservice field
  • 51.
    Act I AN EXERCISETO THE READER
  • 52.
    A R PA N E T
  • 53.
  • 55.
    A protocol thatsupports the sharing of resources that exist in different packet switching networks is presented. The protocol provides for variation in individual network packet sizes, transmission failures, sequencing, flow control, end-to-end error checking, and the creation and destruction of logical process- to-process connections. Some implementation issues are considered, and problems such as internetwork routing, accounting, and timeouts are exposed. In the last few years considerable effort has been expended on the design and implementation of packet switching networks [1]-[7],[14],[17]. A principle reason for developing such networks has been to facilitate the sharing of computer resources. A packet communication network includes a transportation mechanism for delivering data between computers or between computers and terminals. To make the data meaningful, computer and terminals share a common protocol (i.e, a set of agreed upon conventions). Several protocols have already been developed for this purpose [8]-[12],[16]. However, these protocols have Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ABSTRACT INTRODUCTION
  • 56.
    In the lastfew years considerable effort has been expended on the design and implementation of packet switching networks [1]-[7],[14],[17]. A principle reason for developing such networks has been to facilitate the sharing of computer resources. A packet communication network includes a transportation mechanism for delivering data between computers or between computers and terminals. To make the data meaningful, computer and terminals share a common protocol (i.e, a set of agreed upon conventions). Several protocols have already been developed for this purpose [8]-[12],[16]. However, these protocols have Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ABSTRACT INTRODUCTION A protocol that supports the sharing of resources that exist in different packet switching networks is presented. The protocol provides for variation in individual network packet sizes, transmission failures, sequencing, flow control, end-to-end error checking, and the creation and destruction of logical process- to-process connections. Some implementation issues are considered, and problems such as internetwork routing, accounting, and timeouts are exposed.
  • 57.
    In the lastfew years considerable effort has been expended on the design and implementation of packet switching networks [1]-[7],[14],[17]. A principle reason for developing such networks has been to facilitate the sharing of computer resources. A packet communication network includes a transportation mechanism for delivering data between computers or between computers and terminals. To make the data meaningful, computer and terminals share a common protocol (i.e, a set of agreed upon conventions). Several protocols have already been developed for this purpose [8]-[12],[16]. However, these protocols have Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ABSTRACT INTRODUCTION “A protocol that supports the sharing of resources that exist in different packet switching networks is presented.” A protocol that supports the sharing of resources that exist in different packet switching networks is presented. The protocol provides for variation in individual network packet sizes, transmission failures, sequencing, flow control, end-to-end error checking, and the creation and destruction of logical process- to-process connections. Some implementation issues are considered, and problems such as internetwork routing, accounting, and timeouts are exposed.
  • 58.
    A protocol thatsupports the sharing of resources that exist in different packet switching networks is presented. The protocol provides for variation in individual network packet sizes, transmission failures, sequencing, flow control, end-to-end error checking, and the creation and destruction of logical process- to-process connections. Some implementation issues are considered, and problems such as internetwork routing, accounting, and timeouts are exposed. In the last few years considerable effort has been expended on the design and implementation of packet switching networks [1]-[7],[14],[17]. A principle reason for developing such networks has been to facilitate the sharing of computer resources. A packet communication network includes a transportation mechanism for delivering data between computers or between computers and terminals. To make the data meaningful, computer and terminals share a common protocol (i.e, a set of agreed upon conventions). Several protocols have already been developed for this purpose [8]-[12],[16]. However, these protocols have Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ABSTRACT INTRODUCTION packet fragmentation transmission failures sequencing flow control error checking connection setup
  • 59.
    Fig. 2. Threenetworks interconnected by two GATEWAYS. may be null) b- Internetwork Header CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK g. 3. Internetworkpacketformat (fields not shown to sc orlc header, is illustrated in Fig. 3 . The source and d ation entries uniforndyand uniquely identifythe add every HOST in the composite network. Addressing is ubject of considerablecomplexitywhichisdiscussed greater detail in the nextsection. Thenext two entr Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication IEEE TRANSACTIONS ON COMMUNICATIOK byte identification-sequencenumber First Message (SEQ = k) Fig. 7. Assignment of sequencenumbers. LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header egments and packets frommessages. 32 16 16 En Wmdow ACK Text (Field sizes in bits1 Hed..LJ format (processheader andtext). the message bythe source for internetworktransmission, the first byte of segment text is used as for the packet. Thebytecount rk header accounts for all the text ocs not include the check-sum bytes ernetxork or process header). e sequence number associated with 16 bits Y E S M S N L _ . .EER I l l I LEnd of Message when set = 1 End of Segmentwhen set = 1 Release Use of ProcessIPortwhen set=l Synchronize to PacketSequence Number wh Fig. 8. Internetworkheader flag field. - 1000 bytes .100101102 . . . I TEXT OFMESSAGE A
  • 60.
    Fig. 2. Threenetworks interconnected by two GATEWAYS. may be null) b- Internetwork Header CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK g. 3. Internetworkpacketformat (fields not shown to sc orlc header, is illustrated in Fig. 3 . The source and d ation entries uniforndyand uniquely identifythe add every HOST in the composite network. Addressing is ubject of considerablecomplexitywhichisdiscussed greater detail in the nextsection. Thenext two entr Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication 643 LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header Fig. 5. Creation of segments and packets frommessages. 32 32 16 16 En SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1 ,+JPlOLIIl Hed Fig.6. Segment format (processheader andtext). IEEE TRANSACTIONS ON COMMUNICATIOK byte identification-sequencenumber First Message (SEQ = k) Fig. 7. Assignment of sequencenumbers. LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header egments and packets frommessages. 32 16 16 En Wmdow ACK Text (Field sizes in bits1 Hed..LJ format (processheader andtext). the message bythe source for internetworktransmission, the first byte of segment text is used as for the packet. Thebytecount rk header accounts for all the text ocs not include the check-sum bytes ernetxork or process header). e sequence number associated with 16 bits Y E S M S N L _ . .EER I l l I LEnd of Message when set = 1 End of Segmentwhen set = 1 Release Use of ProcessIPortwhen set=l Synchronize to PacketSequence Number wh Fig. 8. Internetworkheader flag field. - 1000 bytes .100101102 . . . I TEXT OFMESSAGE A
  • 61.
    Fig. 2. Threenetworks interconnected by two GATEWAYS. may be null) b- Internetwork Header CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK g. 3. Internetworkpacketformat (fields not shown to sc orlc header, is illustrated in Fig. 3 . The source and d ation entries uniforndyand uniquely identifythe add every HOST in the composite network. Addressing is ubject of considerablecomplexitywhichisdiscussed greater detail in the nextsection. Thenext two entr Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication 643 LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header Fig. 5. Creation of segments and packets frommessages. 32 32 16 16 En SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1 ,+JPlOLIIl Hed Fig.6. Segment format (processheader andtext). IEEE TRANSACTIONS ON COMMUNICATIOK byte identification-sequencenumber First Message (SEQ = k) Fig. 7. Assignment of sequencenumbers. LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header egments and packets frommessages. 32 16 16 En Wmdow ACK Text (Field sizes in bits1 Hed..LJ format (processheader andtext). the message bythe source for internetworktransmission, the first byte of segment text is used as for the packet. Thebytecount rk header accounts for all the text ocs not include the check-sum bytes ernetxork or process header). e sequence number associated with 16 bits Y E S M S N L _ . .EER I l l I LEnd of Message when set = 1 End of Segmentwhen set = 1 Release Use of ProcessIPortwhen set=l Synchronize to PacketSequence Number wh Fig. 8. Internetworkheader flag field. - 1000 bytes .100101102 . . . I TEXT OFMESSAGE A wat?!?
  • 62.
    Fig. 2. Threenetworks interconnected by two GATEWAYS. may be null) b- Internetwork Header CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK g. 3. Internetworkpacketformat (fields not shown to sc orlc header, is illustrated in Fig. 3 . The source and d ation entries uniforndyand uniquely identifythe add every HOST in the composite network. Addressing is ubject of considerablecomplexitywhichisdiscussed greater detail in the nextsection. Thenext two entr Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication 643 LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header Fig. 5. Creation of segments and packets frommessages. 32 32 16 16 En SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1 ,+JPlOLIIl Hed Fig.6. Segment format (processheader andtext). IEEE TRANSACTIONS ON COMMUNICATIOK byte identification-sequencenumber First Message (SEQ = k) Fig. 7. Assignment of sequencenumbers. LH = Local Header IH = InternetwolX Header CK = Checksum PH = Process Header egments and packets frommessages. 32 16 16 En Wmdow ACK Text (Field sizes in bits1 Hed..LJ format (processheader andtext). the message bythe source for internetworktransmission, the first byte of segment text is used as for the packet. Thebytecount rk header accounts for all the text ocs not include the check-sum bytes ernetxork or process header). e sequence number associated with 16 bits Y E S M S N L _ . .EER I l l I LEnd of Message when set = 1 End of Segmentwhen set = 1 Release Use of ProcessIPortwhen set=l Synchronize to PacketSequence Number wh Fig. 8. Internetworkheader flag field. - 1000 bytes .100101102 . . . I TEXT OFMESSAGE A wat?!?
  • 63.
    SEQ and SYNin internetwork header
  • 64.
    SEQ and SYNin internetwork header if there’s an internetwork header, and a process header, what the hell is TCP?
  • 65.
    We suppose thatprocesses wish to communicate in full duplex with their correspondents using unbounded but finite length messages. A single character might constitute the text of a message from a process to a terminal or vice versa. An entire page of characters might constitute the text of a message from a file to a process. A data stream (e.g. a continuously generated bit string) can be represented as a sequence of finite length messages. Within a HOST we assume that existence of a transmission control program (TCP) which handles the transmission and acceptance of messages on behalf of the processes it serves. The TCP is in turn served by one or more packet switches connected to the HOST in which the TCP resides. Processes that want to communicate present messages to the TCP for transmission, and TCP’s deliver incoming messages to the appropriate destination processes. We allow the TCP to break up messages into segments because the destination may restrict the amount of data that may arrive, because the local network may limit the maximum transmissin size, or because the TCP may need to share its resources among many processes concurrently. Furthermore, we constrain the length of a segment to an integral number of 8-bit bytes. This uniformity is most helpful in simplifying the software needed with HOST machines of different natural word lengths. Provision at the process level can be made for padding a message that is not an integral number of Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication PROCESS LEV EL COMMUNICATION
  • 66.
    We suppose thatprocesses wish to communicate in full duplex with their correspondents using unbounded but finite length messages. A single character might constitute the text of a message from a process to a terminal or vice versa. An entire page of characters might constitute the text of a message from a file to a process. A data stream (e.g. a continuously generated bit string) can be represented as a sequence of finite length messages. Within a HOST we assume that existence of a transmission control program (TCP) which handles the transmission and acceptance of messages on behalf of the processes it serves. The TCP is in turn served by one or more packet switches connected to the HOST in which the TCP resides. Processes that want to communicate present messages to the TCP for transmission, and TCP’s deliver incoming messages to the appropriate destination processes. We allow the TCP to break up messages into segments because the destination may restrict the amount of data that may arrive, because the local network may limit the maximum transmissin size, or because the TCP may need to share its resources among many processes concurrently. Furthermore, we constrain the length of a segment to an integral number of 8-bit bytes. This uniformity is most helpful in simplifying the software needed with HOST machines of different natural word lengths. Provision at the process level can be made for padding a message that is not an integral number of Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication PROCESS LEV EL COMMUNICATION “Within a HOST we assume the existence of a transmission control program (TCP) which handles transmission”
  • 67.
    TCP is auserspace networking stack SEQ and SYN in internetwork header
  • 68.
    No transmission canbe 100 percent reliable. We propose a timeout and positive acknowledgement mechanism which will allow TCP’s to recover from packet losses from one HOST to another. A TCP transmits packets and waits for replies (acknowledgements) that are carried in the reverse packet stream. If no acknowledgement for a particular packet is received, the TCP will retransmit. It is our expectation that the HOST level retransmission mechanism, which is described in the following paragraphs, will not be called upon very often in practice. Evidence already exists that individual networks can be effectively constructed without this feature. However, the inclusion of a HOST retransmission capability makes it possible to recover from occasional network problems and allows a wide range of HOST protocol strategies to be incorporated. We envision it will occasionally be invoked to allow HOST accommodation to infrequent overdemands for limited buffer resources, and otherwise not used much. Any retransmission policy requires some means by which the receiver can detect duplicate arrivals. Even Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication RETRANSMISSION A N D DUPLICATE DETEC TI ON
  • 69.
    No transmission canbe 100 percent reliable. We propose a timeout and positive acknowledgement mechanism which will allow TCP’s to recover from packet losses from one HOST to another. A TCP transmits packets and waits for replies (acknowledgements) that are carried in the reverse packet stream. If no acknowledgement for a particular packet is received, the TCP will retransmit. It is our expectation that the HOST level retransmission mechanism, which is described in the following paragraphs, will not be called upon very often in practice. Evidence already exists that individual networks can be effectively constructed without this feature. However, the inclusion of a HOST retransmission capability makes it possible to recover from occasional network problems and allows a wide range of HOST protocol strategies to be incorporated. We envision it will occasionally be invoked to allow HOST accommodation to infrequent overdemands for limited buffer resources, and otherwise not used much. Any retransmission policy requires some means by which the receiver can detect duplicate arrivals. Even Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication RETRANSMISSION A N D DUPLICATE DETEC TI ON “No transmission can be 100 percent reliable.”
  • 70.
    No transmission canbe 100 percent reliable. We propose a timeout and positive acknowledgement mechanism which will allow TCP’s to recover from packet losses from one HOST to another. A TCP transmits packets and waits for replies (acknowledgements) that are carried in the reverse packet stream. If no acknowledgement for a particular packet is received, the TCP will retransmit. It is our expectation that the HOST level retransmission mechanism, which is described in the following paragraphs, will not be called upon very often in practice. Evidence already exists that individual networks can be effectively constructed without this feature. However, the inclusion of a HOST retransmission capability makes it possible to recover from occasional network problems and allows a wide range of HOST protocol strategies to be incorporated. We envision it will occasionally be invoked to allow HOST accommodation to infrequent overdemands for limited buffer resources, and otherwise not used much. Any retransmission policy requires some means by which the receiver can detect duplicate arrivals. Even Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication RETRANSMISSION A N D DUPLICATE DETEC TI ON “No transmission can be 100 percent reliable.” “retransmission (…) will not be called upon very often in practice. Evidence already exists that individual networks can be effectively constructed without this feature.”
  • 71.
    TCP is auserspace networking stack SEQ and SYN in internetwork header retransmissions are pathological
  • 72.
    incorporated. We envisionit will occasionally be invoked to allow HOST accommodation to infrequent overdemands for limited buffer resources, and otherwise not used much. Any retransmission policy requires some means by which the receiver can detect duplicate arrivals. Even if an infinite number of distinct packet sequence numbers were available, the receiver would still have the problem of knowing how long to remember previously received packets in order to detect duplicates. Matters are complicated by the fact that only a finite number of distinct sequence numbers are in fact available, and if they are reused, the receiver must be able to distinguish between new transmissions and retransmissions. A window strategy, similar to that used by the French CYCLADES system (voie virtuelle transmission mode [8]) and the ARPANET very distant HOST connection [18]), is proposed here (see Fig. 10). Suppose that the sequence number field in the internetwork header permits sequence numbers to range from 0 to n − 1. We assume that the sender will not transmit more than w bytes without receiving an acknowledgment. The w bytes serve as the window (see Fig. 11). Clearly, w must be less than n. The rules for sender and receiver are as follows. Sender: Let L be the sequence number associated with the left window edge. 1) The sender transmits bytes from segments whose text lies between L and up to L + w − 1. Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication RETRANSMISSION A N D DUPLICATE DETEC TI ON NETWORK INTERCOMMUNICATION 643 SSIONANDDUPLICATE DETECTION e 100 percent reliable. We d positive acknowledgment mecha- TCP’s torecover from packet losses other. A TCP transmits packets and knowledgements) that are carried in eam. If noacknowledgment for a received, theTCP will retransmit. that the HOST level retransmission s described inthe following para- called uponveryofteninpractice. sts2 that individual networks can be d without this feature. However, the retransmissioncapabilitymakes it om occasional network problems and of HOST protocol strategies to be in- ion it will occasionally be invoked to dation to infrequent overdemandsfor es, and otherwise not used much. Left Window Edge I 0 n-1a+w- 1a 1- window -4 I< packet sequence number space -1 Fig. 10. The windowconcept. Source Address I Address Destination I 6 7 8 9 10 Next Read Position End ReadPosition Timeout Fig. 11. Conceptual TCBformat.
  • 73.
    On retransmission, thesame packet might be broken into three 200-byte packets going through a different HOST. Since each byte has a sequence number, there is no confusion at the receiving TCP. We leave for later the issue of initially synchronizing the sender and receiver left window edges and the window size. Every segment that arrives at the destination TCP is ultimately acknowlegded by returning the sequence number of the next segment which must be passed to the process (it may not yet have arrived). Earlier we described the use of a sequence number space and window to aid in duplicate detection. Acknowledgments are carried in the process header (see Fig. 6) and along with them there is provision for a “suggested window” which the receiver can use to control the flow of data from the sender. This is intended to be the main component of the process flow control mechanism. The receiver is free to vary the window size according to any algorithm it desires so long as the window size never exceeds half the sequence number space. This flow control mechanism is exceedingly powerful and flexible and does not suffer from synchronization troubles that may be encountered by incremental buffer allocation schemes [9], [10]. However, it relies heavily on an effective retransmission strategy. The receiver can reduce the window even while packets are en route from the sender whose window is presently larger. The net effect of this reduction will be that the receiver may discard incoming packets (they may be outside the window) and reiterate the Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication NETWORK INTERCOMMUNICATION 643 SSIONANDDUPLICATE DETECTION e 100 percent reliable. We d positive acknowledgment mecha- TCP’s torecover from packet losses other. A TCP transmits packets and knowledgements) that are carried in eam. If noacknowledgment for a received, theTCP will retransmit. that the HOST level retransmission s described inthe following para- called uponveryofteninpractice. sts2 that individual networks can be d without this feature. However, the retransmissioncapabilitymakes it om occasional network problems and of HOST protocol strategies to be in- ion it will occasionally be invoked to dation to infrequent overdemandsfor es, and otherwise not used much. Left Window Edge I 0 n-1a+w- 1a 1- window -4 I< packet sequence number space -1 Fig. 10. The windowconcept. Source Address I Address Destination I 6 7 8 9 10 Next Read Position End ReadPosition Timeout Fig. 11. Conceptual TCBformat. “a ‘suggested window’ which the receiver can use to control the flow of data from the sender. This is intended to be the main component of the process flow control mechanism.” FLOW CONTROL
  • 74.
    TCP is auserspace networking stack SEQ and SYN in internetwork header retransmissions are pathological the resource is the host
  • 75.
    TCP is auserspace networking stack SEQ and SYN in internetwork header retransmissions are pathological the resource is the host no UDP
  • 77.
    “How do youshare a network?”
  • 78.
  • 79.
  • 80.
  • 82.
  • 83.
    The authors wishto thank a number of colleagues for helpful comments during early discussions of international network protocols, especially R. Metcalfe, R. Scantlebury, D. Walden, and H. Zimmerman; D. Davies and L. Pouzin who constructively commented on the fragmentation and accounting issues; and S. Crocker who commented on the creation and destruction of associations. Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ACK NOWLEDGEME NT S
  • 84.
    The authors wishto thank a number of colleagues for helpful comments during early discussions of international network protocols, especially R. Metcalfe, R. Scantlebury, D. Walden, and H. Zimmerman; D. Davies and L. Pouzin who constructively commented on the fragmentation and accounting issues; and S. Crocker who commented on the creation and destruction of associations. Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication ACK NOWLEDGEME NT S “The authors wish to thank (…) especially R. Metcalfe (…)”
  • 85.
  • 86.
    what if instead ofall this…
  • 87.
    what if instead ofall this… x.25 flow control diagnostics connection setup hop-by-hop reliability
  • 88.
    what if instead ofall this… x.25 flow control diagnostics connection setup hop-by-hop reliability ??? …i did nothing?
  • 89.
  • 92.
    R F C1 2 9 6 1981 1982 1983 1984 1985 1986 1987 30,000 0 5000 10,000 15,000 20,000 25,000 Year Numberofhosts
  • 93.
    In October of'86, the Internet had the first of what became a series of 'congestion collapses'. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and three IMP hops) dropped from 32 Kbps to 40 bps. Mike Karels and I were fascinated by this sudden factor-of- thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. We wondered, in particular, if the 4.3BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was "yes". Since that time, we have put seven new algorithms into the 4BSDTCP: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn's clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing Van Jacobson‘88 Congestion Avoidance and Control
  • 94.
    “In October of'86, the Internet had the first of what became a series of 'congestion collapses’. (…) were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad.” In October of '86, the Internet had the first of what became a series of 'congestion collapses'. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and three IMP hops) dropped from 32 Kbps to 40 bps. Mike Karels and I were fascinated by this sudden factor-of- thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. We wondered, in particular, if the 4.3BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was "yes". Since that time, we have put seven new algorithms into the 4BSDTCP: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn's clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing Van Jacobson‘88 Congestion Avoidance and Control
  • 95.
    Van Jacobson‘88 CongestionAvoidance and Control o Z o~ g~ 69 o 0. o o ?* d, ..y':":" o/ .,," e~ 0 2 4 6 8 10 SendTime(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7). what if instead of all this…
  • 96.
    Van Jacobson‘88 CongestionAvoidance and Control o Z o~ g~ 69 o 0. o o ?* d, ..y':":" o/ .,," e~ 0 2 4 6 8 10 SendTime(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7). what if instead of all this…
  • 97.
    Van Jacobson‘88 CongestionAvoidance and Control o Z o~ g~ 69 o 0. o o ?* d, ..y':":" o/ .,," e~ 0 2 4 6 8 10 SendTime(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7). what if instead of all this…
  • 98.
    Van Jacobson‘88 CongestionAvoidance and Control o Z o~ g~ 69 o 0. o o ?* d, ..y':":" o/ .,," e~ 0 2 4 6 8 10 SendTime(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7). what if instead of all this… aggravating retransmissions
  • 99.
    (vii) fast retransmit Ourmeasurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet. This paper is a brief description of (i) - (v) and the rationale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [KP87]. (vii) is described in a soon-to-be-published RFC. Algorithms (i) - (v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP connection) should obey a 'conservation of packets' principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By 'conservation of packets' I mean that for a connection 'in equilibrium', i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call 'conservative': A new packet isn't put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? There are only three ways for packet conservation to fail: 1. The connection doesn't get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can't be reached because of Van Jacobson‘88 Congestion Avoidance and Control
  • 100.
    (vii) fast retransmit Ourmeasurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet. This paper is a brief description of (i) - (v) and the rationale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [KP87]. (vii) is described in a soon-to-be-published RFC. Algorithms (i) - (v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP connection) should obey a 'conservation of packets' principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By 'conservation of packets' I mean that for a connection 'in equilibrium', i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call 'conservative': A new packet isn't put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? There are only three ways for packet conservation to fail: 1. The connection doesn't get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can't be reached because of Van Jacobson‘88 Congestion Avoidance and Control “(…) should obey a ‘conservation of packets’ principle”
  • 101.
    (vii) fast retransmit Ourmeasurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet. This paper is a brief description of (i) - (v) and the rationale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [KP87]. (vii) is described in a soon-to-be-published RFC. Algorithms (i) - (v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP connection) should obey a 'conservation of packets' principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By 'conservation of packets' I mean that for a connection 'in equilibrium', i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call 'conservative': A new packet isn't put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? There are only three ways for packet conservation to fail: 1. The connection doesn't get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can't be reached because of “(…) for a connection 'in equilibrium', (…) the packet flow is what a physicist would call 'conservative': A new packet isn't put into the network until an old packet leaves.” Van Jacobson‘88 Congestion Avoidance and Control “(…) should obey a ‘conservation of packets’ principle”
  • 102.
    Van Jacobson‘88 CongestionAvoidance and Control slow start
  • 103.
    Van Jacobson‘88 CongestionAvoidance and Control congestion avoidance
  • 104.
    Van Jacobson‘88 CongestionAvoidance and Control ..' .f"J / o _ ..."' ,,#. ........'"'"" 0/"8 o ,/fy /fj,f, . • v o 80 - ~" j,Z ZZf "''ff " o . , , , , , 2 4 6 8 10 SendTime(sec) Same conditions as the previous figure (same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3+TCP
  • 105.
    Van Jacobson‘88 CongestionAvoidance and Control ..' .f"J / o _ ..."' ,,#. ........'"'"" 0/"8 o ,/fy /fj,f, . • v o 80 - ~" j,Z ZZf "''ff " o . , , , , , 2 4 6 8 10 SendTime(sec) Same conditions as the previous figure (same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3+TCP o Z o~ g~ 69 o 0. o o ?* d, ..y':":" o/ .,," e~ 0 2 4 6 8 10 SendTime(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7). Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The y-axis is the sequence number in the packet header. Thus a vertical array of dots indicate back-to-back packets and two dots with the same y but different x indicate a retransmit. 'Desirable' behavior on this graph would be a relatively smooth line of clots extending diagonally from the lower left to the upper right. The slope of this line would equal the available bandwidth. Nothing in this trace resembles desirable behavior.
  • 106.
    C O NG E S T I O N C O N T R O L fix RTT estimator slow start (slower than flow control) congestion avoidance
  • 107.
  • 108.
    Van Jacobson‘88 CongestionAvoidance and Control is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used 0.5 and I for reasons partially explained in appendix C. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it's not. Like slow- to slo both c by a ti dow, t depen tives. have b tise th descri algori Fig nectio thoug delibe nario IMP en in tran 4.3BSD multa at Ber the bu lead 12
  • 109.
    Van Jacobson‘88 CongestionAvoidance and Control is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used 0.5 and I for reasons partially explained in appendix C. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it's not. Like slow- to slo both c by a ti dow, t depen tives. have b tise th descri algori Fig nectio thoug delibe nario IMP en in tran 4.3BSD multa at Ber the bu lead 12 u = 1
  • 110.
    Van Jacobson‘88 CongestionAvoidance and Control (These are the first two terms in a Taylor series expan- sion of L(t). There is reason to believe one might even- tually need a three term, second order model, but not until the Internet has grown substantially.) When the network is congested, 7 must be large and the queue lengths will start increasing exponentially, s The system will stabilize only if the traffic sources throt- tle back at least as quickly as the queues are growing. Since a source controls load in a window-based proto- col by adjusting the size of the window, W, we end up with the sender policy On congestion: Wi = dWi_l (d < 1) I.e., a multiplicative decrease of the window size (which becomes an exponential decrease over time if the con- gestion persists). If there's no congestion, 7 must be near zero and the load approximately constant. The network announces, via a dropped packet, when demand is excessive but utili have verg able widt crea T tive Wi = will thro tedio easy to re effec 9I the s will f shrink bottle of co move is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used 0.5 and I for reasons partially explained in appendix C. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it's not. Like slow- to slo both c by a ti dow, t depen tives. have b tise th descri algori Fig nectio thoug delibe nario IMP en in tran 4.3BSD multa at Ber the bu lead 12 u = 1
  • 111.
    Van Jacobson‘88 CongestionAvoidance and Control (These are the first two terms in a Taylor series expan- sion of L(t). There is reason to believe one might even- tually need a three term, second order model, but not until the Internet has grown substantially.) When the network is congested, 7 must be large and the queue lengths will start increasing exponentially, s The system will stabilize only if the traffic sources throt- tle back at least as quickly as the queues are growing. Since a source controls load in a window-based proto- col by adjusting the size of the window, W, we end up with the sender policy On congestion: Wi = dWi_l (d < 1) I.e., a multiplicative decrease of the window size (which becomes an exponential decrease over time if the con- gestion persists). If there's no congestion, 7 must be near zero and the load approximately constant. The network announces, via a dropped packet, when demand is excessive but utili have verg able widt crea T tive Wi = will thro tedio easy to re effec 9I the s will f shrink bottle of co move is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used 0.5 and I for reasons partially explained in appendix C. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it's not. Like slow- to slo both c by a ti dow, t depen tives. have b tise th descri algori Fig nectio thoug delibe nario IMP en in tran 4.3BSD multa at Ber the bu lead 12 u = 1 d = 0.5
  • 112.
    is costly. Butan exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used to slow-start in addition to the above. But, because both congestion avoidance and slow-start are triggered by a timeout and both manipulate the congestion win- dow, they are frequently confused. They are actually in- dependent algorithms with completely different objec- tives. To emphasize the difference, the two algorithms have been presented separately even though in prac- tise they should be implemented together. Appendix B describes a combined slow-start/congestion avoidance algorithm. 11 Figures 7 through 12 show the behavior of TCP con- nections with and without congestion avoidance. Al- though the test conditions (e.g., 16 KB windows) were deliberately chosen to stimulate congestion, the test sce- nario isn't far from common practice: The Arpanet IMP end-to-end protocol allows at most eight packets in transit between any pair of gateways. The default The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, Wi = bWi-1, 1 < b <1/d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. There is an analytic reason for this but it's tedious to derive. It has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what [Kle76], chap. 2.1, calls the rush-hour effect).9 Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: Van Jacobson‘88 Congestion Avoidance and Control ADAPTING TO THE PATH : CONGESTION AVOI DAN C E
  • 113.
    is costly. Butan exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used to slow-start in addition to the above. But, because both congestion avoidance and slow-start are triggered by a timeout and both manipulate the congestion win- dow, they are frequently confused. They are actually in- dependent algorithms with completely different objec- tives. To emphasize the difference, the two algorithms have been presented separately even though in prac- tise they should be implemented together. Appendix B describes a combined slow-start/congestion avoidance algorithm. 11 Figures 7 through 12 show the behavior of TCP con- nections with and without congestion avoidance. Al- though the test conditions (e.g., 16 KB windows) were deliberately chosen to stimulate congestion, the test sce- nario isn't far from common practice: The Arpanet IMP end-to-end protocol allows at most eight packets in transit between any pair of gateways. The default The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, Wi = bWi-1, 1 < b <1/d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. There is an analytic reason for this but it's tedious to derive. It has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what [Kle76], chap. 2.1, calls the rush-hour effect).9 Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: Van Jacobson‘88 Congestion Avoidance and Control “There is an analytic reason for this but it's tedious to derive.” ADAPTING TO THE PATH : CONGESTION AVOI DAN C E
  • 114.
    is costly. Butan exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: On no congestion: W~= W~_~+ ~ (~ << Wmo=) where W,.,,a= is the pipesize (the delay-bandwidth prod- uct of the path minus protocol overhead -- i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [JRC87] and the policy we've implemented in TCP. The only difference between the two implemen- tations is the choice of constants for d and u. We used to slow-start in addition to the above. But, because both congestion avoidance and slow-start are triggered by a timeout and both manipulate the congestion win- dow, they are frequently confused. They are actually in- dependent algorithms with completely different objec- tives. To emphasize the difference, the two algorithms have been presented separately even though in prac- tise they should be implemented together. Appendix B describes a combined slow-start/congestion avoidance algorithm. 11 Figures 7 through 12 show the behavior of TCP con- nections with and without congestion avoidance. Al- though the test conditions (e.g., 16 KB windows) were deliberately chosen to stimulate congestion, the test sce- nario isn't far from common practice: The Arpanet IMP end-to-end protocol allows at most eight packets in transit between any pair of gateways. The default The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, Wi = bWi-1, 1 < b <1/d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. There is an analytic reason for this but it's tedious to derive. It has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what [Kle76], chap. 2.1, calls the rush-hour effect).9 Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, I'll state that the best increase policy is to make small, constant changes to the window size: Van Jacobson‘88 Congestion Avoidance and Control “There is an analytic reason for this but it's tedious to derive.” “Without justification, I’ll state that the best increase policy (…)” ADAPTING TO THE PATH : CONGESTION AVOI DAN C E
  • 115.
    A reason forusing 1⁄2 as the decrease term, as op- posed to the 7/8 in [JRC87], was the following handwaving: When a packet is dropped, you're either starting (or restarting after a drop) or steady-state sending. If you're starting, you know that half the current window size 'worked', i.e., that a window's worth of packets were exchanged with no drops (slow-start guarantees this). Thus on congestion you set the window to the largest size that you know works then slowly increase the size. If the connection is steady-state running and a packet is dropped, it's probably because a new connection started up and took some of your bandwidth. We usually run our nets with p < 0.5 so it's probable that there are now exactly two conversations sharing the bandwidth. I.e., you should reduce your window by half because the bandwidth available to you has been reduced by half. And, if there are more than two conversations sharing the bandwidth, halving your window is conservative - and being conservative at high traffic intensities is probably wise. Although a factor of two change in window size seems a large performance penalty, in system terms Van Jacobson‘88 Congestion Avoidance and Control WINDOW ADJUSTMEN T POL ICY
  • 116.
    A reason forusing 1⁄2 as the decrease term, as op- posed to the 7/8 in [JRC87], was the following handwaving: When a packet is dropped, you're either starting (or restarting after a drop) or steady-state sending. If you're starting, you know that half the current window size 'worked', i.e., that a window's worth of packets were exchanged with no drops (slow-start guarantees this). Thus on congestion you set the window to the largest size that you know works then slowly increase the size. If the connection is steady-state running and a packet is dropped, it's probably because a new connection started up and took some of your bandwidth. We usually run our nets with p < 0.5 so it's probable that there are now exactly two conversations sharing the bandwidth. I.e., you should reduce your window by half because the bandwidth available to you has been reduced by half. And, if there are more than two conversations sharing the bandwidth, halving your window is conservative - and being conservative at high traffic intensities is probably wise. Although a factor of two change in window size seems a large performance penalty, in system terms Van Jacobson‘88 Congestion Avoidance and Control “A reason for using 1/2 as the decrease term (…) was the following handwaving (…)” WINDOW ADJUSTMEN T POL ICY
  • 117.
    nets with p< 0.5 so it's probable that there are now exactly two conversations sharing the bandwidth. I.e., you should reduce your window by half because the bandwidth available to you has been reduced by half. And, if there are more than two conversations sharing the bandwidth, halving your window is conservative - and being conservative at high traffic intensities is probably wise. Although a factor of two change in window size seems a large performance penalty, in system terms the cost is negligible: Currently, packets are dropped only when a large queue has formed. Even with an [ISO86] 'congestion experienced' bit to force senders to reduce their windows, we're stuck with the queue because the bottleneck is running at 100% utilization with no excess bandwidth available to dissipate the queue. If a packet is tossed, some sender shuts up for two RTT, exactly the time needed to empty the queue. If that sender restarts with the correct window size, the queue won't reform. Thus the delay has been reduced to minimum without the system losing any bottleneck bandwidth. The 1 packet increase has less justification than the 0.5 decrease. In fact, it's almost certainly too large. If the algorithm converges to a window size of w, there are O(w2) packets between drops with an additive increase policy. We were shooting for an average drop rate of < 1% and found that on the Arpanet (the worst case of the four networks we tested), windows converged to 8-12 packets. This yields I packet increments for a 1% average drop rate. Van Jacobson‘88 Congestion Avoidance and Control “A reason for using 1/2 as the decrease term (…) was the following handwaving (…)” “The 1-packet increase has less justification than the 0.5 decrease. In fact, it's almost certainly too large.” WINDOW ADJUSTMEN T POL ICY
  • 118.
    “How do youshare a network?”
  • 119.
  • 120.
  • 121.
    3 0 YE A R S improve detection of congestion improve RTT estimation faster window adaptation enforce flow rate fairness
  • 122.
    This paper isdeliberately destructive. It sets out to destroy an ideology that is blocking progress - the idea that fairness between multiplexed packet traffic can be achieved by controlling relative flow rates alone. Flow rate fairness was the goal behind fair resource allocation in widely deployed protocols like weighted fair queuing (WFQ), TCP congestion control and TCP-friendly rate control [8, 1, 11]. But it is actually just unsubstantiated dogma to say that equal flow rates are fair. This is why resource allocation and accountability keep reappearing on every list of requirements for the Internet architecture (e.g. [2]), but never get solved. Obscured by this broken idea, we wouldn’t know a good solution from a bad one. Controlling relative flow rates alone is a completely impractical way of going about the problem. To be realistic for large-scale Internet deployment, relative flow rates should be the outcome of another fairness mechanism, not the mechanism itself. That other mechanism should share out the ‘cost’ of one user’s actions on others—how much each user’s transfers Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION
  • 123.
    This paper isdeliberately destructive. It sets out to destroy an ideology that is blocking progress - the idea that fairness between multiplexed packet traffic can be achieved by controlling relative flow rates alone. Flow rate fairness was the goal behind fair resource allocation in widely deployed protocols like weighted fair queuing (WFQ), TCP congestion control and TCP-friendly rate control [8, 1, 11]. But it is actually just unsubstantiated dogma to say that equal flow rates are fair. This is why resource allocation and accountability keep reappearing on every list of requirements for the Internet architecture (e.g. [2]), but never get solved. Obscured by this broken idea, we wouldn’t know a good solution from a bad one. Controlling relative flow rates alone is a completely impractical way of going about the problem. To be realistic for large-scale Internet deployment, relative flow rates should be the outcome of another fairness mechanism, not the mechanism itself. That other mechanism should share out the ‘cost’ of one user’s actions on others—how much each user’s transfers Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION
  • 124.
    This paper isdeliberately destructive. It sets out to destroy an ideology that is blocking progress - the idea that fairness between multiplexed packet traffic can be achieved by controlling relative flow rates alone. Flow rate fairness was the goal behind fair resource allocation in widely deployed protocols like weighted fair queuing (WFQ), TCP congestion control and TCP-friendly rate control [8, 1, 11]. But it is actually just unsubstantiated dogma to say that equal flow rates are fair. This is why resource allocation and accountability keep reappearing on every list of requirements for the Internet architecture (e.g. [2]), but never get solved. Obscured by this broken idea, we wouldn’t know a good solution from a bad one. Controlling relative flow rates alone is a completely impractical way of going about the problem. To be realistic for large-scale Internet deployment, relative flow rates should be the outcome of another fairness mechanism, not the mechanism itself. That other mechanism should share out the ‘cost’ of one user’s actions on others—how much each user’s transfers Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion “This paper is deliberately destructive.” INTRODUCTION
  • 125.
  • 126.
    flow rate fairness sharesthe wrong thing rate
  • 127.
    x2(t) x1(t) bit rate S HA R I N G W H A T ? x1(t) = x2(t)
  • 128.
    S H AR I N G B E N E F I T S ? u1(x) u2(x) utility function u1(t) > u2(t)
  • 129.
    S H AR I N G C O S T S ?
  • 130.
    S H AR I N G C O S T S ?
  • 131.
    S H AR I N G C O S T S ? the marginal cost of bandwidth is 0
  • 132.
    S H AR I N G C O S T S ? the marginal cost of bandwidth is 0 sunk cost
  • 133.
    S H AR I N G C O S T S ? the marginal cost of bandwidth is 0 sunk cost ephemeral commodity
  • 134.
    S H AR I N G C O S T S ? c1(t) c2(t)
  • 135.
    S H AR I N G C O S T S ? c1(t) c2(t) x2(t) > x1(t) higher rate
  • 136.
    S H AR I N G C O S T S ? c1(t) c2(t) x2(t) > x1(t) higher rate c1(t) = c2(t) same cost
  • 137.
    So in networking,the cost of one flow’s behaviour depends on the congestion volume it causes which is the product of its instantaneous flow rate and congestion on its path, integrated over time. For instance, if two users are sending at 200kbps and 300kbps into a 450kbps line for 0.5s, congestion is (200 + 300 − 450)/(200 + 300) = 10% so the congestion volume each causes is 200k × 10% × 0.5 = 10kb and 15kb respectively. So cost depends not only on flow rate, but on congestion as well. Typically congestion might be in the fractions of a percent but it varies from zero to tens of percent. So, flow rate can never alone serve as a measure of cost. To summarise so far, flow rate is a hopelessly incorrect proxy both for benefit and for cost. Even if the intent was to equalise benefits, equalising flow rates wouldn’t achieve it. Even if the intent was to equalise costs, equalising flow rates wouldn’t achieve it. But actually a realistic resource allocation mechanism only needs to concern itself with costs. If we set aside political economy for a moment and use pure microeconomics, we can use a competitive market to arbitrate fairness, which handles the benefits side, as we shall now explain. Then once we have a feasible, scalable system that at least implements one defined form of fairness, we will show how to build other forms of fairness within that. In life, as long as people cover the cost of their actions, it is generally considered fair enough. If one Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion COST, NOT BENEFIT
  • 138.
    So in networking,the cost of one flow’s behaviour depends on the congestion volume it causes which is the product of its instantaneous flow rate and congestion on its path, integrated over time. For instance, if two users are sending at 200kbps and 300kbps into a 450kbps line for 0.5s, congestion is (200 + 300 − 450)/(200 + 300) = 10% so the congestion volume each causes is 200k × 10% × 0.5 = 10kb and 15kb respectively. So cost depends not only on flow rate, but on congestion as well. Typically congestion might be in the fractions of a percent but it varies from zero to tens of percent. So, flow rate can never alone serve as a measure of cost. To summarise so far, flow rate is a hopelessly incorrect proxy both for benefit and for cost. Even if the intent was to equalise benefits, equalising flow rates wouldn’t achieve it. Even if the intent was to equalise costs, equalising flow rates wouldn’t achieve it. But actually a realistic resource allocation mechanism only needs to concern itself with costs. If we set aside political economy for a moment and use pure microeconomics, we can use a competitive market to arbitrate fairness, which handles the benefits side, as we shall now explain. Then once we have a feasible, scalable system that at least implements one defined form of fairness, we will show how to build other forms of fairness within that. In life, as long as people cover the cost of their actions, it is generally considered fair enough. If one Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion “(…) flow rate is a hopelessly incorrect proxy both for benefit and for cost. Even if the intent was to equalise benefits, equalising flow rates wouldn’t achieve it. Even if the intent was to equalise costs, equalising flow rates wouldn’t achieve it.” COST, NOT BENEFIT
  • 139.
    flow rate fairness sharesthe wrong thing rate
  • 140.
    flow rate fairness sharesthe wrong thing flow amongst the wrong entity
  • 141.
    x2(t) x1(t) bit rate S HA R I N G A M O N G S T W H A T ? x1(t) = x2(t)
  • 142.
    x2(t) x1(t) bit rate x1(t) =x2(t) = x3(t) x3(t) x2(t) + x3(t) > x1(t) S H A R I N G A M O N G S T W H A T ?
  • 143.
    x2(t) x1(t) bit rate x1(t) =x2(t) = x3(t) = x4(t) x3(t) x2(t) + x3(t) + x4(t) > x1(t) S H A R I N G A M O N G S T W H A T ? x4(t)
  • 144.
    fairness is nota question of technical function—any allocation ‘works’. But getting it hopelessly wrong badly skews the outcome of conflicts between the vested interests of real businesses and real people. But isn’t it a basic article of faith that multiple views of fairness should be able to co-exist, the choice depending on policy? Absolutely correct—and we shall return to how this can be done later. But that doesn’t mean we have to give the time of day to any random idea of fairness. Fair allocation of rates between flows isn’t based on any respected definition of fairness from philosophy or the social sciences. It has just gradually become the way things are done in networking. But it’s actually self-referential dogma. Or put more bluntly, bonkers. We expect to be fair to people, groups of people, institutions, companies - things the security community would call ‘principals’. But a flow is merely an information transfer between two applications. Where does the argument come from that information transfers should have equal rights? It’s equivalent to claiming food rations are fair because the boxes are all the same size, irrespective of how many boxes each person gets or how often they get them. Because flows don’t deserve rights in real life, it is not surprising that two loopholes the size of barn doors appear when trying to allocate rate fairly to flows in a nonco-operative environment. If at every instant a resource is shared among the flows competing for a share, any realworld entity can gain by i) creating more flows than anyone else, and ii) keeping them going longer than anyone else. Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION
  • 145.
    fairness is nota question of technical function—any allocation ‘works’. But getting it hopelessly wrong badly skews the outcome of conflicts between the vested interests of real businesses and real people. But isn’t it a basic article of faith that multiple views of fairness should be able to co-exist, the choice depending on policy? Absolutely correct—and we shall return to how this can be done later. But that doesn’t mean we have to give the time of day to any random idea of fairness. Fair allocation of rates between flows isn’t based on any respected definition of fairness from philosophy or the social sciences. It has just gradually become the way things are done in networking. But it’s actually self-referential dogma. Or put more bluntly, bonkers. We expect to be fair to people, groups of people, institutions, companies - things the security community would call ‘principals’. But a flow is merely an information transfer between two applications. Where does the argument come from that information transfers should have equal rights? It’s equivalent to claiming food rations are fair because the boxes are all the same size, irrespective of how many boxes each person gets or how often they get them. Because flows don’t deserve rights in real life, it is not surprising that two loopholes the size of barn doors appear when trying to allocate rate fairly to flows in a nonco-operative environment. If at every instant a resource is shared among the flows competing for a share, any realworld entity can gain by i) creating more flows than anyone else, and ii) keeping them going longer than anyone else. Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION “It’s equivalent to claiming food rations are fair because the boxes are all the same size, irrespective of how many boxes each person gets or how often they get them.”
  • 146.
    flow rate fairness sharesthe wrong thing flow amongst the wrong entity
  • 147.
    flow rate shares thewrong thing fairness amongst the wrong entity non-sequitur
  • 148.
    Whether the prevailingnotion of flow rate fairness has been the root cause or not, there will certainly be no solution until the networking community gets its head out of the sand and understands how unrealistic its view is, and how important this issue is. Certainly fairness is not a question of technical function—any allocation ‘works’. But getting it hopelessly wrong badly skews the outcome of conflicts between the vested interests of real businesses and real people. But isn’t it a basic article of faith that multiple views of fairness should be able to co-exist, the choice depending on policy? Absolutely correct—and we shall return to how this can be done later. But that doesn’t mean we have to give the time of day to any random idea of fairness. Fair allocation of rates between flows isn’t based on any respected definition of fairness from philosophy or the social sciences. It has just gradually become the way things are done in networking. But it’s actually self-referential dogma. Or put more bluntly, bonkers. We expect to be fair to people, groups of people, institutions, companies - things the security community would call ‘principals’. But a flow is merely an information transfer between two applications. Where does the argument come from that information transfers should have equal rights? It’s equivalent to claiming food rations are fair because the boxes are all the same size, irrespective of how many boxes each person gets or how often they get them. Because flows don’t deserve rights in real life, it is not surprising that two loopholes the size of barn doors Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION
  • 149.
    Whether the prevailingnotion of flow rate fairness has been the root cause or not, there will certainly be no solution until the networking community gets its head out of the sand and understands how unrealistic its view is, and how important this issue is. Certainly fairness is not a question of technical function—any allocation ‘works’. But getting it hopelessly wrong badly skews the outcome of conflicts between the vested interests of real businesses and real people. But isn’t it a basic article of faith that multiple views of fairness should be able to co-exist, the choice depending on policy? Absolutely correct—and we shall return to how this can be done later. But that doesn’t mean we have to give the time of day to any random idea of fairness. Fair allocation of rates between flows isn’t based on any respected definition of fairness from philosophy or the social sciences. It has just gradually become the way things are done in networking. But it’s actually self-referential dogma. Or put more bluntly, bonkers. We expect to be fair to people, groups of people, institutions, companies - things the security community would call ‘principals’. But a flow is merely an information transfer between two applications. Where does the argument come from that information transfers should have equal rights? It’s equivalent to claiming food rations are fair because the boxes are all the same size, irrespective of how many boxes each person gets or how often they get them. Because flows don’t deserve rights in real life, it is not surprising that two loopholes the size of barn doors Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION “Fair allocation of rates between flows isn’t based on any respected definition of fairness from philosophy or the social sciences. It has just gradually become the way things are done in networking.”
  • 150.
    This paper isdeliberately destructive. It sets out to destroy an ideology that is blocking progress - the idea that fairness between multiplexed packet traffic can be achieved by controlling relative flow rates alone. Flow rate fairness was the goal behind fair resource allocation in widely deployed protocols like weighted fair queuing (WFQ), TCP congestion control and TCP-friendly rate control [8, 1, 11]. But it is actually just unsubstantiated dogma to say that equal flow rates are fair. This is why resource allocation and accountability keep reappearing on every list of requirements for the Internet architecture (e.g. [2]), but never get solved. Obscured by this broken idea, we wouldn’t know a good solution from a bad one. Controlling relative flow rates alone is a completely impractical way of going about the problem. To be realistic for large-scale Internet deployment, relative flow rates should be the outcome of another fairness mechanism, not the mechanism itself. That other mechanism should share out the ‘cost’ of one user’s actions on others—how much each user’s transfers restrict other transfers, given capacity constraints. Then flow rates will depend on a deeper level of fairness that has so far remained unnamed in the literature, but is best termed ‘cost fairness’. It really is only the idea of flow rate fairness that needs destroying—nearly ever ything we’ve engineered can remain. The Internet architecture Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion INTRODUCTION “Obscured by this broken idea, we wouldn’t know a good solution from a bad one.”
  • 151.
    what would fairlook like?
  • 152.
    C O ST F A I R the cost is congestion
  • 153.
    increase with flowrate, but the shape and size of the function relating the two (the utility function) is unknown, subjective and private to each user. Flow rate itself is an extremely inadequate measure for comparing benefits: user benefit per bit rate might be ten orders of magnitude different for different types of flow (e.g. SMS and video). So different applications might derive completely different benefits from equal flow rates and equal benefits might be derived from very different flow rates. Turning to the cost of a data transfer across a network, flow rate alone is not the measure of that either. Cost is also dependent on the level of congestion on the path. This is counter-intuitive for some people so we shall explain a little further. Once a network has been provisioned at a certain size, it doesn’t cost a network operator any more whether a user sends more data or not. But if the network becomes congested, each user restricts every other user, which can be interpreted as a cost to all - an externality in economic terms. For any level of congestion, Kelly showed [20] that the system is optimal if the blame for congestion is attributed among all the users causing it, in proportion to their bit rates. That’s exactly what routers are designed to do anyway. During congestion, a queue randomly distributes the losses so all flows see about the same loss (or ECN marking) rate; if a flow has twice the bit rate of another it should see twice the losses. In this respect random early detection (RED [12]) is slightly fairer than drop tail, but to a first order approximation they both meet this criterion. So in networking, the cost of one flow’s behaviour Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion COST, NOT BENEFIT
  • 154.
    increase with flowrate, but the shape and size of the function relating the two (the utility function) is unknown, subjective and private to each user. Flow rate itself is an extremely inadequate measure for comparing benefits: user benefit per bit rate might be ten orders of magnitude different for different types of flow (e.g. SMS and video). So different applications might derive completely different benefits from equal flow rates and equal benefits might be derived from very different flow rates. Turning to the cost of a data transfer across a network, flow rate alone is not the measure of that either. Cost is also dependent on the level of congestion on the path. This is counter-intuitive for some people so we shall explain a little further. Once a network has been provisioned at a certain size, it doesn’t cost a network operator any more whether a user sends more data or not. But if the network becomes congested, each user restricts every other user, which can be interpreted as a cost to all - an externality in economic terms. For any level of congestion, Kelly showed [20] that the system is optimal if the blame for congestion is attributed among all the users causing it, in proportion to their bit rates. That’s exactly what routers are designed to do anyway. During congestion, a queue randomly distributes the losses so all flows see about the same loss (or ECN marking) rate; if a flow has twice the bit rate of another it should see twice the losses. In this respect random early detection (RED [12]) is slightly fairer than drop tail, but to a first order approximation they both meet this criterion. So in networking, the cost of one flow’s behaviour Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion “(…) if the network becomes congested, each user restricts every other user, which can be interpreted as a cost to all - an externality in economic terms.” COST, NOT BENEFIT
  • 156.
    time rate V OL U M E C A P P I N G
  • 157.
    time rate V OL U M E C A P P I N G
  • 158.
    time rate V OL U M E C A P P I N G not much faster
  • 159.
    time rate V OL U M E C A P P I N G not much faster waste
  • 161.
    time rate R AT E L I M I T I N G
  • 162.
    time rate R AT E L I M I T I N G
  • 163.
    time rate R AT E L I M I T I N G much slower
  • 164.
    time rate R AT E L I M I T I N G much slowerwaste
  • 165.
    C O ST F A I R N E S S c2(t) c1(t) congestion rate reflects cost integrates correctly verifiable across network borders
  • 166.
    time rate W EI G H T E D C O S T
  • 167.
    time rate W EI G H T E D C O S T
  • 169.
  • 170.
  • 171.
  • 172.
    congestion marking starts.Such operators continually receive information on how much real demand there is for capacity while collecting revenue to repay their investments. Such congestion marking controls demand without risk of actual congestion deteriorating service. Once a cost is assigned to congestion that equates to the cost of alleviating it, users will only cause congestion if they want extra capacity enough to be willing to pay its cost. Of course, there will be no need to be too precise about that rule. Perhaps some people might be allowed to get more than they pay for and others less. Perhaps some people will be prepared to pay for what others get, and so on. But, in a system the size of the Internet, there has to be be some handle to arbitrate how much cost some users cause to others. Flow rate fairness comes nowhere near being up to the job. It just isn’t realistic to create a system the size of the Internet and define fairness within the system without reference to fairness outside the system — in the real world where everyone grudgingly accepts that fairness usually means “you get what you pay for”. Note that we use the phrase “you get what you pay for” not just “you pay for what you get”. In Kelly’s original formulation, users had to pay for the congestion they caused, which was unlikely to be taken up commercially. But the reason we are revitalising Kelly’s work is that recent advances (§4.3.2) should allow ISPs to keep their popular flat fee pricing packages by aiming to ensure that users cannot cause more congestion costs than their flat fee pays for. Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion COST, NOT BENEFIT
  • 173.
    congestion marking starts.Such operators continually receive information on how much real demand there is for capacity while collecting revenue to repay their investments. Such congestion marking controls demand without risk of actual congestion deteriorating service. Once a cost is assigned to congestion that equates to the cost of alleviating it, users will only cause congestion if they want extra capacity enough to be willing to pay its cost. Of course, there will be no need to be too precise about that rule. Perhaps some people might be allowed to get more than they pay for and others less. Perhaps some people will be prepared to pay for what others get, and so on. But, in a system the size of the Internet, there has to be be some handle to arbitrate how much cost some users cause to others. Flow rate fairness comes nowhere near being up to the job. It just isn’t realistic to create a system the size of the Internet and define fairness within the system without reference to fairness outside the system — in the real world where everyone grudgingly accepts that fairness usually means “you get what you pay for”. Note that we use the phrase “you get what you pay for” not just “you pay for what you get”. In Kelly’s original formulation, users had to pay for the congestion they caused, which was unlikely to be taken up commercially. But the reason we are revitalising Kelly’s work is that recent advances (§4.3.2) should allow ISPs to keep their popular flat fee pricing packages by aiming to ensure that users cannot cause more congestion costs than their flat fee pays for. Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion “It just isn’t realistic to create a system the size of the Internet and define fairness within the system without reference to fairness outside the system” COST, NOT BENEFIT
  • 174.
    “How do youshare a network?”
  • 175.
  • 176.
    H O WM A N Y W O R K A R O U N D S ? “TCP is bad with small flows” batch and re-use connections open parallel connections artificial limits in multitenancy
  • 178.
    we still haveno idea 2 0 1 6
  • 179.
    we know whatwe have is wrong we still have no idea 2 0 1 6
  • 180.
    we know whatwe have is wrong not broken enough to fix we still have no idea 2 0 1 6
  • 182.