Resource Allocation in Computer Networks

RESOURCE
ALLOCATION
IN
COMPUTER
NETWORKS
joão taveira araújo
@jta

“How do you share a network?”

“How do you share a network?”
the question

A S S U M P T I O N
given an answer

A S S U M P T I O N
given an answer
can’t fully understand

A S S U M P T I O N
given an answer
never worked through question
can’t fully understand

T H I S T A L K
‘62
‘74
‘88
‘07

T H I S T A L K
‘62
‘74
‘88
‘07
different interpretations
of the same question

T H I S T A L K
‘62
‘74
‘88
‘07
foundational papers

O B J E C T I V E S
how did we get here

O B J E C T I V E S
how did we get here
what assumptions

at what cost
O B J E C T I V E S
how did we get here
what assumptions

Let us consider the synthesis of a communication
network which will allow several hundred major
communications stations to talk with one another
after an enemy attack. As a criterion of survivability
we elect to use the percentage of stations both
surviving the physical attack and remaining in
electrical connection with the largest single group of
surviving stations. This criterion is chosen as a
conservative measure of the ability of the surviving
stations to operate together as a coherent entity after
the attack. This means that small groups of stations
isolated from the single largest group are considered
to be ineﬀective.
Although one can draw a wide variety of networks,
they all factor into two components: centralized (or
star) and distributed (or grid or mesh) (see Fig. 1).
The centralized network is obviously vulnerable as
destruction of a single central node destroys
communication between the end stations. In practice,
a mixture of star and mesh components is used to
form communications networks. For example, type
(b) in Fig. 1 shows the hierarchical structure of a set
Paul Baran‘62 On Distributed Communications Networks
INTRODUCTION

Let us consider the synthesis of a communication
network which will allow several hundred major
communications stations to talk with one another
after an enemy attack. As a criterion of survivability
we elect to use the percentage of stations both
surviving the physical attack and remaining in
electrical connection with the largest single group of
surviving stations. This criterion is chosen as a
conservative measure of the ability of the surviving
stations to operate together as a coherent entity after
the attack. This means that small groups of stations
isolated from the single largest group are considered
to be ineﬀective.
Although one can draw a wide variety of networks,
they all factor into two components: centralized (or
star) and distributed (or grid or mesh) (see Fig. 1).
The centralized network is obviously vulnerable as
destruction of a single central node destroys
communication between the end stations. In practice,
a mixture of star and mesh components is used to
form communications networks. For example, type
(b) in Fig. 1 shows the hierarchical structure of a set
INTRODUCTION
“Let us consider the synthesis of
a communication network which
will allow several hundred major
communications stations to talk
with one another after an enemy
attack.”

transmission between any ith station and any jth
station, provided a path can be drawn from the ith to
the jth station.
Starting with a network composed of an array of
stations connected as in Fig. 3, an assigned percentage
of nodes and links is destroyed. If, after this
operation, it is still possible to draw a line to connect
the ith station to the jth station, the ith and jth
stations are said to be connected.
Node Destruction
Figure 4 indicates network performance as a function
of the probability of destruction for each separate
node. If the expected "noise" was destruction caused
by conventional hardware failure, the failures would
be randomly distributed through the network. But, if
the disturbance were caused by enemy attack, the
possible "worst cases" must be considered.
To bisect a 32-link network requires direction of 288
weapons each with a probability of kill, pk = 0.5, or
160 with a pk = 0.7, to produce over an 0.9 probability
of successfully bisecting the network. If hidden
alternative command is allowed, then the largest
single group would still have an expected value of
almost 50 per cent of the initial stations surviving
intact. If this raid misjudges complete availability of
weapons, or complete knowledge of all links in the
cross section, or the eﬀects of the weapons against
each and every link, the raid fails. The high risk of
such raids against highly parallel structures causes
examination of alternative attack policies. Consider
the following uniform raid example. Assume that
2,000 weapons are deployed against a 1000-station
EXAMINATION OF A
DISTRIB UTE D NETWOR K

the jth station.
Node Destruction
“(…) destruction caused by
conventional hardware failure,
the failures would be randomly
distributed through the network.
But, if the disturbance were
caused by enemy attack, the
possible "worst cases" must be
considered.”
EXAMINATION OF A

Node Destruction
network. The stations are so spaced that destruction
of two stations with a single weapon is unlikely.
Divide the 2,000 weapons into two equal 1000-
weapon salvos. Assume any probability of destruction
of a single node from a single weapon less than 1.0;
for example, 0.5. Each weapon on the ﬁrst salvo has a
0.5 probability of destroying its target. But, each
weapon of the second salvo has only a 0.25
probability, since one-half the targets have already
EXAMINATION OF A

“To bisect a 32-link network requires
direction of 288 weapons each with
a probability of kill, pk = 0.5, or 160
with a pk = 0.7, to produce over an
0.9 probability of successfully
bisecting the network.”
Node Destruction
network. The stations are so spaced that destruction
of two stations with a single weapon is unlikely.
Divide the 2,000 weapons into two equal 1000-
weapon salvos. Assume any probability of destruction
of a single node from a single weapon less than 1.0;
for example, 0.5. Each weapon on the ﬁrst salvo has a
0.5 probability of destroying its target. But, each
weapon of the second salvo has only a 0.25
probability, since one-half the targets have already
EXAMINATION OF A

Each node and link in the array of Fig. 2 has the
capacity and the switching ﬂexibility to allow
the jth station.
Node Destruction
EXAMINATION OF A

4. First, extremely survivable networks can be built
using a moderately low redundancy of connectivity
level. Redundancy levels on the order of only three
permit withstanding extremely heavy level attacks
with negligible additional loss to communications.
Secondly, the survivability curves have sharp break-
points. A network of this type will withstand an
increasing attack level until a certain point is reached,
beyond which the network rapidly deteriorates. Thus,
the optimum degree of redundancy can be chosen as
a function of the expected level of attack. Further
redundancy buys little. The redundancy level required
to survive even very heavy attacks is not great--on the
order of only three or four times that of the
minimum span network.
Link Destruction
In the previous example we have examined network
performance as a function of the destruction of the
nodes (which are better targets than links). We shall
now re-examine the same network, but using
unreliable links. In particular, we want to know how
unreliable the links may be without further degrading
the performance of the network.
Figure 5 shows the results for the case of perfect
nodes; only the links fail. There is little system
degradation caused even using extremely unreliable
links--on the order of 50 per cent down-time--
assuming all nodes are working.
Combination Link and Node Destruction
The worst case is the composite eﬀect of failures of
both the links and the nodes. Figure 6 shows the
EXAMINATION OF A

Link Destruction
In the previous example we have examined network
performance as a function of the destruction of the
nodes (which are better targets than links). We shall
now re-examine the same network, but using
unreliable links. In particular, we want to know how
unreliable the links may be without further degrading
the performance of the network.
Figure 5 shows the results for the case of perfect
nodes; only the links fail. There is little system
degradation caused even using extremely unreliable
links--on the order of 50 per cent down-time--
assuming all nodes are working.
Combination Link and Node Destruction
The worst case is the composite effect of failures of
both the links and the nodes. Figure 6 shows the
effect of link failure upon a network having 40 per
cent of its nodes destroyed. It appears that what
would today be regarded as an unreliable link can be
used in a distributed network almost as effectively as
perfectly reliable links. Figure 7 examines the result of
100 trial cases in order to estimate the probability
density distribution of system performance for a
mixture of node and link failures. This is the
distribution of cases for 20 per cent nodal damage
and 35 per cent link damage.
EXAMINATION OF A

We will soon be living in an era in which we cannot
guarantee survivability of any single point. However,
we can still design systems in which system
destruction requires the enemy to pay the price of
destroying n of n stations. If n is made suﬃciently
large, it can be shown that highly survivable system
structures can be built - even in the thermonuclear
era. In order to build such networks and systems we
will have to use a large number of elements. We are
interested in knowing how inexpensive these
elements may be and still permit the system to
operate reliably. There is a strong relationship
between element cost and element reliability. To
design a system that must anticipate a worst-case
destruction of both enemy attack and normal system
failures, one can combine the failures expected by
enemy attack together with the failures caused by
normal reliability problems, provided the enemy does
not know which elements are inoperative. Our future
systems design problem is that of building very
reliable systems out of the described set of unreliable
elements at lowest cost. In choosing the
communications links of the future, digital links
appear increasingly attractive by permitting low-cost
ON A FUTURE SYST E M
DEVELOPMENT

DEVELOPMENT
“(…) highly survivable system
structures can be built - even in
the thermonuclear era.”

DEVELOPMENT
“(…) have to use a large number
of elements. We are interested
in knowing how inexpensive
these elements may be”

high data rate links in emergencies.[2]
Satellites
The problem of building a reliable network using
satellites is somewhat similar to that of building a
communications network with unreliable links. When
a satellite is overhead, the link is operative. When a
satellite is not overhead, the link is out of service.
Thus, such links are highly compatible with the type
of system to be described.
Variable Data Rate Links
In a conventional circuit switched system each of the
tandem links requires matched transmission
bandwidths. In order to make fullest use of a digital
link, the post-error-removal data rate would have to
vary, as it is a function of noise level. The problem
then is to build a communication network made up of
links of variable data rate to use the communication
resource most eﬃciently.
Variable Data Rate Users
We can view both the links and the entry point nodes
of a multiple-user all-digital communications system
as elements operating at an ever changing data rate.
From instant to instant the demand for transmission
will vary.
We would like to take advantage of the average
demand over all users instead of having to allocate a
full peak demand channel to each. Bits can become a
common denominator of loading for economic
charging of customers. We would like to eﬃciently
DEVELOPMENT

Variable Data Rate Links
In a conventional circuit switched system each of the
tandem links requires matched transmission
bandwidths. In order to make fullest use of a digital
link, the post-error-removal data rate would have to
vary, as it is a function of noise level. The problem
then is to build a communication network made up of
links of variable data rate to use the communication
resource most eﬃciently.
will vary.
handle both those users who make highly
intermittent bit demands on the network, and those
who make long-term continuous, low bit demands.
Common User
In communications, as in transportation, it is more
economical for many users to share a common
resource rather than each to build his own system--
particularly when supplying intermittent or
occasional service. This intermittency of service is
DEVELOPMENT

will vary.
Common User
highly characteristic of digital communication
requirements. Therefore, we would like to consider
the interconnection, one day, of many all-digital links
to provide a resource optimized for the handling of
data for many potential intermittent users--a new
common-user system.
Figure 9 demonstrates the basic notion. A wide
mixture of diﬀerent digital transmission links is
combined to form a common resource divided among
many potential users. But, each of these
communications links could possibly have a diﬀerent
DEVELOPMENT

will vary.
Common User
highly characteristic of digital communication
requirements. Therefore, we would like to consider
the interconnection, one day, of many all-digital links
to provide a resource optimized for the handling of
data for many potential intermittent users--a new
common-user system.
DEVELOPMENT
“more economical to share a
common (…) resource optimized
for the handling of data”

common-user system.
data rate. Therefore, we shall next consider how links
of diﬀerent data rates may be interconnected.
Standard Message Block
Present common carrier communications networks,
used for digital transmission, use links and concepts
originally designed for another purpose--voice. These
systems are built around a frequency division
multiplexing link-to-link interface standard. The
standard between links is that of data rate. Time
division multiplexing appears so natural to data
transmission that we might wish to consider an
alternative approach--a standardized message block as
a network interface standard. While a standardized
message block is common in many computer-
communications applications, no serious attempt has
ever been made to use it as a universal standard. A
universally standardized message block would be
composed of perhaps 1024 bits. Most of the message
block would be reserved for whatever type data is to
be transmitted, while the remainder would contain
housekeeping information such as error detection and
routing data, as in Fig. 10.
As we move to the future, there appears to be an
increasing need for a standardized message block for
all-digital communications networks. As data rates
increase, the velocity of propagation over long links
DEVELOPMENT

common-user system.
data rate. Therefore, we shall next consider how links
of diﬀerent data rates may be interconnected.
Standard Message Block
Present common carrier communications networks,
used for digital transmission, use links and concepts
originally designed for another purpose--voice. These
systems are built around a frequency division
multiplexing link-to-link interface standard. The
standard between links is that of data rate. Time
division multiplexing appears so natural to data
transmission that we might wish to consider an
alternative approach--a standardized message block as
a network interface standard. While a standardized
message block is common in many computer-
communications applications, no serious attempt has
ever been made to use it as a universal standard. A
universally standardized message block would be
composed of perhaps 1024 bits. Most of the message
block would be reserved for whatever type data is to
be transmitted, while the remainder would contain
housekeeping information such as error detection and
routing data, as in Fig. 10.
As we move to the future, there appears to be an
increasing need for a standardized message block for
all-digital communications networks. As data rates
increase, the velocity of propagation over long links
DEVELOPMENT
“Time division multiplexing
appears so natural to data that
we might wish to consider an
alternative approach - a
standardized message block”

Telecommunications textbooks arrive at a fire
according to a Poisson distribution

priority marking
(defense contractor)

Act I
AN EXERCISE TO THE READER

A protocol that supports the sharing of resources that
exist in different packet switching networks is
presented. The protocol provides for variation in
individual network packet sizes, transmission failures,
sequencing, flow control, end-to-end error checking,
and the creation and destruction of logical process-
to-process connections. Some implementation issues
are considered, and problems such as internetwork
routing, accounting, and timeouts are exposed.
In the last few years considerable effort has been
expended on the design and implementation of
packet switching networks [1]-[7],[14],[17]. A principle
reason for developing such networks has been to
facilitate the sharing of computer resources. A packet
communication network includes a transportation
mechanism for delivering data between computers or
between computers and terminals. To make the data
meaningful, computer and terminals share a common
protocol (i.e, a set of agreed upon conventions).
Several protocols have already been developed for this
purpose [8]-[12],[16]. However, these protocols have
Vinton G. Cerf and Robert E. Kahn‘74 A Protocol for Packet Network Intercommunication
ABSTRACT
INTRODUCTION

ABSTRACT
INTRODUCTION

ABSTRACT
INTRODUCTION
“A protocol that supports the
sharing of resources that exist in
different packet switching
networks is presented.”

ABSTRACT
INTRODUCTION
packet fragmentation
transmission failures
sequencing
ﬂow control
error checking
connection setup

Fig. 2. Three networks interconnected by two GATEWAYS.
may be null) b- Internetwork Header
CAL HEADER SOURCE DESTINATION SEQUENCE NO. BYTE COUNTIFLAG FIELD TEXT ICHECK
g. 3. Internetworkpacketformat (fields not shown to sc
orlc header, is illustrated in Fig. 3 . The source and d
ation entries uniforndyand uniquely identifythe add
every HOST in the composite network. Addressing is
ubject of considerablecomplexitywhichisdiscussed
greater detail in the nextsection. Thenext two entr
IEEE TRANSACTIONS ON COMMUNICATIOK
byte identification-sequencenumber
First Message
(SEQ = k)
Fig. 7. Assignment of sequencenumbers.
LH = Local Header
IH = InternetwolX Header
CK = Checksum
PH = Process Header
egments and packets frommessages.
32 16 16 En
Wmdow ACK Text (Field sizes in bits1
Hed..LJ
format (processheader andtext).
the message bythe source
for internetworktransmission, the
first byte of segment text is used as
for the packet. Thebytecount
rk header accounts for all the text
ocs not include the check-sum bytes
ernetxork or process header).
e sequence number associated with
16 bits
Y E S M
S
N L
_ . .EER
I l l I
LEnd of Message when set = 1
End of Segmentwhen set = 1
Release Use of ProcessIPortwhen set=l
Synchronize to PacketSequence Number wh
Fig. 8. Internetworkheader flag field.
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A

643
LH = Local Header
CK = Checksum
PH = Process Header
Fig. 5. Creation of segments and packets frommessages.
32 32 16 16 En
SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1
,+JPlOLIIl Hed
Fig.6. Segment format (processheader andtext).
First Message
(SEQ = k)
LH = Local Header
CK = Checksum
PH = Process Header
32 16 16 En
Hed..LJ
16 bits
Y E S M
S
N L
_ . .EER
I l l I
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A

643
LH = Local Header
CK = Checksum
PH = Process Header
Fig. 5. Creation of segments and packets frommessages.
32 32 16 16 En
SourcePortDertinatianIPort Wmdow ACK Text (Field sizes in bits1
,+JPlOLIIl Hed
Fig.6. Segment format (processheader andtext).
First Message
(SEQ = k)
LH = Local Header
CK = Checksum
PH = Process Header
32 16 16 En
Hed..LJ
16 bits
Y E S M
S
N L
_ . .EER
I l l I
- 1000 bytes .100101102 . . .
I TEXT OFMESSAGE A
wat?!?

SEQ and SYN in internetwork header

if there’s an internetwork header, and a
process header, what the hell is TCP?

We suppose that processes wish to communicate in
full duplex with their correspondents using
unbounded but finite length messages. A single
character might constitute the text of a message from
a process to a terminal or vice versa. An entire page of
characters might constitute the text of a message
from a file to a process. A data stream (e.g. a
continuously generated bit string) can be represented
as a sequence of finite length messages.
Within a HOST we assume that existence of a
transmission control program (TCP) which handles
the transmission and acceptance of messages on
behalf of the processes it serves. The TCP is in turn
served by one or more packet switches connected to
the HOST in which the TCP resides. Processes that
want to communicate present messages to the TCP
for transmission, and TCP’s deliver incoming
messages to the appropriate destination processes.
We allow the TCP to break up messages into
segments because the destination may restrict the
amount of data that may arrive, because the local
network may limit the maximum transmissin size, or
because the TCP may need to share its resources
among many processes concurrently. Furthermore, we
constrain the length of a segment to an integral
number of 8-bit bytes. This uniformity is most helpful
in simplifying the software needed with HOST
machines of different natural word lengths.
Provision at the process level can be made for
padding a message that is not an integral number of
PROCESS LEV EL
COMMUNICATION

We suppose that processes wish to communicate in
full duplex with their correspondents using
unbounded but finite length messages. A single
character might constitute the text of a message from
a process to a terminal or vice versa. An entire page of
characters might constitute the text of a message
from a file to a process. A data stream (e.g. a
continuously generated bit string) can be represented
as a sequence of finite length messages.
Within a HOST we assume that existence of a
transmission control program (TCP) which handles
the transmission and acceptance of messages on
behalf of the processes it serves. The TCP is in turn
served by one or more packet switches connected to
the HOST in which the TCP resides. Processes that
want to communicate present messages to the TCP
for transmission, and TCP’s deliver incoming
messages to the appropriate destination processes.
We allow the TCP to break up messages into
segments because the destination may restrict the
amount of data that may arrive, because the local
network may limit the maximum transmissin size, or
because the TCP may need to share its resources
among many processes concurrently. Furthermore, we
constrain the length of a segment to an integral
number of 8-bit bytes. This uniformity is most helpful
in simplifying the software needed with HOST
machines of different natural word lengths.
Provision at the process level can be made for
padding a message that is not an integral number of
PROCESS LEV EL
COMMUNICATION
“Within a HOST we assume the
existence of a transmission
control program (TCP) which
handles transmission”

TCP is a userspace networking stack

No transmission can be 100 percent reliable. We
propose a timeout and positive acknowledgement
mechanism which will allow TCP’s to recover from
packet losses from one HOST to another. A TCP
transmits packets and waits for replies
(acknowledgements) that are carried in the reverse
packet stream. If no acknowledgement for a
particular packet is received, the TCP will retransmit.
It is our expectation that the HOST level
retransmission mechanism, which is described in the
following paragraphs, will not be called upon very
often in practice. Evidence already exists that
individual networks can be eﬀectively constructed
without this feature. However, the inclusion of a
HOST retransmission capability makes it possible to
recover from occasional network problems and allows
a wide range of HOST protocol strategies to be
incorporated. We envision it will occasionally be
invoked to allow HOST accommodation to
infrequent overdemands for limited buﬀer resources,
and otherwise not used much.
Any retransmission policy requires some means by
which the receiver can detect duplicate arrivals. Even
RETRANSMISSION A N D
DUPLICATE DETEC TI ON

“No transmission can be 100
percent reliable.”

“No transmission can be 100
percent reliable.”
“retransmission (…) will not be
called upon very often in
practice. Evidence already exists
that individual networks can be
effectively constructed without
this feature.”

retransmissions are pathological

if an infinite number of distinct packet sequence
numbers were available, the receiver would still have
the problem of knowing how long to remember
previously received packets in order to detect
duplicates. Matters are complicated by the fact that
only a finite number of distinct sequence numbers are
in fact available, and if they are reused, the receiver
must be able to distinguish between new
transmissions and retransmissions.
A window strategy, similar to that used by the French
CYCLADES system (voie virtuelle transmission
mode [8]) and the ARPANET very distant HOST
connection [18]), is proposed here (see Fig. 10).
Suppose that the sequence number field in the
internetwork header permits sequence numbers to
range from 0 to n − 1. We assume that the sender will
not transmit more than w bytes without receiving an
acknowledgment. The w bytes serve as the window
(see Fig. 11). Clearly, w must be less than n. The rules
for sender and receiver are as follows.
Sender: Let L be the sequence number associated
with the left window edge.
1) The sender transmits bytes from segments whose
text lies between L and up to L + w − 1.
NETWORK INTERCOMMUNICATION 643
SSIONANDDUPLICATE
DETECTION
e 100 percent reliable. We
d positive acknowledgment mecha-
TCP’s torecover from packet losses
other. A TCP transmits packets and
knowledgements) that are carried in
eam. If noacknowledgment for a
received, theTCP will retransmit.
that the HOST level retransmission
s described inthe following para-
called uponveryofteninpractice.
sts2 that individual networks can be
d without this feature. However, the
retransmissioncapabilitymakes it
om occasional network problems and
of HOST protocol strategies to be in-
ion it will occasionally be invoked to
dation to infrequent overdemandsfor
es, and otherwise not used much.
Left Window Edge
I
0 n-1a+w- 1a
1- window -4
I< packet sequence number space
-1
Fig. 10. The windowconcept.
Source
Address
I Address
Destination
I
6
7
8
9
10
Next Read Position
End ReadPosition
Timeout
Fig. 11. Conceptual TCBformat.

On retransmission, the same packet might be broken
into three 200-byte packets going through a different
HOST. Since each byte has a sequence number, there
is no confusion at the receiving TCP. We leave for
later the issue of initially synchronizing the sender
and receiver left window edges and the window size.
Every segment that arrives at the destination TCP is
ultimately acknowlegded by returning the sequence
number of the next segment which must be passed to
the process (it may not yet have arrived).
Earlier we described the use of a sequence number
space and window to aid in duplicate detection.
Acknowledgments are carried in the process header
(see Fig. 6) and along with them there is provision for
a “suggested window” which the receiver can use to
control the flow of data from the sender. This is
intended to be the main component of the process
flow control mechanism. The receiver is free to vary
the window size according to any algorithm it desires
so long as the window size never exceeds half the
sequence number space.
This flow control mechanism is exceedingly powerful
and flexible and does not suffer from synchronization
troubles that may be encountered by incremental
buffer allocation schemes [9], [10]. However, it relies
heavily on an effective retransmission strategy. The
receiver can reduce the window even while packets
are en route from the sender whose window is
presently larger. The net effect of this reduction will
be that the receiver may discard incoming packets
(they may be outside the window) and reiterate the
NETWORK INTERCOMMUNICATION 643
SSIONANDDUPLICATE
DETECTION
e 100 percent reliable. We
d positive acknowledgment mecha-
TCP’s torecover from packet losses
other. A TCP transmits packets and
knowledgements) that are carried in
eam. If noacknowledgment for a
received, theTCP will retransmit.
that the HOST level retransmission
s described inthe following para-
called uponveryofteninpractice.
sts2 that individual networks can be
d without this feature. However, the
retransmissioncapabilitymakes it
om occasional network problems and
of HOST protocol strategies to be in-
ion it will occasionally be invoked to
dation to infrequent overdemandsfor
es, and otherwise not used much.
Left Window Edge
I
0 n-1a+w- 1a
1- window -4
I< packet sequence number space
-1
Fig. 10. The windowconcept.
Source
Address
I Address
Destination
I
6
7
8
9
10
Next Read Position
End ReadPosition
Timeout
Fig. 11. Conceptual TCBformat.
“a ‘suggested window’ which
the receiver can use to control
the flow of data from the sender.
This is intended to be the main
component of the process flow
control mechanism.”
FLOW CONTROL

the resource is the host

the resource is the host
no UDP

ﬂow control
(systems engineer)

x.25
ﬂow control
diagnostics
connection setup
hop-by-hop reliability

The authors wish to thank a number of colleagues for
helpful comments during early discussions of
international network protocols, especially R.
Metcalfe, R. Scantlebury, D. Walden, and H.
Zimmerman; D. Davies and L. Pouzin who
constructively commented on the fragmentation and
accounting issues; and S. Crocker who commented on
the creation and destruction of associations.
ACK NOWLEDGEME NT S

The authors wish to thank a number of colleagues for
helpful comments during early discussions of
international network protocols, especially R.
Metcalfe, R. Scantlebury, D. Walden, and H.
Zimmerman; D. Davies and L. Pouzin who
constructively commented on the fragmentation and
accounting issues; and S. Crocker who commented on
the creation and destruction of associations.
ACK NOWLEDGEME NT S
“The authors wish to thank (…)
especially R. Metcalfe (…)”

what if instead
of all this…

what if instead
of all this…
x.25
ﬂow control
diagnostics
connection setup

what if instead
of all this…
x.25
ﬂow control
diagnostics
connection setup
???
…i did nothing?

R F C 1 2 9 6
1981 1982 1983 1984 1985 1986 1987
30,000
0
5000
10,000
15,000
20,000
25,000
Year
Numberofhosts

In October of '86, the Internet had the first of what
became a series of 'congestion collapses'. During this
period, the data throughput from LBL to UC
Berkeley (sites separated by 400 yards and three IMP
hops) dropped from 32 Kbps to 40 bps. Mike Karels
and I were fascinated by this sudden factor-of-
thousand drop in bandwidth and embarked on an
investigation of why things had gotten so bad. We
wondered, in particular, if the 4.3BSD (Berkeley
UNIX) TCP was mis-behaving or if it could be tuned
to work better under abysmal network conditions.
The answer to both of these questions was "yes".
Since that time, we have put seven new algorithms
into the 4BSDTCP:
(i) round-trip-time variance estimation
(ii) exponential retransmit timer backoff
(iii) slow-start
(iv) more aggressive receiver ack policy
(v) dynamic window sizing on congestion
(vi) Karn's clamped retransmit backoff
(vii) fast retransmit
Our measurements and the reports of beta testers
suggest that the final product is fairly good at dealing
Van Jacobson‘88 Congestion Avoidance and Control

“In October of '86, the Internet
had the first of what became a
series of 'congestion collapses’.
(…) were fascinated by this
sudden factor-of-thousand drop
in bandwidth and embarked on
an investigation of why things
had gotten so bad.”
In October of '86, the Internet had the first of what
became a series of 'congestion collapses'. During this
period, the data throughput from LBL to UC
Berkeley (sites separated by 400 yards and three IMP
hops) dropped from 32 Kbps to 40 bps. Mike Karels
and I were fascinated by this sudden factor-of-
thousand drop in bandwidth and embarked on an
investigation of why things had gotten so bad. We
wondered, in particular, if the 4.3BSD (Berkeley
UNIX) TCP was mis-behaving or if it could be tuned
to work better under abysmal network conditions.
The answer to both of these questions was "yes".
Since that time, we have put seven new algorithms
into the 4BSDTCP:
(i) round-trip-time variance estimation
(ii) exponential retransmit timer backoff
(iii) slow-start
(iv) more aggressive receiver ack policy
(v) dynamic window sizing on congestion
(vi) Karn's clamped retransmit backoff

o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5
(the 4.3BSDTCP). The two Suns were on different Ethemets connected by IP gateways
driving a 230.4 Kbs point-to-point link (essentiallythe setup shown in fig. 7).
what if instead
of all this…

o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
what if instead
of all this…
aggravating
retransmissions

with congested conditions on the Internet.
This paper is a brief description of (i) - (v) and the
rationale behind them. (vi) is an algorithm recently
developed by Phil Karn of Bell Communications
Research, described in [KP87]. (vii) is described in a
soon-to-be-published RFC.
Algorithms (i) - (v) spring from one observation: The
flow on a TCP connection (or ISO TP-4 or Xerox NS
SPP connection) should obey a 'conservation of
packets' principle. And, if this principle were obeyed,
congestion collapse would become the exception
rather than the rule. Thus congestion control involves
finding places that violate conservation and fixing
them.
By 'conservation of packets' I mean that for a
connection 'in equilibrium', i.e., running stably with a
full window of data in transit, the packet flow is what
a physicist would call 'conservative': A new packet
isn't put into the network until an old packet leaves.
The physics of flow predicts that systems with this
property should be robust in the face of congestion.
Observation of the Internet suggests that it was not
particularly robust. Why the discrepancy?
There are only three ways for packet conservation to
fail:
1. The connection doesn't get to equilibrium, or
2. A sender injects a new packet before an old packet
has exited, or
3. The equilibrium can't be reached because of

them.
fail:
has exited, or
“(…) should obey a
‘conservation of packets’
principle”

them.
fail:
has exited, or
“(…) for a connection 'in
equilibrium', (…) the packet flow
is what a physicist would call
'conservative': A new packet
isn't put into the network until an
old packet leaves.”
“(…) should obey a
‘conservation of packets’
principle”

slow start

congestion
avoidance

..'
.f"J /
o _ ..."' ,,#. ........'"'"" 0/"8
o
,/fy /fj,f, . •
v o
80 -
~" j,Z ZZf "''ff "
o . , , , , ,
2 4 6 8 10
SendTime(sec)
Same conditions as the previous figure (same time of day, same Suns, same network
path, same buffer and window sizes), except the machines were running the 4.3+TCP

..'
.f"J /
o _ ..."' ,,#. ........'"'"" 0/"8
o
,/fy /fj,f, . •
v o
80 -
~" j,Z ZZf "''ff "
o . , , , , ,
2 4 6 8 10
SendTime(sec)
Same conditions as the previous figure (same time of day, same Suns, same network
path, same buffer and window sizes), except the machines were running the 4.3+TCP
o
Z
o~
g~
69
o
0.
o
o
?*
d,
..y':":" o/
.,,"
e~
0 2 4 6 8 10
SendTime(sec)
Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The y-axis
is the sequence number in the packet header. Thus a vertical array of dots indicate
back-to-back packets and two dots with the same y but different x indicate a retransmit.
'Desirable' behavior on this graph would be a relatively smooth line of clots extending
diagonally from the lower left to the upper right. The slope of this line would equal the
available bandwidth. Nothing in this trace resembles desirable behavior.

C O N G E S T I O N C O N T R O L
ﬁx RTT estimator
slow start (slower than ﬂow control)
congestion avoidance

is costly. But an exponential, almost regardless of its
time constant, increases so quickly that overestimates
are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the window
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
where W,.,,a= is the pipesize (the delay-bandwidth prod-
uct of the path minus protocol overhead -- i.e., the
largest sensible window for the unloaded path). This
is the additive increase / multiplicative decrease policy
suggested in [JRC87] and the policy we've implemented
in TCP. The only difference between the two implemen-
tations is the choice of constants for d and u. We used
0.5 and I for reasons partially explained in appendix C.
A more complete analysis is in yet another in-progress
paper.
The preceding has probably made the congestion
control algorithm sound hairy but it's not. Like slow-
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12

are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
paper.
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1

(These are the first two terms in a Taylor series expan-
sion of L(t). There is reason to believe one might even-
tually need a three term, second order model, but not
until the Internet has grown substantially.)
When the network is congested, 7 must be large and
the queue lengths will start increasing exponentially, s
The system will stabilize only if the traffic sources throt-
tle back at least as quickly as the queues are growing.
Since a source controls load in a window-based proto-
col by adjusting the size of the window, W, we end up
with the sender policy
On congestion:
Wi = dWi_l (d < 1)
I.e., a multiplicative decrease of the window size (which
becomes an exponential decrease over time if the con-
gestion persists).
If there's no congestion, 7 must be near zero and the
load approximately constant. The network announces,
via a dropped packet, when demand is excessive but
utili
have
verg
able
widt
crea
T
tive
Wi =
will
thro
tedio
easy
to re
effec
9I
the s
will f
shrink
bottle
of co
move
are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
paper.
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1

(These are the first two terms in a Taylor series expan-
sion of L(t). There is reason to believe one might even-
tually need a three term, second order model, but not
until the Internet has grown substantially.)
When the network is congested, 7 must be large and
the queue lengths will start increasing exponentially, s
The system will stabilize only if the traffic sources throt-
tle back at least as quickly as the queues are growing.
Since a source controls load in a window-based proto-
col by adjusting the size of the window, W, we end up
with the sender policy
On congestion:
Wi = dWi_l (d < 1)
I.e., a multiplicative decrease of the window size (which
becomes an exponential decrease over time if the con-
gestion persists).
If there's no congestion, 7 must be near zero and the
load approximately constant. The network announces,
via a dropped packet, when demand is excessive but
utili
have
verg
able
widt
crea
T
tive
Wi =
will
thro
tedio
easy
to re
effec
9I
the s
will f
shrink
bottle
of co
move
are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
paper.
to slo
both c
by a ti
dow, t
depen
tives.
have b
tise th
descri
algori
Fig
nectio
thoug
delibe
nario
IMP en
in tran
4.3BSD
multa
at Ber
the bu
lead 12
u = 1
d = 0.5

are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
to slow-start in addition to the above. But, because
both congestion avoidance and slow-start are triggered
by a timeout and both manipulate the congestion win-
dow, they are frequently confused. They are actually in-
dependent algorithms with completely different objec-
tives. To emphasize the difference, the two algorithms
have been presented separately even though in prac-
tise they should be implemented together. Appendix B
describes a combined slow-start/congestion avoidance
algorithm. 11
Figures 7 through 12 show the behavior of TCP con-
nections with and without congestion avoidance. Al-
though the test conditions (e.g., 16 KB windows) were
deliberately chosen to stimulate congestion, the test sce-
nario isn't far from common practice: The Arpanet
IMP end-to-end protocol allows at most eight packets
in transit between any pair of gateways. The default
The first thought is to use a symmetric, multiplicative
increase, possibly with a longer time constant, Wi =
bWi-1, 1 < b <1/d. This is a mistake. The result will
oscillate wildly and, on the average, deliver poor
throughput. There is an analytic reason for this but
it's tedious to derive. It has to do with that fact that it
is easy to drive the net into saturation but hard for
the net to recover (what [Kle76], chap. 2.1, calls the
rush-hour effect).9 Thus overestimating the available
bandwidth is costly. But an exponential, almost
regardless of its time constant, increases so quickly
that overestimates are inevitable.
Without justification, I'll state that the best increase
policy is to make small, constant changes to the
window size:
ADAPTING TO THE PATH :
CONGESTION AVOI DAN C E

are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
algorithm. 11
window size:
“There is an analytic reason for
this but it's tedious to derive.”

are inevitable.
size:
On no congestion:
W~= W~_~+ ~ (~ << Wmo=)
algorithm. 11
window size:
“There is an analytic reason for
this but it's tedious to derive.”
“Without justification, I’ll state
that the best increase policy
(…)”

A reason for using 1⁄2 as the decrease term, as op-
posed to the 7/8 in [JRC87], was the following
handwaving: When a packet is dropped, you're either
starting (or restarting after a drop) or steady-state
sending. If you're starting, you know that half the
current window size 'worked', i.e., that a window's
worth of packets were exchanged with no drops
(slow-start guarantees this). Thus on congestion you
set the window to the largest size that you know
works then slowly increase the size. If the connection
is steady-state running and a packet is dropped, it's
probably because a new connection started up and
took some of your bandwidth. We usually run our
nets with p < 0.5 so it's probable that there are now
exactly two conversations sharing the bandwidth. I.e.,
you should reduce your window by half because the
bandwidth available to you has been reduced by half.
And, if there are more than two conversations sharing
the bandwidth, halving your window is conservative -
and being conservative at high traﬃc intensities is
probably wise.
Although a factor of two change in window size
seems a large performance penalty, in system terms
WINDOW ADJUSTMEN T POL ICY

A reason for using 1⁄2 as the decrease term, as op-
posed to the 7/8 in [JRC87], was the following
handwaving: When a packet is dropped, you're either
starting (or restarting after a drop) or steady-state
sending. If you're starting, you know that half the
current window size 'worked', i.e., that a window's
worth of packets were exchanged with no drops
(slow-start guarantees this). Thus on congestion you
set the window to the largest size that you know
works then slowly increase the size. If the connection
is steady-state running and a packet is dropped, it's
probably because a new connection started up and
took some of your bandwidth. We usually run our
probably wise.
“A reason for using 1/2 as the
decrease term (…) was the
following handwaving (…)”

probably wise.
the cost is negligible: Currently, packets are dropped
only when a large queue has formed. Even with an
[ISO86] 'congestion experienced' bit to force senders
to reduce their windows, we're stuck with the queue
because the bottleneck is running at 100% utilization
with no excess bandwidth available to dissipate the
queue. If a packet is tossed, some sender shuts up for
two RTT, exactly the time needed to empty the
queue. If that sender restarts with the correct window
size, the queue won't reform. Thus the delay has been
reduced to minimum without the system losing any
bottleneck bandwidth.
The 1 packet increase has less justiﬁcation than the
0.5 decrease. In fact, it's almost certainly too large. If
the algorithm converges to a window size of w, there
are O(w2) packets between drops with an additive
increase policy. We were shooting for an average drop
rate of < 1% and found that on the Arpanet (the worst
case of the four networks we tested), windows
converged to 8-12 packets. This yields I packet
increments for a 1% average drop rate.
“A reason for using 1/2 as the
decrease term (…) was the
following handwaving (…)”
“The 1-packet increase has less
justification than the 0.5
decrease. In fact, it's almost
certainly too large.”

packet conservation principle
(physicist)

ﬂow rate fairness
(sharing buffer space)

3 0 Y E A R S
improve detection of congestion
improve RTT estimation
faster window adaptation
enforce ﬂow rate fairness

This paper is deliberately destructive. It sets out to
destroy an ideology that is blocking progress - the
idea that fairness between multiplexed packet traffic
can be achieved by controlling relative flow rates
alone. Flow rate fairness was the goal behind fair
resource allocation in widely deployed protocols like
weighted fair queuing (WFQ), TCP congestion
control and TCP-friendly rate control [8, 1, 11]. But it
is actually just unsubstantiated dogma to say that
equal flow rates are fair. This is why resource
allocation and accountability keep reappearing on
every list of requirements for the Internet
architecture (e.g. [2]), but never get solved. Obscured
by this broken idea, we wouldn’t know a good
solution from a bad one.
Controlling relative flow rates alone is a completely
impractical way of going about the problem. To be
realistic for large-scale Internet deployment, relative
flow rates should be the outcome of another fairness
mechanism, not the mechanism itself. That other
mechanism should share out the ‘cost’ of one user’s
actions on others—how much each user’s transfers
Bob Briscoe‘07 Flow Rate Fairness: Dismantling A Religion
INTRODUCTION

“This paper is deliberately
destructive.”
INTRODUCTION

flow rate fairness
shares the wrong thing
rate

x2(t)
x1(t)
bit rate
S H A R I N G W H A T ?
x1(t) = x2(t)

S H A R I N G B E N E F I T S ?
u1(x)
u2(x)
utility function
u1(t) > u2(t)

S H A R I N G C O S T S ?
the marginal cost of bandwidth is 0

sunk cost

sunk cost
ephemeral commodity

c1(t)
c2(t)

c1(t)
c2(t)
x2(t) > x1(t)
higher rate

c1(t)
c2(t)
x2(t) > x1(t)
higher rate
c1(t) = c2(t)
same cost

So in networking, the cost of one flow’s behaviour
depends on the congestion volume it causes which is
the product of its instantaneous flow rate and
congestion on its path, integrated over time. For
instance, if two users are sending at 200kbps and
300kbps into a 450kbps line for 0.5s, congestion is
(200 + 300 − 450)/(200 + 300) = 10% so the congestion
volume each causes is 200k × 10% × 0.5 = 10kb and
15kb respectively.
So cost depends not only on flow rate, but on
congestion as well. Typically congestion might be in
the fractions of a percent but it varies from zero to
tens of percent. So, flow rate can never alone serve as
a measure of cost.
To summarise so far, flow rate is a hopelessly
incorrect proxy both for benefit and for cost. Even if
the intent was to equalise benefits, equalising flow
rates wouldn’t achieve it. Even if the intent was to
equalise costs, equalising flow rates wouldn’t achieve
it.
But actually a realistic resource allocation mechanism
only needs to concern itself with costs. If we set aside
political economy for a moment and use pure
microeconomics, we can use a competitive market to
arbitrate fairness, which handles the benefits side, as
we shall now explain. Then once we have a feasible,
scalable system that at least implements one defined
form of fairness, we will show how to build other
forms of fairness within that.
In life, as long as people cover the cost of their
actions, it is generally considered fair enough. If one
COST, NOT BENEFIT

depends on the congestion volume it causes which is
the product of its instantaneous flow rate and
congestion on its path, integrated over time. For
instance, if two users are sending at 200kbps and
300kbps into a 450kbps line for 0.5s, congestion is
(200 + 300 − 450)/(200 + 300) = 10% so the congestion
volume each causes is 200k × 10% × 0.5 = 10kb and
15kb respectively.
So cost depends not only on flow rate, but on
congestion as well. Typically congestion might be in
the fractions of a percent but it varies from zero to
tens of percent. So, flow rate can never alone serve as
a measure of cost.
To summarise so far, flow rate is a hopelessly
incorrect proxy both for benefit and for cost. Even if
the intent was to equalise benefits, equalising flow
rates wouldn’t achieve it. Even if the intent was to
equalise costs, equalising flow rates wouldn’t achieve
it.
But actually a realistic resource allocation mechanism
only needs to concern itself with costs. If we set aside
political economy for a moment and use pure
microeconomics, we can use a competitive market to
arbitrate fairness, which handles the benefits side, as
we shall now explain. Then once we have a feasible,
scalable system that at least implements one defined
form of fairness, we will show how to build other
forms of fairness within that.
In life, as long as people cover the cost of their
actions, it is generally considered fair enough. If one
“(…) flow rate is a hopelessly
incorrect proxy both for benefit and
for cost. Even if the intent was to
equalise benefits, equalising flow
rates wouldn’t achieve it. Even if
the intent was to equalise costs,
equalising flow rates wouldn’t
achieve it.”
COST, NOT BENEFIT

flow rate fairness
flow
amongst the wrong entity

x2(t)
x1(t)
bit rate
S H A R I N G A M O N G S T W H A T ?
x1(t) = x2(t)

x2(t)
x1(t)
bit rate
x1(t) = x2(t) = x3(t)
x3(t)
x2(t) + x3(t) > x1(t)

x2(t)
x1(t)
bit rate
x1(t) = x2(t) = x3(t) = x4(t)
x3(t)
x2(t) + x3(t) + x4(t) > x1(t)
x4(t)

fairness is not a question of technical function—any
allocation ‘works’. But getting it hopelessly wrong
badly skews the outcome of conflicts between the
vested interests of real businesses and real people.
But isn’t it a basic article of faith that multiple views
of fairness should be able to co-exist, the choice
depending on policy? Absolutely correct—and we
shall return to how this can be done later. But that
doesn’t mean we have to give the time of day to any
random idea of fairness.
Fair allocation of rates between flows isn’t based on
any respected definition of fairness from philosophy
or the social sciences. It has just gradually become the
way things are done in networking. But it’s actually
self-referential dogma. Or put more bluntly, bonkers.
We expect to be fair to people, groups of people,
institutions, companies - things the security
community would call ‘principals’. But a flow is
merely an information transfer between two
applications. Where does the argument come from
that information transfers should have equal rights?
It’s equivalent to claiming food rations are fair
because the boxes are all the same size, irrespective of
how many boxes each person gets or how often they
get them.
Because flows don’t deserve rights in real life, it is not
surprising that two loopholes the size of barn doors
appear when trying to allocate rate fairly to flows in a
nonco-operative environment. If at every instant a
resource is shared among the flows competing for a
share, any realworld entity can gain by i) creating
more flows than anyone else, and ii) keeping them
going longer than anyone else.
INTRODUCTION

get them.
appear when trying to allocate rate fairly to flows in a
nonco-operative environment. If at every instant a
resource is shared among the flows competing for a
share, any realworld entity can gain by i) creating
more flows than anyone else, and ii) keeping them
going longer than anyone else.
INTRODUCTION
“It’s equivalent to claiming food
rations are fair because the boxes
are all the same size, irrespective of
how many boxes each person gets
or how often they get them.”

flow rate
fairness
amongst the wrong entity
non-sequitur

Whether the prevailing notion of ﬂow rate fairness
has been the root cause or not, there will certainly be
no solution until the networking community gets its
head out of the sand and understands how unrealistic
its view is, and how important this issue is. Certainly
get them.
INTRODUCTION

Whether the prevailing notion of ﬂow rate fairness
has been the root cause or not, there will certainly be
no solution until the networking community gets its
head out of the sand and understands how unrealistic
its view is, and how important this issue is. Certainly
get them.
INTRODUCTION
“Fair allocation of rates between
flows isn’t based on any respected
definition of fairness from
philosophy or the social sciences. It
has just gradually become the way
things are done in networking.”

restrict other transfers, given capacity constraints.
Then ﬂow rates will depend on a deeper level of
fairness that has so far remained unnamed in the
literature, but is best termed ‘cost fairness’.
It really is only the idea of ﬂow rate fairness that
needs destroying—nearly ever ything we’ve
engineered can remain. The Internet architecture
INTRODUCTION
“Obscured by this broken idea, we
wouldn’t know a good solution
from a bad one.”

C O S T F A I R
the cost is congestion

increase with flow rate, but the shape and size of the
function relating the two (the utility function) is
unknown, subjective and private to each user. Flow
rate itself is an extremely inadequate measure for
comparing benefits: user benefit per bit rate might be
ten orders of magnitude different for different types
of flow (e.g. SMS and video). So different applications
might derive completely different benefits from equal
flow rates and equal benefits might be derived from
very different flow rates.
Turning to the cost of a data transfer across a
network, flow rate alone is not the measure of that
either. Cost is also dependent on the level of
congestion on the path. This is counter-intuitive for
some people so we shall explain a little further. Once
a network has been provisioned at a certain size, it
doesn’t cost a network operator any more whether a
user sends more data or not. But if the network
becomes congested, each user restricts every other
user, which can be interpreted as a cost to all - an
externality in economic terms. For any level of
congestion, Kelly showed [20] that the system is
optimal if the blame for congestion is attributed
among all the users causing it, in proportion to their
bit rates. That’s exactly what routers are designed to
do anyway. During congestion, a queue randomly
distributes the losses so all flows see about the same
loss (or ECN marking) rate; if a flow has twice the bit
rate of another it should see twice the losses. In this
respect random early detection (RED [12]) is slightly
fairer than drop tail, but to a first order
approximation they both meet this criterion.
COST, NOT BENEFIT

increase with flow rate, but the shape and size of the
function relating the two (the utility function) is
unknown, subjective and private to each user. Flow
rate itself is an extremely inadequate measure for
comparing benefits: user benefit per bit rate might be
ten orders of magnitude different for different types
of flow (e.g. SMS and video). So different applications
might derive completely different benefits from equal
flow rates and equal benefits might be derived from
very different flow rates.
Turning to the cost of a data transfer across a
network, flow rate alone is not the measure of that
either. Cost is also dependent on the level of
congestion on the path. This is counter-intuitive for
some people so we shall explain a little further. Once
a network has been provisioned at a certain size, it
doesn’t cost a network operator any more whether a
user sends more data or not. But if the network
becomes congested, each user restricts every other
user, which can be interpreted as a cost to all - an
externality in economic terms. For any level of
congestion, Kelly showed [20] that the system is
optimal if the blame for congestion is attributed
among all the users causing it, in proportion to their
bit rates. That’s exactly what routers are designed to
do anyway. During congestion, a queue randomly
distributes the losses so all flows see about the same
loss (or ECN marking) rate; if a flow has twice the bit
rate of another it should see twice the losses. In this
respect random early detection (RED [12]) is slightly
fairer than drop tail, but to a first order
approximation they both meet this criterion.
“(…) if the network becomes
congested, each user restricts every
other user, which can be
interpreted as a cost to all - an
externality in economic terms.”
COST, NOT BENEFIT

time
rate V O L U M E C A P P I N G

time
rate V O L U M E C A P P I N G not much faster

time
rate V O L U M E C A P P I N G not much faster
waste

time
rate R A T E L I M I T I N G

time
much slower

time
much slowerwaste

C O S T F A I R N E S S
c2(t)
c1(t)
congestion
rate
reﬂects cost
integrates correctly
veriﬁable across

network borders

time
rate W E I G H T E D C O S T

causes disproportionate
congestion

congestion
“protect customers” /
demand more money

congestion
“protect customers” /
demand more money
“not fair”

congestion marking starts. Such operators continually
receive information on how much real demand there
is for capacity while collecting revenue to repay their
investments. Such congestion marking controls
demand without risk of actual congestion
deteriorating service.
Once a cost is assigned to congestion that equates to
the cost of alleviating it, users will only cause
congestion if they want extra capacity enough to be
willing to pay its cost. Of course, there will be no
need to be too precise about that rule. Perhaps some
people might be allowed to get more than they pay
for and others less. Perhaps some people will be
prepared to pay for what others get, and so on. But,
in a system the size of the Internet, there has to be be
some handle to arbitrate how much cost some users
cause to others. Flow rate fairness comes nowhere
near being up to the job. It just isn’t realistic to create
a system the size of the Internet and define fairness
within the system without reference to fairness
outside the system — in the real world where
everyone grudgingly accepts that fairness usually
means “you get what you pay for”.
Note that we use the phrase “you get what you pay
for” not just “you pay for what you get”. In Kelly’s
original formulation, users had to pay for the
congestion they caused, which was unlikely to be
taken up commercially. But the reason we are
revitalising Kelly’s work is that recent advances
(§4.3.2) should allow ISPs to keep their popular flat
fee pricing packages by aiming to ensure that users
cannot cause more congestion costs than their flat fee
pays for.
COST, NOT BENEFIT

congestion marking starts. Such operators continually
receive information on how much real demand there
is for capacity while collecting revenue to repay their
investments. Such congestion marking controls
demand without risk of actual congestion
deteriorating service.
Once a cost is assigned to congestion that equates to
the cost of alleviating it, users will only cause
congestion if they want extra capacity enough to be
willing to pay its cost. Of course, there will be no
need to be too precise about that rule. Perhaps some
people might be allowed to get more than they pay
for and others less. Perhaps some people will be
prepared to pay for what others get, and so on. But,
in a system the size of the Internet, there has to be be
some handle to arbitrate how much cost some users
cause to others. Flow rate fairness comes nowhere
near being up to the job. It just isn’t realistic to create
a system the size of the Internet and define fairness
within the system without reference to fairness
outside the system — in the real world where
everyone grudgingly accepts that fairness usually
means “you get what you pay for”.
Note that we use the phrase “you get what you pay
for” not just “you pay for what you get”. In Kelly’s
original formulation, users had to pay for the
congestion they caused, which was unlikely to be
taken up commercially. But the reason we are
revitalising Kelly’s work is that recent advances
(§4.3.2) should allow ISPs to keep their popular flat
fee pricing packages by aiming to ensure that users
cannot cause more congestion costs than their flat fee
pays for.
“It just isn’t realistic to create a
system the size of the Internet and
define fairness within the system
without reference to fairness
outside the system”
COST, NOT BENEFIT

H O W M A N Y W O R K A R O U N D S ?
“TCP is bad with small ﬂows”
batch and re-use connections
open parallel connections
artiﬁcial limits in multitenancy

we know what we have is wrong
we still have no idea
2 0 1 6

we know what we have is wrong
not broken enough to ﬁx
we still have no idea
2 0 1 6

Resource Allocation in Computer Networks

More Related Content

What's hot

Viewers also liked

Similar to Resource Allocation in Computer Networks

More from Fastly

Recently uploaded

Resource Allocation in Computer Networks