SlideShare a Scribd company logo
1 of 14
Download to read offline
arXiv:1509.05393v2[cs.DC]18Sep2015
A Critique of the CAP Theorem
Martin Kleppmann
Abstract
The CAP Theorem is a frequently cited impossibility
result in distributed systems, especially among NoSQL
distributed databases. In this paper we survey some
of the confusion about the meaning of CAP, includ-
ing inconsistencies and ambiguities in its deļ¬nitions,
and we highlight some problems in its formalization.
CAP is often interpreted as proof that eventually con-
sistent databases have better availability properties than
strongly consistent databases; although there is some
truth in this, we show that more careful reasoning is
required. These problems cast doubt on the utility of
CAP as a tool for reasoning about trade-offs in practi-
cal systems. As alternative to CAP, we propose a delay-
sensitivity framework, which analyzes the sensitivity of
operation latency to network delay, and which may help
practitioners reason about the trade-offs between con-
sistency guarantees and tolerance of network faults.
1 Background
Replicated databases maintain copies of the same data
on multiple nodes, potentially in disparate geographical
locations, in order to tolerate faults (failures of nodes or
communication links) and to provide lower latency to
users (requests can be served by a nearby site). How-
ever, implementing reliable, fault-tolerant applications
in a distributed system is difļ¬cult: if there are multi-
ple copies of the data on different nodes, they may be
inconsistent with each other, and an application that is
not designed to handle such inconsistencies may pro-
duce incorrect results.
In order to provide a simpler programming model
to application developers, the designers of distributed
data systems have explored various consistency guar-
antees that can be implemented by the database infras-
tructure, such as linearizability [30], sequential consis-
tency [38], causal consistency [4] and pipelined RAM
(PRAM) [42]. When multiple processes execute opera-
tions on a shared storage abstraction such as a database,
a consistency model describes what values are allowed
to be returned by operations accessing the storage, de-
pending on other operations executed previously or
concurrently, and the return values of those operations.
Similar concerns arise in the design of multipro-
cessor computers, which are not geographically dis-
tributed, but nevertheless present inconsistent views of
memory to different threads, due to the various caches
and buffers employed by modern CPU architectures.
For example, x86 microprocessors provide a level of
consistency that is weaker than sequential, but stronger
than causal consistency [48]. However, in this paper we
focus our attention on distributed systems that must tol-
erate partial failures and unreliable network links.
A strong consistency model like linearizability pro-
vides an easy-to-understand guarantee: informally, all
operations behave as if they executed atomically on a
single copy of the data. However, this guarantee comes
at the cost of reduced performance [6] and fault toler-
ance [22] compared to weaker consistency models. In
particular, as we discuss in this paper, algorithms that
ensure stronger consistency properties among replicas
are more sensitive to message delays and faults in the
network. Many real computer networks are prone to
unbounded delays and lost messages [10], making the
fault tolerance of distributed consistency algorithms an
important issue in practice.
A network partition is a particular kind of communi-
cation fault that splits the network into subsets of nodes
such that nodes in one subset cannot communicate with
nodes in another. As long as the partition exists, any
data modiļ¬cations made in one subset of nodes cannot
be visible to nodes in another subset, since all messages
between them are lost. Thus, an algorithm that main-
tains the illusion of a single copy may have to delay
operations until the partition is healed, to avoid the risk
of introducing inconsistent data in different subsets of
nodes.
This trade-off was already known in the 1970s [22,
23, 32, 40], but it was rediscovered in the early 2000s,
when the webā€™s growing commercial popularity made
1
geographic distribution and high availability important
to many organizations [18, 50]. It was originally called
the CAP Principle by Fox and Brewer [16, 24], where
CAP stands for Consistency, Availability and Partition
tolerance. After the principle was formalized by Gilbert
and Lynch [25, 26] it became known as the CAP Theo-
rem.
CAP became an inļ¬‚uential idea in the NoSQL move-
ment [50], and was adopted by distributed systems prac-
titioners to critique design decisions [31]. It provoked
a lively debate about trade-offs in data systems, and
encouraged system designers to challenge the received
wisdom that strong consistency guarantees were essen-
tial for databases [17].
The rest of this paper is organized as follows: in sec-
tion 2 we compare various deļ¬nitions of consistency,
availability and partition tolerance. We then examine
the formalization of CAP by Gilbert and Lynch [25]
in section 3. Finally, in section 4 we discuss some al-
ternatives to CAP that are useful for reasoning about
trade-offs in distributed systems.
2 CAP Theorem Deļ¬nitions
CAP was originally presented in the form of ā€œconsis-
tency, availability, partition tolerance: pick any twoā€
(i.e. you can have CA, CP or AP, but not all three). Sub-
sequent debates concluded that this formulation is mis-
leading [17, 28, 47], because the distinction between
CA and CP is unclear, as detailed later in this section.
Many authors now prefer the following formulation: if
there is no network partition, a system can be both con-
sistent and available; when a network partition occurs,
a system must choose between either consistency (CP)
or availability (AP).
Some authors [21, 41] deļ¬ne a CP system as one in
which a majority of nodes on one side of a partition
can continue operating normally, and a CA system as
one that may fail catastrophically under a network par-
tition (since it is designed on the assumption that parti-
tions are very rare). However, this deļ¬nition is not uni-
versally agreed, since it is counter-intuitive to label a
system as ā€œavailableā€ if it fails catastrophically under a
partition, while a system that continues partially operat-
ing in the same situation is labelled ā€œunavailableā€ (see
section 2.3).
Disagreement about the deļ¬nitions of terms like
availability is the source of many misunderstandings
about CAP, and unclear deļ¬nitions lead to problems
with its formalization as a theorem. In sections 2.1 to
2.3 we survey various deļ¬nitions that have been pro-
posed.
2.1 Availability
In practical engineering terms, availability usually
refers to the proportion of time during which a service
is able to successfully handle requests, or the propor-
tion of requests that receive a successful response. A
response is usually considered successful if it is valid
(not an error, and satisļ¬es the databaseā€™s safety proper-
ties) and it arrives at the client within some timeout,
which may be speciļ¬ed in a service level agreement
(SLA). Availability in this sense is a metric that is em-
pirically observed during a period of a serviceā€™s oper-
ation. A service may be available (up) or unavailable
(down) at any given time, but it is nonsensical to say
that some software package or algorithm is ā€˜availableā€™
or ā€˜unavailableā€™ in general, since the uptime percentage
is only known in retrospect, after a period of operation
(during which various faults may have occurred).
There is a long tradition of highly available and fault-
tolerant systems, whose algorithms are designed such
that the system can remain available (up) even when
some part of the system is faulty, thus increasing the
expected mean time to failure (MTTF) of the system as
a whole. Using such a system does not automatically
make a service 100% available, but it may increase the
observed availability during operation, compared to us-
ing a system that is not fault-tolerant.
2.1.1 The A in CAP
Does the A in CAP refer to a property of an algorithm,
or to an observed metric during system operation? This
distinction is unclear. Brewer does not offer a precise
deļ¬nition of availability, but states that ā€œavailability is
obviously continuous from 0 to 100 percentā€ [17], sug-
gesting an observed metric. Fox and Brewer also use
the term yield to refer to the proportion of requests that
are completed successfully [24] (without specifying any
timeout).
On the other hand, Gilbert and Lynch [25] write: ā€œFor
a distributed system to be continuously available, ev-
ery request received by a non-failing node in the sys-
tem must result in a responseā€.1 In order to prove a re-
1This sentence appears to deļ¬ne a property of continuous avail-
ability, but the rest of the paper does not refer to this ā€œcontinuousā€
aspect.
2
sult about systems in general, this deļ¬nition interprets
availability as a property of an algorithm, not as an ob-
served metric during system operation ā€“ i.e. they de-
ļ¬ne a system as being ā€œavailableā€ or ā€œunavailableā€ stat-
ically, based on its algorithms, not its operational status
at some point in time.
One particular execution of the algorithm is avail-
able if every request in that execution eventually re-
ceives a response. Thus, an algorithm is ā€œavailableā€ un-
der Gilbert and Lynchā€™s deļ¬nition if all possible exe-
cutions of the algorithm are available. That is, the al-
gorithm must guarantee that requests always result in
responses, no matter what happens in the system (see
section 2.3.1).
Note that Gilbert and Lynchā€™s deļ¬nition requires any
non-failed node to be able to generate valid responses,
even if that node is completely isolated from the other
nodes. This deļ¬nition is at odds with Fox and Brewerā€™s
original proposal of CAP, which states that ā€œdata is con-
sidered highly available if a given consumer of the data
can always reach some replicaā€ [24, emphasis original].
Many so-called highly available or fault-tolerant sys-
tems have very high uptime in practice, but are in fact
ā€œunavailableā€ under Gilbert and Lynchā€™s deļ¬nition [34]:
for example, in a system with an elected leader or pri-
mary node, if a client that cannot reach the leader due
to a network fault, the client cannot perform any writes,
even though it may be able to reach another replica.
2.1.2 No maximum latency
Note that Gilbert and Lynchā€™s deļ¬nition of availability
does not specify any upper bound on operation latency:
it only requires requests to eventually return a response
within some unbounded but ļ¬nite time. This is conve-
nient for proof purposes, but does not closely match our
intuitive notion of availability (in most situations, a ser-
vice that takes a week to respond might as well be con-
sidered unavailable).
This deļ¬nition of availability is a pure liveness prop-
erty, not a safety property [5]: that is, at any point in
time, if the response has not yet arrived, there is still
hope that the availability property might still be ful-
ļ¬lled, because the response may yet arrive ā€“ it is never
too late. This aspect of the deļ¬nition will be impor-
tant in section 3, when we examine Gilbert and Lynchā€™s
proofs in more detail.
(In section 4 we will discuss a deļ¬nition of availabil-
ity that takes latency into account.)
2.1.3 Failed nodes
Another noteworthy aspect of Gilbert and Lynchā€™s def-
inition of availability is the proviso of applying only to
non-failed nodes. This allows the aforementioned deļ¬-
nition of a CA system as one that fails catastrophically
if a network partition occurs: if the partition causes all
nodes to fail, then the availability requirement does not
apply to any nodes, and thus it is trivially satisļ¬ed, even
if no node is able to respond to any requests. This deļ¬ni-
tion is logically sound, but somewhat counter-intuitive.
2.2 Consistency
Consistency is also an overloaded word in data sys-
tems: consistency in the sense of ACID is a very differ-
ent property from consistency in CAP [17]. In the dis-
tributed systems literature, consistency is usually under-
stood as not one particular property, but as a spectrum of
models with varying strengths of guarantee. Examples
of such consistency models include linearizability [30],
sequential consistency [38], causal consistency [4] and
PRAM [42].
There is some similarity between consistency mod-
els and transaction isolation models such as serializ-
ability [15], snapshot isolation [14], repeatable read and
read committed [3, 27]. Both describe restrictions on
the values that operations may return, depending on
other (prior or concurrent) operations. The difference is
that transaction isolation models are usually formalized
assuming a single replica, and operate at the granular-
ity of transactions (each transaction may read or write
multiple objects). Consistency models assume multi-
ple replicas, but are usually deļ¬ned in terms of single-
object operations (not grouped into transactions). Bailis
et al. [12] demonstrate a uniļ¬ed framework for reason-
ing about both distributed consistency and transaction
isolation in terms of CAP.
2.2.1 The C in CAP
Fox and Brewer [24] deļ¬ne the C in CAP as one-
copy serializability (1SR) [15], whereas Gilbert and
Lynch [25] deļ¬ne it as linearizability. Those deļ¬ni-
tions are not identical, but fairly similar.2 Both are
2Linearizability is a recency guarantee, whereas 1SR is not. 1SR
requires isolated execution of multi-object transactions, which lin-
earizability does not. Both require coordination, in the sense of sec-
tion 4.4 [12].
3
safety properties [5], i.e. restrictions on the possible ex-
ecutions of the system, ensuring that certain situations
never occur.
In the case of linearizability, the situation that may
not occur is a stale read: stated informally, once a write
operation has completed or some read operation has re-
turned a new value, all following read operations must
return the new value, until it is overwritten by another
write operation. Gilbert and Lynch observe that if the
write and read operations occur on different nodes, and
those nodes cannot communicate during the time when
those operations are being executed, then the safety
property cannot be satisļ¬ed, because the read operation
cannot know about the value written.
The C of CAP is sometimes referred to as strong con-
sistency (a term that is not formally deļ¬ned), and con-
trasted with eventual consistency [9, 49, 50], which is
often regarded as the weakest level of consistency that
is useful to applications. Eventual consistency means
that if a system stops accepting writes and sufļ¬cient3
communication occurs, then eventually all replicas will
converge to the same value. However, as the aforemen-
tioned list of consistency models indicates, it is overly
simplistic to cast ā€˜strongā€™ and eventual consistency as
the only possible choices.
2.2.2 Probabilistic consistency
It is also possible to deļ¬ne consistency as a quantitative
metric rather than a safety property. For example, Fox
and Brewer [24] deļ¬ne harvest as ā€œthe fraction of the
data reļ¬‚ected in the response, i.e. the completeness of
the answer to the query,ā€ and probabilistically bounded
staleness [11] studies the probability of a read returning
a stale value, given various assumptions about the distri-
bution of network latencies. However, these stochastic
deļ¬nitions of consistency are not the subject of CAP.
2.3 Partition Tolerance
A network partition has long been deļ¬ned as a com-
munication failure in which the network is split into
disjoint sub-networks, with no communication possible
across sub-networks [32]. This is a fairly narrow class
of fault, but it does occur in practice [10], so it is worth
studying.
3It is not clear what amount of communication is ā€˜sufļ¬cientā€™. A
possible formalization would be to require all replicas to converge to
the same value within ļ¬nite time, assuming fair-loss links (see section
2.3.2).
2.3.1 Assumptions about system model
It is less clear what partition tolerance means. Gilbert
and Lynch [25] deļ¬ne a system as partition-tolerant
if it continues to satisfy the consistency and availabil-
ity properties in the presence of a partition. Fox and
Brewer [24] deļ¬ne partition-resilience as ā€œthe system
as whole can survive a partition between data replicasā€
(where survive is not deļ¬ned).
At ļ¬rst glance, these deļ¬nitions may seem redundant:
if we say that an algorithm provides some guarantee
(e.g. linearizability), then we expect all executions of
the algorithm to satisfy that property, regardless of the
faults that occur during the execution.
However, we can clarify the deļ¬nitions by observing
that the correctness of a distributed algorithm is always
subject to assumptions about the faults that may occur
during its execution. If you take an algorithm that as-
sumes fair-loss links and crash-stop processes, and sub-
ject it to Byzantine faults, the execution will most likely
violate safety properties that were supposedly guaran-
teed. These assumptions are typically encoded in a sys-
tem model, and non-Byzantine system models rule out
certain kinds of fault as impossible (so algorithms are
not expected to tolerate them).
Thus, we can interpret partition tolerance as mean-
ing ā€œa network partition is among the faults that are
assumed to be possible in the system.ā€ Note that this
deļ¬nition of partition tolerance is a statement about the
system model, whereas consistency and availability are
properties of the possible executions of an algorithm. It
is misleading to say that an algorithm ā€œprovides parti-
tion tolerance,ā€ and it is better to say that an algorithm
ā€œassumes that partitions may occur.ā€
If an algorithm assumes the absence of partitions, and
is nevertheless subjected to a partition, it may violate
its guarantees in arbitrarily undeļ¬ned ways (including
failing to respond even after the partition is healed, or
deleting arbitrary amounts of data). Even though it may
seem that such arbitrary failure semantics are not very
useful, various systems exhibit such behavior in prac-
tice [35, 36]. Making networks highly reliable is very
expensive [10], so most distributed programs must as-
sume that partitions will occur sooner or later [28].
2.3.2 Partitions and fair-loss links
Further confusion arises due to the fact that network
partitions are only one of a wide range of faults that can
occur in distributed systems, including nodes failing or
4
restarting, nodes pausing for some amount of time (e.g.
due to garbage collection), and loss or delay of mes-
sages in the network. Some faults can be modeled in
terms of other faults (for example, Gilbert and Lynch
state that the loss of an individual message can be mod-
eled as a short-lived network partition).
In the design of distributed systems algorithms,
a commonly assumed system model is fair-loss
links [19]. A network link has the fair-loss property if
the probability of a message not being lost is non-zero,
i.e. the link sometimes delivers messages. The link may
have intervals of time during which all messages are
dropped, but those intervals must be of ļ¬nite duration.
On a fair-loss link, message delivery can be made re-
liable by retrying a message an unbounded number of
times: the message is guaranteed to be eventually deliv-
ered after a ļ¬nite number of attempts [19].
We argue that fair-loss links are a good model of
most networks in practice: faults occur unpredictably;
messages are lost while the fault is occurring; the fault
lasts for some ļ¬nite duration (perhaps seconds, perhaps
hours), and eventually it is healed (perhaps after human
intervention).There is no malicious actor in the network
who can cause systematic message loss over unlimited
periods of time ā€“ such malicious actors are usually only
assumed in the design of Byzantine fault tolerant algo-
rithms.
Is ā€œpartitions may occurā€ equivalent to assuming fair-
loss links? Gilbert and Lynch [25] deļ¬ne partitions as
ā€œthe network will be allowed to lose arbitrarily many
messages sent from one node to another.ā€ In this deļ¬ni-
tion it is unclear whether the number of lost messages
is unbounded but ļ¬nite, or whether it is potentially inļ¬-
nite.
Partitions of a ļ¬nite duration are possible with fair-
loss links, and thus an algorithm that is correct in a sys-
tem model of fair-loss links can tolerate partitions of a
ļ¬nite duration. Partitions of an inļ¬nite duration require
some further thought, as we shall see in section 3.
3 The CAP Proofs
In this section, we build upon the discussion of deļ¬ni-
tions in the last section, and examine the proofs of the
theorems of Gilbert and Lynch [25]. We highlight some
ambiguities in the reasoning of the proofs, and then sug-
gest a more precise formalization.
3.1 Theorems 1 and 2
Gilbert and Lynchā€™s Theorem 1 is stated as follows:
It is impossible in the asynchronous network model
to implement a read/write data object that guarantees
the following properties:
ā€¢ Availability
ā€¢ Atomic consistency4
in all fair executions (including those in which mes-
sages are lost).
Theorem 2 is similar, but speciļ¬ed in a system model
with bounded network delay. The discussion in this sec-
tion 3.1 applies to both theorems.
3.1.1 Availability of failed nodes
The ļ¬rst problem with this proof is the deļ¬nition of
availability. As discussed in section 2.1.3, only non-
failing nodes are required to respond.
If it is possible for the algorithm to declare nodes as
failed (e.g. if a node may crash itself), then the avail-
ability property can be trivially satisļ¬ed: all nodes can
be crashed, and thus no node is required to respond. Of
course, such an algorithm would not be useful in prac-
tice. Alternatively, if a minority of nodes is permanently
partitioned from the majority, an algorithm could deļ¬ne
the nodes in the minority partition as failed (by crash-
ing them), while the majority partition continues imple-
menting a linearizable register [7].
This is not the intention of CAP ā€“ the raison dā€™Ė†etre
of CAP is to characterize systems in which a minority
partition can continue operating independently of the
rest ā€“ but the present formalization of availability does
not exclude such trivial solutions.
3.1.2 Finite and inļ¬nite partitions
Gilbert and Lynchā€™s proofs of theorems 1 and 2 con-
struct an execution of an algorithm A in which a write
is followed by a read, while simultaneously a partition
exists in the network. By showing that the execution is
not linearizable, the authors derive a contradiction.
Note that this reasoning is only correct if we assume
a system model in which partitions may have inļ¬nite
duration.
If the system model is based on fair-loss links, then
all partitions may be assumed to be of unbounded but
4In this context, atomic consistency is synonymous with lineariz-
ability, and it is unrelated to the A in ACID.
5
ļ¬nite duration (section 2.3.2). Likewise, Gilbert and
Lynchā€™s availability property does not place any upper
bound on the duration of an operation, as long as it is ļ¬-
nite (section 2.1.2). Thus, if a linearizable algorithm en-
counters a network partition in a fair-loss system model,
it is acceptable for the algorithm to simply wait for the
partition to be healed: at any point in time, there is still
hope that the partition will be healed in future, and so
the availability property may yet be satisļ¬ed. For exam-
ple, the ABD algorithm [7] can be used to implement a
linearizable read-write register in an asynchronous net-
work with fair-loss links.5
On the other hand, in an execution where a partition
of inļ¬nite duration occurs, the algorithm is forced to
make a choice between waiting until the partition heals
(which never happens, thus violating availability) and
exhibiting the execution in the proof of Theorem 1 (thus
violating linearizability). We can conclude that Theo-
rem 1 is only valid in a system model where inļ¬nite
partitions are possible.
3.1.3 Linearizability vs. eventual consistency
Note that in the case of an inļ¬nite partition, no infor-
mation can ever ļ¬‚ow from one sub-network to the other.
Thus, even eventual consistency (replica convergencein
ļ¬nite time, see section 2.2.1) is not possible in a system
with an inļ¬nite partition.
Theorem 1 demonstrated that in a system model with
inļ¬nite partitions, no algorithm exists which ensures
linearizability and availability in all executions. How-
ever, we can also see that in the same system model, no
algorithm exists which ensures eventual consistency in
all executions.
The CAP theorem is often understood as demonstrat-
ing that linearizability cannot be achieved with high
availability, whereas eventual consistency can. How-
ever, the results so far do not differentiate between lin-
earizable and eventually consistent algorithms: both are
possible if partitions are always ļ¬nite, and both are im-
possible in a system model with inļ¬nite partitions.
To distinguish between linearizability and eventual
consistency, a more careful formalization of CAP is re-
quired, which we give in section 3.2.
5ABD [7] is an algorithm for a single-writer multi-reader regis-
ter. It was extended to the multi-writer case by Lynch and Shvarts-
man [44].
3.2 The partitionable system model
In this section we suggest a more precise formulation of
CAP, and derive a result similar to Gilbert and Lynchā€™s
Theorem 1 and Corollary 1.1. This formulation will
help us gain a better understanding of CAP and its con-
sequences.
3.2.1 Deļ¬nitions
Deļ¬ne a partitionable link to be a point-to-point link
with the following properties:
1. No duplication: If a process p sends a message m
once to process q, then m is delivered at most once
by q.
2. No creation: If some process q delivers a message
m with sender p, then m was previously sent to q
by process p.
(A partitionable link is allowed to drop an inļ¬nite num-
ber of messages and cause unbounded message delay.)
Deļ¬ne the partitionable model as a system model
in which processes can only communicate via parti-
tionable links, in which processes never crash,6 and in
which every process has access to a local clock that is
able to generate timeouts (the clock progresses mono-
tonically at a rate approximately equal to real time, but
clocks of different processes are not synchronized).
Deļ¬ne an execution E as admissible in a system
model M if the processes and links in E satisfy the prop-
erties deļ¬ned by M.
Deļ¬ne an algorithm A as terminating in a system
model M if, for every execution E of A, if E is admissi-
ble in M, then every operation in E terminates in ļ¬nite
time.7
Deļ¬ne an execution E as loss-free if for every mes-
sage m sent from p to q during E, m is eventually deliv-
ered to q. (There is no delay bound on delivery.)
An execution E is partitioned if it is not loss-free.
Note: we may assume that links automatically resend
lost messages an unbounded number of times. Thus, an
execution in which messages are transiently lost during
6The assumption that processes never crash is of course unrealis-
tic, but it makes the impossibility results in section 3.2.2 stronger. It
also rules out the trivial solution of section 3.1.1.
7Our deļ¬nition of terminating corresponds to Gilbert and Lynchā€™s
deļ¬nition of available. We prefer to call it terminating because the
word available is widely understood as referring to an empirical met-
ric (see section 2.1). There is some similarity to wait-free data struc-
tures [29], although these usually assume reliable communication and
unreliable processes.
6
some ļ¬nite time period is not partitioned, because the
links will eventually deliver all messages that were lost.
An execution is only partitioned if the message loss per-
sists forever.
For the deļ¬nition of linearizability we refer to Her-
lihy and Wing [30].
There is no generally agreed formalization of even-
tual consistency, but the following corresponds to a
liveness property that has been proposed [8, 9]: even-
tually, every read operation read(q) at process q must
return a set of all the values v ever written by any pro-
cess p in an operation write(p,v). For simplicity, we
assume that values are never removed from the read set,
although an application may only see one of the values
(e.g. the one with the highest timestamp).
More formally, an inļ¬nite execution E is eventually
consistent if, for all processes p and q, and for ev-
ery value v such that operation write(p,v) occurs in E,
there are only ļ¬nitely many operations in E such that
v /āˆˆ read(q).
3.2.2 Impossibility results
Assertion 1. If an algorithm A implements a termi-
nating read-write register R in the partitionable model,
then there exists a loss-free execution of A in which R is
not linearizable.
Proof. Consider an execution E1 in which the initial
value of R is v1, and no messages are delivered (all
messages are lost, which is admissible for partitionable
links). In E1, p ļ¬rst performs an operation write(p,v2)
where v2 = v1. This operation must terminate in ļ¬nite
time due to the termination property of A.
After the write operation terminates, q performs an
operation read(q), which must return v1, since there is
no way for q to know the value v2, due to all messages
being lost. The read must also terminate. This execution
is not linearizable, because the read did not return v2.
Now consider an execution E2 which extends E1 as
follows: after the termination of the read(q) operation,
every message that was sent during E1 is delivered (this
is admissible for partitionable links). These deliveries
cannot affect the execution of the write and read op-
erations, since they occur after the termination of both
operations, so E2 is also non-linearizable. Moreover, E2
is loss-free, since every message was delivered.
Corollary 2. There is no algorithm that implements
a terminating read-write register in the partitionable
model that is linearizable in all loss-free executions.
This corresponds to Gilbert and Lynchā€™s Corollary 1,
and follows directly from the existence of a loss-free,
non-linearizable execution (assertion 1).
Assertion 3. There is no algorithm that implements
a terminating read-write register in the partitionable
model that is eventually consistent in all executions.
Proof. Consider an execution E in which no messages
are delivered, and in which process p performs opera-
tion write(p,v). This write must terminate, since the al-
gorithm is terminating. Process q with p = q performs
read(q) inļ¬nitely many times. However, since no mes-
sages are delivered, q can never learn about the value
written, so read(q) never returns v. Thus, E is not even-
tually consistent.
3.2.3 Opportunistic properties
Note that corollary 2 is about loss-free executions,
whereas assertion 3 is about all executions. If we limit
ourselves to loss-free executions, then eventual consis-
tency is possible (e.g. by maintaining a replica at each
process, and broadcasting every write to all processes).
However, everything we have discussed in this sec-
tion pertains to the partitionable model, in which we
cannot assume that all executions are loss-free. For clar-
ity, we should specify the properties of an algorithm
such that they hold for all admissible executions of a
given system model, not only selected executions.
To this end, we can transform a property P into an
opportunistic property Pā€² such that:
āˆ€E :(E |= Pā€²
) ā‡” (lossfree(E) ā‡’ (E |= P))
or, equivalently:
āˆ€E :(E |= Pā€²
) ā‡” (partitioned(E)āˆØ(E |= P)).
In other words, Pā€² is trivially satisļ¬ed for executions
that are partitioned. Requiring Pā€² to hold for all execu-
tions is equivalent to requiring P to hold for all loss-
free executions.
Hence we deļ¬ne an execution E as opportunistically
eventually consistent if E is partitioned or if E is eventu-
ally consistent. (This is a weaker liveness property than
eventual consistency.)
Similarly, we deļ¬ne an execution E as opportunisti-
cally terminating linearizable if E is partitioned, or if
E is linearizable and every operation in E terminates in
ļ¬nite time.
7
From the results above, we can see that opportunistic
terminating linearizability is impossible in the partition-
able model (corollary 2), whereas opportunistic even-
tual consistency is possible. This distinction can be un-
derstood as the key result of CAP. However, it is ar-
guably not a very interesting or insightful result.
3.3 Mismatch between formal model and
practical systems
Many of the problems in this section are due to the
fact that availability is deļ¬ned by Gilbert and Lynch as
a liveness property (section 2.1.2). Liveness properties
make statements about something happening eventually
in an inļ¬nite execution, which is confusing to practi-
tioners, since real systems need to get things done in a
ļ¬nite (and usually short) amount of time.
Quoting Lamport [39]: ā€œLiveness properties are in-
herently problematic. The question of whether a real
system satisļ¬es a liveness property is meaningless; it
can be answered only by observing the system for an
inļ¬nite length of time, and real systems donā€™t run for-
ever. Liveness is always an approximation to the prop-
erty we really care about. We want a program to termi-
nate within 100 years, but proving that it does would re-
quire the addition of distracting timing assumptions. So,
we prove the weaker condition that the program eventu-
ally terminates. This doesnā€™t prove that the program will
terminate within our lifetimes, but it does demonstrate
the absence of inļ¬nite loops.ā€
Brewer [17] and some commercial database ven-
dors [1] state that ā€œall three properties [consistency,
availability, and partition tolerance] are more contin-
uous than binaryā€. This is in direct contradiction to
Gilbert and Lynchā€™s formalization of CAP (and our
restatement thereof), which expresses consistency and
availability as safety and liveness properties of an algo-
rithm, and partitions as a property of the system model.
Such properties either hold or they do not hold; there is
no degree of continuity in their deļ¬nition.
Brewerā€™s informal interpretation of CAP is intuitively
appealing, but it is not a theorem, since it is not ex-
pressed formally (and thus cannot be proved or dis-
proved) ā€“ it is, at best, a rule of thumb. Gilbert and
Lynchā€™s formalization can be proved correct, but it does
not correspond to practitionersā€™ intuitions for real sys-
tems. This contradiction suggests that although the for-
mal model may be true, it is not useful.
4 Alternatives to CAP
In section 2 we explored the deļ¬nitions of the terms
consistency, availability and partition tolerance, and
noted that a wide range of ambiguous and mutually
incompatible interpretations have been proposed, lead-
ing to widespread confusion. Then, in section 3 we ex-
plored Gilbert and Lynchā€™s deļ¬nitions and proofs in
more detail, and highlighted some problems with the
formalization of CAP.
All of these misunderstandings and ambiguity lead
us to asserting that CAP is no longer an appropriate
tool for reasoning about systems. A better framework
for describing trade-offs is required. Such a framework
should be simple to understand, match most peopleā€™s in-
tuitions, and use deļ¬nitions that are formal and correct.
In the rest of this paper we develop a ļ¬rst draft of
an alternative framework called delay-sensitivity, which
provides tools for reasoning about trade-offs between
consistency and robustness to network faults. It is based
on to several existing results from the distributed sys-
tems literature (most of which in fact predate CAP).
4.1 Latency and availability
As discussed in section 2.1.2, the latency (response
time) of operations is often important in practice, but
it is deliberately ignored by Gilbert and Lynch.
The problem with latency is that it is more difļ¬cult
to model. Latency is inļ¬‚uenced by many factors, espe-
cially the delay of packets on the network. Many com-
puter networks (including Ethernet and the Internet) do
not guarantee bounded delay, i.e. they allow packets
to be delayed arbitrarily. Latencies and network delays
are therefore typically described as probability distribu-
tions.
On the other hand, network delay can model a wide
range of faults. In network protocols that automatically
retransmit lost packets (such as TCP), transient packet
loss manifests itself to the application as temporarily
increased delay. Even when the period of packet loss
exceeds the TCP connection timeout, application-level
protocols often retry failed requests until they succeed,
so the effective latency of the operation is the time from
the ļ¬rst attempt until the successful completion. Even
network partitions can be modelled as large packet de-
lays (up to the duration of the partition), provided that
the duration of the partition is ļ¬nite and lost packets are
retransmitted an unbounded number of times.
8
Abadi [2] argues that there is a trade-off between
consistency and latency, which applies even when there
is no network partition, and which is as important as the
consistency/availability trade-off described by CAP. He
proposes a ā€œPACELCā€ formulation to reason about this
trade-off.
We go further, and assert that availability should be
modeled in terms of operation latency. For example, we
could deļ¬ne the availability of a service as the propor-
tion of requests that meet some latency bound (e.g. re-
turning successfully within 500 ms, as deļ¬ned by an
SLA). This empirically-founded deļ¬nition of availabil-
ity closely matches our intuitive understanding.
We can then reason about a serviceā€™s tolerance of net-
work problems by analyzing how operation latency is
affected by changes in network delay, and whether this
pushes operation latency over the limit set by the SLA.
If a service can sustain low operation latency, even as
network delay increases dramatically, it is more toler-
ant of network problems than a service whose latency
increases.
4.2 How operation latency depends on
network delay
To ļ¬nd a replacement for CAP with a latency-centric
viewpoint we need to examine how operation latency is
affected by network latency at different levels of con-
sistency. In practice, this depends on the algorithms and
implementation of the particular software being used.
However, CAP demonstrated that there is also interest
in theoretical results identifying the fundamental lim-
its of what can be achieved, regardless of the particular
algorithm in use.
Several existing impossibility results establish lower
bounds on the operation latency as a function of the net-
work delay d. These results show that any algorithm
guaranteeing a particular level of consistency cannot
perform operations faster than some lower bound. We
summarize these results in table 1 and in the following
sections.
Our notation is similar to that used in complexity the-
ory to describe the running time of an algorithm. How-
ever, rather than being a function of the size of input,
we describe the latency of an operation as a function of
network delay.
In this section we assume unbounded network delay,
and unsynchronized clocks (i.e. each process has access
to a clock that progresses monotonically at a rate ap-
Consistency level write read
latency latency
linearizability O(d) O(d)
sequential consistency O(d) O(1)
causal consistency O(1) O(1)
Table 1: Lowest possible operation latency at various
consistency levels, as a function of network delay d.
proximately equal to real time, but the synchronization
error between clocks is unbounded).
4.2.1 Linearizability
Attiya and Welch [6] show that any algorithm imple-
menting a linearizable read-write register must have an
operation latency of at least u/2, where u is the uncer-
tainty of delay in the network between replicas.8
In this proof, network delay is assumed to be at most
d and at least d āˆ’ u, so u is the difference between
the minimum and maximum network delay. In many
networks, the maximum possible delay (due to net-
work congestion or retransmitting lost packets) is much
greater than the minimum possible delay (due to the
speed of light), so u ā‰ˆ d. If network delay is unbounded,
operation latency is also unbounded.
For the purposes of this survey, we can simplify the
result to say that linearizability requires the latency of
read and write operations to be proportional to the net-
work delay d. This is indicated in table 1 as O(d) la-
tency for reads and writes. We call these operations
delay-sensitive, as their latency is sensitive to changes
in network delay.
4.2.2 Sequential consistency
Lipton and Sandberg [42] show that any algorithm im-
plementing a sequentially consistent read-write register
must have |r| + |w| ā‰„ d, where |r| is the latency of a
read operation, |w| is the latency of a write operation,
and d is the network delay. Mavronicolas and Roth [46]
further develop this result.
This lower bound provides a degree of choice for the
application: for example, an application that performs
8Attiya and Welch [6] originally proved a bound of u/2 for write
operations (assuming two writer processes and one reader), and a
bound of u/4 for read operations (two readers, one writer). The u/2
bound for read operations is due to Mavronicolas and Roth [46].
9
more reads than writes can reduce the average opera-
tion latency by choosing |r| = 0 and |w| ā‰„ d, whereas
a write-heavy application might choose |r| ā‰„ d and
|w| = 0. Attiya and Welch [6] describe algorithms for
both of these cases (the |r| = 0 case is similar to the
Zab algorithm used by Apache ZooKeeper [33]).
Choosing |r| = 0 or |w| = 0 means the operation can
complete without waiting for any network communica-
tion (it may still send messages, but need not wait for
a response from other nodes). The latency of such an
operation thus only depends on the local database algo-
rithms: it might be constant-time O(1), or it might be
O(logn) where n is the size of the database, but either
way it is independent of the network delay d, so we call
it delay-independent.
In table 1, sequential consistency is described as hav-
ing fast reads and slow writes (constant-time reads, and
write latency proportional to network delay), although
these roles can be swapped if an application prefers fast
writes and slow reads.
4.2.3 Causal consistency
If sequential consistency allows the latency of some op-
erations to be independent of network delay, which level
of consistency allows all operation latencies to be inde-
pendent of the network? Recent results [8, 45] show that
causal consistency [4] with eventual convergence is the
strongest possible consistency guarantee with this prop-
erty.9
Read Your Writes [49], PRAM [42] and other weak
consistency models (all the way down to eventual con-
sistency, which provides no safety property [9]) are
weaker than causal consistency, and thus achievable
without waiting for the network.
If tolerance of network delay is the only considera-
tion, causal consistency is the optimal consistency level.
There may be other reasons for choosing weaker con-
sistency levels (for example, the metadata overhead of
tracking causality [8, 20]), but these trade-offs are out-
side of the scope of this discussion, as they are also out-
side the scope of CAP.
9There are a few variants of causal consistency, such as real time
causal [45], causal+ [43] and observable causal [8] consistency.
They have subtle differences, but we do not have space in this paper
to compare them in detail.
4.3 Heterogeneous delays
A limitation of the results in section 4.2 is that they as-
sume the distribution of network delays is the same be-
tween every pair of nodes. This assumption is not true
in general: for example, network delay between nodes
in the same datacenter is likely to be much lower than
between geographically distributed nodes communicat-
ing over WAN links.
If we model network faults as periods of increased
network delay (section 4.1), then a network partition
is a situation in which the delay between nodes within
each partition remains small, while the delay across par-
titions increases dramatically (up to the duration of the
partition).
For O(d) algorithms, which of these different delays
do we need to assume for d? The answer depends on
the communication pattern of the algorithm.
4.3.1 Modeling network topology
For example, a replication algorithm that uses a sin-
gle leader or primary node requires all write requests
to contact the primary, and thus d in this case is the net-
work delay between the client and the leader (possibly
via other nodes). In a geographically distributed system,
if client and leader are in different locations, d includes
WAN links. If the client is temporarily partitioned from
the leader, d increases up to the duration of the partition.
By contrast, the ABD algorithm [7] waits for re-
sponses from a majority of replicas, so d is the largest
among the majority of replicas that are fastest to re-
spond. If a minority of replicas is temporarily parti-
tioned from the client, the operation latency remains in-
dependent of the partition duration.
Another possibility is to treat network delay within
the same datacenter, dlocal, differently from net-
work delay over WAN links, dremote, because usu-
ally dlocal ā‰Ŗ dremote. Systems such as COPS [43],
which place a leader in each datacenter, provide lin-
earizable operations within one datacenter (requiring
O(dlocal) latency), and causal consistency across da-
tacenters (making the request latency independent of
dremote).
4.4 Delay-independent operations
The big-O notation for operation latency ignores con-
stant factors (such as the number of network round-trips
required by an algorithm), but it captures the essence of
10
what we need to know for building systems that can
tolerate network faults: what happens if network delay
dramatically degrades? In a delay-sensitive O(d) algo-
rithm, operation latency may increase to be as large as
the duration of the network interruption (i.e. minutes or
even hours), whereas a delay-independent O(1) algo-
rithm remains unaffected.
If the SLA calls for operation latencies that are signif-
icantly shorter than the expected duration of network in-
terruptions, delay-independent algorithms are required.
In such algorithms, the time until replica convergence
is still proportional to d, but convergence is decou-
pled from operation latency. Put another way, delay-
independent algorithms support disconnected or ofļ¬‚ine
operation. Disconnected operation has long been used
in network ļ¬le systems [37] and automatic teller ma-
chines [18].
For example, consider a calendar application running
on a mobile device: a user may travel through a tunnel
or to a remote location where there is no cellular net-
work coverage. For a mobile device, regular network
interruptions are expected, and they may last for days.
During this time, the user should still be able to inter-
act with the calendar app, checking their schedule and
adding events (with any changes asynchronously prop-
agated when an internet connection is next available).
However, even in environments with fast and re-
liable network connectivity, delay-independent algo-
rithms have been shown to have performance and scal-
ability beneļ¬ts: in this context, they are known as
coordination-free [13] or ALPS [43] systems. Many
popular database integrity constraints can be imple-
mented without synchronous coordination between
replicas [13].
4.5 Proposed terminology
Much of the confusion around CAP is due to the am-
biguous, counter-intuitive and contradictory deļ¬nitions
of terms such as availability, as discussed in section 2.
In order to improve the situation and reduce misunder-
standings, there is a need to standardize terminology
with simple, formal and correct deļ¬nitions that match
the intuitions of practitioners.
Building upon the observations above and the results
cited in section 4.2, we propose the following deļ¬ni-
tions as a ļ¬rst draft of a delay-sensitivity framework for
reasoning about consistency and availability trade-offs.
These deļ¬nitions are informal and intended as a starting
point for further discussion.
Availability is an empirical metric, not a property of
an algorithm. It is deļ¬ned as the percentage of
successful requests (returning a non-error response
within a predeļ¬ned latency bound) over some pe-
riod of system operation.
Delay-sensitive describes algorithms or operations
that need to wait for network communication to
complete, i.e. which have latency proportional to
network delay. The opposite is delay-independent.
Systems must specify the nature of the sen-
sitivity (e.g. an operation may be sensitive to
intra-datacenter delay but independent of inter-
datacenter delay). A fully delay-independent sys-
tem supports disconnected (ofļ¬‚ine) operation.
Network faults encompass packet loss (both transient
and long-lasting) and unusually large packet delay.
Network partitions are just one particular type of
network fault; in most cases, systems should plan
for all kinds of network fault, and not only parti-
tions. As long as lost packets or failed requests are
retried, they can be modeled as large network de-
lay.
Fault tolerance is used in preference to high availabil-
ity or partition tolerance. The maximum fault that
can be tolerated must be speciļ¬ed (e.g. ā€œthe al-
gorithm can tolerate up to a minority of replicas
crashing or disconnectingā€), and the description
must also state what happens if more faults occur
than the system can tolerate (e.g. all requests return
an error, or a consistency property is violated).
Consistency refers to a spectrum of different con-
sistency models (including linearizability and
causal consistency), not one particular consistency
model. When a particular consistency model such
as linearizability is intended, it is referred to by its
usual name. The term strong consistency is vague,
and may refer to linearizability, sequential consis-
tency or one-copy serializability.
5 Conclusion
In this paper we discussed several problems with the
CAP theorem: the deļ¬nitions of consistency, availabil-
ity and partition tolerance in the literature are somewhat
contradictory and counter-intuitive, and the distinction
that CAP draws between ā€œstrongā€ and ā€œeventualā€ con-
sistency models is less clear than widely believed.
11
CAP has nevertheless been very inļ¬‚uential in the
design of distributed data systems. It deserves credit
for catalyzing the exploration of the design space of
systems with weak consistency guarantees, e.g. in the
NoSQL movement. However, we believe that CAP has
now reached the end of its usefulness; we recommend
that it should be relegated to the history of distributed
systems, and no longer be used for justifying design de-
cisions.
As an alternative to CAP, we propose a simple delay-
sensitivity framework for reasoning about trade-offs be-
tween consistency guarantees and tolerance of network
faults in a replicated database. Every operation is cat-
egorized as either O(d), if its latency is sensitive to
network delay, or O(1), if it is independent of net-
work delay. On the assumption that lost messages are
retransmitted an unbounded number of times, we can
model network faults (including partitions) as periods
of greatly increased delay. The algorithmā€™s sensitivity to
network delay determines whether the system can still
meet its service level agreement (SLA) when a network
fault occurs.
The actual sensitivity of a system to network de-
lay depends on its implementation, but ā€“ in keeping
with the goal of CAP ā€“ we can prove that certain lev-
els of consistency cannot be achieved without making
operation latency proportional to network delay. These
theoretical lower bounds are summarized in Table 1.
We have not proved any new results in this paper, but
merely drawn on existing distributed systems research
dating mostly from the 1990s (and thus predating CAP).
For future work, it would be interesting to model the
probability distribution of latencies for different con-
currency control and replication algorithms (e.g. by ex-
tending PBS [11]), rather than modeling network delay
as just a single number d. It would also be interesting
to model the network communication topology of dis-
tributed algorithms more explicitly.
We hope that by being more rigorous about the impli-
cations of different consistency levels on performance
and fault tolerance, we can encourage designers of dis-
tributed data systems to continue the exploration of the
design space. And we also hope that by adopting sim-
ple, correct and intuitive terminology,we can help guide
application developers towards the storage technologies
that are most appropriate for their use cases.
Acknowledgements
Many thanks to Niklas EkstrĀØom, Seth Gilbert and Henry
Robinson for fruitful discussions around this topic.
References
[1] ACID support in Aerospike. Aerospike, Inc., June 2014.
URL http://www.aerospike.com/docs/architecture/
assets/AerospikeACIDSupport.pdf.
[2] Daniel J Abadi. Consistency tradeoffs in modern distributed
database system design. IEEE Computer Magazine, 45(2):37ā€“
42, February 2012. doi:10.1109/MC.2012.33.
[3] Atul Adya. Weak Consistency: A Generalized Theory and Opti-
mistic Implementations for Distributed Transactions. PhD the-
sis, Massachusetts Institute of Technology, March 1999.
[4] Mustaque Ahamad, Gil Neiger, James E Burns, Prince Kohli,
and Phillip W Hutto. Causal memory: deļ¬nitions, implemen-
tation, and programming. Distributed Computing, 9(1):37ā€“49,
March 1995. doi:10.1007/BF01784241.
[5] Bowen Alpern and Fred B Schneider. Deļ¬ning liveness. In-
formation Processing Letters, 21(4):181ā€“185, October 1985.
doi:10.1016/0020-0190(85)90056-0.
[6] Hagit Attiya and Jennifer L Welch. Sequential con-
sistency versus linearizability. ACM Transactions on
Computer Systems (TOCS), 12(2):91ā€“122, May 1994.
doi:10.1145/176575.176576.
[7] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing mem-
ory robustly in message-passing systems. Journal of the ACM,
42(1):124ā€“142, January 1995. doi:10.1145/200836.200869.
[8] Hagit Attiya, Faith Ellen, and Adam Morrison. Limitations of
highly-available eventually-consistent data stores. In ACM Sym-
posium on Principles of Distributed Computing (PODC), July
2015. doi:10.1145/2767386.2767419.
[9] Peter Bailis and Ali Ghodsi. Eventual consistency today: Lim-
itations, extensions, and beyond. ACM Queue, 11(3), March
2013. doi:10.1145/2460276.2462076.
[10] Peter Bailis and Kyle Kingsbury. The network is reliable. ACM
Queue, 12(7), July 2014. doi:10.1145/2639988.2639988.
[11] Peter Bailis, Shivaram Venkataraman, Michael J Franklin,
Joseph M Hellerstein, and Ion Stoica. Probabilistically bounded
staleness for practical partial quorums. Proceedings of the
VLDB Endowment, 5(8):776ā€“787, April 2012.
[12] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi,
Joseph M Hellerstein, and Ion Stoica. Highly available transac-
tions: Virtues and limitations. In 40th International Conference
on Very Large Data Bases (VLDB), September 2014.
[13] Peter Bailis, Alan Fekete, Michael J Franklin, Ali Ghodsi,
Joseph M Hellerstein, and Ion Stoica. Coordination-avoiding
database systems. Proceedings of the VLDB Endowment, 8(3):
185ā€“196, November 2014.
12
[14] Hal Berenson, Philip A Bernstein, Jim N Gray, Jim Melton,
Elizabeth Oā€™Neil, and Patrick Oā€™Neil. A critique of
ANSI SQL isolation levels. In ACM International Con-
ference on Management of Data (SIGMOD), May 1995.
doi:10.1145/568271.223785.
[15] Philip A Bernstein, Vassos Hadzilacos, and Nathan Good-
man. Concurrency Control and Recovery in Database
Systems. Addison-Wesley, 1987. ISBN 0201107155.
URL http://research.microsoft.com/en-us/people/
philbe/ccontrol.aspx.
[16] Eric A Brewer. Towards robust distributed systems (keynote). In
19th ACM Symposium on Principles of Distributed Computing
(PODC), July 2000.
[17] Eric A Brewer. CAP twelve years later: How the ā€œrulesā€ have
changed. IEEE Computer Magazine, 45(2):23ā€“29, February
2012. doi:10.1109/MC.2012.37.
[18] Eric A Brewer. NoSQL: Past, present, future. In QCon San
Francisco, November 2012. URL http://www.infoq.com/
presentations/NoSQL-History.
[19] Christian Cachin, Rachid Guerraoui, and LuĀ“ıs Rodrigues. In-
troduction to Reliable and Secure Distributed Programming.
Springer, 2nd edition, February 2011. ISBN 978-3-642-15259-
7.
[20] Bernadette Charron-Bost. Concerning the size of logical clocks
in distributed systems. Information Processing Letters, 39(1):
11ā€“16, July 1991. doi:10.1016/0020-0190(91)90055-M.
[21] Jeff Darcy. When partitions attack, October 2010. URL
http://pl.atyp.us/wordpress/index.php/2010/10/
when-partitions-attack/.
[22] Susan B Davidson, Hector Garcia-Molina, and Dale Skeen.
Consistency in partitioned networks. ACM Computing Surveys,
17(3):341ā€“370, September 1985. doi:10.1145/5505.5508.
[23] Michael J Fischer and Alan Michael. Sacriļ¬cing serializability
to attain high availability of data in an unreliable network. In 1st
ACM Symposium on Principles of Database Systems (PODS),
pages 70ā€“75, March 1982. doi:10.1145/588111.588124.
[24] Armando Fox and Eric A Brewer. Harvest, yield, and scal-
able tolerant systems. In 7th Workshop on Hot Topics in
Operating Systems (HotOS), pages 174ā€“178, March 1999.
doi:10.1109/HOTOS.1999.798396.
[25] Seth Gilbert and Nancy Lynch. Brewerā€™s conjecture
and the feasibility of consistent, available, partition-tolerant
web services. ACM SIGACT News, 33(2):51ā€“59, 2002.
doi:10.1145/564585.564601.
[26] Seth Gilbert and Nancy Lynch. Perspectives on the CAP theo-
rem. IEEE Computer Magazine, 45(2):30ā€“36, February 2012.
doi:10.1109/MC.2011.389.
[27] Jim N Gray, Raymond A Lorie, Gianfranco R Putzolu, and Irv-
ing L Traiger. Granularity of locks and degrees of consistency in
a shared data base. In G M Nijssen, editor, Modelling in Data
Base Management Systems: Proceedings of the IFIP Working
Conference on Modelling in Data Base Management Systems,
pages 364ā€“394. Elsevier/North Holland Publishing, 1976.
[28] Coda Hale. You canā€™t sacriļ¬ce partition tolerance, Oc-
tober 2010. URL http://codahale.com/you-cant-
sacrifice-partition-tolerance/.
[29] Maurice P Herlihy. Impossibility and universality results for
wait-free synchronization. In 7th ACM Symposium on Princi-
ples of Distributed Computing (PODC), pages 276ā€“290, August
1988. doi:10.1145/62546.62593.
[30] Maurice P Herlihy and Jeannette M Wing. Linearizability: A
correctness condition for concurrent objects. ACM Transactions
on Programming Languages and Systems, 12(3):463ā€“492, July
1990. doi:10.1145/78969.78972.
[31] Jeff Hodges. Notes on distributed systems for young bloods,
January 2013. URL http://www.somethingsimilar.com/
2013/01/14/notes-on-distributed-systems-for-
young-bloods/.
[32] Paul R Johnson and Robert H Thomas. RFC 677: The mainte-
nance of duplicate databases. Network Working Group, January
1975. URL https://tools.ietf.org/html/rfc677.
[33] Flavio P Junqueira, Benjamin C Reed, and Marco Seraļ¬ni.
Zab: High-performance broadcast for primary-backup sys-
tems. In 41st IEEE International Conference on Depend-
able Systems and Networks (DSN), pages 245ā€“256, June 2011.
doi:10.1109/DSN.2011.5958223.
[34] Won Kim. Highly available systems for database applica-
tion. ACM Computing Surveys, 16(1):71ā€“98, March 1984.
doi:10.1145/861.866.
[35] Kyle Kingsbury. Call me maybe: RabbitMQ, June 2014.
URL https://aphyr.com/posts/315-call-me-maybe-
rabbitmq.
[36] Kyle Kingsbury. Call me maybe: Elasticsearch 1.5.0, April
2015. URL https://aphyr.com/posts/323-call-me-
maybe-elasticsearch-1-5-0.
[37] James J Kistler and M Satyanarayanan. Disconnected
operation in the Coda ļ¬le system. ACM Transactions
on Computer Systems (TOCS), 10(1):3ā€“25, February 1992.
doi:10.1145/146941.146942.
[38] Leslie Lamport. How to make a multiprocessor com-
puter that correctly executes multiprocess programs. IEEE
Transactions on Computers, 28(9):690ā€“691, September 1979.
doi:10.1109/TC.1979.1675439.
[39] Leslie Lamport. Fairness and hyperfairness. Dis-
tributed Computing, 13(4):239ā€“245, November 2000.
doi:10.1007/PL00008921.
[40] Bruce G Lindsay, Patricia Grifļ¬ths Selinger, C Galtieri, Jim N
Gray, Raymond A Lorie, Thomas G Price, Gianfranco R Put-
zolu, Irving L Traiger, and Bradford W Wade. Notes on dis-
tributed databases. Technical Report RJ2571(33471), IBM Re-
search, July 1979.
[41] Nicolas Liochon. You do it too: Forfeiting network partition
tolerance in distributed systems, July 2015. URL http://
blog.thislongrun.com/2015/07/Forfeit-Partition-
Tolerance-Distributed-System-CAP-Theorem.html.
13
[42] Richard J Lipton and Jonathan S Sandberg. PRAM: A scalable
shared memory. Technical Report CS-TR-180-88, Princeton
University Department of Computer Science, September 1988.
[43] Wyatt Lloyd, Michael J Freedman, Michael Kaminsky, and
David G Andersen. Donā€™t settle for eventual: Scalable causal
consistency for wide-area storage with COPS. In 23rd ACM
Symposium on Operating Systems Principles (SOSP), pages
401ā€“416, October 2011. doi:10.1145/2043556.2043593.
[44] Nancy Lynch and Alex Shvartsman. Robust emulation of
shared memory using dynamic quorum-acknowledged broad-
casts. In 27th Annual International Symposium on Fault-
Tolerant Computing (FTCS), pages 272ā€“281, June 1997.
doi:10.1109/FTCS.1997.614100.
[45] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. Consistency,
availability, and convergence. Technical Report UTCS TR-11-
22, University of Texas at Austin, Department of Computer Sci-
ence, May 2011.
[46] Marios Mavronicolas and Dan Roth. Linearizable read/write
objects. Theoretical Computer Science, 220(1):267ā€“319, June
1999. doi:10.1016/S0304-3975(98)90244-4.
[47] Henry Robinson. CAP confusion: Problems with ā€˜partition
toleranceā€™, April 2010. URL http://blog.cloudera.
com/blog/2010/04/cap-confusion-problems-with-
partition-tolerance/.
[48] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco
Zappa Nardelli, and Magnus O Myreen. x86-TSO: A
rigorous and usable programmerā€™s model for x86 multiproces-
sors. Communications of the ACM, 53(7):89ā€“97, July 2010.
doi:10.1145/1785414.1785443.
[49] Douglas B Terry, Alan J Demers, Karin Petersen, Mike J
Spreitzer, Marvin M Theimer, and Brent B Welch. Ses-
sion guarantees for weakly consistent replicated data. In
3rd International Conference on Parallel and Distributed In-
formation Systems (PDIS), pages 140ā€“149, September 1994.
doi:10.1109/PDIS.1994.331722.
[50] Werner Vogels. Eventually consistent. ACM Queue, 6(6):14ā€“19,
October 2008. doi:10.1145/1466443.1466448.
14

More Related Content

What's hot

RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...IJNSA Journal
Ā 
Data Replication In Cloud Computing
Data Replication In Cloud ComputingData Replication In Cloud Computing
Data Replication In Cloud ComputingRahul Garg
Ā 
Cluster computing pptl (2)
Cluster computing pptl (2)Cluster computing pptl (2)
Cluster computing pptl (2)Rohit Jain
Ā 
Mn3422372248
Mn3422372248Mn3422372248
Mn3422372248IJERA Editor
Ā 
Active database
Active databaseActive database
Active databasemridul mishra
Ā 
MSB-Distributed systems goals
MSB-Distributed systems goalsMSB-Distributed systems goals
MSB-Distributed systems goalsMOHD. SHAHRUKH BHATI
Ā 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...Narendra Singh Yadav
Ā 
Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Dataijccsa
Ā 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEEGLOBALSOFTSTUDENTPROJECTS
Ā 
State model based
State model basedState model based
State model basedRafael Roque
Ā 
Client server
Client serverClient server
Client serverJoy Das
Ā 
Bt0070 operating systems 1
Bt0070 operating systems  1Bt0070 operating systems  1
Bt0070 operating systems 1Techglyphs
Ā 
Coordination-aware Elasticity
Coordination-aware ElasticityCoordination-aware Elasticity
Coordination-aware ElasticityHong-Linh Truong
Ā 
Survey on Division and Replication of Data in Cloud for Optimal Performance a...
Survey on Division and Replication of Data in Cloud for Optimal Performance a...Survey on Division and Replication of Data in Cloud for Optimal Performance a...
Survey on Division and Replication of Data in Cloud for Optimal Performance a...IJSRD
Ā 

What's hot (17)

RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
Ā 
93 99
93 9993 99
93 99
Ā 
50620130101004
5062013010100450620130101004
50620130101004
Ā 
Data Replication In Cloud Computing
Data Replication In Cloud ComputingData Replication In Cloud Computing
Data Replication In Cloud Computing
Ā 
Cluster computing pptl (2)
Cluster computing pptl (2)Cluster computing pptl (2)
Cluster computing pptl (2)
Ā 
Mn3422372248
Mn3422372248Mn3422372248
Mn3422372248
Ā 
Active database
Active databaseActive database
Active database
Ā 
MSB-Distributed systems goals
MSB-Distributed systems goalsMSB-Distributed systems goals
MSB-Distributed systems goals
Ā 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
Ā 
Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Data
Ā 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
Ā 
State model based
State model basedState model based
State model based
Ā 
Client server
Client serverClient server
Client server
Ā 
ICICCE0298
ICICCE0298ICICCE0298
ICICCE0298
Ā 
Bt0070 operating systems 1
Bt0070 operating systems  1Bt0070 operating systems  1
Bt0070 operating systems 1
Ā 
Coordination-aware Elasticity
Coordination-aware ElasticityCoordination-aware Elasticity
Coordination-aware Elasticity
Ā 
Survey on Division and Replication of Data in Cloud for Optimal Performance a...
Survey on Division and Replication of Data in Cloud for Optimal Performance a...Survey on Division and Replication of Data in Cloud for Optimal Performance a...
Survey on Division and Replication of Data in Cloud for Optimal Performance a...
Ā 

Similar to A Critique of the CAP Theorem by Martin Kleppmann

Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...ijfcstjournal
Ā 
Paper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion EditionPaper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion EditionMarkus Michalewicz
Ā 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMƔrio Almeida
Ā 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyRohit Dubey
Ā 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed systemsatish raj
Ā 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software SystemsVincenzo De Florio
Ā 
Consistency of data replication
Consistency of data replicationConsistency of data replication
Consistency of data replicationijitjournal
Ā 
Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.pptbalewayalew
Ā 
Distributed system
Distributed systemDistributed system
Distributed systemchirag patil
Ā 
On deferred constraints in distributed database systems
On deferred constraints in distributed database systemsOn deferred constraints in distributed database systems
On deferred constraints in distributed database systemsijma
Ā 
All you didn't know about the CAP theorem
All you didn't know about the CAP theoremAll you didn't know about the CAP theorem
All you didn't know about the CAP theoremKanstantsin Hontarau
Ā 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
Ā 
JPJ1403 A Stochastic Model To Investigate Data Center Performance And QoS I...
JPJ1403   A Stochastic Model To Investigate Data Center Performance And QoS I...JPJ1403   A Stochastic Model To Investigate Data Center Performance And QoS I...
JPJ1403 A Stochastic Model To Investigate Data Center Performance And QoS I...chennaijp
Ā 
Amaru Plug Resilient In-Band Control For SDN
Amaru  Plug Resilient In-Band Control For SDNAmaru  Plug Resilient In-Band Control For SDN
Amaru Plug Resilient In-Band Control For SDNJeff Brooks
Ā 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
Ā 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Ā 
Ijcatr04071003
Ijcatr04071003Ijcatr04071003
Ijcatr04071003Editor IJCATR
Ā 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsEditor IJCATR
Ā 

Similar to A Critique of the CAP Theorem by Martin Kleppmann (20)

Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...
Ā 
Paper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion EditionPaper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion Edition
Ā 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
Ā 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Ā 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed system
Ā 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software Systems
Ā 
Consistency of data replication
Consistency of data replicationConsistency of data replication
Consistency of data replication
Ā 
I0935053
I0935053I0935053
I0935053
Ā 
Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.ppt
Ā 
Cluster computing
Cluster computingCluster computing
Cluster computing
Ā 
Distributed system
Distributed systemDistributed system
Distributed system
Ā 
On deferred constraints in distributed database systems
On deferred constraints in distributed database systemsOn deferred constraints in distributed database systems
On deferred constraints in distributed database systems
Ā 
All you didn't know about the CAP theorem
All you didn't know about the CAP theoremAll you didn't know about the CAP theorem
All you didn't know about the CAP theorem
Ā 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Ā 
JPJ1403 A Stochastic Model To Investigate Data Center Performance And QoS I...
JPJ1403   A Stochastic Model To Investigate Data Center Performance And QoS I...JPJ1403   A Stochastic Model To Investigate Data Center Performance And QoS I...
JPJ1403 A Stochastic Model To Investigate Data Center Performance And QoS I...
Ā 
Amaru Plug Resilient In-Band Control For SDN
Amaru  Plug Resilient In-Band Control For SDNAmaru  Plug Resilient In-Band Control For SDN
Amaru Plug Resilient In-Band Control For SDN
Ā 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
Ā 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid Systems
Ā 
Ijcatr04071003
Ijcatr04071003Ijcatr04071003
Ijcatr04071003
Ā 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid Systems
Ā 

More from mustafa sarac

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma sonmustafa sarac
Ā 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
Ā 
Latka december digital
Latka december digitalLatka december digital
Latka december digitalmustafa sarac
Ā 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualmustafa sarac
Ā 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpymustafa sarac
Ā 
Math for programmers
Math for programmersMath for programmers
Math for programmersmustafa sarac
Ā 
The book of Why
The book of WhyThe book of Why
The book of Whymustafa sarac
Ā 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodumustafa sarac
Ā 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizmustafa sarac
Ā 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?mustafa sarac
Ā 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mimustafa sarac
Ā 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?mustafa sarac
Ā 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Marketsmustafa sarac
Ā 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimimustafa sarac
Ā 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0mustafa sarac
Ā 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tshmustafa sarac
Ā 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008mustafa sarac
Ā 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guidemustafa sarac
Ā 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020mustafa sarac
Ā 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dicemustafa sarac
Ā 

More from mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
Ā 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
Ā 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
Ā 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
Ā 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
Ā 
Math for programmers
Math for programmersMath for programmers
Math for programmers
Ā 
The book of Why
The book of WhyThe book of Why
The book of Why
Ā 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
Ā 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
Ā 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
Ā 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
Ā 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
Ā 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
Ā 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
Ā 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
Ā 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
Ā 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
Ā 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
Ā 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
Ā 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
Ā 

Recently uploaded

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
Ā 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
Ā 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
Ā 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
Ā 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Ā 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Ā 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
Ā 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Ā 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Ā 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
Ā 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Ā 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
Ā 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa
Ā 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
Ā 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Ā 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
Ā 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
Ā 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
Ā 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
Ā 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
Ā 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Ā 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Ā 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
Ā 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
Ā 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Ā 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Ā 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
Ā 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Ā 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Ā 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
Ā 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Ā 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
Ā 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
Ā 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
Ā 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Ā 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
Ā 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
Ā 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
Ā 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Ā 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Ā 

A Critique of the CAP Theorem by Martin Kleppmann

  • 1. arXiv:1509.05393v2[cs.DC]18Sep2015 A Critique of the CAP Theorem Martin Kleppmann Abstract The CAP Theorem is a frequently cited impossibility result in distributed systems, especially among NoSQL distributed databases. In this paper we survey some of the confusion about the meaning of CAP, includ- ing inconsistencies and ambiguities in its deļ¬nitions, and we highlight some problems in its formalization. CAP is often interpreted as proof that eventually con- sistent databases have better availability properties than strongly consistent databases; although there is some truth in this, we show that more careful reasoning is required. These problems cast doubt on the utility of CAP as a tool for reasoning about trade-offs in practi- cal systems. As alternative to CAP, we propose a delay- sensitivity framework, which analyzes the sensitivity of operation latency to network delay, and which may help practitioners reason about the trade-offs between con- sistency guarantees and tolerance of network faults. 1 Background Replicated databases maintain copies of the same data on multiple nodes, potentially in disparate geographical locations, in order to tolerate faults (failures of nodes or communication links) and to provide lower latency to users (requests can be served by a nearby site). How- ever, implementing reliable, fault-tolerant applications in a distributed system is difļ¬cult: if there are multi- ple copies of the data on different nodes, they may be inconsistent with each other, and an application that is not designed to handle such inconsistencies may pro- duce incorrect results. In order to provide a simpler programming model to application developers, the designers of distributed data systems have explored various consistency guar- antees that can be implemented by the database infras- tructure, such as linearizability [30], sequential consis- tency [38], causal consistency [4] and pipelined RAM (PRAM) [42]. When multiple processes execute opera- tions on a shared storage abstraction such as a database, a consistency model describes what values are allowed to be returned by operations accessing the storage, de- pending on other operations executed previously or concurrently, and the return values of those operations. Similar concerns arise in the design of multipro- cessor computers, which are not geographically dis- tributed, but nevertheless present inconsistent views of memory to different threads, due to the various caches and buffers employed by modern CPU architectures. For example, x86 microprocessors provide a level of consistency that is weaker than sequential, but stronger than causal consistency [48]. However, in this paper we focus our attention on distributed systems that must tol- erate partial failures and unreliable network links. A strong consistency model like linearizability pro- vides an easy-to-understand guarantee: informally, all operations behave as if they executed atomically on a single copy of the data. However, this guarantee comes at the cost of reduced performance [6] and fault toler- ance [22] compared to weaker consistency models. In particular, as we discuss in this paper, algorithms that ensure stronger consistency properties among replicas are more sensitive to message delays and faults in the network. Many real computer networks are prone to unbounded delays and lost messages [10], making the fault tolerance of distributed consistency algorithms an important issue in practice. A network partition is a particular kind of communi- cation fault that splits the network into subsets of nodes such that nodes in one subset cannot communicate with nodes in another. As long as the partition exists, any data modiļ¬cations made in one subset of nodes cannot be visible to nodes in another subset, since all messages between them are lost. Thus, an algorithm that main- tains the illusion of a single copy may have to delay operations until the partition is healed, to avoid the risk of introducing inconsistent data in different subsets of nodes. This trade-off was already known in the 1970s [22, 23, 32, 40], but it was rediscovered in the early 2000s, when the webā€™s growing commercial popularity made 1
  • 2. geographic distribution and high availability important to many organizations [18, 50]. It was originally called the CAP Principle by Fox and Brewer [16, 24], where CAP stands for Consistency, Availability and Partition tolerance. After the principle was formalized by Gilbert and Lynch [25, 26] it became known as the CAP Theo- rem. CAP became an inļ¬‚uential idea in the NoSQL move- ment [50], and was adopted by distributed systems prac- titioners to critique design decisions [31]. It provoked a lively debate about trade-offs in data systems, and encouraged system designers to challenge the received wisdom that strong consistency guarantees were essen- tial for databases [17]. The rest of this paper is organized as follows: in sec- tion 2 we compare various deļ¬nitions of consistency, availability and partition tolerance. We then examine the formalization of CAP by Gilbert and Lynch [25] in section 3. Finally, in section 4 we discuss some al- ternatives to CAP that are useful for reasoning about trade-offs in distributed systems. 2 CAP Theorem Deļ¬nitions CAP was originally presented in the form of ā€œconsis- tency, availability, partition tolerance: pick any twoā€ (i.e. you can have CA, CP or AP, but not all three). Sub- sequent debates concluded that this formulation is mis- leading [17, 28, 47], because the distinction between CA and CP is unclear, as detailed later in this section. Many authors now prefer the following formulation: if there is no network partition, a system can be both con- sistent and available; when a network partition occurs, a system must choose between either consistency (CP) or availability (AP). Some authors [21, 41] deļ¬ne a CP system as one in which a majority of nodes on one side of a partition can continue operating normally, and a CA system as one that may fail catastrophically under a network par- tition (since it is designed on the assumption that parti- tions are very rare). However, this deļ¬nition is not uni- versally agreed, since it is counter-intuitive to label a system as ā€œavailableā€ if it fails catastrophically under a partition, while a system that continues partially operat- ing in the same situation is labelled ā€œunavailableā€ (see section 2.3). Disagreement about the deļ¬nitions of terms like availability is the source of many misunderstandings about CAP, and unclear deļ¬nitions lead to problems with its formalization as a theorem. In sections 2.1 to 2.3 we survey various deļ¬nitions that have been pro- posed. 2.1 Availability In practical engineering terms, availability usually refers to the proportion of time during which a service is able to successfully handle requests, or the propor- tion of requests that receive a successful response. A response is usually considered successful if it is valid (not an error, and satisļ¬es the databaseā€™s safety proper- ties) and it arrives at the client within some timeout, which may be speciļ¬ed in a service level agreement (SLA). Availability in this sense is a metric that is em- pirically observed during a period of a serviceā€™s oper- ation. A service may be available (up) or unavailable (down) at any given time, but it is nonsensical to say that some software package or algorithm is ā€˜availableā€™ or ā€˜unavailableā€™ in general, since the uptime percentage is only known in retrospect, after a period of operation (during which various faults may have occurred). There is a long tradition of highly available and fault- tolerant systems, whose algorithms are designed such that the system can remain available (up) even when some part of the system is faulty, thus increasing the expected mean time to failure (MTTF) of the system as a whole. Using such a system does not automatically make a service 100% available, but it may increase the observed availability during operation, compared to us- ing a system that is not fault-tolerant. 2.1.1 The A in CAP Does the A in CAP refer to a property of an algorithm, or to an observed metric during system operation? This distinction is unclear. Brewer does not offer a precise deļ¬nition of availability, but states that ā€œavailability is obviously continuous from 0 to 100 percentā€ [17], sug- gesting an observed metric. Fox and Brewer also use the term yield to refer to the proportion of requests that are completed successfully [24] (without specifying any timeout). On the other hand, Gilbert and Lynch [25] write: ā€œFor a distributed system to be continuously available, ev- ery request received by a non-failing node in the sys- tem must result in a responseā€.1 In order to prove a re- 1This sentence appears to deļ¬ne a property of continuous avail- ability, but the rest of the paper does not refer to this ā€œcontinuousā€ aspect. 2
  • 3. sult about systems in general, this deļ¬nition interprets availability as a property of an algorithm, not as an ob- served metric during system operation ā€“ i.e. they de- ļ¬ne a system as being ā€œavailableā€ or ā€œunavailableā€ stat- ically, based on its algorithms, not its operational status at some point in time. One particular execution of the algorithm is avail- able if every request in that execution eventually re- ceives a response. Thus, an algorithm is ā€œavailableā€ un- der Gilbert and Lynchā€™s deļ¬nition if all possible exe- cutions of the algorithm are available. That is, the al- gorithm must guarantee that requests always result in responses, no matter what happens in the system (see section 2.3.1). Note that Gilbert and Lynchā€™s deļ¬nition requires any non-failed node to be able to generate valid responses, even if that node is completely isolated from the other nodes. This deļ¬nition is at odds with Fox and Brewerā€™s original proposal of CAP, which states that ā€œdata is con- sidered highly available if a given consumer of the data can always reach some replicaā€ [24, emphasis original]. Many so-called highly available or fault-tolerant sys- tems have very high uptime in practice, but are in fact ā€œunavailableā€ under Gilbert and Lynchā€™s deļ¬nition [34]: for example, in a system with an elected leader or pri- mary node, if a client that cannot reach the leader due to a network fault, the client cannot perform any writes, even though it may be able to reach another replica. 2.1.2 No maximum latency Note that Gilbert and Lynchā€™s deļ¬nition of availability does not specify any upper bound on operation latency: it only requires requests to eventually return a response within some unbounded but ļ¬nite time. This is conve- nient for proof purposes, but does not closely match our intuitive notion of availability (in most situations, a ser- vice that takes a week to respond might as well be con- sidered unavailable). This deļ¬nition of availability is a pure liveness prop- erty, not a safety property [5]: that is, at any point in time, if the response has not yet arrived, there is still hope that the availability property might still be ful- ļ¬lled, because the response may yet arrive ā€“ it is never too late. This aspect of the deļ¬nition will be impor- tant in section 3, when we examine Gilbert and Lynchā€™s proofs in more detail. (In section 4 we will discuss a deļ¬nition of availabil- ity that takes latency into account.) 2.1.3 Failed nodes Another noteworthy aspect of Gilbert and Lynchā€™s def- inition of availability is the proviso of applying only to non-failed nodes. This allows the aforementioned deļ¬- nition of a CA system as one that fails catastrophically if a network partition occurs: if the partition causes all nodes to fail, then the availability requirement does not apply to any nodes, and thus it is trivially satisļ¬ed, even if no node is able to respond to any requests. This deļ¬ni- tion is logically sound, but somewhat counter-intuitive. 2.2 Consistency Consistency is also an overloaded word in data sys- tems: consistency in the sense of ACID is a very differ- ent property from consistency in CAP [17]. In the dis- tributed systems literature, consistency is usually under- stood as not one particular property, but as a spectrum of models with varying strengths of guarantee. Examples of such consistency models include linearizability [30], sequential consistency [38], causal consistency [4] and PRAM [42]. There is some similarity between consistency mod- els and transaction isolation models such as serializ- ability [15], snapshot isolation [14], repeatable read and read committed [3, 27]. Both describe restrictions on the values that operations may return, depending on other (prior or concurrent) operations. The difference is that transaction isolation models are usually formalized assuming a single replica, and operate at the granular- ity of transactions (each transaction may read or write multiple objects). Consistency models assume multi- ple replicas, but are usually deļ¬ned in terms of single- object operations (not grouped into transactions). Bailis et al. [12] demonstrate a uniļ¬ed framework for reason- ing about both distributed consistency and transaction isolation in terms of CAP. 2.2.1 The C in CAP Fox and Brewer [24] deļ¬ne the C in CAP as one- copy serializability (1SR) [15], whereas Gilbert and Lynch [25] deļ¬ne it as linearizability. Those deļ¬ni- tions are not identical, but fairly similar.2 Both are 2Linearizability is a recency guarantee, whereas 1SR is not. 1SR requires isolated execution of multi-object transactions, which lin- earizability does not. Both require coordination, in the sense of sec- tion 4.4 [12]. 3
  • 4. safety properties [5], i.e. restrictions on the possible ex- ecutions of the system, ensuring that certain situations never occur. In the case of linearizability, the situation that may not occur is a stale read: stated informally, once a write operation has completed or some read operation has re- turned a new value, all following read operations must return the new value, until it is overwritten by another write operation. Gilbert and Lynch observe that if the write and read operations occur on different nodes, and those nodes cannot communicate during the time when those operations are being executed, then the safety property cannot be satisļ¬ed, because the read operation cannot know about the value written. The C of CAP is sometimes referred to as strong con- sistency (a term that is not formally deļ¬ned), and con- trasted with eventual consistency [9, 49, 50], which is often regarded as the weakest level of consistency that is useful to applications. Eventual consistency means that if a system stops accepting writes and sufļ¬cient3 communication occurs, then eventually all replicas will converge to the same value. However, as the aforemen- tioned list of consistency models indicates, it is overly simplistic to cast ā€˜strongā€™ and eventual consistency as the only possible choices. 2.2.2 Probabilistic consistency It is also possible to deļ¬ne consistency as a quantitative metric rather than a safety property. For example, Fox and Brewer [24] deļ¬ne harvest as ā€œthe fraction of the data reļ¬‚ected in the response, i.e. the completeness of the answer to the query,ā€ and probabilistically bounded staleness [11] studies the probability of a read returning a stale value, given various assumptions about the distri- bution of network latencies. However, these stochastic deļ¬nitions of consistency are not the subject of CAP. 2.3 Partition Tolerance A network partition has long been deļ¬ned as a com- munication failure in which the network is split into disjoint sub-networks, with no communication possible across sub-networks [32]. This is a fairly narrow class of fault, but it does occur in practice [10], so it is worth studying. 3It is not clear what amount of communication is ā€˜sufļ¬cientā€™. A possible formalization would be to require all replicas to converge to the same value within ļ¬nite time, assuming fair-loss links (see section 2.3.2). 2.3.1 Assumptions about system model It is less clear what partition tolerance means. Gilbert and Lynch [25] deļ¬ne a system as partition-tolerant if it continues to satisfy the consistency and availabil- ity properties in the presence of a partition. Fox and Brewer [24] deļ¬ne partition-resilience as ā€œthe system as whole can survive a partition between data replicasā€ (where survive is not deļ¬ned). At ļ¬rst glance, these deļ¬nitions may seem redundant: if we say that an algorithm provides some guarantee (e.g. linearizability), then we expect all executions of the algorithm to satisfy that property, regardless of the faults that occur during the execution. However, we can clarify the deļ¬nitions by observing that the correctness of a distributed algorithm is always subject to assumptions about the faults that may occur during its execution. If you take an algorithm that as- sumes fair-loss links and crash-stop processes, and sub- ject it to Byzantine faults, the execution will most likely violate safety properties that were supposedly guaran- teed. These assumptions are typically encoded in a sys- tem model, and non-Byzantine system models rule out certain kinds of fault as impossible (so algorithms are not expected to tolerate them). Thus, we can interpret partition tolerance as mean- ing ā€œa network partition is among the faults that are assumed to be possible in the system.ā€ Note that this deļ¬nition of partition tolerance is a statement about the system model, whereas consistency and availability are properties of the possible executions of an algorithm. It is misleading to say that an algorithm ā€œprovides parti- tion tolerance,ā€ and it is better to say that an algorithm ā€œassumes that partitions may occur.ā€ If an algorithm assumes the absence of partitions, and is nevertheless subjected to a partition, it may violate its guarantees in arbitrarily undeļ¬ned ways (including failing to respond even after the partition is healed, or deleting arbitrary amounts of data). Even though it may seem that such arbitrary failure semantics are not very useful, various systems exhibit such behavior in prac- tice [35, 36]. Making networks highly reliable is very expensive [10], so most distributed programs must as- sume that partitions will occur sooner or later [28]. 2.3.2 Partitions and fair-loss links Further confusion arises due to the fact that network partitions are only one of a wide range of faults that can occur in distributed systems, including nodes failing or 4
  • 5. restarting, nodes pausing for some amount of time (e.g. due to garbage collection), and loss or delay of mes- sages in the network. Some faults can be modeled in terms of other faults (for example, Gilbert and Lynch state that the loss of an individual message can be mod- eled as a short-lived network partition). In the design of distributed systems algorithms, a commonly assumed system model is fair-loss links [19]. A network link has the fair-loss property if the probability of a message not being lost is non-zero, i.e. the link sometimes delivers messages. The link may have intervals of time during which all messages are dropped, but those intervals must be of ļ¬nite duration. On a fair-loss link, message delivery can be made re- liable by retrying a message an unbounded number of times: the message is guaranteed to be eventually deliv- ered after a ļ¬nite number of attempts [19]. We argue that fair-loss links are a good model of most networks in practice: faults occur unpredictably; messages are lost while the fault is occurring; the fault lasts for some ļ¬nite duration (perhaps seconds, perhaps hours), and eventually it is healed (perhaps after human intervention).There is no malicious actor in the network who can cause systematic message loss over unlimited periods of time ā€“ such malicious actors are usually only assumed in the design of Byzantine fault tolerant algo- rithms. Is ā€œpartitions may occurā€ equivalent to assuming fair- loss links? Gilbert and Lynch [25] deļ¬ne partitions as ā€œthe network will be allowed to lose arbitrarily many messages sent from one node to another.ā€ In this deļ¬ni- tion it is unclear whether the number of lost messages is unbounded but ļ¬nite, or whether it is potentially inļ¬- nite. Partitions of a ļ¬nite duration are possible with fair- loss links, and thus an algorithm that is correct in a sys- tem model of fair-loss links can tolerate partitions of a ļ¬nite duration. Partitions of an inļ¬nite duration require some further thought, as we shall see in section 3. 3 The CAP Proofs In this section, we build upon the discussion of deļ¬ni- tions in the last section, and examine the proofs of the theorems of Gilbert and Lynch [25]. We highlight some ambiguities in the reasoning of the proofs, and then sug- gest a more precise formalization. 3.1 Theorems 1 and 2 Gilbert and Lynchā€™s Theorem 1 is stated as follows: It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: ā€¢ Availability ā€¢ Atomic consistency4 in all fair executions (including those in which mes- sages are lost). Theorem 2 is similar, but speciļ¬ed in a system model with bounded network delay. The discussion in this sec- tion 3.1 applies to both theorems. 3.1.1 Availability of failed nodes The ļ¬rst problem with this proof is the deļ¬nition of availability. As discussed in section 2.1.3, only non- failing nodes are required to respond. If it is possible for the algorithm to declare nodes as failed (e.g. if a node may crash itself), then the avail- ability property can be trivially satisļ¬ed: all nodes can be crashed, and thus no node is required to respond. Of course, such an algorithm would not be useful in prac- tice. Alternatively, if a minority of nodes is permanently partitioned from the majority, an algorithm could deļ¬ne the nodes in the minority partition as failed (by crash- ing them), while the majority partition continues imple- menting a linearizable register [7]. This is not the intention of CAP ā€“ the raison dā€™Ė†etre of CAP is to characterize systems in which a minority partition can continue operating independently of the rest ā€“ but the present formalization of availability does not exclude such trivial solutions. 3.1.2 Finite and inļ¬nite partitions Gilbert and Lynchā€™s proofs of theorems 1 and 2 con- struct an execution of an algorithm A in which a write is followed by a read, while simultaneously a partition exists in the network. By showing that the execution is not linearizable, the authors derive a contradiction. Note that this reasoning is only correct if we assume a system model in which partitions may have inļ¬nite duration. If the system model is based on fair-loss links, then all partitions may be assumed to be of unbounded but 4In this context, atomic consistency is synonymous with lineariz- ability, and it is unrelated to the A in ACID. 5
  • 6. ļ¬nite duration (section 2.3.2). Likewise, Gilbert and Lynchā€™s availability property does not place any upper bound on the duration of an operation, as long as it is ļ¬- nite (section 2.1.2). Thus, if a linearizable algorithm en- counters a network partition in a fair-loss system model, it is acceptable for the algorithm to simply wait for the partition to be healed: at any point in time, there is still hope that the partition will be healed in future, and so the availability property may yet be satisļ¬ed. For exam- ple, the ABD algorithm [7] can be used to implement a linearizable read-write register in an asynchronous net- work with fair-loss links.5 On the other hand, in an execution where a partition of inļ¬nite duration occurs, the algorithm is forced to make a choice between waiting until the partition heals (which never happens, thus violating availability) and exhibiting the execution in the proof of Theorem 1 (thus violating linearizability). We can conclude that Theo- rem 1 is only valid in a system model where inļ¬nite partitions are possible. 3.1.3 Linearizability vs. eventual consistency Note that in the case of an inļ¬nite partition, no infor- mation can ever ļ¬‚ow from one sub-network to the other. Thus, even eventual consistency (replica convergencein ļ¬nite time, see section 2.2.1) is not possible in a system with an inļ¬nite partition. Theorem 1 demonstrated that in a system model with inļ¬nite partitions, no algorithm exists which ensures linearizability and availability in all executions. How- ever, we can also see that in the same system model, no algorithm exists which ensures eventual consistency in all executions. The CAP theorem is often understood as demonstrat- ing that linearizability cannot be achieved with high availability, whereas eventual consistency can. How- ever, the results so far do not differentiate between lin- earizable and eventually consistent algorithms: both are possible if partitions are always ļ¬nite, and both are im- possible in a system model with inļ¬nite partitions. To distinguish between linearizability and eventual consistency, a more careful formalization of CAP is re- quired, which we give in section 3.2. 5ABD [7] is an algorithm for a single-writer multi-reader regis- ter. It was extended to the multi-writer case by Lynch and Shvarts- man [44]. 3.2 The partitionable system model In this section we suggest a more precise formulation of CAP, and derive a result similar to Gilbert and Lynchā€™s Theorem 1 and Corollary 1.1. This formulation will help us gain a better understanding of CAP and its con- sequences. 3.2.1 Deļ¬nitions Deļ¬ne a partitionable link to be a point-to-point link with the following properties: 1. No duplication: If a process p sends a message m once to process q, then m is delivered at most once by q. 2. No creation: If some process q delivers a message m with sender p, then m was previously sent to q by process p. (A partitionable link is allowed to drop an inļ¬nite num- ber of messages and cause unbounded message delay.) Deļ¬ne the partitionable model as a system model in which processes can only communicate via parti- tionable links, in which processes never crash,6 and in which every process has access to a local clock that is able to generate timeouts (the clock progresses mono- tonically at a rate approximately equal to real time, but clocks of different processes are not synchronized). Deļ¬ne an execution E as admissible in a system model M if the processes and links in E satisfy the prop- erties deļ¬ned by M. Deļ¬ne an algorithm A as terminating in a system model M if, for every execution E of A, if E is admissi- ble in M, then every operation in E terminates in ļ¬nite time.7 Deļ¬ne an execution E as loss-free if for every mes- sage m sent from p to q during E, m is eventually deliv- ered to q. (There is no delay bound on delivery.) An execution E is partitioned if it is not loss-free. Note: we may assume that links automatically resend lost messages an unbounded number of times. Thus, an execution in which messages are transiently lost during 6The assumption that processes never crash is of course unrealis- tic, but it makes the impossibility results in section 3.2.2 stronger. It also rules out the trivial solution of section 3.1.1. 7Our deļ¬nition of terminating corresponds to Gilbert and Lynchā€™s deļ¬nition of available. We prefer to call it terminating because the word available is widely understood as referring to an empirical met- ric (see section 2.1). There is some similarity to wait-free data struc- tures [29], although these usually assume reliable communication and unreliable processes. 6
  • 7. some ļ¬nite time period is not partitioned, because the links will eventually deliver all messages that were lost. An execution is only partitioned if the message loss per- sists forever. For the deļ¬nition of linearizability we refer to Her- lihy and Wing [30]. There is no generally agreed formalization of even- tual consistency, but the following corresponds to a liveness property that has been proposed [8, 9]: even- tually, every read operation read(q) at process q must return a set of all the values v ever written by any pro- cess p in an operation write(p,v). For simplicity, we assume that values are never removed from the read set, although an application may only see one of the values (e.g. the one with the highest timestamp). More formally, an inļ¬nite execution E is eventually consistent if, for all processes p and q, and for ev- ery value v such that operation write(p,v) occurs in E, there are only ļ¬nitely many operations in E such that v /āˆˆ read(q). 3.2.2 Impossibility results Assertion 1. If an algorithm A implements a termi- nating read-write register R in the partitionable model, then there exists a loss-free execution of A in which R is not linearizable. Proof. Consider an execution E1 in which the initial value of R is v1, and no messages are delivered (all messages are lost, which is admissible for partitionable links). In E1, p ļ¬rst performs an operation write(p,v2) where v2 = v1. This operation must terminate in ļ¬nite time due to the termination property of A. After the write operation terminates, q performs an operation read(q), which must return v1, since there is no way for q to know the value v2, due to all messages being lost. The read must also terminate. This execution is not linearizable, because the read did not return v2. Now consider an execution E2 which extends E1 as follows: after the termination of the read(q) operation, every message that was sent during E1 is delivered (this is admissible for partitionable links). These deliveries cannot affect the execution of the write and read op- erations, since they occur after the termination of both operations, so E2 is also non-linearizable. Moreover, E2 is loss-free, since every message was delivered. Corollary 2. There is no algorithm that implements a terminating read-write register in the partitionable model that is linearizable in all loss-free executions. This corresponds to Gilbert and Lynchā€™s Corollary 1, and follows directly from the existence of a loss-free, non-linearizable execution (assertion 1). Assertion 3. There is no algorithm that implements a terminating read-write register in the partitionable model that is eventually consistent in all executions. Proof. Consider an execution E in which no messages are delivered, and in which process p performs opera- tion write(p,v). This write must terminate, since the al- gorithm is terminating. Process q with p = q performs read(q) inļ¬nitely many times. However, since no mes- sages are delivered, q can never learn about the value written, so read(q) never returns v. Thus, E is not even- tually consistent. 3.2.3 Opportunistic properties Note that corollary 2 is about loss-free executions, whereas assertion 3 is about all executions. If we limit ourselves to loss-free executions, then eventual consis- tency is possible (e.g. by maintaining a replica at each process, and broadcasting every write to all processes). However, everything we have discussed in this sec- tion pertains to the partitionable model, in which we cannot assume that all executions are loss-free. For clar- ity, we should specify the properties of an algorithm such that they hold for all admissible executions of a given system model, not only selected executions. To this end, we can transform a property P into an opportunistic property Pā€² such that: āˆ€E :(E |= Pā€² ) ā‡” (lossfree(E) ā‡’ (E |= P)) or, equivalently: āˆ€E :(E |= Pā€² ) ā‡” (partitioned(E)āˆØ(E |= P)). In other words, Pā€² is trivially satisļ¬ed for executions that are partitioned. Requiring Pā€² to hold for all execu- tions is equivalent to requiring P to hold for all loss- free executions. Hence we deļ¬ne an execution E as opportunistically eventually consistent if E is partitioned or if E is eventu- ally consistent. (This is a weaker liveness property than eventual consistency.) Similarly, we deļ¬ne an execution E as opportunisti- cally terminating linearizable if E is partitioned, or if E is linearizable and every operation in E terminates in ļ¬nite time. 7
  • 8. From the results above, we can see that opportunistic terminating linearizability is impossible in the partition- able model (corollary 2), whereas opportunistic even- tual consistency is possible. This distinction can be un- derstood as the key result of CAP. However, it is ar- guably not a very interesting or insightful result. 3.3 Mismatch between formal model and practical systems Many of the problems in this section are due to the fact that availability is deļ¬ned by Gilbert and Lynch as a liveness property (section 2.1.2). Liveness properties make statements about something happening eventually in an inļ¬nite execution, which is confusing to practi- tioners, since real systems need to get things done in a ļ¬nite (and usually short) amount of time. Quoting Lamport [39]: ā€œLiveness properties are in- herently problematic. The question of whether a real system satisļ¬es a liveness property is meaningless; it can be answered only by observing the system for an inļ¬nite length of time, and real systems donā€™t run for- ever. Liveness is always an approximation to the prop- erty we really care about. We want a program to termi- nate within 100 years, but proving that it does would re- quire the addition of distracting timing assumptions. So, we prove the weaker condition that the program eventu- ally terminates. This doesnā€™t prove that the program will terminate within our lifetimes, but it does demonstrate the absence of inļ¬nite loops.ā€ Brewer [17] and some commercial database ven- dors [1] state that ā€œall three properties [consistency, availability, and partition tolerance] are more contin- uous than binaryā€. This is in direct contradiction to Gilbert and Lynchā€™s formalization of CAP (and our restatement thereof), which expresses consistency and availability as safety and liveness properties of an algo- rithm, and partitions as a property of the system model. Such properties either hold or they do not hold; there is no degree of continuity in their deļ¬nition. Brewerā€™s informal interpretation of CAP is intuitively appealing, but it is not a theorem, since it is not ex- pressed formally (and thus cannot be proved or dis- proved) ā€“ it is, at best, a rule of thumb. Gilbert and Lynchā€™s formalization can be proved correct, but it does not correspond to practitionersā€™ intuitions for real sys- tems. This contradiction suggests that although the for- mal model may be true, it is not useful. 4 Alternatives to CAP In section 2 we explored the deļ¬nitions of the terms consistency, availability and partition tolerance, and noted that a wide range of ambiguous and mutually incompatible interpretations have been proposed, lead- ing to widespread confusion. Then, in section 3 we ex- plored Gilbert and Lynchā€™s deļ¬nitions and proofs in more detail, and highlighted some problems with the formalization of CAP. All of these misunderstandings and ambiguity lead us to asserting that CAP is no longer an appropriate tool for reasoning about systems. A better framework for describing trade-offs is required. Such a framework should be simple to understand, match most peopleā€™s in- tuitions, and use deļ¬nitions that are formal and correct. In the rest of this paper we develop a ļ¬rst draft of an alternative framework called delay-sensitivity, which provides tools for reasoning about trade-offs between consistency and robustness to network faults. It is based on to several existing results from the distributed sys- tems literature (most of which in fact predate CAP). 4.1 Latency and availability As discussed in section 2.1.2, the latency (response time) of operations is often important in practice, but it is deliberately ignored by Gilbert and Lynch. The problem with latency is that it is more difļ¬cult to model. Latency is inļ¬‚uenced by many factors, espe- cially the delay of packets on the network. Many com- puter networks (including Ethernet and the Internet) do not guarantee bounded delay, i.e. they allow packets to be delayed arbitrarily. Latencies and network delays are therefore typically described as probability distribu- tions. On the other hand, network delay can model a wide range of faults. In network protocols that automatically retransmit lost packets (such as TCP), transient packet loss manifests itself to the application as temporarily increased delay. Even when the period of packet loss exceeds the TCP connection timeout, application-level protocols often retry failed requests until they succeed, so the effective latency of the operation is the time from the ļ¬rst attempt until the successful completion. Even network partitions can be modelled as large packet de- lays (up to the duration of the partition), provided that the duration of the partition is ļ¬nite and lost packets are retransmitted an unbounded number of times. 8
  • 9. Abadi [2] argues that there is a trade-off between consistency and latency, which applies even when there is no network partition, and which is as important as the consistency/availability trade-off described by CAP. He proposes a ā€œPACELCā€ formulation to reason about this trade-off. We go further, and assert that availability should be modeled in terms of operation latency. For example, we could deļ¬ne the availability of a service as the propor- tion of requests that meet some latency bound (e.g. re- turning successfully within 500 ms, as deļ¬ned by an SLA). This empirically-founded deļ¬nition of availabil- ity closely matches our intuitive understanding. We can then reason about a serviceā€™s tolerance of net- work problems by analyzing how operation latency is affected by changes in network delay, and whether this pushes operation latency over the limit set by the SLA. If a service can sustain low operation latency, even as network delay increases dramatically, it is more toler- ant of network problems than a service whose latency increases. 4.2 How operation latency depends on network delay To ļ¬nd a replacement for CAP with a latency-centric viewpoint we need to examine how operation latency is affected by network latency at different levels of con- sistency. In practice, this depends on the algorithms and implementation of the particular software being used. However, CAP demonstrated that there is also interest in theoretical results identifying the fundamental lim- its of what can be achieved, regardless of the particular algorithm in use. Several existing impossibility results establish lower bounds on the operation latency as a function of the net- work delay d. These results show that any algorithm guaranteeing a particular level of consistency cannot perform operations faster than some lower bound. We summarize these results in table 1 and in the following sections. Our notation is similar to that used in complexity the- ory to describe the running time of an algorithm. How- ever, rather than being a function of the size of input, we describe the latency of an operation as a function of network delay. In this section we assume unbounded network delay, and unsynchronized clocks (i.e. each process has access to a clock that progresses monotonically at a rate ap- Consistency level write read latency latency linearizability O(d) O(d) sequential consistency O(d) O(1) causal consistency O(1) O(1) Table 1: Lowest possible operation latency at various consistency levels, as a function of network delay d. proximately equal to real time, but the synchronization error between clocks is unbounded). 4.2.1 Linearizability Attiya and Welch [6] show that any algorithm imple- menting a linearizable read-write register must have an operation latency of at least u/2, where u is the uncer- tainty of delay in the network between replicas.8 In this proof, network delay is assumed to be at most d and at least d āˆ’ u, so u is the difference between the minimum and maximum network delay. In many networks, the maximum possible delay (due to net- work congestion or retransmitting lost packets) is much greater than the minimum possible delay (due to the speed of light), so u ā‰ˆ d. If network delay is unbounded, operation latency is also unbounded. For the purposes of this survey, we can simplify the result to say that linearizability requires the latency of read and write operations to be proportional to the net- work delay d. This is indicated in table 1 as O(d) la- tency for reads and writes. We call these operations delay-sensitive, as their latency is sensitive to changes in network delay. 4.2.2 Sequential consistency Lipton and Sandberg [42] show that any algorithm im- plementing a sequentially consistent read-write register must have |r| + |w| ā‰„ d, where |r| is the latency of a read operation, |w| is the latency of a write operation, and d is the network delay. Mavronicolas and Roth [46] further develop this result. This lower bound provides a degree of choice for the application: for example, an application that performs 8Attiya and Welch [6] originally proved a bound of u/2 for write operations (assuming two writer processes and one reader), and a bound of u/4 for read operations (two readers, one writer). The u/2 bound for read operations is due to Mavronicolas and Roth [46]. 9
  • 10. more reads than writes can reduce the average opera- tion latency by choosing |r| = 0 and |w| ā‰„ d, whereas a write-heavy application might choose |r| ā‰„ d and |w| = 0. Attiya and Welch [6] describe algorithms for both of these cases (the |r| = 0 case is similar to the Zab algorithm used by Apache ZooKeeper [33]). Choosing |r| = 0 or |w| = 0 means the operation can complete without waiting for any network communica- tion (it may still send messages, but need not wait for a response from other nodes). The latency of such an operation thus only depends on the local database algo- rithms: it might be constant-time O(1), or it might be O(logn) where n is the size of the database, but either way it is independent of the network delay d, so we call it delay-independent. In table 1, sequential consistency is described as hav- ing fast reads and slow writes (constant-time reads, and write latency proportional to network delay), although these roles can be swapped if an application prefers fast writes and slow reads. 4.2.3 Causal consistency If sequential consistency allows the latency of some op- erations to be independent of network delay, which level of consistency allows all operation latencies to be inde- pendent of the network? Recent results [8, 45] show that causal consistency [4] with eventual convergence is the strongest possible consistency guarantee with this prop- erty.9 Read Your Writes [49], PRAM [42] and other weak consistency models (all the way down to eventual con- sistency, which provides no safety property [9]) are weaker than causal consistency, and thus achievable without waiting for the network. If tolerance of network delay is the only considera- tion, causal consistency is the optimal consistency level. There may be other reasons for choosing weaker con- sistency levels (for example, the metadata overhead of tracking causality [8, 20]), but these trade-offs are out- side of the scope of this discussion, as they are also out- side the scope of CAP. 9There are a few variants of causal consistency, such as real time causal [45], causal+ [43] and observable causal [8] consistency. They have subtle differences, but we do not have space in this paper to compare them in detail. 4.3 Heterogeneous delays A limitation of the results in section 4.2 is that they as- sume the distribution of network delays is the same be- tween every pair of nodes. This assumption is not true in general: for example, network delay between nodes in the same datacenter is likely to be much lower than between geographically distributed nodes communicat- ing over WAN links. If we model network faults as periods of increased network delay (section 4.1), then a network partition is a situation in which the delay between nodes within each partition remains small, while the delay across par- titions increases dramatically (up to the duration of the partition). For O(d) algorithms, which of these different delays do we need to assume for d? The answer depends on the communication pattern of the algorithm. 4.3.1 Modeling network topology For example, a replication algorithm that uses a sin- gle leader or primary node requires all write requests to contact the primary, and thus d in this case is the net- work delay between the client and the leader (possibly via other nodes). In a geographically distributed system, if client and leader are in different locations, d includes WAN links. If the client is temporarily partitioned from the leader, d increases up to the duration of the partition. By contrast, the ABD algorithm [7] waits for re- sponses from a majority of replicas, so d is the largest among the majority of replicas that are fastest to re- spond. If a minority of replicas is temporarily parti- tioned from the client, the operation latency remains in- dependent of the partition duration. Another possibility is to treat network delay within the same datacenter, dlocal, differently from net- work delay over WAN links, dremote, because usu- ally dlocal ā‰Ŗ dremote. Systems such as COPS [43], which place a leader in each datacenter, provide lin- earizable operations within one datacenter (requiring O(dlocal) latency), and causal consistency across da- tacenters (making the request latency independent of dremote). 4.4 Delay-independent operations The big-O notation for operation latency ignores con- stant factors (such as the number of network round-trips required by an algorithm), but it captures the essence of 10
  • 11. what we need to know for building systems that can tolerate network faults: what happens if network delay dramatically degrades? In a delay-sensitive O(d) algo- rithm, operation latency may increase to be as large as the duration of the network interruption (i.e. minutes or even hours), whereas a delay-independent O(1) algo- rithm remains unaffected. If the SLA calls for operation latencies that are signif- icantly shorter than the expected duration of network in- terruptions, delay-independent algorithms are required. In such algorithms, the time until replica convergence is still proportional to d, but convergence is decou- pled from operation latency. Put another way, delay- independent algorithms support disconnected or ofļ¬‚ine operation. Disconnected operation has long been used in network ļ¬le systems [37] and automatic teller ma- chines [18]. For example, consider a calendar application running on a mobile device: a user may travel through a tunnel or to a remote location where there is no cellular net- work coverage. For a mobile device, regular network interruptions are expected, and they may last for days. During this time, the user should still be able to inter- act with the calendar app, checking their schedule and adding events (with any changes asynchronously prop- agated when an internet connection is next available). However, even in environments with fast and re- liable network connectivity, delay-independent algo- rithms have been shown to have performance and scal- ability beneļ¬ts: in this context, they are known as coordination-free [13] or ALPS [43] systems. Many popular database integrity constraints can be imple- mented without synchronous coordination between replicas [13]. 4.5 Proposed terminology Much of the confusion around CAP is due to the am- biguous, counter-intuitive and contradictory deļ¬nitions of terms such as availability, as discussed in section 2. In order to improve the situation and reduce misunder- standings, there is a need to standardize terminology with simple, formal and correct deļ¬nitions that match the intuitions of practitioners. Building upon the observations above and the results cited in section 4.2, we propose the following deļ¬ni- tions as a ļ¬rst draft of a delay-sensitivity framework for reasoning about consistency and availability trade-offs. These deļ¬nitions are informal and intended as a starting point for further discussion. Availability is an empirical metric, not a property of an algorithm. It is deļ¬ned as the percentage of successful requests (returning a non-error response within a predeļ¬ned latency bound) over some pe- riod of system operation. Delay-sensitive describes algorithms or operations that need to wait for network communication to complete, i.e. which have latency proportional to network delay. The opposite is delay-independent. Systems must specify the nature of the sen- sitivity (e.g. an operation may be sensitive to intra-datacenter delay but independent of inter- datacenter delay). A fully delay-independent sys- tem supports disconnected (ofļ¬‚ine) operation. Network faults encompass packet loss (both transient and long-lasting) and unusually large packet delay. Network partitions are just one particular type of network fault; in most cases, systems should plan for all kinds of network fault, and not only parti- tions. As long as lost packets or failed requests are retried, they can be modeled as large network de- lay. Fault tolerance is used in preference to high availabil- ity or partition tolerance. The maximum fault that can be tolerated must be speciļ¬ed (e.g. ā€œthe al- gorithm can tolerate up to a minority of replicas crashing or disconnectingā€), and the description must also state what happens if more faults occur than the system can tolerate (e.g. all requests return an error, or a consistency property is violated). Consistency refers to a spectrum of different con- sistency models (including linearizability and causal consistency), not one particular consistency model. When a particular consistency model such as linearizability is intended, it is referred to by its usual name. The term strong consistency is vague, and may refer to linearizability, sequential consis- tency or one-copy serializability. 5 Conclusion In this paper we discussed several problems with the CAP theorem: the deļ¬nitions of consistency, availabil- ity and partition tolerance in the literature are somewhat contradictory and counter-intuitive, and the distinction that CAP draws between ā€œstrongā€ and ā€œeventualā€ con- sistency models is less clear than widely believed. 11
  • 12. CAP has nevertheless been very inļ¬‚uential in the design of distributed data systems. It deserves credit for catalyzing the exploration of the design space of systems with weak consistency guarantees, e.g. in the NoSQL movement. However, we believe that CAP has now reached the end of its usefulness; we recommend that it should be relegated to the history of distributed systems, and no longer be used for justifying design de- cisions. As an alternative to CAP, we propose a simple delay- sensitivity framework for reasoning about trade-offs be- tween consistency guarantees and tolerance of network faults in a replicated database. Every operation is cat- egorized as either O(d), if its latency is sensitive to network delay, or O(1), if it is independent of net- work delay. On the assumption that lost messages are retransmitted an unbounded number of times, we can model network faults (including partitions) as periods of greatly increased delay. The algorithmā€™s sensitivity to network delay determines whether the system can still meet its service level agreement (SLA) when a network fault occurs. The actual sensitivity of a system to network de- lay depends on its implementation, but ā€“ in keeping with the goal of CAP ā€“ we can prove that certain lev- els of consistency cannot be achieved without making operation latency proportional to network delay. These theoretical lower bounds are summarized in Table 1. We have not proved any new results in this paper, but merely drawn on existing distributed systems research dating mostly from the 1990s (and thus predating CAP). For future work, it would be interesting to model the probability distribution of latencies for different con- currency control and replication algorithms (e.g. by ex- tending PBS [11]), rather than modeling network delay as just a single number d. It would also be interesting to model the network communication topology of dis- tributed algorithms more explicitly. We hope that by being more rigorous about the impli- cations of different consistency levels on performance and fault tolerance, we can encourage designers of dis- tributed data systems to continue the exploration of the design space. And we also hope that by adopting sim- ple, correct and intuitive terminology,we can help guide application developers towards the storage technologies that are most appropriate for their use cases. Acknowledgements Many thanks to Niklas EkstrĀØom, Seth Gilbert and Henry Robinson for fruitful discussions around this topic. References [1] ACID support in Aerospike. Aerospike, Inc., June 2014. URL http://www.aerospike.com/docs/architecture/ assets/AerospikeACIDSupport.pdf. [2] Daniel J Abadi. Consistency tradeoffs in modern distributed database system design. IEEE Computer Magazine, 45(2):37ā€“ 42, February 2012. doi:10.1109/MC.2012.33. [3] Atul Adya. Weak Consistency: A Generalized Theory and Opti- mistic Implementations for Distributed Transactions. PhD the- sis, Massachusetts Institute of Technology, March 1999. [4] Mustaque Ahamad, Gil Neiger, James E Burns, Prince Kohli, and Phillip W Hutto. Causal memory: deļ¬nitions, implemen- tation, and programming. Distributed Computing, 9(1):37ā€“49, March 1995. doi:10.1007/BF01784241. [5] Bowen Alpern and Fred B Schneider. Deļ¬ning liveness. In- formation Processing Letters, 21(4):181ā€“185, October 1985. doi:10.1016/0020-0190(85)90056-0. [6] Hagit Attiya and Jennifer L Welch. Sequential con- sistency versus linearizability. ACM Transactions on Computer Systems (TOCS), 12(2):91ā€“122, May 1994. doi:10.1145/176575.176576. [7] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing mem- ory robustly in message-passing systems. Journal of the ACM, 42(1):124ā€“142, January 1995. doi:10.1145/200836.200869. [8] Hagit Attiya, Faith Ellen, and Adam Morrison. Limitations of highly-available eventually-consistent data stores. In ACM Sym- posium on Principles of Distributed Computing (PODC), July 2015. doi:10.1145/2767386.2767419. [9] Peter Bailis and Ali Ghodsi. Eventual consistency today: Lim- itations, extensions, and beyond. ACM Queue, 11(3), March 2013. doi:10.1145/2460276.2462076. [10] Peter Bailis and Kyle Kingsbury. The network is reliable. ACM Queue, 12(7), July 2014. doi:10.1145/2639988.2639988. [11] Peter Bailis, Shivaram Venkataraman, Michael J Franklin, Joseph M Hellerstein, and Ion Stoica. Probabilistically bounded staleness for practical partial quorums. Proceedings of the VLDB Endowment, 5(8):776ā€“787, April 2012. [12] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica. Highly available transac- tions: Virtues and limitations. In 40th International Conference on Very Large Data Bases (VLDB), September 2014. [13] Peter Bailis, Alan Fekete, Michael J Franklin, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica. Coordination-avoiding database systems. Proceedings of the VLDB Endowment, 8(3): 185ā€“196, November 2014. 12
  • 13. [14] Hal Berenson, Philip A Bernstein, Jim N Gray, Jim Melton, Elizabeth Oā€™Neil, and Patrick Oā€™Neil. A critique of ANSI SQL isolation levels. In ACM International Con- ference on Management of Data (SIGMOD), May 1995. doi:10.1145/568271.223785. [15] Philip A Bernstein, Vassos Hadzilacos, and Nathan Good- man. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN 0201107155. URL http://research.microsoft.com/en-us/people/ philbe/ccontrol.aspx. [16] Eric A Brewer. Towards robust distributed systems (keynote). In 19th ACM Symposium on Principles of Distributed Computing (PODC), July 2000. [17] Eric A Brewer. CAP twelve years later: How the ā€œrulesā€ have changed. IEEE Computer Magazine, 45(2):23ā€“29, February 2012. doi:10.1109/MC.2012.37. [18] Eric A Brewer. NoSQL: Past, present, future. In QCon San Francisco, November 2012. URL http://www.infoq.com/ presentations/NoSQL-History. [19] Christian Cachin, Rachid Guerraoui, and LuĀ“ıs Rodrigues. In- troduction to Reliable and Secure Distributed Programming. Springer, 2nd edition, February 2011. ISBN 978-3-642-15259- 7. [20] Bernadette Charron-Bost. Concerning the size of logical clocks in distributed systems. Information Processing Letters, 39(1): 11ā€“16, July 1991. doi:10.1016/0020-0190(91)90055-M. [21] Jeff Darcy. When partitions attack, October 2010. URL http://pl.atyp.us/wordpress/index.php/2010/10/ when-partitions-attack/. [22] Susan B Davidson, Hector Garcia-Molina, and Dale Skeen. Consistency in partitioned networks. ACM Computing Surveys, 17(3):341ā€“370, September 1985. doi:10.1145/5505.5508. [23] Michael J Fischer and Alan Michael. Sacriļ¬cing serializability to attain high availability of data in an unreliable network. In 1st ACM Symposium on Principles of Database Systems (PODS), pages 70ā€“75, March 1982. doi:10.1145/588111.588124. [24] Armando Fox and Eric A Brewer. Harvest, yield, and scal- able tolerant systems. In 7th Workshop on Hot Topics in Operating Systems (HotOS), pages 174ā€“178, March 1999. doi:10.1109/HOTOS.1999.798396. [25] Seth Gilbert and Nancy Lynch. Brewerā€™s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51ā€“59, 2002. doi:10.1145/564585.564601. [26] Seth Gilbert and Nancy Lynch. Perspectives on the CAP theo- rem. IEEE Computer Magazine, 45(2):30ā€“36, February 2012. doi:10.1109/MC.2011.389. [27] Jim N Gray, Raymond A Lorie, Gianfranco R Putzolu, and Irv- ing L Traiger. Granularity of locks and degrees of consistency in a shared data base. In G M Nijssen, editor, Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, pages 364ā€“394. Elsevier/North Holland Publishing, 1976. [28] Coda Hale. You canā€™t sacriļ¬ce partition tolerance, Oc- tober 2010. URL http://codahale.com/you-cant- sacrifice-partition-tolerance/. [29] Maurice P Herlihy. Impossibility and universality results for wait-free synchronization. In 7th ACM Symposium on Princi- ples of Distributed Computing (PODC), pages 276ā€“290, August 1988. doi:10.1145/62546.62593. [30] Maurice P Herlihy and Jeannette M Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463ā€“492, July 1990. doi:10.1145/78969.78972. [31] Jeff Hodges. Notes on distributed systems for young bloods, January 2013. URL http://www.somethingsimilar.com/ 2013/01/14/notes-on-distributed-systems-for- young-bloods/. [32] Paul R Johnson and Robert H Thomas. RFC 677: The mainte- nance of duplicate databases. Network Working Group, January 1975. URL https://tools.ietf.org/html/rfc677. [33] Flavio P Junqueira, Benjamin C Reed, and Marco Seraļ¬ni. Zab: High-performance broadcast for primary-backup sys- tems. In 41st IEEE International Conference on Depend- able Systems and Networks (DSN), pages 245ā€“256, June 2011. doi:10.1109/DSN.2011.5958223. [34] Won Kim. Highly available systems for database applica- tion. ACM Computing Surveys, 16(1):71ā€“98, March 1984. doi:10.1145/861.866. [35] Kyle Kingsbury. Call me maybe: RabbitMQ, June 2014. URL https://aphyr.com/posts/315-call-me-maybe- rabbitmq. [36] Kyle Kingsbury. Call me maybe: Elasticsearch 1.5.0, April 2015. URL https://aphyr.com/posts/323-call-me- maybe-elasticsearch-1-5-0. [37] James J Kistler and M Satyanarayanan. Disconnected operation in the Coda ļ¬le system. ACM Transactions on Computer Systems (TOCS), 10(1):3ā€“25, February 1992. doi:10.1145/146941.146942. [38] Leslie Lamport. How to make a multiprocessor com- puter that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690ā€“691, September 1979. doi:10.1109/TC.1979.1675439. [39] Leslie Lamport. Fairness and hyperfairness. Dis- tributed Computing, 13(4):239ā€“245, November 2000. doi:10.1007/PL00008921. [40] Bruce G Lindsay, Patricia Grifļ¬ths Selinger, C Galtieri, Jim N Gray, Raymond A Lorie, Thomas G Price, Gianfranco R Put- zolu, Irving L Traiger, and Bradford W Wade. Notes on dis- tributed databases. Technical Report RJ2571(33471), IBM Re- search, July 1979. [41] Nicolas Liochon. You do it too: Forfeiting network partition tolerance in distributed systems, July 2015. URL http:// blog.thislongrun.com/2015/07/Forfeit-Partition- Tolerance-Distributed-System-CAP-Theorem.html. 13
  • 14. [42] Richard J Lipton and Jonathan S Sandberg. PRAM: A scalable shared memory. Technical Report CS-TR-180-88, Princeton University Department of Computer Science, September 1988. [43] Wyatt Lloyd, Michael J Freedman, Michael Kaminsky, and David G Andersen. Donā€™t settle for eventual: Scalable causal consistency for wide-area storage with COPS. In 23rd ACM Symposium on Operating Systems Principles (SOSP), pages 401ā€“416, October 2011. doi:10.1145/2043556.2043593. [44] Nancy Lynch and Alex Shvartsman. Robust emulation of shared memory using dynamic quorum-acknowledged broad- casts. In 27th Annual International Symposium on Fault- Tolerant Computing (FTCS), pages 272ā€“281, June 1997. doi:10.1109/FTCS.1997.614100. [45] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. Consistency, availability, and convergence. Technical Report UTCS TR-11- 22, University of Texas at Austin, Department of Computer Sci- ence, May 2011. [46] Marios Mavronicolas and Dan Roth. Linearizable read/write objects. Theoretical Computer Science, 220(1):267ā€“319, June 1999. doi:10.1016/S0304-3975(98)90244-4. [47] Henry Robinson. CAP confusion: Problems with ā€˜partition toleranceā€™, April 2010. URL http://blog.cloudera. com/blog/2010/04/cap-confusion-problems-with- partition-tolerance/. [48] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O Myreen. x86-TSO: A rigorous and usable programmerā€™s model for x86 multiproces- sors. Communications of the ACM, 53(7):89ā€“97, July 2010. doi:10.1145/1785414.1785443. [49] Douglas B Terry, Alan J Demers, Karin Petersen, Mike J Spreitzer, Marvin M Theimer, and Brent B Welch. Ses- sion guarantees for weakly consistent replicated data. In 3rd International Conference on Parallel and Distributed In- formation Systems (PDIS), pages 140ā€“149, September 1994. doi:10.1109/PDIS.1994.331722. [50] Werner Vogels. Eventually consistent. ACM Queue, 6(6):14ā€“19, October 2008. doi:10.1145/1466443.1466448. 14