The document discusses various database consistency models including ACID, BASE, and eventual consistency. It describes ACID which provides atomicity, consistency, isolation, and durability but has performance limitations. BASE sacrifices consistency for availability. Eventual consistency guarantees that if no new updates are made, all accesses will eventually return the last value. It also discusses solutions like ACID 2.0 and CRDTs which allow for ACID-like consistency in distributed systems through principles like commutativity and idempotence.
2. ACID
● Atomicity: each transaction is "all or nothing" (Commit or
rollback)
● Consistency: any transaction will bring the database from one
valid state to another (Preserves relational integrity)
● Isolation: concurrent execution of transactions results in a system
state that would be obtained if transactions were executed serially
● Durability: persistence to disk (rebooting doesn't cause data loss,
for example)
4. Deficiencies of ACID
● Difficult to maintain high availability & fault
tolerance in distributed scenarios
● CAP Theorem
● Huge performance overhead in distributed
synchronization
● Huge performance overhead to maintain integrity
6. CAP Theorem
(Brewer's conjecture)
● In plain english:
"...during a network partition, a distributed system
must choose either Consistency or Availability." --
foundationdb.com
7. CAP Theorem
(Brewer's conjecture)
● Assume that you want strong consistency.
● This implies synchronous, blocking updates.
● Assume you also want availability
● This implies multiple nodes with redundancies.
● When you update one node, you need broadcast
synchronously to all other nodes, waiting for
successful confirmations (very slow!!!)
● So far so good... But now a node failed to connect to
the others (network failure)!
● If you don't wait for it to come back, you've
sacrificed consistency. If you block on it, you've
sacrificed availability.
9. BASE
● Basically available: there will be a response to any request, but
that response could still be ‘failure’ to obtain the requested data or
the data may be in an inconsistent or changing state.
● Soft state: even during times without input there may be changes
going on due to ‘eventual consistency,’ thus the state of the
system is always ‘soft.’
● Eventually consistent: "the storage system guarantees that if no
new updates are made to the object, eventually all accesses will
return the last updated value." -- the CTO of Amazon.com
10. Safety versus Liveness
● Liveness: a value distributed across systems eventually converges
to be the same across those same systems (generally the last
update value).
● "Something good eventually happens"
● Safety:the system is at all times consistent.
● "Nothing bad ever happens"
● Eventual consistency is purely a liveness guarantee (reads
eventually return the same value) and does not make safety
guarantees: an eventually consistent system can return any value
before it converges.
11. Safety versus Liveness
● To be clear: in eventual consistency, by default, two
concurrent read/write increments of a standard
counter can potentially increase it by only 1.
● The last write wins, but there is no guarantee with
regards to what happened in between (and they may
have both read the value when it wasn't consistent)
● This is what happens when you don't have any safety
guarantee, as in eventual consistency.
12. Examples
● Most big social media websites
● Google Cloud Datastore
● Most NoSQL databases:
● Riak, Redis, Hadoop (without Hbase), Couchbase,
MongoDB (in some configurations), Cassandra (in some
configurations)
● Etc.
● Amazon's Dynamo DB
● DNS (Domain Name System)
13. Deficiencies of BASE
● Delay in convergence
● No safety guarantee
● You don't have the same update semantics as in ACID
transactions
14. Solutions to BASE's Problems
● Application developers can write compensation logic
● Okay in small, simple applications
● Quickly becomes umanageable in complex applications
● ACID 2.0 design principles that guarantee ACID-like
consistency even with an eventual consistency
mechanism.
16. ACID 2.0
● Associativity & Commutativity: the messages in the queue can
be processed in any order.
● Idempotence: the message queue can use at-least-once-delivery
guarantees (retry logic). Duplicate processing of the same
message doesn't matter.
● Distributed: refers to the fact that ACID 2.0 applies to distributed
systems.
17. What does it mean?
● Unlike ACID and BASE, ACID 2.0 doesn't tell you
what are the guarantees, instead it tells you that there
are certain design principles that are immune to
transactional integrity issues.
● In particular, immutable data structures that you
transform are easier to handle than mutable shared
states (as most functional programming languages
have understood)
18. The CALM Theorem
● Consistency as Logical Monotonicity
● Logically monotonic: intuitively, a monotonic program
(or data structure) makes forward progress over time: it
never "retracts" an earlier conclusion in the face of new
information.
● Implementation is usually through a class of data
structures referred to as CRDTs (conflict-free
replicated data types)
19. Example: the PN-Counter
● Counts the number of increment and decrement calls
per transaction (or "actor", or "node")
● When the value is read, it's calculated on the fly by
summing up the number of increment "marks" and
subtracting from the number of decrement "marks"
21. Example: Bitcoin
● The bitcoin transaction ledger is a CRDT. It's an
append only structure.
● The ledger contains the history of all transactions
ever made: and it's a replicated dataset, updated by
appending new transactions in a peer-to-peer
"eventual consistency" framework.
22. Example: Apache Spark RDDs
● Spark is a high-performance distributed computing
framework
● Big Data analytics
● Machine learning (MLlib)
● Distributed graph processing (GraphX)
● Spark SQL
● It replaces Hadoop MapReduce (about 30 to 100 times
faster)
● The essence of the Spark framework is a type of data
structure called a Resilient Distributed Dataset
(which is a CRDT).
23. Example: Apache Spark RDDs
● RDDs features:
● Immutable
● Distributed / Replicated
● Expose map(), filter(), reduce(), join() operations to
produce new derived RDDs (very "functional"
rather than object-oriented – written in Scala)
● Logs "lineage" information (how the RDD was
constructed) across partitions, rather than the data
itself, for efficiency. If a network fault occurs, it
can reconstruct the data through that lineage. This
way the cost of data replication isn't generally
incurred (only in fault recovery scenarios).
25. Other examples
● Apache Kafka message queue
● Riak vector clocks for synchronization
● The game league of legends uses Riak CRDTs for its in-
game chat system
● TreeDoc and Logoot: for collaborative text editing
● SoundCloud uses a CRDT set for streaming,
implemented on top of Redis
26. Deficiences of CRDTs
● Not a universal solution: doesn't cover all possible
applications
● Garbage collection issues (append-only means it
consumes increasing amounts of space!)
● Complex to design
27. Some solutions
● Bloom programming language
● Provide a "framework" to develop in a commutative,
order-insensitive way that favors data structure of a
CRDT type.
● Existing distributed computing platforms do the
complicated work for us (Apache Spark, for example)
● We still need to accept locking ACID or weakly
consistent BASE for some parts of the system. We
can also resort to better "compromises" such as
causal consistency.