Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Put Your Thinking
CAP On
Tomer Gabel, Wix
JDay Lviv, 2015
Credits
Originally a talk by
Yoav Abrahami (Wix)
Based on “Call Me Maybe” by
Kyle “Aphyr” Kingsbury
Brewer’s CAP Theorem
Partition
Tolerance
ConsistencyAvailability
Brewer’s CAP Theorem
Partition
Tolerance
ConsistencyAvailability
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
• … and s...
Consistency
Consistency: Defined
• In a consistent
system:
All participants
see the same value
at the same time
• “Do you have this
bo...
Consistency: Defined
• If our book store is an
inconsistent system:
– Two customers may
buy the book
– But there’s only on...
Availability
Availability: Defined
• An available system:
– Is reachable
– Responds to requests
(within SLA)
• Availability does not
gu...
Availability: Defined
• What if the system is
unavailable?
– I complete the
checkout
– And click on “Pay”
– And wait
– And...
Partition
Tolerance
Partition Tolerance: Defined
• Partition: one or
more nodes are
unreachable
• No practical
system runs on a
single node
• ...
“The Network is Reliable”
• All four happen in an
IP network
• To a client, delays
and drops are the
same
• Perfect failur...
Partition Tolerance: Reified
• External causes:
– Bad network config
– Faulty equipment
– Scheduled
maintenance
• Even sof...
“Proving” CAP
In Pictures
• Let’s consider a simple
system:
– Service A writes values
– Service B reads values
– Values are replicated
b...
In Pictures
• “Sunny day scenario”:
– A writes a new value V1
– The value is replicated
to node 2
– B reads the new value
...
In Pictures
• What happens if the
network drops?
– A writes a new value V1
– Replication fails
– B still sees the old valu...
In Pictures
• Possible mitigation is
synchronous replication
– A writes a new value V1
– Cannot replicate, so write is
rej...
What does it all mean?
The network is not reliable
• Distributed systems must handle partitions
• Any modern system runs on >1 nodes…
• … and is ...
Granularity
• Real systems comprise many operations
– “Add book to cart”
– “Pay for the book”
• Each has different propert...
CAP IN THE REAL
WORLD
Kyle “Aphyr” Kingsbury
Breaking consistency
guarantees since 2013
PostgreSQL
• Traditional RDBMS
– Transactional
– ACID compliant
• Primarily a CP system
– Writes against a
master node
• “...
PostgreSQL
• Writes are a simplified
2PC:
– Client votes to commit
– Server validates
transaction
– Server stores changes
...
PostgreSQL
• But what if the ack is
never received?
• The commit is already
stored…
• … but the client has
no indication!
...
PostgreSQL
• Let’s experiment!
• 5 clients write to a
PostgreSQL instance
• We then drop the server
from the network
• Res...
So what can we do?
1. Accept false-negatives
– May not be acceptable for your use case!
2. Use idempotent operations
3. Ap...
• A document-oriented database
• Availability/scale via replica sets
– Client writes to a master node
– Master replicates ...
MongoDB
• When a partition occurs:
– If the master is in the
minority, it is demoted
– The majority promotes a
new master…...
MongoDB
• The cluster “heals” after partition resolution:
– The “old” master rejoins the cluster
– Acknowleged minority wr...
MongoDB
• Let’s experiment!
• Set up a 5-node
MongoDB cluster
• 5 clients write to
the cluster
• We then partition
the clu...
MongoDB
• With write concern
unacknowleged:
– Server does not ack
writes (except TCP)
– The default prior to
November 2012...
MongoDB
• With write concern
acknowleged:
– Server acknowledges
writes (after store)
– The default guarantee
• Results:
– ...
MongoDB
• With write concern
replica acknowleged:
– Client specifies
minimum replicas
– Server acks after
writes to replic...
MongoDB
• With write concern
majority:
– For an n-node cluster,
requires at least n/2
replicas
– Also called “quorum”
• Re...
So what can we do?
1. Keep calm and carry on
– As Aphyr puts it, “not all applications need
consistency”
– Have a reliable...
The prime suspects
• Aphyr’s Jepsen tests
include:
– Redis
– Riak
– Zookeeper
– Kafka
– Cassandra
– RabbitMQ
– etcd (and c...
STRATEGIES FOR
DISTRIBUTED SYSTEMS
Immutable Data
• Immutable (adj.):
“Unchanging over
time or unable to be
changed.”
• Meaning:
– No deletes
– No updates
– ...
Idempotence
• An idempotent
operation:
– Can be applied one or
more times with the
same effect
• Enables retries
• Not alw...
Eventual Consistency
• A design which prefers
availability
• … but guarantees that
clients will eventually see
consistent ...
Eventual Consistency
• The system expects
data to diverge
• … and includes
mechanisms to regain
convergence
– Partial orde...
Vector Clocks
• A technique for partial ordering
• Each node has a logical clock
– The clock increases on every write
– Tr...
CRDTs
• Commutative Replicated Data Types1
• A CRDT is a data structure that:
– Eventually converges to a consistent state...
CRDTs
• CRDTs provide specialized semantics:
– G-Counter: Monotonously increasing counter
– PN-Counter: Also supports decr...
Questions?
Complaints?
WE’RE DONE
HERE!
Thank you for listening
tomer@tomergabel.com
@tomerg
http://il.linkedin.com/in/tomergabel
Aphyr’s “Call M...
Upcoming SlideShare
Loading in …5
×

Put Your Thinking CAP On

3,090 views

Published on

A talk given at JDay Lviv 2015 in Ukraine; originally developed by Yoav Abrahami, and based on the works of Kyle "Aphyr" Kingsbury:

Consistency, availability and partition tolerance: these seemingly innocuous concepts have been giving engineers and researchers of distributed systems headaches for over 15 years. But despite how important they are to the design and architecture of modern software, they are still poorly understood by many engineers.

This session covers the definition and practical ramifications of the CAP theorem; you may think that this has nothing to do with you because you "don't work on distributed systems", or possibly that it doesn't matter because you "run over a local network." Yet even traditional enterprise CRUD applications must obey the laws of physics, which are exactly what the CAP theorem describes. Know the rules of the game and they'll serve you well, or ignore them at your own peril...

Published in: Software
  • Be the first to comment

Put Your Thinking CAP On

  1. 1. Put Your Thinking CAP On Tomer Gabel, Wix JDay Lviv, 2015
  2. 2. Credits Originally a talk by Yoav Abrahami (Wix) Based on “Call Me Maybe” by Kyle “Aphyr” Kingsbury
  3. 3. Brewer’s CAP Theorem Partition Tolerance ConsistencyAvailability
  4. 4. Brewer’s CAP Theorem Partition Tolerance ConsistencyAvailability
  5. 5. By Example • I want this book! – I add it to the cart – Then continue browsing • There’s only one copy in stock!
  6. 6. By Example • I want this book! – I add it to the cart – Then continue browsing • There’s only one copy in stock! • … and someone else just bought it.
  7. 7. Consistency
  8. 8. Consistency: Defined • In a consistent system: All participants see the same value at the same time • “Do you have this book in stock?”
  9. 9. Consistency: Defined • If our book store is an inconsistent system: – Two customers may buy the book – But there’s only one item in inventory! • We’ve just violated a business constraint.
  10. 10. Availability
  11. 11. Availability: Defined • An available system: – Is reachable – Responds to requests (within SLA) • Availability does not guarantee success! – The operation may fail – “This book is no longer available”
  12. 12. Availability: Defined • What if the system is unavailable? – I complete the checkout – And click on “Pay” – And wait – And wait some more – And… • Did I purchase the book or not?!
  13. 13. Partition Tolerance
  14. 14. Partition Tolerance: Defined • Partition: one or more nodes are unreachable • No practical system runs on a single node • So all systems are susceptible! A B C D E
  15. 15. “The Network is Reliable” • All four happen in an IP network • To a client, delays and drops are the same • Perfect failure detection is provably impossible1! A B drop delay duplicate reorder A B A B A B time 1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
  16. 16. Partition Tolerance: Reified • External causes: – Bad network config – Faulty equipment – Scheduled maintenance • Even software causes partitions: – Bad network config. – GC pauses – Overloaded servers • Plenty of war stories! – Netflix – Twilio – GitHub – Wix :-) • Some hard numbers1: – 5.2 failed devices/day – 59K lost packets/day – Adding redundancy only improves by 40% 1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
  17. 17. “Proving” CAP
  18. 18. In Pictures • Let’s consider a simple system: – Service A writes values – Service B reads values – Values are replicated between nodes • These are “ideal” systems – Bug-free, predictable Node 1 V0A Node 2 V0B
  19. 19. In Pictures • “Sunny day scenario”: – A writes a new value V1 – The value is replicated to node 2 – B reads the new value Node 1 V0A Node 2 V0B V1 V1 V1 V1
  20. 20. In Pictures • What happens if the network drops? – A writes a new value V1 – Replication fails – B still sees the old value – The system is inconsistent Node 1 V0A Node 2 V0B V1 V0 V1
  21. 21. In Pictures • Possible mitigation is synchronous replication – A writes a new value V1 – Cannot replicate, so write is rejected – Both A and B still see V0 – The system is logically unavailable Node 1 V0A Node 2 V0B V1
  22. 22. What does it all mean?
  23. 23. The network is not reliable • Distributed systems must handle partitions • Any modern system runs on >1 nodes… • … and is therefore distributed • Ergo, you have to choose: – Consistency over availability – Availability over consistency
  24. 24. Granularity • Real systems comprise many operations – “Add book to cart” – “Pay for the book” • Each has different properties • It’s a spectrum, not a binary choice! Consistency Availability Shopping CartCheckout
  25. 25. CAP IN THE REAL WORLD Kyle “Aphyr” Kingsbury Breaking consistency guarantees since 2013
  26. 26. PostgreSQL • Traditional RDBMS – Transactional – ACID compliant • Primarily a CP system – Writes against a master node • “Not a distributed system” – Except with a client at play!
  27. 27. PostgreSQL • Writes are a simplified 2PC: – Client votes to commit – Server validates transaction – Server stores changes – Server acknowledges commit – Client receives acknowledgement Client Server Store
  28. 28. PostgreSQL • But what if the ack is never received? • The commit is already stored… • … but the client has no indication! • The system is in an inconsistent state Client Server Store ?
  29. 29. PostgreSQL • Let’s experiment! • 5 clients write to a PostgreSQL instance • We then drop the server from the network • Results: – 1000 writes – 950 acknowledged – 952 survivors
  30. 30. So what can we do? 1. Accept false-negatives – May not be acceptable for your use case! 2. Use idempotent operations 3. Apply unique transaction IDs – Query state after partition is resolved • These strategies apply to any RDBMS
  31. 31. • A document-oriented database • Availability/scale via replica sets – Client writes to a master node – Master replicates writes to n replicas • User-selectable consistency guarantees
  32. 32. MongoDB • When a partition occurs: – If the master is in the minority, it is demoted – The majority promotes a new master… – … selected by the highest optime
  33. 33. MongoDB • The cluster “heals” after partition resolution: – The “old” master rejoins the cluster – Acknowleged minority writes are reverted!
  34. 34. MongoDB • Let’s experiment! • Set up a 5-node MongoDB cluster • 5 clients write to the cluster • We then partition the cluster • … and restore it to see what happens
  35. 35. MongoDB • With write concern unacknowleged: – Server does not ack writes (except TCP) – The default prior to November 2012 • Results: – 6000 writes – 5700 acknowledged – 3319 survivors – 42% data loss!
  36. 36. MongoDB • With write concern acknowleged: – Server acknowledges writes (after store) – The default guarantee • Results: – 6000 writes – 5900 acknowledged – 3692 survivors – 37% data loss!
  37. 37. MongoDB • With write concern replica acknowleged: – Client specifies minimum replicas – Server acks after writes to replicas • Results: – 6000 writes – 5695 acknowledged – 3768 survivors – 33% data loss!
  38. 38. MongoDB • With write concern majority: – For an n-node cluster, requires at least n/2 replicas – Also called “quorum” • Results: – 6000 writes – 5700 acknowledged – 5701 survivors – No data loss
  39. 39. So what can we do? 1. Keep calm and carry on – As Aphyr puts it, “not all applications need consistency” – Have a reliable backup strategy – … and make sure you drill restores! 2. Use write concern majority – And take the performance hit
  40. 40. The prime suspects • Aphyr’s Jepsen tests include: – Redis – Riak – Zookeeper – Kafka – Cassandra – RabbitMQ – etcd (and consul) – ElasticSearch • If you’re considering them, go read his posts • In fact, go read his posts regardless http://aphyr.com/tags/jepsen
  41. 41. STRATEGIES FOR DISTRIBUTED SYSTEMS
  42. 42. Immutable Data • Immutable (adj.): “Unchanging over time or unable to be changed.” • Meaning: – No deletes – No updates – No merge conflicts – Replication is trivial
  43. 43. Idempotence • An idempotent operation: – Can be applied one or more times with the same effect • Enables retries • Not always possible – Side-effects are key – Consider: payments
  44. 44. Eventual Consistency • A design which prefers availability • … but guarantees that clients will eventually see consistent reads • Consider git: – Always available locally – Converges via push/pull – Human conflict resolution
  45. 45. Eventual Consistency • The system expects data to diverge • … and includes mechanisms to regain convergence – Partial ordering to minimize conflicts – A merge function to resolve conflicts
  46. 46. Vector Clocks • A technique for partial ordering • Each node has a logical clock – The clock increases on every write – Track the last observed clocks for each item – Include this vector on replication • When observed and inbound vectors have no common ancestor, we have a conflict • This lets us know when history diverged
  47. 47. CRDTs • Commutative Replicated Data Types1 • A CRDT is a data structure that: – Eventually converges to a consistent state – Guarantees no conflicts on replication 1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
  48. 48. CRDTs • CRDTs provide specialized semantics: – G-Counter: Monotonously increasing counter – PN-Counter: Also supports decrements – G-Set: A set that only supports adds – 2P-Set: Supports removals but only once • OR-Sets are particularly useful – Keeps track of both additions and removals – Can be used for shopping carts
  49. 49. Questions? Complaints?
  50. 50. WE’RE DONE HERE! Thank you for listening tomer@tomergabel.com @tomerg http://il.linkedin.com/in/tomergabel Aphyr’s “Call Me Maybe” blog posts: http://aphyr.com/tags/jepsen

×