Put Your Thinking
CAP On
Tomer Gabel, Wix
JDay Lviv, 2015
Credits
Originally a talk by
Yoav Abrahami (Wix)
Based on “Call Me Maybe” by
Kyle “Aphyr” Kingsbury
Brewer’s CAP Theorem
Partition
Tolerance
ConsistencyAvailability
Brewer’s CAP Theorem
Partition
Tolerance
ConsistencyAvailability
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
• … and someone else
just bought it.
Consistency
Consistency: Defined
• In a consistent
system:
All participants
see the same value
at the same time
• “Do you have this
book in stock?”
Consistency: Defined
• If our book store is an
inconsistent system:
– Two customers may
buy the book
– But there’s only one
item in inventory!
• We’ve just violated a
business constraint.
Availability
Availability: Defined
• An available system:
– Is reachable
– Responds to requests
(within SLA)
• Availability does not
guarantee success!
– The operation may fail
– “This book is no longer
available”
Availability: Defined
• What if the system is
unavailable?
– I complete the
checkout
– And click on “Pay”
– And wait
– And wait some more
– And…
• Did I purchase the
book or not?!
Partition
Tolerance
Partition Tolerance: Defined
• Partition: one or
more nodes are
unreachable
• No practical
system runs on a
single node
• So all systems are
susceptible!
A
B
C
D
E
“The Network is Reliable”
• All four happen in an
IP network
• To a client, delays
and drops are the
same
• Perfect failure
detection is provably
impossible1!
A B
drop delay
duplicate reorder
A B
A B A B
time
1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
Partition Tolerance: Reified
• External causes:
– Bad network config
– Faulty equipment
– Scheduled
maintenance
• Even software causes
partitions:
– Bad network config.
– GC pauses
– Overloaded servers
• Plenty of war stories!
– Netflix
– Twilio
– GitHub
– Wix :-)
• Some hard numbers1:
– 5.2 failed devices/day
– 59K lost packets/day
– Adding redundancy
only improves by 40%
1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
“Proving” CAP
In Pictures
• Let’s consider a simple
system:
– Service A writes values
– Service B reads values
– Values are replicated
between nodes
• These are “ideal”
systems
– Bug-free, predictable
Node 1
V0A
Node 2
V0B
In Pictures
• “Sunny day scenario”:
– A writes a new value V1
– The value is replicated
to node 2
– B reads the new value
Node 1
V0A
Node 2
V0B
V1
V1
V1
V1
In Pictures
• What happens if the
network drops?
– A writes a new value V1
– Replication fails
– B still sees the old value
– The system is
inconsistent
Node 1
V0A
Node 2
V0B
V1
V0
V1
In Pictures
• Possible mitigation is
synchronous replication
– A writes a new value V1
– Cannot replicate, so write is
rejected
– Both A and B still see V0
– The system is logically
unavailable
Node 1
V0A
Node 2
V0B
V1
What does it all mean?
The network is not reliable
• Distributed systems must handle partitions
• Any modern system runs on >1 nodes…
• … and is therefore distributed
• Ergo, you have to choose:
– Consistency over availability
– Availability over consistency
Granularity
• Real systems comprise many operations
– “Add book to cart”
– “Pay for the book”
• Each has different properties
• It’s a spectrum, not a binary choice!
Consistency Availability
Shopping CartCheckout
CAP IN THE REAL
WORLD
Kyle “Aphyr” Kingsbury
Breaking consistency
guarantees since 2013
PostgreSQL
• Traditional RDBMS
– Transactional
– ACID compliant
• Primarily a CP system
– Writes against a
master node
• “Not a distributed
system”
– Except with a client at
play!
PostgreSQL
• Writes are a simplified
2PC:
– Client votes to commit
– Server validates
transaction
– Server stores changes
– Server acknowledges
commit
– Client receives
acknowledgement
Client Server
Store
PostgreSQL
• But what if the ack is
never received?
• The commit is already
stored…
• … but the client has
no indication!
• The system is in an
inconsistent state
Client Server
Store
?
PostgreSQL
• Let’s experiment!
• 5 clients write to a
PostgreSQL instance
• We then drop the server
from the network
• Results:
– 1000 writes
– 950 acknowledged
– 952 survivors
So what can we do?
1. Accept false-negatives
– May not be acceptable for your use case!
2. Use idempotent operations
3. Apply unique transaction IDs
– Query state after partition is resolved
• These strategies apply to any RDBMS
• A document-oriented database
• Availability/scale via replica sets
– Client writes to a master node
– Master replicates writes to n replicas
• User-selectable consistency guarantees
MongoDB
• When a partition occurs:
– If the master is in the
minority, it is demoted
– The majority promotes a
new master…
– … selected by the highest
optime
MongoDB
• The cluster “heals” after partition resolution:
– The “old” master rejoins the cluster
– Acknowleged minority writes are reverted!
MongoDB
• Let’s experiment!
• Set up a 5-node
MongoDB cluster
• 5 clients write to
the cluster
• We then partition
the cluster
• … and restore it to
see what happens
MongoDB
• With write concern
unacknowleged:
– Server does not ack
writes (except TCP)
– The default prior to
November 2012
• Results:
– 6000 writes
– 5700 acknowledged
– 3319 survivors
– 42% data loss!
MongoDB
• With write concern
acknowleged:
– Server acknowledges
writes (after store)
– The default guarantee
• Results:
– 6000 writes
– 5900 acknowledged
– 3692 survivors
– 37% data loss!
MongoDB
• With write concern
replica acknowleged:
– Client specifies
minimum replicas
– Server acks after
writes to replicas
• Results:
– 6000 writes
– 5695 acknowledged
– 3768 survivors
– 33% data loss!
MongoDB
• With write concern
majority:
– For an n-node cluster,
requires at least n/2
replicas
– Also called “quorum”
• Results:
– 6000 writes
– 5700 acknowledged
– 5701 survivors
– No data loss
So what can we do?
1. Keep calm and carry on
– As Aphyr puts it, “not all applications need
consistency”
– Have a reliable backup strategy
– … and make sure you drill restores!
2. Use write concern majority
– And take the performance hit
The prime suspects
• Aphyr’s Jepsen tests
include:
– Redis
– Riak
– Zookeeper
– Kafka
– Cassandra
– RabbitMQ
– etcd (and consul)
– ElasticSearch
• If you’re
considering them,
go read his posts
• In fact, go read his
posts regardless
http://aphyr.com/tags/jepsen
STRATEGIES FOR
DISTRIBUTED SYSTEMS
Immutable Data
• Immutable (adj.):
“Unchanging over
time or unable to be
changed.”
• Meaning:
– No deletes
– No updates
– No merge conflicts
– Replication is trivial
Idempotence
• An idempotent
operation:
– Can be applied one or
more times with the
same effect
• Enables retries
• Not always possible
– Side-effects are key
– Consider: payments
Eventual Consistency
• A design which prefers
availability
• … but guarantees that
clients will eventually see
consistent reads
• Consider git:
– Always available locally
– Converges via push/pull
– Human conflict resolution
Eventual Consistency
• The system expects
data to diverge
• … and includes
mechanisms to regain
convergence
– Partial ordering to
minimize conflicts
– A merge function to
resolve conflicts
Vector Clocks
• A technique for partial ordering
• Each node has a logical clock
– The clock increases on every write
– Track the last observed clocks for each item
– Include this vector on replication
• When observed and inbound vectors have
no common ancestor, we have a conflict
• This lets us know when history diverged
CRDTs
• Commutative Replicated Data Types1
• A CRDT is a data structure that:
– Eventually converges to a consistent state
– Guarantees no conflicts on replication
1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
CRDTs
• CRDTs provide specialized semantics:
– G-Counter: Monotonously increasing counter
– PN-Counter: Also supports decrements
– G-Set: A set that only supports adds
– 2P-Set: Supports removals but only once
• OR-Sets are particularly useful
– Keeps track of both additions and removals
– Can be used for shopping carts
Questions?
Complaints?
WE’RE DONE
HERE!
Thank you for listening
tomer@tomergabel.com
@tomerg
http://il.linkedin.com/in/tomergabel
Aphyr’s “Call Me Maybe” blog posts:
http://aphyr.com/tags/jepsen

Put Your Thinking CAP On

  • 1.
    Put Your Thinking CAPOn Tomer Gabel, Wix JDay Lviv, 2015
  • 2.
    Credits Originally a talkby Yoav Abrahami (Wix) Based on “Call Me Maybe” by Kyle “Aphyr” Kingsbury
  • 3.
  • 4.
  • 5.
    By Example • Iwant this book! – I add it to the cart – Then continue browsing • There’s only one copy in stock!
  • 6.
    By Example • Iwant this book! – I add it to the cart – Then continue browsing • There’s only one copy in stock! • … and someone else just bought it.
  • 7.
  • 8.
    Consistency: Defined • Ina consistent system: All participants see the same value at the same time • “Do you have this book in stock?”
  • 9.
    Consistency: Defined • Ifour book store is an inconsistent system: – Two customers may buy the book – But there’s only one item in inventory! • We’ve just violated a business constraint.
  • 10.
  • 11.
    Availability: Defined • Anavailable system: – Is reachable – Responds to requests (within SLA) • Availability does not guarantee success! – The operation may fail – “This book is no longer available”
  • 12.
    Availability: Defined • Whatif the system is unavailable? – I complete the checkout – And click on “Pay” – And wait – And wait some more – And… • Did I purchase the book or not?!
  • 13.
  • 14.
    Partition Tolerance: Defined •Partition: one or more nodes are unreachable • No practical system runs on a single node • So all systems are susceptible! A B C D E
  • 15.
    “The Network isReliable” • All four happen in an IP network • To a client, delays and drops are the same • Perfect failure detection is provably impossible1! A B drop delay duplicate reorder A B A B A B time 1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
  • 16.
    Partition Tolerance: Reified •External causes: – Bad network config – Faulty equipment – Scheduled maintenance • Even software causes partitions: – Bad network config. – GC pauses – Overloaded servers • Plenty of war stories! – Netflix – Twilio – GitHub – Wix :-) • Some hard numbers1: – 5.2 failed devices/day – 59K lost packets/day – Adding redundancy only improves by 40% 1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
  • 17.
  • 18.
    In Pictures • Let’sconsider a simple system: – Service A writes values – Service B reads values – Values are replicated between nodes • These are “ideal” systems – Bug-free, predictable Node 1 V0A Node 2 V0B
  • 19.
    In Pictures • “Sunnyday scenario”: – A writes a new value V1 – The value is replicated to node 2 – B reads the new value Node 1 V0A Node 2 V0B V1 V1 V1 V1
  • 20.
    In Pictures • Whathappens if the network drops? – A writes a new value V1 – Replication fails – B still sees the old value – The system is inconsistent Node 1 V0A Node 2 V0B V1 V0 V1
  • 21.
    In Pictures • Possiblemitigation is synchronous replication – A writes a new value V1 – Cannot replicate, so write is rejected – Both A and B still see V0 – The system is logically unavailable Node 1 V0A Node 2 V0B V1
  • 22.
    What does itall mean?
  • 23.
    The network isnot reliable • Distributed systems must handle partitions • Any modern system runs on >1 nodes… • … and is therefore distributed • Ergo, you have to choose: – Consistency over availability – Availability over consistency
  • 24.
    Granularity • Real systemscomprise many operations – “Add book to cart” – “Pay for the book” • Each has different properties • It’s a spectrum, not a binary choice! Consistency Availability Shopping CartCheckout
  • 25.
    CAP IN THEREAL WORLD Kyle “Aphyr” Kingsbury Breaking consistency guarantees since 2013
  • 26.
    PostgreSQL • Traditional RDBMS –Transactional – ACID compliant • Primarily a CP system – Writes against a master node • “Not a distributed system” – Except with a client at play!
  • 27.
    PostgreSQL • Writes area simplified 2PC: – Client votes to commit – Server validates transaction – Server stores changes – Server acknowledges commit – Client receives acknowledgement Client Server Store
  • 28.
    PostgreSQL • But whatif the ack is never received? • The commit is already stored… • … but the client has no indication! • The system is in an inconsistent state Client Server Store ?
  • 29.
    PostgreSQL • Let’s experiment! •5 clients write to a PostgreSQL instance • We then drop the server from the network • Results: – 1000 writes – 950 acknowledged – 952 survivors
  • 30.
    So what canwe do? 1. Accept false-negatives – May not be acceptable for your use case! 2. Use idempotent operations 3. Apply unique transaction IDs – Query state after partition is resolved • These strategies apply to any RDBMS
  • 31.
    • A document-orienteddatabase • Availability/scale via replica sets – Client writes to a master node – Master replicates writes to n replicas • User-selectable consistency guarantees
  • 32.
    MongoDB • When apartition occurs: – If the master is in the minority, it is demoted – The majority promotes a new master… – … selected by the highest optime
  • 33.
    MongoDB • The cluster“heals” after partition resolution: – The “old” master rejoins the cluster – Acknowleged minority writes are reverted!
  • 34.
    MongoDB • Let’s experiment! •Set up a 5-node MongoDB cluster • 5 clients write to the cluster • We then partition the cluster • … and restore it to see what happens
  • 35.
    MongoDB • With writeconcern unacknowleged: – Server does not ack writes (except TCP) – The default prior to November 2012 • Results: – 6000 writes – 5700 acknowledged – 3319 survivors – 42% data loss!
  • 36.
    MongoDB • With writeconcern acknowleged: – Server acknowledges writes (after store) – The default guarantee • Results: – 6000 writes – 5900 acknowledged – 3692 survivors – 37% data loss!
  • 37.
    MongoDB • With writeconcern replica acknowleged: – Client specifies minimum replicas – Server acks after writes to replicas • Results: – 6000 writes – 5695 acknowledged – 3768 survivors – 33% data loss!
  • 38.
    MongoDB • With writeconcern majority: – For an n-node cluster, requires at least n/2 replicas – Also called “quorum” • Results: – 6000 writes – 5700 acknowledged – 5701 survivors – No data loss
  • 39.
    So what canwe do? 1. Keep calm and carry on – As Aphyr puts it, “not all applications need consistency” – Have a reliable backup strategy – … and make sure you drill restores! 2. Use write concern majority – And take the performance hit
  • 40.
    The prime suspects •Aphyr’s Jepsen tests include: – Redis – Riak – Zookeeper – Kafka – Cassandra – RabbitMQ – etcd (and consul) – ElasticSearch • If you’re considering them, go read his posts • In fact, go read his posts regardless http://aphyr.com/tags/jepsen
  • 41.
  • 42.
    Immutable Data • Immutable(adj.): “Unchanging over time or unable to be changed.” • Meaning: – No deletes – No updates – No merge conflicts – Replication is trivial
  • 43.
    Idempotence • An idempotent operation: –Can be applied one or more times with the same effect • Enables retries • Not always possible – Side-effects are key – Consider: payments
  • 44.
    Eventual Consistency • Adesign which prefers availability • … but guarantees that clients will eventually see consistent reads • Consider git: – Always available locally – Converges via push/pull – Human conflict resolution
  • 45.
    Eventual Consistency • Thesystem expects data to diverge • … and includes mechanisms to regain convergence – Partial ordering to minimize conflicts – A merge function to resolve conflicts
  • 46.
    Vector Clocks • Atechnique for partial ordering • Each node has a logical clock – The clock increases on every write – Track the last observed clocks for each item – Include this vector on replication • When observed and inbound vectors have no common ancestor, we have a conflict • This lets us know when history diverged
  • 47.
    CRDTs • Commutative ReplicatedData Types1 • A CRDT is a data structure that: – Eventually converges to a consistent state – Guarantees no conflicts on replication 1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
  • 48.
    CRDTs • CRDTs providespecialized semantics: – G-Counter: Monotonously increasing counter – PN-Counter: Also supports decrements – G-Set: A set that only supports adds – 2P-Set: Supports removals but only once • OR-Sets are particularly useful – Keeps track of both additions and removals – Can be used for shopping carts
  • 49.
  • 50.
    WE’RE DONE HERE! Thank youfor listening tomer@tomergabel.com @tomerg http://il.linkedin.com/in/tomergabel Aphyr’s “Call Me Maybe” blog posts: http://aphyr.com/tags/jepsen

Editor's Notes

  • #6 Image source: http://en.wikipedia.org/wiki/File:Seuss-cat-hat.gif
  • #7 Image source: http://en.wikipedia.org/wiki/File:Seuss-cat-hat.gif
  • #9 Photo source: http://pixabay.com/en/meerkat-zoo-animal-sand-desert-363051/
  • #10 Photo source: Unknown
  • #12 Image source: https://www.flickr.com/photos/framesofmind/8541529818/
  • #13 Image source: http://duelingcouches.blogspot.com/2008/12/patiently-waiting.html
  • #14 Image source: http://anapt.deviantart.com/art/together-157107893
  • #42 Image source: https://www.flickr.com/photos/infocux/8450190120/in/set-72157632701634780
  • #43 Image source: http://en.wikipedia.org/wiki/Great_Pyramid_of_Giza#mediaviewer/File:Kheops-Pyramid.jpg
  • #44 Image source: http://2.bp.blogspot.com/--VVPUQ06BaQ/TzmEacERFoI/AAAAAAAAEzE/e2QPIrRWQAg/s1600/washrinse.jpg
  • #46 Photo source: https://www.flickr.com/photos/luschei/1569384007