Scaling out and the CAP Theorem


Published on

Friday 4th June 1976, the Sex Pistols kicked off their first gig, a gig that's considered to change western music culture forever, pioneering the genesis of punk rock.

Wednesday 19th July 2000 had a similar impact on internet scale companies as the Sex Pistols did on music, with the keynote speech by Eric Brewer at the ACM symposium on the [Principles of Distributed Computing]( (PODC). Eric Brewer claimed that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in those new distributed applications, then we cannot have data consistency. Two years later, in 2002, Seth Gilbert and Nancy Lynch [formally proved]( Brewer's claim as what is known today as the Brewer's Theorem or CAP.

The CAP theorem mandates that a distributed system cannot satisfy both Consistency, Availability and Partition tolerance. In the database ecosystem, many tools claim to solve our data persistence problems while scaling out, offering different capabilities (document stores, key/values, SQL, graph, etc).

In this talk we will explore the CAP theorem

+ We will define what are Consistency, Availability and Partition Tolerance
+ We will explore what CAP means for our applications (ACID vs BASE)
+ We will explore practical applications on MySQL with read slave, MongoDB and Riak based on the work by [Aphyr - Kyle Kingsbury](

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling out and the CAP Theorem

  1. 1.     CAP  Theorem   Reversim  Summit  2014  
  2. 2. CAP  theorem,  or  Brewer’s  theorem,  states  that  it  is  impossible  for  a   distributed  computer  system  to  simultaneously  provide  all  three  of  the   following  guarantees   •  Consistency   –  All  nodes  see  the  same  data  at  the  same  @me   •  Availability   –  A  guarantee  that  every  request  receives  a  response  about  whether  it  was   successful  or  failed   •  Par@@on  Tolerance   –  The  system  con@nues  to  operate  despite  arbitrary  message  loss  or  failure  of   part  of  the  system  
  3. 3. It  means  that  for  internet  scale  companies   we  should  stop  worrying  about  data  consistency     If  we  want  high  availability  in  such  distributed  systems   then  guaranteed  consistency  of  data  is  something     we  cannot  have  
  4. 4. An  Example   Consider  an  online  bookstore   •  You  want  to  buy  the  book  “The  tales  of  the  CAP  theorem”   –  The  store  has  only  one  copy  in  stock   –  You  add  it  to  your  cart  and  con@nue  browsing,  looking  for  another  book   (“ACID  vs  BASE,  a  love  story?”)   •  As  you  browse  the  shop,  someone  else  goes  and  buy  the     “The  tales  of  the  CAP  theorem”   –  Adds  the  book  to  the  cart  and  checks-­‐out  process  
  5. 5. Consistency  
  6. 6. Consistency   A  Service  that  is  consistent  operates  fully  or  not  at  all.     In  our  bookstore  example   •  There  is  only  one  copy  in  stock  and  only  one  person  will  get  it   •  If  both  customers  can  con@nue  through  the  order  process  (payment)  the   lack  of  consistency  will  become  a  business  issue   •  Scale  this  inconsistency  and  you  have  a  major  business  issue   •  You  can  solve  this  issue  using  a  database  to  manage  inventory   –  The  first  checkout  operates  fully,  the  second  not  at  all  
  7. 7. Consistency   Note:  CAP  Consistency  is  the  Atomicity  in  ACID   •  CAP  consistency  is  a  constraint  that  mul@ple  values  of  the  same  data  are   not  allowed   •  ACID  Atomicity  requires  that  each  transac@on  is  “all  or  nothing”   –  Which  implies  that  mul@ple  values  of  the  same  data  are  not  allowed   •  ACID  consistency  means  that  any  transac@on  brings  the  database  from   one  consistent  state  to  another   –  Global  consistency  –  of  the  whole  database  
  8. 8. Availability  
  9. 9. Availability   Availability  means  just  that  –  the  service  is  available   •  When  you  purchase  a  book  you  want  to  get  a  response   –  Not  some  schrodinger  message  about  the  site  being  uncommunica@ve   •  Availability  most  oYen  deserts  you  when  you  need  it  the  most   –  Services  tend  to  go  down  at  busy  periods   •  A  service  that’s  available  but  cannot  be  reached  is  of  no  benefit  to  anyone  
  10. 10. Par@@on  Tolerance  
  11. 11. Par@@on  Tolerance   Par@@on  happens  when  a  node  in  your  system  cannot  communicate  with   another  node   •  Say,  because  a  network  cable  gets  chopped   •  Par@@ons  are  equivalent  to  server  crash   –  If  nothing  can  connect  to  it,  it  may  as  well  not  be  there   •  If  your  applica@on  and  database  runs  on  one  box     then  your  server  acts  as  a  kind  of  atomic  processor   –  it  either  works  or  it  doesn’t   –  How  far  can  you  scale  on  one  host?   •  Once  you  scale  to  mul@ple  hosts,  you  need  par@@on  tolerance  
  12. 12. Par@@ons   But  wait,  are  par@@ons  real?   •  Our  infrastructure  is  reliable,  right?       Formally,  in  any  network     •  IP  networks  do  all  four   •  TCP  means  no  dupes,  reorder   –  Unless  you  retry!   •  Delays  are  indis@nguishable  from  drops  (aYer  a  @meout)     –  There  is  no  perfect  failure  detector  in  an  async  network     A B drop delay duplicate reorder A B A B A B time
  13. 13. Par@@ons  are  real!   Some  Causes   •  GC  Pause   –  Is  actually  a  delay   •  Network  maint   •  Segfaults  &  crashes   •  Faulty  NICs   •  Bridge  loops   •  VLAN  problems   •  Hosted  networks   •  The  cloud   •  WAN  links  &  Backhoes   Published  examples   •  Neclix   •  Twilio   •  Fog  Creek   •  AWS   •  Github   •  Wix   •  MicrosoY  datacenter  study   –  Average  failure  rate  5.2  devices  per   day  and  40.8  links  per  day   –  Median  packet  loss  59,000  packets   –  Network  redundancy  improves   median  traffic  by  43%   More examples at hkp://­‐the-­‐network-­‐is-­‐reliable  
  14. 14. The  CAP  Theorem  proof  
  15. 15. Proof  in  Pictures   •  Consider  a  system  with  two  nodes  N1  and  N2   •  They  both  share  the  same  data  V   •  On  N1  runs  the  program  A   •  On  N2  runs  program  B   –  We  consider  both  A  and  B  to  be  ideal  -­‐  safe,  bug  free,     predictable  and  reliable   •  In  this  example,  A  writes  a  new  values  of  V  and  B  reads  the  values  of  V  
  16. 16. Proof  in  Pictures   Sunny-­‐day  scenario   1.  A  writes  a  new  value  of  V,  denoted  as  V1   2.  A  message  M  is  passed  from  N1  to  N2  which  updates  the  copy  of  V  there   3.  Any  read  by  B  of  V  will  return  V1  
  17. 17. Proof  in  Pictures   In  the  case  of  network  par@@on   •  Messages  from  N1  to  N2  are  not  delivered   –  Even  if  we  use  guaranteed  delivery  of  M,  N1  has  no  way  of  knowing  if  a   message  is  delayed  by  par@@oning  event  or  failure  on  N2   –  Then  N2  contains  an  inconsistent  value  of  V  when  step  3  occurs   •  We  have  lost  consistency!  
  18. 18. Proof  in  Pictures   In  the  case  of  network  par@@on     •  We  can  make  M  synchronous   –  Which  means  the  write  of  A  on  N1  and  the  update  N1  to  N2  is  an  atomic   opera@on   –  A  write  will  fail  in  case  of  a  par@@on     •  We  have  lost  availability!  
  19. 19. What  does  it  all  mean?  
  20. 20. In  prac@cal  terms   For  a  distributed  system  to  not  require  par@@on-­‐tolerance  it  would  have  to   run  on  a  network  which  is  guaranteed  to  never  drop  messages  (or  even   deliver  them  late)  and  whose  nodes  are  guaranteed  to  never  die.     Such  systems  do  not  exist.     Make  your  choice   Choose  consistency  over  availability   Choose  availability  over  consistency   Choose  neither  
  21. 21. CAP  Locality   •  It  holds  per  opera@on  independently   –  A  system  can  be  both  CP  and  AP,  for  different  opera@ons   –  Different  opera@ons  can  be  modeled  with  different  CAP  proper@es   •  An  opera@on  can  be   –  CP  –  consistent  and  par@@on  tolerant   –  AP  –  available  and  par@@on  tolerant   –  P  with  mixed  A  &  C  –  trading  off  between  A  and  C   •  Eventual  consistency,  for  example     Consistency Availability Add item to cartCheckout
  22. 22. Lets  look  at  some  examples  
  23. 23. Using  the  findings  of     Kyle  Kingsbury  
  24. 24. Postgres   •  A  classic  open  source  database   •  We  think  of  it  as  a  CP  system   –  It  accept  writes  only  on  a  single  primary  node   –  Ensuring  a  write  to  slaves  as  well   •  If  a  par@@on  occurs   –  We  cannot  talk  to  the  server  and  the  system  is  unavailable   –  Because  transac@ons  are  ACID,  we’re  always  consistent     However   •  The  distributed  system  composed  of  the  server  and  client  together  may   not  be  consistent   –  They  may  not  agree  if  a  transac@on  took  place  
  25. 25. Postgres   •  Postgres’  commit  protocol  is  a  two  phase     commit  –  2PC   1.  The  client  votes  to  commit  and  sends  a  message   to  the  server   2.  The  server  checks  for  consistency  and  votes  to     commit  (or  reject)  the  transac@on     3.  It  writes  the  transac@on  to  storage   4.  The  server  informs  the  client  that  a  commit  took  place   •  What  happens  if  the  acknowledgment  message  is  dropped?   –  The  client  doesn't  know  whether  the  commit  succeeded  or  not!   –  The  2PC  protocol  requires  the  client  to  wait  an  ack   –  The  client  will  eventually  get  a  @meout  (or  deadlock)    
  26. 26. Postgres   The  experiment   •  Install  and  run  Postgres  on  one  host   •  Run  5  clients  who  write  to  postgres  within  a  transac@on   •  During  the  experiment,  drop  the  network  for  one  of  the  nodes     The  findings   •  Out  of  1000  write  opera@ons   •  950  successfully  acknowledged,  and  all  are  in  the  database   •  2  writes  succeeded,  but  the  client  got  an  excep@on  claiming  an  error   occurred!   –  Note  that  the  client  has  no  way  know  if  the  write  succeeded  or  failed  
  27. 27. Postgres   2PC  Strategies   •  Accept  false  nega@ves   –  Just  ignore  the  excep@on  on  the  client.  Those  errors  happen  only  for  in-­‐flight   writes  at  the  @me  the  par@@on  began.   •  Use  idempotent  opera@ons   –  On  a  network  error,  just  retry   •  Using  transac@on  ID   –  When  a  par@@on  is  resolved,  the  client  checks  if  a  transac@on  was  commiked   using  the  transac@on  ID.   Note  those  strategies  applies  to  most  SQL  engines  
  28. 28. MongoDB   •  MongoDB  is  a  document-­‐oriented  database   •  Replicated  using  a  replica  set   –  Single  writable  primary  node   –  Asynchronously  replicates  writes  as  an  oplog    to  N  secondaries   •  MongoDB  supports  different  levels  of  guarantees   –  Asynchronous  replica@on   –  Confirm  successful  write  to  its  disk  log   –  Confirm  successful  replica@on  of  a  write  to  secondary  nodes   •  Is  MongoDB  consistent?   –  MongoDB  is  promoted  as  a  CP  system   –  However,  it  may  “revert  opera@ons”  on  network  par@@on  in  some  cases  
  29. 29. MongoDB   What  happens  when  the  primary  becomes  unavailable?   •  The  remaining  secondaries  will  detect  the  failed  connec@on   –  Will  try  to  get  to  a  consensus  for  a  new  leader   –  If  they  have  majority,  they’ll  select  the  node  with  the  highest  op@me   •  The  minority  nodes  will  detect  they  no  longer  have  a  quorom   –  Will  demote  the  primary  to  a  secondary   •  If  our  primary  is  on  n1  and  we  cut  n1  &  n2   from  the  rest,  we  expect  n3,  n4  or  n5     to  become  the  new  primary    
  30. 30. MongoDB   The  experiment   •  Install  and  run  MongoDB  on  5  hosts   •  With  5  clients   –  Wri@ng  some  data  to  the  cluster   •  During  the  experiment,     par@@on  the  network   –  To  a  minority  and  primary  nodes   •  Then  restore  the  network   •  check  what  happened?   –  What  writes  survived    
  31. 31. MongoDB   Write  concern  unacknowledged   •  The  default  at  the  @me  Kyle  run  the     experiment   The  findings   •  6000  total  writes   •  5700  acknowledged   •  3319  survivors   •  2381  acknowledged  writes  lost  (42%  write  loss)   Not  surprising,  we  have  data  loss.  
  32. 32. MongoDB   42%  data  loss?   •  What  happened?   •  When  the  par@@on  started   –  The  original  primary  (N1)  con@nued  to     accept  writes   –  But  those  writes  never  made  it  to  the  new   primary  (N5)   •  When  the  par@@on  ended   –  The  original  primary  (N1)  and  the  new  primary  (N5)  compare  notes   –  They  figure  that  the  N5  op@me  is  higher   –  N1  find  the  last  point  the  two  agreed  on  the  oplog  and     rolls  back  to  that  point   •  During  a  rollback,  all  writes  the  old  primary  accepted  a@er  the  common   point  in  the  oplog  are  removed  from  the  database  
  33. 33. MongoDB   Write  concern  safe  or  acknowledged   •  The  current  default   •  Allows  clients  to  catch  network,     duplicate  key  and  other  errors   The  findings   •  6000  total  writes   •  5900  acknowledged   •  3692  survivors   •  2208  acknowledged  writes  lost  (37%  write  loss)   Write  concern  acknowledged  only  verifies  the  write  was  accepted  on  the   master.  We  need  to  ensure  replicas  also  see  the  write  
  34. 34. MongoDB   Write  concern  replicas_safe  or                                                          replica_acknowledged   •  Waits  for  at  least  2  servers  for  the     write  opera@on   The  findings   •  6000  total  writes   •  5695  acknowledged   •  3768  survivors   •  1927  acknowledged  writes  lost  (33%  write  loss)   Mongo  only  verifies  that  a  write  took  place  against  two  nodes.   A  new  primary  can  be  elected  without  having  seen  those  writes.   In  this  case,  Mongo  will  rollback  those  writes.  
  35. 35. MongoDB   Write  concern  majority     •  Waits  for  a  majority  of  servers  for  the     write  opera@on   The  findings   •  6000  total  writes   •  5700  acknowledged   •  5701  survivors   •  2  acknowledged  writes  lost   •  3  unacknowledged  write  found   The  reason  we  have  2  writes  lost  is  a  bug  in  Mongo  that  caused  it  to  threat   network  failures  as  successful  writes.  This  bug  was  fixed  in  2.4.3  (or  2.4.4)     The  fact  we  have  3  unacknowledged  writes  found  it  not  a  problem  -­‐     similar  arguments  to  Postgres   Majority
  36. 36. MongoDB   Takeaways  for  MongoDB     You  can  either   •  Accept  data  loss   –  At  most  WriteConcern  levels  Mongo  can  get  to  a  point  it  rollback  data   •  Use  WriteConcern.Majority   –  With  performance  impact    
  37. 37. Other  distributed  systems  Kyle  tested   All  have  different  Caveats   Worth  a  read  at   ZooKeeper Kafka
  38. 38. Strategies  for  distributed     data  &  systems  
  39. 39. Immutable  Data   •  Immutable  data  means   –  No  updates   –  No  deletes   –  No  need  for  data  merges   –  Easier  to  replicate   •  Immutable  data  solves  the  problems  that  cause  distributed  systems  to   delete  data  (MongoDB,  Riak,  Cassandra,  etc.)   –  However,  even  if  your  data  is  immutable,  exis@ng  tools  assume  it  is  mutable   and  may  s@ll  delete  your  data   •  Can  you  model  all  your  data  to  be  immutable?   •  How  do  you  model  inventory  using  immutable  data?    
  40. 40. Idempotent  Opera@ons   An  opera@on  is  idempotent  if,  whenever  it  is  applied  twice,     it  gives  the  same  result  as  if  it  were  applied  once   •  It  enables  recovering  from  availability  problems   –  A  way  to  introduce  fault  tolerant   –  The  Postgres  client-­‐server  ack  issue,  for  example   •  In  case  of  any  failure,  just  retry   –  Undetermined  response   –  Failure  to  write   •  However,  it  does  not  solve  the  CAP  constraints   •  Can  you  model  all  your  opera@ons  to  be  Idempotent?  
  41. 41. BASE   BASE   •  Defined  by  Eric  Brewer   •  Basically  Available   –  The  system  guarantee  availability,  in  terms  of  the  CAP  theorem   •  SoY  state   –  The  system  state  is  sta@s@cally  consistent.  It  may  change  in  @me  without   external  input.   •  Eventual  consistency   –  The  system  will  converge  to  be  consistent  over  @me   •  Considered  the  contrast  to  ACID  (Atomicity,  Consistency,  Isola@on,   Durability)   –  Not  really  J   –  Both  are  actually  contrived  
  42. 42. Eventual  Consistency   For  AP  systems,  we  can  make  the  system  to  fix  consistency   •  We  all  know  such  a  system  –  Git   –  Available  on  each  node,  full  par@@on  tolerant   –  Gains  consistency  using  Git  push  &  pull   –  Human  merge  of  data   •  Can  we  take  those  ideas  to  other  distributed  systems?   •  How  can  we  track  history?     –  Iden@fy  conflicts?   •  Can  we  make  the  merge  automa@c?  
  43. 43. Vector  Clocks   •  A  way  to  track  ordering  of  events  in  a  distributed  system   •  Enables  detec@ng  conflic@ng  writes   –  And  the  shared  point  in  history  the  divergence  started     •  Each  write  includes  a  logical  clock   –  A  clock  per  node   –  Each  @me  a  node  write  data,  it  increases  it’s  clock   •  Nodes  sync  with  each   other  using  gossip  a     protocol   •  Mul@ple  implementa@ons   –  Node  based   –  Opera@on  based  
  44. 44. Eventual  Consistency   •  A  system  that  expects  data  to  diverge   –  For  small  intervals  in  @me   –  For  as  long  as  a  par@@on  exists   •  Built  to  regain  consistency   –  Using  some  sync  protocol  (gossip)   –  Using  vector  clocks  or  @mestamps  to  compare  values   •  Needs  to  handle  values  merge   –  Minimize  merges  using  vector  clocks   •  Merge  only  if  values  actually  diverge   –  Using  @mestamp  to  select  newer  value   –  Using  business  specific  merge  func@ons   –  Using  CRDTs  
  45. 45. CRDTs   Commuta@ve  Replicated  Data  Type     (also  known  as  Conflict-­‐free  Replicated  Data  Type)   •  Not  a  lot  of  data  types  available  to  select  from   –  G-­‐Counter,  PN-­‐Counter,  G-­‐Set,  2P-­‐Set,  OR-­‐Set,  U-­‐Set,  Graphs   •  OR-­‐Set   –  For  social  graphs   –  can  be  used  for  shopping  cart  (with  some  modifica@ons)  
  46. 46. QuesEons?  anyone?