Distributed Programming          and Data Consistency          by Paulo Gaspar          @paulogaspar7 on Twitter          ...
This presentation is about...                             Awareness about how most of us do some                          ...
Consistency Perceptionquinta-feira, 24 de Junho de 2010   3
What is Consistency?quinta-feira, 24 de Junho de 2010                                                                     ...
What isn’t?quinta-feira, 24 de Junho de 2010                                                                             5...
Consistency across timequinta-feira, 24 de Junho de 2010                                                    6Consistency a...
Consistency across timequinta-feira, 24 de Junho de 2010                                                    72 of 3=> Expe...
Consistency across timequinta-feira, 24 de Junho de 2010                                                    83 of 3=> Expe...
Inconsistency across timequinta-feira, 24 de Junho de 2010                                                      9...and st...
Inconsistency across timequinta-feira, 24 de Junho de 2010                                                      102 of 3=>...
Inconsistency across timequinta-feira, 24 de Junho de 2010                                                      113 of 3=>...
Consistency is perception                     ...and time matters...quinta-feira, 24 de Junho de 2010                     ...
Cache Consistency          (Low Latency - high read performance)quinta-feira, 24 de Junho de 2010                 13
The Case for LB Caches                Memcached at FB:         You HAVE TO Replicate to Scale-Outquinta-feira, 24 de Junho...
So, now it “Loadbalances”...quinta-feira, 24 de Junho de 2010                                                             ...
...but then you can have...quinta-feira, 24 de Junho de 2010                                                      16With t...
Inconsistency across timequinta-feira, 24 de Junho de 2010                                                      172 of 3=>...
Inconsistency across timequinta-feira, 24 de Junho de 2010                                                      183 of 3=>...
...now it can pick >1 versions!quinta-feira, 24 de Junho de 2010                       19Why you can have inconsistencies ...
Data Caching Consistency                               Multi-layer and/or Load Balanced caches                            ...
The “Schrodinger”         Consistency Model                               A Schrodinger’s Cache?                          ...
Java Caching Solutions                               http://java-source.net/open-source/cache-                            ...
Other Interesting         Caching Solutions                               Varnish (HTTP Cache)                            ...
Slow and Big Consistency          (The Higher Latency - BigData)quinta-feira, 24 de Junho de 2010          24
MapReduce is for embarrassingly      parallel problems with some time...quinta-feira, 24 de Junho de 2010                 ...
Coordination             Consensus needed for Map Reduce                   Consensus is the process of agreeing on one res...
MapReduce + Consensus                                    (Google + Hadoop Implementations)                Google, coordina...
Apache Hadoop Projects                              HBase (Distributed DB)                              HDFS (Distributed ...
Apache Zookeeper                  Distributed Coordination                  for Distributed Applications                  ...
Apache Zookeeper API                           Simple API:                                    Filesystem like node tree   ...
Consistency w/ Interaction          (Low Latency - read/write - harder stuff)quinta-feira, 24 de Junho de 2010            ...
Two “High”/Sexy reasons for            Distributing Data Storage                                    (not just cache)      ...
Why care about HA?                1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google)                Unrecoverable RA...
Why care about Latency?                Google: Half a second delay caused a 20% drop in                traffic (30 results ...
Fallacies of Distributed Computing          (What can go wrong?)                                    1. The network is reli...
Other Distributed Data Contexts        (the less sexy daily stuff?)                   EAI / B2B / Systems Integration     ...
The CAP Theorem and          Eventual Consistencyquinta-feira, 24 de Junho de 2010   37
CAP Theorem History              1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems”              pape...
The CAP Theorem               strong Consistency, high Availability, Partition-resilience:                                ...
Eventual Consistency for          Availability          BASE                                                          ACID...
CAP Trade-offs                CA without P: Databases providing distributed transactions can                only do it whi...
Living with CAP             All systems are probabilistic, wether they realize it or not             And so are Distribute...
CAP Theorem History              1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems”              pape...
Amazon’s Dynamo DB                        Also a “Wide Column Store”         Problem                                 Techn...
Tuning Consistency:                                    N: number of nodes to replicate each item to;                      ...
Previous Experiences...quinta-feira, 24 de Junho de 2010   46
Eventually Consistent Systems                               Banks                               EAI Integrations          ...
Amazon Dynamo Lessons          (according to the paper)                 Data returned to Shopping Cart 24h profiling:      ...
NoSQL Java being used at                Cassandra: at Facebook, being introduced on                Twitter, persistent cac...
Is NoSQL better than SQL?                The NoSQL vs. SQL database debate is really about                ACID vs. BASE da...
Some interesting techniques...          (...which we could all be using...)quinta-feira, 24 de Junho de 2010              ...
Wikipedia image                                               Vector Clocks                           On each internal eve...
Wikipedia image                                    Merkle Tree / Hash Tree                                Used to verify /...
ACID and FAST          (Lowest Latency - read/write - hardest stuff)quinta-feira, 24 de Junho de 2010                     ...
Immediately Consistent Systems                                                               Data-grids:                  ...
Tools          (Most with source code to pick from)quinta-feira, 24 de Junho de 2010                56
NoSQL Taxonomy          by Steve Yen [PG]                key‐value‐cache: memcached, repcached, coherence, infinispan, eXtr...
Some related Solutions I find interesting...                Zookeeper                (use it, configuration, elasticity, gro...
Opportunities          (...to use these tools)quinta-feira, 24 de Junho de 2010   59
Some cases we could talk about...                               EAI Integrations                               (Should use...
Q&Aquinta-feira, 24 de Junho de 2010         61
Upcoming SlideShare
Loading in...5
×

Distributed Programming and Data Consistency w/ Notes - June 2010 update

1,466
-1

Published on

June 2010 update. Several URLs were then updated on the notes.

Presentation with NOTES.

Tuning Data Consistency to obtain efficient Distributed Computing solutions.

The solutions that the academic world and the new NoSQL trend is making available to the IT industry in general.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,466
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Distributed Programming and Data Consistency w/ Notes - June 2010 update

  1. 1. Distributed Programming and Data Consistency by Paulo Gaspar @paulogaspar7 on Twitter This will be placed at: http://www.slideshare.net/paulogaspar7quinta-feira, 24 de Junho de 2010 1Twitter: @paulogaspar7 - http://twitter.com/paulogaspar7Blog: http://paulogaspar7.blogspot.com/
  2. 2. This presentation is about... Awareness about how most of us do some kind Distributed Computing these days Tuning Data Consistency for Fun and Profit Tools and sources of knowledge related to Distributed Computing Where to get some source code tooquinta-feira, 24 de Junho de 2010 2
  3. 3. Consistency Perceptionquinta-feira, 24 de Junho de 2010 3
  4. 4. What is Consistency?quinta-feira, 24 de Junho de 2010 4Our perception of consistency is related with what we know about the system and its state. That is how we figurewhat might fit...
  5. 5. What isn’t?quinta-feira, 24 de Junho de 2010 5...and what does not fit. Obviously a person will have a different degree of precision and tolerance than anautomated system.
  6. 6. Consistency across timequinta-feira, 24 de Junho de 2010 6Consistency also has a time axis, with state sequences that make sense...1 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  7. 7. Consistency across timequinta-feira, 24 de Junho de 2010 72 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  8. 8. Consistency across timequinta-feira, 24 de Junho de 2010 83 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  9. 9. Inconsistency across timequinta-feira, 24 de Junho de 2010 9...and state sequences that do NOT make sense.1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  10. 10. Inconsistency across timequinta-feira, 24 de Junho de 2010 102 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  11. 11. Inconsistency across timequinta-feira, 24 de Junho de 2010 113 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  12. 12. Consistency is perception ...and time matters...quinta-feira, 24 de Junho de 2010 12Again, each (type of) observer will have a different degree of evaluation precision and tolerance to inconsistencies.
  13. 13. Cache Consistency (Low Latency - high read performance)quinta-feira, 24 de Junho de 2010 13
  14. 14. The Case for LB Caches Memcached at FB: You HAVE TO Replicate to Scale-Outquinta-feira, 24 de Junho de 2010 14An example of how you still might have to replicate in order to scale, even with a very high performance store.The reason for FB’s issue (might lack some detail): http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-capacit.html“What happens when you add more servers is that the number of requests is not reduced, only the number of keysin each request is reduced. The number keys returned in a request only matters if you are bandwidth limited. Theserver is still on the hook for processing the same number of requests. Adding more machines doesnt change thenumber of request a server has to process and since these servers are already CPU bound they simply cant handlemore load. So adding more servers doesnt help you handle more requests. Not what we usually expect. This isanother example of why architecture matters.”
  15. 15. So, now it “Loadbalances”...quinta-feira, 24 de Junho de 2010 15...and with LB inconsistencies along the time axis can happen (eg. by reading from alternate out-of-synchbackends)
  16. 16. ...but then you can have...quinta-feira, 24 de Junho de 2010 16With the possibility of state sequences that do NOT make sense.1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  17. 17. Inconsistency across timequinta-feira, 24 de Junho de 2010 172 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  18. 18. Inconsistency across timequinta-feira, 24 de Junho de 2010 183 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  19. 19. ...now it can pick >1 versions!quinta-feira, 24 de Junho de 2010 19Why you can have inconsistencies along the time axis.
  20. 20. Data Caching Consistency Multi-layer and/or Load Balanced caches Changing to cached data => can cause => Inconsistency Across Time Some candidate solutions: All in Memory with update Push (instead of TTL + Pull) Cache Replication/Synchronization The “Schrodinger” Consistency Model...quinta-feira, 24 de Junho de 2010 20Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server statechanges, are any server to client delays (due to caching) really there?Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the sitedown) due to overload.
  21. 21. The “Schrodinger” Consistency Model A Schrodinger’s Cache? Data Inconsistencies only matter if they can be observed ...but the observer might just be another system if its work quality is affected Parallelism x Accuracy of State Evaluation The case for the 3” Cache on a “Live Site”quinta-feira, 24 de Junho de 2010 21Parallelism x Accurate State Eval.:- Cliff Click’s Non Blocking Counter- How many breads exist at a given moment on the stores of a large supermarket network?Might have to live without an at the moment state evaluation. Accurate evaluation of a past moment’s state might still bepossible.Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server statechanges, are any server to client delays (due to caching) really there?Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the sitedown) due to overload.
  22. 22. Java Caching Solutions http://java-source.net/open-source/cache- solutions EhCache, OSCache, JBoss Cache, Apache JCS, Terracotta, etc EhCache now at Terracotta Oracle Coherence GigaSpaces XAP Data Grid IBM WebSphere eXtreme Scalequinta-feira, 24 de Junho de 2010 22* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONEven on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server statechanges, are any server to client delays (due to caching) really there?Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the sitedown) due to overload.
  23. 23. Other Interesting Caching Solutions Varnish (HTTP Cache) Redis Memcached There are many, many others...quinta-feira, 24 de Junho de 2010 23Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server statechanges, are any server to client delays (due to caching) really there?Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the sitedown) due to overload.
  24. 24. Slow and Big Consistency (The Higher Latency - BigData)quinta-feira, 24 de Junho de 2010 24
  25. 25. MapReduce is for embarrassingly parallel problems with some time...quinta-feira, 24 de Junho de 2010 25Consistency scenarios, starting from the most “sexy” (Web, Peta Bytes of Data):* MapReduce works like vote counting - vote mapped to voting tables, counted, “reduced” to stats;* MR is appropriate for "embarrassingly parallel" tasks, like indexing the Internet and other huge processing tasks;* We should use it whenever possible;* There is a lot to be learned about Map Reduce: - Evaluation and expression of candidate problems; - Build and manage an its infrastructure; - etc.* Even MR has coordination needs;* Even MR should have SLAs (Service Level Agreements).
  26. 26. Coordination Consensus needed for Map Reduce Consensus is the process of agreeing on one result among a group of participants Consensus is not as easy as it seems Byzantine Generals Problem + 2 Generals Problem All solutions are probabilisticquinta-feira, 24 de Junho de 2010 26* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONhttp://en.wikipedia.org/wiki/Two_Generals_Problemhttp://en.wikipedia.org/wiki/Byzantine_fault_tolerance#OriginThe thought experiment involves considering how they might go about coming to consensus. In its simplest form one general (referred to as the "first general" below) is knownto be the leader, decides on the time of attack, and must communicate this time to the other general. The requirement that causes the "problem" is that both generals mustattack at the agreed upon time to succeed. Having a solitary general attack is considered a disastrous failure. The problem is to come up with algorithms that the generals canuse, including sending messages and processing received messages, that can allow them to correctly conclude:Yes, we will both attack at the agreed upon time.Note that it is quite simple for the generals to come to an agreement on the time to attack. One successful message with a successful acknowledgement suffices for that. Thesubtlety of the Two Generals Problem is in the impossibility of designing algorithms for the generals to use to safely agree to the above statement.Illustrating the problemThe first general may start by sending a message "Let us attack at 9 oclock in the morning." However, once dispatched, the first general has no idea whether or not themessenger got through. Any amount of uncertainty may lead the first general to hesitate to attack, since if the second general does not also attack at that time, the citysgarrison will repel the advance, leading to the destruction of that attacking generals forces. Knowing this, the second general may send a confirmation back to the first: "Ireceived your message and will attack at 9 oclock." However, what if the confirmation messenger were captured? The second general, knowing that the first will hesitatewithout the confirmation, may himself hesitate. A solution might seem to be to have the first general send a second confirmation: "I received your confirmation of the plannedattack." However, what if that messenger were captured? It quickly becomes evident that no matter how many rounds of confirmation are made there is no way to guaranteethe second requirement that both generals agree the message was delivered.
  27. 27. MapReduce + Consensus (Google + Hadoop Implementations) Google, coordination by Chubby using Paxos. Used only at Google; Google BigTable is a Wide Column Store which works on top of GoogleFS. Used only at Google; Hadoop, used at Amazon, Facebook, Rackspace, Twitter, Yahoo!, etc.; Hadoop ZooKeeper implements a Paxos variation and is used at Rackspace, Yahoo!, etc.; Hadoop HBase is a Wide Column Store, on top of HDFS and now uses ZooKeeper. Used at Yahoo! etc.quinta-feira, 24 de Junho de 2010 27* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONParallel between Google’s internally developed systems and their Hadoop counterparts. http://hadoop.apache.org/ http://labs.google.com/papers/The very interesting “coordinators”: http://labs.google.com/papers/chubby.html http://hadoop.apache.org/zookeeper/Zookeeper sure looks like a very interesting and reusable piece of software.Curiosity: HBase is faster since using ZooKeeper... is it also because of Zookeeper??? http://hadoop.apache.org/hbase/
  28. 28. Apache Hadoop Projects HBase (Distributed DB) HDFS (Distributed File System) MapReduce Pig (Query Language++) ZooKeeper (Coordination) others (Common, Hive, Chukwa)quinta-feira, 24 de Junho de 2010 28* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONhttp://hadoop.apache.org/
  29. 29. Apache Zookeeper Distributed Coordination for Distributed Applications Design Goals: Simple, Replicated, Ordered, Fast (very resilient too) Other Properties: Thousands of clients, better for 10:1 reads:writesquinta-feira, 24 de Junho de 2010 29http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
  30. 30. Apache Zookeeper API Simple API: Filesystem like node tree Conditional Updates Watches (notifications) Ephemeral and Sequence Nodes Out of box + recipes for: Name Service, Configuration, Group Membership, Barriers, Queues, Locks, 2P Commit, Leader Electionquinta-feira, 24 de Junho de 2010 30* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONhttp://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
  31. 31. Consistency w/ Interaction (Low Latency - read/write - harder stuff)quinta-feira, 24 de Junho de 2010 31
  32. 32. Two “High”/Sexy reasons for Distributing Data Storage (not just cache) High Performance Data Access (Read / Write) High Availability (HA)quinta-feira, 24 de Junho de 2010 32
  33. 33. Why care about HA? 1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google) Unrecoverable RAM errors/year: 1.3% machines, 0.22% DIMM (Google) Router, Rack, PDU, misc. network failures Over 4 nines only through redundancy, best hardware never good enough (James Hamilton-MS and Amazon) Hey! This might affect smaller fish like us!!!quinta-feira, 24 de Junho de 2010 33* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONSources:For Google’s numbers check the slideware at: http://videolectures.net/wsdm09_dean_cblirs/For the James Hamilton quote: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdfAnother very quoted paper with Google’s DRAM failure stats and patterns: http://research.google.com/pubs/pub35162.htmlYou can find other HA and Systems related papers from Google and James Hamilton at: http://mvdirona.com/jrh/work/ http://research.google.com/pubs/DistributedSystemsandParallelComputing.html
  34. 34. Why care about Latency? Google: Half a second delay caused a 20% drop in traffic (30 results instead of 10, via Marissa Mayer); Amazon found every 100ms of latency costs 1% sales (via Greg Linden); A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 ms behind the competition (via NYT). Hey! This affects anyone online!quinta-feira, 24 de Junho de 2010 34* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONYou can find all this references trough this page (if you follow the links): http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-itIncluding these: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  35. 35. Fallacies of Distributed Computing (What can go wrong?) 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesnt change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous.quinta-feira, 24 de Junho de 2010 35* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONJust to remember this classic on the HA challenges. A few more details at: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
  36. 36. Other Distributed Data Contexts (the less sexy daily stuff?) EAI / B2B / Systems Integration Geographic Distribution (e.g.:Health System+Hospitals) Systems with n-tier / SOA Architectures Elasticity on Peaks (still sexy...)quinta-feira, 24 de Junho de 2010 36The daily jobs of so many IT professionals have much more relation with this type of common distributed systemsthan with the sexier kind we talked about before. But these fields too would benefit from the learning the lessonsand using the technologies we are talking about.
  37. 37. The CAP Theorem and Eventual Consistencyquinta-feira, 24 de Junho de 2010 37
  38. 38. CAP Theorem History 1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazons Dynamo” post by Werner Vogels (Amazon’s CTO) quoting a paper (by him + others) 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 2010-04-23: With the PACELC model, Daniel Abadi remembers and explains on his blog the obvious importance of latency on BASE vs. ACID and other tuning decisions over designs which revolve around CAP.quinta-feira, 24 de Junho de 2010 38* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONThe online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “EventualConsistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistencyReally essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works intruly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.htmlIf you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theoremYou should still take a look at:* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma isalready mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is alreadymentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf* You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf* The PACELC model was described by Daniel Abadi on his blog at: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlDefinition of ACID:
  39. 39. The CAP Theorem strong Consistency, high Availability, Partition-resilience: pick at most 2quinta-feira, 24 de Junho de 2010 39I simply had to put The Diagram, of course.According to http://books.couchdb.org/relax/intro/eventual-consistency...exemples of what goes in the interceptions:CP => Classic RDBMS Enforced ConsistencyCA => Paxos / Consensus Consensus Protocols for HA ConsistencyAP => CouchDB + Eventually Consistent DBs Eventual Consistency
  40. 40. Eventual Consistency for Availability BASE ACID (Basically Available Soft-state Eventual consistency) (Atomicity, Consistency, Isolation, Durability) Weak Consistency Strong consistency (stale data ok) (NO stale data) Availability first Isolation Best effort Focus on “commit” Approximate answers OK Availability? Aggressive (optimistic) Conservative (pessimistic) Faster / Lower Latency Saferquinta-feira, 24 de Junho de 2010 40* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONYou can find a variation of this slide at Brewer’s 2000’s PODC keynote at: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdfI skipped these rather controversial bits: ACID: * Nested transactions; * Difficult evolution (e.g. schema) BASE: * Simpler! * Easier evolutionI already tried both ways (data stores with and without schema) and I rather have some schema mechanism for the mostcomplex stuff.ACID:A)tomicityEither all of the tasks of a transaction are performed or none of them are.C)onsistencyA database remains in a consistent state before the start of the transaction and after the transaction is over (whether successfulor not).I)solationOther operations cannot access or see the data in an intermediate state during a transaction.D)urabilityOnce the user has been notified of success, the transaction will persist. This means it will survive system failure, and that thedatabase system has checked the integrity constraints and wont need to abort the transaction.
  41. 41. CAP Trade-offs CA without P: Databases providing distributed transactions can only do it while their network is ok; CP without A: While there is a partition, transactions to an ACID database may be blocked until the partition heals (to avoid merge conflicts -> inconsistency); AP without C: Caching provides client-server partition resilience by replicating data, even if the partition prevents verifying if a replica is fresh. In general, any distributed DB problem can be solved with either: expiration-based caching to get AP; or replicas and majority voting to get PC (minority is unavailable).quinta-feira, 24 de Junho de 2010 41* VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONConcept introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdfI should probably skip this slide during a life presentation. This is stuff you have to read about.
  42. 42. Living with CAP All systems are probabilistic, wether they realize it or not And so are Distributed Transactions (2 Generals Problem) Life is Eventually Consistent Weak CAP Principle: The stronger the guarantees made about any two of C, A and P, the weaker the guarantees that can be made about the third Systems should degrade gracefully, instead of all or nothing (e.g.: displaying data from available partitions)quinta-feira, 24 de Junho de 2010 42* #1 => Explain why Life is Eventually ConsistentSteve Yen clearly illustrates the “Life is Eventually Consistent” idea on the slideware (slides 40 to 45) he used forhis “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdfThe Weak CAP Principle was introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer etal.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdfTo understand how hard (ACID) Distributed Transactions are, you have an excellent history of the concepts relatedto this problem here: http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.htmlThe difficulties of (ACID) Distributed Transactions are well illustrated by the classic Two Generals’ Problem: http://en.wikipedia.org/wiki/Two_Generals_ProblemLeslie Lamport et al further explore the problem (and its solutions) on the classic “The Byzantine Generals Problem”paper: http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdfAnd if you think that Two Phase Commit is a 100% reliable mechanism... think again: http://www.cs.cornell.edu/courses/cs614/2004sp/papers/Ske81.pdfThis is just to illustrate the difficulty of the problem. There are more reliable mechanisms, like Three PhaseCommit: http://en.wikipedia.org/wiki/Three-phase_commit_protocol http://ei.cs.vt.edu/~cs5204/fall99/distributedDBMS/sreenu/3pc.html...or the so called Paxos Commit: http://research.microsoft.com/pubs/64636/tr-2003-96.pdf
  43. 43. CAP Theorem History 1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazons Dynamo” post by Werner Vogels (Amazon’s CTO) quoting a paper (by him + others) 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 2010-04-23: With the PACELC model, Daniel Abadi remembers and explains on his blog the obvious importance of latency on BASE vs. ACID and other tuning decisions over designs which revolve around CAP.quinta-feira, 24 de Junho de 2010 43* VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION* Repeated slide, repeated notes (to pass focus from CAP to Dynamo and Eventual Consistency):The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “EventualConsistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistencyReally essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works intruly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.htmlIf you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theoremYou should still take a look at:* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma isalready mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is alreadymentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf* You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf* The PACELC model was described by Daniel Abadi on his blog at: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
  44. 44. Amazon’s Dynamo DB Also a “Wide Column Store” Problem Technique Partitioning Consistent Hashing High Availability for writes Vector clocks with reconciliation during reads Handling temporary failures Sloppy Quorum and hinted handoff (NRW) Recovering from permanent failures Anti-entropy using Merkle trees Membership and failure detection Gossip-based membership protocol and failure detection. (in bold some techniques which could improve many “enterprise” / “every-day” solutions)quinta-feira, 24 de Junho de 2010 44* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONThe source here is the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.htmlStrict distributed DBs, rather than dealing with the uncertainty of the correctness of an answer, make data is madeunavailable until it is absolutely certain that it is correct.At Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution - avg or median not goodenough to provide a good experience for all. The choice for 99.9% over an even higher percentile has been made basedon a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much.Experiences with Amazon’s production systems have shown that this approach provides a better overall experiencecompared to those systems that meet SLAs defined based on the mean or median.
  45. 45. Tuning Consistency: N: number of nodes to replicate each item to; W: number of required nodes for write success; R: number of required nodes for read success. W < N = remaining nodes will receive the write later. R < N = remaining nodes ignored.quinta-feira, 24 de Junho de 2010 45Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html...but you can find a similar diagram and similar mechanisms described about several (NoSQL) databases thatpartially clone Dynamo.
  46. 46. Previous Experiences...quinta-feira, 24 de Junho de 2010 46
  47. 47. Eventually Consistent Systems Banks EAI Integrations Many messaging based (SOA) systems Google Amazon Etc.quinta-feira, 24 de Junho de 2010 47Unlike what many examples say, Banks often use Eventual Consistency on many (limited value/risk) transactions -or use “large” periodic transaction / compensation fixed windows to process large numbers of larger valuemovements. So much for those ACID transaction examples...
  48. 48. Amazon Dynamo Lessons (according to the paper) Data returned to Shopping Cart 24h profiling: 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3 versions and 0.00009% of requests saw 4 versions. In two years applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date; With coordination via Gossip protocol it is harder to scale further than a few hundred nodes. (Could be better w/ Chubby / ZK like coordinators?)quinta-feira, 24 de Junho de 2010 48Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.htmlWikipedia has an article on Gossip Protocols (although, at the data I write this, not as precise as other Wikipediaarticles I just quoted): http://en.wikipedia.org/wiki/Gossip_protocolThe solution I mention as a possibly more scalable alternative to Gossip Protocols for consensus is the use ofPaxos (or derivates) Coordinators, like the proprietary Google’s Chubby or the open source Apache HadoopZookeeper.When I first wrote and used (at my SAPO Codebits 2009 talk) these slides, the only support I still had to my (thenintuitive) belief that these more directed approaches should be more efficient than Gossip Protocols was the 6.6part from the Dynamo paper - the paper even mentions the possibility of “introducing hierarchical extensions toDynamo”.Thanks to my SAPO Codebits talk I met Henrique Moniz, then a Ph.D. student at the University of Lisbon. After Idiscussed this issue (consensus scalability) with him he pointed me to a couple of interesting papers, one of whichimmediately captured my attention:* Gossip-based broadcast protocols by João Leitão http://www.gsd.inesc-id.pt/~jleitao/pdf/masterthesis-leitao.pdfThis paper offers a more complete description of gossip protocols overhead and, to my surprise, also pointed afew reliability weak spots on known Gossip Protocols. The paper goes on to present a more robust and efficientGossip Protocol called “HyParView” using a more “directed” approach.HyParView sure looks like an interesting solution in terms of robustness for environments with an high incidenceof system/network failures but I still believe that using coordinators will be more efficient in a well controlled datacenter.Not that using coordinators and making them scale out BIG is exactly trivial, as you can read here:-On the “Vertical Paxos and Primary-Backup Replication” paper, by Leslie Lamport et al, that Henrique Monizpointed me to: http://research.microsoft.com/pubs/80907/podc09v6.pdf-Or on this interesting article from the Cloudera’s blog about the (now upcoming) Observers feature of Apache
  49. 49. NoSQL Java being used at Cassandra: at Facebook, being introduced on Twitter, persistent cache at reddit, replacing MySQL at digg, etc. Voldemort: at LinkedIn, Gilt Groupe e-commerce site (check Geir Magnusson’s QCon presentation); HBase: Yahoo!, Twitter, Adobe, Ning, Stumbleupon, Meetup, etc. Often used for high volume analytics but also for other high volume stores and M-R tasks.quinta-feira, 24 de Junho de 2010 49* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONhttp://www.infoq.com/presentations/Project-Voldemort-at-Gilt-Groupe
  50. 50. Is NoSQL better than SQL? The NoSQL vs. SQL database debate is really about ACID vs. BASE databases A query language advantage indicator is given by Hadoop Pig use at Twitter (via Kevin Well): The Pig version is 5% of the code 5% of the time Within 50% of the execution time Any one which used c-tree like DB APIs can say the samequinta-feira, 24 de Junho de 2010 50http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
  51. 51. Some interesting techniques... (...which we could all be using...)quinta-feira, 24 de Junho de 2010 51
  52. 52. Wikipedia image Vector Clocks On each internal even a process increments its logical clock; Before sending a message, it increments its own clock in the vector and sends it with the message; On receiving a message, it increments its clock and updates each element on its own vector to max.(own, msg).quinta-feira, 24 de Junho de 2010 52* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONAlso based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Vector_clockVector Clocks (and other similar algorithms) have a predecessor in Lamport timestamps: http://en.wikipedia.org/wiki/Lamport_timestampsIntroduced in the classic paper “Time, Clocks, and the Ordering of Events in a Distributed System” by LeslieLamport: http://en.wikipedia.org/wiki/Lamport_timestamps
  53. 53. Wikipedia image Merkle Tree / Hash Tree Used to verify / compare a set of data blocks and efficiently find where the mismatches are.quinta-feira, 24 de Junho de 2010 53* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONAlso based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Hash_tree
  54. 54. ACID and FAST (Lowest Latency - read/write - hardest stuff)quinta-feira, 24 de Junho de 2010 54
  55. 55. Immediately Consistent Systems Data-grids: Oracle Coherence Trading Gigaspaces All Data in RAM Online Gambling Can do ACID Very High Speed Max. Scale-outquinta-feira, 24 de Junho de 2010 55* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATIONTrading and Online Gambling really need to do large volumes of fast ACID transactions and are the big customersof Data Grids.Why Online Gambling needs ACID transactions has all to do with the type of game and the type of rules/assets(some virtual) it involves.Why Trading really needs ACID is s bit more obvious: you might be able to compensate an overdraft at a bank(more so for limited values) but you really cannot sell shares you do not have for sale.The performance needs are obvious for both too. For Trading there are even some new reasons, like (again): http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  56. 56. Tools (Most with source code to pick from)quinta-feira, 24 de Junho de 2010 56
  57. 57. NoSQL Taxonomy by Steve Yen [PG] key‐value‐cache: memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracota [???] key‐value‐store: keyspace [w/Paxos], flare, schema‐free, RAMCloud [, Mnesia (Erlang), Chordless] eventually‐consistent key‐value‐store: dynamo, Voldemort, Dynomite, SubRecord, MotionDb, Dovetaildb ordered‐key‐value‐store: tokyo tyrant[, BerkleyDB, JDBM], lightcloud, NMDB, luxio, memcachedb, actord data‐structures server: redis tuple‐store: gigaspaces [?], coord, apache river object database: ZopeDB, db4o, Shoal document store: CouchDB [evC, MVCC], MongoDB [evC], Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho [evC], Scalaris [Erlang, w/Paxos] wide columnar store: BigTable, Hadoop HBase [w/ Zookeeper], [Amazon Dynamo-evC, ] Cassandra [evC], Hypertable, KAI, OpenNeptune, Qbase, KDI [graph database: Neo4J, Sones, etc.]quinta-feira, 24 de Junho de 2010 57From Steve Yen’s slideware (slide 54) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdfI do not completely understand or agree with Steve’s criteria but it sure is a possible starting point on building adatabase/storage taxonomy.The stuff in square brackets is mine. “evC” means Eventually Consistent and “?” just means I have doubts / don’tunderstand some specific classification.
  58. 58. Some related Solutions I find interesting... Zookeeper (use it, configuration, elasticity, group membership, leader election, notification...) JDBM, BerkleyDB (careful w/ the OS license) (just use them for very fast persistence storage) Voldemort and Cassandra (use them or pick code for Vector Clocks, Merkle Trees, data compression, communications and other code - nice code bases) Redis (not Java, but usable from and a kind of a Swiss Knife) The Riak Basho Bitcask store idea. Used something similar (but not generic) in Java: http://downloads.basho.com/papers/bitcask-intro.pdf EhCache (the pre-Terracotta version shows of simplistic some stuff can be and still work) Just use RMI and native Java serialization (as EhCache does) JBoss Netty (if you want to do seriously fast network communication) Varnish (an HTTP cache which knows how to use Virtual Memory)quinta-feira, 24 de Junho de 2010 58http://downloads.basho.com/papers/bitcask-intro.pdf
  59. 59. Opportunities (...to use these tools)quinta-feira, 24 de Junho de 2010 59
  60. 60. Some cases we could talk about... EAI Integrations (Should use Vector Clocks?) Zookeeper at the “Farm” (Config./Coord.) Live soccer game site Web sites in general Log like / timeline systems (forums, healthcare, Twitter, etc.) Analytics Logistic Planing across EU case Tradingquinta-feira, 24 de Junho de 2010 60This is the placeholder slide to exercise the ideas and discuss possible applications of some of the mechanismswhich were presented on this talk (had no time at Codebits... still tuning this not-so-easy presentation).Except for the last two scenarios (and the Twitter alternative on the “Log like” one) all others represent quitecommon types of problems which you can meet without having to work for a Fortune Top 50 company or for amega web portal / service. Even an “Analytics” with enough data to justify using MapReduce is common enough.Many large (but not necessarily huge) companies often quit doing more with the data they have just because ofthe trouble of finding a way to do it (“more”).* “Analytics” (high data + easy on consistency as it is) is currently seem to be the playground of Map Reduce, withHadoop stuff being used “everywhere”. Look at how many times you can find the words “analytics” or“analysis” (and “MapReduce”) on these “Powered by” Hadoop web pages: http://wiki.apache.org/hadoop/PoweredBy http://wiki.apache.org/hadoop/Hbase/PoweredBy* “Live soccer game...” is a nice problem to discuss short live caching and its consistency issues;* “Log like / timeline systems...” are systems where information is mostly “insert only” and most of the effort tokeep consistency is related to keeping proper ordering information (with timestamps being usually enough),properly merging the data from different sources and respect the explicit or implicit SLAs on datasynchronizations. Obviously, there are different difficulties across the several cases here mentioned, depending ondata flow, necessary performance, etc.;* “EAI Integrations” often need better knowledge about ordering and are not as simples as the previous scenario.Due to factors like the use of asynchronous and event driven mechanisms and the possibility of having updatesfor a given document across multiple steps of a (multiple) process(es), a timestamp is often too limited asordering information... but is often the most you get. IMO this is a good scenario for using Vector Clocks andcompany;* “Zookeeper” is a great system even if “just” to configure the simplest web (or webservice) farm, to coordinate thesimplest cross farm operations (e.g.: cache related) or just for each server to know which are its peers;* “Logistic Planing” is a complex scenario which demands a mix of solutions. It revolves around a logisticscompany which transports goods across Europe, with planning offices on different countries. I will probably haveto remove it from this slide for any future talk I might give on this topic even if it is the most interesting of themall. So, it does not make much sense to develop it here (maybe a blog post since, to me, this is a >10 year old
  61. 61. Q&Aquinta-feira, 24 de Junho de 2010 61

×