Lorenzo Alberton                            @lorenzoalberton NoSQL Databases:Why, what and when      NoSQL Databases Demys...
NoSQL: WhyScalability, Concurrency, New trends                                       2
New Trends 2002   2004   2006   2008   2010   2012           Big data                                           3
New Trends 2002   2004   2006   2008   2010   2012           Big data   Concurrency                                       ...
New Trends 2002   2004   2006   2008   2010   2012           Big data                        Connectivity   Concurrency   ...
New Trends 2002   2004   2006   2008   2010   2012           Big data                        Connectivity   Concurrency   ...
New Trends 2002   2004   2006   2008   2010   2012           Big data                        Connectivity   P2P Knowledge ...
New Trends 2002   2004   2006   2008   2010   2012           Big data                        Connectivity   P2P Knowledge ...
What’s the problem with RDBMS’s?  http://www.codefutures.com/database-sharding                                            ...
What’s the problem with RDBMS’s?                                                 Caching                                  ...
What’s the problem with RDBMS’s?                                   5
What’s the problem with RDBMS’s?http://www.flickr.com/photos/dimi3/3096166092   5
Quick Comparison            Overview from 10,000 feet  (random impressions from the interwebs)                            ...
Quick Comparison            Overview from 10,000 feet  (random impressions from the interwebs)           http://www.flickr....
MongoDB is web-scale        ...but /dev/null          is even better!                            7
Cassandra is teh schnitz        ..Love,v/null          .but /de          is even better                           8
CouchDB: Relax!         .buve/a LO n ?             t de      ulfree space       ..Love,v/T of!l !?      You harenaieth?r R...
No, seriously...*(*) Not another “Mine is bigger” comparison, please                                                      10
A little theory      Fundamental Principles  of (Distributed) Databases     http://www.timbarcz.com/blog/PassionInProgramm...
ACIDATOMICITY: All or nothingCONSISTENCY: Any transaction will take the db from oneconsistent state to another, with no br...
Isolation Levels, Locking & MVCCIsolation                   nounProperty that defines how/when the changes made by one oper...
Isolation Levels, Locking & MVCCIsolation                    nounProperty that defines how/when the changes made by one ope...
Isolation Levels, Locking & MVCCIsolation                    nounProperty that defines how/when the changes made by one ope...
Isolation Levels, Locking & MVCCIsolation                    nounProperty that defines how/when the changes made by one ope...
Isolation Levels, Locking & MVCCIsolation                    nounProperty that defines how/when the changes made by one ope...
Isolation Levels, Locking & MVCC                                            Non-repeatableIsolation Level       Dirty Read...
Isolation Levels, Locking & MVCCIsolation Level   Range Lock   Read Lock   Write LockSerializableRepeatable               ...
Multi-Version Concurrency Control                                            Root                                    Index...
Multi-Version Concurrency Control       obsolete                             Root       new version                       ...
Multi-Version Concurrency Control       obsolete                             Root                                 atomic p...
Multi-Version Concurrency Control       obsolete                             Root                                 atomic p...
Multi-Version Concurrency Control       obsolete                             Root                                 atomic p...
Distributed Transactions - 2PC       Coordinator                          1) COMMIT                             REQUEST   ...
Distributed Transactions - 2PC       Coordinator                                        1) COMMIT                         ...
Distributed Transactions - 2PC              Coordinator                                                1) COMMIT          ...
Distributed Transactions - 2PC       Coordinator                              1) COMMIT                                 RE...
Distributed Transactions - 2PC       Coordinator                          2) COMMIT                             PHASE     ...
Distributed Transactions - 2PC       Coordinator                               2) COMMIT                                  ...
Distributed Transactions - 2PC        Coordinator                             2) COMMIT                                PHA...
Distributed Transactions - 2PC       Coordinator                                    2) COMMIT                             ...
Distributed Transactions - 2PC       Coordinator                                             2) COMMIT                    ...
Distributed Transactions - 2PC       Coordinator                          2) COMMIT                             PHASE     ...
Distributed Transactions - 2PC       Coordinator                                 2) COMMIT                                ...
Distributed Transactions - 2PC        Coordinator                          2) COMMIT                             PHASE    ...
Distributed Transactions - 2PC       Coordinator                                    2) COMMIT                             ...
Distributed Transactions - 2PC       Coordinator                                         2) COMMIT                        ...
Problems with 2PC                   Blocking ProtocolRisk of indefinite cohort      Conservative behaviourblocks if coordin...
Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs:  Number of proce...
[PSE image alert]                    22
ACID & Distributed Systems    http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb   23
ACID & Distributed Systems         ACID properties are always desirable                           But what about:         ...
CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM S...
CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM S...
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to an...
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to an...
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to an...
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to an...
Consistency: Client-side viewA service that is consistent operates fully or not at all.  Strong consistency (as in ACID)  ...
Consistency: Client-side viewA service that is consistent operates fully or not at all.  Strong consistency (as in ACID)  ...
Consistency: Server-side (Quorum)N = number of nodes with a replica of the data                                           ...
Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repairhttp://s3.amazonaws.com/Al...
Modulo-based Hashing    N1      N2         N3   N4                                 29
Modulo-based Hashing    N1      N2         N3   N4                 ?                                 29
Modulo-based Hashing    N1           N2        N3          N4         partition = key % n_servers                         ...
Modulo-based Hashing    N1           N2        N3        N4         partition = key % n_servers - 1)                      ...
Modulo-based Hashing      N1                N2              N3             N4           partition = key % n_servers - 1)  ...
Consistent Hashing                2160    0                            A     F                                       B    ...
Consistent Hashing                2160    0                            A     F                                       B    ...
Consistent Hashing                2160    0                            A                                                  ...
Consistent Hashing                2160    0                            A     F                                       B    ...
Consistent Hashing                2160    0                            A     F                                       B    ...
Consistent Hashing                2160    0                                                                  only the keys...
Consistent Hashing - Replication                       A     F                                   B                Ring E  ...
Consistent Hashing - Replication                                               Key hosted                                 ...
Consistent Hashing - Node Changes             A     F               B E         D                 C                       ...
Consistent Hashing - Node Changes                                                 Key membership                        A ...
Consistent Hashing - Load Distribution                    2160   0                                               Different...
Consistent Hashing - Load Distribution         2160   0                     Different Strategies                          ...
Vector Clocks & Conflict DetectionA       B      C                            write handled by A                           ...
Vector Clocks & Conflict DetectionA       B      C                            write handled by A                           ...
Vector Clocks & Conflict Detection A       B      C                            write handled by A                          ...
Vector Clocks & Conflict Detection A       B      C                            write handled by A                          ...
Vector Clocks & Conflict Detection A       B      C                            write handled by A                          ...
Vector Clocks & Conflict DetectionA       B      C                            write handled by A                           ...
Vector Clocks & Conflict DetectionA       B      C                            write handled by A                           ...
Vector Clocks & Conflict Detection A       B      C                            write handled by A                          ...
Vector Clocks & Conflict Detection A       B      C                            write handled by A                          ...
Gossip Protocol + Hinted Handoff             A                         periodic, pairwise,     F               B     inter...
Gossip Protocol + Hinted Handoff                                  A              I can’t see B, it might be               ...
Gossip Protocol + Hinted Handoff                     My canonical node is                     supposed to be B.           ...
Merkle Trees (Hash Trees)                                                                             Leaves: hashes of   ...
Read Repair              A     F                B                          GET(k, R=2) E         D                  C     ...
Read Repair              A     F                B                          GET(k, R=2) E         D                  C     ...
Read Repair                      k=XYZ (v.2)              A                            k=XYZ (v.2)     F                  ...
Read Repair              A     F                B                                      k=XYZ (v.2) E                      ...
NoSQL Break-down      Key-value stores, Column Families, Document-oriented dbs, Graph databases                           ...
Focus Of Different Data Models              Key-Value               StoresSize                                     Column ...
1) Key-value stores                 Amazon Dynamo Paper Data model: collection of key-value pairs                         ...
Voldemort                                AP                                     LICENSEDynamo DHT implementationConsistent...
Voldemort                                                 AP                                                      LICENSED...
Voldemort                                                        AP                                                       ...
Voldemort                                                             AP                                                  ...
Voldemort                                                        AP                                                       ...
Voldemort                                AP                                     LICENSEDynamo DHT implementationConsistent...
Membase                                                                               CP                                  ...
Membase                                                                                CP                                 ...
Membase                                                                                 CP                                ...
Riak                                                  AP                                                  LICENSE         ...
Redis                                                                     CP                                              ...
2) Column Families                 Google BigTable paper   Data model: big table, column families                         ...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestam...
Google BigTable: Data StructureSSTableSmallest building blockPersistent immutable Map[k,v]Operations: lookup by key / key ...
Google BigTable: Data StructureSSTableTabletSmallest building block range of rowsDynamically partitionedPersistent immutab...
Google BigTable: Data StructureSSTableTableTabletSmallest Tablets (table segments) make up a tableMultiple building blockD...
Google BigTable: I/O           memtable              readmemoryGFS           tablet log                        SSTable SST...
Google BigTable: I/O           memtable                  read                          minor                        compac...
Google BigTable: I/O           memtable                  read                          minor                        compac...
Google BigTable: I/O           memtable                    read                            minor                          ...
Google BigTable: Location Dereferencing                                           Metadata Tablets   User Tables          ...
Google BigTable: Architecture                                                               fs metadata, ACL,             ...
HBase                                  CP                                   LICENSEOSS implementation of BigTable         ...
HBase                                  CP                                   LICENSEOSS implementation of BigTable         ...
HBase                                                     CP                                                      LICENSEO...
HBase                                  CP                                   LICENSEOSS implementation of BigTable         ...
HBase                                            CP                                             LICENSEOSS implementation ...
HBase                                                     CP                                                      LICENSEO...
HBase                                  CP                                   LICENSEOSS implementation of BigTable         ...
Hypertable                              CP                                    LICENSEOSS BigTable implementationFaster tha...
Hypertable                              CP                                    LICENSEOSS BigTable implementationFaster tha...
Hypertable                                            CP                                                  LICENSEOSS BigTa...
Hypertable                              CP                                    LICENSEOSS BigTable implementationFaster tha...
Cassandra                                                                                                AP               ...
Cassandra                                                                                                      AP         ...
Cassandra                                                                                                      AP         ...
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
Upcoming SlideShare
Loading in...5
×

NoSQL Databases: Why, what and when

117,685

Published on

NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).

Published in: Design, Technology
51 Comments
659 Likes
Statistics
Notes
No Downloads
Views
Total Views
117,685
On Slideshare
0
From Embeds
0
Number of Embeds
84
Actions
Shares
0
Downloads
0
Comments
51
Likes
659
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • NoSQL Databases: Why, what and when

    1. 1. Lorenzo Alberton @lorenzoalberton NoSQL Databases:Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
    2. 2. NoSQL: WhyScalability, Concurrency, New trends 2
    3. 3. New Trends 2002 2004 2006 2008 2010 2012 Big data 3
    4. 4. New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
    5. 5. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
    6. 6. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
    7. 7. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
    8. 8. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
    9. 9. What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
    10. 10. What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
    11. 11. What’s the problem with RDBMS’s? 5
    12. 12. What’s the problem with RDBMS’s?http://www.flickr.com/photos/dimi3/3096166092 5
    13. 13. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
    14. 14. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
    15. 15. MongoDB is web-scale ...but /dev/null is even better! 7
    16. 16. Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
    17. 17. CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
    18. 18. No, seriously...*(*) Not another “Mine is bigger” comparison, please 10
    19. 19. A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
    20. 20. ACIDATOMICITY: All or nothingCONSISTENCY: Any transaction will take the db from oneconsistent state to another, with no broken constraints(referential integrity)ISOLATION: Other operations cannot access data that hasbeen modified during a transaction that has not yet completedDURABILITY: Ability to recover the committed transactionupdates against any kind of system failure (transaction log) 12
    21. 21. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations 13
    22. 22. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
    23. 23. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
    24. 24. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
    25. 25. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
    26. 26. Isolation Levels, Locking & MVCC Non-repeatableIsolation Level Dirty Reads Phantoms readsSerializable - - -Repeatable - -ReadRead -CommittedReadUncommitted http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/ 14
    27. 27. Isolation Levels, Locking & MVCCIsolation Level Range Lock Read Lock Write LockSerializableRepeatable -ReadRead - -CommittedRead - - -Uncommitted 15
    28. 28. Multi-Version Concurrency Control Root Index Index Index IndexIndex Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data 16
    29. 29. Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    30. 30. Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    31. 31. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    32. 32. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    33. 33. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 17
    34. 34. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Query to commit Participants 17
    35. 35. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 1) Exec Transaction up to the COMMIT request 2) Write entry to undo and redo logs 17
    36. 36. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Agree or Abort Participants 17
    37. 37. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 18
    38. 38. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Commit a) SUCCESS (agreement from all) Participants 18
    39. 39. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 1) Complete operation 2) Release locks 18
    40. 40. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge a) SUCCESS (agreement from all) Participants 18
    41. 41. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Complete transaction (completion phase) a) SUCCESS (agreement from all) Participants 18
    42. 42. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 19
    43. 43. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Rollback b) FAILURE (abort from any) Participants 19
    44. 44. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 1) Undo operation 2) Release locks 19
    45. 45. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
    46. 46. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
    47. 47. Problems with 2PC Blocking ProtocolRisk of indefinite cohort Conservative behaviourblocks if coordinator fails biased to the abort case 20
    48. 48. Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
    49. 49. [PSE image alert] 22
    50. 50. ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
    51. 51. ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
    52. 52. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
    53. 53. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
    54. 54. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    55. 55. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    56. 56. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistencyhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    57. 57. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    58. 58. Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
    59. 59. Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
    60. 60. Consistency: Server-side (Quorum)N = number of nodes with a replica of the data (*)W = number of replicas that must acknowledge the updateR = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter whatW+R>N Strong consistency (usually N=3, W=R=2)W = N, R =1 Optimised for readsW = 1, R = N Optimised for writes (durability not guaranteed in presence of failures)W + R <= N Weak consistency 27
    61. 61. Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repairhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
    62. 62. Modulo-based Hashing N1 N2 N3 N4 29
    63. 63. Modulo-based Hashing N1 N2 N3 N4 ? 29
    64. 64. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers 29
    65. 65. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers 29
    66. 66. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
    67. 67. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    68. 68. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    69. 69. Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    70. 70. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    71. 71. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    72. 72. Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    73. 73. Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
    74. 74. Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
    75. 75. Consistent Hashing - Node Changes A F B E D C 32
    76. 76. Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
    77. 77. Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
    78. 78. Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
    79. 79. Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    80. 80. Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    81. 81. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    82. 82. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    83. 83. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    84. 84. Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    85. 85. Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    86. 86. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    87. 87. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    88. 88. Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
    89. 89. Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
    90. 90. Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
    91. 91. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
    92. 92. Read Repair A F B GET(k, R=2) E D C 39
    93. 93. Read Repair A F B GET(k, R=2) E D C 39
    94. 94. Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
    95. 95. Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
    96. 96. NoSQL Break-down Key-value stores, Column Families, Document-oriented dbs, Graph databases 40
    97. 97. Focus Of Different Data Models Key-Value StoresSize Column Families Document Databases Graph Databases Complexity http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 41
    98. 98. 1) Key-value stores Amazon Dynamo Paper Data model: collection of key-value pairs 42
    99. 99. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    100. 100. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java HTTP / Sockets API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    101. 101. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL Conflicts resolved at read HTTP Java and write time Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    102. 102. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Json, Java String, byte[], Avro Protobuf Thrift, Avro, ProtoBuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    103. 103. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC Simple optimistic locking for multi-row updates, pluggable storage engine 43
    104. 104. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    105. 105. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    106. 106. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    107. 107. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookupAll metadata kept in memory (high throughput / low latency).Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    108. 108. Riak AP LICENSE Apache 2 LANGUAGE C, Erlang API/PROTOCOL REST HTTP * ProtoBuf Buckets → K-V “Links” (~relations) Targeted JS Map/Reduce Tune-able consistency (one-quorum-all) 45
    109. 109. Redis CP LICENSEK-V store “Data Structures Server” BSDMap, Set, Sorted Set, Linked List LANGUAGESet/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C API * + PROTOCOL Telnet- like PERSISTENCE 10-100K op/s (whole dataset in RAM + VM) in memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATION master-slave Distributed if client supports consistent hashing http://redis.io/presentation/Redis_Cluster.pdf 46
    110. 110. 2) Column Families Google BigTable paper Data model: big table, column families 47
    111. 111. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    112. 112. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    113. 113. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    114. 114. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    115. 115. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family http://labs.google.com/papers/bigtable-osdi06.pdf 48
    116. 116. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    117. 117. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    118. 118. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates Automatic GC ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    119. 119. Google BigTable: Data StructureSSTableSmallest building blockPersistent immutable Map[k,v]Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
    120. 120. Google BigTable: Data StructureSSTableTabletSmallest building block range of rowsDynamically partitionedPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
    121. 121. Google BigTable: Data StructureSSTableTableTabletSmallest Tablets (table segments) make up a tableMultiple building blockDynamically partitioned range of rowsPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTableTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
    122. 122. Google BigTable: I/O memtable readmemoryGFS tablet log SSTable SSTable SSTable write 50
    123. 123. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write 50
    124. 124. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
    125. 125. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
    126. 126. Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of thelock service; maintains metadata treetablet server locations5 replicas, one elected ... master (via quorum) Up to 3 levels ...Paxos algorithm used in the metadata to keep consistency hierarchy 51
    127. 127. Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
    128. 128. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    129. 129. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift(instead of Chubby) PERSISTENCE memtable/ SSTable 53
    130. 130. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    131. 131. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTableHDFS, S3, S3N, EBS (with GZip/LZO CF compression) 53
    132. 132. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
    133. 133. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable Batch Streaming, Map/Reduce 53
    134. 134. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    135. 135. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
    136. 136. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
    137. 137. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
    138. 138. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
    139. 139. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
    140. 140. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
    141. 141. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thriftrow_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55

    ×