NoSQL Databases: Why, what and when

138,722 views
133,411 views

Published on

NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).

Published in: Design, Technology
60 Comments
715 Likes
Statistics
Notes
No Downloads
Views
Total views
138,722
On SlideShare
0
From Embeds
0
Number of Embeds
44,060
Actions
Shares
0
Downloads
0
Comments
60
Likes
715
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • NoSQL Databases: Why, what and when

    1. Lorenzo Alberton @lorenzoalberton NoSQL Databases:Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
    2. NoSQL: WhyScalability, Concurrency, New trends 2
    3. New Trends 2002 2004 2006 2008 2010 2012 Big data 3
    4. New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
    5. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
    6. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
    7. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
    8. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
    9. What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
    10. What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
    11. What’s the problem with RDBMS’s? 5
    12. What’s the problem with RDBMS’s?http://www.flickr.com/photos/dimi3/3096166092 5
    13. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
    14. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
    15. MongoDB is web-scale ...but /dev/null is even better! 7
    16. Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
    17. CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
    18. No, seriously...*(*) Not another “Mine is bigger” comparison, please 10
    19. A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
    20. ACIDATOMICITY: All or nothingCONSISTENCY: Any transaction will take the db from oneconsistent state to another, with no broken constraints(referential integrity)ISOLATION: Other operations cannot access data that hasbeen modified during a transaction that has not yet completedDURABILITY: Ability to recover the committed transactionupdates against any kind of system failure (transaction log) 12
    21. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations 13
    22. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
    23. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
    24. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
    25. Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
    26. Isolation Levels, Locking & MVCC Non-repeatableIsolation Level Dirty Reads Phantoms readsSerializable - - -Repeatable - -ReadRead -CommittedReadUncommitted http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/ 14
    27. Isolation Levels, Locking & MVCCIsolation Level Range Lock Read Lock Write LockSerializableRepeatable -ReadRead - -CommittedRead - - -Uncommitted 15
    28. Multi-Version Concurrency Control Root Index Index Index IndexIndex Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data 16
    29. Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    30. Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    31. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    32. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
    33. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 17
    34. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Query to commit Participants 17
    35. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 1) Exec Transaction up to the COMMIT request 2) Write entry to undo and redo logs 17
    36. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Agree or Abort Participants 17
    37. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 18
    38. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Commit a) SUCCESS (agreement from all) Participants 18
    39. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 1) Complete operation 2) Release locks 18
    40. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge a) SUCCESS (agreement from all) Participants 18
    41. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Complete transaction (completion phase) a) SUCCESS (agreement from all) Participants 18
    42. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 19
    43. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Rollback b) FAILURE (abort from any) Participants 19
    44. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 1) Undo operation 2) Release locks 19
    45. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
    46. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
    47. Problems with 2PC Blocking ProtocolRisk of indefinite cohort Conservative behaviourblocks if coordinator fails biased to the abort case 20
    48. Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
    49. [PSE image alert] 22
    50. ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
    51. ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
    52. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
    53. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
    54. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    55. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    56. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistencyhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    57. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
    58. Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
    59. Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
    60. Consistency: Server-side (Quorum)N = number of nodes with a replica of the data (*)W = number of replicas that must acknowledge the updateR = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter whatW+R>N Strong consistency (usually N=3, W=R=2)W = N, R =1 Optimised for readsW = 1, R = N Optimised for writes (durability not guaranteed in presence of failures)W + R <= N Weak consistency 27
    61. Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repairhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
    62. Modulo-based Hashing N1 N2 N3 N4 29
    63. Modulo-based Hashing N1 N2 N3 N4 ? 29
    64. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers 29
    65. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers 29
    66. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
    67. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    68. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    69. Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    70. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    71. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    72. Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
    73. Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
    74. Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
    75. Consistent Hashing - Node Changes A F B E D C 32
    76. Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
    77. Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
    78. Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
    79. Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    80. Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    81. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    82. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    83. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
    84. Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    85. Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    86. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    87. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
    88. Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
    89. Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
    90. Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
    91. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
    92. Read Repair A F B GET(k, R=2) E D C 39
    93. Read Repair A F B GET(k, R=2) E D C 39
    94. Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
    95. Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
    96. NoSQL Break-down Key-value stores, Column Families, Document-oriented dbs, Graph databases 40
    97. Focus Of Different Data Models Key-Value StoresSize Column Families Document Databases Graph Databases Complexity http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 41
    98. 1) Key-value stores Amazon Dynamo Paper Data model: collection of key-value pairs 42
    99. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    100. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java HTTP / Sockets API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    101. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL Conflicts resolved at read HTTP Java and write time Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    102. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Json, Java String, byte[], Avro Protobuf Thrift, Avro, ProtoBuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    103. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC Simple optimistic locking for multi-row updates, pluggable storage engine 43
    104. Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
    105. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    106. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    107. Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookupAll metadata kept in memory (high throughput / low latency).Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
    108. Riak AP LICENSE Apache 2 LANGUAGE C, Erlang API/PROTOCOL REST HTTP * ProtoBuf Buckets → K-V “Links” (~relations) Targeted JS Map/Reduce Tune-able consistency (one-quorum-all) 45
    109. Redis CP LICENSEK-V store “Data Structures Server” BSDMap, Set, Sorted Set, Linked List LANGUAGESet/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C API * + PROTOCOL Telnet- like PERSISTENCE 10-100K op/s (whole dataset in RAM + VM) in memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATION master-slave Distributed if client supports consistent hashing http://redis.io/presentation/Redis_Cluster.pdf 46
    110. 2) Column Families Google BigTable paper Data model: big table, column families 47
    111. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    112. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    113. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    114. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
    115. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family http://labs.google.com/papers/bigtable-osdi06.pdf 48
    116. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    117. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    118. Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates Automatic GC ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
    119. Google BigTable: Data StructureSSTableSmallest building blockPersistent immutable Map[k,v]Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
    120. Google BigTable: Data StructureSSTableTabletSmallest building block range of rowsDynamically partitionedPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
    121. Google BigTable: Data StructureSSTableTableTabletSmallest Tablets (table segments) make up a tableMultiple building blockDynamically partitioned range of rowsPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTableTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
    122. Google BigTable: I/O memtable readmemoryGFS tablet log SSTable SSTable SSTable write 50
    123. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write 50
    124. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
    125. Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
    126. Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of thelock service; maintains metadata treetablet server locations5 replicas, one elected ... master (via quorum) Up to 3 levels ...Paxos algorithm used in the metadata to keep consistency hierarchy 51
    127. Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
    128. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    129. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift(instead of Chubby) PERSISTENCE memtable/ SSTable 53
    130. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    131. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTableHDFS, S3, S3N, EBS (with GZip/LZO CF compression) 53
    132. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
    133. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable Batch Streaming, Map/Reduce 53
    134. HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
    135. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
    136. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
    137. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
    138. Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
    139. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
    140. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
    141. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thriftrow_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55

    ×