NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over ...
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
mchmarnyOne of the better presentations on issues related to data scalability.7 months ago
Are you sure you want to
Chandra Sekhar A at Progress SoftwareFantastic presentation. I am a DB internal expert. Can you share the slides. arveti.chandrashekar@gmail.com10 months ago
NoSQL Databases: Why, what and whenPresentation Transcript
Lorenzo Alberton @lorenzoalberton NoSQL Databases:Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
NoSQL: WhyScalability, Concurrency, New trends 2
New Trends 2002 2004 2006 2008 2010 2012 Big data 3
New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
What’s the problem with RDBMS’s? 5
What’s the problem with RDBMS’s?http://www.flickr.com/photos/dimi3/3096166092 5
Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
MongoDB is web-scale ...but /dev/null is even better! 7
Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
No, seriously...*(*) Not another “Mine is bigger” comparison, please 10
A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
ACIDATOMICITY: All or nothingCONSISTENCY: Any transaction will take the db from oneconsistent state to another, with no broken constraints(referential integrity)ISOLATION: Other operations cannot access data that hasbeen modified during a transaction that has not yet completedDURABILITY: Ability to recover the committed transactionupdates against any kind of system failure (transaction log) 12
Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations 13
Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
Multi-Version Concurrency Control Root Index Index Index IndexIndex Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data 16
Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
Problems with 2PC Blocking ProtocolRisk of indefinite cohort Conservative behaviourblocks if coordinator fails biased to the abort case 20
Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
[PSE image alert] 22
ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistencyhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
Consistency: Server-side (Quorum)N = number of nodes with a replica of the data (*)W = number of replicas that must acknowledge the updateR = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter whatW+R>N Strong consistency (usually N=3, W=R=2)W = N, R =1 Optimised for readsW = 1, R = N Optimised for writes (durability not guaranteed in presence of failures)W + R <= N Weak consistency 27
Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repairhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
Consistent Hashing - Node Changes A F B E D C 32
Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
Read Repair A F B GET(k, R=2) E D C 39
Read Repair A F B GET(k, R=2) E D C 39
Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookupAll metadata kept in memory (high throughput / low latency).Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
Google BigTable: Data StructureSSTableSmallest building blockPersistent immutable Map[k,v]Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
Google BigTable: Data StructureSSTableTabletSmallest building block range of rowsDynamically partitionedPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
Google BigTable: Data StructureSSTableTableTabletSmallest Tablets (table segments) make up a tableMultiple building blockDynamically partitioned range of rowsPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTableTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write 50
Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of thelock service; maintains metadata treetablet server locations5 replicas, one elected ... master (via quorum) Up to 3 levels ...Paxos algorithm used in the metadata to keep consistency hierarchy 51
Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift(instead of Chubby) PERSISTENCE memtable/ SSTable 53
HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thriftrow_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family", key, ["super_column",] "column") Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family", key, ["super_column",] "column") Tunable B R/W/N A C P2P F Gossip x D Ehttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family", key, ["super_column",] "column") Tunable B R/W/N A C ALL P2P ONE F Gossip x QUORUM D Ehttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family", key, ["super_column",] "column") Tunable B R/W/N A RandomPartitioner (MD5) C ALL P2P OrderPreservingPartitioner ONE F Gossip x QUORUM E D Range Scans, Fulltext Index (Solandra)http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place B+Tree, Snapshots(no versioning, no CONCURRENCYappend-only log) In-place updates REPLICATION master-slave replica sets 59
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no CONCURRENCYappend-only log) In-place updates REPLICATION master-slave replica sets 59
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates REPLICATION master-slave replica sets 59
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce replica sets (well, aggregation) 59
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION New! Improved! master-slave Map/Reduce --dur flag No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
4) Graph databases Graph Theory 60
Neo4j LICENSE AGPLv3 / Commercial LANGUAGE Java Nodes, Vertical Scalability API/PROTOCOL Graph Data Structure Relationships, (1000s times faster, but not distributed) REST Properties on both Java SPARQL PERSISTENCE On-disk Physical structure: linked-list LinkedList stored on disk HA cluster with ZooKeeper High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
Neo4j LICENSE NeoService neo = ... // factory AGPLv3 / Commercial LANGUAGE Transaction tx = neo.beginTx(); Java Nodes, Vertical Scalability Node Structure n1 = neo.createNode(); (1000s times faster, API/PROTOCOL Graph Data Relationships, n1.setProperty(“name”, “John”); but not distributed) REST Properties on both Java n1.setProperty(“age”, 35); SPARQL PERSISTENCE Node n2 = neo.createNode(); n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29); LinkedList stored on disk n2.setProperty(“job”, “engineer”); n1.createRelationshipTo(n2, RelTypes.KNOWS); HA cluster with ZooKeeper tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
Neo4j LICENSE Traverser neo = ... // factory NeoServicefriendTraverser = n1.traverse( AGPLv3 / Traverser.order.BREADTH_FIRST, Commercial LANGUAGE Transaction tx = neo.beginTx(); StopEvaluator.END_OF_GRAPH, ReturnableEvaluator.ALL_BUT_START_NODE, Java Nodes, Vertical Scalability GraphNodeRelTypes.KNOWS,Relationships, n1 = neo.createNode(); API/PROTOCOL Data Structure (1000s times faster, n1.setProperty(“name”, “John”); but not distributed) Direction.OUTGOING REST Properties on both Java ); n1.setProperty(“age”, 35); SPARQL // Traverse the node space PERSISTENCE System.out.println(“John’s Node n2 = neo.createNode();friends: ”); for (Node friend : friendsTraverser) { n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29);depth stored on disk System.out.printf(“At LinkedList %d => %s%n”, n2.setProperty(“job”, “engineer”); friendTraverser.currentPosition(). getDepth(), n1.createRelationshipTo(n2, RelTypes.KNOWS); friendTraverser.getProperty(“name”) ); HA cluster with ZooKeeper } tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
Final Considerations Query modes Achievements and Problems 62
Query Modes: a new “SQL”? Map/Reduce 63
SQL vs. Map/Reduce SELECT 19OPQ db.runCommand({ A*2=*LR Dim1, Dim2, ! mapreduce: "DenormAggCollection", SUM(Measure1) AS MSum, query: { " COUNT(*) AS RecordCount, filter1: { $in: [ A, B ] }, AVG(Measure2) AS MAvg, # filter2: C, MIN(Measure1) AS MMin filter3: { $gt: 123 } MAX(CASE }, WHEN Measure2 < 100 $ map: function() { emit( THEN Measure2 { d1: this.Dim1, d2: this.Dim2 }, END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1, FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 } WHERE (Filter1 IN (’A’,’B’)) );}, AND (Filter2 = ‘C’) % reduce: function(key, vals) { AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) { HAVING (MMin > 0) ret.msum += vals[i].msum; ORDER BY RecordCount DESC ret.recs += vals[i].recs; LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } ! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret; *+7/63/8-93/02/7:-/16,/;+2470*2</ }, )-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) { " A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs; return val; # C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/ }, G-E030*2/$</M)-67-./"N!NIN#IN G048/F3B*)2-</)048*3B*)2-@*)= 1+37/?607/+2705/;02650>670*2@ $ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@ out: result1, verbose: true % D057-)3/:6E-/62/FGAHC470E-G-4*).I }); 5**802=/3795-@ db.result1. C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/ 7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@ find({ mmin: { $gt: 0 } }). & C34-2.02=J/!K/L-34-2.02=J/I! sort({ recs: -1 }). skip(4). limit(8); http://rickosborne.org/download/SQL-to-MongoDB.pdf 64
Pig vs. Map/Reduce 65
Pig vs. Map/Reduce 65
Data model, Relations and Consistency A step backwards? 66
Data model, Relations and Consistency A step backwards? Scalability, availability and resilience come at a cost 66
Big Data collect store organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
Big Data collect we don’t always know up-front store which questionswe’re going to ask organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
We’re Hiring!http://mediasift.com/careers 68
Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks http://joind.in/talk/view/2517 69
1–10 of 30 previous next Post a comment
1–10 of 30 previous next