• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NoSQL Databases: Why, what and when
 

NoSQL Databases: Why, what and when

on

  • 107,809 views

NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over ...

NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).

Statistics

Views

Total Views
107,809
Views on SlideShare
66,554
Embed Views
41,255

Actions

Likes
590
Downloads
0
Comments
42

164 Embeds 41,255

http://ontwik.com 9323
http://www.alberton.info 8933
http://nosql.mypopescu.com 8679
http://architects.dzone.com 3260
http://cloud.csdn.net 1153
http://elearning.bolton.ac.uk 1132
http://www.yiihsia.com 796
http://blog.nosqlfan.com 702
http://diegonogueira.com.br 623
http://mark.chmarny.com 531
http://blog.amt.in 517
http://rosnikv92.wordpress.com 414
http://eufacoprogramas.com 407
http://johanharjono.com 399
http://alberton.info 312
http://www.eufacoprogramas.com 243
http://szazz.com.br 227
http://test.ical.ly 226
http://julian.pustkuchen.com 225
http://thetechnicalweb.blogspot.com 218
http://java.dzone.com 214
http://allenjoe.net 184
http://www.justinablog.com 162
http://www.webguo.com 159
http://lanyrd.com 155
http://www.csdn.net 151
http://mary-chrysanth.me 131
http://blog.chmarny.com 118
http://blog.est.im 116
http://www.aalizwel.com 113
http://www.scoop.it 105
http://blog.muehlburger.at 105
http://cesarhernandezgt.blogspot.com 99
http://paper.li 92
http://4539915562109787658_7d07c0fcb09e7c814e2bb7957f7466f2b2e40077.blogspot.com 64
http://thetechnicalweb.blogspot.in 61
http://bnagur.blogspot.com 59
http://www.testically.org 58
http://www.tracylling.com 53
url_unknown 47
http://translate.googleusercontent.com 47
http://neo4j.tw 43
http://lalitkapoor.com 42
http://blog.websourcing.fr 33
http://aalizwel.com 31
http://feeds.feedburner.com 30
http://thetechnicalweb.blogspot.com.au 25
http://localhost 23
http://thinkery.me 22
http://thetechnicalweb.blogspot.co.uk 18
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

110 of 42 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • excellent,i'm chinese software ,could you please share me the ppt at thl1207@gmail.com
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi, Can you please share me this presentation. My mail id =regijos@gmail.com
    Are you sure you want to
    Your message goes here
    Processing…
  • It is very nice, Could you please share to the email id : chandra.cmreddy@gmail.com
    Are you sure you want to
    Your message goes here
    Processing…
  • Nice presentation.

    Kindly share the download link : niteshwar.bhardwaj@gmail.com
    Are you sure you want to
    Your message goes here
    Processing…
  • I was going through net searching for NoSQL intro presentation and I found this one far better (in fact EXCELLENT) in terms of Data and Storage Structures based coverage of various NoSQL flavors. Can you please share with me the download link for the PPT at rahulsaini3@outlook.com
    Are you sure you want to
    Your message goes here
    Processing…

110 of 42 previous next

Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

NoSQL Databases: Why, what and when NoSQL Databases: Why, what and when Presentation Transcript

  • Lorenzo Alberton @lorenzoalberton NoSQL Databases:Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
  • NoSQL: WhyScalability, Concurrency, New trends 2
  • New Trends 2002 2004 2006 2008 2010 2012 Big data 3
  • New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
  • New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
  • New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
  • New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
  • New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
  • What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
  • What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
  • What’s the problem with RDBMS’s? 5
  • What’s the problem with RDBMS’s?http://www.flickr.com/photos/dimi3/3096166092 5
  • Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
  • Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
  • MongoDB is web-scale ...but /dev/null is even better! 7
  • Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
  • CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
  • No, seriously...*(*) Not another “Mine is bigger” comparison, please 10
  • A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
  • ACIDATOMICITY: All or nothingCONSISTENCY: Any transaction will take the db from oneconsistent state to another, with no broken constraints(referential integrity)ISOLATION: Other operations cannot access data that hasbeen modified during a transaction that has not yet completedDURABILITY: Ability to recover the committed transactionupdates against any kind of system failure (transaction log) 12
  • Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations 13
  • Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
  • Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
  • Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
  • Isolation Levels, Locking & MVCCIsolation nounProperty that defines how/when the changes made by one operationbecome visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
  • Isolation Levels, Locking & MVCC Non-repeatableIsolation Level Dirty Reads Phantoms readsSerializable - - -Repeatable - -ReadRead -CommittedReadUncommitted http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/ 14
  • Isolation Levels, Locking & MVCCIsolation Level Range Lock Read Lock Write LockSerializableRepeatable -ReadRead - -CommittedRead - - -Uncommitted 15
  • Multi-Version Concurrency Control Root Index Index Index IndexIndex Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data 16
  • Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
  • Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
  • Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
  • Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index IndexIndex Index Index Index Index Index Index IndexData Data Data Data Data Data Data Data Data Data Data Data Data 16
  • Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 17
  • Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Query to commit Participants 17
  • Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 1) Exec Transaction up to the COMMIT request 2) Write entry to undo and redo logs 17
  • Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Agree or Abort Participants 17
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 18
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Commit a) SUCCESS (agreement from all) Participants 18
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 1) Complete operation 2) Release locks 18
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge a) SUCCESS (agreement from all) Participants 18
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Complete transaction (completion phase) a) SUCCESS (agreement from all) Participants 18
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 19
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Rollback b) FAILURE (abort from any) Participants 19
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 1) Undo operation 2) Release locks 19
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
  • Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
  • Problems with 2PC Blocking ProtocolRisk of indefinite cohort Conservative behaviourblocks if coordinator fails biased to the abort case 20
  • Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
  • [PSE image alert] 22
  • ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time.http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistencyhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlhttp://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • Consistency: Client-side viewA service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • Consistency: Server-side (Quorum)N = number of nodes with a replica of the data (*)W = number of replicas that must acknowledge the updateR = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter whatW+R>N Strong consistency (usually N=3, W=R=2)W = N, R =1 Optimised for readsW = 1, R = N Optimised for writes (durability not guaranteed in presence of failures)W + R <= N Weak consistency 27
  • Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repairhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
  • Modulo-based Hashing N1 N2 N3 N4 29
  • Modulo-based Hashing N1 N2 N3 N4 ? 29
  • Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers 29
  • Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers 29
  • Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
  • Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • Consistent Hashing - Node Changes A F B E D C 32
  • Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
  • Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
  • Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
  • Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • Vector Clocks & Conflict DetectionA B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document.D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • Vector Clocks & Conflict DetectionA B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or withD3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
  • Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
  • Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
  • Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
  • Read Repair A F B GET(k, R=2) E D C 39
  • Read Repair A F B GET(k, R=2) E D C 39
  • Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
  • Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
  • NoSQL Break-down Key-value stores, Column Families, Document-oriented dbs, Graph databases 40
  • Focus Of Different Data Models Key-Value StoresSize Column Families Document Databases Graph Databases Complexity http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 41
  • 1) Key-value stores Amazon Dynamo Paper Data model: collection of key-value pairs 42
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java HTTP / Sockets API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL Conflicts resolved at read HTTP Java and write time Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Json, Java String, byte[], Avro Protobuf Thrift, Avro, ProtoBuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC Simple optimistic locking for multi-row updates, pluggable storage engine 43
  • Voldemort AP LICENSEDynamo DHT implementationConsistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • Membase CP LICENSEDHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookupAll metadata kept in memory (high throughput / low latency).Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • Riak AP LICENSE Apache 2 LANGUAGE C, Erlang API/PROTOCOL REST HTTP * ProtoBuf Buckets → K-V “Links” (~relations) Targeted JS Map/Reduce Tune-able consistency (one-quorum-all) 45
  • Redis CP LICENSEK-V store “Data Structures Server” BSDMap, Set, Sorted Set, Linked List LANGUAGESet/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C API * + PROTOCOL Telnet- like PERSISTENCE 10-100K op/s (whole dataset in RAM + VM) in memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATION master-slave Distributed if client supports consistent hashing http://redis.io/presentation/Redis_Cluster.pdf 46
  • 2) Column Families Google BigTable paper Data model: big table, column families 47
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column family ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable PaperSparse, distributed, persistent multi-dimensional sorted mapindexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column columnrow_key row column familyAtomic updates Automatic GC ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • Google BigTable: Data StructureSSTableSmallest building blockPersistent immutable Map[k,v]Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
  • Google BigTable: Data StructureSSTableTabletSmallest building block range of rowsDynamically partitionedPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • Google BigTable: Data StructureSSTableTableTabletSmallest Tablets (table segments) make up a tableMultiple building blockDynamically partitioned range of rowsPersistent immutable Map[k,v]Built from multiple SSTablesOperations: lookup and loadkey range scanUnit of distribution by key / balancingTableTablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • Google BigTable: I/O memtable readmemoryGFS tablet log SSTable SSTable SSTable write 50
  • Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write 50
  • Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
  • Google BigTable: I/O memtable read minor compactionmemoryGFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
  • Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of thelock service; maintains metadata treetablet server locations5 replicas, one elected ... master (via quorum) Up to 3 levels ...Paxos algorithm used in the metadata to keep consistency hierarchy 51
  • Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift(instead of Chubby) PERSISTENCE memtable/ SSTable 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTableHDFS, S3, S3N, EBS (with GZip/LZO CF compression) 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable Batch Streaming, Map/Reduce 53
  • HBase CP LICENSEOSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
  • Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
  • Hypertable CP LICENSEOSS BigTable implementationFaster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thriftrow_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family",
key,
["super_column",]
"column") Tunable R/W/N xhttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A C P2P F Gossip x D Ehttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A C ALL P2P ONE F Gossip x QUORUM D Ehttp://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thriftrow_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCYkeyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A RandomPartitioner (MD5) C ALL P2P OrderPreservingPartitioner ONE F Gossip x QUORUM E D Range Scans, Fulltext Index (Solandra)http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 3) Document DBs Lotus NotesData model: collection of K-V collections 56
  • CouchDB AP LICENSE Apache 2 LANGUAGEJSON docs Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGEJSON docs map-reduce “views” (materialised resultset) Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGEJSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale CONCURRENCYtransactions volume MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution CONCURRENCYtransactions volume at application level MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCYtransactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction REPLICATION(very primitive VACUUM) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction Update validation / REPLICATION(very primitive VACUUM) Auth triggers multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction Update validation / Delayed commits REPLICATION(very primitive VACUUM) Auth triggers (write performance) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • CouchDB AP LICENSE MVCC consequences: Apache 2 compaction load, disk space LANGUAGE Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-onlyhttp://enda.squarespace.com/tech/2009/12/8/couchdb-compaction-big-impacts.html design REPLICATION http://chesnok.com/talks/mvcc_couchcamp.pdf (PgSQL VACUUM) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 58
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3BSON serialisation LANGUAGE(storage & transfer) C++ API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation Master-Slave,(storage & transfer) C++ Auto-Failover API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place B+Tree, Snapshots(no versioning, no CONCURRENCYappend-only log) In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no CONCURRENCYappend-only log) In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates REPLICATION master-slave replica sets 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce replica sets (well, aggregation) 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
  • MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGEBSON serialisation B-Tree Indexes Master-Slave,(storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCEUpdate in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots(no versioning, no Replication + CONCURRENCYappend-only log) Snapshotting In-place updates GROUP BY REPLICATION New! Improved! master-slave Map/Reduce --dur flag No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
  • 4) Graph databases Graph Theory 60
  • Neo4j LICENSE AGPLv3 / Commercial LANGUAGE Java Nodes, Vertical Scalability API/PROTOCOL Graph Data Structure Relationships, (1000s times faster, but not distributed) REST Properties on both Java SPARQL PERSISTENCE On-disk Physical structure: linked-list LinkedList stored on disk HA cluster with ZooKeeper High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
  • Neo4j LICENSE NeoService neo = ... // factory AGPLv3 / Commercial LANGUAGE Transaction tx = neo.beginTx(); Java Nodes, Vertical Scalability Node Structure n1 = neo.createNode(); (1000s times faster, API/PROTOCOL Graph Data Relationships, n1.setProperty(“name”, “John”); but not distributed) REST Properties on both Java n1.setProperty(“age”, 35); SPARQL PERSISTENCE Node n2 = neo.createNode(); n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29); LinkedList stored on disk n2.setProperty(“job”, “engineer”); n1.createRelationshipTo(n2, RelTypes.KNOWS); HA cluster with ZooKeeper tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
  • Neo4j LICENSE Traverser neo = ... // factory NeoServicefriendTraverser = n1.traverse( AGPLv3 / Traverser.order.BREADTH_FIRST, Commercial LANGUAGE Transaction tx = neo.beginTx(); StopEvaluator.END_OF_GRAPH, ReturnableEvaluator.ALL_BUT_START_NODE, Java Nodes, Vertical Scalability GraphNodeRelTypes.KNOWS,Relationships, n1 = neo.createNode(); API/PROTOCOL Data Structure (1000s times faster, n1.setProperty(“name”, “John”); but not distributed) Direction.OUTGOING REST Properties on both Java ); n1.setProperty(“age”, 35); SPARQL // Traverse the node space PERSISTENCE System.out.println(“John’s Node n2 = neo.createNode();friends: ”); for (Node friend : friendsTraverser) { n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29);depth stored on disk System.out.printf(“At LinkedList %d => %s%n”, n2.setProperty(“job”, “engineer”); friendTraverser.currentPosition(). getDepth(), n1.createRelationshipTo(n2, RelTypes.KNOWS); friendTraverser.getProperty(“name”) ); HA cluster with ZooKeeper } tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversalhttp://docs.neo4j.org/chunked/stable/ha-how.html 61
  • Final Considerations Query modes Achievements and Problems 62
  • Query Modes: a new “SQL”? Map/Reduce 63
  • SQL vs. Map/Reduce SELECT 19OPQ db.runCommand({ A*2=*LR Dim1, Dim2, ! mapreduce: "DenormAggCollection", SUM(Measure1) AS MSum, query: { " COUNT(*) AS RecordCount, filter1: { $in: [ A, B ] }, AVG(Measure2) AS MAvg, # filter2: C, MIN(Measure1) AS MMin filter3: { $gt: 123 } MAX(CASE }, WHEN Measure2 < 100 $ map: function() { emit( THEN Measure2 { d1: this.Dim1, d2: this.Dim2 }, END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1, FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 } WHERE (Filter1 IN (’A’,’B’)) );}, AND (Filter2 = ‘C’) % reduce: function(key, vals) { AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) { HAVING (MMin > 0) ret.msum += vals[i].msum; ORDER BY RecordCount DESC ret.recs += vals[i].recs; LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } ! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret; *+7/63/8-93/02/7:-/16,/;+2470*2</ }, )-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) { " A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs; return val; # C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/ }, G-E030*2/$</M)-67-./"N!NIN#IN G048/F3B*)2-</)048*3B*)2-@*)= 1+37/?607/+2705/;02650>670*2@ $ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@ out: result1, verbose: true % D057-)3/:6E-/62/FGAHC470E-G-4*).I }); 5**802=/3795-@ db.result1. C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/ 7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@ find({ mmin: { $gt: 0 } }). & C34-2.02=J/!K/L-34-2.02=J/I! sort({ recs: -1 }). skip(4). limit(8); http://rickosborne.org/download/SQL-to-MongoDB.pdf 64
  • Pig vs. Map/Reduce 65
  • Pig vs. Map/Reduce 65
  • Data model, Relations and Consistency A step backwards? 66
  • Data model, Relations and Consistency A step backwards? Scalability, availability and resilience come at a cost 66
  • Big Data collect store organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
  • Big Data collect we don’t always know up-front store which questionswe’re going to ask organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
  • We’re Hiring!http://mediasift.com/careers 68
  • Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks http://joind.in/talk/view/2517 69