Successfully reported this slideshow.

NoSQL - how it works (@pavlobaron)

3,044 views

Published on

Slides of my OOP'12 talk

  • Great article! I am working on an executive overview that I need to present to my manager, can anybody point me to some other high-level discussions on this topic, I'm not a great writer (except when it comes to code :)) and I would like to see how some people describe it in a way that's easy for non-technical folks ... Thanks
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

NoSQL - how it works (@pavlobaron)

  1. 1. NoSQL.How it works. Pavlo Baron
  2. 2. Geek‘s GuideTo The Working Life Pavlo Baron pavlo.baron@codecentric.de @pavlobaron
  3. 3. NoSQL is not about …<140’000 things NoSQL is not about>… NoSQL is about choice(Jan Lehnardt on NoSQL)
  4. 4. (John Muellerleile on NoSQL)
  5. 5. NoSQL addresses the issue of poorly structured data
  6. 6. NoSQL addresses the issueof data management simplicity
  7. 7. NoSQL addresses the issue of data flood
  8. 8. NoSQL addresses the issue of extremely frequent reads/writes
  9. 9. NoSQL addresses the issue of big data streams
  10. 10. NoSQL addresses the issue of real-time data processing and analysis
  11. 11. NoSQL addresses the issue of huge data storage
  12. 12. NoSQL addresses the issue of fast data filtering
  13. 13. NoSQL addresses the issue of complex, deep relations
  14. 14. NoSQL addresses the issue of pure web existences
  15. 15. NoSQL addresses the issue of picking the right tool for the job
  16. 16. How?
  17. 17. Chop in smaller pieces
  18. 18. Chop in bite-size,manageable pieces
  19. 19. Separate readingfrom writing
  20. 20. CachingVariations: eager write, append only lazy write, eventual consistency
  21. 21. Write throughwrite read read cache write read through missproducts users data store
  22. 22. Write back / write snapshotting readcache read write miss backproducts users data store
  23. 23. Design for theoreticallyunlimited amount of data
  24. 24. Append, update, mark, recycle, don’t delete and restructure
  25. 25. Minimize hard relations
  26. 26. Parallelize and distribute
  27. 27. Avoid single bottle necks
  28. 28. Decentralize with “equal” nodes
  29. 29. Build upon consensus,agreement, voting, quorum
  30. 30. write RM2 Gossip – RM RM1 Clock table Value Update log stable clockReplica clock updates Value Executed operation table
  31. 31. Gossip – node down/upNode 1Node 2 update, read, update update 4 down 4 upNode 3 Node 4 update read
  32. 32. Don’t trust timeand timestamps
  33. 33. ClocksV(i), V(j): competingConflict resolution: 1: siblings, client 2: merge, system 3: voting, system
  34. 34. TimestampsNode 1 10:00 10:10 10:20Node 2 10:01 10:11 10:20Node 3 9:59 10:09 10:18 10:19
  35. 35. Logical clocks ?Node 1 1 4 5 6 7Node 2 2 3 6 7 ?Node 3 2 4 5 6 7
  36. 36. Vector clocksNode 1 1,0,0 2,2,0 3,2,0 4,3,3Node 2 1,1,0 1,2,0 1,3,3 4,4,3Node 3 1,0,1 1,2,2 1,2,3 4,3,4
  37. 37. Vector clocksNode 1 Node 2 Node 3 Node 4 1,0,0,0 1,1,0,0 1,2,0,0 1,3,0,3 1,0,1,0 1,0,2,0 1,0,0,1 1,2,0,2 1,2,0,3
  38. 38. Strive forO(1) for data lookups #
  39. 39. Merkle TreesN, M: nodesHT(N), HT(M): hash treesM needs update: obtain HT(N) calc delta(HT(M), HT(N)) pull keys(delta)
  40. 40. Node a.1 Merkle Trees a ab ac abc abd acb acc abe abd ada adb ab ad aNode a.2
  41. 41. Node a.1 Merkle Trees a ab abc abd abd ada adb ab ad aNode a.2
  42. 42. Node 1 Vertical sharding users addresses contracts orders „read contract“ user=foo invoices products itemsNode 2
  43. 43. Node 1 Range based sharding users id(1-N) addresses zip(1234- read 2345) write products write addresses users zip(2346- read id(1-M) 9999)Node 2
  44. 44. Hash based shardingstart with 3 nodes: node hash N = # mod 3add 2 nodes N = # mod 5kill 2 nodes N = # mod 3
  45. 45. Insert key N Key = “foo” #=N
  46. 46. Add 2 nodesrehash leaverehashleave
  47. 47. LookupKey = “foo” key #=N NValue = “bar”
  48. 48. Remove noderehash leaverehashleave
  49. 49. The ringX bit integer space 0 <= N <= 2 ^ Xor: 2 x Pi 0 <= A <= 2 x Pi x(N) = cos(A) y(N) = sin(A)
  50. 50. Clustering12 partitions (constant) 3 nodes, 4 vnodes eachadd node 4 nodes, 3 vnodes eachAlternatives: 3 nodes, 2 x 5 + 1 x 2 vnodes container based
  51. 51. QuorumV: vnodes holding a keyW: write quorumR: read quorumDW: durable write quorum • W > 0.5 * V R + W > V
  52. 52. Insert keyKey = “foo” (sloppy quorum)# = N, W = 2 replicate N ok
  53. 53. Add node co py leaveleave co py py leaveco
  54. 54. Lookup key (sloppy quorum)N Value = “bar” Key = “foo” # = N, R = 2
  55. 55. Remove nodecopy leave
  56. 56. Minimize the distance between the data and its processors
  57. 57. Utilize commodity hardware
  58. 58. MapReducemodel: functional map/foldout-database MR irrelevantin-database MR: data locality no splitting needed distributed querying distributed processing
  59. 59. In-database MapReduce query =Node X "Alice" map reduce hit list map map N= N= N= „Alice" "Alice" "Alice" Node A Node B Node C
  60. 60. Design with eventualactuality/consistency in mind
  61. 61. BASEBasically Available,Soft-state,Eventually consistentOpposite to ACID
  62. 62. Read your write consistencyFE1 FE2 write read write read v2 v2 v1 v1 v1 v2 v3 Data store
  63. 63. Session consistency FESession 1 Session 2 write read write read v2 v2 v1 v1 v1 v2 v3 Data store
  64. 64. Monotonic read consistencyFE1 FE2 read read read read read v2 v2 v3 v3 v4 v1 v2 v3 v4 Data store
  65. 65. Monotonic write consistencyFE1 FE2 write write read read v1 v2 v3 v3 v1 v2 v3 v4 Data store
  66. 66. Eventual consistencyFE1 FE2read read read read write v1 v2 v2 v3 v3 v1 v2 v3 Data store
  67. 67. Implement redundancy and replication
  68. 68. Source node Replication – addresses state transfer products take usersTarget node
  69. 69. Source node Replication – deletes operational transfer inserts take updates runTarget node
  70. 70. Eager replication - 3PCCoordinatorCohort 1 can yes pre ACK commit ok commit? commitCohort 2
  71. 71. Eager replication – 3PC (failure)CoordinatorCohort 1 can yes pre ACK abort ok commit? commitCohort 2
  72. 72. Eager replication- Paxos Commit2F + 1 acceptors overall , F + 1correct ones to achieveconsensusStability, Consistency,Non-Triviality,Non-Blocking
  73. 73. Paxos CommitEager replication – commit 2b prepared prepare 2a prepared begin commit initial other Acceptors leader RMs RM1
  74. 74. Eager replication – Paxos Commit (failure)Acceptors 2a prepared 2a prepared timeout, timeout, no no decision decisionleader initial prepare prepare abort begin commitotherRMs RM 1
  75. 75. Master node Lazy replication – master/slave addresses products write users read readSlave node(s)
  76. 76. Master node(s) Lazy replication – master/master users itemsid(1-N) id(1-K) write read users items readid(1-M) id(1-L) writeMaster node(s)
  77. 77. Hinted handoffN: node, G: group including Nnode(N) is unavailable replicate to G or store data(N) locally hint handoff for later node(N) is alive handoff data to node(N)
  78. 78. Key = “foo”, # = N -> Directhandoff hint = true replica failsKey = “foo” N replicate
  79. 79. Replicahandoff recovers
  80. 80. AllKey = “foo”,# = N -> replicashandoff hint = failtrue N
  81. 81. All replicashandoff recoverreplicate
  82. 82. Consider latency an adjustment screw
  83. 83. Consider availability an adjustment screw
  84. 84. CAP – the variationsCA – irrelevantCP – eventually unavailableoffering maximum consistencyAP – eventually inconsistentoffering maximum availability
  85. 85. CAP – the tradeoffA C
  86. 86. Replica 1 CP v1 read v2 write v2 v2 v1 readReplica 2
  87. 87. Replica 1 CP (partition) v1 read v2 write v2 v1 readReplica 2
  88. 88. Replica 1 AP v1 write v2 v2 read replicate v2 v1 readReplica 2
  89. 89. Replica 1 AP (partition) v1 write v2 v2 read hint handoff v2 v1 readReplica 2
  90. 90. Build upon appropriate storage strategy,not upon a general one
  91. 91. Design for frequent structure changes
  92. 92. Most queries are known up frontAd-hoc queries areseldom necessaryPrepared queries canextremely speed up data retrievalIndex can help ad-hoc querying,and can be externalizedIndex should be incremental
  93. 93. Store asDocument (semi-structured)Key/Value (unstructured)Graph (special case)...Externalize relations andproperties
  94. 94. The graph caseSaving graph in a table leads to:Limited depthFixed relation typesExpensive nested subselectsFull table scan tendencyGraph data stores store graphdata optimally
  95. 95. Thank you
  96. 96. Many graphics I’ve created myselfSome images originate from istockphoto.com except few ones taken from Wikipedia and product pages

×