Your SlideShare is downloading. ×
  • Like
  • Save
NoSQL - how it works (@pavlobaron)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

NoSQL - how it works (@pavlobaron)

  • 2,407 views
Published

Slides of my OOP'12 talk

Slides of my OOP'12 talk

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,407
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NoSQL.How it works. Pavlo Baron
  • 2. Geek‘s GuideTo The Working Life Pavlo Baron pavlo.baron@codecentric.de @pavlobaron
  • 3. NoSQL is not about …<140’000 things NoSQL is not about>… NoSQL is about choice(Jan Lehnardt on NoSQL)
  • 4. (John Muellerleile on NoSQL)
  • 5. NoSQL addresses the issue of poorly structured data
  • 6. NoSQL addresses the issueof data management simplicity
  • 7. NoSQL addresses the issue of data flood
  • 8. NoSQL addresses the issue of extremely frequent reads/writes
  • 9. NoSQL addresses the issue of big data streams
  • 10. NoSQL addresses the issue of real-time data processing and analysis
  • 11. NoSQL addresses the issue of huge data storage
  • 12. NoSQL addresses the issue of fast data filtering
  • 13. NoSQL addresses the issue of complex, deep relations
  • 14. NoSQL addresses the issue of pure web existences
  • 15. NoSQL addresses the issue of picking the right tool for the job
  • 16. How?
  • 17. Chop in smaller pieces
  • 18. Chop in bite-size,manageable pieces
  • 19. Separate readingfrom writing
  • 20. CachingVariations: eager write, append only lazy write, eventual consistency
  • 21. Write throughwrite read read cache write read through missproducts users data store
  • 22. Write back / write snapshotting readcache read write miss backproducts users data store
  • 23. Design for theoreticallyunlimited amount of data
  • 24. Append, update, mark, recycle, don’t delete and restructure
  • 25. Minimize hard relations
  • 26. Parallelize and distribute
  • 27. Avoid single bottle necks
  • 28. Decentralize with “equal” nodes
  • 29. Build upon consensus,agreement, voting, quorum
  • 30. write RM2 Gossip – RM RM1 Clock table Value Update log stable clockReplica clock updates Value Executed operation table
  • 31. Gossip – node down/upNode 1Node 2 update, read, update update 4 down 4 upNode 3 Node 4 update read
  • 32. Don’t trust timeand timestamps
  • 33. ClocksV(i), V(j): competingConflict resolution: 1: siblings, client 2: merge, system 3: voting, system
  • 34. TimestampsNode 1 10:00 10:10 10:20Node 2 10:01 10:11 10:20Node 3 9:59 10:09 10:18 10:19
  • 35. Logical clocks ?Node 1 1 4 5 6 7Node 2 2 3 6 7 ?Node 3 2 4 5 6 7
  • 36. Vector clocksNode 1 1,0,0 2,2,0 3,2,0 4,3,3Node 2 1,1,0 1,2,0 1,3,3 4,4,3Node 3 1,0,1 1,2,2 1,2,3 4,3,4
  • 37. Vector clocksNode 1 Node 2 Node 3 Node 4 1,0,0,0 1,1,0,0 1,2,0,0 1,3,0,3 1,0,1,0 1,0,2,0 1,0,0,1 1,2,0,2 1,2,0,3
  • 38. Strive forO(1) for data lookups #
  • 39. Merkle TreesN, M: nodesHT(N), HT(M): hash treesM needs update: obtain HT(N) calc delta(HT(M), HT(N)) pull keys(delta)
  • 40. Node a.1 Merkle Trees a ab ac abc abd acb acc abe abd ada adb ab ad aNode a.2
  • 41. Node a.1 Merkle Trees a ab abc abd abd ada adb ab ad aNode a.2
  • 42. Node 1 Vertical sharding users addresses contracts orders „read contract“ user=foo invoices products itemsNode 2
  • 43. Node 1 Range based sharding users id(1-N) addresses zip(1234- read 2345) write products write addresses users zip(2346- read id(1-M) 9999)Node 2
  • 44. Hash based shardingstart with 3 nodes: node hash N = # mod 3add 2 nodes N = # mod 5kill 2 nodes N = # mod 3
  • 45. Insert key N Key = “foo” #=N
  • 46. Add 2 nodesrehash leaverehashleave
  • 47. LookupKey = “foo” key #=N NValue = “bar”
  • 48. Remove noderehash leaverehashleave
  • 49. The ringX bit integer space 0 <= N <= 2 ^ Xor: 2 x Pi 0 <= A <= 2 x Pi x(N) = cos(A) y(N) = sin(A)
  • 50. Clustering12 partitions (constant) 3 nodes, 4 vnodes eachadd node 4 nodes, 3 vnodes eachAlternatives: 3 nodes, 2 x 5 + 1 x 2 vnodes container based
  • 51. QuorumV: vnodes holding a keyW: write quorumR: read quorumDW: durable write quorum • W > 0.5 * V R + W > V
  • 52. Insert keyKey = “foo” (sloppy quorum)# = N, W = 2 replicate N ok
  • 53. Add node co py leaveleave co py py leaveco
  • 54. Lookup key (sloppy quorum)N Value = “bar” Key = “foo” # = N, R = 2
  • 55. Remove nodecopy leave
  • 56. Minimize the distance between the data and its processors
  • 57. Utilize commodity hardware
  • 58. MapReducemodel: functional map/foldout-database MR irrelevantin-database MR: data locality no splitting needed distributed querying distributed processing
  • 59. In-database MapReduce query =Node X "Alice" map reduce hit list map map N= N= N= „Alice" "Alice" "Alice" Node A Node B Node C
  • 60. Design with eventualactuality/consistency in mind
  • 61. BASEBasically Available,Soft-state,Eventually consistentOpposite to ACID
  • 62. Read your write consistencyFE1 FE2 write read write read v2 v2 v1 v1 v1 v2 v3 Data store
  • 63. Session consistency FESession 1 Session 2 write read write read v2 v2 v1 v1 v1 v2 v3 Data store
  • 64. Monotonic read consistencyFE1 FE2 read read read read read v2 v2 v3 v3 v4 v1 v2 v3 v4 Data store
  • 65. Monotonic write consistencyFE1 FE2 write write read read v1 v2 v3 v3 v1 v2 v3 v4 Data store
  • 66. Eventual consistencyFE1 FE2read read read read write v1 v2 v2 v3 v3 v1 v2 v3 Data store
  • 67. Implement redundancy and replication
  • 68. Source node Replication – addresses state transfer products take usersTarget node
  • 69. Source node Replication – deletes operational transfer inserts take updates runTarget node
  • 70. Eager replication - 3PCCoordinatorCohort 1 can yes pre ACK commit ok commit? commitCohort 2
  • 71. Eager replication – 3PC (failure)CoordinatorCohort 1 can yes pre ACK abort ok commit? commitCohort 2
  • 72. Eager replication- Paxos Commit2F + 1 acceptors overall , F + 1correct ones to achieveconsensusStability, Consistency,Non-Triviality,Non-Blocking
  • 73. Paxos CommitEager replication – commit 2b prepared prepare 2a prepared begin commit initial other Acceptors leader RMs RM1
  • 74. Eager replication – Paxos Commit (failure)Acceptors 2a prepared 2a prepared timeout, timeout, no no decision decisionleader initial prepare prepare abort begin commitotherRMs RM 1
  • 75. Master node Lazy replication – master/slave addresses products write users read readSlave node(s)
  • 76. Master node(s) Lazy replication – master/master users itemsid(1-N) id(1-K) write read users items readid(1-M) id(1-L) writeMaster node(s)
  • 77. Hinted handoffN: node, G: group including Nnode(N) is unavailable replicate to G or store data(N) locally hint handoff for later node(N) is alive handoff data to node(N)
  • 78. Key = “foo”, # = N -> Directhandoff hint = true replica failsKey = “foo” N replicate
  • 79. Replicahandoff recovers
  • 80. AllKey = “foo”,# = N -> replicashandoff hint = failtrue N
  • 81. All replicashandoff recoverreplicate
  • 82. Consider latency an adjustment screw
  • 83. Consider availability an adjustment screw
  • 84. CAP – the variationsCA – irrelevantCP – eventually unavailableoffering maximum consistencyAP – eventually inconsistentoffering maximum availability
  • 85. CAP – the tradeoffA C
  • 86. Replica 1 CP v1 read v2 write v2 v2 v1 readReplica 2
  • 87. Replica 1 CP (partition) v1 read v2 write v2 v1 readReplica 2
  • 88. Replica 1 AP v1 write v2 v2 read replicate v2 v1 readReplica 2
  • 89. Replica 1 AP (partition) v1 write v2 v2 read hint handoff v2 v1 readReplica 2
  • 90. Build upon appropriate storage strategy,not upon a general one
  • 91. Design for frequent structure changes
  • 92. Most queries are known up frontAd-hoc queries areseldom necessaryPrepared queries canextremely speed up data retrievalIndex can help ad-hoc querying,and can be externalizedIndex should be incremental
  • 93. Store asDocument (semi-structured)Key/Value (unstructured)Graph (special case)...Externalize relations andproperties
  • 94. The graph caseSaving graph in a table leads to:Limited depthFixed relation typesExpensive nested subselectsFull table scan tendencyGraph data stores store graphdata optimally
  • 95. Thank you
  • 96. Many graphics I’ve created myselfSome images originate from istockphoto.com except few ones taken from Wikipedia and product pages