Pavlo  Baron http://www.pbit.org [email_address] @pavlobaron
Agenda Blah-blah More blah-blah Color pics Standing ovations
So, come on,  sell  this to me
Agenda Blah-blah More blah-blah Color pics Standing ovations
Somewhere  a mosquito coughs…
…  and somewhere else a data center gets  flooded with  data (PB)
Big   Data   describes datasets that grow so large that they become  awkward to work with using on-hand database managemen...
NoSQL is not about … <140’000 things NoSQL is not about>… NoSQL   is about  choice (Jan Lehnardt, CouchDB)
Look here brother, who you jivin‘ with that Cosmik Debris ?
(John Muellerleile)
Agenda Blah-blah More blah-blah Color pics Standing ovations
So, you think you can tell heaven  from  hell ...
Where  does your data actually come  from ?
Do you have a million well structured records?
Or a couple of  Gigabytes  of storage?
Does your data get modified every now and then ?
Do you look at your data Once a month  to create a management report?
Or is your data an  unstructured  chaos?
Do you get flooded by  tera-/petabytes  of data?
Or do you simply get bombed  with data?
Does your data flow on  streams  at a very  high rate  from different locations?
Or do you have to read The Matrix ?
Do you need to distribute your data over the  whole  world
Or does your  existence depend on (the quality of) your data?
Look  back and turn back. Look at yourself
Is it the  storage  that you need to focus on?
Or are you more preparing  data?
Or do you have your customers  spread all  over the world ?
Or do you have complex  statistical analysis  to do?
Or do you have to  filter  data as it comes?
Or is it necessary to  visualize  the data?
...every  blade  is sharp, the  arrows  fly...
Chop  in smaller pieces
Chop in  bite-size , manageable pieces
Separate  reading from writing
Update  and  mark, don’t delete  physically
Minimize  hard relations
Separate archive  from accessible data
Trash  everything that has only to be  analyzed in  real-time
Parallelize  and distribute
Avoid  single bottle necks
Decentralize  with “ equal” nodes
Design with  Byzantine faults in mind
Build upon  consensus , agreement ,  voting ,  quorum
Don’t trust time and timestamps
Strive for O(1)  for data lookups #
Minimize the distance between the data and its processors
Utilize  commodity hardware
Consider  hardware fallibility
Relax  new hardware startup procedure
Bring  data to  its  users
Build upon  asynchronous message passing
Consider  network unreliability
Consider  asynchronous message passing unreliability
Design with  eventual actuality/consistency in mind
Implement  redundancy and  replication
Consider  latency  an adjustment screw
Consider  availability  an adjustment screw
Be prepared for disaster
Utilize the fog/clouds
Design  for theoretically unlimited amount of data
Design  for frequent structure changes
Design  for the all-in-one mix
Agenda Blah-blah More blah-blah Color pics Standing ovations
Why can we never be  sure till we  die . Or have  killed  for an answer
CAP – C onsistency, A vailability, P artition tolerance
CAP – the variations CA  – irrelevant CP  – eventually unavailable offering maximum consistency AP  – eventually inconsist...
CAP – the  tradeoff A C
CP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2
CP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2
AP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 replicate
AP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 hint handoff
BASE
BASE B asically  A vailable, S oft-state, E ventually consistent Opposite to ACID
Causal  ordering / consistency RM1 RM2 RM3
Read  your  write consistency write v 2 read v2 FE1 v 2 Data store v 3 v 1 write v 1 read v1 FE2
Session 2 Session 1 Session  consistency write v 2 read v2 FE v 2 Data store v 3 v 1 write v 1 read v1
FIFO  ordering RM1 RM2 RM3
Monotonic  read  consistency read v 2 read v2 FE1 v 2 Data store v 3 v 1 read v 3 read v4 FE2 v 4 read v3
Total  ordering RM1 RM2 RM3
Monotonic  write  consistency write v 1 write v4 FE1 Data store v 2 write v 2 write v3 FE2 v 4 v 1 v 3
Eventual  consistency read v 1 read v2 FE1 Data store v 3 write v 3 FE2 read v3 v 1 read v2 v 2
Run, rabbit, run. Dig that  hole , forget the sun
Logical  sharding
Node 1 Node 2 users products contracts Vertical sharding items orders addresses invoices „ read contract“ user=foo
Node 1 Node 2 users id(1-N) products Range  based sharding addresses zip(1234- 2345) read users id(1-M) addresses zip(2346...
Hash based  sharding start with 3 nodes: node hash  N = # mod 3 add 2 nodes N = # mod 5 kill 2 nodes N = # mod 3
Insert  key Key = “foo” # = N N
rehash leave leave rehash Add 2 nodes
Lookup key Key = “foo” # = N N Value = “bar”
rehash leave leave rehash Remove node
Consistent  hashing
The  ring X bit integer space 0 <= N <= 2 ^ X or: 2 x Pi 0 <= A <= 2 x Pi x(N) = cos(A) y(N) = sin(A)
Key = “foo” # = N N Insert  key
copy leave rehash leave leave rehash Add  node
Lookup  key Key = “foo” # = N N Value = “bar”
copy/ miss leave rehash leave leave rehash Remove  node
Clustering 12 partitions (constant) 3 nodes, 4 vnodes each add node 4 nodes, 3 vnodes each Alternatives: 3 nodes, 2 x 5 + ...
Quorum V: vnodes holding a key W: write quorum R: read quorum DW: durable write quorum W > 0.5 * V R + W > V
Key = “foo” # = N, W = 2 N Insert  key ( sloppy  quorum) replicate ok
leave Add  node copy copy leave
Key = “foo” # = N, R = 2 N Lookup  key ( sloppy quorum) Value = “bar”
leave Remove node copy copy leave
Inside out, outside in. Perpetual  change
Clocks V(i), V(j): competing Conflict resolution: 1:  siblings , client 2:  merge , system 3:  voting , system
Node 1 Node 2 Node 3 10:00 10:11 10:20 10:20 10:01 9:59 10:09 10:10 Timestamps 10:18 10:19
Node 1 Node 2 Node 3 1 3 5 6 2 2 4 5 4 7 7 7 Logical  clocks 6 6 ? ?
Node 1 Node 2 Node 3 1,0,0 1,2,0 3,2,0 1,3,3 1,1,0 1,0,1 1,2,2 1,2,3 2,2,0 4,3,3 4,4,3 4,3,4 Vector  clocks
Node 2 Node 3 Node 4 1,1,0,0 1,0,1,0 1,0,0,1 1,3,0,3 1,2,0, 2 1,2,0,3 Vector  clocks Node 1 1,0,0,0 1,2,0,0 1,0,2,0
Merkle  Trees N, M: nodes HT(N), HT(M): hash trees M needs update: obtain HT(N) calc delta(HT(M), HT(N)) pull keys(delta)
Node a.1 Node a.2 a ab ac abc abd acb acc Merkle  Trees a ab ad abe abd ada adb
Node a.1 Node a.2 a ab abc abd Merkle  Trees a ab ad abd ada adb
Sudden call  shouldn't take away  the startled  memory
Replication – state  transfer Target node users products addresses Source node take
Replication – operational transfer Target node updates inserts deletes Source node take run
Eager  replication -  3PC Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK commit ok
Eager  replication – 3PC  ( failure ) Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK abort ok
Eager  replication- Paxos  Commit 2F + 1  acceptorsoverall ,  F + 1  correct ones to achieve consensus Stability, Consiste...
prepare 2b prepared initial leader other RMs RM1 2a prepared Eager  replication – Paxos  Commit Acceptors begin commit com...
Eager  replication –  Paxos Commit ( failure ) prepare timeout, no decision initial leader other RMs RM 1 2a prepared Acce...
Master node Slave node(s) users products Lazy  replication – Master/slave addresses read write read
Master node(s) Master node(s) Lazy  replication – Master/master read write read users id(1-N) users id(1-M) items id(1-K) ...
stable updates Gossip  – RM RM1 Clock table Replica clock Update log Value clock Value Executed operation table write RM2 ...
Node 1 Node 2 Node 3 update Gossip  – node  down/up Node 4 update update, 4 down read read, 4 up update
Hinted  handoff N: node, G: group including N node(N) is unavailable replicate  to G  or store  data(N) locally hint hando...
Key = “foo” N replicate Key = “foo”, # = N -> handoff hint = true Direct replica fails
Replica recovers handoff
N Key = “foo”, # = N -> handoff hint = true All replicas fail
All replicas recover replicate handoff
I’m a  speed  king, see me  fly
MapReduce
MapReduce model: functional  map/fold out-database  MR  irrelevant in-database  MR: data locality no splitting  needed dis...
In-database  MapReduce map reduce Node X Node C N = &quot;Alice&quot; map query = &quot;Alice&quot; Node A N = „ Alice&quo...
Caching
Caching Variations: eager write , append only lazy write , eventual consistency
Write  through read write data store products write through users cache read read miss
Write  back  / snapshotting read write data store products write back users cache read miss
Physical  storage
Physical  storage row  based:  irrelevant column  based: many  rows,  few  columns value  based: ad-hoc  querying
Column  based storage 1, 2 Peter, Anna London, Paris data store ID Name City 1 Peter London 2 Anna Paris
Value  based storage 1:1, 3:Peter, 5:London, 2:2, 4:Anna, 6:Paris, 7:[1, 3, 5], 8:[2, 4, 6] data store ID Name City 1 Pete...
Agenda Blah-blah More blah-blah Color pics Standing ovations
Thank  you
Many graphics I’ve created myself, though I better should have asked @mononcqc for help ‘cause his drawings are awesome So...
Upcoming SlideShare
Loading in …5
×

Big Data & NoSQL - EFS'11 (Pavlo Baron)

6,990 views

Published on

That's the slides of my half day workshop at the EFS'11 in Stuttgart where I covered some theoretical aspects of NoSQL data stores relevant for dealing with large data amounts

Published in: Technology
1 Comment
32 Likes
Statistics
Notes
  • Concepts so nicely explained - thanks for sharing!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,990
On SlideShare
0
From Embeds
0
Number of Embeds
1,195
Actions
Shares
0
Downloads
0
Comments
1
Likes
32
Embeds 0
No embeds

No notes for slide

Big Data & NoSQL - EFS'11 (Pavlo Baron)

  1. 2. Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
  2. 3. Agenda Blah-blah More blah-blah Color pics Standing ovations
  3. 4. So, come on, sell this to me
  4. 5. Agenda Blah-blah More blah-blah Color pics Standing ovations
  5. 6. Somewhere a mosquito coughs…
  6. 7. … and somewhere else a data center gets flooded with data (PB)
  7. 8. Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools (Wikipedia)
  8. 9. NoSQL is not about … <140’000 things NoSQL is not about>… NoSQL is about choice (Jan Lehnardt, CouchDB)
  9. 10. Look here brother, who you jivin‘ with that Cosmik Debris ?
  10. 11. (John Muellerleile)
  11. 12. Agenda Blah-blah More blah-blah Color pics Standing ovations
  12. 13. So, you think you can tell heaven from hell ...
  13. 14. Where does your data actually come from ?
  14. 15. Do you have a million well structured records?
  15. 16. Or a couple of Gigabytes of storage?
  16. 17. Does your data get modified every now and then ?
  17. 18. Do you look at your data Once a month to create a management report?
  18. 19. Or is your data an unstructured chaos?
  19. 20. Do you get flooded by tera-/petabytes of data?
  20. 21. Or do you simply get bombed with data?
  21. 22. Does your data flow on streams at a very high rate from different locations?
  22. 23. Or do you have to read The Matrix ?
  23. 24. Do you need to distribute your data over the whole world
  24. 25. Or does your existence depend on (the quality of) your data?
  25. 26. Look back and turn back. Look at yourself
  26. 27. Is it the storage that you need to focus on?
  27. 28. Or are you more preparing data?
  28. 29. Or do you have your customers spread all over the world ?
  29. 30. Or do you have complex statistical analysis to do?
  30. 31. Or do you have to filter data as it comes?
  31. 32. Or is it necessary to visualize the data?
  32. 33. ...every blade is sharp, the arrows fly...
  33. 34. Chop in smaller pieces
  34. 35. Chop in bite-size , manageable pieces
  35. 36. Separate reading from writing
  36. 37. Update and mark, don’t delete physically
  37. 38. Minimize hard relations
  38. 39. Separate archive from accessible data
  39. 40. Trash everything that has only to be analyzed in real-time
  40. 41. Parallelize and distribute
  41. 42. Avoid single bottle necks
  42. 43. Decentralize with “ equal” nodes
  43. 44. Design with Byzantine faults in mind
  44. 45. Build upon consensus , agreement , voting , quorum
  45. 46. Don’t trust time and timestamps
  46. 47. Strive for O(1) for data lookups #
  47. 48. Minimize the distance between the data and its processors
  48. 49. Utilize commodity hardware
  49. 50. Consider hardware fallibility
  50. 51. Relax new hardware startup procedure
  51. 52. Bring data to its users
  52. 53. Build upon asynchronous message passing
  53. 54. Consider network unreliability
  54. 55. Consider asynchronous message passing unreliability
  55. 56. Design with eventual actuality/consistency in mind
  56. 57. Implement redundancy and replication
  57. 58. Consider latency an adjustment screw
  58. 59. Consider availability an adjustment screw
  59. 60. Be prepared for disaster
  60. 61. Utilize the fog/clouds
  61. 62. Design for theoretically unlimited amount of data
  62. 63. Design for frequent structure changes
  63. 64. Design for the all-in-one mix
  64. 65. Agenda Blah-blah More blah-blah Color pics Standing ovations
  65. 66. Why can we never be sure till we die . Or have killed for an answer
  66. 67. CAP – C onsistency, A vailability, P artition tolerance
  67. 68. CAP – the variations CA – irrelevant CP – eventually unavailable offering maximum consistency AP – eventually inconsistent offering maximum availability
  68. 69. CAP – the tradeoff A C
  69. 70. CP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2
  70. 71. CP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2
  71. 72. AP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 replicate
  72. 73. AP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 hint handoff
  73. 74. BASE
  74. 75. BASE B asically A vailable, S oft-state, E ventually consistent Opposite to ACID
  75. 76. Causal ordering / consistency RM1 RM2 RM3
  76. 77. Read your write consistency write v 2 read v2 FE1 v 2 Data store v 3 v 1 write v 1 read v1 FE2
  77. 78. Session 2 Session 1 Session consistency write v 2 read v2 FE v 2 Data store v 3 v 1 write v 1 read v1
  78. 79. FIFO ordering RM1 RM2 RM3
  79. 80. Monotonic read consistency read v 2 read v2 FE1 v 2 Data store v 3 v 1 read v 3 read v4 FE2 v 4 read v3
  80. 81. Total ordering RM1 RM2 RM3
  81. 82. Monotonic write consistency write v 1 write v4 FE1 Data store v 2 write v 2 write v3 FE2 v 4 v 1 v 3
  82. 83. Eventual consistency read v 1 read v2 FE1 Data store v 3 write v 3 FE2 read v3 v 1 read v2 v 2
  83. 84. Run, rabbit, run. Dig that hole , forget the sun
  84. 85. Logical sharding
  85. 86. Node 1 Node 2 users products contracts Vertical sharding items orders addresses invoices „ read contract“ user=foo
  86. 87. Node 1 Node 2 users id(1-N) products Range based sharding addresses zip(1234- 2345) read users id(1-M) addresses zip(2346- 9999) write write read
  87. 88. Hash based sharding start with 3 nodes: node hash N = # mod 3 add 2 nodes N = # mod 5 kill 2 nodes N = # mod 3
  88. 89. Insert key Key = “foo” # = N N
  89. 90. rehash leave leave rehash Add 2 nodes
  90. 91. Lookup key Key = “foo” # = N N Value = “bar”
  91. 92. rehash leave leave rehash Remove node
  92. 93. Consistent hashing
  93. 94. The ring X bit integer space 0 <= N <= 2 ^ X or: 2 x Pi 0 <= A <= 2 x Pi x(N) = cos(A) y(N) = sin(A)
  94. 95. Key = “foo” # = N N Insert key
  95. 96. copy leave rehash leave leave rehash Add node
  96. 97. Lookup key Key = “foo” # = N N Value = “bar”
  97. 98. copy/ miss leave rehash leave leave rehash Remove node
  98. 99. Clustering 12 partitions (constant) 3 nodes, 4 vnodes each add node 4 nodes, 3 vnodes each Alternatives: 3 nodes, 2 x 5 + 1 x 2 vnodes container based
  99. 100. Quorum V: vnodes holding a key W: write quorum R: read quorum DW: durable write quorum W > 0.5 * V R + W > V
  100. 101. Key = “foo” # = N, W = 2 N Insert key ( sloppy quorum) replicate ok
  101. 102. leave Add node copy copy leave
  102. 103. Key = “foo” # = N, R = 2 N Lookup key ( sloppy quorum) Value = “bar”
  103. 104. leave Remove node copy copy leave
  104. 105. Inside out, outside in. Perpetual change
  105. 106. Clocks V(i), V(j): competing Conflict resolution: 1: siblings , client 2: merge , system 3: voting , system
  106. 107. Node 1 Node 2 Node 3 10:00 10:11 10:20 10:20 10:01 9:59 10:09 10:10 Timestamps 10:18 10:19
  107. 108. Node 1 Node 2 Node 3 1 3 5 6 2 2 4 5 4 7 7 7 Logical clocks 6 6 ? ?
  108. 109. Node 1 Node 2 Node 3 1,0,0 1,2,0 3,2,0 1,3,3 1,1,0 1,0,1 1,2,2 1,2,3 2,2,0 4,3,3 4,4,3 4,3,4 Vector clocks
  109. 110. Node 2 Node 3 Node 4 1,1,0,0 1,0,1,0 1,0,0,1 1,3,0,3 1,2,0, 2 1,2,0,3 Vector clocks Node 1 1,0,0,0 1,2,0,0 1,0,2,0
  110. 111. Merkle Trees N, M: nodes HT(N), HT(M): hash trees M needs update: obtain HT(N) calc delta(HT(M), HT(N)) pull keys(delta)
  111. 112. Node a.1 Node a.2 a ab ac abc abd acb acc Merkle Trees a ab ad abe abd ada adb
  112. 113. Node a.1 Node a.2 a ab abc abd Merkle Trees a ab ad abd ada adb
  113. 114. Sudden call shouldn't take away the startled memory
  114. 115. Replication – state transfer Target node users products addresses Source node take
  115. 116. Replication – operational transfer Target node updates inserts deletes Source node take run
  116. 117. Eager replication - 3PC Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK commit ok
  117. 118. Eager replication – 3PC ( failure ) Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK abort ok
  118. 119. Eager replication- Paxos Commit 2F + 1 acceptorsoverall , F + 1 correct ones to achieve consensus Stability, Consistency, Non-Triviality, Non-Blocking
  119. 120. prepare 2b prepared initial leader other RMs RM1 2a prepared Eager replication – Paxos Commit Acceptors begin commit commit
  120. 121. Eager replication – Paxos Commit ( failure ) prepare timeout, no decision initial leader other RMs RM 1 2a prepared Acceptors begin commit abort prepare 2a prepared timeout, no decision
  121. 122. Master node Slave node(s) users products Lazy replication – Master/slave addresses read write read
  122. 123. Master node(s) Master node(s) Lazy replication – Master/master read write read users id(1-N) users id(1-M) items id(1-K) items id(1-L) write
  123. 124. stable updates Gossip – RM RM1 Clock table Replica clock Update log Value clock Value Executed operation table write RM2 gossip
  124. 125. Node 1 Node 2 Node 3 update Gossip – node down/up Node 4 update update, 4 down read read, 4 up update
  125. 126. Hinted handoff N: node, G: group including N node(N) is unavailable replicate to G or store data(N) locally hint handoff for later node(N) is alive handoff data to node(N)
  126. 127. Key = “foo” N replicate Key = “foo”, # = N -> handoff hint = true Direct replica fails
  127. 128. Replica recovers handoff
  128. 129. N Key = “foo”, # = N -> handoff hint = true All replicas fail
  129. 130. All replicas recover replicate handoff
  130. 131. I’m a speed king, see me fly
  131. 132. MapReduce
  132. 133. MapReduce model: functional map/fold out-database MR irrelevant in-database MR: data locality no splitting needed distributed querying distributed processing
  133. 134. In-database MapReduce map reduce Node X Node C N = &quot;Alice&quot; map query = &quot;Alice&quot; Node A N = „ Alice&quot; Node B N = &quot;Alice&quot; map hit list
  134. 135. Caching
  135. 136. Caching Variations: eager write , append only lazy write , eventual consistency
  136. 137. Write through read write data store products write through users cache read read miss
  137. 138. Write back / snapshotting read write data store products write back users cache read miss
  138. 139. Physical storage
  139. 140. Physical storage row based: irrelevant column based: many rows, few columns value based: ad-hoc querying
  140. 141. Column based storage 1, 2 Peter, Anna London, Paris data store ID Name City 1 Peter London 2 Anna Paris
  141. 142. Value based storage 1:1, 3:Peter, 5:London, 2:2, 4:Anna, 6:Paris, 7:[1, 3, 5], 8:[2, 4, 6] data store ID Name City 1 Peter London 2 Anna Paris
  142. 143. Agenda Blah-blah More blah-blah Color pics Standing ovations
  143. 144. Thank you
  144. 145. Many graphics I’ve created myself, though I better should have asked @mononcqc for help ‘cause his drawings are awesome Some images originate from istockphoto.com except few ones taken from Wikipedia and product pages

×