Introduction to   CassandraWellington Python User GroupAaron Morton @aaronmorton          1/12/2010
Disclaimer.This is an introduction not        a reference.
I may, from time to timeand for the best possible  reasons, bullshit you.
What do you already know   about Cassandra?
Get ready.
The next slide has a lot on            it.
Cassandra is a distributed, fault tolerant, scalable, column oriented data          store.
A word about “column     oriented”.
Relax.
It’s different to a roworiented DB like MySQL.             So...
For now, think about keys and values. Where each  value is a hash / dict.
{‘foo’ : {‘bar’ : ‘baz’,},}  {key : {col_name :     col_value,},}
Cassandra’s data model and on disk storage are based  on the Google Bigtable     paper from 2006.
The distributed cluster design is based on theAmazon Dynamo paper      from 2007.
Easy.Lets store ‘foo’ somewhere.
foo
But I want to be able toread it back if one machine            fails.
Lets distribute it on 3 of  the 5 nodes I have.
This is the Replication         Factor.   Called RF or N.
Each node has a token thatidentifies the upper value of     the key range it is       responsible for.
#1       <= E #5            #2<= Z          <= J #4            #3<= T          <= O
Client connects to arandom node and asks it tocoordinate storing the ‘foo’           key.
Each node knows about allother nodes in the cluster,  including their tokens.
This is achieved using a  Gossip protocol. Every second each node sharesit’s full view of the cluster with 1 to 3 other no...
Our coordinator is node 5.    It knows node 2 is responsible for the ‘foo’           key.
#1Client         <= E #5              #2<= Z            foo #4              #3<= T            <= O
But there is a problem...
What if we have lots ofvalues between F and J?
We end up with a “hot” section in our ring of        nodes.
That’s bad mmmkay?
You shouldnt have a hot  section in your ring.       mmmkay?
A Partitioner is used to apply a transform to thekey. The transformed valuesare also used to define the      range for a no...
The Random Partitioner applies a MD5 transform. The range of all possiblekeys values is changed to a      128 bit number.
There are other Partitioners, such as the    Order PreservingPartitioner. But start with the Random Partitioner.
Let’s pretend all keys are now transformed to aninteger between 0 and 9.
Our 5 node cluster now      looks like.
#1       <= 2 #5            #2<= 0          <= 4 #4            #3<= 8          <= 6
Pretend our ‘foo’ key  transforms to 3.
#1Client         <= 2 #5             #2<= 0            "3" #4              #3<= 8            <= 6
Good start.
But where are the replicas? We want to replicate the     ‘foo’ key 3 times.
A Replication Strategy is used to determine whichnodes should store replicas.
It’s also used to work outwhich nodes should have a    value when reading.
Simple Strategy orders the nodes by their token and    places the replicasclockwise around the ring.
Network Topology Strategy is aware of the racks andData Centres your servers  are in. Can split replicas       between DC’s.
Simple Strategy will do in       most cases.
Our coordinator will sendthe write to all 3 nodes at          once.
#1Client         <= 2 #5             #2<= 0            "3" #4             #3 "3"            "3"
Once the 3 replicas tell the   coordinator they havefinished, it will tell the client   the write completed.
Done.Let’s go home.
Hang on.What about fault tolerant?What if node #4 is down?
#1Client         <= 2 #5             #2<= 0            "3" #4             #3 "3"            "3"
After all, two out of three          ain’t bad.
The client must specify aConsistency Level for each        operation.
Consistency Level specifies  how many nodes mustagree before the operation       is a success.
For reads is known as R.For writes is known as W.
Here are the simple ones(there are a few more)...
One.The coordinator will only  wait for one node to acknowledge the write.
Quorum.N/2 + 1
All.
The cluster will work toeventually make all copies  of the data consistent.
To get consistent behaviourmake sure that R + W > N.    You can do this by...
Always using Quorum for    read and writes.          Or...
Use All for writes and One         for reads.            Or...
Use All for reads and One        for writes.
Try our write again, usingQuorum consistency level.
Coordinator will wait for 2 nodes to complete the write before telling the client it has completed.
#1Client         <= 2 #5             #2<= 0            "3" #4             #3 "3"            "3"
What about when node 4    comes online?
It will not have our “foo”            key.
Won’t somebody pleasethink of the “foo” key!?
During our write the coordinator will send aHinted Handoff to one of   the online replicas.
Hinted Handoff tells the  node that one of the replicas was down andneeds to be updated later.
#1Client         <= 2 #5               #2<= 0              "3" #4               #3 "3"              "3"                sen...
When node 4 comes backup, node 3 will eventually   process the Hinted Handoffs and send the     “foo” key to it.
#1Client         <= 2 #5             #2<= 0            "3" #4             #3 "3"            "3"
What if the “foo” key isread before the Hinted Handoff is processed?
#1Client         <= 2 #5               #2<= 0              "3" #4               #3 ""               "3"                sen...
At our Quorum CL the coordinator asks all nodesthat should have replicas to     perform the read.
The Replication Strategy isused by the coordinator to   find out which nodes   should have replicas.
Once CL nodes havereturned, their values are       compared.
If the do not match a ReadRepair process is kicked off.
A timestamp provided bythe client during the write is used to determine the       “latest” value.
The “foo” key is written to node 4, and consistency   achieved, before thecoordinator returns to the          client.
At lower CL’s the ReadRepair happens in the  background and is     probabilistic.
We can force Cassandra torepair everything using the   Anti Entropy feature.
Anti Entropy is the main   feature for achievingconsistency. RR and HH are       optimisations.
Anti Entropy is startedmanually via command line      or Java JMX.
Great so far.
But ratemylolcats.com is    going to be huge.How do I store 100 Million     pictures of cats?
Add more nodes.
More disk capacity, disk IO,memory, CPU, network IO.    More everything.
Except.
Less “work” per node.
Linear scaling.
Clusters of 100+ TB.
And now for the data      model.
From the outside in.
A Keyspace is the container  for everything in your       application.
Keyspaces can be thought    of as Databases.
A Column Family is acontainer for ordered and   indexed Columns.
Columns have a name, value, and timestampprovided by the client.
The CF indexes the  columns by name andsupports get operations by          name.
CF’s do not define whichcolumns can be stored in         them.
Column Families have alarge memory overhead.
You typically have few (<10) CF’s in your Keyspace. But      there is no limit.
We have Rows.Rows have a key.
Rows store columns in oneor more Column Families.
Different rows can storedifferent columns in the same Column Family.
User CF               username => fredkey => fred                d_o_b => 04/03               username => bobkey => bob   ...
A key can store different  columns in different   Column Families.
User CF                 username => fredkey => fred                  d_o_b => 04/03              Timeline CF              ...
Here comes the SuperColumn Family to ruin it all.
Arrgggghhhhh.
A Super Column Family is acontainer for ordered and indexes Super Columns.
A Super Column has aname and an ordered and indexed list of Columns.
So the Super ColumnFamily just gives another   level to our hash.
Social Super CFkey => fred                     following => {                  bob => 01/01/2010,                  tom => ...
How about some code?
Nzpug welly-cassandra-02-12-2010
Upcoming SlideShare
Loading in …5
×

Nzpug welly-cassandra-02-12-2010

831 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
831
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nzpug welly-cassandra-02-12-2010

  1. 1. Introduction to CassandraWellington Python User GroupAaron Morton @aaronmorton 1/12/2010
  2. 2. Disclaimer.This is an introduction not a reference.
  3. 3. I may, from time to timeand for the best possible reasons, bullshit you.
  4. 4. What do you already know about Cassandra?
  5. 5. Get ready.
  6. 6. The next slide has a lot on it.
  7. 7. Cassandra is a distributed, fault tolerant, scalable, column oriented data store.
  8. 8. A word about “column oriented”.
  9. 9. Relax.
  10. 10. It’s different to a roworiented DB like MySQL. So...
  11. 11. For now, think about keys and values. Where each value is a hash / dict.
  12. 12. {‘foo’ : {‘bar’ : ‘baz’,},} {key : {col_name : col_value,},}
  13. 13. Cassandra’s data model and on disk storage are based on the Google Bigtable paper from 2006.
  14. 14. The distributed cluster design is based on theAmazon Dynamo paper from 2007.
  15. 15. Easy.Lets store ‘foo’ somewhere.
  16. 16. foo
  17. 17. But I want to be able toread it back if one machine fails.
  18. 18. Lets distribute it on 3 of the 5 nodes I have.
  19. 19. This is the Replication Factor. Called RF or N.
  20. 20. Each node has a token thatidentifies the upper value of the key range it is responsible for.
  21. 21. #1 <= E #5 #2<= Z <= J #4 #3<= T <= O
  22. 22. Client connects to arandom node and asks it tocoordinate storing the ‘foo’ key.
  23. 23. Each node knows about allother nodes in the cluster, including their tokens.
  24. 24. This is achieved using a Gossip protocol. Every second each node sharesit’s full view of the cluster with 1 to 3 other nodes.
  25. 25. Our coordinator is node 5. It knows node 2 is responsible for the ‘foo’ key.
  26. 26. #1Client <= E #5 #2<= Z foo #4 #3<= T <= O
  27. 27. But there is a problem...
  28. 28. What if we have lots ofvalues between F and J?
  29. 29. We end up with a “hot” section in our ring of nodes.
  30. 30. That’s bad mmmkay?
  31. 31. You shouldnt have a hot section in your ring. mmmkay?
  32. 32. A Partitioner is used to apply a transform to thekey. The transformed valuesare also used to define the range for a node.
  33. 33. The Random Partitioner applies a MD5 transform. The range of all possiblekeys values is changed to a 128 bit number.
  34. 34. There are other Partitioners, such as the Order PreservingPartitioner. But start with the Random Partitioner.
  35. 35. Let’s pretend all keys are now transformed to aninteger between 0 and 9.
  36. 36. Our 5 node cluster now looks like.
  37. 37. #1 <= 2 #5 #2<= 0 <= 4 #4 #3<= 8 <= 6
  38. 38. Pretend our ‘foo’ key transforms to 3.
  39. 39. #1Client <= 2 #5 #2<= 0 "3" #4 #3<= 8 <= 6
  40. 40. Good start.
  41. 41. But where are the replicas? We want to replicate the ‘foo’ key 3 times.
  42. 42. A Replication Strategy is used to determine whichnodes should store replicas.
  43. 43. It’s also used to work outwhich nodes should have a value when reading.
  44. 44. Simple Strategy orders the nodes by their token and places the replicasclockwise around the ring.
  45. 45. Network Topology Strategy is aware of the racks andData Centres your servers are in. Can split replicas between DC’s.
  46. 46. Simple Strategy will do in most cases.
  47. 47. Our coordinator will sendthe write to all 3 nodes at once.
  48. 48. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "3" "3"
  49. 49. Once the 3 replicas tell the coordinator they havefinished, it will tell the client the write completed.
  50. 50. Done.Let’s go home.
  51. 51. Hang on.What about fault tolerant?What if node #4 is down?
  52. 52. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "3" "3"
  53. 53. After all, two out of three ain’t bad.
  54. 54. The client must specify aConsistency Level for each operation.
  55. 55. Consistency Level specifies how many nodes mustagree before the operation is a success.
  56. 56. For reads is known as R.For writes is known as W.
  57. 57. Here are the simple ones(there are a few more)...
  58. 58. One.The coordinator will only wait for one node to acknowledge the write.
  59. 59. Quorum.N/2 + 1
  60. 60. All.
  61. 61. The cluster will work toeventually make all copies of the data consistent.
  62. 62. To get consistent behaviourmake sure that R + W > N. You can do this by...
  63. 63. Always using Quorum for read and writes. Or...
  64. 64. Use All for writes and One for reads. Or...
  65. 65. Use All for reads and One for writes.
  66. 66. Try our write again, usingQuorum consistency level.
  67. 67. Coordinator will wait for 2 nodes to complete the write before telling the client it has completed.
  68. 68. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "3" "3"
  69. 69. What about when node 4 comes online?
  70. 70. It will not have our “foo” key.
  71. 71. Won’t somebody pleasethink of the “foo” key!?
  72. 72. During our write the coordinator will send aHinted Handoff to one of the online replicas.
  73. 73. Hinted Handoff tells the node that one of the replicas was down andneeds to be updated later.
  74. 74. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "3" "3" send "3" to #4
  75. 75. When node 4 comes backup, node 3 will eventually process the Hinted Handoffs and send the “foo” key to it.
  76. 76. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "3" "3"
  77. 77. What if the “foo” key isread before the Hinted Handoff is processed?
  78. 78. #1Client <= 2 #5 #2<= 0 "3" #4 #3 "" "3" send "3" to #4
  79. 79. At our Quorum CL the coordinator asks all nodesthat should have replicas to perform the read.
  80. 80. The Replication Strategy isused by the coordinator to find out which nodes should have replicas.
  81. 81. Once CL nodes havereturned, their values are compared.
  82. 82. If the do not match a ReadRepair process is kicked off.
  83. 83. A timestamp provided bythe client during the write is used to determine the “latest” value.
  84. 84. The “foo” key is written to node 4, and consistency achieved, before thecoordinator returns to the client.
  85. 85. At lower CL’s the ReadRepair happens in the background and is probabilistic.
  86. 86. We can force Cassandra torepair everything using the Anti Entropy feature.
  87. 87. Anti Entropy is the main feature for achievingconsistency. RR and HH are optimisations.
  88. 88. Anti Entropy is startedmanually via command line or Java JMX.
  89. 89. Great so far.
  90. 90. But ratemylolcats.com is going to be huge.How do I store 100 Million pictures of cats?
  91. 91. Add more nodes.
  92. 92. More disk capacity, disk IO,memory, CPU, network IO. More everything.
  93. 93. Except.
  94. 94. Less “work” per node.
  95. 95. Linear scaling.
  96. 96. Clusters of 100+ TB.
  97. 97. And now for the data model.
  98. 98. From the outside in.
  99. 99. A Keyspace is the container for everything in your application.
  100. 100. Keyspaces can be thought of as Databases.
  101. 101. A Column Family is acontainer for ordered and indexed Columns.
  102. 102. Columns have a name, value, and timestampprovided by the client.
  103. 103. The CF indexes the columns by name andsupports get operations by name.
  104. 104. CF’s do not define whichcolumns can be stored in them.
  105. 105. Column Families have alarge memory overhead.
  106. 106. You typically have few (<10) CF’s in your Keyspace. But there is no limit.
  107. 107. We have Rows.Rows have a key.
  108. 108. Rows store columns in oneor more Column Families.
  109. 109. Different rows can storedifferent columns in the same Column Family.
  110. 110. User CF username => fredkey => fred d_o_b => 04/03 username => bobkey => bob city => wellington
  111. 111. A key can store different columns in different Column Families.
  112. 112. User CF username => fredkey => fred d_o_b => 04/03 Timeline CF 09:01 => tweet_60key => fred 09:02 => tweet_70
  113. 113. Here comes the SuperColumn Family to ruin it all.
  114. 114. Arrgggghhhhh.
  115. 115. A Super Column Family is acontainer for ordered and indexes Super Columns.
  116. 116. A Super Column has aname and an ordered and indexed list of Columns.
  117. 117. So the Super ColumnFamily just gives another level to our hash.
  118. 118. Social Super CFkey => fred following => { bob => 01/01/2010, tom => 01/02/2010} followers => { bob => 01/01/2010}
  119. 119. How about some code?

×