Introduction to
Cassandra
Wellington Ruby on Rails User Group
Aaron Morton @aaronmorton
24/11/2010
Disclaimer.
This is an introduction not
a reference.
I may, from time to time
and for the best possible
reasons, bullshit you.
What do you already know
about Cassandra?
Get ready.
The next slide has a lot on
it.
Cassandra is a distributed,
fault tolerant, scalable,
column oriented data
store.
A word about “column
oriented”.
Relax.
It’s different to a row
oriented DB like MySQL.
So...
For now, think about keys
and values.Where each
value is a hash / dict.
Cassandra’s data model and
on disk storage are based
on the Google Bigtable
paper from 2006.
The distributed cluster
design is based on the
Amazon Dynamo paper
from 2007.
{‘foo’ => {‘bar’ => ‘baz’,},}
{key => {col_name =>
col_value,},}
Easy.
Lets store ‘foo’ somewhere.
'foo'
But I want to be able to
read it back if one machine
fails.
Lets distribute it on 3 of
the 5 nodes I have.
This is the Replication
Factor.
Called RF or N.
Each node has a token that
identifies the upper value of
the key range it is
responsible for.
#1
<= E
#2
<= J
#3
<= O
#4
<= T
#5
<= Z
Client connects to a
random node and asks it to
coordinate storing the ‘foo’
key.
Each node knows about all
other nodes in the cluster,
including their tokens.
This is achieved using a
Gossip protocol. Every
second each node shares
it’s full view of the cluster
with 1 to 3 other no...
Our coordinator is node 5.
It knows node 2 is
responsible for the ‘foo’
key.
#1
<= E
#2
'foo'
#3
<= O
#4
<= T
#5
<= Z
Client
But there is a problem...
What if we have lots of
values between F and J?
We end up with a “hot”
section in our ring of
nodes.
That’s bad mmmkay?
You shouldn't have a hot
section in your ring.
mmmkay?
A Partitioner is used to
apply a transform to the
key.The transformed values
are also used to define a
nodes’ range.
The Random Partitioner
applies a MD5 transform.
The range of all possible
keys values is changed to a
128 bit number.
There are other
Partitioners, such as the
Order Preserving Partition.
But start with the Random
Partitioner.
Let’s pretend all keys are
now transformed to an
integer between 0 and 9.
Our 5 node cluster now
looks like.
#1
<= 2
#2
<= 4
#3
<= 6
#4
<= 8
#5
<= 0
Pretend our ‘foo’ key
transforms to 3.
#1
<= 2
#2
"3"
#3
<= 6
#4
<= 8
#5
<= 0
Client
Good start.
But where are the replicas?
We want to replicate the
‘foo’ key 3 times.
A Replication Strategy is
used to determine which
nodes should store replicas.
It’s also used to work out
which nodes should have a
value when reading.
Simple Strategy orders the
nodes by their token and
places the replicas around
the ring.
Network Topology Strategy
is aware of the racks and
Data Centres your servers
are in. Can split replicas
between DC’s.
Simple Strategy will do in
most cases.
Our coordinator will send
the write to all 3 nodes at
once.
#1
<= 2
#2
"3"
#3
"3"
#4
"3"
#5
<= 0
Client
Once the 3 replicas tell the
coordinator they have
finished, it will tell the client
the write completed.
Done.
Let’s go home.
Hang on.
What about fault tolerant?
What if node #4 is down?
#1
<= 2
#2
"3"
#3
"3"
#4
"3"
#5
<= 0
Client
The client must specify a
Consistency Level for each
operation.
Consistency Level specifies
how many nodes must
agree before the operation
is a success.
For reads is known as R.
For writes is known as W.
Here are the simple ones
(there are a few more)...
One.
The coordinator will only
wait for one node to
acknowledge the write.
Quorum.
N/2 + 1
All.
The cluster will work to
eventually make all copies
of the data consistent.
To get consistent behaviour
make sure that R + W > N.
You can do this by...
Always using Quorum for
read and writes.
Or...
Use All for writes and One
for reads.
Or...
Use All for reads and One
for writes.
Try our write again, using
Quorum consistency level.
Coordinator will wait for 2
nodes to complete the
write before telling the
client has completed.
#1
<= 2
#2
"3"
#3
"3"
#4
"3"
#5
<= 0
Client
What about when node 4
comes online?
It will not have our “foo”
key.
Won’t somebody please
think of the “foo” key!?
During our write the
coordinator will send a
Hinted Handoff to one of
the online replicas.
Hinted Handoff tells the
node that one of the
replicas was down and
needs to be updated later.
#1
<= 2
#2
"3"
#3
"3"
#4
"3"
#5
<= 0
Client
send "3"
to #4
When node 4 comes back
up, node 3 will eventually
process the Hinted
Handoffs and send the
“foo” key to it.
#1
<= 2
#2
"3"
#3
"3"
#4
"3"
#5
<= 0
Client
What if the “foo” key is
read before the Hinted
Handoff is processed?
#1
<= 2
#2
"3"
#3
"3"
#4
""
#5
<= 0
Client
send "3"
to #4
At our Quorum CL the
coordinator asks all nodes
that should have replicas to
perform the read.
Once CL nodes have
returned, their values are
compared.
If the do not match a Read
Repair process is kicked off.
A timestamp provided by
the client during the write
is used to determine the
“latest” value.
The “foo” key is written to
node 4, and consistency
achieved, before the
coordinator returns to the
client.
At lower CL the Read
Repair happens in the
background and is
probabilistic.
We can force Cassandra to
repair everything using the
Anti Entropy feature.
Anti Entropy is the main
feature for achieving
consistency. RR and HH are
optimisations.
Anti Entropy started
manually via command line
or Java JMX.
Great so far.
But ratemylolcats.com is
going to be huge.
How do I store 100 Million
pictures of cats?
Add more nodes.
More disk capacity, disk IO,
memory, CPU, network IO.
More everything.
Linear scaling.
Clusters of 100+ TB.
And now for the data
model.
From the outside in.
A Keyspace is the container
for everything in your
application.
Keyspaces can be thought
of as Databases.
A Column Family is a
container for ordered and
indexed Columns.
Columns have a name,
value, and timestamp
provided by the client.
The CF indexes the
columns by name and
supports get operations by
name.
CF’s do not define which
columns can be stored in
them.
Column Families have a
large memory overhead.
You typically have few (<10)
CF’s in your Keyspace. But
there is no limit.
We have Rows.
Rows have a key.
Rows store columns in one
or more Column Families.
Different rows can store
different columns in the
same Column Family.
User CF
username => fred
d_o_b => 04/03
username => bob
city => wellington
key => fred
key => bob
A key can store different
columns in different
Column Families.
User CF
username => fred
d_o_b => 04/03
09:01 => tweet_60
09:02 => tweet_70
key => fred
key => fred
Timeline CF
Here comes the Super
Column Family to ruin it all.
Arrgggghhhhh.
A Super Column Family is a
container for ordered and
indexes Super Columns.
A Super Column has a
name and an ordered and
indexed list of Columns.
So the Super Column
Family just gives another
level to our hash.
Social Super CF
following => {
bob => 01/01/2010,
tom => 01/02/2010}
followers => {
bob => 01/01/2010}
key => fred
How about some code?
Introduction to Cassandra
Upcoming SlideShare
Loading in...5
×

Introduction to Cassandra

2,775

Published on

Recent talk I gave at the Wellington Rails User Group.

I tried to build up the model of how and why Cassandra does things.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,775
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
84
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Introduction to Cassandra

  1. 1. Introduction to Cassandra Wellington Ruby on Rails User Group Aaron Morton @aaronmorton 24/11/2010
  2. 2. Disclaimer. This is an introduction not a reference.
  3. 3. I may, from time to time and for the best possible reasons, bullshit you.
  4. 4. What do you already know about Cassandra?
  5. 5. Get ready.
  6. 6. The next slide has a lot on it.
  7. 7. Cassandra is a distributed, fault tolerant, scalable, column oriented data store.
  8. 8. A word about “column oriented”.
  9. 9. Relax.
  10. 10. It’s different to a row oriented DB like MySQL. So...
  11. 11. For now, think about keys and values.Where each value is a hash / dict.
  12. 12. Cassandra’s data model and on disk storage are based on the Google Bigtable paper from 2006.
  13. 13. The distributed cluster design is based on the Amazon Dynamo paper from 2007.
  14. 14. {‘foo’ => {‘bar’ => ‘baz’,},} {key => {col_name => col_value,},}
  15. 15. Easy. Lets store ‘foo’ somewhere.
  16. 16. 'foo'
  17. 17. But I want to be able to read it back if one machine fails.
  18. 18. Lets distribute it on 3 of the 5 nodes I have.
  19. 19. This is the Replication Factor. Called RF or N.
  20. 20. Each node has a token that identifies the upper value of the key range it is responsible for.
  21. 21. #1 <= E #2 <= J #3 <= O #4 <= T #5 <= Z
  22. 22. Client connects to a random node and asks it to coordinate storing the ‘foo’ key.
  23. 23. Each node knows about all other nodes in the cluster, including their tokens.
  24. 24. This is achieved using a Gossip protocol. Every second each node shares it’s full view of the cluster with 1 to 3 other nodes.
  25. 25. Our coordinator is node 5. It knows node 2 is responsible for the ‘foo’ key.
  26. 26. #1 <= E #2 'foo' #3 <= O #4 <= T #5 <= Z Client
  27. 27. But there is a problem...
  28. 28. What if we have lots of values between F and J?
  29. 29. We end up with a “hot” section in our ring of nodes.
  30. 30. That’s bad mmmkay?
  31. 31. You shouldn't have a hot section in your ring. mmmkay?
  32. 32. A Partitioner is used to apply a transform to the key.The transformed values are also used to define a nodes’ range.
  33. 33. The Random Partitioner applies a MD5 transform. The range of all possible keys values is changed to a 128 bit number.
  34. 34. There are other Partitioners, such as the Order Preserving Partition. But start with the Random Partitioner.
  35. 35. Let’s pretend all keys are now transformed to an integer between 0 and 9.
  36. 36. Our 5 node cluster now looks like.
  37. 37. #1 <= 2 #2 <= 4 #3 <= 6 #4 <= 8 #5 <= 0
  38. 38. Pretend our ‘foo’ key transforms to 3.
  39. 39. #1 <= 2 #2 "3" #3 <= 6 #4 <= 8 #5 <= 0 Client
  40. 40. Good start.
  41. 41. But where are the replicas? We want to replicate the ‘foo’ key 3 times.
  42. 42. A Replication Strategy is used to determine which nodes should store replicas.
  43. 43. It’s also used to work out which nodes should have a value when reading.
  44. 44. Simple Strategy orders the nodes by their token and places the replicas around the ring.
  45. 45. Network Topology Strategy is aware of the racks and Data Centres your servers are in. Can split replicas between DC’s.
  46. 46. Simple Strategy will do in most cases.
  47. 47. Our coordinator will send the write to all 3 nodes at once.
  48. 48. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  49. 49. Once the 3 replicas tell the coordinator they have finished, it will tell the client the write completed.
  50. 50. Done. Let’s go home.
  51. 51. Hang on. What about fault tolerant? What if node #4 is down?
  52. 52. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  53. 53. The client must specify a Consistency Level for each operation.
  54. 54. Consistency Level specifies how many nodes must agree before the operation is a success.
  55. 55. For reads is known as R. For writes is known as W.
  56. 56. Here are the simple ones (there are a few more)...
  57. 57. One. The coordinator will only wait for one node to acknowledge the write.
  58. 58. Quorum. N/2 + 1
  59. 59. All.
  60. 60. The cluster will work to eventually make all copies of the data consistent.
  61. 61. To get consistent behaviour make sure that R + W > N. You can do this by...
  62. 62. Always using Quorum for read and writes. Or...
  63. 63. Use All for writes and One for reads. Or...
  64. 64. Use All for reads and One for writes.
  65. 65. Try our write again, using Quorum consistency level.
  66. 66. Coordinator will wait for 2 nodes to complete the write before telling the client has completed.
  67. 67. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  68. 68. What about when node 4 comes online?
  69. 69. It will not have our “foo” key.
  70. 70. Won’t somebody please think of the “foo” key!?
  71. 71. During our write the coordinator will send a Hinted Handoff to one of the online replicas.
  72. 72. Hinted Handoff tells the node that one of the replicas was down and needs to be updated later.
  73. 73. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client send "3" to #4
  74. 74. When node 4 comes back up, node 3 will eventually process the Hinted Handoffs and send the “foo” key to it.
  75. 75. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  76. 76. What if the “foo” key is read before the Hinted Handoff is processed?
  77. 77. #1 <= 2 #2 "3" #3 "3" #4 "" #5 <= 0 Client send "3" to #4
  78. 78. At our Quorum CL the coordinator asks all nodes that should have replicas to perform the read.
  79. 79. Once CL nodes have returned, their values are compared.
  80. 80. If the do not match a Read Repair process is kicked off.
  81. 81. A timestamp provided by the client during the write is used to determine the “latest” value.
  82. 82. The “foo” key is written to node 4, and consistency achieved, before the coordinator returns to the client.
  83. 83. At lower CL the Read Repair happens in the background and is probabilistic.
  84. 84. We can force Cassandra to repair everything using the Anti Entropy feature.
  85. 85. Anti Entropy is the main feature for achieving consistency. RR and HH are optimisations.
  86. 86. Anti Entropy started manually via command line or Java JMX.
  87. 87. Great so far.
  88. 88. But ratemylolcats.com is going to be huge. How do I store 100 Million pictures of cats?
  89. 89. Add more nodes.
  90. 90. More disk capacity, disk IO, memory, CPU, network IO. More everything.
  91. 91. Linear scaling.
  92. 92. Clusters of 100+ TB.
  93. 93. And now for the data model.
  94. 94. From the outside in.
  95. 95. A Keyspace is the container for everything in your application.
  96. 96. Keyspaces can be thought of as Databases.
  97. 97. A Column Family is a container for ordered and indexed Columns.
  98. 98. Columns have a name, value, and timestamp provided by the client.
  99. 99. The CF indexes the columns by name and supports get operations by name.
  100. 100. CF’s do not define which columns can be stored in them.
  101. 101. Column Families have a large memory overhead.
  102. 102. You typically have few (<10) CF’s in your Keyspace. But there is no limit.
  103. 103. We have Rows. Rows have a key.
  104. 104. Rows store columns in one or more Column Families.
  105. 105. Different rows can store different columns in the same Column Family.
  106. 106. User CF username => fred d_o_b => 04/03 username => bob city => wellington key => fred key => bob
  107. 107. A key can store different columns in different Column Families.
  108. 108. User CF username => fred d_o_b => 04/03 09:01 => tweet_60 09:02 => tweet_70 key => fred key => fred Timeline CF
  109. 109. Here comes the Super Column Family to ruin it all.
  110. 110. Arrgggghhhhh.
  111. 111. A Super Column Family is a container for ordered and indexes Super Columns.
  112. 112. A Super Column has a name and an ordered and indexed list of Columns.
  113. 113. So the Super Column Family just gives another level to our hash.
  114. 114. Social Super CF following => { bob => 01/01/2010, tom => 01/02/2010} followers => { bob => 01/01/2010} key => fred
  115. 115. How about some code?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×