Your SlideShare is downloading. ×

Introduction to Cassandra

2,694

Published on

Recent talk I gave at the Wellington Rails User Group. …

Recent talk I gave at the Wellington Rails User Group.

I tried to build up the model of how and why Cassandra does things.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,694
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
83
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Cassandra Wellington Ruby on Rails User Group Aaron Morton @aaronmorton 24/11/2010
  • 2. Disclaimer. This is an introduction not a reference.
  • 3. I may, from time to time and for the best possible reasons, bullshit you.
  • 4. What do you already know about Cassandra?
  • 5. Get ready.
  • 6. The next slide has a lot on it.
  • 7. Cassandra is a distributed, fault tolerant, scalable, column oriented data store.
  • 8. A word about “column oriented”.
  • 9. Relax.
  • 10. It’s different to a row oriented DB like MySQL. So...
  • 11. For now, think about keys and values.Where each value is a hash / dict.
  • 12. Cassandra’s data model and on disk storage are based on the Google Bigtable paper from 2006.
  • 13. The distributed cluster design is based on the Amazon Dynamo paper from 2007.
  • 14. {‘foo’ => {‘bar’ => ‘baz’,},} {key => {col_name => col_value,},}
  • 15. Easy. Lets store ‘foo’ somewhere.
  • 16. 'foo'
  • 17. But I want to be able to read it back if one machine fails.
  • 18. Lets distribute it on 3 of the 5 nodes I have.
  • 19. This is the Replication Factor. Called RF or N.
  • 20. Each node has a token that identifies the upper value of the key range it is responsible for.
  • 21. #1 <= E #2 <= J #3 <= O #4 <= T #5 <= Z
  • 22. Client connects to a random node and asks it to coordinate storing the ‘foo’ key.
  • 23. Each node knows about all other nodes in the cluster, including their tokens.
  • 24. This is achieved using a Gossip protocol. Every second each node shares it’s full view of the cluster with 1 to 3 other nodes.
  • 25. Our coordinator is node 5. It knows node 2 is responsible for the ‘foo’ key.
  • 26. #1 <= E #2 'foo' #3 <= O #4 <= T #5 <= Z Client
  • 27. But there is a problem...
  • 28. What if we have lots of values between F and J?
  • 29. We end up with a “hot” section in our ring of nodes.
  • 30. That’s bad mmmkay?
  • 31. You shouldn't have a hot section in your ring. mmmkay?
  • 32. A Partitioner is used to apply a transform to the key.The transformed values are also used to define a nodes’ range.
  • 33. The Random Partitioner applies a MD5 transform. The range of all possible keys values is changed to a 128 bit number.
  • 34. There are other Partitioners, such as the Order Preserving Partition. But start with the Random Partitioner.
  • 35. Let’s pretend all keys are now transformed to an integer between 0 and 9.
  • 36. Our 5 node cluster now looks like.
  • 37. #1 <= 2 #2 <= 4 #3 <= 6 #4 <= 8 #5 <= 0
  • 38. Pretend our ‘foo’ key transforms to 3.
  • 39. #1 <= 2 #2 "3" #3 <= 6 #4 <= 8 #5 <= 0 Client
  • 40. Good start.
  • 41. But where are the replicas? We want to replicate the ‘foo’ key 3 times.
  • 42. A Replication Strategy is used to determine which nodes should store replicas.
  • 43. It’s also used to work out which nodes should have a value when reading.
  • 44. Simple Strategy orders the nodes by their token and places the replicas around the ring.
  • 45. Network Topology Strategy is aware of the racks and Data Centres your servers are in. Can split replicas between DC’s.
  • 46. Simple Strategy will do in most cases.
  • 47. Our coordinator will send the write to all 3 nodes at once.
  • 48. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  • 49. Once the 3 replicas tell the coordinator they have finished, it will tell the client the write completed.
  • 50. Done. Let’s go home.
  • 51. Hang on. What about fault tolerant? What if node #4 is down?
  • 52. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  • 53. The client must specify a Consistency Level for each operation.
  • 54. Consistency Level specifies how many nodes must agree before the operation is a success.
  • 55. For reads is known as R. For writes is known as W.
  • 56. Here are the simple ones (there are a few more)...
  • 57. One. The coordinator will only wait for one node to acknowledge the write.
  • 58. Quorum. N/2 + 1
  • 59. All.
  • 60. The cluster will work to eventually make all copies of the data consistent.
  • 61. To get consistent behaviour make sure that R + W > N. You can do this by...
  • 62. Always using Quorum for read and writes. Or...
  • 63. Use All for writes and One for reads. Or...
  • 64. Use All for reads and One for writes.
  • 65. Try our write again, using Quorum consistency level.
  • 66. Coordinator will wait for 2 nodes to complete the write before telling the client has completed.
  • 67. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  • 68. What about when node 4 comes online?
  • 69. It will not have our “foo” key.
  • 70. Won’t somebody please think of the “foo” key!?
  • 71. During our write the coordinator will send a Hinted Handoff to one of the online replicas.
  • 72. Hinted Handoff tells the node that one of the replicas was down and needs to be updated later.
  • 73. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client send "3" to #4
  • 74. When node 4 comes back up, node 3 will eventually process the Hinted Handoffs and send the “foo” key to it.
  • 75. #1 <= 2 #2 "3" #3 "3" #4 "3" #5 <= 0 Client
  • 76. What if the “foo” key is read before the Hinted Handoff is processed?
  • 77. #1 <= 2 #2 "3" #3 "3" #4 "" #5 <= 0 Client send "3" to #4
  • 78. At our Quorum CL the coordinator asks all nodes that should have replicas to perform the read.
  • 79. Once CL nodes have returned, their values are compared.
  • 80. If the do not match a Read Repair process is kicked off.
  • 81. A timestamp provided by the client during the write is used to determine the “latest” value.
  • 82. The “foo” key is written to node 4, and consistency achieved, before the coordinator returns to the client.
  • 83. At lower CL the Read Repair happens in the background and is probabilistic.
  • 84. We can force Cassandra to repair everything using the Anti Entropy feature.
  • 85. Anti Entropy is the main feature for achieving consistency. RR and HH are optimisations.
  • 86. Anti Entropy started manually via command line or Java JMX.
  • 87. Great so far.
  • 88. But ratemylolcats.com is going to be huge. How do I store 100 Million pictures of cats?
  • 89. Add more nodes.
  • 90. More disk capacity, disk IO, memory, CPU, network IO. More everything.
  • 91. Linear scaling.
  • 92. Clusters of 100+ TB.
  • 93. And now for the data model.
  • 94. From the outside in.
  • 95. A Keyspace is the container for everything in your application.
  • 96. Keyspaces can be thought of as Databases.
  • 97. A Column Family is a container for ordered and indexed Columns.
  • 98. Columns have a name, value, and timestamp provided by the client.
  • 99. The CF indexes the columns by name and supports get operations by name.
  • 100. CF’s do not define which columns can be stored in them.
  • 101. Column Families have a large memory overhead.
  • 102. You typically have few (<10) CF’s in your Keyspace. But there is no limit.
  • 103. We have Rows. Rows have a key.
  • 104. Rows store columns in one or more Column Families.
  • 105. Different rows can store different columns in the same Column Family.
  • 106. User CF username => fred d_o_b => 04/03 username => bob city => wellington key => fred key => bob
  • 107. A key can store different columns in different Column Families.
  • 108. User CF username => fred d_o_b => 04/03 09:01 => tweet_60 09:02 => tweet_70 key => fred key => fred Timeline CF
  • 109. Here comes the Super Column Family to ruin it all.
  • 110. Arrgggghhhhh.
  • 111. A Super Column Family is a container for ordered and indexes Super Columns.
  • 112. A Super Column has a name and an ordered and indexed list of Columns.
  • 113. So the Super Column Family just gives another level to our hash.
  • 114. Social Super CF following => { bob => 01/01/2010, tom => 01/02/2010} followers => { bob => 01/01/2010} key => fred
  • 115. How about some code?

×