Cassandra's Sweet Spot - an introduction to Apache Cassandra

15,772 views
15,437 views

Published on

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.

Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

Published in: Technology
0 Comments
25 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
15,772
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
321
Comments
0
Likes
25
Embeds 0
No embeds

No notes for slide

Cassandra's Sweet Spot - an introduction to Apache Cassandra

  1. 1. Cassandra’s sweet spotDave Gardner@davegardnerisme
  2. 2. jobs.hailocab.comLooking for an expert backendJava dev – speak to me!meetup.com/Cassandra-LondonNext event 21st November
  3. 3. Building applications with Cassandra • Key features • Creating an application • Data modeling
  4. 4. Comparing Cassandra with X “Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I dont know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?” 27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/ 7773
  5. 5. Comparing Cassandra with X “They have approximately nothing in common. And, no, Cassandra is definitely not dying off.” 28th July 2010 http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
  6. 6. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot.
  7. 7. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot. This means learning about each solution; how is it designed? what algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk2 011.html
  8. 8. Comparing Cassandra with X“they say … I can’t decide between this projectand this project even though they look nothinglike each other. And the fact that you can’tdecide indicates that you don’t actually have aproblem that requires them.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
  9. 9. Headline features 1. Elastic Read and write throughput increases linearly as new machines are added http://cassandra.apache.org/
  10. 10. Headline features 2. Decentralised Fault tolerant with no single point of failure; no “master” node http://cassandra.apache.org/
  11. 11. The dynamo paper • Consistent hashing • Vector clocks • Gossip protocol • Hinted handoff • Read repair http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf
  12. 12. The dynamo paper # 1 RF = 3 # # 6 2 Coordinator # # 5 3Client # 4
  13. 13. Headline features 3. Rich data model Column based, range slices, column slices, secondary indexes, counters, expiring columns http://cassandra.apache.org/
  14. 14. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  15. 15. The big table paper Column Family Name Name Name Row Key Value Value Value Column Column Column
  16. 16. Headline features 4. Youre in control Tunable consistency, per operation http://cassandra.apache.org/
  17. 17. Consistency levels How many replicas must respond to declare success?
  18. 18. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  19. 19. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  20. 20. Headline features 5. Performant Well known for high write performance http://www.datastax.com/docs/1.0/introduction/index#core- strengths-of-cassandra
  21. 21. Benchmark* http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  22. 22. Recap: headline features 1. Elastic 2. Decentralised 3. Rich data model 4. You’re in control (tunable consistency) 5. Performant
  23. 23. A simple ad-targeting application Some ads Choose which ad to show Our user knowledge
  24. 24. A simple ad-targeting application Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets) http://pixel.wehaveyourkidneys.com/add.php?add=foo
  25. 25. A simple ad-targeting application Record clicks and impressions of each ad; storing data per-ad and per-segment http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1 http://pixel.wehaveyourkidneys.com/adClick.php?ad=1
  26. 26. A simple ad-targeting application Real-time ad performance analytics, broken down by segment (which segments are performing well?) http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  27. 27. A simple ad-targeting application Recommendations based on best- performing ads (this is left as an exercise for the reader)
  28. 28. Additional requirements • Large number of users • High volume of impressions • Highly available – downtime is money
  29. 29. A good fit for Cassandra? Yes! Big data, high availability and lots of writes are all good signs that Cassandra will fit well. http://www.nosqldatabases.com/main/2010/10/19/what-is- cassandra-good-for.html
  30. 30. A good fit for Cassandra? Although there are many things that people are using Cassandra for. Highly available HTTP request routing (tiny data!) http://blip.tv/datastax/highly-available-http-request-routing-dns- using-cassandra-5501901
  31. 31. Top Tip #2 Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.
  32. 32. Demo Live demo before we start
  33. 33. Data modeling Start from your queries, work backwards http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  34. 34. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  35. 35. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  36. 36. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  37. 37. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  38. 38. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  39. 39. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Is user in segment X? A: Single column fetch
  40. 40. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Which segments is user X in? A: Column slice fetch
  41. 41. Top Tip #3 With column slices, we get the columns back ordered, according to our schema We cannot do the same for rows however, unless we use the Order Preserving Partitioner
  42. 42. Top Tip #4 Don’t use the Order Preserving Partitioner unless you absolutely have to http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  43. 43. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  44. 44. Expiring columns An expiring column will be automatically deleted after n seconds http://cassandra.apache.org/
  45. 45. Data modeling: user segments $pool = new ConnectionPool( whyk, array(localhost) ); $users = new ColumnFamily($pool, users); $users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires ); Using phpcassa client: https://github.com/thobbs/phpcassa
  46. 46. Data modeling: user segments UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in- cassandra-0-8-part-1-cql-the-cassandra-query-language http://www.datastax.com/docs/1.0/references/cql
  47. 47. Top Tip #5 Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns
  48. 48. Data modeling: ad performance Track overall ad performance; how many clicks/impressions per ad? ["ads"][<adId>][<stamp>]["click"] = # ["ads"][<adId>][<stamp>]["impression"] = # [CF] [Row] [S.Col] [Col] = value Using super columns
  49. 49. Top Tip #6 Friends don’t let friends use Super Columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  50. 50. Data modeling: ad performance Try again using regular columns: ["ads"][<adId>][<stamp>-"click"] = # ["ads"][<adId>][<stamp>-"impression"] = # [CF] [Row] [Col] = value
  51. 51. Data modeling: ad performance ads Column Family: [1][2011103015-click] = 1 [1][2011103015-impression] = 3434 [1][2011103016-click] = 12 [1][2011103016-impression] = 5411 [1][2011103017-click] = 2 [1][2011103017-impression] = 345 Q: Get performance of ad X between two date/times A: Column slice against single row specifying a start stamp and end stamp + 1
  52. 52. Think carefully about your data This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread. Other options: http://rubyscale.com/2011/basic-time-series-with-cassandra/
  53. 53. Counters • Distributed atomic counters • Easy to use • Not idempotent http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part- 2-counters
  54. 54. Data modeling: ad performance $stamp = date(YmdH); $ads->add( $adId, // row key "$stamp-impression", // column 1 // increment ); We’ll store performance metrics in hour buckets for graphing.
  55. 55. Data modeling: ad performance UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
  56. 56. Data modeling: performance/segment We can add in another dimension to our stats so we can breakdown by segment. ["ads"][<adId>] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  57. 57. Data modeling: performance/segment ads Column Family: [1][2011103015-bar-click] = 1 [1][2011103015-bar-impression] = 3434 [1][2011103015-foo-click] = 12 [1][2011103015-foo-impression] = 5411 [1][2011103016-bar-click] = 2 Q: Get performance of ad X between two date/times, split by segment A: Column slice against single row specifying a start stamp and end stamp + 1
  58. 58. Data modeling: performance/segment $stamp = date(YmdH); $ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr ); We’ll store performance metrics in hour buckets for graphing.
  59. 59. Data modeling: segment stats Track overall clicks/impressions per bucket; which buckets are most clicky? ["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  60. 60. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  61. 61. Recap: Common data modeling patterns 1. Using column names with no value [cf][rowKey][columnName] = 1
  62. 62. Recap: Common data modeling patterns 2. Counters [cf][rowKey][columnName]++
  63. 63. And also… 3. Serialising a whole object [cf][rowKey][columnName] = { foo: 3, bar: 11 }
  64. 64. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax now offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  65. 65. HiveCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
  66. 66. There’s more: Supercharged Cassandra Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads Includes instant snapshot of CFs http://www.acunu.com/products/choosing-cassandra/
  67. 67. In conclusion Cassandra is founded on sound design principles
  68. 68. In conclusion The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
  69. 69. In conclusion The clients are getting better; CQL is a step forward
  70. 70. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  71. 71. In conclusion Cassandra’s sweet spot is highly available “big data” (especially time- series) with large numbers of writes
  72. 72. ThanksLearn more about Cassandrameetup.com/Cassandra-LondonCheckout the code https://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

×