0
Cassandra’s sweet spotDave Gardner@davegardnerisme
jobs.hailocab.comLooking for an expert backendJava dev – speak to me!meetup.com/Cassandra-LondonNext event 21st November
Building applications with Cassandra   • Key features   • Creating an application   • Data modeling
Comparing Cassandra with X  “Can someone quickly explain the  differences between the two? Other than  the fact that Mongo...
Comparing Cassandra with X “They have approximately nothing in common. And, no, Cassandra is definitely not dying off.” 28...
Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot.
Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot. This means learning about each solutio...
Comparing Cassandra with X“they say … I can’t decide between this projectand this project even though they look nothinglik...
Headline features 1. Elastic Read and write throughput increases linearly as new machines are added http://cassandra.apach...
Headline features 2. Decentralised Fault tolerant with no single point of failure; no “master” node http://cassandra.apach...
The dynamo paper •   Consistent hashing •   Vector clocks •   Gossip protocol •   Hinted handoff •   Read repair http://ww...
The dynamo paper                       #                       1       RF = 3                   #       #                 ...
Headline features 3. Rich data model Column based, range slices, column slices, secondary indexes, counters, expiring colu...
The big table paper •   Sparse "columnar" data model •   SSTable disk storage •   Append-only commit log •   Memtable (buf...
The big table paper                      Column Family             Name        Name         Name   Row Key             Val...
Headline features 4. Youre in control Tunable consistency, per operation http://cassandra.apache.org/
Consistency levels How many replicas must respond to declare success?
Consistency levels: write operations  Level                Description  ANY                  One node, including hinted ha...
Consistency levels: read operations  Level                Description  ONE                  1st Response  QUORUM          ...
Headline features 5. Performant Well known for high write performance http://www.datastax.com/docs/1.0/introduction/index#...
Benchmark*                 http://blog.cubrid.org/dev-             platform/nosql-benchmarking/                           ...
Recap: headline features 1. Elastic 2. Decentralised 3. Rich data model 4. You’re in control (tunable consistency) 5. Perf...
A simple ad-targeting application                                     Some ads Choose which ad to show                Our ...
A simple ad-targeting application Allow us to capture user behaviour/data via “pixels” - placing users into segments (diff...
A simple ad-targeting application Record clicks and impressions of each ad; storing data per-ad and per-segment http://pix...
A simple ad-targeting application Real-time ad performance analytics, broken down by segment (which segments are performin...
A simple ad-targeting application Recommendations based on best- performing ads (this is left as an exercise for the reader)
Additional requirements • Large number of users • High volume of impressions • Highly available – downtime is money
A good fit for Cassandra? Yes! Big data, high availability and lots of writes are all good signs that Cassandra will fit w...
A good fit for Cassandra? Although there are many things that people are using Cassandra for. Highly available HTTP reques...
Top Tip #2 Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a larg...
Demo Live demo before we start
Data modeling Start from your queries, work backwards http://www.slideshare.net/mattdennis/cassandra-data-modeling http://...
Data model basics: conflict resolution Per-column timestamp-based conflict resolution {                              {    ...
Data model basics: conflict resolution Per-column timestamp-based conflict resolution {                              {    ...
Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema {                ...
Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema {     badger: foo...
Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucket...
Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar]   =   1 [f97be9cc-5255-4578-8...
Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar]   =   1 [f97be9cc-5255-4578-8...
Top Tip #3 With column slices, we get the columns back ordered, according to our schema We cannot do the same for rows how...
Top Tip #4 Don’t use the Order Preserving Partitioner unless you absolutely have to http://ria101.wordpress.com/2010/02/22...
Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucket...
Expiring columns An expiring column will be automatically deleted after n seconds http://cassandra.apache.org/
Data modeling: user segments $pool = new ConnectionPool(     whyk, array(localhost)     ); $users = new ColumnFamily($pool...
Data modeling: user segments UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY =     f97be9cc-5255-4578-8813-76701c0945b...
Top Tip #5 Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individu...
Data modeling: ad performance Track overall ad performance; how many clicks/impressions per ad? ["ads"][<adId>][<stamp>]["...
Top Tip #6 Friends don’t let friends use Super Columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- t...
Data modeling: ad performance Try again using regular columns: ["ads"][<adId>][<stamp>-"click"] = # ["ads"][<adId>][<stamp...
Data modeling: ad performance ads Column Family: [1][2011103015-click] = 1 [1][2011103015-impression] = 3434 [1][201110301...
Think carefully about your data This scheme works because I’m assuming each ad has a relatively short lifespan. This means...
Counters • Distributed atomic counters • Easy to use • Not idempotent http://www.datastax.com/dev/blog/whats-new-in-cassan...
Data modeling: ad performance $stamp = date(YmdH); $ads->add(     $adId,                              // row key     "$sta...
Data modeling: ad performance UPDATE ads SET 2011103015-impression     = 2011103015-impression + 1 WHERE KEY = 1’
Data modeling: performance/segment We can add in another dimension to our stats so we can breakdown by segment. ["ads"][<a...
Data modeling: performance/segment ads Column Family: [1][2011103015-bar-click] = 1 [1][2011103015-bar-impression] = 3434 ...
Data modeling: performance/segment $stamp = date(YmdH); $ads->add(     "$adId-segments",                                //...
Data modeling: segment stats Track overall clicks/impressions per bucket; which buckets are most clicky? ["segments"][<adI...
Recap: Data modeling • Think about the queries, work   backwards • Don’t overuse single rows; try to   spread the load • D...
Recap: Common data modeling patterns 1. Using column names with no value [cf][rowKey][columnName] = 1
Recap: Common data modeling patterns 2. Counters [cf][rowKey][columnName]++
And also… 3. Serialising a whole object [cf][rowKey][columnName] = {     foo: 3,     bar: 11     }
There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cas...
HiveCREATE EXTERNAL TABLE tempUsers    (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.ca...
There’s more: Supercharged Cassandra Acunu have reengineered the entire Unix storage stack, optimised specifically for Big...
In conclusion Cassandra is founded on sound design principles
In conclusion The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
In conclusion The clients are getting better; CQL is a step forward
In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
In conclusion Cassandra’s sweet spot is highly available “big data” (especially time- series) with large numbers of writes
ThanksLearn more about Cassandrameetup.com/Cassandra-LondonCheckout the code https://github.com/davegardnerisme/we-have-yo...
Upcoming SlideShare
Loading in...5
×

Cassandra's Sweet Spot - an introduction to Apache Cassandra

14,519

Published on

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.

Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

Published in: Technology
0 Comments
24 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
14,519
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
317
Comments
0
Likes
24
Embeds 0
No embeds

No notes for slide

Transcript of "Cassandra's Sweet Spot - an introduction to Apache Cassandra"

  1. 1. Cassandra’s sweet spotDave Gardner@davegardnerisme
  2. 2. jobs.hailocab.comLooking for an expert backendJava dev – speak to me!meetup.com/Cassandra-LondonNext event 21st November
  3. 3. Building applications with Cassandra • Key features • Creating an application • Data modeling
  4. 4. Comparing Cassandra with X “Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I dont know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?” 27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/ 7773
  5. 5. Comparing Cassandra with X “They have approximately nothing in common. And, no, Cassandra is definitely not dying off.” 28th July 2010 http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
  6. 6. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot.
  7. 7. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot. This means learning about each solution; how is it designed? what algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk2 011.html
  8. 8. Comparing Cassandra with X“they say … I can’t decide between this projectand this project even though they look nothinglike each other. And the fact that you can’tdecide indicates that you don’t actually have aproblem that requires them.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
  9. 9. Headline features 1. Elastic Read and write throughput increases linearly as new machines are added http://cassandra.apache.org/
  10. 10. Headline features 2. Decentralised Fault tolerant with no single point of failure; no “master” node http://cassandra.apache.org/
  11. 11. The dynamo paper • Consistent hashing • Vector clocks • Gossip protocol • Hinted handoff • Read repair http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf
  12. 12. The dynamo paper # 1 RF = 3 # # 6 2 Coordinator # # 5 3Client # 4
  13. 13. Headline features 3. Rich data model Column based, range slices, column slices, secondary indexes, counters, expiring columns http://cassandra.apache.org/
  14. 14. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  15. 15. The big table paper Column Family Name Name Name Row Key Value Value Value Column Column Column
  16. 16. Headline features 4. Youre in control Tunable consistency, per operation http://cassandra.apache.org/
  17. 17. Consistency levels How many replicas must respond to declare success?
  18. 18. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  19. 19. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  20. 20. Headline features 5. Performant Well known for high write performance http://www.datastax.com/docs/1.0/introduction/index#core- strengths-of-cassandra
  21. 21. Benchmark* http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  22. 22. Recap: headline features 1. Elastic 2. Decentralised 3. Rich data model 4. You’re in control (tunable consistency) 5. Performant
  23. 23. A simple ad-targeting application Some ads Choose which ad to show Our user knowledge
  24. 24. A simple ad-targeting application Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets) http://pixel.wehaveyourkidneys.com/add.php?add=foo
  25. 25. A simple ad-targeting application Record clicks and impressions of each ad; storing data per-ad and per-segment http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1 http://pixel.wehaveyourkidneys.com/adClick.php?ad=1
  26. 26. A simple ad-targeting application Real-time ad performance analytics, broken down by segment (which segments are performing well?) http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  27. 27. A simple ad-targeting application Recommendations based on best- performing ads (this is left as an exercise for the reader)
  28. 28. Additional requirements • Large number of users • High volume of impressions • Highly available – downtime is money
  29. 29. A good fit for Cassandra? Yes! Big data, high availability and lots of writes are all good signs that Cassandra will fit well. http://www.nosqldatabases.com/main/2010/10/19/what-is- cassandra-good-for.html
  30. 30. A good fit for Cassandra? Although there are many things that people are using Cassandra for. Highly available HTTP request routing (tiny data!) http://blip.tv/datastax/highly-available-http-request-routing-dns- using-cassandra-5501901
  31. 31. Top Tip #2 Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.
  32. 32. Demo Live demo before we start
  33. 33. Data modeling Start from your queries, work backwards http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  34. 34. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  35. 35. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  36. 36. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  37. 37. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  38. 38. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  39. 39. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Is user in segment X? A: Single column fetch
  40. 40. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Which segments is user X in? A: Column slice fetch
  41. 41. Top Tip #3 With column slices, we get the columns back ordered, according to our schema We cannot do the same for rows however, unless we use the Order Preserving Partitioner
  42. 42. Top Tip #4 Don’t use the Order Preserving Partitioner unless you absolutely have to http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  43. 43. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  44. 44. Expiring columns An expiring column will be automatically deleted after n seconds http://cassandra.apache.org/
  45. 45. Data modeling: user segments $pool = new ConnectionPool( whyk, array(localhost) ); $users = new ColumnFamily($pool, users); $users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires ); Using phpcassa client: https://github.com/thobbs/phpcassa
  46. 46. Data modeling: user segments UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in- cassandra-0-8-part-1-cql-the-cassandra-query-language http://www.datastax.com/docs/1.0/references/cql
  47. 47. Top Tip #5 Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns
  48. 48. Data modeling: ad performance Track overall ad performance; how many clicks/impressions per ad? ["ads"][<adId>][<stamp>]["click"] = # ["ads"][<adId>][<stamp>]["impression"] = # [CF] [Row] [S.Col] [Col] = value Using super columns
  49. 49. Top Tip #6 Friends don’t let friends use Super Columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  50. 50. Data modeling: ad performance Try again using regular columns: ["ads"][<adId>][<stamp>-"click"] = # ["ads"][<adId>][<stamp>-"impression"] = # [CF] [Row] [Col] = value
  51. 51. Data modeling: ad performance ads Column Family: [1][2011103015-click] = 1 [1][2011103015-impression] = 3434 [1][2011103016-click] = 12 [1][2011103016-impression] = 5411 [1][2011103017-click] = 2 [1][2011103017-impression] = 345 Q: Get performance of ad X between two date/times A: Column slice against single row specifying a start stamp and end stamp + 1
  52. 52. Think carefully about your data This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread. Other options: http://rubyscale.com/2011/basic-time-series-with-cassandra/
  53. 53. Counters • Distributed atomic counters • Easy to use • Not idempotent http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part- 2-counters
  54. 54. Data modeling: ad performance $stamp = date(YmdH); $ads->add( $adId, // row key "$stamp-impression", // column 1 // increment ); We’ll store performance metrics in hour buckets for graphing.
  55. 55. Data modeling: ad performance UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
  56. 56. Data modeling: performance/segment We can add in another dimension to our stats so we can breakdown by segment. ["ads"][<adId>] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  57. 57. Data modeling: performance/segment ads Column Family: [1][2011103015-bar-click] = 1 [1][2011103015-bar-impression] = 3434 [1][2011103015-foo-click] = 12 [1][2011103015-foo-impression] = 5411 [1][2011103016-bar-click] = 2 Q: Get performance of ad X between two date/times, split by segment A: Column slice against single row specifying a start stamp and end stamp + 1
  58. 58. Data modeling: performance/segment $stamp = date(YmdH); $ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr ); We’ll store performance metrics in hour buckets for graphing.
  59. 59. Data modeling: segment stats Track overall clicks/impressions per bucket; which buckets are most clicky? ["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  60. 60. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  61. 61. Recap: Common data modeling patterns 1. Using column names with no value [cf][rowKey][columnName] = 1
  62. 62. Recap: Common data modeling patterns 2. Counters [cf][rowKey][columnName]++
  63. 63. And also… 3. Serialising a whole object [cf][rowKey][columnName] = { foo: 3, bar: 11 }
  64. 64. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax now offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  65. 65. HiveCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
  66. 66. There’s more: Supercharged Cassandra Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads Includes instant snapshot of CFs http://www.acunu.com/products/choosing-cassandra/
  67. 67. In conclusion Cassandra is founded on sound design principles
  68. 68. In conclusion The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
  69. 69. In conclusion The clients are getting better; CQL is a step forward
  70. 70. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  71. 71. In conclusion Cassandra’s sweet spot is highly available “big data” (especially time- series) with large numbers of writes
  72. 72. ThanksLearn more about Cassandrameetup.com/Cassandra-LondonCheckout the code https://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×