Cassandra's Sweet Spot - an introduction to Apache Cassandra
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Cassandra's Sweet Spot - an introduction to Apache Cassandra

  • 12,914 views
Uploaded on

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application......

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.

Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,914
On Slideshare
12,905
From Embeds
9
Number of Embeds
4

Actions

Shares
Downloads
203
Comments
0
Likes
9

Embeds 9

http://searchutil01 4
http://www.agroverde.com.ec 3
https://twitter.com 1
http://agroverde.com.ec 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cassandra’s sweet spotDave Gardner@davegardnerisme
  • 2. jobs.hailocab.comLooking for an expert backendJava dev – speak to me!meetup.com/Cassandra-LondonNext event 21st November
  • 3. Building applications with Cassandra • Key features • Creating an application • Data modeling
  • 4. Comparing Cassandra with X “Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I dont know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?” 27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/ 7773
  • 5. Comparing Cassandra with X “They have approximately nothing in common. And, no, Cassandra is definitely not dying off.” 28th July 2010 http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
  • 6. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot.
  • 7. Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot. This means learning about each solution; how is it designed? what algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk2 011.html
  • 8. Comparing Cassandra with X“they say … I can’t decide between this projectand this project even though they look nothinglike each other. And the fact that you can’tdecide indicates that you don’t actually have aproblem that requires them.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
  • 9. Headline features 1. Elastic Read and write throughput increases linearly as new machines are added http://cassandra.apache.org/
  • 10. Headline features 2. Decentralised Fault tolerant with no single point of failure; no “master” node http://cassandra.apache.org/
  • 11. The dynamo paper • Consistent hashing • Vector clocks • Gossip protocol • Hinted handoff • Read repair http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf
  • 12. The dynamo paper # 1 RF = 3 # # 6 2 Coordinator # # 5 3Client # 4
  • 13. Headline features 3. Rich data model Column based, range slices, column slices, secondary indexes, counters, expiring columns http://cassandra.apache.org/
  • 14. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  • 15. The big table paper Column Family Name Name Name Row Key Value Value Value Column Column Column
  • 16. Headline features 4. Youre in control Tunable consistency, per operation http://cassandra.apache.org/
  • 17. Consistency levels How many replicas must respond to declare success?
  • 18. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  • 19. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  • 20. Headline features 5. Performant Well known for high write performance http://www.datastax.com/docs/1.0/introduction/index#core- strengths-of-cassandra
  • 21. Benchmark* http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  • 22. Recap: headline features 1. Elastic 2. Decentralised 3. Rich data model 4. You’re in control (tunable consistency) 5. Performant
  • 23. A simple ad-targeting application Some ads Choose which ad to show Our user knowledge
  • 24. A simple ad-targeting application Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets) http://pixel.wehaveyourkidneys.com/add.php?add=foo
  • 25. A simple ad-targeting application Record clicks and impressions of each ad; storing data per-ad and per-segment http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1 http://pixel.wehaveyourkidneys.com/adClick.php?ad=1
  • 26. A simple ad-targeting application Real-time ad performance analytics, broken down by segment (which segments are performing well?) http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  • 27. A simple ad-targeting application Recommendations based on best- performing ads (this is left as an exercise for the reader)
  • 28. Additional requirements • Large number of users • High volume of impressions • Highly available – downtime is money
  • 29. A good fit for Cassandra? Yes! Big data, high availability and lots of writes are all good signs that Cassandra will fit well. http://www.nosqldatabases.com/main/2010/10/19/what-is- cassandra-good-for.html
  • 30. A good fit for Cassandra? Although there are many things that people are using Cassandra for. Highly available HTTP request routing (tiny data!) http://blip.tv/datastax/highly-available-http-request-routing-dns- using-cassandra-5501901
  • 31. Top Tip #2 Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.
  • 32. Demo Live demo before we start
  • 33. Data modeling Start from your queries, work backwards http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  • 34. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • 35. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • 36. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • 37. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  • 38. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  • 39. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Is user in segment X? A: Single column fetch
  • 40. Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Which segments is user X in? A: Column slice fetch
  • 41. Top Tip #3 With column slices, we get the columns back ordered, according to our schema We cannot do the same for rows however, unless we use the Order Preserving Partitioner
  • 42. Top Tip #4 Don’t use the Order Preserving Partitioner unless you absolutely have to http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  • 43. Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
  • 44. Expiring columns An expiring column will be automatically deleted after n seconds http://cassandra.apache.org/
  • 45. Data modeling: user segments $pool = new ConnectionPool( whyk, array(localhost) ); $users = new ColumnFamily($pool, users); $users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires ); Using phpcassa client: https://github.com/thobbs/phpcassa
  • 46. Data modeling: user segments UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in- cassandra-0-8-part-1-cql-the-cassandra-query-language http://www.datastax.com/docs/1.0/references/cql
  • 47. Top Tip #5 Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns
  • 48. Data modeling: ad performance Track overall ad performance; how many clicks/impressions per ad? ["ads"][<adId>][<stamp>]["click"] = # ["ads"][<adId>][<stamp>]["impression"] = # [CF] [Row] [S.Col] [Col] = value Using super columns
  • 49. Top Tip #6 Friends don’t let friends use Super Columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  • 50. Data modeling: ad performance Try again using regular columns: ["ads"][<adId>][<stamp>-"click"] = # ["ads"][<adId>][<stamp>-"impression"] = # [CF] [Row] [Col] = value
  • 51. Data modeling: ad performance ads Column Family: [1][2011103015-click] = 1 [1][2011103015-impression] = 3434 [1][2011103016-click] = 12 [1][2011103016-impression] = 5411 [1][2011103017-click] = 2 [1][2011103017-impression] = 345 Q: Get performance of ad X between two date/times A: Column slice against single row specifying a start stamp and end stamp + 1
  • 52. Think carefully about your data This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread. Other options: http://rubyscale.com/2011/basic-time-series-with-cassandra/
  • 53. Counters • Distributed atomic counters • Easy to use • Not idempotent http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part- 2-counters
  • 54. Data modeling: ad performance $stamp = date(YmdH); $ads->add( $adId, // row key "$stamp-impression", // column 1 // increment ); We’ll store performance metrics in hour buckets for graphing.
  • 55. Data modeling: ad performance UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
  • 56. Data modeling: performance/segment We can add in another dimension to our stats so we can breakdown by segment. ["ads"][<adId>] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  • 57. Data modeling: performance/segment ads Column Family: [1][2011103015-bar-click] = 1 [1][2011103015-bar-impression] = 3434 [1][2011103015-foo-click] = 12 [1][2011103015-foo-impression] = 5411 [1][2011103016-bar-click] = 2 Q: Get performance of ad X between two date/times, split by segment A: Column slice against single row specifying a start stamp and end stamp + 1
  • 58. Data modeling: performance/segment $stamp = date(YmdH); $ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr ); We’ll store performance metrics in hour buckets for graphing.
  • 59. Data modeling: segment stats Track overall clicks/impressions per bucket; which buckets are most clicky? ["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
  • 60. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  • 61. Recap: Common data modeling patterns 1. Using column names with no value [cf][rowKey][columnName] = 1
  • 62. Recap: Common data modeling patterns 2. Counters [cf][rowKey][columnName]++
  • 63. And also… 3. Serialising a whole object [cf][rowKey][columnName] = { foo: 3, bar: 11 }
  • 64. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax now offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  • 65. HiveCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
  • 66. There’s more: Supercharged Cassandra Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads Includes instant snapshot of CFs http://www.acunu.com/products/choosing-cassandra/
  • 67. In conclusion Cassandra is founded on sound design principles
  • 68. In conclusion The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
  • 69. In conclusion The clients are getting better; CQL is a step forward
  • 70. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  • 71. In conclusion Cassandra’s sweet spot is highly available “big data” (especially time- series) with large numbers of writes
  • 72. ThanksLearn more about Cassandrameetup.com/Cassandra-LondonCheckout the code https://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations