Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning Cassandra

14,950 views

Published on

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Published in: Technology, Business
  • How Do I Start Learning Cassandra?

    I found a very good link which explains about Big data , Hadoop fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=zQ5B4_dFJTk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Learning Cassandra

  1. 1. Learning CassandraDave Gardner@davegardnerisme
  2. 2. What I’m going to cover • How to NoSQL • Cassandra basics (dynamo and big table) • How to use the data model in real life
  3. 3. How to NoSQL 1. Find data store that doesn’t use SQL 2. Anything 3. Cram all the things into it 4. Triumphantly blog this success 5. Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4
  4. 4. Choosing NoSQL “NoSQL DBs trade off traditional features to better support new and emerging use cases” http://www.slideshare.net/argv0/riak-use-cases-dissecting-the- solutions-to-hard-problems
  5. 5. Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008
  6. 6. Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise
  7. 7. Choosing NoSQL“they say … I can’t decide between this project andthis project even though they look nothing like eachother. And the fact that you can’t decide indicates thatyou don’t actually have a problem that requiresthem.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
  8. 8. What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added
  9. 9. Netflix benchmark: linear scaling http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
  10. 10. What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels
  11. 11. What do we get in return? Low latency, solid performance Cassandra has very good write performance
  12. 12. Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  13. 13. What do we get in return? Operational simplicity Homogenous cluster, no “master” node, no SPOF
  14. 14. What do we get in return? Rich data model Cassandra is more than simple key- value – columns, composites, counters, secondary indexes
  15. 15. How to NoSQL version 2 Learn about each solution • What tradeoffs are you making? • How is it designed? • What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html
  16. 16. Amazon Dynamo + Google Big TableConsistent hashing ColumnarVector clocks * SSTable storageGossip protocol Append-onlyHinted handoff MemtableRead repair Compactionhttp://www.allthingsdistributed.com/fi http://labs.google.com/papers/bigles/amazon-dynamo-sosp2007.pdf table-osdi06.pdf* not in Cassandra
  17. 17. The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3Client # 4
  18. 18. The dynamo paper # 1 # # 6 2 consistent hashing Coordinator # # 5 3Client # 4
  19. 19. Consistency levels How many replicas must respond to declare success?
  20. 20. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  21. 21. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  22. 22. The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3Client # 4
  23. 23. The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3Client # 4
  24. 24. The dynamo paper # 1 RF = 3 CL = One # + hint # 6 2 Coordinator # # 5 3Client # 4
  25. 25. The dynamo paper # 1 RF = 3 CL = One # Read # 6 2 repair Coordinator # # 5 3Client # 4
  26. 26. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  27. 27. The big table paper + timestamp Name Value Column
  28. 28. The big table paperwe can have millions of columns * Name Name Name Value Value Value Column Column Column * theoretically up to 2 billion
  29. 29. The big table paper Row Name Name Name Row Key Value Value Value Column Column Column
  30. 30. The big table paper Column Family Row Key Column Column Column Row Key Column Column Column Row Key Column Column Column we can have billions of rows
  31. 31. The big table paperWrite Memtable Flushed on time/size trigger Memory Disk Commit Log SSTable SSTable SSTable SSTable Immutable
  32. 32. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  33. 33. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } bigger timestamp http://cassandra.apache.org/
  34. 34. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  35. 35. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  36. 36. Key point Each “query” can be answered from a single slice of disk (once compaction has finished)
  37. 37. Data modeling – 1000ft introduction • Start from your queries and work backwards • Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  38. 38. Pattern 1: not using the value Storing that user X is in bucket Y Row key: f97be9cc-5255-457… Column name: foo Value: 1 we don’t really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58
  39. 39. Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: single column foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  40. 40. Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: column slice foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  41. 41. Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd
  42. 42. Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key: 1 Column name: 2011103015-click Value: 34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php
  43. 43. Pattern 2: counters Increment by 1 using CQL UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
  44. 44. Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345
  45. 45. Pattern 3: time series Store canonical reference of impressions and clicks Row key: 20111030 Column name: <time UUID> Value: {json} Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/
  46. 46. Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457… Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  47. 47. Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4)
  48. 48. Anti-pattern 2: super columns Friends don’t let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  49. 49. Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  50. 50. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  51. 51. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  52. 52. Hive: SQL-like interface to HadoopCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
  53. 53. In conclusion Cassandra is founded on sound design principles
  54. 54. In conclusion The data model is incredibly powerful
  55. 55. In conclusion CQL and a new breed of clients are making it easier to use
  56. 56. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  57. 57. In conclusion There is a strong community and multiple companies offering professional support
  58. 58. Thanks looking for a job?Learn more about Cassandrameetup.com/Cassandra-LondonSample ad-targeting project on Githubhttps://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

×