Learning Cassandra

13,438 views
13,099 views

Published on

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Published in: Technology, Business
1 Comment
14 Likes
Statistics
Notes
  • How Do I Start Learning Cassandra?

    I found a very good link which explains about Big data , Hadoop fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=zQ5B4_dFJTk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
13,438
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
448
Comments
1
Likes
14
Embeds 0
No embeds

No notes for slide
  • This is the way that NoSQL is often approachedA light-hearted take on both how people approach NoSQL and to some extent the tools themselves
  • A better approach is to consider NoSQL in terms of tradeoffs
  • Sums it up
  • 1st
  • 2nd
  • 3rd
  • 4th
  • 5th and last
  • A better approach
  • Last slide
  • Learning Cassandra

    1. 1. Learning CassandraDave Gardner@davegardnerisme
    2. 2. What I’m going to cover • How to NoSQL • Cassandra basics (dynamo and big table) • How to use the data model in real life
    3. 3. How to NoSQL 1. Find data store that doesn’t use SQL 2. Anything 3. Cram all the things into it 4. Triumphantly blog this success 5. Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4
    4. 4. Choosing NoSQL “NoSQL DBs trade off traditional features to better support new and emerging use cases” http://www.slideshare.net/argv0/riak-use-cases-dissecting-the- solutions-to-hard-problems
    5. 5. Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008
    6. 6. Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise
    7. 7. Choosing NoSQL“they say … I can’t decide between this project andthis project even though they look nothing like eachother. And the fact that you can’t decide indicates thatyou don’t actually have a problem that requiresthem.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
    8. 8. What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added
    9. 9. Netflix benchmark: linear scaling http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
    10. 10. What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels
    11. 11. What do we get in return? Low latency, solid performance Cassandra has very good write performance
    12. 12. Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
    13. 13. What do we get in return? Operational simplicity Homogenous cluster, no “master” node, no SPOF
    14. 14. What do we get in return? Rich data model Cassandra is more than simple key- value – columns, composites, counters, secondary indexes
    15. 15. How to NoSQL version 2 Learn about each solution • What tradeoffs are you making? • How is it designed? • What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html
    16. 16. Amazon Dynamo + Google Big TableConsistent hashing ColumnarVector clocks * SSTable storageGossip protocol Append-onlyHinted handoff MemtableRead repair Compactionhttp://www.allthingsdistributed.com/fi http://labs.google.com/papers/bigles/amazon-dynamo-sosp2007.pdf table-osdi06.pdf* not in Cassandra
    17. 17. The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3Client # 4
    18. 18. The dynamo paper # 1 # # 6 2 consistent hashing Coordinator # # 5 3Client # 4
    19. 19. Consistency levels How many replicas must respond to declare success?
    20. 20. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
    21. 21. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
    22. 22. The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3Client # 4
    23. 23. The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3Client # 4
    24. 24. The dynamo paper # 1 RF = 3 CL = One # + hint # 6 2 Coordinator # # 5 3Client # 4
    25. 25. The dynamo paper # 1 RF = 3 CL = One # Read # 6 2 repair Coordinator # # 5 3Client # 4
    26. 26. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
    27. 27. The big table paper + timestamp Name Value Column
    28. 28. The big table paperwe can have millions of columns * Name Name Name Value Value Value Column Column Column * theoretically up to 2 billion
    29. 29. The big table paper Row Name Name Name Row Key Value Value Value Column Column Column
    30. 30. The big table paper Column Family Row Key Column Column Column Row Key Column Column Column Row Key Column Column Column we can have billions of rows
    31. 31. The big table paperWrite Memtable Flushed on time/size trigger Memory Disk Commit Log SSTable SSTable SSTable SSTable Immutable
    32. 32. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    33. 33. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } bigger timestamp http://cassandra.apache.org/
    34. 34. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    35. 35. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
    36. 36. Key point Each “query” can be answered from a single slice of disk (once compaction has finished)
    37. 37. Data modeling – 1000ft introduction • Start from your queries and work backwards • Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
    38. 38. Pattern 1: not using the value Storing that user X is in bucket Y Row key: f97be9cc-5255-457… Column name: foo Value: 1 we don’t really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58
    39. 39. Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: single column foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
    40. 40. Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: column slice foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
    41. 41. Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd
    42. 42. Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key: 1 Column name: 2011103015-click Value: 34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php
    43. 43. Pattern 2: counters Increment by 1 using CQL UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
    44. 44. Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345
    45. 45. Pattern 3: time series Store canonical reference of impressions and clicks Row key: 20111030 Column name: <time UUID> Value: {json} Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/
    46. 46. Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457… Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
    47. 47. Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4)
    48. 48. Anti-pattern 2: super columns Friends don’t let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
    49. 49. Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
    50. 50. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
    51. 51. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
    52. 52. Hive: SQL-like interface to HadoopCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
    53. 53. In conclusion Cassandra is founded on sound design principles
    54. 54. In conclusion The data model is incredibly powerful
    55. 55. In conclusion CQL and a new breed of clients are making it easier to use
    56. 56. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
    57. 57. In conclusion There is a strong community and multiple companies offering professional support
    58. 58. Thanks looking for a job?Learn more about Cassandrameetup.com/Cassandra-LondonSample ad-targeting project on Githubhttps://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

    ×