Your SlideShare is downloading. ×
0
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Learning Cassandra
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Learning Cassandra

11,120

Published on

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Published in: Technology, Business
1 Comment
13 Likes
Statistics
Notes
  • How Do I Start Learning Cassandra?

    I found a very good link which explains about Big data , Hadoop fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=zQ5B4_dFJTk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
11,120
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
359
Comments
1
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the way that NoSQL is often approachedA light-hearted take on both how people approach NoSQL and to some extent the tools themselves
  • A better approach is to consider NoSQL in terms of tradeoffs
  • Sums it up
  • 1st
  • 2nd
  • 3rd
  • 4th
  • 5th and last
  • A better approach
  • Last slide
  • Transcript

    • 1. Learning CassandraDave Gardner@davegardnerisme
    • 2. What I’m going to cover • How to NoSQL • Cassandra basics (dynamo and big table) • How to use the data model in real life
    • 3. How to NoSQL 1. Find data store that doesn’t use SQL 2. Anything 3. Cram all the things into it 4. Triumphantly blog this success 5. Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4
    • 4. Choosing NoSQL “NoSQL DBs trade off traditional features to better support new and emerging use cases” http://www.slideshare.net/argv0/riak-use-cases-dissecting-the- solutions-to-hard-problems
    • 5. Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008
    • 6. Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise
    • 7. Choosing NoSQL“they say … I can’t decide between this project andthis project even though they look nothing like eachother. And the fact that you can’t decide indicates thatyou don’t actually have a problem that requiresthem.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
    • 8. What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added
    • 9. Netflix benchmark: linear scaling http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
    • 10. What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels
    • 11. What do we get in return? Low latency, solid performance Cassandra has very good write performance
    • 12. Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
    • 13. What do we get in return? Operational simplicity Homogenous cluster, no “master” node, no SPOF
    • 14. What do we get in return? Rich data model Cassandra is more than simple key- value – columns, composites, counters, secondary indexes
    • 15. How to NoSQL version 2 Learn about each solution • What tradeoffs are you making? • How is it designed? • What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html
    • 16. Amazon Dynamo + Google Big TableConsistent hashing ColumnarVector clocks * SSTable storageGossip protocol Append-onlyHinted handoff MemtableRead repair Compactionhttp://www.allthingsdistributed.com/fi http://labs.google.com/papers/bigles/amazon-dynamo-sosp2007.pdf table-osdi06.pdf* not in Cassandra
    • 17. The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3Client # 4
    • 18. The dynamo paper # 1 # # 6 2 consistent hashing Coordinator # # 5 3Client # 4
    • 19. Consistency levels How many replicas must respond to declare success?
    • 20. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
    • 21. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
    • 22. The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3Client # 4
    • 23. The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3Client # 4
    • 24. The dynamo paper # 1 RF = 3 CL = One # + hint # 6 2 Coordinator # # 5 3Client # 4
    • 25. The dynamo paper # 1 RF = 3 CL = One # Read # 6 2 repair Coordinator # # 5 3Client # 4
    • 26. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
    • 27. The big table paper + timestamp Name Value Column
    • 28. The big table paperwe can have millions of columns * Name Name Name Value Value Value Column Column Column * theoretically up to 2 billion
    • 29. The big table paper Row Name Name Name Row Key Value Value Value Column Column Column
    • 30. The big table paper Column Family Row Key Column Column Column Row Key Column Column Column Row Key Column Column Column we can have billions of rows
    • 31. The big table paperWrite Memtable Flushed on time/size trigger Memory Disk Commit Log SSTable SSTable SSTable SSTable Immutable
    • 32. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    • 33. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } bigger timestamp http://cassandra.apache.org/
    • 34. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    • 35. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
    • 36. Key point Each “query” can be answered from a single slice of disk (once compaction has finished)
    • 37. Data modeling – 1000ft introduction • Start from your queries and work backwards • Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
    • 38. Pattern 1: not using the value Storing that user X is in bucket Y Row key: f97be9cc-5255-457… Column name: foo Value: 1 we don’t really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58
    • 39. Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: single column foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
    • 40. Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: column slice foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
    • 41. Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd
    • 42. Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key: 1 Column name: 2011103015-click Value: 34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php
    • 43. Pattern 2: counters Increment by 1 using CQL UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
    • 44. Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345
    • 45. Pattern 3: time series Store canonical reference of impressions and clicks Row key: 20111030 Column name: <time UUID> Value: {json} Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/
    • 46. Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457… Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
    • 47. Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4)
    • 48. Anti-pattern 2: super columns Friends don’t let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
    • 49. Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
    • 50. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
    • 51. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
    • 52. Hive: SQL-like interface to HadoopCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
    • 53. In conclusion Cassandra is founded on sound design principles
    • 54. In conclusion The data model is incredibly powerful
    • 55. In conclusion CQL and a new breed of clients are making it easier to use
    • 56. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
    • 57. In conclusion There is a strong community and multiple companies offering professional support
    • 58. Thanks looking for a job?Learn more about Cassandrameetup.com/Cassandra-LondonSample ad-targeting project on Githubhttps://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

    ×