Learning Cassandra
Upcoming SlideShare
Loading in...5
×
 

Learning Cassandra

on

  • 9,667 views

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

Statistics

Views

Total Views
9,667
Views on SlideShare
9,656
Embed Views
11

Actions

Likes
10
Downloads
243
Comments
1

4 Embeds 11

https://twimg0-a.akamaihd.net 5
http://www.techgig.com 3
https://twitter.com 2
http://techgig.in 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How Do I Start Learning Cassandra?

    I found a very good link which explains about Big data , Hadoop fundamentals and Map Reduce in a very simple manner. Hope this will help everyone . http://www.youtube.com/watch?v=zQ5B4_dFJTk
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the way that NoSQL is often approachedA light-hearted take on both how people approach NoSQL and to some extent the tools themselves
  • A better approach is to consider NoSQL in terms of tradeoffs
  • Sums it up
  • 1st
  • 2nd
  • 3rd
  • 4th
  • 5th and last
  • A better approach
  • Last slide

Learning Cassandra Learning Cassandra Presentation Transcript

  • Learning CassandraDave Gardner@davegardnerisme
  • What I’m going to cover • How to NoSQL • Cassandra basics (dynamo and big table) • How to use the data model in real life
  • How to NoSQL 1. Find data store that doesn’t use SQL 2. Anything 3. Cram all the things into it 4. Triumphantly blog this success 5. Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4
  • Choosing NoSQL “NoSQL DBs trade off traditional features to better support new and emerging use cases” http://www.slideshare.net/argv0/riak-use-cases-dissecting-the- solutions-to-hard-problems
  • Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008
  • Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise
  • Choosing NoSQL“they say … I can’t decide between this project andthis project even though they look nothing like eachother. And the fact that you can’t decide indicates thatyou don’t actually have a problem that requiresthem.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
  • What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added
  • Netflix benchmark: linear scaling http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
  • What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels
  • What do we get in return? Low latency, solid performance Cassandra has very good write performance
  • Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  • What do we get in return? Operational simplicity Homogenous cluster, no “master” node, no SPOF
  • What do we get in return? Rich data model Cassandra is more than simple key- value – columns, composites, counters, secondary indexes
  • How to NoSQL version 2 Learn about each solution • What tradeoffs are you making? • How is it designed? • What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html
  • Amazon Dynamo + Google Big TableConsistent hashing ColumnarVector clocks * SSTable storageGossip protocol Append-onlyHinted handoff MemtableRead repair Compactionhttp://www.allthingsdistributed.com/fi http://labs.google.com/papers/bigles/amazon-dynamo-sosp2007.pdf table-osdi06.pdf* not in Cassandra
  • The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3Client # 4
  • The dynamo paper # 1 # # 6 2 consistent hashing Coordinator # # 5 3Client # 4
  • Consistency levels How many replicas must respond to declare success?
  • Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  • Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  • The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3Client # 4
  • The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3Client # 4
  • The dynamo paper # 1 RF = 3 CL = One # + hint # 6 2 Coordinator # # 5 3Client # 4
  • The dynamo paper # 1 RF = 3 CL = One # Read # 6 2 repair Coordinator # # 5 3Client # 4
  • The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  • The big table paper + timestamp Name Value Column
  • The big table paperwe can have millions of columns * Name Name Name Value Value Value Column Column Column * theoretically up to 2 billion
  • The big table paper Row Name Name Name Row Key Value Value Value Column Column Column
  • The big table paper Column Family Row Key Column Column Column Row Key Column Column Column Row Key Column Column Column we can have billions of rows
  • The big table paperWrite Memtable Flushed on time/size trigger Memory Disk Commit Log SSTable SSTable SSTable SSTable Immutable
  • Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } bigger timestamp http://cassandra.apache.org/
  • Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  • Key point Each “query” can be answered from a single slice of disk (once compaction has finished)
  • Data modeling – 1000ft introduction • Start from your queries and work backwards • Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  • Pattern 1: not using the value Storing that user X is in bucket Y Row key: f97be9cc-5255-457… Column name: foo Value: 1 we don’t really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58
  • Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: single column foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  • Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: column slice foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  • Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd
  • Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key: 1 Column name: 2011103015-click Value: 34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php
  • Pattern 2: counters Increment by 1 using CQL UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
  • Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345
  • Pattern 3: time series Store canonical reference of impressions and clicks Row key: 20111030 Column name: <time UUID> Value: {json} Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/
  • Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457… Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  • Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4)
  • Anti-pattern 2: super columns Friends don’t let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  • Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  • Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  • There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  • Hive: SQL-like interface to HadoopCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
  • In conclusion Cassandra is founded on sound design principles
  • In conclusion The data model is incredibly powerful
  • In conclusion CQL and a new breed of clients are making it easier to use
  • In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  • In conclusion There is a strong community and multiple companies offering professional support
  • Thanks looking for a job?Learn more about Cassandrameetup.com/Cassandra-LondonSample ad-targeting project on Githubhttps://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations