• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 

Cassandra's Sweet Spot - an introduction to Apache Cassandra

on

  • 10,917 views

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application ...

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.

Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

Statistics

Views

Total Views
10,917
Views on SlideShare
10,912
Embed Views
5

Actions

Likes
5
Downloads
147
Comments
0

2 Embeds 5

http://searchutil01 4
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra's Sweet Spot - an introduction to Apache Cassandra Cassandra's Sweet Spot - an introduction to Apache Cassandra Presentation Transcript

    • Cassandra’s sweet spotDave Gardner@davegardnerisme
    • jobs.hailocab.comLooking for an expert backendJava dev – speak to me!meetup.com/Cassandra-LondonNext event 21st November
    • Building applications with Cassandra • Key features • Creating an application • Data modeling
    • Comparing Cassandra with X “Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I dont know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?” 27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/ 7773
    • Comparing Cassandra with X “They have approximately nothing in common. And, no, Cassandra is definitely not dying off.” 28th July 2010 http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
    • Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot.
    • Top Tip #1 To use a NoSQL solution effectively, we need to identify its sweet spot. This means learning about each solution; how is it designed? what algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk2 011.html
    • Comparing Cassandra with X“they say … I can’t decide between this projectand this project even though they look nothinglike each other. And the fact that you can’tdecide indicates that you don’t actually have aproblem that requires them.”Benjamin Black – NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
    • Headline features 1. Elastic Read and write throughput increases linearly as new machines are added http://cassandra.apache.org/
    • Headline features 2. Decentralised Fault tolerant with no single point of failure; no “master” node http://cassandra.apache.org/
    • The dynamo paper • Consistent hashing • Vector clocks • Gossip protocol • Hinted handoff • Read repair http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf
    • The dynamo paper # 1 RF = 3 # # 6 2 Coordinator # # 5 3Client # 4
    • Headline features 3. Rich data model Column based, range slices, column slices, secondary indexes, counters, expiring columns http://cassandra.apache.org/
    • The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
    • The big table paper Column Family Name Name Name Row Key Value Value Value Column Column Column
    • Headline features 4. Youre in control Tunable consistency, per operation http://cassandra.apache.org/
    • Consistency levels How many replicas must respond to declare success?
    • Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
    • Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
    • Headline features 5. Performant Well known for high write performance http://www.datastax.com/docs/1.0/introduction/index#core- strengths-of-cassandra
    • Benchmark* http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
    • Recap: headline features 1. Elastic 2. Decentralised 3. Rich data model 4. You’re in control (tunable consistency) 5. Performant
    • A simple ad-targeting application Some ads Choose which ad to show Our user knowledge
    • A simple ad-targeting application Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets) http://pixel.wehaveyourkidneys.com/add.php?add=foo
    • A simple ad-targeting application Record clicks and impressions of each ad; storing data per-ad and per-segment http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1 http://pixel.wehaveyourkidneys.com/adClick.php?ad=1
    • A simple ad-targeting application Real-time ad performance analytics, broken down by segment (which segments are performing well?) http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
    • A simple ad-targeting application Recommendations based on best- performing ads (this is left as an exercise for the reader)
    • Additional requirements • Large number of users • High volume of impressions • Highly available – downtime is money
    • A good fit for Cassandra? Yes! Big data, high availability and lots of writes are all good signs that Cassandra will fit well. http://www.nosqldatabases.com/main/2010/10/19/what-is- cassandra-good-for.html
    • A good fit for Cassandra? Although there are many things that people are using Cassandra for. Highly available HTTP request routing (tiny data!) http://blip.tv/datastax/highly-available-http-request-routing-dns- using-cassandra-5501901
    • Top Tip #2 Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.
    • Demo Live demo before we start
    • Data modeling Start from your queries, work backwards http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
    • Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    • Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    • Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
    • Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
    • Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
    • Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Is user in segment X? A: Single column fetch
    • Data modeling: user segments user Column Family: [f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1 [f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1 [06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1 [503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1 Q: Which segments is user X in? A: Column slice fetch
    • Top Tip #3 With column slices, we get the columns back ordered, according to our schema We cannot do the same for rows however, unless we use the Order Preserving Partitioner
    • Top Tip #4 Don’t use the Order Preserving Partitioner unless you absolutely have to http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
    • Data modeling: user segments Add user to bucket X, with expiry time Y Which buckets is user X in? ["user"][<uuid>][<bucketId>] = 1 [CF] [rowKey] [columnName] = value
    • Expiring columns An expiring column will be automatically deleted after n seconds http://cassandra.apache.org/
    • Data modeling: user segments $pool = new ConnectionPool( whyk, array(localhost) ); $users = new ColumnFamily($pool, users); $users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires ); Using phpcassa client: https://github.com/thobbs/phpcassa
    • Data modeling: user segments UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in- cassandra-0-8-part-1-cql-the-cassandra-query-language http://www.datastax.com/docs/1.0/references/cql
    • Top Tip #5 Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns
    • Data modeling: ad performance Track overall ad performance; how many clicks/impressions per ad? ["ads"][<adId>][<stamp>]["click"] = # ["ads"][<adId>][<stamp>]["impression"] = # [CF] [Row] [S.Col] [Col] = value Using super columns
    • Top Tip #6 Friends don’t let friends use Super Columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
    • Data modeling: ad performance Try again using regular columns: ["ads"][<adId>][<stamp>-"click"] = # ["ads"][<adId>][<stamp>-"impression"] = # [CF] [Row] [Col] = value
    • Data modeling: ad performance ads Column Family: [1][2011103015-click] = 1 [1][2011103015-impression] = 3434 [1][2011103016-click] = 12 [1][2011103016-impression] = 5411 [1][2011103017-click] = 2 [1][2011103017-impression] = 345 Q: Get performance of ad X between two date/times A: Column slice against single row specifying a start stamp and end stamp + 1
    • Think carefully about your data This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread. Other options: http://rubyscale.com/2011/basic-time-series-with-cassandra/
    • Counters • Distributed atomic counters • Easy to use • Not idempotent http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part- 2-counters
    • Data modeling: ad performance $stamp = date(YmdH); $ads->add( $adId, // row key "$stamp-impression", // column 1 // increment ); We’ll store performance metrics in hour buckets for graphing.
    • Data modeling: ad performance UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1’
    • Data modeling: performance/segment We can add in another dimension to our stats so we can breakdown by segment. ["ads"][<adId>] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
    • Data modeling: performance/segment ads Column Family: [1][2011103015-bar-click] = 1 [1][2011103015-bar-impression] = 3434 [1][2011103015-foo-click] = 12 [1][2011103015-foo-impression] = 5411 [1][2011103016-bar-click] = 2 Q: Get performance of ad X between two date/times, split by segment A: Column slice against single row specifying a start stamp and end stamp + 1
    • Data modeling: performance/segment $stamp = date(YmdH); $ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr ); We’ll store performance metrics in hour buckets for graphing.
    • Data modeling: segment stats Track overall clicks/impressions per bucket; which buckets are most clicky? ["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = # [CF] [Row] [Col] = value
    • Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
    • Recap: Common data modeling patterns 1. Using column names with no value [cf][rowKey][columnName] = 1
    • Recap: Common data modeling patterns 2. Counters [cf][rowKey][columnName]++
    • And also… 3. Serialising a whole object [cf][rowKey][columnName] = { foo: 3, bar: 11 }
    • There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax now offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
    • HiveCREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
    • There’s more: Supercharged Cassandra Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads Includes instant snapshot of CFs http://www.acunu.com/products/choosing-cassandra/
    • In conclusion Cassandra is founded on sound design principles
    • In conclusion The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
    • In conclusion The clients are getting better; CQL is a step forward
    • In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
    • In conclusion Cassandra’s sweet spot is highly available “big data” (especially time- series) with large numbers of writes
    • ThanksLearn more about Cassandrameetup.com/Cassandra-LondonCheckout the code https://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations