• Save
On Rails with Apache Cassandra
Upcoming SlideShare
Loading in...5

On Rails with Apache Cassandra



Presentation on Apache Cassandra from the April 27th Austin on Rails meetup.

Presentation on Apache Cassandra from the April 27th Austin on Rails meetup.



Total Views
Views on SlideShare
Embed Views



6 Embeds 1,307

http://www.nosqldatabases.com 1053
http://www.slideshare.net 242
http://translate.googleusercontent.com 7
http://webcache.googleusercontent.com 3
http://b.hatena.ne.jp 1
https://twimg0-a.akamaihd.net 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • cassandra_object moved to git clone git://github.com/NZKoz/cassandra_object.git
    Are you sure you want to
    Your message goes here
  • Thanks for the mention of the Cassandra storage adapter for RDF.rb, Stu. We're currently in the process of benchmarking it with the 258 million RDF triples that constitute the English edition of DBpedia, the RDFization of Wikipedia. Cassandra's doing very nicely so far, though we do have to maintain our own secondary performance indices, as you mention.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

On Rails with Apache Cassandra On Rails with Apache Cassandra Presentation Transcript

  • On Rails with Apache Cassandra Austin on Rails April 27th 2010 Stu Hood (@stuhood) – Technical Lead, Rackspace
  • My, what a large/volatile dataset you have! ● Large ● Larger than 1 node can handle ● Volatile ● More than 25% (ish) writes ● (but still larger than available memory) ● Expensive ● More than you can afford with a commercial solution
  • My, what a large/volatile dataset you have! ● For example: ● Event/log data ● Output of batch processing or log analytics jobs ● Social network relationships/updates ● In general: ● Large volume of high fanout data
  • Conversely... ● If your pattern easily fits one RDBMS machine: ● Don't Use Cassandra ● Possibly consider MongoDB, CouchDB, Neo4j, Redis, etc – For schema freedom and flexibility
  • Case Study: Digg 1.Vertical partitioning and master/slave trees 2.Developed sharding solution ● IDDB ● Awkward replication, fragile scaling 3.Began populating Cassandra in parallel ● Initial dataset for 'green badges' – 3 TB – 76 billion kv pairs ● Most applications being ported to Cassandra
  • Cassandra's Elders
  • Standing on the shoulders of: Amazon Dynamo ● No node in the cluster is special ● No special roles ● No scaling bottlenecks ● No single point of failure ● Techniques ● Gossip ● Eventual consistency
  • Standing on the shoulders of: Google Bigtable ● “Column family” data model ● Range queries for rows: ● Scan rows in order ● Memtable/SSTable structure ● Always writes sequentially to disk ● Bloom filters to minimize random reads ● Trounces B-Trees for big data – Linear insert performance – Log growth for reads
  • Enter Cassandra ● Hybrid of ancestors ● Adopts listed features ● And adds: ● Pluggable partitioning ● Multi datacenter support – Pluggable locality awareness ● Datamodel improvements
  • Enter Cassandra ● Project status ● Open sourced by Facebook in 2008 (no longer active) ● Apache License, Version 2.0 ● Graduated to Apache TLP February 2010 ● Major releases: 0.3 through 0.6.1 (0.7 this summer) ● cassandra.apache.org ● Known deployments at: ● Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
  • The Datamodel Cluster Nodes have Tokens: OrderPreservingPartitioner: Actual keys RandomPartitioner: MD5s of keys
  • The Datamodel Cluster > Keyspace Like an RDBMS schema: Keyspace per application
  • The Datamodel Cluster > Keyspace > Column Family Sorted hash: Bytes → Row Like an RDBMS table: Separates classes of Objects Row Key → Row
  • The Datamodel Cluster > Keyspace > Column Family > Row Sorted hash: Name → Value ...
  • The Datamodel Cluster > Keyspace > Column Family > Row > “Column” Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns … Name → Value bytes → bytes +version timestamp
  • StatusApp: another Twitter clone.
  • StatusApp Example <ColumnFamily Name=”Users”> ● Unique id as key: name->value pairs contain user attributes {key: “rails_user”, row: {“fullname”: “Damon Clinkscales”, “joindate”: “back_in_the_day” … }}
  • StatusApp Example <ColumnFamily Name=”Timelines”> ● User id and timeline name as key: row contains list of updates from that timeline {key: “user19:personal”, row: {<timeuuid1>: “status19”, <timeuuid2>: “status21”, … }}
  • Raw Client API ● Thrift RPC framework ● Generates client bindings for (almost) any language 1. Get the most recent status in a timeline: ● get_slice(keyspace, key, [column_family, column_name], predicate, consistency_level) ● get_slice(“statusapp”, “userid19:personal”, [“Timelines”], {start: ””, count: 1}, QUORUM) > <timeuuid1>: “status19”
  • But... ● Don't use the Raw Thrift API! ● You won't enjoy it ● Use high level Client APIs ● Many options for each language
  • Consistency Levels? ● Eventual consistency ● Synch to Washington, asynch to Hong Kong ● Client API Tunables ● Synchronously write to W replicas ● Confirm R replicas match at read time ● of N total replicas ● Allows for almost-strong consistency ● When W + R > N
  • Write Example Replication Factor == N == 3: 3 Copies
  • Write Example Client connects to arbitrary node
  • Write Example cl.ONE: W == 1 Block for success on 1 replica
  • Write Example cl.QUORUM: W == N/2+1 Block for success on a majority
  • Caveat consumptor ● No secondary indexes: ● Typically implemented in client libraries ● No transactions ● But atomic increment/decrement RSN ● Absolutely no joins ● You don't really want 'em anyway
  • “That doesn't sound worth the trouble!"
  • Cassandra Ruby Support: Cassandra Object ● Mostly duck-type compatible with ActiveRecord objects ● Transparently builds (non-)unique secondary indexes ● Excludes: – :order – :conditions – :join – :group
  • Cassandra Ruby Support: RDF.rb ● Repository implementation for RDF.rb ● Stores triple of (subject, predicate, object) as (rowkey, name, subname)
  • Silver linings: Ops ● Dead drive? ● Swap the drive, restart, run 'repair' ● Streams missing data from other replicas ● Dead node? ● Start a new node with the same IP and token, run 'repair'
  • Silver linings: Ops ● Need N new nodes? ● Start more nodes with the same config file ● New nodes request load information from the cluster and join with a token that balances the cluster
  • Silver linings: Ops ● Adding a datacenter? ● Configure “dc/rack/ip” describing node location ● Add new nodes as before
  • Silver linings: Performance
  • Getting started ● `gem install cassandra` ● `git clone git://github.com/tritonrc/cassandra_object.git` ● http://cassandra.apache.org/ ● Read "Getting Started"... Roughly: – Start one node – Test/develop app, editing node config as necessary – Launch cluster by starting more nodes with chosen config
  • Questions?
  • Resources ● http://cassandra.apache.org/ ● http://wiki.apache.org/cassandra/ ● Mailing Lists ● #cassandra on freenode.net
  • References ● Digg Technology Blog ● http://about.digg.com/blog/looking-future-cassandra ● http://about.digg.com/blog/introducing-digg’s-iddb-infrastructure ● Github Projects ● http://github.com/tritonrc/cassandra_object ● http://github.com/bendiken/rdf-cassandra ● Cassandra Wiki ● http://wiki.apache.org/cassandra/ ● Brandon William's perf tests ● http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png