On Rails with Apache Cassandra


                 Austin on Rails
                  April 27th 2010
  Stu Hood (@stuhood) ...
My, what a large/volatile dataset you
               have!
●   Large
    ●   Larger than 1 node can handle
●   Volatile
  ...
My, what a large/volatile dataset you
               have!
●   For example:
    ●   Event/log data
    ●   Output of batch...
Conversely...
●   If your pattern easily fits one RDBMS machine:
    ●   Don't Use Cassandra
    ●   Possibly consider Mon...
Case Study: Digg
1.Vertical partitioning and master/slave trees
2.Developed sharding solution
  ●   IDDB
  ●   Awkward rep...
Cassandra's Elders
Standing on the shoulders of:
             Amazon Dynamo
●   No node in the cluster is special
    ●   No special roles
  ...
Standing on the shoulders of:
              Google Bigtable
●   “Column family” data model
●   Range queries for rows:
   ...
Enter Cassandra
●   Hybrid of ancestors
    ●   Adopts listed features
●   And adds:
    ●   Pluggable partitioning
    ● ...
Enter Cassandra
●   Project status
    ●   Open sourced by Facebook in 2008 (no longer active)
    ●   Apache License, Ver...
The Datamodel
Cluster



                           Nodes have Tokens:
                     OrderPreservingPartitioner:
  ...
The Datamodel
Cluster >   Keyspace



                              Like an RDBMS schema:
                              Ke...
The Datamodel
Cluster > Keyspace >   Column Family




              Sorted hash:
             Bytes → Row                ...
The Datamodel
Cluster > Keyspace > Column Family >   Row




                                             Sorted hash: Nam...
The Datamodel
Cluster > Keyspace > Column Family > Row >   “Column”

                                                Not l...
StatusApp: another Twitter clone.
StatusApp Example
<ColumnFamily Name=”Users”>
●   Unique id as key: name->value pairs contain
    user attributes
{key: “r...
StatusApp Example
<ColumnFamily Name=”Timelines”>
●   User id and timeline name as key: row contains
    list of updates f...
Raw Client API
●   Thrift RPC framework
    ●   Generates client bindings for (almost) any language


1. Get the most rece...
But...
●   Don't use the Raw Thrift API!
    ●   You won't enjoy it
●   Use high level Client APIs
    ●   Many options fo...
Consistency Levels?
●   Eventual consistency
    ●   Synch to Washington, asynch to Hong Kong
●   Client API Tunables
    ...
Write Example




          Replication Factor == N == 3:
                              3 Copies
Write Example




         Client connects to arbitrary node
Write Example




                                cl.ONE:
                                W == 1
          Block for succe...
Write Example




                           cl.QUORUM:
                            W == N/2+1
          Block for success...
Caveat consumptor
●   No secondary indexes:
    ●   Typically implemented in client libraries
●   No transactions
    ●   ...
“That doesn't sound worth the
          trouble!"
Cassandra Ruby Support:
               Cassandra Object
●   Mostly duck-type compatible with ActiveRecord
    objects
    ...
Cassandra Ruby Support: RDF.rb
●   Repository implementation for RDF.rb
    ●   Stores triple of (subject, predicate, obje...
Silver linings: Ops
●   Dead drive?
    ●   Swap the drive, restart, run 'repair'
    ●   Streams missing data from other ...
Silver linings: Ops
●   Need N new nodes?
    ●   Start more nodes with the same config file
    ●   New nodes request loa...
Silver linings: Ops
●   Adding a datacenter?
    ●   Configure “dc/rack/ip” describing node location
    ●   Add new nodes...
Silver linings: Performance
Getting started
●   `gem install cassandra`
●   `git clone
    git://github.com/tritonrc/cassandra_object.git`
●   http://...
Questions?
Resources
●   http://cassandra.apache.org/
●   http://wiki.apache.org/cassandra/
●   Mailing Lists
●   #cassandra on freen...
References
●   Digg Technology Blog
    ●   http://about.digg.com/blog/looking-future-cassandra
    ●   http://about.digg....
Upcoming SlideShare
Loading in...5
×

On Rails with Apache Cassandra

14,212

Published on

Presentation on Apache Cassandra from the April 27th Austin on Rails meetup.

Published in: Technology
2 Comments
25 Likes
Statistics
Notes
  • cassandra_object moved to git clone git://github.com/NZKoz/cassandra_object.git
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks for the mention of the Cassandra storage adapter for RDF.rb, Stu. We're currently in the process of benchmarking it with the 258 million RDF triples that constitute the English edition of DBpedia, the RDFization of Wikipedia. Cassandra's doing very nicely so far, though we do have to maintain our own secondary performance indices, as you mention.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
14,212
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
2
Likes
25
Embeds 0
No embeds

No notes for slide

On Rails with Apache Cassandra

  1. 1. On Rails with Apache Cassandra Austin on Rails April 27th 2010 Stu Hood (@stuhood) – Technical Lead, Rackspace
  2. 2. My, what a large/volatile dataset you have! ● Large ● Larger than 1 node can handle ● Volatile ● More than 25% (ish) writes ● (but still larger than available memory) ● Expensive ● More than you can afford with a commercial solution
  3. 3. My, what a large/volatile dataset you have! ● For example: ● Event/log data ● Output of batch processing or log analytics jobs ● Social network relationships/updates ● In general: ● Large volume of high fanout data
  4. 4. Conversely... ● If your pattern easily fits one RDBMS machine: ● Don't Use Cassandra ● Possibly consider MongoDB, CouchDB, Neo4j, Redis, etc – For schema freedom and flexibility
  5. 5. Case Study: Digg 1.Vertical partitioning and master/slave trees 2.Developed sharding solution ● IDDB ● Awkward replication, fragile scaling 3.Began populating Cassandra in parallel ● Initial dataset for 'green badges' – 3 TB – 76 billion kv pairs ● Most applications being ported to Cassandra
  6. 6. Cassandra's Elders
  7. 7. Standing on the shoulders of: Amazon Dynamo ● No node in the cluster is special ● No special roles ● No scaling bottlenecks ● No single point of failure ● Techniques ● Gossip ● Eventual consistency
  8. 8. Standing on the shoulders of: Google Bigtable ● “Column family” data model ● Range queries for rows: ● Scan rows in order ● Memtable/SSTable structure ● Always writes sequentially to disk ● Bloom filters to minimize random reads ● Trounces B-Trees for big data – Linear insert performance – Log growth for reads
  9. 9. Enter Cassandra ● Hybrid of ancestors ● Adopts listed features ● And adds: ● Pluggable partitioning ● Multi datacenter support – Pluggable locality awareness ● Datamodel improvements
  10. 10. Enter Cassandra ● Project status ● Open sourced by Facebook in 2008 (no longer active) ● Apache License, Version 2.0 ● Graduated to Apache TLP February 2010 ● Major releases: 0.3 through 0.6.1 (0.7 this summer) ● cassandra.apache.org ● Known deployments at: ● Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
  11. 11. The Datamodel Cluster Nodes have Tokens: OrderPreservingPartitioner: Actual keys RandomPartitioner: MD5s of keys
  12. 12. The Datamodel Cluster > Keyspace Like an RDBMS schema: Keyspace per application
  13. 13. The Datamodel Cluster > Keyspace > Column Family Sorted hash: Bytes → Row Like an RDBMS table: Separates classes of Objects Row Key → Row
  14. 14. The Datamodel Cluster > Keyspace > Column Family > Row Sorted hash: Name → Value ...
  15. 15. The Datamodel Cluster > Keyspace > Column Family > Row > “Column” Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns … Name → Value bytes → bytes +version timestamp
  16. 16. StatusApp: another Twitter clone.
  17. 17. StatusApp Example <ColumnFamily Name=”Users”> ● Unique id as key: name->value pairs contain user attributes {key: “rails_user”, row: {“fullname”: “Damon Clinkscales”, “joindate”: “back_in_the_day” … }}
  18. 18. StatusApp Example <ColumnFamily Name=”Timelines”> ● User id and timeline name as key: row contains list of updates from that timeline {key: “user19:personal”, row: {<timeuuid1>: “status19”, <timeuuid2>: “status21”, … }}
  19. 19. Raw Client API ● Thrift RPC framework ● Generates client bindings for (almost) any language 1. Get the most recent status in a timeline: ● get_slice(keyspace, key, [column_family, column_name], predicate, consistency_level) ● get_slice(“statusapp”, “userid19:personal”, [“Timelines”], {start: ””, count: 1}, QUORUM) > <timeuuid1>: “status19”
  20. 20. But... ● Don't use the Raw Thrift API! ● You won't enjoy it ● Use high level Client APIs ● Many options for each language
  21. 21. Consistency Levels? ● Eventual consistency ● Synch to Washington, asynch to Hong Kong ● Client API Tunables ● Synchronously write to W replicas ● Confirm R replicas match at read time ● of N total replicas ● Allows for almost-strong consistency ● When W + R > N
  22. 22. Write Example Replication Factor == N == 3: 3 Copies
  23. 23. Write Example Client connects to arbitrary node
  24. 24. Write Example cl.ONE: W == 1 Block for success on 1 replica
  25. 25. Write Example cl.QUORUM: W == N/2+1 Block for success on a majority
  26. 26. Caveat consumptor ● No secondary indexes: ● Typically implemented in client libraries ● No transactions ● But atomic increment/decrement RSN ● Absolutely no joins ● You don't really want 'em anyway
  27. 27. “That doesn't sound worth the trouble!"
  28. 28. Cassandra Ruby Support: Cassandra Object ● Mostly duck-type compatible with ActiveRecord objects ● Transparently builds (non-)unique secondary indexes ● Excludes: – :order – :conditions – :join – :group
  29. 29. Cassandra Ruby Support: RDF.rb ● Repository implementation for RDF.rb ● Stores triple of (subject, predicate, object) as (rowkey, name, subname)
  30. 30. Silver linings: Ops ● Dead drive? ● Swap the drive, restart, run 'repair' ● Streams missing data from other replicas ● Dead node? ● Start a new node with the same IP and token, run 'repair'
  31. 31. Silver linings: Ops ● Need N new nodes? ● Start more nodes with the same config file ● New nodes request load information from the cluster and join with a token that balances the cluster
  32. 32. Silver linings: Ops ● Adding a datacenter? ● Configure “dc/rack/ip” describing node location ● Add new nodes as before
  33. 33. Silver linings: Performance
  34. 34. Getting started ● `gem install cassandra` ● `git clone git://github.com/tritonrc/cassandra_object.git` ● http://cassandra.apache.org/ ● Read "Getting Started"... Roughly: – Start one node – Test/develop app, editing node config as necessary – Launch cluster by starting more nodes with chosen config
  35. 35. Questions?
  36. 36. Resources ● http://cassandra.apache.org/ ● http://wiki.apache.org/cassandra/ ● Mailing Lists ● #cassandra on freenode.net
  37. 37. References ● Digg Technology Blog ● http://about.digg.com/blog/looking-future-cassandra ● http://about.digg.com/blog/introducing-digg’s-iddb-infrastructure ● Github Projects ● http://github.com/tritonrc/cassandra_object ● http://github.com/bendiken/rdf-cassandra ● Cassandra Wiki ● http://wiki.apache.org/cassandra/ ● Brandon William's perf tests ● http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

×