Cassandra Overview

Uploaded on

Short Apache Cassandra overview at the December Seattle Apache Cassandra meetup at Disney

Short Apache Cassandra overview at the December Seattle Apache Cassandra meetup at Disney

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • CQL spec is at version 3 – but I believe is still a bit raw and untested. Not getting rid of thrift anytime soon


  • 1. Cassandra Overview   
  • 2. What Is It? ● It is a persistent database, but not an RDBMS – more on API later ● It can run as a single instance or as a part of a cluster. ● All nodes are equal, no master, no slaves ● The cluster can be distributed within a single DC or across multiple DCs. ● Multiple DCs can be Active-Active for performance or Active-Passive for DR   
  • 3. Simple API ● Get, Put, Delete – all by key ● Batch put and delete – save wire time ● Range queries (iterate over sequence of keys) ● Target individual columns within a row – Get and Put ● Native integration available for Hadoop MapReduce ● CQL – SQL like language   
  • 4. Consistent Hash Ring ● Conceptually all nodes in a cluster are on a ring of hash values, “tokens” ● Each node is assigned a token range on the ring ● A keys hash (token) places it on the ring, within a specific nodes token range ● The hash is consistent, meaning the location of data is consistent and predictable   
  • 5. 0 => 2127 (Random  Partitoner) K1 => H1 (token) 2127      0 H1 => R4 (primary = N4) N = 3 N1 RS = N4, N5, N6 N8 R1 R2 R8 N7 N2 R7 R3 N6 N3 R6 R4 N5 R5 N4 H1   
  • 6. Replication ● Replication Factor (N) determines how many replicas exist for each key ● Location of replicas is determined by consistent hash ring and the “partitioner” ● Generally, N=3 means data will be placed on node N, N+1, N+2 on the ring (This can vary based on placement strategy, but is predictable) ● Powerful because no query required to find the node(s) containing a key   
  • 7. Consistency ● Consistency is “eventual” in Cassandra – it will always work to create N (Replication Factor) replicas ● Write Consistency (W) defines how many replicas are guaranteed per “put” request ● Read Consistency (R) defines how many replicas are consulted before responding ● W and R are tunable per request, therefore consistency is tunable as well   
  • 8. Data Modeling Example   
  • 9. Schema Overview ● Keyspace (“database”) contains one or more ColumnFamilies ● ColumnFamily (“table”) contains zero or more rows ● A Row must contain one or more columns ● ColumnFamilies are indexed by key (“rows”, but more like hash map) ● Rows within the same CF may have different number of columns, and different  column names!!  
  • 10. Example UserData (Keyspace) UserAttributes (ColumnFamily, sort = UTF8) Age Sex Weight Ellie 4 Female 32 Age Sex Sammy 2 Male Age EyeColor Height Sex Henry 2 Blue 30 Male UserAccessLog (ColumnFamily, sort = Long) 7/20/2010 7/22/2010 Sammy 7/22/2010 7/23/2010 7/24/2010 Henry   
  • 11. Columns ● Column names (not values) are sorted, per key ● 32 bit limit to number of columns per key – entire column must fit in RAM, on one machine ● Can retrieve/update/delete all columns, columns by name, or range of columns ● A key (or row) must contain at least one Column, otherwise considered deleted   
  • 12. Thrift Read Methods ● get – return a single column for a single key ● get_slice – return multiple columns for a single key ● multiget_slice – return multiple columns for a list of keys ● get_range_slices – return multiple columns for a “range” of keys ● Most use “high level” client (Hector,  Pycassa, etc)  
  • 13. Thrift Write Methods ● insert – insert/update a single column for a single key (most call this method, “put”) ● batch_mutate – insert/update/remove multiple columns for multiple keys in multiple ColumnFamilies ● remove – remove a single column (or entire row) for a single key   
  • 14. Useful References ● 0/amazons_dynamo.html ● 2/eventually_consistent.html ● ● - "A description of the cassandra data model" ● - "Architecture Overview" ● - “Operations”    ● - "Articles and Presentations"