CQL spec is at version 3 – but I believe is still a bit raw and untested. Not getting rid of thrift anytime soon
What Is It? ● It is a persistent database, but not an RDBMS – more on API later ● It can run as a single instance or as a part of a cluster. ● All nodes are equal, no master, no slaves ● The cluster can be distributed within a single DC or across multiple DCs. ● Multiple DCs can be Active-Active for performance or Active-Passive for DR
Simple API ● Get, Put, Delete – all by key ● Batch put and delete – save wire time ● Range queries (iterate over sequence of keys) ● Target individual columns within a row – Get and Put ● Native integration available for Hadoop MapReduce ● CQL – SQL like language
Consistent Hash Ring ● Conceptually all nodes in a cluster are on a ring of hash values, “tokens” ● Each node is assigned a token range on the ring ● A keys hash (token) places it on the ring, within a specific nodes token range ● The hash is consistent, meaning the location of data is consistent and predictable
Replication ● Replication Factor (N) determines how many replicas exist for each key ● Location of replicas is determined by consistent hash ring and the “partitioner” ● Generally, N=3 means data will be placed on node N, N+1, N+2 on the ring (This can vary based on placement strategy, but is predictable) ● Powerful because no query required to find the node(s) containing a key
Consistency ● Consistency is “eventual” in Cassandra – it will always work to create N (Replication Factor) replicas ● Write Consistency (W) defines how many replicas are guaranteed per “put” request ● Read Consistency (R) defines how many replicas are consulted before responding ● W and R are tunable per request, therefore consistency is tunable as well
Schema Overview ● Keyspace (“database”) contains one or more ColumnFamilies ● ColumnFamily (“table”) contains zero or more rows ● A Row must contain one or more columns ● ColumnFamilies are indexed by key (“rows”, but more like hash map) ● Rows within the same CF may have different number of columns, and different column names!!
Example UserData (Keyspace) UserAttributes (ColumnFamily, sort = UTF8) Age Sex Weight Ellie 4 Female 32 Age Sex Sammy 2 Male Age EyeColor Height Sex Henry 2 Blue 30 Male UserAccessLog (ColumnFamily, sort = Long) 7/20/2010 7/22/2010 Sammy 7/22/2010 7/23/2010 7/24/2010 Henry
Columns ● Column names (not values) are sorted, per key ● 32 bit limit to number of columns per key – entire column must fit in RAM, on one machine ● Can retrieve/update/delete all columns, columns by name, or range of columns ● A key (or row) must contain at least one Column, otherwise considered deleted
Thrift Read Methods ● get – return a single column for a single key ● get_slice – return multiple columns for a single key ● multiget_slice – return multiple columns for a list of keys ● get_range_slices – return multiple columns for a “range” of keys ● Most use “high level” client (Hector, Pycassa, etc)
Thrift Write Methods ● insert – insert/update a single column for a single key (most call this method, “put”) ● batch_mutate – insert/update/remove multiple columns for multiple keys in multiple ColumnFamilies ● remove – remove a single column (or entire row) for a single key
Useful References ● http://www.allthingsdistributed.com/2007/1 0/amazons_dynamo.html ● http://www.allthingsdistributed.com/2008/1 2/eventually_consistent.html ● http://wiki.apache.org/cassandra/ ● - "A description of the cassandra data model" ● - "Architecture Overview" ● - “Operations” ● - "Articles and Presentations"