• Save


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Cassandra - Wellington No Sql



Slides from a talk I gave on 18/10 at the Wellington No Sql group. Focused on Availability, Consistency and Partition Tolerance.

Slides from a talk I gave on 18/10 at the Wellington No Sql group. Focused on Availability, Consistency and Partition Tolerance.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Cassandra - Wellington No Sql Cassandra - Wellington No Sql Presentation Transcript

  • Cassandra Wellington No Sql 18/10/2010 Aaron Morton, @aaronmorton
  • The Headlines • Highly Available • Tuneable Consistency • Partition Tolerance • Column Data Model
  • Data Model (quickly) • It's different to RDBMS, and requires a different approach to application design. • Consider how the data will be used, not how it should be modelled. • Simple view; a row has a key and stores an ordered hash in one or more Column Families. More later. • Gets a lot of attention but is not Cassandra's killer feature. View slide
  • Availability • Row data is kept together and replicated by the cluster. • Replication factor is configurable. • Partitioner determines the position of a row key in the distributed hash table. • Replication Strategy determines where in the cluster to place the replicas. View slide
  • Replication Factor • Set per Keyspace. Oops, forgot to mention them. • Known as RF or N. • Is the "target" number of row replicas to store. • A lower number of replicas may be acceptable in the short term. However the system works to eventually have N replicas.
  • Partitioners • One per cluster, cannot be changed without dropping all data. • Ships with, Random, Byte Order, Order Preserving, Collating Order Preserving • Determines the physical order of the rows in the cluster. • Random uses md5 hash of the key and gives the best distribution. • Order Preserving uses the actual key, prone to hotspots.
  • Replication Strategy • Set per Keyspace. • Each node has a token, has ownership of the keys between the previous token and it's own. • Simple Strategy places the replicas clock wise around the hash ring. • Network Topology Strategy splits the RF over different DC's or Racks. • Endpoint Snitch provides knowledge of the network. Ships with Simple, Rack Inferring and Property File.
  • Consistency • It's relative. • Each read and write request specifies a Consistency Level. • Individual nodes may be inconsistent with respect to others. • Reads may give consistent results while some nodes have inconsistent values. • The entire cluster will eventually move to a state where there is one version of each value.
  • Consistency Level • Specifies how many nodes must agree before a request is considered a success. • The system will attempt to get all replicas to agree, eventually. • Specified per request, for reads is R for writes is W.
  • Consistency Levels • Zero. All replication is asynchronous. Very bad mojo. • Any • One, Quorum (N/2 +1), All. • DC Quorum. Quorum within the local DC. • DC Quorum Sync. Quorum in each DC. (DC replication factor is determined by Network Topology Strategy)
  • Which nodes count? • CL's other than Any only consider the nodes the Replication Strategy identifies as targets. • Request may fail even if there are other nodes online. • CL Any allows a Hinted Handoff to count as a write.
  • W+R>N • Gives consistent reads and writes. • W is Consistency Level for writes. • R is Consistency Level for reads. • N is Replication Factor.
  • Inconsistency? • Node goes down, planned or otherwise. Writes succeed with remaining nodes. • Node is over worked and drops messages, request will succeed if CL is achieved. • Gypsy curse.
  • Consistent Reads • Read request asynchronously sent to all replicas, either full data or digest is requested. • Once CL nodes have returned values are compared. • If nodes do not agree full data is requested from Quorum nodes and repaired. • For CL One Read Repair is probabilistic and asynchronous. Is mandatory and synchronous for higher CL's.
  • Consistent Writes • Write request sent asynchronously to all replicas. • Must be acknowledged by CL replicas. • If some nodes are down a Hinted Handoff is sent to one of the up replicas. • For CL other than Any Hinted Handoff does not contribute to CL. • For CL Any if no replicas are up the coordinating nodes will store the Hinted Handoff.
  • Anti Entropy • Main feature for achieving consistency. Hinted Handoff is an optimisation. • Triggered manually via JMX or command line. • Nodes exchange digests of their data using Merkle Trees. • Differences are repaired in the background.
  • Partition Tolerance • Nodes can fail without stopping the system. • Nodes can run slowly without slowing the entire system.
  • How tolerant? • Depends on Replication Factor, Consistency Level and luck. • For RF 3 can lose only 1 replica for a key and still maintain Quorum operations. • For RF 5 can lose 2 replicas for a key and still maintain Quorum operations.
  • Failure Detection • Each node keeps track of every other node. • Based on Phi Accrual Failure detector. • Considers the intervals for the last 1,000 Gossip heartbeats. • Configurable eviction threshold.
  • Failing at failing • What about if a node is just running slow? • Dynamic Endpoint Snitch tracks response latency from other nodes. • Alters the "proximity" of a node based on it's recent performance. • Proximity is considered when asking a node for full data or a digest.
  • Data Model • Cassandra is an index building machine. • Ideally try to serve a query by reading one or more rows from one CF. • Consider how the data will be read rather than model beauty. • Denormalise • There's information in column names.
  • Like Ogres • Keyspace • Row • Super Column Family or Column Family • Column
  • Keyspace • Typically one per application. • Just a container, little overhead. • Name, replication strategy, replication factor.
  • Row • Identified by binary key. • Intersection with Column Family. • All rows with the same key on the same nodes.
  • Super Column Family • Ordered list of Super Columns at the first level. • Super Column has a Name and ordered list of columns. • Some performance issues for large rows. • Good for future app expansion.
  • Column Family • One level of ordered columns. • Column has Name, Value, Timestamp and Time To Live (0.7 only). • Name and Value are binary. • Suggested upper limit of 2 billion columns per row.