Cassandra overview

Outline
● History/motivation
● Semi structured data in Cassandra
○ CFs and SuperCFs
● Architecture of Cassandra system
○ Distribution of content
○ Replication of content
○ Consistency level
○ Node internals
○ Gossip
● Thrift API
● Design patterns - denormalization

History/motivation
● Initially developed by facebook for Inbox
Search
○ in late 2007/early 2008
● Designed for
○ node failure - commodity hardware
○ scale - can increase number of nodes easily to
accommodate increasing demand
○ fast write access while delivering good read
performance
● Combination of Bigtable and Dynamo
● Was operational for over 2 years
○ Dropped in favour of HBase

History/motivation
● Released as open source in July 2008
● Apache liked it
○ Became Apache Incubator project in March 2009
○ Became Apache top level project in Feb 2010
● Active project with releases every few
months
○ currently on version 1.1
■ production ready, but still evolving

Why it's interesting (in this
context)...
● Has seen significant growth in last couple of
years
● Enough deployments to be credible
○ Netflix, Ooyala, Digg, Cisco,
● Is scalable and robust enough for big data
problems
○ no single point of failure
● Complex system
○ perhaps excessively complex today

Cassandra - semi
structured data
● Column based database
○ has similarities to standard RDBMS
● Terminology:
○ Keystore -> database
○ ColumnFamily -> table

Cassandra - semi
structured data
● No specific schema is required
○ although it is possible to define schema
■ can include typing information for parts of
schema to minimize data integrity problems
● Rows can have large numbers of columns
○ limit on number of columns is 2B
● Column values should not exceed some MB
● SuperColumns are columns embedded
within columns
○ third level in a map
○ little discussion of SC here

Cassandra - secondary
indexing
● Columns can be indexed
○ so-called 'secondary indexing'
■ row keys form the primary index
● Some debate abt the merits of secondary
indexing in cassandra
○ secondary indexing is an atomic operation
■ unlike alternative 'manual' indexing approach
○ causes change in thinking regarding NoSQL design
■ very similar to classical RDBMS thinking

Cassandra Architecture
● Cluster configuration typical
● All nodes peers
○ although there are some seeds which should be
more reliable, larger nodes
● Peers have common view of tokenspace
○ tokenspace is a ring
■ of size 2^127
○ peers have responsibility for some part of ring
■ ie some range of tokens within ring
● Row key/keyspace mapped to token
○ used to determine which node is responsible for row
data

Cassandra - Cluster and
Tokenspace

Cassandra - Data
Distribution
● Map from RowKey to token determines data
distribution
● RandomPartitioner is most important map
○ generates MD5 hash of rowkey
○ distributes data evenly over nodes in cluster
○ highly preferred solution
○ constraint that it is not possible to iterate over rows
● OrderedPartitioner
○ generates token based on simply byte mapping of
row key
○ most probably results in uneven distribution of data
○ can be used to iterate over rows

Cassandra - Data
Replication
● Multiple levels of replication supported
○ can support arbitrary level of replication
○ replication factors specified per keyspace
● Two replication strategies
○ RackUnaware
■ Make replicas in next n nodes along token ring
○ RackAware
■ Makes one replica in remote data centre
■ Make remaining replicas in next nodes along
token ring
● good ring configuration should result in diversity over data
centres

Cassandra - Consistency
Level
● A mechanism to trade off latency with data
consistency
○ Write case:
■ Faster response <-> less sure data written
properly
○ Read case:
■ Faster response <-> less sure most recent data
read
● Related to data replication above
○ replication factor determines meaningful levels for
consistency level

Level - Write
Level Behavior
ANY Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE Ensure that the write has been written to at least 1 replica's commit log and memory table
before responding to the client.
TWO Ensure that the write has been written to at least 2 replica's before responding to the client.
THREE Ensure that the write has been written to at least 3 replica's before responding to the client.
QUORUM Ensure that the write has been written to N / 2 + 1 replicas before responding to the client.
LOCAL_Q Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within the local
UORUM datacenter (requires NetworkTopologyStrategy)
EACH_QU Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each datacenter
ORUM (requires NetworkTopologyStrategy)
ALL Ensure that the write is written to all N replicas before responding to the client. Any
unresponsive replicas will fail the operation.

Level - Read
Level Behavior
ANY Not supported. You probably want ONE instead.
ONE Will return the record returned by the first replica to respond. A consistency check is always
done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is
used. This means subsequent calls will have correct data even if the initial read gets an older
value. (This is calledReadRepair)
TWO Will query 2 replicas and return the record with the most recent timestamp. Again, the
remaining replicas will be checked in the background.
THREE Will query 3 replicas and return the record with the most recent timestamp.
QUORUM Will query all replicas and return the record with the most recent timestamp once it has at least
a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the
background.
LOCAL_Q Returns the record with the most recent timestamp once a majority of replicas within the local
UORUM datacenter have replied.
EACH_QU Returns the record with the most recent timestamp once a majority of replicas within each
ORUM datacenter have replied.
ALL Will query all replicas and return the record with the most recent timestamp once all replicas
have replied. Any unresponsive replicas will fail the operation.

Cassandra - Node Internals
● Node comprises
○ commit log
■ list of pending writes
○ memtable
■ data written to system resident in memory
○ SSTables
■ per CF file containing persistent data
● Memtable writes when out of space, too
many keys or after time period
● SSTables comprise of
○ Data - sorted strings
○ Index, Bloom Filter

Cassandra - Node Internals
● Compaction occurs from time to time
○ cleans up SSTable
○ removes redundant rows
○ regenerates indexes

Cassandra - Behaviour -
Write
● Write properties:
○ No reads
○ No seeks
○ Fast!
○ Atomic within CF
○ Always writable

Cassandra - Behaviour -
Read
● Read Path:
○ Any node
○ Partitioner
○ Wait for R responses
○ Wait for N-R responses in background and perform
read repair
● Read Properties:
○ Read multiple SSTables
○ Slower than writes (but stil fast)
○ Seeks can be mitigated with more RAM
○ Scales to billions of rows

Cassandra - Gossip
● Gossip protocol used to relay information
between nodes in cluster
● Proactive communications mechanism to
share information
○ nodes proactively share what they know with
random other nodes
● Token space information exchanged via
gossip
● Failure detection based on gossip
○ heartbeat mechanism

Thrift API - basic calls
● insert(key, column_parent, column,
consistency_level)
○ key is row/keyspace identifier
○ column_parent is either column identifier
■ can be column name or super column idenfier
○ column is column data
● get(key, column_path, consistency_level)
○ returns a column corresponding to the key
● get_slice(key, column_parent,
slice_predicate, consistency_level)
○ typically returns set of columns corresponding to key

Thrift API - other
operations
● get multiple rows
● delete row
● batch operations
○ important for speeding up system
○ can batch up mix of add, insert and delete
operations
● keyspace and cluster management

Denormalization
● Cassandra requires query oriented design
○ determine queries first, design data models
accordingly
○ in contrast to standard RDBMS
■ normalize data at design time
■ construct arbitrary queries usually based on joins
● Quite fundamental difference in approach
○ typically results in quite different data models
● Common use of valueless columns
○ column name contains data
■ good for time series data
○ can have very many columns in given row

Denormalization
● Standard SQL
○ SELECT * FROM USER WHERE CITY = 'Dublin'
● Typically create CF which groups users by
city
○ row key is city identifer
○ columns are user IDs
● Can get UID of all users in given city by
querying this CF
○ give city as row-key

Other considerations...
● SuperColumnFamily
○ when it is useful?
● Multi data centre deployments
○ Cassandra can leverage topology to maximize
resiliency
● Reaction to node failure
● Reconfiguration of system
○ introduction of new nodes into existing system

● It is a complex system with many working
parts

Cassandra overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cassandra overview

Similar to Cassandra overview (20)

More from Sean Murphy

More from Sean Murphy (8)

Recently uploaded

Recently uploaded (20)

Cassandra overview