Cassandra

Cassandra
Scalable Distributed Data Store Sylvain Lebresne
(sylvain@yakaz.com)

A few dates

• Created by Facebook (around 2007)

• Open-sourced in 2008

• Becomes a ASF incubator project in January 2009

• Graduated as a top-level ASF project in February 2010

• 3 major releases (current is 0.6)

• In production in multiple companies (Facebook, Digg, Twitter, Reddit,
Rackspace, etc.) with largest cluster of over 150 machines

Dynamo Big Table

data model and
distribution model
storage architecture

Cassandra

Why Cassandra ?

• Fully distributed (client can connect to any node)

• No single point of failure

• Incremental scalability

• Richer data model than simple key-value

• Data center aware

• Fast reads, faster writes (“optimize for reads, writes are cheap”)

• Always writable

• Eventually consistent

Eventual Consistency

• Isn’t consistency important ?

• Yes, eventual consistency:

• Not : “Let’s not be consistent”

• But more: “Instead of designing (costly) measure to prevent inconsistency,
we acknowledge that the cluster may be in an inconsistent state for a brief
period of time, and we deal with it”

• Moreover, Cassandra allows the client to choose a trade-off between
consistency and latency

Data model

• A distributed multi-level hash map:
(millions is ok but
limited by node
capacity)
column key value
Keyspace Column Family row key
(one by (as many as
(a few)
application) you want)
super column key column key value
(millions is ok but
(in current implementation,
limited by node
not too many)
capacity)

Columns

Column

name value ts

Columns

Column

“fname” “John”

Rows

Row
ident42
“fname” “John”

“lname” “Doe”

“phone” 0612346789

“age” 35

“picture” 0x0FC3...

Column Family

Column Family
ident42 ident123 ident24
“fname” “John” “fname” “Chuck” “fname” “Sylvain”

“lname” “Doe” “lname” “Norris” “lname” “Lebresne”

“phone” 0612346789 “age” 70 “phone” 0698765432

“age” 35 “email” “chuck@gmail.com” “age” 29

“picture” 0x0FC3... “picture” 0x159A...

SuperColumns

(Super) Column Family
ident42 ident123
ident24 ident123 ident42
1983483 msg11 1990310 msg21 1847820 msg31

1983490 msg12 1991321 msg22 1848923 msg32

1983512 msg13 2015672 msg23 1848924 msg33

1983618 msg14 1983618 msg34

Comparison

• Columns and super columns are sorted.

• This sorting is customizable and deﬁned by column family.

• Predeﬁned sorts are:

• BytesType
• LongType
• AsciiType
• UTF8Type
• LexicalUUIDType
• TimeUUIDType

API

• Writes:

• insert() : insert/update a single column

• remove() : remove a column/super column/row

• batch_mutate() : update/remove multiple columns

• Reads:

• get() : retrieve a single column

• get_slice() : retrieve a group of columns (by names or range)

• get_range_slices() : retrieve a set of slices for a range of (row) keys

• count() : count the number of columns in a row

Cassandra Cluster: Replication & Consistency

Ring (Consistent Hashing)

• Data distribution:
RF = 3
• take a hash function (md5) and place
node on the domain of this hash
• each node is “responsible” of the key
that falls between its position and the
preceding node
• to know where to store a column, use
node responsible of md5(row key)

• Data replication:
• cluster have a replication factor (RF).
• place replicas on preceding nodes

Writing - Cluster side

24

24

24


24

insert( 42 )

24

24


24

insert( 42 )

• Consistency Level : how many
node must respond for 24
success ?

24


24

insert( 42 )

success ?
• CL.ZERO : none

24


24

insert( 42 )

success ?
• CL.ZERO : none
• CL.ONE : one

24


24

insert( 42 )

success ?
• CL.ZERO : none
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
24


24

ok

success ?
• CL.ZERO : none
• CL.ONE : one
42


24

success ?
• CL.ZERO : none
• CL.ONE : one
42


42

success ?
• CL.ZERO : none
• CL.ONE : one
42

Reading - Cluster side

42

get( )

42

42


42

get( )

success ?

42


42

get( )

success ?
• CL.ONE : one

42


42

get( 42 )

success ?
• CL.ONE : one

42


42

get( 42 )

success ?
• CL.ONE : one
• If values differs, returns the one
with greater timestamp 42

Failure and Consistency

To repair inconsistency when they occurs:

1. Hinted Handoff: when a node is down, insertions are send to another
machine. Those insertions are sent to the node come back alive.

2. Read Repair: on reads, if values differ, the out of sync nodes are repaired
by inserting the newer value.

3. Anti Entropy: compare versions in two nodes using merkle tree (manual
operation).

Write Path

1. write commit log (for persistency)

2. write memtable (write is
acknowledged to client)

3. if memtable reach treshold, ﬂush to
disk as SSTable.

4. Remark: deletion amounts to the
insertion of a “tombstone”.

Read Path

• Versions of the same column can be at
the same time:
• in the memtable
• in the memtables being flushed
• in one or multiple SSTable
• We need to read all version and
resolve using timestamp
• But:
• bloom filters allow to skip reading
unnecessary files
• SSTable are indexed
• Compaction keep things
reasonnable

Compaction

• Runs regularly as a background operation

• Merge SSTables together

• Get rid of old and deleted values

But...

• Requires disk space temporarily

• As of today, needs to deserialize each row entirely

Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Cassandra

Similar to Cassandra (12)

Recently uploaded

Recently uploaded (20)

Cassandra