2. A few dates
• Created by Facebook (around 2007)
• Open-sourced in 2008
• Becomes a ASF incubator project in January 2009
• Graduated as a top-level ASF project in February 2010
• 3 major releases (current is 0.6)
• In production in multiple companies (Facebook, Digg, Twitter, Reddit,
Rackspace, etc.) with largest cluster of over 150 machines
3. Dynamo Big Table
data model and
distribution model
storage architecture
Cassandra
4. Why Cassandra ?
• Fully distributed (client can connect to any node)
• No single point of failure
• Incremental scalability
• Richer data model than simple key-value
• Data center aware
• Fast reads, faster writes (“optimize for reads, writes are cheap”)
• Always writable
• Eventually consistent
5. Eventual Consistency
• Isn’t consistency important ?
• Yes, eventual consistency:
• Not : “Let’s not be consistent”
• But more: “Instead of designing (costly) measure to prevent inconsistency,
we acknowledge that the cluster may be in an inconsistent state for a brief
period of time, and we deal with it”
• Moreover, Cassandra allows the client to choose a trade-off between
consistency and latency
7. Data model
• A distributed multi-level hash map:
(millions is ok but
limited by node
capacity)
column key value
Keyspace Column Family row key
(one by (as many as
(a few)
application) you want)
super column key column key value
(millions is ok but
(in current implementation,
limited by node
not too many)
capacity)
13. Comparison
• Columns and super columns are sorted.
• This sorting is customizable and defined by column family.
• Predefined sorts are:
• BytesType
• LongType
• AsciiType
• UTF8Type
• LexicalUUIDType
• TimeUUIDType
14. API
• Writes:
• insert() : insert/update a single column
• remove() : remove a column/super column/row
• batch_mutate() : update/remove multiple columns
• Reads:
• get() : retrieve a single column
• get_slice() : retrieve a group of columns (by names or range)
• get_range_slices() : retrieve a set of slices for a range of (row) keys
• count() : count the number of columns in a row
16. Ring (Consistent Hashing)
• Data distribution:
RF = 3
• take a hash function (md5) and place
node on the domain of this hash
• each node is “responsible” of the key
that falls between its position and the
preceding node
• to know where to store a column, use
node responsible of md5(row key)
• Data replication:
• cluster have a replication factor (RF).
• place replicas on preceding nodes
20. Writing - Cluster side
24
insert( 42 )
• Consistency Level : how many
node must respond for 24
success ?
24
21. Writing - Cluster side
24
insert( 42 )
• Consistency Level : how many
node must respond for 24
success ?
• CL.ZERO : none
24
22. Writing - Cluster side
24
insert( 42 )
• Consistency Level : how many
node must respond for 24
success ?
• CL.ZERO : none
• CL.ONE : one
24
23. Writing - Cluster side
24
insert( 42 )
• Consistency Level : how many
node must respond for 24
success ?
• CL.ZERO : none
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
24
24. Writing - Cluster side
24
ok
• Consistency Level : how many
node must respond for 42
success ?
• CL.ZERO : none
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
42
25. Writing - Cluster side
24
• Consistency Level : how many
node must respond for 42
success ?
• CL.ZERO : none
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
42
26. Writing - Cluster side
42
• Consistency Level : how many
node must respond for 42
success ?
• CL.ZERO : none
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
42
29. Reading - Cluster side
42
get( )
• Consistency Level : how many
node must respond for 42
success ?
42
30. Reading - Cluster side
42
get( )
• Consistency Level : how many
node must respond for 42
success ?
• CL.ONE : one
42
31. Reading - Cluster side
42
get( )
• Consistency Level : how many
node must respond for 42
success ?
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
42
32. Reading - Cluster side
42
get( 42 )
• Consistency Level : how many
node must respond for 42
success ?
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
42
33. Reading - Cluster side
42
get( 42 )
• Consistency Level : how many
node must respond for 42
success ?
• CL.ONE : one
• CL.QUORUM : one more
than half the replicas
• If values differs, returns the one
with greater timestamp 42
34. Failure and Consistency
To repair inconsistency when they occurs:
1. Hinted Handoff: when a node is down, insertions are send to another
machine. Those insertions are sent to the node come back alive.
2. Read Repair: on reads, if values differ, the out of sync nodes are repaired
by inserting the newer value.
3. Anti Entropy: compare versions in two nodes using merkle tree (manual
operation).
36. Write Path
1. write commit log (for persistency)
2. write memtable (write is
acknowledged to client)
3. if memtable reach treshold, flush to
disk as SSTable.
4. Remark: deletion amounts to the
insertion of a “tombstone”.
37. Read Path
• Versions of the same column can be at
the same time:
• in the memtable
• in the memtables being flushed
• in one or multiple SSTable
• We need to read all version and
resolve using timestamp
• But:
• bloom filters allow to skip reading
unnecessary files
• SSTable are indexed
• Compaction keep things
reasonnable
38. Compaction
• Runs regularly as a background operation
• Merge SSTables together
• Get rid of old and deleted values
But...
• Requires disk space temporarily
• As of today, needs to deserialize each row entirely