2. What is Cassandra?
● A distributed storage system with a flexible schema and
high-write throughput
● Developed by Facebook; turned over to Apache
● At its core, Cassandra borrows from both:
○ Amazon's Dynamo Infrastructure
○ Google's BigTable Data Model
4. Cassandra's Data Model
● Rows (keyspace)
● Column Families
● Columns and Super Columns
○ User can specify sorting by name or timestamp
Column SuperColumn
KeyA ColumnA ColumnB ColumnC
Byte [] Name Byte [] Name
KeyB ColumnX ColumnY Column Z Byte [] Value List<Column>
Columns
Int64 Timestamp
KeyA SuperColumnI SuperColumnJ
KeyB SuperColumnM SuperColumnN
5. Cassandra's Data Model (in JSON)
● Key > Column Family > Column
{
"keyA":{
"Users":{
"emailAddress":{"timestamp":"1", "value":"foo@bar.com"},
"webSite":{"timestamp":"4", "value":"http://bar.com"}
},
"Stats":{
"visits":{"timestamp":"3", "value":"243"}
}
},
"keyB":{
"Users":{
"emailAddress":{"timestamp":"1", "value":"user2@bar.com"},
"twitter":{"timestamp":"4", "value":"user2"}
}
}
}
6. Cassandra's Data Model (in JSON)
● Key > Column Family > Super Column > Column
{
"KeyA": {
"Tags": {
"cassandra": {
"incubator": {"timestamp": "http://incubator.apache.org/cassandra/"},
"jira": {"timestamp": "http://issues.apache.org/jira/browse/CASSANDRA"}
},
"thrift": {
"jira": {"timestamp": "http://issues.apache.org/jira/browse/THRIFT"}
}
}
}
}
7. Differences from Dynamo
● Partitioning
○ Dynamo distributes virtual nodes on the hash ring using
the performance of the host node
○ Cassandra distributes host nodes by examining load
information on the hash ring and moving lightly loaded
nodes to alleviate those with high load
● Replication
○ "Rack Unaware"
○ "Rack Aware"
○ "Datacenter Aware"
8. Differences from Dynamo
● Failure Detection
○ Dynamo uses a gossip-based protocol for membership
changes; a node is assumed failed if it does not respond
○ Cassandra uses the same gossip-based protocol but uses
a φ (phi) Accrual Failure Detector
■ Does not emit a boolean up or down
■ Emits a value which represents a suspicion level
■ The suspicion threshold is dynamically adjusted via
the gossip messages
■ Sliding windows determined by arrival times
■ Statistical distribution model created
9. Differences from BigTable
● Data Model
○ BigTable stores <K,V> pairs in SSTables by Column
Family with historical versions
○ Cassandra drops historical versions and adds the super
column concept
● Storage
○ BigTable uses the Google File System (GFS)
○ Cassandra uses the local file system