Cassandra deep-dive @ NoSQLNow!

Cassandra FTW

Andrew Byde
Principal Scientist

Menu

• Introduction
• Data model + storage architecture
• Partitioning + replication
• Consistency
• De-normalisation

History

• 2007: Started at Facebook for inbox search
• July 2008: Open sourced by Facebook
• March 2009: Apache Incubator
• February 2010: Apache top-level project
• May 2011:Version 0.8

What it’s good for

• Horizontal scalability
• No single-point of failure -- symmetric
• Multi-data centre support
• Very high write workloads
• Tuneable consistency -- per operation

What it’s not so good for

• Transactions
• Read heavy workloads
• Low latency applications
• compared to in-memory dbs

Keyspaces and Column Families
SQL Cassandra

Database row/key col_1 col_2
Keyspace
row/key col_1 col_1
row/ col_1 col_1

Table Column Family

Column Family

rowkey: {
column: value,
column: value,
...
}

...every value is timestamped

Super Column Family
rowkey: {
supercol: {
column: value,
column: value,
...
}
supercol: {
column: value,
column: value,
...
}
}

Rows and columns
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

Reads
• get
• get_slice One row, some cols
• name predicate
• slice range
• multiget_slice Multiple rows
• get_range_slices

get
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

get_slice: name predicate
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

get_slice: slice range
row1 x x x
row2 x x x x x
row3 x x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

multiget_slice: name
predicate
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

get_range_slices: slice range
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x

Data Layout
writes
key-value insert
on-disk
un-ordered
commit log in-memory
... (key,col)-sorted
memtable
ﬂush
on-disk 01001101110101000 01001101110101000

(key,col)-sorted ...
SSTables

Data Layout
SSTables

SSTable
Bloom Filter 01001101110101000

Index
Data

Data Layout
reads
?

01001101110101000 01001101110101000 010011011101010001111010101001

Data Layout
reads
?

X X
01001101110101000 01001101110101000 010011011101010001111010101001

Distribution:

Partitioning +
Replication

Partitioning + Replication

(k, v)
?

Partitioning + Replication
• Partitioning data on to nodes
• load balancing
• row-based
• Replication
• to protect against failure
• better availability

Partitioning
• Random: take hash of row key
• good for load balancing

• bad for range queries

• Ordered: subdivide key space
• bad for load balancing

• good for range queries

• Or build your own...

Simple Replication

(k, v)

Nodes arranged on a ‘ring’

Simple Replication
Primary location

(k, v)


Simple Replication
Primary location

(k, v) Extra copies
are successors
on the ring


Topology-aware
Replication
• Snitch : node IP (DataCenter, rack)

• EC2Snitch
• Region DC; availability_zone rack

• PropertyFileSnitch
• Conﬁgured from a ﬁle

Topology-aware
Replication
DC 1 DC 2

(k, v)

r1 r2 r1 r2

Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center

(k, v)

r1 r2 r1 r2

Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center

(k, v)

spread across
racks within a r1 r2 r1 r2
data center

Consistency Level
• How many replicas must respond in order to
declare success
• W/N must succeed for write to succeed
• write with client-generated timestamp

• R/N must succeed for read to succeed
• return most recent, by timestamp

• Tuneable per request

Consistency Level

• 1, 2, 3 responses
• Quorum (more than half)
• Quorum in local data center
• Quorum in each data center

Maintaining consistency

• Read repair
• Hinted handoff
• Anti-entropy

Read repair
• If the replicas disagree on read, send most
recent data back

n1

read k? n2

n3

Read repair
recent data back

n1 v, t1

read k? n2 not found!

n3 v’, t2

Read repair
recent data back

n1 v, t1

n2 not found!

user n3 v’, t2

Read repair
recent data back

n1

n2

n3 write (k, v’, t2)

Hinted handoff

• When a node is unavailable
• Writes can be written to any node as a hint
• Delivered when the node comes back
online

Anti-entropy

• Equivalent to ‘read repair all’
• Requires reading all data (woah)
• (Although only hashes are sent to calculate diffs)

• Manual process

De-normalisation

• Disk space is much cheaper than disk seeks
• Read at 100 MB/s, seek at 100 IO/s
• => copy data to avoid seeks

Inbox query
user2

user1 msg1
user3
msg2

msg3 user4
...

Q? inbox for
user3

Data-centric model
m1: {
sender: user1
content: “Mary had a little lamb”
recipients: user2, user3
}

• but how to do ‘recipients’ for Inbox?
• one-to-many modelled by a join table

To join
m1: { user2: {
sender: user1 m1: true
subject: “A rhyme”
content: “Mary had a little lamb” }
} user3: {
m2: {
sender: user1 m1: true
subject: “colours” m2: true
content: “Its fleece was white as snow”
} }
m3: { user4: {
sender: user1
subject: “loyalty” m2: true
content: “And everywhere that Mary went” m3: true
}
}

.. or not to join
• Joins are expensive, so de-normalise to trade
off space for time
• We can have lots of columns, so think BIG:
• Make message id a time-typed super-column.
• This makes get_slice an efﬁcient way of
searching for messages in a time window

Super Column Family
user2: {
m1: {
sender: user1
}
}
user3: {
m1: {
sender: user1
}
m2: {
sender: user1
subject: “colours”
}
}
...

De-normalisation +
Cassandra
• have to write a copy of the record for each
recipient ... but writes are very cheap
• get_slice fetches columns for a particular
row, so gets received messages for a user
• on-disk column order is optimal for this
query

Cassandra deep-dive @ NoSQLNow!

More Related Content

Similar to Cassandra deep-dive @ NoSQLNow!

More from Acunu

Recently uploaded

Cassandra deep-dive @ NoSQLNow!

Editor's Notes