Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Cassandra
storage internals
Nicolas Favre-Felix
Cassandra Europe 2012

What this talk covers

• What happens within a Cassandra node

• How Cassandra reads and writes data

• What compaction is and why we need it

• How counters are stored, modiﬁed, and read

Concepts
• Memtables • On heap, off-heap

• SSTables • Compaction

• Commit Log • Bloom ﬁlters

• Key cache • SSTable index

• Row cache • Counters

Why is this important?
• Understand what goes on under the hood

• Understand the reasons for these choices

• Diagnose issues

• Tune Cassandra for performance

• Make your data model efﬁcient

A word about hard drives

• Main driver behind Cassandra’s storage choices

• The last moving part

• Fast sequential I/O (150 MB/s)

• Slow random I/O (120-200 IOPS)

What SSDs bring
• Fast sequential I/O

• Fast random I/O

• Higher cost

• Limited lifetime

• Performance degradation

Disk usage with B-trees

• Important data structure in relational databases

• In-place overwrites (random I/O)

• LogB(N) random accesses for reads and writes

Disk usage with Cassandra
• Made for spinning disks

• Sequential writes, much less than 1 I/O per insert

• Several layers of cache

• Random reads, approximately 1 I/O per read

• Generally “write-optimised”

Writing to Cassandra
Let’s add a row with a few columns

Row Key Column Column Column Column

The Cassandra write path
In the JVM

New data Memtable

Commit
On disk log

The Commit Log
• Each write is added to a log ﬁle

• Guarantees durability after a crash

• 1-second window during which data is still in RAM

• Sequential I/O

• A dedicated disk is recommended

Memtables
• In-memory Key/Value data structure

• Implemented with ConcurrentSkipListMap

• One per column family

• Very fast inserts

• Columns are merged in memory for the same key

• Flushed at a certain threshold, into an SSTable

Dumping a Memtable on disk

In the JVM Full Memtable

Commit
On disk log

Dumping a Memtable on disk

In the JVM New Memtable

Commit
On disk log
SSTable

The SSTable

• One ﬁle, written sequentially

• Columns are in order, grouped by row

• Immutable once written, no updates!

SSTables start piling up!

In the JVM Memtable

Commit log SSTable SSTable SSTable

On disk SSTable SSTable SSTable

SSTable SSTable SSTable


SSTables
• Can’t keep all of them forever

• Need to reclaim disk space

• Reads could touch several SSTables

• Scans touch all of them

• In-memory data structures per SSTable

Compaction
• Merge SSTables of similar size together

• Remove overwrites and deleted data (timestamps)

• Improve range query performance

• Major compaction creates a single SSTable

• I/O intensive operation

Recent improvements

• Pluggable compaction

• Different strategies, chosen per column family

• SSTable compression

• More efﬁcient SSTable merges

Reading from Cassandra
• Reading all these SSTables would be very inefﬁcient

• We have to read from memory as much as possible

• Otherwise we need to do 2 things efﬁciently:

• Find the right SSTable to read from

• Find where in that SSTable to read the data

First step for reads

• The Memtable!

• Read the most recent data

• Very fast, no need to touch the disk

Off-heap (no GC) Row cache

In the JVM Memtable

Commit
On disk log
SSTable

Row cache

• Stores a whole row in memory

• Off-heap, not subject to Garbage Collection

• Size is conﬁgurable per column family

• Last resort before having to read from disk

Finding the right SSTable

In the JVM Memtable

Commit log SSTable SSTable

On disk SSTable SSTable SSTable

SSTable SSTable SSTable SSTable

Bloom ﬁlter
• Saved with each SSTable

• Answers “contains(Key) :: boolean”

• Saved on disk but kept in memory

• Probabilistic data structure

• Conﬁgurable proportion of false positives

• No false negatives

Bloom filter

In the JVM Memtable

exists(key)?
Bloom filter Bloom filter Bloom filter
true/false

Commit
On disk log

Reading from an SSTable
• We need to know where in the ﬁle our data is saved
• Keys are sorted, why don’t we do a binary search?
• Keys are not all the same size
• Jumping around in a ﬁle is very slow
• Log2(N) random I/O, ~20 for 1 million keys

Reading from an SSTable
Let’s index key ranges in the SSTable

Key: k-128 Key: k-256 Key: k-384

Position: 12098 Position: 23445 Position: 43678

SSTable

SSTable index
• Saved with each SSTable

• Stores key ranges and their offsets: [(Key, Offset)]

• Saved on disk but kept in memory

• Avoids searching for a key by scanning the ﬁle

• Conﬁgurable key interval (default: 128)

SSTable index

In the JVM Memtable

SSTable
Bloom ﬁlter
index

Commit
On disk log
SSTable

Sometimes not enough

• Storing key ranges is limited

• We can do better by storing the exact offset

• This saves approximately one I/O

The key cache

In the JVM Memtable

SSTable
Bloom ﬁlter Key cache
index

Commit
On disk log
SSTable

Key cache

• Stores the exact location in the SSTable

• Stored in heap

• Avoids having to scan a whole index interval

• Size is conﬁgurable per column family

2

Off-heap (no GC) Row cache

1

In the JVM Memtable

3 4 5
SSTable
Bloom ﬁlter Key cache
index

6

Commit
On disk log
SSTable

Distributed counters

• 64-bit signed integer, replicated in the cluster

• Atomic inc and dec by an arbitrary amount

• Counting with read-inc-write would be inefﬁcient

• Stored differently from regular columns

Consider a cluster
with 3 nodes, RF=3

Internal counter data
• List of increments received by the local node
• Summaries (Version,Sum) sent by the other nodes
• The total value is the sum of all counts

Internal counter data
• List of increments received by the local node
• Summaries (Version,Sum) sent by the other nodes
• The total value is the sum of all counts

Local increments +5 +2 -3

node version: 3
Received from count: 5

version: 5

Incrementing a counter
• A coordinator node is chosen (blue node)

Local increments +5 +2 -3

• A coordinator node is chosen

• Stores its increment locally

Local increments +5 +2 -3 +1



• Reads back the sum of its increments





• Forwards a summary to other replicas: (v.4, sum 5)





• Forwards a summary to other replicas

• Replicas update their records:

version: 4

Reading a counter

• Replicas return their counts and versions

• Including what they know about other nodes

• Only the most recent versions are kept

Reading a counter

version: 6
count: 12

Reading a counter

{
v. 3, count 5
v. 6, count 12
v. 2, count 8

{ v. 3, count 5
v. 5, count 10
v. 4, count 5
version: 6
count: 12

Reading a counter

{
v. 3, count 5
v. 6, count 12
v. 2, count 8

{ v. 3, count 5
v. 5, count 10
v. 4, count 5
version: 6
count: 12 Counter value: 5 + 12 + 5 = 22

Tuning
• Cassandra can’t really use large amounts of RAM

• Garbage Collection pauses stop everything

• Compaction has an impact on performance

• Reading from disk is slow

• These limitations restrict the size of each node

Recap
• Fast sequential writes

• ~1 I/O for uncached reads, 0 for cached

• Counter increments read on write, columns don’t

• Know where your time is spent (monitor!)

• Tune accordingly

Questions?

http://www.flickr.com/photos/kubina/326628918/sizes/l/in/photostream/
http://www.flickr.com/photos/alwarrete/5651579563/sizes/o/in/photostream/
http://www.flickr.com/photos/pio1976/3330670980/sizes/o/in/photostream/
http://www.flickr.com/photos/lwr/100518736/sizes/l/in/photostream/

• In-kernel backend
• No Garbage Collection
• No need to plan heavy compactions
• Low and consistent latency
• Full versioning, snapshots
• No degradation with Big Data

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Similar to Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix (20)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix