Cassandra and Solid State Drives

CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax

FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
FOR SPINNING DISKS

insert({ cf1: { row1: { col3: foo } } })

Client Cassandra

On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
row1 col1: [del] col2: “def” col3: “foo”

{ cf1: { row2: { col1: xyz } } }
row2 col1: “xyz”
{ cf1: { row1: { col3: foo } } }

COMMIT

In-Memory Memtable for “cf1”

row1 col1: [del] col2: “def” col3: “foo”

row2 col1: “xyz”

SSTable SSTable SSTable SSTable

1 2 3 4

FLUSH


1 2 3 4

SSTable

SSTables are merged to maintain read performance

COMPACT

X X X X

SSTable
New SSTable is streamed
to disk and old SSTables
are erased

TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable

TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
IMPORTANT
complexity O(N)
• SSTables are completely immutable

COMPARED
• Most popular data storage engines
rewrite modified data in-place: MySQL
(InnoDB), PostgreSQL, Oracle,
MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
writes before flushing to disk
• ... but flushes are RANDOM writes

SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
sec for modern SATA drives

THAT WAS THE WORLD
IN WHICH CASSANDRA
WAS BUILT

2012: MLC NAND FLASH*
• Affordable: ~$1.75/GB street
• Massive IOPS: 39,500/sec read, 23,000/
sec write
• Latency of less than 100µs
• Good sequential throughput: 270MB/
sec read, 205MB/sec write
• Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive

WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
LSM-TREES OBSOLETE?

SOLID STATE HAS
SOME MAJOR BUTS...

... BUT
• Cannot overwrite directly: must erase
first, then write
• Can write in small increments (4KB),
but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
for each erase block

WEAR LEVELING is used
to reduce the number of
total erase operations

WEAR LEVELING

Write 1
Write 2

WEAR LEVELING

Write 1
Write 2
Write 3

Remember: the whole block must be erased

Write 1
Write 2
Write 3

How is data from only
Write 2 modified?

Empty Block

Mark Garbage Append
Modified
Data

... fragmentation,
WHICH MEANS...

GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
background... as much as possible
• If no empty blocks are available, GC
must be done before ANY writes can
complete

WRITE AMPLIFICATION
• When only a few kilobytes are written,
but fragmentation causes a whole
block to be rewritten
• The smaller & more random the writes,
the worse this gets
• Modern “mark and sweep” GC reduces
it, but cannot eliminate it

Torture test shows massive
write performance drop-off
for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6

Some poorly designed drives
COMPLETELY fall apart

Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6

Even a well-behaved drive
suffers significantly from the
torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Post-torture, all disk blocks
were marked empty, and the
“fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

“TRIM”
• Filesystems don’t typically immediately
erase data when files are deleted, they just
mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
when a region of disk is no longer used
• If an entire erase block is marked as
unused, GC is avoided, otherwise TRIM
just hastens the collection process

TRIM only reduces the
write amplification effect,
it can’t eliminate it.

AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
which causes around 10x write amplification

CASSANDRA
ONLY WRITES
SEQUENTIALLY

“For a sequential write workload,
write ampliﬁcation is equal to 1,
i.e., there is no write
ampliﬁcation.”

Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
Performance: Understanding, Analysis, and Performance Modeling”

Cassandra and Solid State Drives

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Cassandra and Solid State Drives

Similar to Cassandra and Solid State Drives (20)

Recently uploaded

Recently uploaded (20)

Cassandra and Solid State Drives