Cassandra and Solid State Drives

CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax

FACT
CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
FOR SPINNING DISKS

Client Cassandra
On-Disk Node Commit Log
{ cf1: { row1: { col1: abc } } }
{ cf1: { row1: { col2: def } } }
{ cf1: { row1: { col1: <del> } } }
{ cf1: { row2: { col1: xyz } } }
{ cf1: { row1: { col3: foo } } }
insert({ cf1: { row1: { col3: foo } } })
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
COMMIT

SSTableSSTableSSTableSSTableSSTableSSTableSSTable
1
2
3
4
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
FLUSH

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
31 2 4
SSTables are merged to maintain read performance
COMPACT
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable
New SSTable is streamed
to disk and old SSTables
are erased
X X X X

TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable

TAKEAWAYS
• All disk writes are sequential, append-
only operations
• On-disk tables (SSTables) are written in
sorted order, so compaction is linear
complexity O(N)
• SSTables are completely immutable
IMPORTANT

COMPARED
• Most popular data storage engines
rewrite modified data in-place: MySQL
(InnoDB), PostgreSQL, Oracle,
MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
writes before flushing to disk
• ... but flushes are RANDOM writes

SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
sec for modern SATA drives

THAT WAS THE WORLD
IN WHICH CASSANDRA
WAS BUILT

2012: MLC NAND FLASH*
• Affordable: ~$1.75/GB street
• Massive IOPS: 39,500/sec read, 23,000/
sec write
• Latency of less than 100µs
• Good sequential throughput: 270MB/
sec read, 205MB/sec write
• Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive

WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
LSM-TREES OBSOLETE?

SOLID STATE HAS
SOME MAJOR BUTS...

... BUT
• Cannot overwrite directly: must erase
first, then write
• Can write in small increments (4KB),
but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
for each erase block

WEAR LEVELING is used
to reduce the number of
total erase operations

WEAR LEVELING
Write 1
Write 2
Write 3

Write 1
Write 2
Write 3
How is data from only
Write 2 modified?
Remember: the whole block must be erased

Mark Garbage Append
Modified
Data
Empty Block

... fragmentation,
WHICH MEANS...

GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
background... as much as possible
• If no empty blocks are available, GC
must be done before ANY writes can
complete

WRITE AMPLIFICATION
• When only a few kilobytes are written,
but fragmentation causes a whole
block to be rewritten
• The smaller & more random the writes,
the worse this gets
• Modern “mark and sweep” GC reduces
it, but cannot eliminate it

Torture test shows massive
write performance drop-off
for heavily fragmented drive
Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6

Some poorly designed drives
COMPLETELY fall apart
Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6

Even a well-behaved drive
suffers significantly from the
torture test
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Post-torture, all disk blocks
were marked empty, and the
“fast” comes back...
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

“TRIM”
• Filesystems don’t typically immediately
erase data when files are deleted, they just
mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
when a region of disk is no longer used
• If an entire erase block is marked as
unused, GC is avoided, otherwise TRIM
just hastens the collection process

TRIM only reduces the
write amplification effect,
it can’t eliminate it.

AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
which causes around 10x write amplification

CASSANDRA
ONLY WRITES
SEQUENTIALLY

“For a sequential write workload,
write ampliﬁcation is equal to 1,
i.e., there is no write
ampliﬁcation.”
Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
Performance: Understanding, Analysis, and Performance Modeling”

Cassandra and Solid State Drives

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Cassandra and Solid State Drives

Similar to Cassandra and Solid State Drives (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Cassandra and Solid State Drives