CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax
FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
 FOR SPINNING DISKS
LSM-TREES
WRITE PATH
insert({ cf1: { row1: { col3: foo } } })




      Client                         Cassandra




     On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
                                                   In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
                                                 row1   col1: [del]   col2: “def”   col3: “foo”

{ cf1: { row2: { col1: xyz } } }
                                                 row2   col1: “xyz”
{ cf1: { row1: { col3: foo } } }




                                                              COMMIT
In-Memory Memtable for “cf1”

                    row1         col1: [del]       col2: “def”   col3: “foo”


                    row2         col1: “xyz”




SSTable   SSTable      SSTable             SSTable




 1         2               3                   4

                                                                               FLUSH
SSTable   SSTable             SSTable   SSTable




        1         2                    3         4


                           SSTable




SSTables are merged to maintain read performance


                                           COMPACT
X X X X
  SSTable   SSTable         SSTable   SSTable




SSTable
                      New SSTable is streamed
                      to disk and old SSTables
                             are erased
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
                    IMPORTANT
  complexity O(N)
• SSTables are completely immutable
COMPARED
• Most popular data storage engines
  rewrite modified data in-place: MySQL
  (InnoDB), PostgreSQL, Oracle,
  MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
  writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
  to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
  sec for modern SATA drives
THAT WAS THE WORLD
IN WHICH CASSANDRA
     WAS BUILT
2012: MLC NAND FLASH*
            • Affordable: ~$1.75/GB street
            • Massive IOPS: 39,500/sec read, 23,000/
                  sec write
            • Latency of less than 100µs
            • Good sequential throughput: 270MB/
                  sec read, 205MB/sec write
            • Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
  LSM-TREES OBSOLETE?
SOLID STATE HAS
SOME MAJOR BUTS...
... BUT
• Cannot overwrite directly: must erase
  first, then write
• Can write in small increments (4KB),
  but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
  for each erase block
WEAR LEVELING is used
to reduce the number of
 total erase operations
WEAR LEVELING
WEAR LEVELING
Erase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING

 Disk Page
WEAR LEVELING

  Write 1
WEAR LEVELING

  Write 1
  Write 2
WEAR LEVELING

  Write 1
  Write 2
  Write 3
Remember: the whole block must be erased


         Write 1
         Write 2
         Write 3

                   How is data from only
                     Write 2 modified?
Mark Garbage
Empty Block




Mark Garbage    Append
                Modified
                 Data
Wait... GARBAGE?
THAT MEANS...
... fragmentation,
WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
  background... as much as possible
• If no empty blocks are available, GC
  must be done before ANY writes can
  complete
WRITE AMPLIFICATION
• When only a few kilobytes are written,
  but fragmentation causes a whole
  block to be rewritten
• The smaller & more random the writes,
  the worse this gets
• Modern “mark and sweep” GC reduces
  it, but cannot eliminate it
Torture test shows massive
             write performance drop-off
            for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives
      COMPLETELY fall apart


Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive
suffers significantly from the
         torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks
were marked empty, and the
   “fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”
• Filesystems don’t typically immediately
  erase data when files are deleted, they just
  mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
  when a region of disk is no longer used
• If an entire erase block is marked as
  unused, GC is avoided, otherwise TRIM
  just hastens the collection process
TRIM only reduces the
write amplification effect,
   it can’t eliminate it.
THEN THERE’S
 LIFETIME...
AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
   which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
CASSANDRA
 ONLY WRITES
SEQUENTIALLY
“For a sequential write workload,
     write amplification is equal to 1,
          i.e., there is no write
              amplification.”


Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
        Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.
     ~ @rbranson

Cassandra and Solid State Drives

  • 1.
    CASSANDRA & SOLID STATEDRIVES Rick Branson, DataStax
  • 2.
    FACT CASSANDRA’S STORAGE ENGINE WASOPTIMIZED FOR SPINNING DISKS
  • 3.
  • 4.
  • 5.
    insert({ cf1: {row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log { cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1” { cf1: { row1: { col2: def } } } { cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo” { cf1: { row2: { col1: xyz } } } row2 col1: “xyz” { cf1: { row1: { col3: foo } } } COMMIT
  • 6.
    In-Memory Memtable for“cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz” SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
  • 7.
    SSTable SSTable SSTable SSTable 1 2 3 4 SSTable SSTables are merged to maintain read performance COMPACT
  • 8.
    X X XX SSTable SSTable SSTable SSTable SSTable New SSTable is streamed to disk and old SSTables are erased
  • 9.
    TAKEAWAYS • All diskwrites are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 10.
    TAKEAWAYS • All diskwrites are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N) • SSTables are completely immutable
  • 11.
    COMPARED • Most populardata storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc • Most perform similar buffering of writes before flushing to disk • ... but flushes are RANDOM writes
  • 12.
    SPINNING DISKS • Dirtcheap: $0.08/GB • Seek time limited by time it takes for drive to rotate: IOPS = RPM/60 • 7,200 RPM = ~120 IOPS • 15,000 RPM has been the max for decades • Sequential operations are best: 125MB/ sec for modern SATA drives
  • 13.
    THAT WAS THEWORLD IN WHICH CASSANDRA WAS BUILT
  • 14.
    2012: MLC NANDFLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25 * based on specifications provided by Intel for 300GB Intel 320 drive
  • 15.
    WITH RANDOM ACCESS STORAGE,ARE CASSANDRA’S LSM-TREES OBSOLETE?
  • 17.
    SOLID STATE HAS SOMEMAJOR BUTS...
  • 18.
    ... BUT • Cannotoverwrite directly: must erase first, then write • Can write in small increments (4KB), but only erase in ~512KB blocks • Latency: write is ~100µs, erase is ~2ms • Limited durability: ~5,000 cycles (MLC) for each erase block
  • 19.
    WEAR LEVELING isused to reduce the number of total erase operations
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    WEAR LEVELING Write 1 Write 2
  • 27.
    WEAR LEVELING Write 1 Write 2 Write 3
  • 28.
    Remember: the wholeblock must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
  • 29.
  • 30.
    Empty Block Mark Garbage Append Modified Data
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    GARBAGE COLLECTION • Compactsfragmented disk blocks • Erase operations drag on performance • Modern SSDs do this in the background... as much as possible • If no empty blocks are available, GC must be done before ANY writes can complete
  • 36.
    WRITE AMPLIFICATION • Whenonly a few kilobytes are written, but fragmentation causes a whole block to be rewritten • The smaller & more random the writes, the worse this gets • Modern “mark and sweep” GC reduces it, but cannot eliminate it
  • 37.
    Torture test showsmassive write performance drop-off for heavily fragmented drive Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
  • 38.
    Some poorly designeddrives COMPLETELY fall apart Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
  • 39.
    Even a well-behaveddrive suffers significantly from the torture test Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 40.
    Post-torture, all diskblocks were marked empty, and the “fast” comes back... Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 42.
    “TRIM” • Filesystems don’ttypically immediately erase data when files are deleted, they just mark them as deleted and erase later • TRIM allows the OS to actively tell the drive when a region of disk is no longer used • If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
  • 43.
    TRIM only reducesthe write amplification effect, it can’t eliminate it.
  • 44.
  • 47.
    AnandTech estimates thatmodern MLC SSDs only last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
  • 48.
  • 49.
    TAKEAWAYS • All diskwrites are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 50.
  • 51.
    “For a sequentialwrite workload, write amplification is equal to 1, i.e., there is no write amplification.” Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
  • 52.
    THANK YOU. ~ @rbranson