• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra and Solid State Drives
 

Cassandra and Solid State Drives

on

  • 966 views

By Rick Branson of Instagram

By Rick Branson of Instagram

Statistics

Views

Total Views
966
Views on SlideShare
965
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra and Solid State Drives Cassandra and Solid State Drives Presentation Transcript

    • CASSANDRA & SOLIDSTATE DRIVESRick Branson, DataStax
    • FACTCASSANDRA’S STORAGEENGINE WAS OPTIMIZEDFOR SPINNING DISKS
    • LSM-TREES
    • WRITE PATH
    • Client CassandraOn-Disk Node Commit Log{ cf1: { row1: { col1: abc } } }{ cf1: { row1: { col2: def } } }{ cf1: { row1: { col1: <del> } } }{ cf1: { row2: { col1: xyz } } }{ cf1: { row1: { col3: foo } } }insert({ cf1: { row1: { col3: foo } } })In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”row1row2col1: [del] col2: “def” col3: “foo”col1: “xyz”COMMIT
    • SSTableSSTableSSTableSSTableSSTableSSTableSSTable1SSTableSSTableSSTableSSTableSSTableSSTableSSTable2SSTableSSTableSSTableSSTableSSTableSSTableSSTable3SSTableSSTableSSTableSSTableSSTableSSTableSSTable4In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”row1row2col1: [del] col2: “def” col3: “foo”col1: “xyz”FLUSH
    • SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable31 2 4SSTables are merged to maintain read performanceCOMPACTSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable
    • SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTablesare erasedX X X X
    • TAKEAWAYS• All disk writes are sequential, append-only operations• On-disk tables (SSTables) are written insorted order, so compaction is linearcomplexity O(N)• SSTables are completely immutable
    • TAKEAWAYS• All disk writes are sequential, append-only operations• On-disk tables (SSTables) are written insorted order, so compaction is linearcomplexity O(N)• SSTables are completely immutableIMPORTANT
    • COMPARED• Most popular data storage enginesrewrite modified data in-place: MySQL(InnoDB), PostgreSQL, Oracle,MongoDB, Membase, BerkeleyDB, etc• Most perform similar buffering ofwrites before flushing to disk• ... but flushes are RANDOM writes
    • SPINNING DISKS• Dirt cheap: $0.08/GB• Seek time limited by time it takes for driveto rotate: IOPS = RPM/60• 7,200 RPM = ~120 IOPS• 15,000 RPM has been the max for decades• Sequential operations are best: 125MB/sec for modern SATA drives
    • THAT WAS THE WORLDIN WHICH CASSANDRAWAS BUILT
    • 2012: MLC NAND FLASH*• Affordable: ~$1.75/GB street• Massive IOPS: 39,500/sec read, 23,000/sec write• Latency of less than 100µs• Good sequential throughput: 270MB/sec read, 205MB/sec write• Way cheaper per IOPS: $0.02 vs $1.25* based on specifications provided by Intel for 300GB Intel 320 drive
    • WITH RANDOM ACCESSSTORAGE, ARE CASSANDRA’SLSM-TREES OBSOLETE?
    • SOLID STATE HASSOME MAJOR BUTS...
    • ... BUT• Cannot overwrite directly: must erasefirst, then write• Can write in small increments (4KB),but only erase in ~512KB blocks• Latency: write is ~100µs, erase is ~2ms• Limited durability: ~5,000 cycles (MLC)for each erase block
    • WEAR LEVELING is usedto reduce the number oftotal erase operations
    • WEAR LEVELING
    • WEAR LEVELINGErase Block
    • WEAR LEVELING
    • WEAR LEVELING
    • WEAR LEVELINGDisk Page
    • WEAR LEVELINGWrite 1
    • WEAR LEVELINGWrite 1Write 2
    • WEAR LEVELINGWrite 1Write 2Write 3
    • Write 1Write 2Write 3How is data from onlyWrite 2 modified?Remember: the whole block must be erased
    • Mark Garbage
    • Mark Garbage AppendModifiedDataEmpty Block
    • Wait... GARBAGE?
    • THAT MEANS...
    • ... fragmentation,WHICH MEANS...
    • Garbage Collection!
    • GARBAGE COLLECTION• Compacts fragmented disk blocks• Erase operations drag on performance• Modern SSDs do this in thebackground... as much as possible• If no empty blocks are available, GCmust be done before ANY writes cancomplete
    • WRITE AMPLIFICATION• When only a few kilobytes are written,but fragmentation causes a wholeblock to be rewritten• The smaller & more random the writes,the worse this gets• Modern “mark and sweep” GC reducesit, but cannot eliminate it
    • Torture test shows massivewrite performance drop-offfor heavily fragmented driveSource: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
    • Some poorly designed drivesCOMPLETELY fall apartSource: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
    • Even a well-behaved drivesuffers significantly from thetorture testSource: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
    • Post-torture, all disk blockswere marked empty, and the“fast” comes back...Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
    • “TRIM”• Filesystems don’t typically immediatelyerase data when files are deleted, they justmark them as deleted and erase later• TRIM allows the OS to actively tell the drivewhen a region of disk is no longer used• If an entire erase block is marked asunused, GC is avoided, otherwise TRIMjust hastens the collection process
    • TRIM only reduces thewrite amplification effect,it can’t eliminate it.
    • THEN THERE’SLIFETIME...
    • AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,which causes around 10x write amplification
    • REMEMBER THIS?
    • TAKEAWAYS• All disk writes are sequential, append-only operations• On-disk tables (SSTables) are written insorted order, so compaction is linearcomplexity O(N)• SSTables are completely immutable
    • CASSANDRAONLY WRITESSEQUENTIALLY
    • “For a sequential write workload,write amplification is equal to 1,i.e., there is no writeamplification.”Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random WritePerformance: Understanding, Analysis, and Performance Modeling”
    • THANK YOU.~ @rbranson