• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra and Solid State Drives
 

Cassandra and Solid State Drives

on

  • 23,455 views

 

Statistics

Views

Total Views
23,455
Views on SlideShare
20,920
Embed Views
2,535

Actions

Likes
52
Downloads
390
Comments
3

32 Embeds 2,535

http://blog.nosqlfan.com 1650
http://nosql.mypopescu.com 457
http://feed.feedsky.com 109
http://www.scoop.it 108
https://twitter.com 61
http://xianguo.com 27
http://feeds.feedburner.com 23
http://www.redditmedia.com 23
http://reader.youdao.com 16
http://us-w1.rockmelt.com 10
http://www.edutec.edu.sv 8
http://www.hanrss.com 6
http://www.newsblur.com 5
http://tweetedtimes.com 5
http://cache.baiducontent.com 3
http://dev.newsblur.com 3
http://192.168.11.127 2
http://xue.uplook.cn 2
http://core.traackr.com 2
https://www.linkedin.com 2
http://localhost 2
https://si0.twimg.com 1
http://news.uplook.cn 1
http://www.uplook.cn 1
http://feedproxy.google.com 1
http://tumblr.hootsuite.com 1
http://t.co 1
http://www.16kan.com 1
http://127.0.0.1 1
http://webcache.googleusercontent.com 1
http://newsblur.com 1
http://www.acushare.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra and Solid State Drives Cassandra and Solid State Drives Presentation Transcript

    • CASSANDRA & SOLIDSTATE DRIVESRick Branson, DataStax
    • FACTCASSANDRA’S STORAGEENGINE WAS OPTIMIZED FOR SPINNING DISKS
    • LSM-TREES
    • WRITE PATH
    • insert({ cf1: { row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log{ cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1”{ cf1: { row1: { col2: def } } }{ cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo”{ cf1: { row2: { col1: xyz } } } row2 col1: “xyz”{ cf1: { row1: { col3: foo } } } COMMIT
    • In-Memory Memtable for “cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz”SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
    • SSTable SSTable SSTable SSTable 1 2 3 4 SSTableSSTables are merged to maintain read performance COMPACT
    • X X X X SSTable SSTable SSTable SSTableSSTable New SSTable is streamed to disk and old SSTables are erased
    • TAKEAWAYS• All disk writes are sequential, append- only operations• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)• SSTables are completely immutable
    • TAKEAWAYS• All disk writes are sequential, append- only operations• On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N)• SSTables are completely immutable
    • COMPARED• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc• Most perform similar buffering of writes before flushing to disk• ... but flushes are RANDOM writes
    • SPINNING DISKS• Dirt cheap: $0.08/GB• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60• 7,200 RPM = ~120 IOPS• 15,000 RPM has been the max for decades• Sequential operations are best: 125MB/ sec for modern SATA drives
    • THAT WAS THE WORLDIN WHICH CASSANDRA WAS BUILT
    • 2012: MLC NAND FLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25* based on specifications provided by Intel for 300GB Intel 320 drive
    • WITH RANDOM ACCESSSTORAGE, ARE CASSANDRA’S LSM-TREES OBSOLETE?
    • SOLID STATE HASSOME MAJOR BUTS...
    • ... BUT• Cannot overwrite directly: must erase first, then write• Can write in small increments (4KB), but only erase in ~512KB blocks• Latency: write is ~100µs, erase is ~2ms• Limited durability: ~5,000 cycles (MLC) for each erase block
    • WEAR LEVELING is usedto reduce the number of total erase operations
    • WEAR LEVELING
    • WEAR LEVELINGErase Block
    • WEAR LEVELING
    • WEAR LEVELING
    • WEAR LEVELING Disk Page
    • WEAR LEVELING Write 1
    • WEAR LEVELING Write 1 Write 2
    • WEAR LEVELING Write 1 Write 2 Write 3
    • Remember: the whole block must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
    • Mark Garbage
    • Empty BlockMark Garbage Append Modified Data
    • Wait... GARBAGE?
    • THAT MEANS...
    • ... fragmentation,WHICH MEANS...
    • Garbage Collection!
    • GARBAGE COLLECTION• Compacts fragmented disk blocks• Erase operations drag on performance• Modern SSDs do this in the background... as much as possible• If no empty blocks are available, GC must be done before ANY writes can complete
    • WRITE AMPLIFICATION• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten• The smaller & more random the writes, the worse this gets• Modern “mark and sweep” GC reduces it, but cannot eliminate it
    • Torture test shows massive write performance drop-off for heavily fragmented driveSource: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
    • Some poorly designed drives COMPLETELY fall apartSource: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
    • Even a well-behaved drivesuffers significantly from the torture testSource: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
    • Post-torture, all disk blockswere marked empty, and the “fast” comes back...Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
    • “TRIM”• Filesystems don’t typically immediately erase data when files are deleted, they just mark them as deleted and erase later• TRIM allows the OS to actively tell the drive when a region of disk is no longer used• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
    • TRIM only reduces thewrite amplification effect, it can’t eliminate it.
    • THEN THERE’S LIFETIME...
    • AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
    • REMEMBER THIS?
    • TAKEAWAYS• All disk writes are sequential, append- only operations• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)• SSTables are completely immutable
    • CASSANDRA ONLY WRITESSEQUENTIALLY
    • “For a sequential write workload, write amplification is equal to 1, i.e., there is no write amplification.”Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
    • THANK YOU. ~ @rbranson