SlideShare a Scribd company logo
1 of 52
Download to read offline
CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax
FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
 FOR SPINNING DISKS
LSM-TREES
WRITE PATH
insert({ cf1: { row1: { col3: foo } } })




      Client                         Cassandra




     On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
                                                   In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
                                                 row1   col1: [del]   col2: “def”   col3: “foo”

{ cf1: { row2: { col1: xyz } } }
                                                 row2   col1: “xyz”
{ cf1: { row1: { col3: foo } } }




                                                              COMMIT
In-Memory Memtable for “cf1”

                    row1         col1: [del]       col2: “def”   col3: “foo”


                    row2         col1: “xyz”




SSTable   SSTable      SSTable             SSTable




 1         2               3                   4

                                                                               FLUSH
SSTable   SSTable             SSTable   SSTable




        1         2                    3         4


                           SSTable




SSTables are merged to maintain read performance


                                           COMPACT
X X X X
  SSTable   SSTable         SSTable   SSTable




SSTable
                      New SSTable is streamed
                      to disk and old SSTables
                             are erased
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
                    IMPORTANT
  complexity O(N)
• SSTables are completely immutable
COMPARED
• Most popular data storage engines
  rewrite modified data in-place: MySQL
  (InnoDB), PostgreSQL, Oracle,
  MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
  writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
  to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
  sec for modern SATA drives
THAT WAS THE WORLD
IN WHICH CASSANDRA
     WAS BUILT
2012: MLC NAND FLASH*
            • Affordable: ~$1.75/GB street
            • Massive IOPS: 39,500/sec read, 23,000/
                  sec write
            • Latency of less than 100µs
            • Good sequential throughput: 270MB/
                  sec read, 205MB/sec write
            • Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
  LSM-TREES OBSOLETE?
SOLID STATE HAS
SOME MAJOR BUTS...
... BUT
• Cannot overwrite directly: must erase
  first, then write
• Can write in small increments (4KB),
  but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
  for each erase block
WEAR LEVELING is used
to reduce the number of
 total erase operations
WEAR LEVELING
WEAR LEVELING
Erase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING

 Disk Page
WEAR LEVELING

  Write 1
WEAR LEVELING

  Write 1
  Write 2
WEAR LEVELING

  Write 1
  Write 2
  Write 3
Remember: the whole block must be erased


         Write 1
         Write 2
         Write 3

                   How is data from only
                     Write 2 modified?
Mark Garbage
Empty Block




Mark Garbage    Append
                Modified
                 Data
Wait... GARBAGE?
THAT MEANS...
... fragmentation,
WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
  background... as much as possible
• If no empty blocks are available, GC
  must be done before ANY writes can
  complete
WRITE AMPLIFICATION
• When only a few kilobytes are written,
  but fragmentation causes a whole
  block to be rewritten
• The smaller & more random the writes,
  the worse this gets
• Modern “mark and sweep” GC reduces
  it, but cannot eliminate it
Torture test shows massive
             write performance drop-off
            for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives
      COMPLETELY fall apart


Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive
suffers significantly from the
         torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks
were marked empty, and the
   “fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”
• Filesystems don’t typically immediately
  erase data when files are deleted, they just
  mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
  when a region of disk is no longer used
• If an entire erase block is marked as
  unused, GC is avoided, otherwise TRIM
  just hastens the collection process
TRIM only reduces the
write amplification effect,
   it can’t eliminate it.
THEN THERE’S
 LIFETIME...
AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
   which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
CASSANDRA
 ONLY WRITES
SEQUENTIALLY
“For a sequential write workload,
     write amplification is equal to 1,
          i.e., there is no write
              amplification.”


Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
        Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.
     ~ @rbranson

More Related Content

What's hot

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
Pythian
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 

What's hot (18)

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 

Similar to Cassandra and Solid State Drives

Storage structure
Storage structureStorage structure
Storage structure
Mohd Arif
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated
Werner Fischer
 

Similar to Cassandra and Solid State Drives (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
 
Storage structure
Storage structureStorage structure
Storage structure
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
 
SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 

Recently uploaded

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 

Cassandra and Solid State Drives

  • 1. CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax
  • 2. FACT CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED FOR SPINNING DISKS
  • 5. insert({ cf1: { row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log { cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1” { cf1: { row1: { col2: def } } } { cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo” { cf1: { row2: { col1: xyz } } } row2 col1: “xyz” { cf1: { row1: { col3: foo } } } COMMIT
  • 6. In-Memory Memtable for “cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz” SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
  • 7. SSTable SSTable SSTable SSTable 1 2 3 4 SSTable SSTables are merged to maintain read performance COMPACT
  • 8. X X X X SSTable SSTable SSTable SSTable SSTable New SSTable is streamed to disk and old SSTables are erased
  • 9. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 10. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N) • SSTables are completely immutable
  • 11. COMPARED • Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc • Most perform similar buffering of writes before flushing to disk • ... but flushes are RANDOM writes
  • 12. SPINNING DISKS • Dirt cheap: $0.08/GB • Seek time limited by time it takes for drive to rotate: IOPS = RPM/60 • 7,200 RPM = ~120 IOPS • 15,000 RPM has been the max for decades • Sequential operations are best: 125MB/ sec for modern SATA drives
  • 13. THAT WAS THE WORLD IN WHICH CASSANDRA WAS BUILT
  • 14. 2012: MLC NAND FLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25 * based on specifications provided by Intel for 300GB Intel 320 drive
  • 15. WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S LSM-TREES OBSOLETE?
  • 16.
  • 17. SOLID STATE HAS SOME MAJOR BUTS...
  • 18. ... BUT • Cannot overwrite directly: must erase first, then write • Can write in small increments (4KB), but only erase in ~512KB blocks • Latency: write is ~100µs, erase is ~2ms • Limited durability: ~5,000 cycles (MLC) for each erase block
  • 19. WEAR LEVELING is used to reduce the number of total erase operations
  • 25. WEAR LEVELING Write 1
  • 26. WEAR LEVELING Write 1 Write 2
  • 27. WEAR LEVELING Write 1 Write 2 Write 3
  • 28. Remember: the whole block must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
  • 30. Empty Block Mark Garbage Append Modified Data
  • 35. GARBAGE COLLECTION • Compacts fragmented disk blocks • Erase operations drag on performance • Modern SSDs do this in the background... as much as possible • If no empty blocks are available, GC must be done before ANY writes can complete
  • 36. WRITE AMPLIFICATION • When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten • The smaller & more random the writes, the worse this gets • Modern “mark and sweep” GC reduces it, but cannot eliminate it
  • 37. Torture test shows massive write performance drop-off for heavily fragmented drive Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
  • 38. Some poorly designed drives COMPLETELY fall apart Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
  • 39. Even a well-behaved drive suffers significantly from the torture test Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 40. Post-torture, all disk blocks were marked empty, and the “fast” comes back... Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 41.
  • 42. “TRIM” • Filesystems don’t typically immediately erase data when files are deleted, they just mark them as deleted and erase later • TRIM allows the OS to actively tell the drive when a region of disk is no longer used • If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
  • 43. TRIM only reduces the write amplification effect, it can’t eliminate it.
  • 45.
  • 46.
  • 47. AnandTech estimates that modern MLC SSDs only last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
  • 49. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 51. “For a sequential write workload, write amplification is equal to 1, i.e., there is no write amplification.” Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
  • 52. THANK YOU. ~ @rbranson