SlideShare a Scribd company logo
CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax
FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
 FOR SPINNING DISKS
LSM-TREES
WRITE PATH
insert({ cf1: { row1: { col3: foo } } })




      Client                         Cassandra




     On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
                                                   In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
                                                 row1   col1: [del]   col2: “def”   col3: “foo”

{ cf1: { row2: { col1: xyz } } }
                                                 row2   col1: “xyz”
{ cf1: { row1: { col3: foo } } }




                                                              COMMIT
In-Memory Memtable for “cf1”

                    row1         col1: [del]       col2: “def”   col3: “foo”


                    row2         col1: “xyz”




SSTable   SSTable      SSTable             SSTable




 1         2               3                   4

                                                                               FLUSH
SSTable   SSTable             SSTable   SSTable




        1         2                    3         4


                           SSTable




SSTables are merged to maintain read performance


                                           COMPACT
X X X X
  SSTable   SSTable         SSTable   SSTable




SSTable
                      New SSTable is streamed
                      to disk and old SSTables
                             are erased
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
                    IMPORTANT
  complexity O(N)
• SSTables are completely immutable
COMPARED
• Most popular data storage engines
  rewrite modified data in-place: MySQL
  (InnoDB), PostgreSQL, Oracle,
  MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
  writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
  to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
  sec for modern SATA drives
THAT WAS THE WORLD
IN WHICH CASSANDRA
     WAS BUILT
2012: MLC NAND FLASH*
            • Affordable: ~$1.75/GB street
            • Massive IOPS: 39,500/sec read, 23,000/
                  sec write
            • Latency of less than 100µs
            • Good sequential throughput: 270MB/
                  sec read, 205MB/sec write
            • Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
  LSM-TREES OBSOLETE?
SOLID STATE HAS
SOME MAJOR BUTS...
... BUT
• Cannot overwrite directly: must erase
  first, then write
• Can write in small increments (4KB),
  but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
  for each erase block
WEAR LEVELING is used
to reduce the number of
 total erase operations
WEAR LEVELING
WEAR LEVELING
Erase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING

 Disk Page
WEAR LEVELING

  Write 1
WEAR LEVELING

  Write 1
  Write 2
WEAR LEVELING

  Write 1
  Write 2
  Write 3
Remember: the whole block must be erased


         Write 1
         Write 2
         Write 3

                   How is data from only
                     Write 2 modified?
Mark Garbage
Empty Block




Mark Garbage    Append
                Modified
                 Data
Wait... GARBAGE?
THAT MEANS...
... fragmentation,
WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
  background... as much as possible
• If no empty blocks are available, GC
  must be done before ANY writes can
  complete
WRITE AMPLIFICATION
• When only a few kilobytes are written,
  but fragmentation causes a whole
  block to be rewritten
• The smaller & more random the writes,
  the worse this gets
• Modern “mark and sweep” GC reduces
  it, but cannot eliminate it
Torture test shows massive
             write performance drop-off
            for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives
      COMPLETELY fall apart


Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive
suffers significantly from the
         torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks
were marked empty, and the
   “fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”
• Filesystems don’t typically immediately
  erase data when files are deleted, they just
  mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
  when a region of disk is no longer used
• If an entire erase block is marked as
  unused, GC is avoided, otherwise TRIM
  just hastens the collection process
TRIM only reduces the
write amplification effect,
   it can’t eliminate it.
THEN THERE’S
 LIFETIME...
AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
   which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
CASSANDRA
 ONLY WRITES
SEQUENTIALLY
“For a sequential write workload,
     write amplification is equal to 1,
          i.e., there is no write
              amplification.”


Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
        Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.
     ~ @rbranson

More Related Content

What's hot

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSPythian
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
Ceph Community
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
DataStax Academy
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
DataStax Academy
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
Ramkumar Nottath
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
ScyllaDB
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
DataStax Academy
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph Community
 

What's hot (18)

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 

Similar to Cassandra and Solid State Drives

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
DataStax Academy
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
Tomas Vondra
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
AbdulhseynAayev1
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
JawaharPrasad3
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
I Goo Lee
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
Xiao Yan Li
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
Matthew Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
andrew311
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
jimmytruong
 
Storage structure
Storage structureStorage structure
Storage structureMohd Arif
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updatedWerner Fischer
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
Alex Zaballa
 
SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
Saurabh Kumar
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
Alex Chistyakov
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
Javier González
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
jasonajohnson
 

Similar to Cassandra and Solid State Drives (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
 
Storage structure
Storage structureStorage structure
Storage structure
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
 
SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Cassandra and Solid State Drives

  • 1. CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax
  • 2. FACT CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED FOR SPINNING DISKS
  • 5. insert({ cf1: { row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log { cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1” { cf1: { row1: { col2: def } } } { cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo” { cf1: { row2: { col1: xyz } } } row2 col1: “xyz” { cf1: { row1: { col3: foo } } } COMMIT
  • 6. In-Memory Memtable for “cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz” SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
  • 7. SSTable SSTable SSTable SSTable 1 2 3 4 SSTable SSTables are merged to maintain read performance COMPACT
  • 8. X X X X SSTable SSTable SSTable SSTable SSTable New SSTable is streamed to disk and old SSTables are erased
  • 9. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 10. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N) • SSTables are completely immutable
  • 11. COMPARED • Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc • Most perform similar buffering of writes before flushing to disk • ... but flushes are RANDOM writes
  • 12. SPINNING DISKS • Dirt cheap: $0.08/GB • Seek time limited by time it takes for drive to rotate: IOPS = RPM/60 • 7,200 RPM = ~120 IOPS • 15,000 RPM has been the max for decades • Sequential operations are best: 125MB/ sec for modern SATA drives
  • 13. THAT WAS THE WORLD IN WHICH CASSANDRA WAS BUILT
  • 14. 2012: MLC NAND FLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25 * based on specifications provided by Intel for 300GB Intel 320 drive
  • 15. WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S LSM-TREES OBSOLETE?
  • 16.
  • 17. SOLID STATE HAS SOME MAJOR BUTS...
  • 18. ... BUT • Cannot overwrite directly: must erase first, then write • Can write in small increments (4KB), but only erase in ~512KB blocks • Latency: write is ~100µs, erase is ~2ms • Limited durability: ~5,000 cycles (MLC) for each erase block
  • 19. WEAR LEVELING is used to reduce the number of total erase operations
  • 25. WEAR LEVELING Write 1
  • 26. WEAR LEVELING Write 1 Write 2
  • 27. WEAR LEVELING Write 1 Write 2 Write 3
  • 28. Remember: the whole block must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
  • 30. Empty Block Mark Garbage Append Modified Data
  • 35. GARBAGE COLLECTION • Compacts fragmented disk blocks • Erase operations drag on performance • Modern SSDs do this in the background... as much as possible • If no empty blocks are available, GC must be done before ANY writes can complete
  • 36. WRITE AMPLIFICATION • When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten • The smaller & more random the writes, the worse this gets • Modern “mark and sweep” GC reduces it, but cannot eliminate it
  • 37. Torture test shows massive write performance drop-off for heavily fragmented drive Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
  • 38. Some poorly designed drives COMPLETELY fall apart Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
  • 39. Even a well-behaved drive suffers significantly from the torture test Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 40. Post-torture, all disk blocks were marked empty, and the “fast” comes back... Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 41.
  • 42. “TRIM” • Filesystems don’t typically immediately erase data when files are deleted, they just mark them as deleted and erase later • TRIM allows the OS to actively tell the drive when a region of disk is no longer used • If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
  • 43. TRIM only reduces the write amplification effect, it can’t eliminate it.
  • 45.
  • 46.
  • 47. AnandTech estimates that modern MLC SSDs only last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
  • 49. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 51. “For a sequential write workload, write amplification is equal to 1, i.e., there is no write amplification.” Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
  • 52. THANK YOU. ~ @rbranson