SlideShare a Scribd company logo
Cassandra
storage internals
     Nicolas Favre-Felix
   Cassandra Europe 2012
What this talk covers

ā€¢ What happens within a Cassandra node

ā€¢ How Cassandra reads and writes data

ā€¢ What compaction is and why we need it

ā€¢ How counters are stored, modiļ¬ed, and read
Concepts
ā€¢ Memtables        ā€¢ On heap, off-heap

ā€¢ SSTables         ā€¢ Compaction

ā€¢ Commit Log       ā€¢ Bloom ļ¬lters

ā€¢ Key cache        ā€¢ SSTable index

ā€¢ Row cache        ā€¢ Counters
Why is this important?
ā€¢ Understand what goes on under the hood

ā€¢ Understand the reasons for these choices

ā€¢ Diagnose issues

ā€¢ Tune Cassandra for performance

ā€¢ Make your data model efļ¬cient
A word about hard drives
A word about hard drives

ā€¢ Main driver behind Cassandraā€™s storage choices

ā€¢ The last moving part

ā€¢ Fast sequential I/O (150 MB/s)

ā€¢ Slow random I/O (120-200 IOPS)
What SSDs bring
ā€¢ Fast sequential I/O

ā€¢ Fast random I/O

ā€¢ Higher cost

ā€¢ Limited lifetime

ā€¢ Performance degradation
Disk usage with B-trees

ā€¢ Important data structure in relational databases

ā€¢ In-place overwrites (random I/O)

ā€¢ LogB(N) random accesses for reads and writes
Disk usage with Cassandra
 ā€¢ Made for spinning disks

 ā€¢ Sequential writes, much less than 1 I/O per insert

 ā€¢ Several layers of cache

 ā€¢ Random reads, approximately 1 I/O per read

 ā€¢ Generally ā€œwrite-optimisedā€
Writing
to Cassandra
Writing to Cassandra
Letā€™s add a row with a few columns



Row Key   Column   Column   Column   Column
The Cassandra write path
In the JVM

  New data            Memtable




             Commit
 On disk       log
The Commit Log
ā€¢ Each write is added to a log ļ¬le

ā€¢ Guarantees durability after a crash

ā€¢ 1-second window during which data is still in RAM

ā€¢ Sequential I/O

ā€¢ A dedicated disk is recommended
Memtables
ā€¢ In-memory Key/Value data structure

ā€¢ Implemented with ConcurrentSkipListMap

ā€¢ One per column family

ā€¢ Very fast inserts

ā€¢ Columns are merged in memory for the same key

ā€¢ Flushed at a certain threshold, into an SSTable
Dumping a Memtable on disk


In the JVM            Full Memtable




             Commit
 On disk       log
Dumping a Memtable on disk


In the JVM            New Memtable




             Commit
 On disk       log
                         SSTable
The SSTable

ā€¢ One ļ¬le, written sequentially

ā€¢ Columns are in order, grouped by row

ā€¢ Immutable once written, no updates!
SSTables start piling up!

In the JVM   Memtable


Commit log   SSTable    SSTable   SSTable

 On disk      SSTable   SSTable   SSTable

              SSTable   SSTable   SSTable

              SSTable   SSTable   SSTable
SSTables
ā€¢ Canā€™t keep all of them forever

ā€¢ Need to reclaim disk space

ā€¢ Reads could touch several SSTables

ā€¢ Scans touch all of them

ā€¢ In-memory data structures per SSTable
Compacting SSTables
Compaction
ā€¢ Merge SSTables of similar size together

ā€¢ Remove overwrites and deleted data (timestamps)

ā€¢ Improve range query performance

ā€¢ Major compaction creates a single SSTable

ā€¢ I/O intensive operation
Recent improvements

ā€¢ Pluggable compaction

ā€¢ Different strategies, chosen per column family

ā€¢ SSTable compression

ā€¢ More efļ¬cient SSTable merges
Reading from Cassandra
Reading from Cassandra
ā€¢ Reading all these SSTables would be very inefļ¬cient

ā€¢ We have to read from memory as much as possible

ā€¢ Otherwise we need to do 2 things efļ¬ciently:

 ā€¢ Find the right SSTable to read from

 ā€¢ Find where in that SSTable to read the data
First step for reads

ā€¢ The Memtable!

ā€¢ Read the most recent data

ā€¢ Very fast, no need to touch the disk
Off-heap (no GC)       Row cache




In the JVM             Memtable




              Commit
 On disk        log
                        SSTable
Row cache

ā€¢ Stores a whole row in memory

ā€¢ Off-heap, not subject to Garbage Collection

ā€¢ Size is conļ¬gurable per column family

ā€¢ Last resort before having to read from disk
Finding the right SSTable

In the JVM     Memtable


Commit log                SSTable               SSTable

 On disk     SSTable    SSTable               SSTable

              SSTable     SSTable   SSTable       SSTable
Bloom ļ¬lter
ā€¢ Saved with each SSTable

ā€¢ Answers ā€œcontains(Key) :: booleanā€

ā€¢ Saved on disk but kept in memory

ā€¢ Probabilistic data structure

ā€¢ Conļ¬gurable proportion of false positives

ā€¢ No false negatives
Bloom ļ¬lter

 In the JVM                          Memtable

exists(key)?
                       Bloom ļ¬lter   Bloom ļ¬lter   Bloom ļ¬lter
true/false


              Commit
  On disk       log
                       SSTable       SSTable       SSTable
Reading from an SSTable
ā€¢ We need to know where in the ļ¬le our data is saved
ā€¢ Keys are sorted, why donā€™t we do a binary search?
 ā€¢ Keys are not all the same size
 ā€¢ Jumping around in a ļ¬le is very slow
 ā€¢ Log2(N) random I/O, ~20 for 1 million keys
Reading from an SSTable
       Letā€™s index key ranges in the SSTable


  Key: k-128        Key: k-256        Key: k-384


Position: 12098   Position: 23445   Position: 43678



                                       SSTable
SSTable index
ā€¢ Saved with each SSTable

ā€¢ Stores key ranges and their offsets: [(Key, Offset)]

ā€¢ Saved on disk but kept in memory

ā€¢ Avoids searching for a key by scanning the ļ¬le

ā€¢ Conļ¬gurable key interval (default: 128)
SSTable index

In the JVM                           Memtable


                                                SSTable
                       Bloom ļ¬lter
                                                 index




              Commit
 On disk        log
                                     SSTable
Sometimes not enough

ā€¢ Storing key ranges is limited

ā€¢ We can do better by storing the exact offset

ā€¢ This saves approximately one I/O
The key cache

In the JVM                           Memtable


                                                  SSTable
                       Bloom ļ¬lter    Key cache
                                                   index




              Commit
 On disk        log
                                     SSTable
Key cache

ā€¢ Stores the exact location in the SSTable

ā€¢ Stored in heap

ā€¢ Avoids having to scan a whole index interval

ā€¢ Size is conļ¬gurable per column family
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom ļ¬lter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
Distributed counters
Distributed counters

ā€¢ 64-bit signed integer, replicated in the cluster

ā€¢ Atomic inc and dec by an arbitrary amount

ā€¢ Counting with read-inc-write would be inefļ¬cient

ā€¢ Stored differently from regular columns
Consider a cluster
with 3 nodes, RF=3
Internal counter data
ā€¢ List of increments received by the local node
ā€¢ Summaries (Version,Sum) sent by the other nodes
ā€¢ The total value is the sum of all counts
Internal counter data
ā€¢ List of increments received by the local node
ā€¢ Summaries (Version,Sum) sent by the other nodes
ā€¢ The total value is the sum of all counts


         Local increments        +5       +2   -3



node                              version: 3
         Received from             count: 5



                                  version: 5
         Received from            count: 10
Incrementing a counter
ā€¢ A coordinator node is chosen (blue node)




Local increments     +5    +2   -3
Incrementing a counter
ā€¢ A coordinator node is chosen

ā€¢ Stores its increment locally




Local increments       +5    +2   -3   +1
Incrementing a counter
ā€¢ A coordinator node is chosen

ā€¢ Stores its increment locally

ā€¢ Reads back the sum of its increments




Local increments       +5    +2   -3     +1
Incrementing a counter
ā€¢ A coordinator node is chosen

ā€¢ Stores its increment locally

ā€¢ Reads back the sum of its increments

ā€¢ Forwards a summary to other replicas: (v.4, sum 5)




Local increments       +5    +2   -3     +1
Incrementing a counter
ā€¢ A coordinator node is chosen

ā€¢ Stores its increment locally

ā€¢ Reads back the sum of its increments

ā€¢ Forwards a summary to other replicas

ā€¢ Replicas update their records:

                       version: 4
Received from           count: 5
Reading a counter

ā€¢ Replicas return their counts and versions

ā€¢ Including what they know about other nodes

ā€¢ Only the most recent versions are kept
Reading a counter




version: 6
count: 12
Reading a counter

             {
                  v. 3, count 5
                 v. 6, count 12
                  v. 2, count 8




                 {     v. 3, count 5
                      v. 5, count 10
                       v. 4, count 5
version: 6
count: 12
Reading a counter

             {
                    v. 3, count 5
                   v. 6, count 12
                    v. 2, count 8




                  {      v. 3, count 5
                        v. 5, count 10
                         v. 4, count 5
version: 6
count: 12        Counter value: 5 + 12 + 5 = 22
Storage problems
Tuning
ā€¢ Cassandra canā€™t really use large amounts of RAM

ā€¢ Garbage Collection pauses stop everything

ā€¢ Compaction has an impact on performance

ā€¢ Reading from disk is slow

ā€¢ These limitations restrict the size of each node
Recap
ā€¢ Fast sequential writes

ā€¢ ~1 I/O for uncached reads, 0 for cached

ā€¢ Counter increments read on write, columns donā€™t

ā€¢ Know where your time is spent (monitor!)

ā€¢ Tune accordingly
Questions?


http://www.ļ¬‚ickr.com/photos/kubina/326628918/sizes/l/in/photostream/
http://www.ļ¬‚ickr.com/photos/alwarrete/5651579563/sizes/o/in/photostream/
http://www.ļ¬‚ickr.com/photos/pio1976/3330670980/sizes/o/in/photostream/
http://www.ļ¬‚ickr.com/photos/lwr/100518736/sizes/l/in/photostream/
ā€¢ In-kernel backend
ā€¢ No Garbage Collection
ā€¢ No need to plan heavy compactions
ā€¢ Low and consistent latency
ā€¢ Full versioning, snapshots
ā€¢ No degradation with Big Data

More Related Content

What's hot

Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
Ā 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
Lucidworks (Archived)
Ā 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
Ā 
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
NoSQLmatters
Ā 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
DataWorks Summit
Ā 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Experts, Inc.
Ā 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
Severalnines
Ā 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
ahl0003
Ā 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
enissoz
Ā 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
Ā 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
Ā 
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
BertrandDrouvot
Ā 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
Ā 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
Ā 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
Ā 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
Scott Miao
Ā 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
Christopher Choi
Ā 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQL
Peter Eisentraut
Ā 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
MariaDB plc
Ā 

What's hot (19)

Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
Ā 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
Ā 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
Ā 
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo ā€“ How Redis Cluster works, and why - NoSQL matters Barce...
Ā 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
Ā 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
Ā 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
Ā 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
Ā 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
Ā 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Ā 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
Ā 
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Ā 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
Ā 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Ā 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
Ā 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
Ā 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
Ā 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQL
Ā 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
Ā 

Similar to Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
Rick Branson
Ā 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon
Ā 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
Schubert Zhang
Ā 
Cachememory
CachememoryCachememory
Cachememory
Slideshare
Ā 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
Iraklis Psaroudakis
Ā 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
jbellis
Ā 
Shignled disk
Shignled diskShignled disk
Shignled disk
Wenlei Xie
Ā 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
Acunu
Ā 
Flash 101
Flash 101Flash 101
Flash 101
Roman Tarnavski
Ā 
ę·±å…„äŗ†č§£Redis
ę·±å…„äŗ†č§£Redisę·±å…„äŗ†č§£Redis
ę·±å…„äŗ†č§£Redis
iammutex
Ā 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
Ā 
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
JAXLondon2014
Ā 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
JyotiprakashMishra18
Ā 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
Ā 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
peknap
Ā 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
DataStax Academy
Ā 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
Ā 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
bigbase
Ā 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012
NAVER D2
Ā 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
Ā 

Similar to Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
Ā 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
Ā 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
Ā 
Cachememory
CachememoryCachememory
Cachememory
Ā 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
Ā 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
Ā 
Shignled disk
Shignled diskShignled disk
Shignled disk
Ā 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
Ā 
Flash 101
Flash 101Flash 101
Flash 101
Ā 
ę·±å…„äŗ†č§£Redis
ę·±å…„äŗ†č§£Redisę·±å…„äŗ†č§£Redis
ę·±å…„äŗ†č§£Redis
Ā 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Ā 
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā€™...
Ā 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
Ā 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ā 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Ā 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Ā 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Ā 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
Ā 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012
Ā 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Ā 

More from Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
Ā 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
Acunu
Ā 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
Ā 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Acunu
Ā 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu
Ā 
All Your Base
All Your BaseAll Your Base
All Your Base
Acunu
Ā 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
Acunu
Ā 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
Acunu
Ā 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
Acunu
Ā 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Acunu
Ā 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
Acunu
Ā 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
Acunu
Ā 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
Acunu
Ā 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
Acunu
Ā 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Acunu
Ā 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
Acunu
Ā 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Acunu
Ā 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
Acunu
Ā 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Acunu
Ā 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Acunu
Ā 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
Ā 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
Ā 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Ā 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Ā 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
Ā 
All Your Base
All Your BaseAll Your Base
All Your Base
Ā 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
Ā 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
Ā 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
Ā 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Ā 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
Ā 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
Ā 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
Ā 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
Ā 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Ā 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
Ā 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Ā 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
Ā 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Ā 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Ā 

Recently uploaded

Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
Ā 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
Ā 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
Ā 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
Ā 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
Ā 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
Ā 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
Ā 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
Ā 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
Ā 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
Ā 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
Ā 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
Ā 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
Ā 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
Ā 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
Ā 
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
Ā 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-UniversitƤt
Ā 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
Ā 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
Ā 

Recently uploaded (20)

Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
Ā 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
Ā 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
Ā 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ā 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Ā 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
Ā 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Ā 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Ā 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Ā 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Ā 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Ā 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Ā 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Ā 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Ā 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Ā 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Ā 
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
ā€œHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Ā 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Ā 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
Ā 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
Ā 

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

  • 1. Cassandra storage internals Nicolas Favre-Felix Cassandra Europe 2012
  • 2. What this talk covers ā€¢ What happens within a Cassandra node ā€¢ How Cassandra reads and writes data ā€¢ What compaction is and why we need it ā€¢ How counters are stored, modiļ¬ed, and read
  • 3. Concepts ā€¢ Memtables ā€¢ On heap, off-heap ā€¢ SSTables ā€¢ Compaction ā€¢ Commit Log ā€¢ Bloom ļ¬lters ā€¢ Key cache ā€¢ SSTable index ā€¢ Row cache ā€¢ Counters
  • 4. Why is this important? ā€¢ Understand what goes on under the hood ā€¢ Understand the reasons for these choices ā€¢ Diagnose issues ā€¢ Tune Cassandra for performance ā€¢ Make your data model efļ¬cient
  • 5. A word about hard drives
  • 6. A word about hard drives ā€¢ Main driver behind Cassandraā€™s storage choices ā€¢ The last moving part ā€¢ Fast sequential I/O (150 MB/s) ā€¢ Slow random I/O (120-200 IOPS)
  • 7. What SSDs bring ā€¢ Fast sequential I/O ā€¢ Fast random I/O ā€¢ Higher cost ā€¢ Limited lifetime ā€¢ Performance degradation
  • 8. Disk usage with B-trees ā€¢ Important data structure in relational databases ā€¢ In-place overwrites (random I/O) ā€¢ LogB(N) random accesses for reads and writes
  • 9. Disk usage with Cassandra ā€¢ Made for spinning disks ā€¢ Sequential writes, much less than 1 I/O per insert ā€¢ Several layers of cache ā€¢ Random reads, approximately 1 I/O per read ā€¢ Generally ā€œwrite-optimisedā€
  • 11. Writing to Cassandra Letā€™s add a row with a few columns Row Key Column Column Column Column
  • 12. The Cassandra write path In the JVM New data Memtable Commit On disk log
  • 13. The Commit Log ā€¢ Each write is added to a log ļ¬le ā€¢ Guarantees durability after a crash ā€¢ 1-second window during which data is still in RAM ā€¢ Sequential I/O ā€¢ A dedicated disk is recommended
  • 14. Memtables ā€¢ In-memory Key/Value data structure ā€¢ Implemented with ConcurrentSkipListMap ā€¢ One per column family ā€¢ Very fast inserts ā€¢ Columns are merged in memory for the same key ā€¢ Flushed at a certain threshold, into an SSTable
  • 15. Dumping a Memtable on disk In the JVM Full Memtable Commit On disk log
  • 16. Dumping a Memtable on disk In the JVM New Memtable Commit On disk log SSTable
  • 17. The SSTable ā€¢ One ļ¬le, written sequentially ā€¢ Columns are in order, grouped by row ā€¢ Immutable once written, no updates!
  • 18. SSTables start piling up! In the JVM Memtable Commit log SSTable SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  • 19. SSTables ā€¢ Canā€™t keep all of them forever ā€¢ Need to reclaim disk space ā€¢ Reads could touch several SSTables ā€¢ Scans touch all of them ā€¢ In-memory data structures per SSTable
  • 21. Compaction ā€¢ Merge SSTables of similar size together ā€¢ Remove overwrites and deleted data (timestamps) ā€¢ Improve range query performance ā€¢ Major compaction creates a single SSTable ā€¢ I/O intensive operation
  • 22. Recent improvements ā€¢ Pluggable compaction ā€¢ Different strategies, chosen per column family ā€¢ SSTable compression ā€¢ More efļ¬cient SSTable merges
  • 24. Reading from Cassandra ā€¢ Reading all these SSTables would be very inefļ¬cient ā€¢ We have to read from memory as much as possible ā€¢ Otherwise we need to do 2 things efļ¬ciently: ā€¢ Find the right SSTable to read from ā€¢ Find where in that SSTable to read the data
  • 25. First step for reads ā€¢ The Memtable! ā€¢ Read the most recent data ā€¢ Very fast, no need to touch the disk
  • 26. Off-heap (no GC) Row cache In the JVM Memtable Commit On disk log SSTable
  • 27. Row cache ā€¢ Stores a whole row in memory ā€¢ Off-heap, not subject to Garbage Collection ā€¢ Size is conļ¬gurable per column family ā€¢ Last resort before having to read from disk
  • 28. Finding the right SSTable In the JVM Memtable Commit log SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  • 29. Bloom ļ¬lter ā€¢ Saved with each SSTable ā€¢ Answers ā€œcontains(Key) :: booleanā€ ā€¢ Saved on disk but kept in memory ā€¢ Probabilistic data structure ā€¢ Conļ¬gurable proportion of false positives ā€¢ No false negatives
  • 30. Bloom ļ¬lter In the JVM Memtable exists(key)? Bloom ļ¬lter Bloom ļ¬lter Bloom ļ¬lter true/false Commit On disk log SSTable SSTable SSTable
  • 31. Reading from an SSTable ā€¢ We need to know where in the ļ¬le our data is saved ā€¢ Keys are sorted, why donā€™t we do a binary search? ā€¢ Keys are not all the same size ā€¢ Jumping around in a ļ¬le is very slow ā€¢ Log2(N) random I/O, ~20 for 1 million keys
  • 32. Reading from an SSTable Letā€™s index key ranges in the SSTable Key: k-128 Key: k-256 Key: k-384 Position: 12098 Position: 23445 Position: 43678 SSTable
  • 33. SSTable index ā€¢ Saved with each SSTable ā€¢ Stores key ranges and their offsets: [(Key, Offset)] ā€¢ Saved on disk but kept in memory ā€¢ Avoids searching for a key by scanning the ļ¬le ā€¢ Conļ¬gurable key interval (default: 128)
  • 34. SSTable index In the JVM Memtable SSTable Bloom ļ¬lter index Commit On disk log SSTable
  • 35. Sometimes not enough ā€¢ Storing key ranges is limited ā€¢ We can do better by storing the exact offset ā€¢ This saves approximately one I/O
  • 36. The key cache In the JVM Memtable SSTable Bloom ļ¬lter Key cache index Commit On disk log SSTable
  • 37. Key cache ā€¢ Stores the exact location in the SSTable ā€¢ Stored in heap ā€¢ Avoids having to scan a whole index interval ā€¢ Size is conļ¬gurable per column family
  • 38. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 39. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 40. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 41. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 42. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 43. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 44. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom ļ¬lter Key cache index 6 Commit On disk log SSTable
  • 46. Distributed counters ā€¢ 64-bit signed integer, replicated in the cluster ā€¢ Atomic inc and dec by an arbitrary amount ā€¢ Counting with read-inc-write would be inefļ¬cient ā€¢ Stored differently from regular columns
  • 47. Consider a cluster with 3 nodes, RF=3
  • 48. Internal counter data ā€¢ List of increments received by the local node ā€¢ Summaries (Version,Sum) sent by the other nodes ā€¢ The total value is the sum of all counts
  • 49. Internal counter data ā€¢ List of increments received by the local node ā€¢ Summaries (Version,Sum) sent by the other nodes ā€¢ The total value is the sum of all counts Local increments +5 +2 -3 node version: 3 Received from count: 5 version: 5 Received from count: 10
  • 50. Incrementing a counter ā€¢ A coordinator node is chosen (blue node) Local increments +5 +2 -3
  • 51. Incrementing a counter ā€¢ A coordinator node is chosen ā€¢ Stores its increment locally Local increments +5 +2 -3 +1
  • 52. Incrementing a counter ā€¢ A coordinator node is chosen ā€¢ Stores its increment locally ā€¢ Reads back the sum of its increments Local increments +5 +2 -3 +1
  • 53. Incrementing a counter ā€¢ A coordinator node is chosen ā€¢ Stores its increment locally ā€¢ Reads back the sum of its increments ā€¢ Forwards a summary to other replicas: (v.4, sum 5) Local increments +5 +2 -3 +1
  • 54. Incrementing a counter ā€¢ A coordinator node is chosen ā€¢ Stores its increment locally ā€¢ Reads back the sum of its increments ā€¢ Forwards a summary to other replicas ā€¢ Replicas update their records: version: 4 Received from count: 5
  • 55. Reading a counter ā€¢ Replicas return their counts and versions ā€¢ Including what they know about other nodes ā€¢ Only the most recent versions are kept
  • 57. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5 version: 6 count: 12
  • 58. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5 version: 6 count: 12 Counter value: 5 + 12 + 5 = 22
  • 60. Tuning ā€¢ Cassandra canā€™t really use large amounts of RAM ā€¢ Garbage Collection pauses stop everything ā€¢ Compaction has an impact on performance ā€¢ Reading from disk is slow ā€¢ These limitations restrict the size of each node
  • 61. Recap ā€¢ Fast sequential writes ā€¢ ~1 I/O for uncached reads, 0 for cached ā€¢ Counter increments read on write, columns donā€™t ā€¢ Know where your time is spent (monitor!) ā€¢ Tune accordingly
  • 63. ā€¢ In-kernel backend ā€¢ No Garbage Collection ā€¢ No need to plan heavy compactions ā€¢ Low and consistent latency ā€¢ Full versioning, snapshots ā€¢ No degradation with Big Data