Cassandra Performance Evaluation with Compression
                           Schubert Zhang, May.2010
                           schubert.zhang@gmail.com

The current implementation of Cassandra’s storage layer and indexing mechanism
only allow compression at row level.


   Column Family Row Serialization Structure:
1. The old structure:
                               Len          HashCount
      bloom filter                                                               BitSet
                               (int)           (int)
       index size                int
                              FirstColumName    LastColumName                         Offset (long)         Block Width
    index of block 0
                             (Len(short)+name) (Len(short)+name)                    (0 for first block)        (long)

    index of block 1

                             localDeletionTime         markedForDeleteAt
     deletion meta
                                   (int)                     (long)
     column count                int

    column block 0
    (uncompressed)            Column0            Column1       Column2              Column3

    column block 1
    (uncompressed)           deleteMark       timestamp                            value
                               (bool)           (long)                            (byte[])


2. The new structure (to support compression)
The new structure is appropriate for both old (uncompressed) and new (compressed)
format.

              format         (int): -1 (old format), 0 (new, LZO compressed), 1(new, GZ compressed), 2(new, uncompressed)
                               Len        HashCount
            bloom filter                                             BitSet
                               (int)         (int)
                             localDeletionTime     markedForDeleteAt
           deletion meta
                                    (int)                (long)
           column count        int

         column block 0
       (compressed or not)    Column0       Column1        Column2      Column3

         column block 1
       (compressed or not)   deleteMark    timestamp                   value
                               (bool)        (long)                   (byte[])
             index size        int
                              FirstColumName    LastColumName             Offset (long)       Block Width     Size on Disk
          index of block 0
                             (Len(short)+name) (Len(short)+name)        (0 for first block)      (long)           (int)

          index of block 1

           index size’

If the first int (format) is -1, the following structure will be same as “The
old structure”, except the “index of block” will use the new one.




                                                       1
     Benchmark:
1. Just one single node (only one disk, 4GB RAM(3GB for JVM heap), 4 Cores)
2. Dataset:
      ~200 bytes per column (thrift compactly encoded, the original CSV string is
~250 bytes)
      100,000 keys
      500,000,000 columns totally
      ~5,000 columns per key in average
3. Key Cache and Row Cache both disabled
4. Write or Read Client has 4 Threads, totally execute 10,000 read operations.
5. Every read operation only read the first 100 columns of the specified key.
5. The read performance is got after major compaction, i.e. only one SSTable.


     Compression Performance Matrix:
Field                   Model      Uncompressed         Compressed         Compressed
            Criteria                 (Default)             (GZ)                (LZO)
    Size      Disk Space(B)          104.545GB           45.067GB            54.656GB
            Compression Ratio           1/1                1/2.3              1/1.9
Compact       Major Time(H)             3:16               5:30               3:08
             Row Max Size(B)          1186948             512475              624396
 Write      Throughput(ops/s)          12635               11806              11034
             Avg Latency(ms)           0.320               0.334              0.347
             Min Latency(ms)           0.079               0.083              0.089
             Max Latency(ms)           19331               5128               10227
            Local Latency(ms)          0.032               0.033              0.037
    Read    Throughput(ops/s)            25                 28                  25
             Avg Latency(ms)            159                 144                159
             Min Latency(ms)             1                   2                   1
             Max Latency(ms)            1038               1526                619
            Local Latency(ms)           159                 144                159
Note:
1. The bottleneck of Write is CPU and memory.
      a)   In theory, we may get better performance under more power CPU and more
           RAM.
      b)   And if the commitlog is stored on a dedicated disk, we may get better result.
2. The bottleneck of Read is disk utility (100%).
      a)   Too many seeks.
      b)   Every read need 2 seeks to reach the row. So, a read operation needs at
           least 20ms on disk seek. The maximum throughput (ops/s) is 50.
      c)   If the row is compressed, one additional seek in the row is needed.
3. The compression ratio will become better along with the average size of row.
      a)   Since our dataset are very random, the ratio is just about 1/2.
4. Compaction is CPU-bound, since compaction is single-threaded. Gzip compression
     is slower.

                                               2
Configuration:
Parameter                       Value
KeysCached                      0
DiskAccessMode                  standard
SlicedBufferSizeInKB            64
FlushDataBufferSizeInMB         32
FlushIndexBufferSizeInMB        8
ColumnIndexSizeInKB             64
MemtableThroughputInMB          128
ConcurrentReads                 16
ConcurrentWrites                64
CommitLogSync                   periodic
CommitLogSyncPeriodInMS         10000


     Encoding + Compression:




1. The original text CSV column: ~250 bytes
2. Use thrift compacted encoding: ~200 bytes
3. Encoding + Compression, compositive reduce ratio: ~1/3


     Read Throughput/Latency on slice size (count of columns):
      Test on LZO compressed data, totally executed 10,000 read operations.
           Slice Size           50               500                5000
Criteria
Throughput(ops/s)               25               21                  15
    Avg Latency(ms)        158.865             186.571            256.837
    Min Latency(ms)         1.278               5.041              60.934
    Max Latency(ms)        288.307             395.427            1223.202




                                           3
Read Throughput

                                                 30




                           Throughput(ops/s)
                                                 25               25
                                                 20                                       21

                                                 15                                                              15
                                                 10
                                                  5
                                                  0
                                                             50                     500                   5000
                                                                       Slice Size(Count of Columns)



                                                                         Read Latency

                    1400
                    1200                                                                       1223.202
      Latency(ms)




                    1000
                                                                                                             Avg Latency(ms)
                     800
                                                                                                             Min Latency(ms)
                     600
                                                                                                             Max Latency(ms)
                     400                                                  395.427
                                                       288.307  256.837
                     200                             186.571
                                                       158.865
                       0                             5.041
                                                       1.278    60.934
                                        50        500        5000
                                     Slice Size(Count of Columns)


     Read Throughput/Latency on KeyCache, mmap, etc:
      Test on LZO compressed data, use the benchmark. Totally executed 10,000 read
operations.
                     Feature                             KeyCache=100%                         KeyCache=0               KeyCache=0
                                                      DiskAccess=standard                  DiskAccess=                DiskAccess=mmap
Criteria                                                                              mmap_index_only
Throughput(ops/s)                                                 40                               40                       84
    Avg Latency(ms)                                         100.522                             101.762                   47.342
    Min Latency(ms)                                          1.566                               1.453                     1.270
    Max Latency(ms)                                         278.975                             267.120                   239.816

                                                 90                                                         84
                                                 80
                             Throughput(ops/s)




                                                 70
                                                 60
                                                 50
                                                             40                      40
                                                 40
                                                 30
                                                 20
                                                 10
                                                  0
                                                      KeyCache_standard       mmap_index_only             mmap

But, for a long time of evaluation, the performance of on mmap is unstable. Following
evaluation executed 1000,000 read operations. It may because of GC.




                                                                                 4
Read Throughput (mmap)

Throughput(ops/s)   120
                    100
                     80
                     60
                     40
                     20
                      0
                          1   15   29 43   57 71 85   99 113 127 141 155 169 183 197 211 225 239 253 267 281
                                                            Time( 1minute)




                                                               5

Cassandra Compression and Performance Evaluation

  • 1.
    Cassandra Performance Evaluationwith Compression Schubert Zhang, May.2010 schubert.zhang@gmail.com The current implementation of Cassandra’s storage layer and indexing mechanism only allow compression at row level.  Column Family Row Serialization Structure: 1. The old structure: Len HashCount bloom filter BitSet (int) (int) index size int FirstColumName LastColumName Offset (long) Block Width index of block 0 (Len(short)+name) (Len(short)+name) (0 for first block) (long) index of block 1 localDeletionTime markedForDeleteAt deletion meta (int) (long) column count int column block 0 (uncompressed) Column0 Column1 Column2 Column3 column block 1 (uncompressed) deleteMark timestamp value (bool) (long) (byte[]) 2. The new structure (to support compression) The new structure is appropriate for both old (uncompressed) and new (compressed) format. format (int): -1 (old format), 0 (new, LZO compressed), 1(new, GZ compressed), 2(new, uncompressed) Len HashCount bloom filter BitSet (int) (int) localDeletionTime markedForDeleteAt deletion meta (int) (long) column count int column block 0 (compressed or not) Column0 Column1 Column2 Column3 column block 1 (compressed or not) deleteMark timestamp value (bool) (long) (byte[]) index size int FirstColumName LastColumName Offset (long) Block Width Size on Disk index of block 0 (Len(short)+name) (Len(short)+name) (0 for first block) (long) (int) index of block 1 index size’ If the first int (format) is -1, the following structure will be same as “The old structure”, except the “index of block” will use the new one. 1
  • 2.
    Benchmark: 1. Just one single node (only one disk, 4GB RAM(3GB for JVM heap), 4 Cores) 2. Dataset: ~200 bytes per column (thrift compactly encoded, the original CSV string is ~250 bytes) 100,000 keys 500,000,000 columns totally ~5,000 columns per key in average 3. Key Cache and Row Cache both disabled 4. Write or Read Client has 4 Threads, totally execute 10,000 read operations. 5. Every read operation only read the first 100 columns of the specified key. 5. The read performance is got after major compaction, i.e. only one SSTable.  Compression Performance Matrix: Field Model Uncompressed Compressed Compressed Criteria (Default) (GZ) (LZO) Size Disk Space(B) 104.545GB 45.067GB 54.656GB Compression Ratio 1/1 1/2.3 1/1.9 Compact Major Time(H) 3:16 5:30 3:08 Row Max Size(B) 1186948 512475 624396 Write Throughput(ops/s) 12635 11806 11034 Avg Latency(ms) 0.320 0.334 0.347 Min Latency(ms) 0.079 0.083 0.089 Max Latency(ms) 19331 5128 10227 Local Latency(ms) 0.032 0.033 0.037 Read Throughput(ops/s) 25 28 25 Avg Latency(ms) 159 144 159 Min Latency(ms) 1 2 1 Max Latency(ms) 1038 1526 619 Local Latency(ms) 159 144 159 Note: 1. The bottleneck of Write is CPU and memory. a) In theory, we may get better performance under more power CPU and more RAM. b) And if the commitlog is stored on a dedicated disk, we may get better result. 2. The bottleneck of Read is disk utility (100%). a) Too many seeks. b) Every read need 2 seeks to reach the row. So, a read operation needs at least 20ms on disk seek. The maximum throughput (ops/s) is 50. c) If the row is compressed, one additional seek in the row is needed. 3. The compression ratio will become better along with the average size of row. a) Since our dataset are very random, the ratio is just about 1/2. 4. Compaction is CPU-bound, since compaction is single-threaded. Gzip compression is slower. 2
  • 3.
    Configuration: Parameter Value KeysCached 0 DiskAccessMode standard SlicedBufferSizeInKB 64 FlushDataBufferSizeInMB 32 FlushIndexBufferSizeInMB 8 ColumnIndexSizeInKB 64 MemtableThroughputInMB 128 ConcurrentReads 16 ConcurrentWrites 64 CommitLogSync periodic CommitLogSyncPeriodInMS 10000  Encoding + Compression: 1. The original text CSV column: ~250 bytes 2. Use thrift compacted encoding: ~200 bytes 3. Encoding + Compression, compositive reduce ratio: ~1/3  Read Throughput/Latency on slice size (count of columns): Test on LZO compressed data, totally executed 10,000 read operations. Slice Size 50 500 5000 Criteria Throughput(ops/s) 25 21 15 Avg Latency(ms) 158.865 186.571 256.837 Min Latency(ms) 1.278 5.041 60.934 Max Latency(ms) 288.307 395.427 1223.202 3
  • 4.
    Read Throughput 30 Throughput(ops/s) 25 25 20 21 15 15 10 5 0 50 500 5000 Slice Size(Count of Columns) Read Latency 1400 1200 1223.202 Latency(ms) 1000 Avg Latency(ms) 800 Min Latency(ms) 600 Max Latency(ms) 400 395.427 288.307 256.837 200 186.571 158.865 0 5.041 1.278 60.934 50 500 5000 Slice Size(Count of Columns)  Read Throughput/Latency on KeyCache, mmap, etc: Test on LZO compressed data, use the benchmark. Totally executed 10,000 read operations. Feature KeyCache=100% KeyCache=0 KeyCache=0 DiskAccess=standard DiskAccess= DiskAccess=mmap Criteria mmap_index_only Throughput(ops/s) 40 40 84 Avg Latency(ms) 100.522 101.762 47.342 Min Latency(ms) 1.566 1.453 1.270 Max Latency(ms) 278.975 267.120 239.816 90 84 80 Throughput(ops/s) 70 60 50 40 40 40 30 20 10 0 KeyCache_standard mmap_index_only mmap But, for a long time of evaluation, the performance of on mmap is unstable. Following evaluation executed 1000,000 read operations. It may because of GC. 4
  • 5.
    Read Throughput (mmap) Throughput(ops/s) 120 100 80 60 40 20 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 Time( 1minute) 5