Slides from Seattle Cassandra Users November Meetup hosted by Hulu.
Cassandra lets users specify a number of different metadata options for each table in a cluster. In this talk, we’ll discuss the Cassandra read path, the write path, and how understanding what’s going on under Cassandra's hood can help developers and operators optimize their tables to achieve better latencies and higher throughput. We’ll touch on the various compaction strategies, discuss when each is the appropriate choice, and then we’ll touch on some of the less commonly tuned options, including compression options, bloom filter false positive chance, clustering order, read repair chance, and speculative retry.
2. A Little About Me
• Cassandra in production since 2010
• Infrastructure @ Crowdstrike
• Hundreds of terabytes in Cassandra
• Occasional code contributions
• Cassandra MVP
• Cassandra Day LA: 5 years of Hindsight
• Cassandra Summit 2015: DTCS is Broken (unofficial title)
3. An Introduction
to CrowdStrike
We Are CyberSecurity Technology Company
We Detect, Prevent And Respond To All Attack
Types In Real Time, Protecting Organizations
From Catastrophic Breaches
We Provide Next Generation Endpoint Protection,
Threat Intelligence & Pre &Post IR Services
4. A Little About Tonight
• Cassandra Write paths
• Cassandra Read paths
• Knowing what Cassandra is doing helps you understand how
to tune
• It’s not just about performance, it’s also about latencies,
stability, and correctness
• Feel free to interrupt me! Ask questions before, during, after
5. Write Path, Simplified
• Writes first go to the commitlog
• Then, memtable
• Then, eventually flushed to sstables
• If RF > ONE, the coordinator sends the mutation to replicas
• Depending on CL, the coordinator waits until enough respond before reporting
success to the client
6. Write Path, Simplified
• Writes first go to the commitlog
- Append only journal
- Replayed on node startup
- Purged once the node knows that all relevant data is written into sstables (nodetool flush)
- If you use spinning disks, append-only model avoids seeks (as long as commitlog is on its own
partition)
7. Write Path, Simplified
• Then the memtable
- Effectively a write-back cache of rows as they’re written
- Once a row is written to the memtable, the mutation can be counted towards the CONSISTENCY
LEVEL of the query
- Writes are batched in the memtable until it’s ready to flush
8. Write Path, Simplified
• Then flushed to sstables
- At specified thresholds ( memtable_(off)heap_space_in_mb * memtable_cleanup_threshold ), the
memtable is flushed to disk
- Each sstable is written exactly one time - never changed once it’s written
- If a new write comes in for the same value, it’s written to a new sstable with a new timestamp
9. Table Option #1: Compaction Strategy
• If tables are never re-written, how do updates and deletions work? Compaction! Multiple sstables
are joined together, duplicate cells are merged, deleted data is purged (eventually)
• Each table specifies a compaction strategy. Cassandra ships with 3 by default
• SizeTieredCompactionStrategy is the oldest, most mature, tuned for writes
• LeveledCompactionStrategy is tuned for read latency
• DateTieredCompactionStrategy is meant for time series, TTL heavy workloads
10. Table Option #1: Compaction Strategy
• SizeTieredCompactionStrategy
• Every time min_threshold (4) files of the same size appear, combine them together
11. Table Option #1: Compaction Strategy
• SizeTieredCompactionStrategy
• Every time min_threshold (4) files of the same size appear, combine them together
• Over time, older data naturally ends up in larger files
13. Table Option #1: Compaction Strategy
• SizeTieredCompactionStrategy Advantages
• Minimizes write amplification
• Very easy to reason about
• Simple algorithm, so unlikely to cause extra CPU/memory usage at flush time
• Flushing is important – complicated compaction strategies that block flushing can be bad (if the
memtable fills before it flushes, stop accepting writes)
14. Table Option #1: Compaction Strategy
• SizeTieredCompactionStrategy Disadvantages
• Deleted data from old files may not be compacted away for a very long time
• Frequently changed cells will live in many files, and must be merged on read
• Read queries may touch a number of files, which is SLOW
15. Table Option #1: Compaction Strategy
• LeveledCompactionStrategy
• Spends extra effort compacting sstables to ensure that each row exists in at most one sstable per
‘level’
• Expected probability for number of sstables per read: ~1.11
• Advantage: lower read latency
• Disadvantage: much more IO required
• Typically advantageous when you:
Read much more than you write
Highly sensitive to read latency
Rows change over time (values updated, or values expire)
• Prefer STCS if:
You can’t spare the IO
Rows are write-once
You write far more than you read
16. Table Option #1: Compaction Strategy
• DateTieredCompactionStrategy
• Designed for time series, often TTL heavy workloads
• Assumes writes come in order
• Tries to group sstables by date
• Great in theory
18. Table Option #1: Compaction Strategy
• Takeaway: Choosing the right compaction strategy not only impacts latency, but IO/CPU, and can
have a huge impact on disk space if you use TTLS
19. Read Path
1. Find the right server using the partition key and partition function (probably murmur3)
2. Find the sstables on disk that contain the row in question
3. Find the partition offset in the data files (use cache if possible, otherwise use the partition index
data)
4. The data is then read from the appropriate file
5. Duplicate cells are merged with timestamp resolution (last write wins)
6. If CL > ONE, the coordinator checks multiple replicas, and repairs any that are incorrect
22. Table Option #2: Bloom Filters
• Off-heap data structure that tells Cassandra that the row either “might” or “does not” exist in a
given data file
• Probabilistic: bloom_filter_fp_chance
• Defaults to 0.01 on STCS, 0.1 on LCS (LCS already defragments, so false positives are less
costly)
• Cost: RAM (offheap) – 0.01 uses approximately 3x the memory as 0.1
• Tuning: Adjust based on RAM available and number of sstables.
• For slow disks or lots of sstables, lower fp chance to decrease disk IO
• If you’re memory starved and have few sstables on disk, raise the fp chance and use the RAM
for page cache
• WITH bloom_filter_fp_chance=0.01
24. Table Option #3: Key Cache
• There’s a row cache – don’t use it
• The key cache helps find the data in the sstable quickly
• If you set the key cache low, there’s a good chance the OS page cache will help, but key cache
will be faster
• WITH caching = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘NONE’}
26. Table Option #4: Partition Summary / Index
• Maps row key to offset in data file
• It’s not every row key – it’s a sorted sampling
• You can tune the sample parameters: max_index_interval , min_index_interval
• Cassandra will adapt sample based on sstable read hotness – more frequently read tables will
get a more dense index for more accurate locations on disk
• Again, primarily a RAM tradeoff – lower interval = more RAM = less IO
• WITH max_index_interval 2048 AND min_index_interval = 128
28. Table Option #5: Compression
• The sstable is compressed chunk-by-chunk as it’s written (either during flush or compaction)
• Compression offsets are mapped like index offsets
• Larger chunks typically means better compression ratios for most data sets
• Smaller chunks means that if you do go to disk, you have less over-read
• Very literal tradeoff between disk IO and storage capacity – larger chunks = better ratios, but you
may have to read larger chunks off the disk when it’s not cached in RAM
• Data size dependent: 64k read for 500 bytes of data may severely limit your read performance
• WITH compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
30. Table Option #6: Correctness (CRC)
• Compressed tables have a checksum embedded in the compression data
• Cassandra can verify that checksum on decompression, IF you want
• WITH crc_check_chance = 1.0
• Uncompressed files have NO CORRECTNESS VALIDATION in the read path – if you have disk
based bit rot, Cassandra won’t know unless you run manual sstable verify (nodetool verify)
32. Table Option #7: Clustering
• Each partition is written once in the file
• Values in the partition are sorted based on clustering order
• In CQL3 terms, this means clustering key values:
• Because records are sorted when written, retrieving a range of clustering keys is incredibly fast
(nearly free)
• Normal sort order is ascending! If you need descending, flip the order in the schema so the read
path can do a single linear pass:
• WITH CLUSTERING ORDER BY (sensor_reading_timestamp DESC)
34. Table Option #8: Correctness (Read Repair)
• Depending on your consistency level, the coordinator will ask multiple replicas for the data
• One will return the data; others will return a digest
• If the digest doesn’t match the data, the coordinator will choose the value with the highest
timestamp, and make sure all replicas have that value – you can not disable this type of read
repair, except by querying with CL:ONE
• If the digest does match for the replicas returned, but you’re using CL < ALL, you can have
cassandra read-repair that cell anyway just in case:
• WITH dclocal_read_repair_chance = 0.01 AND read_repair_chance = 0.0
36. Table Option #9: Avoiding Timeouts
• Typical Cassandra use cases have RF > 1
• You may ask for data from X nodes, where X < RF
• If one of those nodes is slow to respond (query load, compaction load, JVM GC), Cassandra can
try other replicas before waiting for the full 10s timeout
• “Speculative Retry” is configurable based on logical time limits, like 99% latencies
• WITH speculative_retry = '99PERCENTILE’
• Watch out: speculative retry may violate LOCAL_ datacenter consistency levels (for now)
37. Lots of Options, Lots of Flexibility
• Choose compaction based on write / read PATTERNS
• Choose bloom filter FP chance based on read latency and memory available
• Enable the key cache, but probably not the row cache
• You can tune the index interval if you have really hot and really cold sstables
• Compression chunk size can control how much data you read off of the disk at a time, or how
well your data compresses
• Compression gives you CRCs to guard against corruption, and you can tune whether or not
they’re used
• SSTables are inherently sorted, use clustering order options as it fits your data
• Foreground read repair can’t be disabled, but background read repair can be used to help speed
up ‘eventual’ consistency
• Speculative retry may can help avoid timeouts and/or drop your 99.9% latencies
38. That’s it!
• You can talk to me about Cassandra on Twitter ( @jjirsa )
• There’s an active Cassandra community in IRC: irc.freenode.net #cassandra
• Crowdstrike is hiring: www.crowdstrike.com/careers/
• Huge thanks to Datastax and Hulu for making this meetup happen!