Cassandra Internals
Cassandra London Meetup – July 2013
Nicolas Favre-Felix
Software Engineer
@yowgi – @acunu
1
Nicolas Favre-Felix – Cassandra London July 2013
A lot to talk about
• Memtable
• SSTable
• Commit log
• Row Cache
• Key Cache
• Compaction
• Secondary indexes
• Bloom Filters
• Index samples
• Column indexes
• Thrift
• CQL
2
Nicolas Favre-Felix – Cassandra London July 2013
1. High latency in a read-heavy workload
2. High CPU usage with little activity on the cluster
3. nodetool repair taking too long to complete
4. Optimising for the highest insert throughput
Four real-world problems
3
Nicolas Favre-Felix – Cassandra London July 2013
• Acunu professional services for Apache Cassandra
• 24x7 support for questions and emergencies
• Cluster “health check” sessions
• CassandraTraining & Workshop
Context
4
Nicolas Favre-Felix – Cassandra London July 2013
“Reading takes too long”
5
Nicolas Favre-Felix – Cassandra London July 2013
Symptoms
• High latency observed in read operations
• Thousands of read requests per second
6
Nicolas Favre-Felix – Cassandra London July 2013
Staged Event-Driven
Architecture (SEDA)
7
Nicolas Favre-Felix – Cassandra London July 2013
SEDA in Cassandra
• Stages in Cassandra have different roles
• MutationStage for writes
• ReadStage for reads
• ... 10 or so in total
• Each Stage is backed by a thread pool
• Not all task queues are bounded
8
Nicolas Favre-Felix – Cassandra London July 2013
ReadStage
• Not all reads are equal:
• Some served from in-memory data structures
• Some served from the Linux page cache
• Some need to hit disk, possibly more than once
• Read operations can be disk-bound
• Avoid saturating disk with random reads
• Recommended pool size: 16×number_of_drives
9
Nicolas Favre-Felix – Cassandra London July 2013
nodetool tpstats
Pool Name Active Pending Completed
ReadStage 16 3197 733819430
RequestResponseStage 0 0 3381277
MutationStage 5 0 1130984
ReadRepairStage 0 0 80095473
ReplicateOnWriteStage 0 0 4728857
GossipStage 0 0 20252373
AntiEntropyStage 0 0 2228
MigrationStage 0 0 19
MemtablePostFlusher 0 0 839
StreamStage 0 0 40
FlushWriter 0 0 2349
MiscStage 0 0 0
commitlog_archiver 0 0 0
AntiEntropySessions 0 0 11
InternalResponseStage 0 0 7
HintedHandoff 0 0 6018
10
Nicolas Favre-Felix – Cassandra London July 2013
Solution
• iostat: little I/O activity
• free: large amount of memory used to cache pages
• → Increased concurrent_reads to 32
• → Latency dropped to reasonable levels
• Recommendations:
• Reduce the number of reads
• Keep an eye on I/O as data grows
• Buy more disks or RAM when falling out of cache
11
Nicolas Favre-Felix – Cassandra London July 2013
“Cassandra is busy doing nothing”
12
Nicolas Favre-Felix – Cassandra London July 2013
Context
• 2-node cluster
• Little activity on the cluster
• Very high CPU usage on the nodes
• Storing metadata on published web content
13
Nicolas Favre-Felix – Cassandra London July 2013
nodetool cfhistograms
• Node-local histogram stored per CF, per node
• Distribution of number of files accessed per read
• Distribution of read and write latencies
• Distribution of row sizes and column counts
• Buckets are approximate but still very useful
14
Nicolas Favre-Felix – Cassandra London July 2013
SSTables accessed per read
0
1,000,000
2,000,000
3,000,000
0 1 2 3 4 5 6 7 8 9 10
Number of reads
SSTables accessed
15
Nicolas Favre-Felix – Cassandra London July 2013
Row size distribution (bytes)
0
1
2
3
4
5
0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000
Number of rows
Row size in bytes
16
Nicolas Favre-Felix – Cassandra London July 2013
Column count distribution
0
2
4
6
8
10
0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000
Number of rows
Number of columns
17
Nicolas Favre-Felix – Cassandra London July 2013
Read latency distribution (µsec)
0
180,000
360,000
540,000
720,000
900,000
1 100 10,000 1,000,000
Number of reads
Number of reads
Latency (µsec)
18
Nicolas Favre-Felix – Cassandra London July 2013
Data model issue
• Row key was “views”
• Column names were item names, values counters
• Cassandra stored only a few massive rows
• → Reading from many SSTables
• → De-serialising large column indexes
views post-1234: 77: post-1240: 8 post-1250: 3
19
Nicolas Favre-Felix – Cassandra London July 2013
CF read latency & column index
(taken from Aaron Morton’s talk at Cassandra SF 2012)
0
1,500
3,000
4,500
6,000
85th 95th 99th
Latency(microseconds)
Percentile
First column from 1,200
First column from 1,000,000
20
Nicolas Favre-Felix – Cassandra London July 2013
Solution
• “Transpose” the table:
• Make the item name the row key
• Have a few counters per item
• Distribute the rows across the whole cluster
post-123 : views: 9078 comments: 3
21
Nicolas Favre-Felix – Cassandra London July 2013
“nodetool repair takes ages”
22
Nicolas Favre-Felix – Cassandra London July 2013
Nodetool repair
• “Active Anti-Entropy” mechanism in Cassandra
• Synchronises replicas
• Running repair is important to replicate tombstones
• Should run at least once every 10 days
• Repair was taking a week to complete
23
Nicolas Favre-Felix – Cassandra London July 2013
Two phases
1. Contact replicas, ask for MerkleTrees
1. They scan their local data and send a tree back
2. Compare MerkleTrees between replicas
1. Identify differences
2. Stream blocks of data out to other nodes
3. Stream data in and merge locally
24
Nicolas Favre-Felix – Cassandra London July 2013
MerkleTrees
top hash
hash-0 hash-1
hash-00 hash-01 hash-10 hash-11
data
block 0
data
block 1
data
block 2
data
block 3
•Hashes of hashes of ... data
•215 = 32,768 leaf nodes
(memory)
(disk)
25
Nicolas Favre-Felix – Cassandra London July 2013
Cassandra logs
• MerkleTree requests and responses
• Check how long it took
• Differences found, in number of leaf nodes
• More differences more data to stream
• Streaming sessions starting and ending
26
Nicolas Favre-Felix – Cassandra London July 2013
Diagnostic
• Building MerkleTrees: 20-30 minutes
• “4,700 ranges out of sync” (~14% of 32,768)
• Streaming session to repair the range: 4.5 hours
• Much slower rate than expected
27
Nicolas Favre-Felix – Cassandra London July 2013
Solutions
• Increase consistency level from ONE
• Rely on read repair to decrease entropy
• Fix problem of dropped writes
• Review data model and cluster size
• Add more disks and RAM, maybe more nodes
• Investigate network issues (speed, partitions?)
• Monitor both phases of the repair process
28
Nicolas Favre-Felix – Cassandra London July 2013
“How can we write faster?”
29
Nicolas Favre-Felix – Cassandra London July 2013
Context
• Time-series data from 1 million sensors
• 40 data points (e.g. temperature, pressure...)
• Sent in one batch every 5 minutes
• 40M cols / 5 min = 133,000 cols/sec
• One node...
30
Nicolas Favre-Felix – Cassandra London July 2013
Data model 1
• One row per (sensor, day)
• Metrics columns grouped by minute within the row
• Range queries between minutes A and B within a day
CREATE TABLE sensor_data (
sensor_id text,
day integer,
hour integer,
minute integer,
metric1 integer,
[...]
metric40 integer,
PRIMARY KEY ((sensor_id, day), minute);
31
Nicolas Favre-Felix – Cassandra London July 2013
Data model 1
• At 12:00, insert 40 cols into row (sensor1, 2013-07-11)
• At 12:05, insert 40 cols into row (sensor1, 2013-07-11)
• These columns might not be written to the same file
• Compaction process needs to merge them together:
• Large amounts of overlap between SSTables
• Rate is around 500 KB/sec
• 30% CPU usage spent compacting; no issues with I/O
32
Nicolas Favre-Felix – Cassandra London July 2013
Data model 2
• One row per (sensor, day, minute)
• No range query within the day (need to enumerate)
• Compaction now reaching 7 MB/sec
• Tests show a 10-20% increase in throughput
- PRIMARY KEY ((sensor_id, day), minute);
+ PRIMARY KEY ((sensor_id, day, minute));
33
Nicolas Favre-Felix – Cassandra London July 2013
Next steps
• Workload is CPU-bound, disks are not a problem
• Larger memtable mean lower write amplification
• Managed to flush after 400k ops instead of 200k
• Track time spent in GC with jstat -gcutil
• At this rate, consider adding more nodes
34
Nicolas Favre-Felix – Cassandra London July 2013
1. Interactions between Cassandra and the hardware
2. Implications of a bad data model at the storage layer
3. Internal data structures and processes
4. Work involved in arranging data on disk
Four problems, four solutions
35
Nicolas Favre-Felix – Cassandra London July 2013
Guidelines
• Monitor Cassandra, OS, JVM, hardware
• Learn how to use nodetool
• Follow best practices in data modelling and sizing
• Keep an eye on the Cassandra logs
• Consider available resources as sharing “work”
36
Nicolas Favre-Felix – Cassandra London July 2013
Thank you!
37

Understanding Cassandra internals to solve real-world problems

  • 1.
    Cassandra Internals Cassandra LondonMeetup – July 2013 Nicolas Favre-Felix Software Engineer @yowgi – @acunu 1
  • 2.
    Nicolas Favre-Felix –Cassandra London July 2013 A lot to talk about • Memtable • SSTable • Commit log • Row Cache • Key Cache • Compaction • Secondary indexes • Bloom Filters • Index samples • Column indexes • Thrift • CQL 2
  • 3.
    Nicolas Favre-Felix –Cassandra London July 2013 1. High latency in a read-heavy workload 2. High CPU usage with little activity on the cluster 3. nodetool repair taking too long to complete 4. Optimising for the highest insert throughput Four real-world problems 3
  • 4.
    Nicolas Favre-Felix –Cassandra London July 2013 • Acunu professional services for Apache Cassandra • 24x7 support for questions and emergencies • Cluster “health check” sessions • CassandraTraining & Workshop Context 4
  • 5.
    Nicolas Favre-Felix –Cassandra London July 2013 “Reading takes too long” 5
  • 6.
    Nicolas Favre-Felix –Cassandra London July 2013 Symptoms • High latency observed in read operations • Thousands of read requests per second 6
  • 7.
    Nicolas Favre-Felix –Cassandra London July 2013 Staged Event-Driven Architecture (SEDA) 7
  • 8.
    Nicolas Favre-Felix –Cassandra London July 2013 SEDA in Cassandra • Stages in Cassandra have different roles • MutationStage for writes • ReadStage for reads • ... 10 or so in total • Each Stage is backed by a thread pool • Not all task queues are bounded 8
  • 9.
    Nicolas Favre-Felix –Cassandra London July 2013 ReadStage • Not all reads are equal: • Some served from in-memory data structures • Some served from the Linux page cache • Some need to hit disk, possibly more than once • Read operations can be disk-bound • Avoid saturating disk with random reads • Recommended pool size: 16×number_of_drives 9
  • 10.
    Nicolas Favre-Felix –Cassandra London July 2013 nodetool tpstats Pool Name Active Pending Completed ReadStage 16 3197 733819430 RequestResponseStage 0 0 3381277 MutationStage 5 0 1130984 ReadRepairStage 0 0 80095473 ReplicateOnWriteStage 0 0 4728857 GossipStage 0 0 20252373 AntiEntropyStage 0 0 2228 MigrationStage 0 0 19 MemtablePostFlusher 0 0 839 StreamStage 0 0 40 FlushWriter 0 0 2349 MiscStage 0 0 0 commitlog_archiver 0 0 0 AntiEntropySessions 0 0 11 InternalResponseStage 0 0 7 HintedHandoff 0 0 6018 10
  • 11.
    Nicolas Favre-Felix –Cassandra London July 2013 Solution • iostat: little I/O activity • free: large amount of memory used to cache pages • → Increased concurrent_reads to 32 • → Latency dropped to reasonable levels • Recommendations: • Reduce the number of reads • Keep an eye on I/O as data grows • Buy more disks or RAM when falling out of cache 11
  • 12.
    Nicolas Favre-Felix –Cassandra London July 2013 “Cassandra is busy doing nothing” 12
  • 13.
    Nicolas Favre-Felix –Cassandra London July 2013 Context • 2-node cluster • Little activity on the cluster • Very high CPU usage on the nodes • Storing metadata on published web content 13
  • 14.
    Nicolas Favre-Felix –Cassandra London July 2013 nodetool cfhistograms • Node-local histogram stored per CF, per node • Distribution of number of files accessed per read • Distribution of read and write latencies • Distribution of row sizes and column counts • Buckets are approximate but still very useful 14
  • 15.
    Nicolas Favre-Felix –Cassandra London July 2013 SSTables accessed per read 0 1,000,000 2,000,000 3,000,000 0 1 2 3 4 5 6 7 8 9 10 Number of reads SSTables accessed 15
  • 16.
    Nicolas Favre-Felix –Cassandra London July 2013 Row size distribution (bytes) 0 1 2 3 4 5 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 Number of rows Row size in bytes 16
  • 17.
    Nicolas Favre-Felix –Cassandra London July 2013 Column count distribution 0 2 4 6 8 10 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 Number of rows Number of columns 17
  • 18.
    Nicolas Favre-Felix –Cassandra London July 2013 Read latency distribution (µsec) 0 180,000 360,000 540,000 720,000 900,000 1 100 10,000 1,000,000 Number of reads Number of reads Latency (µsec) 18
  • 19.
    Nicolas Favre-Felix –Cassandra London July 2013 Data model issue • Row key was “views” • Column names were item names, values counters • Cassandra stored only a few massive rows • → Reading from many SSTables • → De-serialising large column indexes views post-1234: 77: post-1240: 8 post-1250: 3 19
  • 20.
    Nicolas Favre-Felix –Cassandra London July 2013 CF read latency & column index (taken from Aaron Morton’s talk at Cassandra SF 2012) 0 1,500 3,000 4,500 6,000 85th 95th 99th Latency(microseconds) Percentile First column from 1,200 First column from 1,000,000 20
  • 21.
    Nicolas Favre-Felix –Cassandra London July 2013 Solution • “Transpose” the table: • Make the item name the row key • Have a few counters per item • Distribute the rows across the whole cluster post-123 : views: 9078 comments: 3 21
  • 22.
    Nicolas Favre-Felix –Cassandra London July 2013 “nodetool repair takes ages” 22
  • 23.
    Nicolas Favre-Felix –Cassandra London July 2013 Nodetool repair • “Active Anti-Entropy” mechanism in Cassandra • Synchronises replicas • Running repair is important to replicate tombstones • Should run at least once every 10 days • Repair was taking a week to complete 23
  • 24.
    Nicolas Favre-Felix –Cassandra London July 2013 Two phases 1. Contact replicas, ask for MerkleTrees 1. They scan their local data and send a tree back 2. Compare MerkleTrees between replicas 1. Identify differences 2. Stream blocks of data out to other nodes 3. Stream data in and merge locally 24
  • 25.
    Nicolas Favre-Felix –Cassandra London July 2013 MerkleTrees top hash hash-0 hash-1 hash-00 hash-01 hash-10 hash-11 data block 0 data block 1 data block 2 data block 3 •Hashes of hashes of ... data •215 = 32,768 leaf nodes (memory) (disk) 25
  • 26.
    Nicolas Favre-Felix –Cassandra London July 2013 Cassandra logs • MerkleTree requests and responses • Check how long it took • Differences found, in number of leaf nodes • More differences more data to stream • Streaming sessions starting and ending 26
  • 27.
    Nicolas Favre-Felix –Cassandra London July 2013 Diagnostic • Building MerkleTrees: 20-30 minutes • “4,700 ranges out of sync” (~14% of 32,768) • Streaming session to repair the range: 4.5 hours • Much slower rate than expected 27
  • 28.
    Nicolas Favre-Felix –Cassandra London July 2013 Solutions • Increase consistency level from ONE • Rely on read repair to decrease entropy • Fix problem of dropped writes • Review data model and cluster size • Add more disks and RAM, maybe more nodes • Investigate network issues (speed, partitions?) • Monitor both phases of the repair process 28
  • 29.
    Nicolas Favre-Felix –Cassandra London July 2013 “How can we write faster?” 29
  • 30.
    Nicolas Favre-Felix –Cassandra London July 2013 Context • Time-series data from 1 million sensors • 40 data points (e.g. temperature, pressure...) • Sent in one batch every 5 minutes • 40M cols / 5 min = 133,000 cols/sec • One node... 30
  • 31.
    Nicolas Favre-Felix –Cassandra London July 2013 Data model 1 • One row per (sensor, day) • Metrics columns grouped by minute within the row • Range queries between minutes A and B within a day CREATE TABLE sensor_data ( sensor_id text, day integer, hour integer, minute integer, metric1 integer, [...] metric40 integer, PRIMARY KEY ((sensor_id, day), minute); 31
  • 32.
    Nicolas Favre-Felix –Cassandra London July 2013 Data model 1 • At 12:00, insert 40 cols into row (sensor1, 2013-07-11) • At 12:05, insert 40 cols into row (sensor1, 2013-07-11) • These columns might not be written to the same file • Compaction process needs to merge them together: • Large amounts of overlap between SSTables • Rate is around 500 KB/sec • 30% CPU usage spent compacting; no issues with I/O 32
  • 33.
    Nicolas Favre-Felix –Cassandra London July 2013 Data model 2 • One row per (sensor, day, minute) • No range query within the day (need to enumerate) • Compaction now reaching 7 MB/sec • Tests show a 10-20% increase in throughput - PRIMARY KEY ((sensor_id, day), minute); + PRIMARY KEY ((sensor_id, day, minute)); 33
  • 34.
    Nicolas Favre-Felix –Cassandra London July 2013 Next steps • Workload is CPU-bound, disks are not a problem • Larger memtable mean lower write amplification • Managed to flush after 400k ops instead of 200k • Track time spent in GC with jstat -gcutil • At this rate, consider adding more nodes 34
  • 35.
    Nicolas Favre-Felix –Cassandra London July 2013 1. Interactions between Cassandra and the hardware 2. Implications of a bad data model at the storage layer 3. Internal data structures and processes 4. Work involved in arranging data on disk Four problems, four solutions 35
  • 36.
    Nicolas Favre-Felix –Cassandra London July 2013 Guidelines • Monitor Cassandra, OS, JVM, hardware • Learn how to use nodetool • Follow best practices in data modelling and sizing • Keep an eye on the Cassandra logs • Consider available resources as sharing “work” 36
  • 37.
    Nicolas Favre-Felix –Cassandra London July 2013 Thank you! 37