Understanding Cassandra internals to solve real-world problems

Cassandra Internals
Cassandra London Meetup – July 2013
Nicolas Favre-Felix
Software Engineer
@yowgi – @acunu
1

Nicolas Favre-Felix – Cassandra London July 2013
A lot to talk about
• Memtable
• SSTable
• Commit log
• Row Cache
• Key Cache
• Compaction
• Secondary indexes
• Bloom Filters
• Index samples
• Column indexes
• Thrift
• CQL
2

1. High latency in a read-heavy workload
2. High CPU usage with little activity on the cluster
3. nodetool repair taking too long to complete
4. Optimising for the highest insert throughput
Four real-world problems
3

• Acunu professional services for Apache Cassandra
• 24x7 support for questions and emergencies
• Cluster “health check” sessions
• CassandraTraining & Workshop
Context
4

“Reading takes too long”
5

Symptoms
• High latency observed in read operations
• Thousands of read requests per second
6

Staged Event-Driven
Architecture (SEDA)
7

SEDA in Cassandra
• Stages in Cassandra have different roles
• MutationStage for writes
• ReadStage for reads
• ... 10 or so in total
• Each Stage is backed by a thread pool
• Not all task queues are bounded
8

ReadStage
• Not all reads are equal:
• Some served from in-memory data structures
• Some served from the Linux page cache
• Some need to hit disk, possibly more than once
• Read operations can be disk-bound
• Avoid saturating disk with random reads
• Recommended pool size: 16×number_of_drives
9

nodetool tpstats
Pool Name Active Pending Completed
ReadStage 16 3197 733819430
RequestResponseStage 0 0 3381277
MutationStage 5 0 1130984
ReadRepairStage 0 0 80095473
ReplicateOnWriteStage 0 0 4728857
GossipStage 0 0 20252373
AntiEntropyStage 0 0 2228
MigrationStage 0 0 19
MemtablePostFlusher 0 0 839
StreamStage 0 0 40
FlushWriter 0 0 2349
MiscStage 0 0 0
commitlog_archiver 0 0 0
AntiEntropySessions 0 0 11
InternalResponseStage 0 0 7
HintedHandoff 0 0 6018
10

Solution
• iostat: little I/O activity
• free: large amount of memory used to cache pages
• → Increased concurrent_reads to 32
• → Latency dropped to reasonable levels
• Recommendations:
• Reduce the number of reads
• Keep an eye on I/O as data grows
• Buy more disks or RAM when falling out of cache
11

“Cassandra is busy doing nothing”
12

Context
• 2-node cluster
• Little activity on the cluster
• Very high CPU usage on the nodes
• Storing metadata on published web content
13

nodetool cfhistograms
• Node-local histogram stored per CF, per node
• Distribution of number of ﬁles accessed per read
• Distribution of read and write latencies
• Distribution of row sizes and column counts
• Buckets are approximate but still very useful
14

SSTables accessed per read
0
1,000,000
2,000,000
3,000,000
0 1 2 3 4 5 6 7 8 9 10
Number of reads
SSTables accessed
15

Row size distribution (bytes)
0
1
2
3
4
5
0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000
Number of rows
Row size in bytes
16

Column count distribution
0
2
4
6
8
10
0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000
Number of rows
Number of columns
17

Read latency distribution (µsec)
0
180,000
360,000
540,000
720,000
900,000
1 100 10,000 1,000,000
Number of reads
Number of reads
Latency (µsec)
18

Data model issue
• Row key was “views”
• Column names were item names, values counters
• Cassandra stored only a few massive rows
• → Reading from many SSTables
• → De-serialising large column indexes
views post-1234: 77: post-1240: 8 post-1250: 3
19

CF read latency & column index
(taken from Aaron Morton’s talk at Cassandra SF 2012)
0
1,500
3,000
4,500
6,000
85th 95th 99th
Latency(microseconds)
Percentile
First column from 1,200
First column from 1,000,000
20

Solution
• “Transpose” the table:
• Make the item name the row key
• Have a few counters per item
• Distribute the rows across the whole cluster
post-123 : views: 9078 comments: 3
21

“nodetool repair takes ages”
22

Nodetool repair
• “Active Anti-Entropy” mechanism in Cassandra
• Synchronises replicas
• Running repair is important to replicate tombstones
• Should run at least once every 10 days
• Repair was taking a week to complete
23

Two phases
1. Contact replicas, ask for MerkleTrees
1. They scan their local data and send a tree back
2. Compare MerkleTrees between replicas
1. Identify differences
2. Stream blocks of data out to other nodes
3. Stream data in and merge locally
24

MerkleTrees
top hash
hash-0 hash-1
hash-00 hash-01 hash-10 hash-11
data
block 0
data
block 1
data
block 2
data
block 3
•Hashes of hashes of ... data
•215 = 32,768 leaf nodes
(memory)
(disk)
25

Cassandra logs
• MerkleTree requests and responses
• Check how long it took
• Differences found, in number of leaf nodes
• More differences more data to stream
• Streaming sessions starting and ending
26

Diagnostic
• Building MerkleTrees: 20-30 minutes
• “4,700 ranges out of sync” (~14% of 32,768)
• Streaming session to repair the range: 4.5 hours
• Much slower rate than expected
27

Solutions
• Increase consistency level from ONE
• Rely on read repair to decrease entropy
• Fix problem of dropped writes
• Review data model and cluster size
• Add more disks and RAM, maybe more nodes
• Investigate network issues (speed, partitions?)
• Monitor both phases of the repair process
28

“How can we write faster?”
29

Context
• Time-series data from 1 million sensors
• 40 data points (e.g. temperature, pressure...)
• Sent in one batch every 5 minutes
• 40M cols / 5 min = 133,000 cols/sec
• One node...
30

Data model 1
• One row per (sensor, day)
• Metrics columns grouped by minute within the row
• Range queries between minutes A and B within a day
CREATE TABLE sensor_data (
sensor_id text,
day integer,
hour integer,
minute integer,
metric1 integer,
[...]
metric40 integer,
PRIMARY KEY ((sensor_id, day), minute);
31

Data model 1
• At 12:00, insert 40 cols into row (sensor1, 2013-07-11)
• At 12:05, insert 40 cols into row (sensor1, 2013-07-11)
• These columns might not be written to the same ﬁle
• Compaction process needs to merge them together:
• Large amounts of overlap between SSTables
• Rate is around 500 KB/sec
• 30% CPU usage spent compacting; no issues with I/O
32

Data model 2
• One row per (sensor, day, minute)
• No range query within the day (need to enumerate)
• Compaction now reaching 7 MB/sec
• Tests show a 10-20% increase in throughput
- PRIMARY KEY ((sensor_id, day), minute);
+ PRIMARY KEY ((sensor_id, day, minute));
33

Next steps
• Workload is CPU-bound, disks are not a problem
• Larger memtable mean lower write ampliﬁcation
• Managed to ﬂush after 400k ops instead of 200k
• Track time spent in GC with jstat -gcutil
• At this rate, consider adding more nodes
34

1. Interactions between Cassandra and the hardware
2. Implications of a bad data model at the storage layer
3. Internal data structures and processes
4. Work involved in arranging data on disk
Four problems, four solutions
35

Guidelines
• Monitor Cassandra, OS, JVM, hardware
• Learn how to use nodetool
• Follow best practices in data modelling and sizing
• Keep an eye on the Cassandra logs
• Consider available resources as sharing “work”
36

Thank you!
37

Understanding Cassandra internals to solve real-world problems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Understanding Cassandra internals to solve real-world problems

Similar to Understanding Cassandra internals to solve real-world problems (20)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Understanding Cassandra internals to solve real-world problems