Understanding Cassandra internals to solve real-world problems

10,501 views

Published on

Published in: Technology, Travel, Business

Understanding Cassandra internals to solve real-world problems

  1. 1. Cassandra Internals Cassandra London Meetup – July 2013 Nicolas Favre-Felix Software Engineer @yowgi – @acunu 1
  2. 2. Nicolas Favre-Felix – Cassandra London July 2013 A lot to talk about • Memtable • SSTable • Commit log • Row Cache • Key Cache • Compaction • Secondary indexes • Bloom Filters • Index samples • Column indexes • Thrift • CQL 2
  3. 3. Nicolas Favre-Felix – Cassandra London July 2013 1. High latency in a read-heavy workload 2. High CPU usage with little activity on the cluster 3. nodetool repair taking too long to complete 4. Optimising for the highest insert throughput Four real-world problems 3
  4. 4. Nicolas Favre-Felix – Cassandra London July 2013 • Acunu professional services for Apache Cassandra • 24x7 support for questions and emergencies • Cluster “health check” sessions • CassandraTraining & Workshop Context 4
  5. 5. Nicolas Favre-Felix – Cassandra London July 2013 “Reading takes too long” 5
  6. 6. Nicolas Favre-Felix – Cassandra London July 2013 Symptoms • High latency observed in read operations • Thousands of read requests per second 6
  7. 7. Nicolas Favre-Felix – Cassandra London July 2013 Staged Event-Driven Architecture (SEDA) 7
  8. 8. Nicolas Favre-Felix – Cassandra London July 2013 SEDA in Cassandra • Stages in Cassandra have different roles • MutationStage for writes • ReadStage for reads • ... 10 or so in total • Each Stage is backed by a thread pool • Not all task queues are bounded 8
  9. 9. Nicolas Favre-Felix – Cassandra London July 2013 ReadStage • Not all reads are equal: • Some served from in-memory data structures • Some served from the Linux page cache • Some need to hit disk, possibly more than once • Read operations can be disk-bound • Avoid saturating disk with random reads • Recommended pool size: 16×number_of_drives 9
  10. 10. Nicolas Favre-Felix – Cassandra London July 2013 nodetool tpstats Pool Name Active Pending Completed ReadStage 16 3197 733819430 RequestResponseStage 0 0 3381277 MutationStage 5 0 1130984 ReadRepairStage 0 0 80095473 ReplicateOnWriteStage 0 0 4728857 GossipStage 0 0 20252373 AntiEntropyStage 0 0 2228 MigrationStage 0 0 19 MemtablePostFlusher 0 0 839 StreamStage 0 0 40 FlushWriter 0 0 2349 MiscStage 0 0 0 commitlog_archiver 0 0 0 AntiEntropySessions 0 0 11 InternalResponseStage 0 0 7 HintedHandoff 0 0 6018 10
  11. 11. Nicolas Favre-Felix – Cassandra London July 2013 Solution • iostat: little I/O activity • free: large amount of memory used to cache pages • → Increased concurrent_reads to 32 • → Latency dropped to reasonable levels • Recommendations: • Reduce the number of reads • Keep an eye on I/O as data grows • Buy more disks or RAM when falling out of cache 11
  12. 12. Nicolas Favre-Felix – Cassandra London July 2013 “Cassandra is busy doing nothing” 12
  13. 13. Nicolas Favre-Felix – Cassandra London July 2013 Context • 2-node cluster • Little activity on the cluster • Very high CPU usage on the nodes • Storing metadata on published web content 13
  14. 14. Nicolas Favre-Felix – Cassandra London July 2013 nodetool cfhistograms • Node-local histogram stored per CF, per node • Distribution of number of files accessed per read • Distribution of read and write latencies • Distribution of row sizes and column counts • Buckets are approximate but still very useful 14
  15. 15. Nicolas Favre-Felix – Cassandra London July 2013 SSTables accessed per read 0 1,000,000 2,000,000 3,000,000 0 1 2 3 4 5 6 7 8 9 10 Number of reads SSTables accessed 15
  16. 16. Nicolas Favre-Felix – Cassandra London July 2013 Row size distribution (bytes) 0 1 2 3 4 5 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 Number of rows Row size in bytes 16
  17. 17. Nicolas Favre-Felix – Cassandra London July 2013 Column count distribution 0 2 4 6 8 10 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 Number of rows Number of columns 17
  18. 18. Nicolas Favre-Felix – Cassandra London July 2013 Read latency distribution (µsec) 0 180,000 360,000 540,000 720,000 900,000 1 100 10,000 1,000,000 Number of reads Number of reads Latency (µsec) 18
  19. 19. Nicolas Favre-Felix – Cassandra London July 2013 Data model issue • Row key was “views” • Column names were item names, values counters • Cassandra stored only a few massive rows • → Reading from many SSTables • → De-serialising large column indexes views post-1234: 77: post-1240: 8 post-1250: 3 19
  20. 20. Nicolas Favre-Felix – Cassandra London July 2013 CF read latency & column index (taken from Aaron Morton’s talk at Cassandra SF 2012) 0 1,500 3,000 4,500 6,000 85th 95th 99th Latency(microseconds) Percentile First column from 1,200 First column from 1,000,000 20
  21. 21. Nicolas Favre-Felix – Cassandra London July 2013 Solution • “Transpose” the table: • Make the item name the row key • Have a few counters per item • Distribute the rows across the whole cluster post-123 : views: 9078 comments: 3 21
  22. 22. Nicolas Favre-Felix – Cassandra London July 2013 “nodetool repair takes ages” 22
  23. 23. Nicolas Favre-Felix – Cassandra London July 2013 Nodetool repair • “Active Anti-Entropy” mechanism in Cassandra • Synchronises replicas • Running repair is important to replicate tombstones • Should run at least once every 10 days • Repair was taking a week to complete 23
  24. 24. Nicolas Favre-Felix – Cassandra London July 2013 Two phases 1. Contact replicas, ask for MerkleTrees 1. They scan their local data and send a tree back 2. Compare MerkleTrees between replicas 1. Identify differences 2. Stream blocks of data out to other nodes 3. Stream data in and merge locally 24
  25. 25. Nicolas Favre-Felix – Cassandra London July 2013 MerkleTrees top hash hash-0 hash-1 hash-00 hash-01 hash-10 hash-11 data block 0 data block 1 data block 2 data block 3 •Hashes of hashes of ... data •215 = 32,768 leaf nodes (memory) (disk) 25
  26. 26. Nicolas Favre-Felix – Cassandra London July 2013 Cassandra logs • MerkleTree requests and responses • Check how long it took • Differences found, in number of leaf nodes • More differences more data to stream • Streaming sessions starting and ending 26
  27. 27. Nicolas Favre-Felix – Cassandra London July 2013 Diagnostic • Building MerkleTrees: 20-30 minutes • “4,700 ranges out of sync” (~14% of 32,768) • Streaming session to repair the range: 4.5 hours • Much slower rate than expected 27
  28. 28. Nicolas Favre-Felix – Cassandra London July 2013 Solutions • Increase consistency level from ONE • Rely on read repair to decrease entropy • Fix problem of dropped writes • Review data model and cluster size • Add more disks and RAM, maybe more nodes • Investigate network issues (speed, partitions?) • Monitor both phases of the repair process 28
  29. 29. Nicolas Favre-Felix – Cassandra London July 2013 “How can we write faster?” 29
  30. 30. Nicolas Favre-Felix – Cassandra London July 2013 Context • Time-series data from 1 million sensors • 40 data points (e.g. temperature, pressure...) • Sent in one batch every 5 minutes • 40M cols / 5 min = 133,000 cols/sec • One node... 30
  31. 31. Nicolas Favre-Felix – Cassandra London July 2013 Data model 1 • One row per (sensor, day) • Metrics columns grouped by minute within the row • Range queries between minutes A and B within a day CREATE TABLE sensor_data ( sensor_id text, day integer, hour integer, minute integer, metric1 integer, [...] metric40 integer, PRIMARY KEY ((sensor_id, day), minute); 31
  32. 32. Nicolas Favre-Felix – Cassandra London July 2013 Data model 1 • At 12:00, insert 40 cols into row (sensor1, 2013-07-11) • At 12:05, insert 40 cols into row (sensor1, 2013-07-11) • These columns might not be written to the same file • Compaction process needs to merge them together: • Large amounts of overlap between SSTables • Rate is around 500 KB/sec • 30% CPU usage spent compacting; no issues with I/O 32
  33. 33. Nicolas Favre-Felix – Cassandra London July 2013 Data model 2 • One row per (sensor, day, minute) • No range query within the day (need to enumerate) • Compaction now reaching 7 MB/sec • Tests show a 10-20% increase in throughput - PRIMARY KEY ((sensor_id, day), minute); + PRIMARY KEY ((sensor_id, day, minute)); 33
  34. 34. Nicolas Favre-Felix – Cassandra London July 2013 Next steps • Workload is CPU-bound, disks are not a problem • Larger memtable mean lower write amplification • Managed to flush after 400k ops instead of 200k • Track time spent in GC with jstat -gcutil • At this rate, consider adding more nodes 34
  35. 35. Nicolas Favre-Felix – Cassandra London July 2013 1. Interactions between Cassandra and the hardware 2. Implications of a bad data model at the storage layer 3. Internal data structures and processes 4. Work involved in arranging data on disk Four problems, four solutions 35
  36. 36. Nicolas Favre-Felix – Cassandra London July 2013 Guidelines • Monitor Cassandra, OS, JVM, hardware • Learn how to use nodetool • Follow best practices in data modelling and sizing • Keep an eye on the Cassandra logs • Consider available resources as sharing “work” 36
  37. 37. Nicolas Favre-Felix – Cassandra London July 2013 Thank you! 37

×