Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

2,263 views

Published on

.

Published in: Technology

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

  1. 1. Cassandra Meetup
  2. 2. Monitoring C* Health at Scale Jason Cacciatore @jasoncac
  3. 3. How do we assess health ? ● Node Level - dmesg errors - gossip status - thresholds (heap, disk usage, …) ● Cluster Level – ring aggregate
  4. 4. Scope ● Production + Test ● Hundreds of clusters ● Over 9,000 nodes
  5. 5. Current Architecture
  6. 6. So… what’s the problem ? ● Cron system for monitoring is problematic ○ No state, just snapshot ● Stream processor is a better fit
  7. 7. Mantis ● A reactive stream processing system that runs on Apache Mesos ○ Cloud native ○ Provides a flexible functional programming model ○ Supports job autoscaling ○ Deep integration into Netflix ecosystem ● Current Scale ○ ~350 jobs ○ 8 Million Messages/sec processed ○ 250 Gb/sec data processed ● QConNY 2015 talk on Netflix Mantis
  8. 8. Modeled as Mantis Jobs
  9. 9. Resources
  10. 10. Real-time Dashboard
  11. 11. How can I try this ? ● Priam - JMXNodeTool ● Stream Processor - Spark
  12. 12. THANK YOU
  13. 13. C* Gossip: the good, the bad and the ugly Minh Do @timiblossom
  14. 14. What is Gossip Protocol or Gossip? ● A peer-to-peer communication protocol in a distributed system ● Inspired by the form of gossip seen in human social networks ● Nodes spread out information to whichever peers they can contact ● Used in C* primarily as a membership protocol and information sharing
  15. 15. Gossip flow in C* ● At start, Gossiper loads seed addresses from configuration file into the gossip list ● doShadowRound on seeds ● Every 1s, gossip up to 3 nodes from the peers: a random peer, a seed peer, and unreachable peer
  16. 16. C* Gossip round in 3 stages
  17. 17. How Gossip Helps C*? ● Discover cluster topology (DC, Rack) ● Discover token owners ● Figure out peer statuses: ○ moving ○ leaving/left ○ normal ○ down ○ bootstrapping ● Exchange Schema version ● Share Load (used disk space)/Severity (CPU) ● Share Release version/Net version
  18. 18. What Gossip does not do for C*? ● Detect crashes in Thrift or Native servers ● Manage cluster (need Priam or OpsCenter) ● Collect performance metrics: latencies, RPS, JVM stats, network stats, etc. ● Give C* admins a good sleep.
  19. 19. Gossip race condition Most of Gossip issues and bugs are caused by incorrect code logics in handling race conditions ● The larger the cluster the higher chance of having race condition ● There are several C* components running in different threads that can affect gossip status: ○ Gossiper ○ FailureDetector ○ Snitches ○ StorageService ○ InboundTcpConnection ○ OutboundTcpConnection
  20. 20. Pain - Gossip Can Inflict on C* ● CASSANDRA-6125 An example to show race condition ● CASSANDRA-10298 Gossiper does not clean out metadata on a dead peer properly to cause a dead peer to stay in ring forever ● CASSANDRA-10371 Dead nodes remain in gossip to prevent a replacement due to FailureDetector unable to evict a down node ● CASSANDRA-8072 Unable to gossip to any seeds ● CASSANDRA-8336 Shut down issue that peers resurrect a down node
  21. 21. Pain - Gossip Can Inflict on C*, cont. ● CASSANDRA-8072 and CASSANDRA-7292 Problems on reusing IP of a dead node on a new node ● CASSANDRA-10969 Long running cluster (over 1yr) has restarting issue ● CASSANDRA-8768 Upgrading issue to newer version ● CASSANDRA-10321 Gossip to dead nodes caused CPU usage to be 100% ● A lemon node or AWS network issue to cause one node not to see the other to display a confusing gossip view
  22. 22. What can we do? ● Rolling restart C* cluster once in awhile ● On AWS, when there is an gossip issue, try reboot. If still have bad gossip view,, replace with a new instance ● Node assassination (unsafe and need a repair/clean-up) ● Monitor network activities to take pre-emptive actions ● Search community for the issues reported in the system logs ● Fix it yourself ● Pray
  23. 23. THANK YOU
  24. 24. References ● https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf ● https://wiki.apache.org/cassandra/ArchitectureGossip
  25. 25. Cassandra Tickler @chriskalan
  26. 26. When does repair fall down? ● Running LCS on an old version of C* ● Space issues ● Repair gets stuck
  27. 27. Solution - Cassandra Tickler
  28. 28. Solution - Cassandra Tickler
  29. 29. Solution - Cassandra Tickler https://github.com/ckalantzis/cassTickler
  30. 30. THANK YOU

×