Cassandra Meetup
Monitoring C* Health at Scale
Jason Cacciatore @jasoncac
How do we assess health ?
● Node Level
- dmesg errors
- gossip status
- thresholds (heap, disk usage, …)
● Cluster Level – ring aggregate
Scope
● Production + Test
● Hundreds of clusters
● Over 9,000 nodes
Current Architecture
So… what’s the problem ?
● Cron system for monitoring is
problematic
○ No state, just snapshot
● Stream processor is a better fit
Mantis
● A reactive stream processing system that runs on Apache Mesos
○ Cloud native
○ Provides a flexible functional programming model
○ Supports job autoscaling
○ Deep integration into Netflix ecosystem
● Current Scale
○ ~350 jobs
○ 8 Million Messages/sec processed
○ 250 Gb/sec data processed
● QConNY 2015 talk on Netflix Mantis
Modeled as Mantis Jobs
Resources
Real-time Dashboard
How can I try this ?
● Priam - JMXNodeTool
● Stream Processor - Spark
THANK YOU
C* Gossip: the good, the bad
and the ugly
Minh Do @timiblossom
What is Gossip Protocol or Gossip?
● A peer-to-peer communication protocol in a
distributed system
● Inspired by the form of gossip seen in
human social networks
● Nodes spread out information to
whichever peers they can contact
● Used in C* primarily as a membership
protocol and information sharing
Gossip flow in C*
● At start, Gossiper loads seed
addresses from configuration
file into the gossip list
● doShadowRound on seeds
● Every 1s, gossip up to 3 nodes
from the peers: a random peer,
a seed peer, and unreachable
peer
C* Gossip round in 3 stages
How Gossip Helps C*?
● Discover cluster topology (DC, Rack)
● Discover token owners
● Figure out peer statuses:
○ moving
○ leaving/left
○ normal
○ down
○ bootstrapping
● Exchange Schema version
● Share Load (used disk space)/Severity (CPU)
● Share Release version/Net version
What Gossip does not do for C*?
● Detect crashes in Thrift or Native servers
● Manage cluster (need Priam or OpsCenter)
● Collect performance metrics: latencies, RPS,
JVM stats, network stats, etc.
● Give C* admins a good sleep.
Gossip race condition
Most of Gossip issues and bugs are caused by incorrect code
logics in handling race conditions
● The larger the cluster the higher chance of having race
condition
● There are several C* components running in different threads
that can affect gossip status:
○ Gossiper
○ FailureDetector
○ Snitches
○ StorageService
○ InboundTcpConnection
○ OutboundTcpConnection
Pain - Gossip Can Inflict on C*
● CASSANDRA-6125 An example to show race condition
● CASSANDRA-10298 Gossiper does not clean out
metadata on a dead peer properly to cause a dead peer
to stay in ring forever
● CASSANDRA-10371 Dead nodes remain in gossip to
prevent a replacement due to FailureDetector unable to
evict a down node
● CASSANDRA-8072 Unable to gossip to
any seeds
● CASSANDRA-8336 Shut down issue
that peers resurrect a down node
Pain - Gossip Can Inflict on C*, cont.
● CASSANDRA-8072 and CASSANDRA-7292
Problems on reusing IP of a dead node on a new node
● CASSANDRA-10969 Long running cluster (over 1yr) has
restarting issue
● CASSANDRA-8768 Upgrading issue to newer version
● CASSANDRA-10321 Gossip to dead nodes caused CPU
usage to be 100%
● A lemon node or AWS network issue to cause one node
not to see the other to display a confusing gossip view
What can we do?
● Rolling restart C* cluster once in awhile
● On AWS, when there is an gossip issue,
try reboot. If still have bad gossip view,,
replace with a new instance
● Node assassination (unsafe and
need a repair/clean-up)
● Monitor network activities to take
pre-emptive actions
● Search community for the issues
reported in the system logs
● Fix it yourself
● Pray
THANK YOU
References
● https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf
● https://wiki.apache.org/cassandra/ArchitectureGossip
Cassandra Tickler
@chriskalan
When does repair fall down?
● Running LCS on an old version of C*
● Space issues
● Repair gets stuck
Solution - Cassandra Tickler
Solution - Cassandra Tickler
Solution - Cassandra Tickler
https://github.com/ckalantzis/cassTickler
THANK YOU

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

  • 1.
  • 2.
    Monitoring C* Healthat Scale Jason Cacciatore @jasoncac
  • 3.
    How do weassess health ? ● Node Level - dmesg errors - gossip status - thresholds (heap, disk usage, …) ● Cluster Level – ring aggregate
  • 4.
    Scope ● Production +Test ● Hundreds of clusters ● Over 9,000 nodes
  • 5.
  • 6.
    So… what’s theproblem ? ● Cron system for monitoring is problematic ○ No state, just snapshot ● Stream processor is a better fit
  • 7.
    Mantis ● A reactivestream processing system that runs on Apache Mesos ○ Cloud native ○ Provides a flexible functional programming model ○ Supports job autoscaling ○ Deep integration into Netflix ecosystem ● Current Scale ○ ~350 jobs ○ 8 Million Messages/sec processed ○ 250 Gb/sec data processed ● QConNY 2015 talk on Netflix Mantis
  • 8.
  • 9.
  • 10.
  • 12.
    How can Itry this ? ● Priam - JMXNodeTool ● Stream Processor - Spark
  • 13.
  • 14.
    C* Gossip: thegood, the bad and the ugly Minh Do @timiblossom
  • 15.
    What is GossipProtocol or Gossip? ● A peer-to-peer communication protocol in a distributed system ● Inspired by the form of gossip seen in human social networks ● Nodes spread out information to whichever peers they can contact ● Used in C* primarily as a membership protocol and information sharing
  • 16.
    Gossip flow inC* ● At start, Gossiper loads seed addresses from configuration file into the gossip list ● doShadowRound on seeds ● Every 1s, gossip up to 3 nodes from the peers: a random peer, a seed peer, and unreachable peer
  • 17.
    C* Gossip roundin 3 stages
  • 18.
    How Gossip HelpsC*? ● Discover cluster topology (DC, Rack) ● Discover token owners ● Figure out peer statuses: ○ moving ○ leaving/left ○ normal ○ down ○ bootstrapping ● Exchange Schema version ● Share Load (used disk space)/Severity (CPU) ● Share Release version/Net version
  • 19.
    What Gossip doesnot do for C*? ● Detect crashes in Thrift or Native servers ● Manage cluster (need Priam or OpsCenter) ● Collect performance metrics: latencies, RPS, JVM stats, network stats, etc. ● Give C* admins a good sleep.
  • 20.
    Gossip race condition Mostof Gossip issues and bugs are caused by incorrect code logics in handling race conditions ● The larger the cluster the higher chance of having race condition ● There are several C* components running in different threads that can affect gossip status: ○ Gossiper ○ FailureDetector ○ Snitches ○ StorageService ○ InboundTcpConnection ○ OutboundTcpConnection
  • 21.
    Pain - GossipCan Inflict on C* ● CASSANDRA-6125 An example to show race condition ● CASSANDRA-10298 Gossiper does not clean out metadata on a dead peer properly to cause a dead peer to stay in ring forever ● CASSANDRA-10371 Dead nodes remain in gossip to prevent a replacement due to FailureDetector unable to evict a down node ● CASSANDRA-8072 Unable to gossip to any seeds ● CASSANDRA-8336 Shut down issue that peers resurrect a down node
  • 22.
    Pain - GossipCan Inflict on C*, cont. ● CASSANDRA-8072 and CASSANDRA-7292 Problems on reusing IP of a dead node on a new node ● CASSANDRA-10969 Long running cluster (over 1yr) has restarting issue ● CASSANDRA-8768 Upgrading issue to newer version ● CASSANDRA-10321 Gossip to dead nodes caused CPU usage to be 100% ● A lemon node or AWS network issue to cause one node not to see the other to display a confusing gossip view
  • 23.
    What can wedo? ● Rolling restart C* cluster once in awhile ● On AWS, when there is an gossip issue, try reboot. If still have bad gossip view,, replace with a new instance ● Node assassination (unsafe and need a repair/clean-up) ● Monitor network activities to take pre-emptive actions ● Search community for the issues reported in the system logs ● Fix it yourself ● Pray
  • 24.
  • 25.
  • 26.
  • 27.
    When does repairfall down? ● Running LCS on an old version of C* ● Space issues ● Repair gets stuck
  • 28.
  • 29.
  • 30.
    Solution - CassandraTickler https://github.com/ckalantzis/cassTickler
  • 31.