Everyone knows that monitoring is crucial, but it’s a complex subject that few get right. In this session, you will learn the best practices for monitoring your Couchbase cluster, including the metrics that matter, integrating your tool set with Couchbase APIs, determining alert thresholds, and a review of an end-to-end reference implementation.
This means proactively identifying issues or resource consumption trends that could eventually lead to downtime or poor performance Stuff happens and sometimes a bad batch of hard drives fail at the same time, or the new network guy sets all your ports to 10Mbps, or the application has a bug that DOSes the cluster. If we can’t prevent it, we need to be able to figure out what it was so we can fix it.
120+ base with additional stats added for each index and XDCR replication Stats are viewable by bucket Also viewable by server or aggregated across all servers Stats can be trended from the default minutely to yearly, however note that the data points are heavily down sampled over time
This makes the interface great for realtime status and debugging, but historical data is only useful for high level trending. The interface likely won’t help you identify the cause of a transient latency spike that occurred 3 days ago.
Most alerts are re-active – only sending notifications after a failure has already occurred.
What we want is proactive alerting so that we can avoid the failure altogether.
Monitoring Couchbase: getting it right – Connect Silicon Valley 2017