In this presentation we cover how we managed to monitor NATS messaging system at scale, and embed critical information in our pre-existent monitoring stack.
As a team working on SDN/NFV areas, our solution is built as a distributed system of microservices on top of Kubernetes. In the heart of our architecture we are using NATS as a messaging system to ensure the resiliency of the critical paths. The monitoring of our system is achieved through EFK stack. Consequently, monitoring of NATS had to be handled with Elastic stack as well.
Monitoring the NATS messaging system at scale with Elastic Beats
1. Monitoring the NATS messaging system
at scale with Elastic Beats
MichaelKatsoulis
skatsaounis
ChrsMark
Tracking Issue:
https://github.com/elastic/beats/issues/10071
2. Who we are. What we do
SWDC
SDN/NFV Teams
Internal Projects (NFVSAP)
&
Open Source Contribution
Christos Markou
Stamatis katsaounis
Michael Katsoulis
3. NFVSAP: NFV Service Assurance Platform
Delivers:
− high performance
− deterministic performance
− improved energy
consumption
All via closed-loop automation
4. Challenges
Tons of data exchanged every second
(real-time visibility, real-time actions)
NATS server can potentially become
the SPOB/SPOF
5. The Problem
The heart of our Service Assurance system is NATS.
We need to assure its health
by providing actionable visibility in real-time.
Athens, March 2019
6. Athens, March 2019
What is NATS
An open-source messaging system like RabbitMQ and Kafka
A CNCF incubating project written in Go
The core principles underlying NATS are performance, scalability and ease-of-
use
7. Athens, March 2019
Why NATS
Super performant (Million msgs/sec)
Extra lightweight (docker image at 8MB)
Simple text based communication over TCP
Written in Go, like our services interacting with it
Supports Request/Reply → ideal for NFVSAP’s control plane
https://seroter.wordpress.com/2016/05/16/modern-open-source-messaging-apache-kafka-rabbitmq-and-nats-in-action/
8. Athens, March 2019
NATS Monitoring
System Resources Utilization (CPU, Memory)
I/O throughput (Messages, Bytes)
# of Slow Consumers
# of Subscriptions, Connections, Routes
https://nats.io/documentation/managing_the_server/monitoring/
9. Go agents running on multiple
compute hosts
Our stack runs
on top of k8s
EFK monitoring stack
plaintext
logs
Athens, March 2019
The Problem (depicted)
Monitoring
Data
Can we ship them to Elasticsearch?
10. Use a shipper that queries NATS
monitoring endpoints and ships the data
to Elasticsearch
Our approach
Athens, March 2019
15. Athens, March 2019
NatsBeat as a custom Metricbeat
A binary querying all NATS monitoring endpoints at a 10 sec interval
All data are shipped through events to Elasticsearch
NATSBeat is listed as one of the Community Beats
https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html
17. Athens, March 2019
NATS as an official Metricbeat module
4 new Metricsets were introduced, one per each endpoint
All monitoring data were analyzed
New metrics were created from the most meaningful data
Metrics were visualized with a pre-built Kibana Dashboard
19. Athens, March 2019
Beats project with NATS Metricbeat module (2)
https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-module-nats.html
20. NATS is a service which also has logs.
Could we ship them to Elasticsearch?
Athens, March 2019
Enrichments
21. Athens, March 2019
NATS as an official Filebeat module
New Fileset was created for parsing NATS server logs
All meaningful data were extracted with Grok patterns
Log events were created and shipped to Elasticsearch
Log data were visualized with a new Kibana Dashboard