Talk by: Eric Lippmann
We recently started researching and developing a Module for Icinga to monitor Kubernetes environments. During the past months we learned a lot about the platform and how we can monitor Kubernetes with Icinga efficiently. In this talk I will present our challenges but also the progress that we made. The talk will include a sneak peak into the current state of the Module and outline our vision of monitoring Kubernetes with Icinga.
8. Monitoring K8s – What to Monitor
• Hosts (where K8s components run)
• K8s itself
• Services, e.g. Deployments, *Sets, Jobs
• Pods
• Containers
• Key metics
Not only infrastructure but also workloads
9. Challenges – Complexity
• Loads of resource types
• Multiple components and layers
• Different failure points
• Understanding of the entire stack
Via hosts, services and check plugins?
14. K8s Monitoring – Probes
Liveness probes periodically check container liveness and
restart containers that fail it.
Readiness probes indicate container readiness and remove
failing ones from their service endpoints.
Startup probes defer the execution of liveness and readiness
probes and restarts containers that fail it.
15. K8s Monitoring – Approaches
• Poll K8s APIs
• Agent per node via DaemonSet
• Agent per pod (sidecar container)
• Events
• Metrics
• Logs
• APM
16. Possible K8s Metric Sources
• Node metrics from Prometheus node exporter
• Container metrics from cAdvisor (or metrics-server)
• K8s metrics
• API server
• etcd
• scheduler
• controller manager
• kube-state-metrics
18. Icinga K8s Monitoring, at the moment…
• Collects K8s resources and their
• health, events, certain metrics and logs
• Visualizes K8s resources and hierarchies
19. Icinga K8s Monitoring, should also…
• Correlate health, logs, metrics and events
• Provide alerts
• Of course, via icinga-notifications
• Give configuration tips
20. Icinga K8s Monitoring Architecture
• Icinga Web Module (PHP)
• View resources and hierarchies
• Daemon (Go)
• Collect resources, health, events,
logs and certain metrics
• Send alerts
• Database (PostgreSQL / MySQL / MariaDB)
• Stores resources, health, …
Hosts
Rather static
Ping checks
Services
Resource usaga
CPU, Memory, Storage. Network, Latencies
Apps
Webserver
Databases
URLs
Check Plugins
Contain logic
Common understanding of what is wrong
Not each and everyone has to find and configure own rules
Cube
Business Process
vSphere
Hosts
K8s Nodes
K8s itself
Etcd, scheduler, controller, api server
Services aka. K8s resources
Cluster Monitoring (infrastructure)
All clusters should monitor the underlying server components since problems at the server level will show up in the workloads. Some metrics to look for while monitoring node resources are CPU, disk, and network bandwidth. Having an overview of these metrics will let you know if it’s time to scale the cluster up or down (this is especially useful when using cloud providers where running cost is important).
Workload Monitoring (workload)
Metrics related to deployments and their pods should be taken into consideration here. Checking the number of pods a deployment has at a moment compared to its desired state can be relevant. Also, we can look for health checks, container metrics, and finally application metrics.
Everything is gone
Logs. Metrics, events
Jobs
Configuration changes
Scaling
Name changes (not for StatefulSet)
History
Collect everything but alert on service level
In order to determine the health at every level, from the application to the operating system to the infrastructure, you need to monitor metrics in all the different layers and components - services, containers, pods, deployments, nodes, and clusters. And each and everyone has to understand which metrics there are, what they mean and how to interpret them.
In this scenario, monitoring the cluster metrics would show roughly 50% memory utilization. It’s not very useful information, nor is it alarming. But what would happen if you go down a level and monitor the metrics of each node? In that case, one of the nodes would show 100% memory usage — this would reveal a problem, but not its origin. Going down another level to the pod metrics would get you closer to the problem, and going down yet another level to the container metrics would allow you to isolate the culprit of the memory leak.
This simple example shows the value of monitoring the metrics of each Kubernetes layer. Yes, cluster-wide metrics provide a high-level overview of Kubernetes deployment performance, but you’ll need those lower-layer metrics to identify problems and obtain useful insights that will help you administer the cluster and optimize the resources.
Cluster
Kubernetes components
Resource usage
Underutilized / Over capacitv
Nodes
Number of nodes sufficient?
Account node failures
Capacity of Pods, Ips and ressources
Pods
Resource usage against requests and limits
Running vs desired
Containers
Logs
Metrics
Cluster, Pods, Containers, Deployments, Sets, Applications
Expectations
Number of replicas
Deployment
Updated pods