SlideShare a Scribd company logo
1 of 36
Download to read offline
Monitoring Kubernetes with
Prometheus
Tom Wilkie, Oct 2018
❤
Tom Wilkie VP Product, Grafana Labs
Previously: Kausal, Weaveworks, Google, Acunu, Xensource
Twitter: @tom_wilkie Email: tom@grafana.com
Prometheus
Kubernetes
Monitoring & Alerting
Getting Started
Prometheus
• A monitoring & alerting system.

• Inspired by Google’s BorgMon

• Originally built by SoundCloud in
2012

• Open Source, now part of the CNCF

• Simple text-based metrics format

• Multidimensional datamodel

• Rich, concise query language
Prometheus’ data model is very simple:
<identifier> → [ (t0, v0), (t1, v1), ... ]
Timestamps are millisecond int64, values are float64
https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus
Prometheus identifiers
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“200”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“200”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”}
Prometheus series selector
http_requests_total{job=“nginx”, status=~“5..”}
Building queries usually starts with a selector
PromQL: http_requests_total{job=“nginx”, status=~“5..”}
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 34
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 56
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 76
{job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} 96
...
Can select vectors of values…
PromQL: http_requests_total{job=“nginx”, status=~“5..”}[1m]
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} [30, 31, 32, 34]
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} [4, 24, 56, 56]
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} [76, 76, 76, 76]
{job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} [56, 106, 5, 96]
...
And apply functions…
PromQL: rate(http_requests_total{job=“nginx”, status=~“5..”}[1m])
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 0.0666
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 0.866
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 0.0
{job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} 2.43
...
And aggregate by a dimension…
PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m]))
{path=“/home”} 0.0666
{path=“/settings”} 3.3
...
Do binary operations…
PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m]))
/
sum by (path) (rate(http_requests_total{job=“nginx”}[1m]))
{path=“/home”} 0.001
{path=“/settings”} 1.0
...
Kubernetes
• Platform for managing containerized
workloads and services

• “operating system for you datacenter”

• Inspired by Google’s Borg

• Also part of the CNCF

• Distributed, fault tolerant architecture

• Rich object model for you
applications
https://thenewstack.io/myth-cloud-native-portability/
kube-state-metrics
cAdvisor
Monitoring & Alerting
What should I monitor?
USE Method
● Utilisation, Saturation, Errors…

RED Method
● Requests, Errors, Duration…

??? Method
● Expected system state…
USE Method
• cluster and node level
metrics 

• node_exporter run as a
daemonset
USE Method
CPU Utilisation:
1 - avg(rate(node_cpu{mode=“idle"}[1m]))
CPU Saturation:
sum(node_load1)/ sum(node:node_num_cpu:sum)
USE Method
• Can also look at
container level metrics
from cAdvisor…

• …and combine them
with metadata from
kube-state-metrics.
USE Method
Container CPU usage by “app” label
sum by (namespace, label_name) (
sum by (pod_name, namespace (
rate(container_cpu_usage_seconds_total[5m])
)
* on (pod_name) group_left(label_name)
label_join(kube_pod_labels, "pod_name", ",", "pod")
)
RED Method
• Metrics exposed by
components for RED-
style monitoring
RED Method
Most useful alert I’ve found:

100 * sum by(instance, job) (
rate(rest_client_requests_total{code!~”2..”}[5m])
)
/
sum by(instance, job) (
rate(rest_client_requests_total[5m])
)
??? Method
Alert expressions are invariants that describe a healthy
system
kube_deployment_spec_replicas !=
kube_deployment_status_replicas_available
rate(kube_pod_container_status_restarts_total
[15m]) > 0
??? Method
Alert expressions are invariants that describe a healthy system
(kube_pod_status_phase{phase!~”Running|Succeeded”}) > 0
sum(kube_pod_container_resource_requests_cpu_cores)
/ sum(node:node_num_cpu:sum)
>
(count(node:node_num_cpu:sum) - 1)
/ count(node:node_num_cpu:sum)
Cortex
• Horizontally scalable, HA
Prometheus

• Now part of the CNCF Sandbox

• Distributed, fault tolerant architecture

• Long term storage

• Multitenant

https://github.com/cortexproject/cortex
Getting Started
Getting setup
• github.com/coreos/prometheus-operator - Job to look after running
Prometheus on Kubernetes

• github.com/coreos/kube-prometheus - Set of configs for running all there
other things you need.

• github.com/grafana/jsonnet-libs/tree/master/prometheus-ksonnet - My
configs for running Prometheus, Alertmanager, Grafana etc

• github.com/kubernetes-monitoring/kubernetes-mixin - Joint project to
unify and improve common alerts for Kubernetes.
More reading…
https://landing.google.com/sre/book.html
https://www.youtube.com/watch?v=1oJXMdVi0mM
http://www.brendangregg.com/usemethod.html
Questions?

More Related Content

What's hot

What's hot (20)

Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Docker swarm
Docker swarmDocker swarm
Docker swarm
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring System
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
Docker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshopDocker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshop
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
Gitlab, GitOps & ArgoCD
Gitlab, GitOps & ArgoCDGitlab, GitOps & ArgoCD
Gitlab, GitOps & ArgoCD
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Introduction to Helm
Introduction to HelmIntroduction to Helm
Introduction to Helm
 
Automation with ansible
Automation with ansibleAutomation with ansible
Automation with ansible
 
Advanced Terraform
Advanced TerraformAdvanced Terraform
Advanced Terraform
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 

Similar to Monitoring Kubernetes with Prometheus

Oscon London 2016 - Docker from Development to Production
Oscon London 2016 - Docker from Development to ProductionOscon London 2016 - Docker from Development to Production
Oscon London 2016 - Docker from Development to Production
Patrick Chanezon
 

Similar to Monitoring Kubernetes with Prometheus (20)

What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Monitoring the Hashistack with Prometheus
Monitoring the Hashistack with PrometheusMonitoring the Hashistack with Prometheus
Monitoring the Hashistack with Prometheus
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018
 
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Prometheus course
Prometheus coursePrometheus course
Prometheus course
 
Devoxx 2016 - Docker Nuts and Bolts
Devoxx 2016 - Docker Nuts and BoltsDevoxx 2016 - Docker Nuts and Bolts
Devoxx 2016 - Docker Nuts and Bolts
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICESTHE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
 
Oscon London 2016 - Docker from Development to Production
Oscon London 2016 - Docker from Development to ProductionOscon London 2016 - Docker from Development to Production
Oscon London 2016 - Docker from Development to Production
 
Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with Prometheus
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at Scale
 
Containerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with KubernetesContainerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with Kubernetes
 
What's new in Docker - InfraKit - Docker Meetup Berlin 2016
What's new in Docker - InfraKit - Docker Meetup Berlin 2016What's new in Docker - InfraKit - Docker Meetup Berlin 2016
What's new in Docker - InfraKit - Docker Meetup Berlin 2016
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Monitoring Kubernetes with Prometheus

  • 2. Tom Wilkie VP Product, Grafana Labs Previously: Kausal, Weaveworks, Google, Acunu, Xensource Twitter: @tom_wilkie Email: tom@grafana.com
  • 4. Prometheus • A monitoring & alerting system. • Inspired by Google’s BorgMon • Originally built by SoundCloud in 2012 • Open Source, now part of the CNCF • Simple text-based metrics format • Multidimensional datamodel • Rich, concise query language
  • 5.
  • 6.
  • 7. Prometheus’ data model is very simple: <identifier> → [ (t0, v0), (t1, v1), ... ] Timestamps are millisecond int64, values are float64 https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus
  • 8. Prometheus identifiers http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“200”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“200”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} Prometheus series selector http_requests_total{job=“nginx”, status=~“5..”}
  • 9. Building queries usually starts with a selector PromQL: http_requests_total{job=“nginx”, status=~“5..”} {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 34 {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 56 {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 76 {job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} 96 ...
  • 10. Can select vectors of values… PromQL: http_requests_total{job=“nginx”, status=~“5..”}[1m] {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} [30, 31, 32, 34] {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} [4, 24, 56, 56] {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} [76, 76, 76, 76] {job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} [56, 106, 5, 96] ...
  • 11. And apply functions… PromQL: rate(http_requests_total{job=“nginx”, status=~“5..”}[1m]) {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 0.0666 {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 0.866 {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 0.0 {job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“502”} 2.43 ...
  • 12. And aggregate by a dimension… PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m])) {path=“/home”} 0.0666 {path=“/settings”} 3.3 ...
  • 13. Do binary operations… PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m])) / sum by (path) (rate(http_requests_total{job=“nginx”}[1m])) {path=“/home”} 0.001 {path=“/settings”} 1.0 ...
  • 14. Kubernetes • Platform for managing containerized workloads and services • “operating system for you datacenter” • Inspired by Google’s Borg • Also part of the CNCF • Distributed, fault tolerant architecture • Rich object model for you applications
  • 15.
  • 20. What should I monitor? USE Method ● Utilisation, Saturation, Errors… RED Method ● Requests, Errors, Duration… ??? Method ● Expected system state…
  • 21. USE Method • cluster and node level metrics • node_exporter run as a daemonset
  • 22. USE Method CPU Utilisation: 1 - avg(rate(node_cpu{mode=“idle"}[1m])) CPU Saturation: sum(node_load1)/ sum(node:node_num_cpu:sum)
  • 23. USE Method • Can also look at container level metrics from cAdvisor… • …and combine them with metadata from kube-state-metrics.
  • 24. USE Method Container CPU usage by “app” label sum by (namespace, label_name) ( sum by (pod_name, namespace ( rate(container_cpu_usage_seconds_total[5m]) ) * on (pod_name) group_left(label_name) label_join(kube_pod_labels, "pod_name", ",", "pod") )
  • 25. RED Method • Metrics exposed by components for RED- style monitoring
  • 26. RED Method Most useful alert I’ve found: 100 * sum by(instance, job) ( rate(rest_client_requests_total{code!~”2..”}[5m]) ) / sum by(instance, job) ( rate(rest_client_requests_total[5m]) )
  • 27. ??? Method Alert expressions are invariants that describe a healthy system kube_deployment_spec_replicas != kube_deployment_status_replicas_available rate(kube_pod_container_status_restarts_total [15m]) > 0
  • 28. ??? Method Alert expressions are invariants that describe a healthy system (kube_pod_status_phase{phase!~”Running|Succeeded”}) > 0 sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)
  • 29. Cortex • Horizontally scalable, HA Prometheus • Now part of the CNCF Sandbox • Distributed, fault tolerant architecture • Long term storage • Multitenant https://github.com/cortexproject/cortex
  • 31. Getting setup • github.com/coreos/prometheus-operator - Job to look after running Prometheus on Kubernetes • github.com/coreos/kube-prometheus - Set of configs for running all there other things you need. • github.com/grafana/jsonnet-libs/tree/master/prometheus-ksonnet - My configs for running Prometheus, Alertmanager, Grafana etc • github.com/kubernetes-monitoring/kubernetes-mixin - Joint project to unify and improve common alerts for Kubernetes.