Monitoring kubernetes with prometheus

Making sure your containers
aren’t on ﬁre
Monitoring microservices with Prometheus
Brice Fernandes
@fractallambda

Getting started with Kubernetes1
2
3
4
The monitoring maturity ladder
Whitebox vs blackbox monitoring
Monitoring with Prometheus
Using PromQL5

How
I.T.
Was
OS
App
Foo v1.1.0

How
I.T.
Was
OS
App
Foo v1.1.0 Foo v1.5.0

How
I.T.
Was
OS
App
Foo v1.1.0 Foo v1.5.0
?

How
I.T.
Was
Reproducible Deployment
Continuous Deployment
Fault Recovery
Memory & CPU allocation
Managing VMs?
?
?
?
?
?
?

The
New
Hotness
OS
Manager
Container
App

The
New
Hotness
OS
Manager
Container
App
Somebody Else’s Problem (SEP)™

Reproducible deployments
Fault recovery
Continuous deployment
Don’t care about machine virtualisation
Memory & CPU multiplexing
Buzzword compliance
The
New
Hotness

Mo’ containers
Mo’ problems

Kubernetes
–
Greek for Helmsman or Pilot

Master
kube-apiserver
kube-controller-manager
kube-scheduler

This is what I want
xxx.xxx.xxx.xxx:30003

➤ minikube start
Starting local Kubernetes v1.7.5 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components…
Kubectl is now configured to use the cluster.

➤ minikube start
Starting VM...
Setting up certs...
Start a local cluster

➤ minikube start
Starting VM...
Setting up certs...
Set up the kubernetes tools
to point to our cluster

➤ kubectl get all
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.0.0.1 <none> 443/TCP 5m

➤ kubectl get all
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.0.0.1 <none> 443/TCP 5m
Default kubernetes service

➤ kubectl apply -f  
https://tinyurl.com/kube-prom-demo-v1

Our definition manifest

Which port to expose externally

deployment "mighty-fine-fe" created
service "mighty-fine-fe" created
Creates our pods

deployment "mighty-fine-fe" created
service "mighty-fine-fe" created
Exposes a service

➤ kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S)
kubernetes 10.0.0.1 <none> 443/TCP
mighty-fine-fe 10.0.0.223 <nodes> 3000:30001/TCP
Port 3000 of app is visible
on port 30001 of cluster

➤ open http://$(minikube ip):30001

Quality
Assurance
Continuous
Improvement

NOT about collecting data
Why vs How

Q:What’s the
most important
metric?

A: What’s the
purpose of your
organisation?

Maybe:
Educational goals
# People reached
# Papers published

Metrics come from
purpose.
Monitor your goals

Ignorance
Availability
Collection
Aggregation
0
Analysis
1
Learning
Automation
Proactivity
2
3
4
5
6
7
The
Monitoring
Ladder

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Aggregation
Analysis
1
Learning
Automation
Proactivity
2
3
4
5
6
7
You don’t know
what’s going on.

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Aggregation
Analysis
1
Learning
Automation
Proactivity
2
3
4
5
6
7
You know whether your
systems are available.
You may have alerts

Ignorance0
The
Monitoring
Ladder
Availability
Logging
Aggregation
Analysis
1
Learning
Automation
Proactivity
2
3
4
5
6
7
You collect logs.
Forensics is possible .
Alerts

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Aggregation
Analysis
1
Learning
Automation
Proactivity
2
3
4
5
6
7
You aggregate and persist
data in a central place.
Correlation is possible.
Alerts
Logs
Forensics

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
You actually analyse
the aggregated and
correlated data.
Use it to fix issues. Alerts
Logs
Forensics
Aggregation3Persistence

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
Root cause analysis.
Strengthening fixes.
Antifragile.
Still responsive. Alerts
Logs
Forensics

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
Automated remedial actions.
Data collection for analysis.
No customer impact.
Alerts
Logs
Forensics
Antifragile

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
Active strengthening
by attacking
production systems.
Alerts
Logs
Forensics
Antifragile
0-Impact

Ignorance0
The
Monitoring
Ladder
Availability
Collection
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
Alerts
Logs
Forensics
Antifragile
0-Impact
Monitoring is
a broad topic

Which one is right?
Pull or Push?
Whitebox or Blackbox?

Which one is right?
Pull or Push?
Whitebox or Blackbox?
Both

Monitoring infrastructure
Key metrics

https://tinyurl.com/kube-prom-monitoring
deployment "prometheus" created
service "prometheus" created
service "internal-prometheus" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-configmap" created
Create and expose
Prometheus

configmap "prometheus-configmap" created Create and expose
Grafana

configmap "prometheus-configmap" created
Configure Prometheus
Using a ConfigMap

➤ kubectl get svc
grafana 10.0.0.120 <nodes> 3000:30002/TCP
internal-prometheus 10.0.0.39 <none> 9090/TCP
prometheus 10.0.0.112 <nodes> 9090:30003/TCP

➤ kubectl get svc
Prometheus internal IP

➤ kubectl get svc
Prometheus external port

Internal
Prometheus
IP
and port

Proxy
instead
of data from
browser

But…
Aggregation3
Ignorance
Availability
Collection
0
Analysis
1
Learning
Automation
Proactivity
2
4
5
6
7
What about persistence?

Using
Weave Cloud’s
Hosted Prometheus

➤ kubectl apply  
-n kube-system  
-f “<some_url>&t=<some_token>”
serviceaccount "weave-flux" created
clusterrole "weave-flux" created
clusterrolebinding "weave-flux" created
secret "flux-git-deploy" created
deployment "weave-flux-memcached" created
service "weave-flux-memcached" created
deployment "weave-flux-agent" created
serviceaccount "weave-scope" created
clusterrole "weave-scope" created
clusterrolebinding "weave-scope" created
daemonset "weave-scope-agent" created
serviceaccount "weave-cortex" created
clusterrole "weave-cortex" created
clusterrolebinding "weave-cortex" created
deployment "weave-cortex-agent" created
service "weave-cortex-agent" created
daemonset "weave-cortex-node-exporter" created
configmap "weave-cortex-agent-config" created

➤ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS
kube-addon-manager-minikube 1/1 Running 1
kube-dns-910330662-bv35c 3/3 Running 3
kubernetes-dashboard-zj028 1/1 Running 1
weave-cortex-agent-815474457-5q0rg 1/1 Running 0
weave-cortex-node-exporter-5tf88 1/1 Running 0
weave-flux-agent-1731903026-d0gw8 1/1 Running 0
weave-flux-memcached-2601059440-f31vp 1/1 Running 0
weave-scope-agent-6fq0b 1/1 Running 0

Adding the
Prometheus
Agent to our app

➤ npm install —save epimetheus

➤ npm install —save epimetheus
Client libraries in: Go, Java, Python,
Ruby, Bash, C++, Common Lisp,
Elixir, Erlang, Haskell, Lua, .NET,
PHP, Rust…

Very straight
forward
in most languages

…
Omitted for brevity:
Pushing new image to registry
Creating new manifest
…

deployment "mighty-fine-fe" configured
service "mighty-fine-fe" configured

➤ open http://$(minikube ip):30001/metrics

Getting
Prometheus
to scrape our app

➤ kubctl apply -f  
https://tinyurl.com/kube-prom-monitoring-v2

deployment "prometheus" configured
service "prometheus" configured
service "internal-prometheus" configured
deployment "grafana" configured
service "grafana" configured
configmap "prometheus-configmap" configured

➤ curl -X POST  
http://$(minikube ip):30001/-/reload
Tell Prometheus to
reload its config

Weave
discovers
the new
Metrics
too

deployment "prometheus" configured
service "prometheus" configured
service "internal-prometheus" configured
deployment "grafana" configured
service "grafana" configured
configmap "prometheus-configmap" configured

Joel York’s SaaS Metrics
http://chaotic-ﬂow.com

Worked
Example:
Churn rate
C × Δt
Churn Ratemonth = ΔCcancel

Worked
Example:
Churn rate
C × Δt
Number of
cancellations
In interval

Worked
Example:
Churn rate
C × Δt
Number of
cancellations
In interval
Number of customers
(at start of interval)

Worked
Example:
Churn rate
Time interval
Number of customers
(at start of interval)
Number of
cancellations
In interval
C × Δt

Worked
Example:
Churn rate
C × Δt

Worked
Example:
Churn rate
Assumed metrics:
total_signups (counter)
total_cancels (counter)
C × Δt

Worked
Example:
Churn rate
ΔCcancel = rate(total_cancels[1m])
C × Δt

Worked
Example:
Churn rate
Base metric
(scalar)
C × Δt

Worked
Example:
Churn rate
Base metric
(scalar)
C × Δt
t0, t1, t2, t3, t4, t5, t6

Worked
Example:
Churn rate
Base metric
(scalar) Data window
C × Δt

Worked
Example:
Churn rate
Base metric
Data range
(vector)
C × Δt

Worked
Example:
Churn rate
Base metric
Data range
(vector)
C × Δt
t0, t1, t2, t3, t4, t5, t6
0
2
4
7
9
11
…
0
2
4
7
9
11
…
0
2
4
7
9
11
…
0
2
4
7
9
11
…
0
2
4
7
9
11
…
0
2
4
7
9
11
…
0
2
4
7
9
11
…

Worked
Example:
Churn rate
Base metric
Data range
(vector)
Built-in rate function
C × Δt

Worked
Example:
Churn rate
C = (total_signups offset 1m) -
(total_cancels offset 1m)
C × Δt

Worked
Example:
Churn rate
C = (total_signups offset 1m) -
(total_cancels offset 1m)
One month ago
C × Δt

Worked
Example:
Churn rate
Churn Ratemonth =
rate(total_cancels[1m]) /
((total_signups offset 1m) - (total_cancels offset 1m))
C × Δt

Getting started with Kubernetes1
2
3
4
The monitoring maturity ladder
Whitebox vs blackbox monitoring
Monitoring with Prometheus
Using PromQL5
Review

References & useful links
- https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
- http://www.ncsysadmin.org/meetings/1010/Monitoring_and_Alerting.pdf
- https://www.oreilly.com/ideas/monitoring-distributed-systems
- https://www.slideshare.net/brianbrazil/monitoring-what-matters-the-prometheus-approach-to-
whitebox-monitoring-berlin-ops-summit-2016
Thank You!
Brice Fernandes
@fractallambda
@weaveworks
Slides: https://tinyurl.com/prometheus-kubernetes-slides
Code: https://tinyurl.com/prometheus-kubernetes-code
Video: https://tinyurl.com/cloud-native-2017
https://weave.works

Monitoring kubernetes with prometheus

More Related Content

What's hot

Similar to Monitoring kubernetes with prometheus

Recently uploaded

Monitoring kubernetes with prometheus