In the last years, we have been building complex stacks, made from lots of components. All of this backed by multiple teams. This talk will present how you can use monitoring to look at the business side and have everyone looking at the same dashboards, making cooperation a reality.
This talk is based on experience. Therefore we will
talk about the Prometheus ecosystem, but it applies
to other workflows and tools.
The DevOps principles:
(a definition of DevOps)
(Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS)
This talk is about all of it..
Who is behind the magic
Dev Ops Security Virtualization QA Networking
Sales Customers Partners ...
What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our
Why business metrics
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, database,
cache systems, load balancers, partners, electricity,
virtualization stack, nfs, ... and it moves over time
Customers won't call you because your disk is full!
Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (soon to be
Everyone knows HTTP!
What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomputed metrics (rate
over 5 minutes, duration over 5 minutes),
excluding servers, instances, ...
Identify, in the chain, the bottlenecks and the
Kind of dashboards
General (multiple business)
Business overview (e.g. one app)
Business focused (e.g. one process)
Technical overview (e.g. linux cluster)
Technical focus (e.g. linux host)
Even fore focused (e.g. cpu usage)
We define our business dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partners, monitoring
robots, state of the monitoring
hidden by default: Technical Health - ntp, disk,
db, network, jvm, ...
Limited number of graphes
Errors in RED
Attention points in Yellow/Orange
How to do alerting right
Use multiple channels (chat, tickets)
Alert when really needed (non prod: BH)
Send the alert to the right people (incl.
Make the alerts actionnable
Major incident in production
Affecting multiple projects
"Situation room": 2 channels: 1 for all the
alerts, 1 for the people
Bring managers, and all the relevant tech
people in the same room
Unique channel of communication for the
incident (archived after the incident)
Business monitoring allows yo to know early
when things are wrong, accross teams
Provides clear asnwers to your customers in
minutes (no more "I don't know, I will check")
// to make between technical and business
metrics (to find causes)
Is it REALLY fixed?
Until when (technical and business)?
What did I miss? What is the impact?
Because you run queries and alerts from a
You can run queries accross targets/jobs
Detect faulty instances, alert for server X
based on metrics of server Y
Do not underestimate the monitoring of the
development / staging environments.
Business metrics are good
candidates to wake up someone at
The downside is that that person must be fluent
with the business.
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empowering people
HTTP permits TLS, Client Auth, ... and cross
org sharing of metrics
Becoming a standard in the industry
Central point for all teams
Show current and past status
Should give you the opportunity to answer
Focusing on Business Metrics is hard work that will
show benefits accross teams and provide visibility
towards hierarchy, enabling you to gain trust and
move on more quickly towards a DevOps model.