5. Sysadmin's view
Access to a lot of components
Range from the frontends to the databases
With 24x7 oncall shifts
6. DevOps
In a DevOps world, more data, more awareness
More changes, different scale
Evolution
How can we keep up??
7. The DevOps principles: CAMS
(a definition of DevOps)
Culture
Automation
Measurement
Sharing
(Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS)
This talk is about all of it..
17. Real world
It works ; it does not work ; it kinda works ; it
maybe works ; no one uses it ; it is broken ; some
things are broken ; it should work but it does not ;
where are my users? help me...
18. The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you
about the traffic jams.
30. Metrics and monitoring
Metrics do not represent problems
Metrics represent a state, give insights
Metrics can be graphed
You can alert based on them
31. Exposed metrics are "raw"
In general you can just expose counters, and let
the monitoring server do the real maths.
That keeps the overhead very low of apps.
33. What are the needs ?
Ingest metrics at high frequency
React to changes
Empower people
Alert on metrics
34. Use one toolchain
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/161054138@N08/37880775085
35. Stop with:
Having 1 "monitoring" + 1 "graphing" stacks
Big all in one tools: think decentralize, scale
Auto Discovery (use service discovery instead)
Manual config
Fragile monitoring (think HA)
45. Pull vs Push
Prometheus pulls metrics
But does not know what it will get!
The target decides what to expose
(short term batches can still push to a
"pushgateway")
46. Exporters
Expose metrics with an HTTP API
Bindings available for many languages (for
"native" metrics)
Exporters do not save data ; they are not
"proxies" and don't "cache" anything
47. Common exporters
Node Exporter: Linux System Metrics
Grok Exporter: Metrics from log files
SNMP Exporter: Network devices
Blackbox exporter: TCP, DNS, Http requests
49. Grafana
Open Source (Apache 2.0)
Web app
Specialized in visualization
Pluggable
Multiple datasources: prometheus, graphite,
influxdb...
Has an API!
52. What are business metrics?
Metrics that effectively tell you how you fullfil
your customers' requests
Provide quality and level of service to
customers
53. CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/3960497840
54. Where to get them?
Frontends
Databases
Caching systems (sessions, ...)
...
Each one of them requires a cross-team
understanding of the business.
55. Where to start?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/
56. USE
Brendan Gregg's USE method
U = Utilisation S = Saturation E = Errors
For resources like network, CPU, memory,...
Also asynchrone processes, ...
57. RED
Tom Wilkie's RED method
R = Requests E = Errors D = Duration
HTTP Requests, synchrone processes,...
76. Timeseries
How we use time: We take the metrics for the
last 7 weeks
We take the median value (exclude 3 top and 3
low)
Excludes anomalies due to
incidents/holidays...
83. What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our
services
84. Why business metrics matter?
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, database,
cache systems, load balancers, partners,
electricity, virtualization stack, nfs, ... and it moves
over time
Customers won't call you because your disk is full!
86. Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (soon to be
OpenMetrics)
Everyone knows HTTP!
87. What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomputed metrics (rate
over 5 minutes, duration over 5 minutes),
excluding servers, instances, ...
Identify, in the chain, the bottlenecks and the
issues
88. Dashboards
We define our dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partners, monitoring
robots, state of the monitoring
hidden by default: Technical Health - ntp, disk,
db, network, jvm, ...
89. Limited number of graphes
Errors in RED
Attention points in Yellow/Orange
96. Dashboards
On product launch / change / ... extract
relevant data from the service and build a
"temporary dashboard"
Share with the teams and managers, show on
big screen
101. Quick Answers
Business monitoring allows yo to know early
when things are wrong
Provides clear asnwers to your customers in
minutes (no more "I will check")
// to make between technical and business
metrics (to find causes)
102. What happened?
Is it REALLY fixed?
When?
Until when (technical and business)?
What did I miss? What is the impact?
103. Metrics benefits
Because you run queries and alerts from a
central location
You can run queries accross targets/jobs
Detect faulty instances, alert for server X
based on metrics of server Y
107. Prometheus benefits
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empowering people
HTTP permits TLS, Client Auth, ... and cross
org sharing of metrics
Becoming a standard in the industry
108. Grafana
Central point for all teams
Show current and pas status
Should give you the opportunity to answer
questions
109. Focusing on Business Metrics is hard work that
will show benefits accross teams and provide
visibility towards hierarchy, enabling you to gain
trust and move on more quickly towards a DevOps
model.