Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Prometheus
From technical monitoring to business
obervability
Julien Pivotto (@roidelapluie)
Sysadmin Days Paris
October 1...
user{name="roidelapluie"} 1
I like Open Source
I like monitoring
I like automation
... and all of that is my daily job at ...
inuits
Sysadmin
Creative Commons Zero https://www.flickr.com/photos/freestocks/25668265836
Sysadmin's view
Access to a lot of components
Range from the frontends to the databases
With 24x7 oncall shifts
DevOps
In a DevOps world, more data, more awareness
More changes, different scale
Evolution
How can we keep up??
The DevOps principles: CAMS
(a definition of DevOps)
Culture
Automation
Measurement
Sharing
(Damon Edwards and John Willis...
Monitoring
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
Creative Commons Public Domain https://pxhere.com/en/photo/265717
Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
Traditional Monitoring
It works - OK
It does not work - CRITICAL
It kinda works - WARNING
I don't know - UNKNOWN
Creative Commons Public Domain https://pxhere.com/fr/photo/952999
Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
Creative Commons Attribution-Share Alike 3.0 Unported
https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504...
Real world
It works ; it does not work ; it kinda works ; it
maybe works ; no one uses it ; it is broken ; some
things are...
The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just look...
Observability
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
Metrics
Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
Metric
Name
Labels (Key-Value Pairs)
Value (Number)
Timestamp
Fetched at a high frequency
Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 1823
Timestamp: Thu Oct 18 10:18:06...
Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 2123
Timestamp: Thu Oct 18 10:18:36...
300 Requests in 30 s = 10 requests per seconds
(POST for inuits.eu with response code 200)
http_request_total{job="fe",instance="fe1",code="200"}
Types of metrics
Counters
Gauges
Histograms
Summaries
Counters
Always go up
start from zero
rate, increase
e.g. number of http requests
Gauges
Go up and down
Average, Sum, Max, ...
^ over time
e.g. concurrent users
Histograms and summaries
Sets of requests
Using "buckets"
Useful to get duration, percentiles, SLA
Metrics and monitoring
Metrics do not represent problems
Metrics represent a state, give insights
Metrics can be graphed
Y...
Exposed metrics are "raw"
In general you can just expose counters, and let
the monitoring server do the real maths.
That k...
Tooling
Creative Commons Attribution 2.0 https://www.flickr.com/photos/psd/5298483
What are the needs ?
Ingest metrics at high frequency
React to changes
Empower people
Alert on metrics
Use one toolchain
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/161054138@N08/37880775085
Stop with:
Having 1 "monitoring" + 1 "graphing" stacks
Big all in one tools: think decentralize, scale
Auto Discovery (use...
Prometheus
https://prometheus.io/
Prometheus
Open Source monitoring tool
Complete Ecosystem
For cloud and on prem
Built around metrics
Cloud Native
Easy to configure, deploy, maintain
Designed in multiple services
Container ready
Orchestration ready (dynami...
Efficient
"Scrapes" millions of metrics
Scales
Manages its own optimized db
(prometheus/tsdb)
How does it work?
How does it work?
How does it work?
How does it work?
How does it work?
Pull vs Push
Prometheus pulls metrics
But does not know what it will get!
The target decides what to expose
(short term ba...
Exporters
Expose metrics with an HTTP API
Bindings available for many languages (for
"native" metrics)
Exporters do not sa...
Common exporters
Node Exporter: Linux System Metrics
Grok Exporter: Metrics from log files
SNMP Exporter: Network devices
...
Grafana
Open Source (Apache 2.0)
Web app
Specialized in visualization
Pluggable
Multiple datasources: prometheus, graphite...
Grafana and Prometheus
Prometheus shipped its own consoles
Now it recommends Grafana and deprecated
its own consoles
Business Metrics
Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
What are business metrics?
Metrics that effectively tell you how you fullfil
your customers' requests
Provide quality and ...
CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/39604...
Where to get them?
Frontends
Databases
Caching systems (sessions, ...)
...
Each one of them requires a cross-team
understa...
Where to start?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/
USE
Brendan Gregg's USE method
U = Utilisation S = Saturation E = Errors
For resources like network, CPU, memory,...
Also ...
RED
Tom Wilkie's RED method
R = Requests E = Errors D = Duration
HTTP Requests, synchrone processes,...
What to get?
Request Rate
Saturation
Error Rate
Duration
Before we dig in ..
What we will see now is monitoring data. It should
not be used for precise usages, like invoicing.
Caching System Monitoring
(USE)
Caching System Monitoring
What do we learn?
Users can connect to the platform: The
authentication works
The platform is currently used
Benefits
Connected users = they can use the platform
Know when you can do maintenances
Know about your user's general habi...
Database
Database
Database
Using SQL exporters to query the data from
your database
Requires a cross team approach
Gets you fine grained, qu...
Database trap
Do not try to replace BI/Reporting
Do not take too many labels -- stay in the
monitoring area
Frontends
RED
sum by (instance, env) (
  rate(http_requests_duration_count[5m])
)
Frontends
RED
sum by (code, env) (
  rate(http_requests_duration_count{code!="200"}[5m])
) / ignoring (code) group_left
sum by (env) (
 ...
Frontends
RED
sum by (env) (
  rate(http_requests_duration_sum[5m])
) /
sum by (env) (
  rate(http_requests_duration_count[5m])
)
What can we learn?
We have traffic from outside
How much traffic
Quality of the trafic
How long it really takes (end to en...
Adding Time
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/rswilson74/3375654385
Timeseries
How we use time: We take the metrics for the
last 7 weeks
We take the median value (exclude 3 top and 3
low)
Ex...
http_requests:rate5m offset 1w
offset queries data in the past
­ record: past_request_rate
  expr: http_requests:rate5m offset 1w
  labels:
    when: 1w
­ record: past_request_rate
  expr: http_requests:rate5m offset 2w
  labels:
    when: 2w
max without(when) (
  bottomk(1,
    topk(4,
      past_request_rate
    )
  )
)
Result
RED
Oops...
RED
What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our...
Why business metrics matter?
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, databa...
Partners
Creative Commons Attribution 2.0 https://www.flickr.com/photos/deanhochman/27248626739
Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (so...
What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomput...
Dashboards
We define our dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partne...
Limited number of graphes
Errors in RED
Attention points in Yellow/Orange
technical view; more graphes; empty when OK
Dashboards
Duplicate the dashboard to have an historical
view
Dashboards
Easy drill down between dashboards / with pre
defined variables
Dashboard
Provide relevant help where needed
(from the haproxy documentation)
Dashboards
On product launch / change / ... extract
relevant data from the service and build a
"temporary dashboard"
Share...
Conventions
Color conventions, general:
RED = Bad
Yellow = Attention
Blue/Green = OK
Also:
RED = problem at our side
Yello...
HTTP Codes
2xx: Greens
3xx: Yellows
4xx: Blues (404: grey)
5xx: Orange/Red
! Same accross all dashboards to enable easy
re...
Side note: github.com/grafana/grafonnet-lib/
A Jsonnet Library to write grafana dashboards.
Conclusion
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/willy_photoshop/34829332342/
Quick Answers
Business monitoring allows yo to know early
when things are wrong
Provides clear asnwers to your customers i...
What happened?
Is it REALLY fixed?
When?
Until when (technical and business)?
What did I miss? What is the impact?
Metrics benefits
Because you run queries and alerts from a
central location
You can run queries accross targets/jobs
Detec...
Metrics benefits
Trends
Dynamic thresholds
Predictions
Do not underestimate the monitoring of the
development / staging environments.
Business metrics are good candidates
to wake up someone at night.
Prometheus benefits
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empower...
Grafana
Central point for all teams
Show current and pas status
Should give you the opportunity to answer
questions
Focusing on Business Metrics is hard work that
will show benefits accross teams and provide
visibility towards hierarchy, ...
Julien Pivotto
roidelapluie
roidelapluie@inuits.eu
Inuits
https://inuits.eu
info@inuits.eu
Contact
Prometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observability
Upcoming SlideShare
Loading in …5
×

of

Prometheus: From technical metrics to business observability Slide 1 Prometheus: From technical metrics to business observability Slide 2 Prometheus: From technical metrics to business observability Slide 3 Prometheus: From technical metrics to business observability Slide 4 Prometheus: From technical metrics to business observability Slide 5 Prometheus: From technical metrics to business observability Slide 6 Prometheus: From technical metrics to business observability Slide 7 Prometheus: From technical metrics to business observability Slide 8 Prometheus: From technical metrics to business observability Slide 9 Prometheus: From technical metrics to business observability Slide 10 Prometheus: From technical metrics to business observability Slide 11 Prometheus: From technical metrics to business observability Slide 12 Prometheus: From technical metrics to business observability Slide 13 Prometheus: From technical metrics to business observability Slide 14 Prometheus: From technical metrics to business observability Slide 15 Prometheus: From technical metrics to business observability Slide 16 Prometheus: From technical metrics to business observability Slide 17 Prometheus: From technical metrics to business observability Slide 18 Prometheus: From technical metrics to business observability Slide 19 Prometheus: From technical metrics to business observability Slide 20 Prometheus: From technical metrics to business observability Slide 21 Prometheus: From technical metrics to business observability Slide 22 Prometheus: From technical metrics to business observability Slide 23 Prometheus: From technical metrics to business observability Slide 24 Prometheus: From technical metrics to business observability Slide 25 Prometheus: From technical metrics to business observability Slide 26 Prometheus: From technical metrics to business observability Slide 27 Prometheus: From technical metrics to business observability Slide 28 Prometheus: From technical metrics to business observability Slide 29 Prometheus: From technical metrics to business observability Slide 30 Prometheus: From technical metrics to business observability Slide 31 Prometheus: From technical metrics to business observability Slide 32 Prometheus: From technical metrics to business observability Slide 33 Prometheus: From technical metrics to business observability Slide 34 Prometheus: From technical metrics to business observability Slide 35 Prometheus: From technical metrics to business observability Slide 36 Prometheus: From technical metrics to business observability Slide 37 Prometheus: From technical metrics to business observability Slide 38 Prometheus: From technical metrics to business observability Slide 39 Prometheus: From technical metrics to business observability Slide 40 Prometheus: From technical metrics to business observability Slide 41 Prometheus: From technical metrics to business observability Slide 42 Prometheus: From technical metrics to business observability Slide 43 Prometheus: From technical metrics to business observability Slide 44 Prometheus: From technical metrics to business observability Slide 45 Prometheus: From technical metrics to business observability Slide 46 Prometheus: From technical metrics to business observability Slide 47 Prometheus: From technical metrics to business observability Slide 48 Prometheus: From technical metrics to business observability Slide 49 Prometheus: From technical metrics to business observability Slide 50 Prometheus: From technical metrics to business observability Slide 51 Prometheus: From technical metrics to business observability Slide 52 Prometheus: From technical metrics to business observability Slide 53 Prometheus: From technical metrics to business observability Slide 54 Prometheus: From technical metrics to business observability Slide 55 Prometheus: From technical metrics to business observability Slide 56 Prometheus: From technical metrics to business observability Slide 57 Prometheus: From technical metrics to business observability Slide 58 Prometheus: From technical metrics to business observability Slide 59 Prometheus: From technical metrics to business observability Slide 60 Prometheus: From technical metrics to business observability Slide 61 Prometheus: From technical metrics to business observability Slide 62 Prometheus: From technical metrics to business observability Slide 63 Prometheus: From technical metrics to business observability Slide 64 Prometheus: From technical metrics to business observability Slide 65 Prometheus: From technical metrics to business observability Slide 66 Prometheus: From technical metrics to business observability Slide 67 Prometheus: From technical metrics to business observability Slide 68 Prometheus: From technical metrics to business observability Slide 69 Prometheus: From technical metrics to business observability Slide 70 Prometheus: From technical metrics to business observability Slide 71 Prometheus: From technical metrics to business observability Slide 72 Prometheus: From technical metrics to business observability Slide 73 Prometheus: From technical metrics to business observability Slide 74 Prometheus: From technical metrics to business observability Slide 75 Prometheus: From technical metrics to business observability Slide 76 Prometheus: From technical metrics to business observability Slide 77 Prometheus: From technical metrics to business observability Slide 78 Prometheus: From technical metrics to business observability Slide 79 Prometheus: From technical metrics to business observability Slide 80 Prometheus: From technical metrics to business observability Slide 81 Prometheus: From technical metrics to business observability Slide 82 Prometheus: From technical metrics to business observability Slide 83 Prometheus: From technical metrics to business observability Slide 84 Prometheus: From technical metrics to business observability Slide 85 Prometheus: From technical metrics to business observability Slide 86 Prometheus: From technical metrics to business observability Slide 87 Prometheus: From technical metrics to business observability Slide 88 Prometheus: From technical metrics to business observability Slide 89 Prometheus: From technical metrics to business observability Slide 90 Prometheus: From technical metrics to business observability Slide 91 Prometheus: From technical metrics to business observability Slide 92 Prometheus: From technical metrics to business observability Slide 93 Prometheus: From technical metrics to business observability Slide 94 Prometheus: From technical metrics to business observability Slide 95 Prometheus: From technical metrics to business observability Slide 96 Prometheus: From technical metrics to business observability Slide 97 Prometheus: From technical metrics to business observability Slide 98 Prometheus: From technical metrics to business observability Slide 99 Prometheus: From technical metrics to business observability Slide 100 Prometheus: From technical metrics to business observability Slide 101 Prometheus: From technical metrics to business observability Slide 102 Prometheus: From technical metrics to business observability Slide 103 Prometheus: From technical metrics to business observability Slide 104 Prometheus: From technical metrics to business observability Slide 105 Prometheus: From technical metrics to business observability Slide 106 Prometheus: From technical metrics to business observability Slide 107 Prometheus: From technical metrics to business observability Slide 108 Prometheus: From technical metrics to business observability Slide 109 Prometheus: From technical metrics to business observability Slide 110
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3 Likes

Share

Download to read offline

Prometheus: From technical metrics to business observability

Download to read offline

Talk given at Sysadmin Days Paris about monitoring, prometheus, and what to do with metrics we collect to understand how our apps are going.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Prometheus: From technical metrics to business observability

  1. 1. Prometheus From technical monitoring to business obervability Julien Pivotto (@roidelapluie) Sysadmin Days Paris October 18th, 2018
  2. 2. user{name="roidelapluie"} 1 I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. inuits
  4. 4. Sysadmin Creative Commons Zero https://www.flickr.com/photos/freestocks/25668265836
  5. 5. Sysadmin's view Access to a lot of components Range from the frontends to the databases With 24x7 oncall shifts
  6. 6. DevOps In a DevOps world, more data, more awareness More changes, different scale Evolution How can we keep up??
  7. 7. The DevOps principles: CAMS (a definition of DevOps) Culture Automation Measurement Sharing (Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS) This talk is about all of it..
  8. 8. Monitoring Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  9. 9. Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
  10. 10. Creative Commons Public Domain https://pxhere.com/en/photo/265717
  11. 11. Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
  12. 12. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  13. 13. Creative Commons Public Domain https://pxhere.com/fr/photo/952999
  14. 14. Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
  15. 15. Creative Commons Attribution-Share Alike 3.0 Unported https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg
  16. 16. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  17. 17. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  18. 18. Observability Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  19. 19. Metrics Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
  20. 20. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  21. 21. Name: Number of HTTP requests Labels: status: 200 vhost: inuits.eu method: post Value: 1823 Timestamp: Thu Oct 18 10:18:06 CEST 2018
  22. 22. Name: Number of HTTP requests Labels: status: 200 vhost: inuits.eu method: post Value: 2123 Timestamp: Thu Oct 18 10:18:36 CEST 2018
  23. 23. 300 Requests in 30 s = 10 requests per seconds (POST for inuits.eu with response code 200)
  24. 24. http_request_total{job="fe",instance="fe1",code="200"}
  25. 25. Types of metrics Counters Gauges Histograms Summaries
  26. 26. Counters Always go up start from zero rate, increase e.g. number of http requests
  27. 27. Gauges Go up and down Average, Sum, Max, ... ^ over time e.g. concurrent users
  28. 28. Histograms and summaries Sets of requests Using "buckets" Useful to get duration, percentiles, SLA
  29. 29. Metrics and monitoring Metrics do not represent problems Metrics represent a state, give insights Metrics can be graphed You can alert based on them
  30. 30. Exposed metrics are "raw" In general you can just expose counters, and let the monitoring server do the real maths. That keeps the overhead very low of apps.
  31. 31. Tooling Creative Commons Attribution 2.0 https://www.flickr.com/photos/psd/5298483
  32. 32. What are the needs ? Ingest metrics at high frequency React to changes Empower people Alert on metrics
  33. 33. Use one toolchain Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/161054138@N08/37880775085
  34. 34. Stop with: Having 1 "monitoring" + 1 "graphing" stacks Big all in one tools: think decentralize, scale Auto Discovery (use service discovery instead) Manual config Fragile monitoring (think HA)
  35. 35. Prometheus https://prometheus.io/
  36. 36. Prometheus Open Source monitoring tool Complete Ecosystem For cloud and on prem Built around metrics
  37. 37. Cloud Native Easy to configure, deploy, maintain Designed in multiple services Container ready Orchestration ready (dynamic config) Fuzziness
  38. 38. Efficient "Scrapes" millions of metrics Scales Manages its own optimized db (prometheus/tsdb)
  39. 39. How does it work?
  40. 40. How does it work?
  41. 41. How does it work?
  42. 42. How does it work?
  43. 43. How does it work?
  44. 44. Pull vs Push Prometheus pulls metrics But does not know what it will get! The target decides what to expose (short term batches can still push to a "pushgateway")
  45. 45. Exporters Expose metrics with an HTTP API Bindings available for many languages (for "native" metrics) Exporters do not save data ; they are not "proxies" and don't "cache" anything
  46. 46. Common exporters Node Exporter: Linux System Metrics Grok Exporter: Metrics from log files SNMP Exporter: Network devices Blackbox exporter: TCP, DNS, Http requests
  47. 47. Grafana Open Source (Apache 2.0) Web app Specialized in visualization Pluggable Multiple datasources: prometheus, graphite, influxdb... Has an API!
  48. 48. Grafana and Prometheus Prometheus shipped its own consoles Now it recommends Grafana and deprecated its own consoles
  49. 49. Business Metrics Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
  50. 50. What are business metrics? Metrics that effectively tell you how you fullfil your customers' requests Provide quality and level of service to customers
  51. 51. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/nox_noctis_silentium/3960497840
  52. 52. Where to get them? Frontends Databases Caching systems (sessions, ...) ... Each one of them requires a cross-team understanding of the business.
  53. 53. Where to start? Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/
  54. 54. USE Brendan Gregg's USE method U = Utilisation S = Saturation E = Errors For resources like network, CPU, memory,... Also asynchrone processes, ...
  55. 55. RED Tom Wilkie's RED method R = Requests E = Errors D = Duration HTTP Requests, synchrone processes,...
  56. 56. What to get? Request Rate Saturation Error Rate Duration
  57. 57. Before we dig in .. What we will see now is monitoring data. It should not be used for precise usages, like invoicing.
  58. 58. Caching System Monitoring (USE)
  59. 59. Caching System Monitoring
  60. 60. What do we learn? Users can connect to the platform: The authentication works The platform is currently used
  61. 61. Benefits Connected users = they can use the platform Know when you can do maintenances Know about your user's general habits (trends)
  62. 62. Database
  63. 63. Database
  64. 64. Database Using SQL exporters to query the data from your database Requires a cross team approach Gets you fine grained, quality data
  65. 65. Database trap Do not try to replace BI/Reporting Do not take too many labels -- stay in the monitoring area
  66. 66. Frontends RED
  67. 67. sum by (instance, env) (   rate(http_requests_duration_count[5m]) )
  68. 68. Frontends RED
  69. 69. sum by (code, env) (   rate(http_requests_duration_count{code!="200"}[5m]) ) / ignoring (code) group_left sum by (env) (   rate(http_requests_duration_count[5m]) )
  70. 70. Frontends RED
  71. 71. sum by (env) (   rate(http_requests_duration_sum[5m]) ) / sum by (env) (   rate(http_requests_duration_count[5m]) )
  72. 72. What can we learn? We have traffic from outside How much traffic Quality of the trafic How long it really takes (end to end)
  73. 73. Adding Time Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/rswilson74/3375654385
  74. 74. Timeseries How we use time: We take the metrics for the last 7 weeks We take the median value (exclude 3 top and 3 low) Excludes anomalies due to incidents/holidays...
  75. 75. http_requests:rate5m offset 1w offset queries data in the past
  76. 76. ­ record: past_request_rate   expr: http_requests:rate5m offset 1w   labels:     when: 1w
  77. 77. ­ record: past_request_rate   expr: http_requests:rate5m offset 2w   labels:     when: 2w
  78. 78. max without(when) (   bottomk(1,     topk(4,       past_request_rate     )   ) )
  79. 79. Result RED
  80. 80. Oops... RED
  81. 81. What do we learn? Predict users habits Deviation from the norm are not normal It means that users can not reach us/use our services
  82. 82. Why business metrics matter? Good service depends on: linux health, dns, network, ntp, disk space, cpu, open files, database, cache systems, load balancers, partners, electricity, virtualization stack, nfs, ... and it moves over time Customers won't call you because your disk is full!
  83. 83. Partners Creative Commons Attribution 2.0 https://www.flickr.com/photos/deanhochman/27248626739
  84. 84. Given that the End User matters We have decided to standadize metrics exchange between partners Prometheus format used (soon to be OpenMetrics) Everyone knows HTTP!
  85. 85. What do we exchange? We are not interested in partner's internal (and don't want to expose us) We are exchanging precomputed metrics (rate over 5 minutes, duration over 5 minutes), excluding servers, instances, ... Identify, in the chain, the bottlenecks and the issues
  86. 86. Dashboards We define our dashboards in two parts: 10 graphes on top about the business: RED, USE, Alerts, data from partners, monitoring robots, state of the monitoring hidden by default: Technical Health - ntp, disk, db, network, jvm, ...
  87. 87. Limited number of graphes Errors in RED Attention points in Yellow/Orange
  88. 88. technical view; more graphes; empty when OK
  89. 89. Dashboards Duplicate the dashboard to have an historical view
  90. 90. Dashboards Easy drill down between dashboards / with pre defined variables
  91. 91. Dashboard Provide relevant help where needed (from the haproxy documentation)
  92. 92. Dashboards On product launch / change / ... extract relevant data from the service and build a "temporary dashboard" Share with the teams and managers, show on big screen
  93. 93. Conventions Color conventions, general: RED = Bad Yellow = Attention Blue/Green = OK Also: RED = problem at our side Yellow/orange = problem at partners side
  94. 94. HTTP Codes 2xx: Greens 3xx: Yellows 4xx: Blues (404: grey) 5xx: Orange/Red ! Same accross all dashboards to enable easy reading
  95. 95. Side note: github.com/grafana/grafonnet-lib/ A Jsonnet Library to write grafana dashboards.
  96. 96. Conclusion Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/willy_photoshop/34829332342/
  97. 97. Quick Answers Business monitoring allows yo to know early when things are wrong Provides clear asnwers to your customers in minutes (no more "I will check") // to make between technical and business metrics (to find causes)
  98. 98. What happened? Is it REALLY fixed? When? Until when (technical and business)? What did I miss? What is the impact?
  99. 99. Metrics benefits Because you run queries and alerts from a central location You can run queries accross targets/jobs Detect faulty instances, alert for server X based on metrics of server Y
  100. 100. Metrics benefits Trends Dynamic thresholds Predictions
  101. 101. Do not underestimate the monitoring of the development / staging environments.
  102. 102. Business metrics are good candidates to wake up someone at night.
  103. 103. Prometheus benefits Pull Based , metrics centrincs The targets (e.g. developers) choose the metrics they expose => Empowering people HTTP permits TLS, Client Auth, ... and cross org sharing of metrics Becoming a standard in the industry
  104. 104. Grafana Central point for all teams Show current and pas status Should give you the opportunity to answer questions
  105. 105. Focusing on Business Metrics is hard work that will show benefits accross teams and provide visibility towards hierarchy, enabling you to gain trust and move on more quickly towards a DevOps model.
  106. 106. Julien Pivotto roidelapluie roidelapluie@inuits.eu Inuits https://inuits.eu info@inuits.eu Contact
  • KerriPoani

    Dec. 1, 2021
  • MaikBlecker

    Dec. 2, 2019
  • daffyduke

    Oct. 24, 2018

Talk given at Sysadmin Days Paris about monitoring, prometheus, and what to do with metrics we collect to understand how our apps are going.

Views

Total views

3,484

On Slideshare

0

From embeds

0

Number of embeds

54

Actions

Downloads

65

Shares

0

Comments

0

Likes

3

×