Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prometheus: From technical metrics to business observability


Published on

Talk given at Sysadmin Days Paris about monitoring, prometheus, and what to do with metrics we collect to understand how our apps are going.

Published in: Technology
  • Be the first to comment

Prometheus: From technical metrics to business observability

  1. 1. Prometheus From technical monitoring to business obervability Julien Pivotto (@roidelapluie) Sysadmin Days Paris October 18th, 2018
  2. 2. user{name="roidelapluie"} 1 I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. inuits
  4. 4. Sysadmin Creative Commons Zero
  5. 5. Sysadmin's view Access to a lot of components Range from the frontends to the databases With 24x7 oncall shifts
  6. 6. DevOps In a DevOps world, more data, more awareness More changes, different scale Evolution How can we keep up??
  7. 7. The DevOps principles: CAMS (a definition of DevOps) Culture Automation Measurement Sharing (Damon Edwards and John Willis, 2010 This talk is about all of it..
  8. 8. Monitoring Creative Commons Attribution 2.0
  9. 9. Creative Commons Attribution ShareAlike 2.0
  10. 10. Creative Commons Public Domain
  11. 11. Creative Commons Attribution 2.0
  12. 12. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  13. 13. Creative Commons Public Domain
  14. 14. Creative Commons Attribution 2.0
  15. 15. Creative Commons Attribution-Share Alike 3.0 Unported
  16. 16. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  17. 17. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  18. 18. Observability Creative Commons Attribution 2.0
  19. 19. Metrics Creative Commons Attribution-Share Alike 2.0
  20. 20. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  21. 21. Name: Number of HTTP requests Labels: status: 200 vhost: method: post Value: 1823 Timestamp: Thu Oct 18 10:18:06 CEST 2018
  22. 22. Name: Number of HTTP requests Labels: status: 200 vhost: method: post Value: 2123 Timestamp: Thu Oct 18 10:18:36 CEST 2018
  23. 23. 300 Requests in 30 s = 10 requests per seconds (POST for with response code 200)
  24. 24. http_request_total{job="fe",instance="fe1",code="200"}
  25. 25. Types of metrics Counters Gauges Histograms Summaries
  26. 26. Counters Always go up start from zero rate, increase e.g. number of http requests
  27. 27. Gauges Go up and down Average, Sum, Max, ... ^ over time e.g. concurrent users
  28. 28. Histograms and summaries Sets of requests Using "buckets" Useful to get duration, percentiles, SLA
  29. 29. Metrics and monitoring Metrics do not represent problems Metrics represent a state, give insights Metrics can be graphed You can alert based on them
  30. 30. Exposed metrics are "raw" In general you can just expose counters, and let the monitoring server do the real maths. That keeps the overhead very low of apps.
  31. 31. Tooling Creative Commons Attribution 2.0
  32. 32. What are the needs ? Ingest metrics at high frequency React to changes Empower people Alert on metrics
  33. 33. Use one toolchain Creative Commons Attribution-ShareAlike 2.0
  34. 34. Stop with: Having 1 "monitoring" + 1 "graphing" stacks Big all in one tools: think decentralize, scale Auto Discovery (use service discovery instead) Manual config Fragile monitoring (think HA)
  35. 35. Prometheus
  36. 36. Prometheus Open Source monitoring tool Complete Ecosystem For cloud and on prem Built around metrics
  37. 37. Cloud Native Easy to configure, deploy, maintain Designed in multiple services Container ready Orchestration ready (dynamic config) Fuzziness
  38. 38. Efficient "Scrapes" millions of metrics Scales Manages its own optimized db (prometheus/tsdb)
  39. 39. How does it work?
  40. 40. How does it work?
  41. 41. How does it work?
  42. 42. How does it work?
  43. 43. How does it work?
  44. 44. Pull vs Push Prometheus pulls metrics But does not know what it will get! The target decides what to expose (short term batches can still push to a "pushgateway")
  45. 45. Exporters Expose metrics with an HTTP API Bindings available for many languages (for "native" metrics) Exporters do not save data ; they are not "proxies" and don't "cache" anything
  46. 46. Common exporters Node Exporter: Linux System Metrics Grok Exporter: Metrics from log files SNMP Exporter: Network devices Blackbox exporter: TCP, DNS, Http requests
  47. 47. Grafana Open Source (Apache 2.0) Web app Specialized in visualization Pluggable Multiple datasources: prometheus, graphite, influxdb... Has an API!
  48. 48. Grafana and Prometheus Prometheus shipped its own consoles Now it recommends Grafana and deprecated its own consoles
  49. 49. Business Metrics Creative Commons Attribution 2.0
  50. 50. What are business metrics? Metrics that effectively tell you how you fullfil your customers' requests Provide quality and level of service to customers
  51. 51. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0
  52. 52. Where to get them? Frontends Databases Caching systems (sessions, ...) ... Each one of them requires a cross-team understanding of the business.
  53. 53. Where to start? Creative Commons Attribution 2.0
  54. 54. USE Brendan Gregg's USE method U = Utilisation S = Saturation E = Errors For resources like network, CPU, memory,... Also asynchrone processes, ...
  55. 55. RED Tom Wilkie's RED method R = Requests E = Errors D = Duration HTTP Requests, synchrone processes,...
  56. 56. What to get? Request Rate Saturation Error Rate Duration
  57. 57. Before we dig in .. What we will see now is monitoring data. It should not be used for precise usages, like invoicing.
  58. 58. Caching System Monitoring (USE)
  59. 59. Caching System Monitoring
  60. 60. What do we learn? Users can connect to the platform: The authentication works The platform is currently used
  61. 61. Benefits Connected users = they can use the platform Know when you can do maintenances Know about your user's general habits (trends)
  62. 62. Database
  63. 63. Database
  64. 64. Database Using SQL exporters to query the data from your database Requires a cross team approach Gets you fine grained, quality data
  65. 65. Database trap Do not try to replace BI/Reporting Do not take too many labels -- stay in the monitoring area
  66. 66. Frontends RED
  67. 67. sum�by�(instance,�env)�( ��rate(http_requests_duration_count[5m]) )
  68. 68. Frontends RED
  69. 69. sum�by�(code,�env)�( ��rate(http_requests_duration_count{code!="200"}[5m]) )�/�ignoring�(code)�group_left sum�by�(env)�( ��rate(http_requests_duration_count[5m]) )
  70. 70. Frontends RED
  71. 71. sum�by�(env)�( ��rate(http_requests_duration_sum[5m]) )�/ sum�by�(env)�( ��rate(http_requests_duration_count[5m]) )
  72. 72. What can we learn? We have traffic from outside How much traffic Quality of the trafic How long it really takes (end to end)
  73. 73. Adding Time Creative Commons Attribution-ShareAlike 2.0
  74. 74. Timeseries How we use time: We take the metrics for the last 7 weeks We take the median value (exclude 3 top and 3 low) Excludes anomalies due to incidents/holidays...
  75. 75. http_requests:rate5m�offset�1w offset queries data in the past
  76. 76. ��record:�past_request_rate ��expr:�http_requests:rate5m�offset�1w ��labels: ����when:�1w
  77. 77. ��record:�past_request_rate ��expr:�http_requests:rate5m�offset�2w ��labels: ����when:�2w
  78. 78. max�without(when)�( ��bottomk(1, ����topk(4, ������past_request_rate ����) ��) )
  79. 79. Result RED
  80. 80. Oops... RED
  81. 81. What do we learn? Predict users habits Deviation from the norm are not normal It means that users can not reach us/use our services
  82. 82. Why business metrics matter? Good service depends on: linux health, dns, network, ntp, disk space, cpu, open files, database, cache systems, load balancers, partners, electricity, virtualization stack, nfs, ... and it moves over time Customers won't call you because your disk is full!
  83. 83. Partners Creative Commons Attribution 2.0
  84. 84. Given that the End User matters We have decided to standadize metrics exchange between partners Prometheus format used (soon to be OpenMetrics) Everyone knows HTTP!
  85. 85. What do we exchange? We are not interested in partner's internal (and don't want to expose us) We are exchanging precomputed metrics (rate over 5 minutes, duration over 5 minutes), excluding servers, instances, ... Identify, in the chain, the bottlenecks and the issues
  86. 86. Dashboards We define our dashboards in two parts: 10 graphes on top about the business: RED, USE, Alerts, data from partners, monitoring robots, state of the monitoring hidden by default: Technical Health - ntp, disk, db, network, jvm, ...
  87. 87. Limited number of graphes Errors in RED Attention points in Yellow/Orange
  88. 88. technical view; more graphes; empty when OK
  89. 89. Dashboards Duplicate the dashboard to have an historical view
  90. 90. Dashboards Easy drill down between dashboards / with pre defined variables
  91. 91. Dashboard Provide relevant help where needed (from the haproxy documentation)
  92. 92. Dashboards On product launch / change / ... extract relevant data from the service and build a "temporary dashboard" Share with the teams and managers, show on big screen
  93. 93. Conventions Color conventions, general: RED = Bad Yellow = Attention Blue/Green = OK Also: RED = problem at our side Yellow/orange = problem at partners side
  94. 94. HTTP Codes 2xx: Greens 3xx: Yellows 4xx: Blues (404: grey) 5xx: Orange/Red ! Same accross all dashboards to enable easy reading
  95. 95. Side note: A Jsonnet Library to write grafana dashboards.
  96. 96. Conclusion Creative Commons Attribution-ShareAlike 2.0
  97. 97. Quick Answers Business monitoring allows yo to know early when things are wrong Provides clear asnwers to your customers in minutes (no more "I will check") // to make between technical and business metrics (to find causes)
  98. 98. What happened? Is it REALLY fixed? When? Until when (technical and business)? What did I miss? What is the impact?
  99. 99. Metrics benefits Because you run queries and alerts from a central location You can run queries accross targets/jobs Detect faulty instances, alert for server X based on metrics of server Y
  100. 100. Metrics benefits Trends Dynamic thresholds Predictions
  101. 101. Do not underestimate the monitoring of the development / staging environments.
  102. 102. Business metrics are good candidates to wake up someone at night.
  103. 103. Prometheus benefits Pull Based , metrics centrincs The targets (e.g. developers) choose the metrics they expose => Empowering people HTTP permits TLS, Client Auth, ... and cross org sharing of metrics Becoming a standard in the industry
  104. 104. Grafana Central point for all teams Show current and pas status Should give you the opportunity to answer questions
  105. 105. Focusing on Business Metrics is hard work that will show benefits accross teams and provide visibility towards hierarchy, enabling you to gain trust and move on more quickly towards a DevOps model.
  106. 106. Julien Pivotto roidelapluie Inuits Contact