Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring as an entry point for collaboration

221 views

Published on

In the last years, we have been building complex stacks, made from lots of components. All of this backed by multiple teams. This talk will present how you can use monitoring to look at the business side and have everyone looking at the same dashboards, making cooperation a reality.

Published in: Technology
  • Be the first to comment

Monitoring as an entry point for collaboration

  1. 1. Monitoring as an entry point for collaboration Julien Pivotto (@roidelapluie) DevOpsDays Geneva February 22nd, 2019
  2. 2. @roidelapluie I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. inuits
  4. 4. This talk is based on experience. Therefore we will talk about the Prometheus ecosystem, but it applies to other workflows and tools.
  5. 5. The DevOps principles: CAMS (a definition of DevOps) Culture Automation Measurement Sharing (Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS) This talk is about all of it..
  6. 6. Who is behind the magic Dev Ops Security Virtualization QA Networking Sales Customers Partners ...
  7. 7. Monitoring Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  8. 8. Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
  9. 9. Creative Commons Public Domain https://pxhere.com/en/photo/265717
  10. 10. Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
  11. 11. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  12. 12. Creative Commons Public Domain https://pxhere.com/fr/photo/952999
  13. 13. Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
  14. 14. Creative Commons Attribution-Share Alike 3.0 Unported https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg
  15. 15. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  16. 16. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  17. 17. Further questions At which speed are the cars running? How long do they stop? How many pedestrians are crossing the road?
  18. 18. Observability Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  19. 19. Observability is the ability to be inside the application, and look around to observe its world. In practice: Collecting relevant information Making it available quickly and easily
  20. 20. Metrics Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
  21. 21. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  22. 22. Name: Number of HTTP requests Labels: status: 200 vhost: inuits.eu method: post Value: 1823 Timestamp: Thu Oct 18 10:18:06 CEST 2018
  23. 23. Name: Number of HTTP requests Labels: status: 200 vhost: inuits.eu method: post Value: 2123 Timestamp: Thu Oct 18 10:18:36 CEST 2018
  24. 24. 300 Requests in 30 s = 10 requests per seconds (POST for inuits.eu with response code 200)
  25. 25. http_request_total{job="fe",instance="fe1",code="200"}
  26. 26. Types of metrics Counters Gauges Histograms Summaries
  27. 27. Counters Always go up start from zero rate, increase e.g. number of http requests
  28. 28. Gauges Go up and down Average, Sum, Max, ... ^ over time e.g. concurrent users
  29. 29. Histograms and summaries Sets of requests Using "buckets" Useful to get duration, percentiles, SLA
  30. 30. Metrics and monitoring Metrics do not represent problems Metrics represent a state, give insights Metrics can be graphed You can alert based on them
  31. 31. Exposed metrics are "raw" In general you can just expose counters, and let the monitoring server do the real maths. That keeps the overhead very low of apps.
  32. 32. Tooling Creative Commons Attribution 2.0 https://www.flickr.com/photos/psd/5298483
  33. 33. What are the needs ? Ingest metrics at high frequency React to changes Empower people Alert on metrics
  34. 34. Use one toolchain Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/161054138@N08/37880775085
  35. 35. Stop with: Having 1 "monitoring" + 1 "graphing" stacks Big all in one tools: think decentralize, scale Auto Discovery (use service discovery instead) Manual config Fragile monitoring (think HA)
  36. 36. Prometheus https://prometheus.io/
  37. 37. Prometheus Open Source monitoring tool Complete Ecosystem For cloud and on prem Built around metrics
  38. 38. Cloud Native Easy to configure, deploy, maintain Designed in multiple services Container ready Orchestration ready (dynamic config) Fuzziness
  39. 39. Efficient "Scrapes" millions of metrics Scales Manages its own optimized db (prometheus/tsdb)
  40. 40. How does it work?
  41. 41. How does it work?
  42. 42. How does it work?
  43. 43. How does it work?
  44. 44. How does it work?
  45. 45. Pull vs Push Prometheus pulls metrics But does not know what it will get! The target decides what to expose (short term batches can still push to a "pushgateway")
  46. 46. Exporters Expose metrics with an HTTP API Bindings available for many languages (for "native" metrics) Exporters do not save data ; they are not "proxies" and don't "cache" anything
  47. 47. Common exporters Node Exporter: Linux System Metrics Grok Exporter: Metrics from log files SNMP Exporter: Network devices Blackbox exporter: TCP, DNS, Http requests
  48. 48. Grafana Open Source (Apache 2.0) Web app Specialized in visualization Pluggable Multiple datasources: prometheus, graphite, influxdb... Has an API!
  49. 49. Grafana and Prometheus Prometheus shipped its own consoles Now it recommends Grafana and deprecated its own consoles
  50. 50. Business Metrics Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
  51. 51. What are business metrics? Metrics that effectively tell you how you fullfil your customers' requests Provide quality and level of service to customers
  52. 52. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/nox_noctis_silentium/3960497840
  53. 53. Where to get them? Frontends Databases Caching systems (sessions, ...) ... Each one of them requires a cross-team understanding of the business.
  54. 54. Where to start? Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/
  55. 55. USE Brendan Gregg's USE method U = Utilisation S = Saturation E = Errors For resources like network, CPU, memory,... Also asynchrone processes, ...
  56. 56. RED Tom Wilkie's RED method R = Requests E = Errors D = Duration HTTP Requests, synchrone processes,...
  57. 57. What to get? Request Rate Saturation Error Rate Duration
  58. 58. Before we dig in .. What we will see now is monitoring data. It should not be used for precise usages, like invoicing.
  59. 59. Caching System Monitoring (USE)
  60. 60. Caching System Monitoring
  61. 61. What do we learn? Users can connect to the platform: The authentication works The platform is currently used
  62. 62. Benefits Connected users = they can use the platform Know when you can do maintenances Know about your user's general habits (trends)
  63. 63. Database
  64. 64. Database
  65. 65. Database Using SQL exporters to query the data from your database Requires a cross team approach Gets you fine grained, quality data
  66. 66. Database trap Do not try to replace BI/Reporting Do not take too many labels -- stay in the monitoring area
  67. 67. Frontends
  68. 68. sum by (instance, env) ( rate(http_requests_duration_count[5m]) )
  69. 69. Frontends RED
  70. 70. sum by (code, env) ( rate(http_requests_duration_count{code!="200"}[5m]) ) / ignoring (code) group_left sum by (env) ( rate(http_requests_duration_count[5m]) )
  71. 71. Frontends RED
  72. 72. sum by (env) ( rate(http_requests_duration_sum[5m]) ) / sum by (env) ( rate(http_requests_duration_count[5m]) )
  73. 73. What can we learn? We have traffic from outside How much traffic Quality of the trafic How long it really takes (end to end)
  74. 74. Networking Utilisation Saturation Errors Multicast Broadcast Use aliases to identify ports - human readable
  75. 75. Adding Time Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/rswilson74/3375654385
  76. 76. Timeseries How we use time: We take the metrics for the last 7 weeks We take the median value (exclude 3 top and 3 low) Excludes anomalies due to incidents/holidays...
  77. 77. http_requests:rate5m offset 1w offset queries data in the past
  78. 78. - record: past_request_rate expr: http_requests:rate5m offset 1w labels: when: 1w
  79. 79. - record: past_request_rate expr: http_requests:rate5m offset 2w labels: when: 2w
  80. 80. max without(when) ( bottomk(1, topk(4, past_request_rate ) ) )
  81. 81. Result RED
  82. 82. Oops... RED
  83. 83. What do we learn? Predict users habits Deviation from the norm are not normal It means that users can not reach us/use our services
  84. 84. Why business metrics matter? Good service depends on: linux health, dns, network, ntp, disk space, cpu, open files, database, cache systems, load balancers, partners, electricity, virtualization stack, nfs, ... and it moves over time Customers won't call you because your disk is full!
  85. 85. Partners Creative Commons Attribution 2.0 https://www.flickr.com/photos/deanhochman/27248626739
  86. 86. Given that the End User matters We have decided to standadize metrics exchange between partners Prometheus format used (soon to be OpenMetrics) Everyone knows HTTP!
  87. 87. What do we exchange? We are not interested in partner's internal (and don't want to expose us) We are exchanging precomputed metrics (rate over 5 minutes, duration over 5 minutes), excluding servers, instances, ... Identify, in the chain, the bottlenecks and the issues
  88. 88. Dashboards
  89. 89. Kind of dashboards General (multiple business) Business overview (e.g. one app) Business focused (e.g. one process) Technical overview (e.g. linux cluster) Technical focus (e.g. linux host) Even fore focused (e.g. cpu usage)
  90. 90. Dashboards We define our business dashboards in two parts: 10 graphes on top about the business: RED, USE, Alerts, data from partners, monitoring robots, state of the monitoring hidden by default: Technical Health - ntp, disk, db, network, jvm, ...
  91. 91. Limited number of graphes Errors in RED Attention points in Yellow/Orange
  92. 92. technical view; more graphes; empty when OK
  93. 93. Dashboards Duplicate some dashboards to compare to an historical view. Especially when dashboard specific with business patterns not easy to remember.
  94. 94. Dashboards Easy drill down between dashboards / with pre defined variables
  95. 95. Dashboard Provide relevant help where needed (from the haproxy documentation)
  96. 96. Dashboards On product launch / change / ... extract relevant data from the service and build a "temporary dashboard" Share with the teams and managers, show on big screen
  97. 97. Conventions Color conventions, general: RED = Bad Yellow = Attention Blue/Green = OK Also: RED = problem at our side Yellow/orange = problem at partners side
  98. 98. HTTP Codes 2xx: Greens 3xx: Yellows 4xx: Blues (404: grey) 5xx: Orange/Red Same accross all dashboards to enable quick/easy reading.
  99. 99. This is not only cross teams Newcomers People passing by or not actively looking On-Call During incidents .. lots of people For those reasons, keep your dashboards simple and intuitive!
  100. 100. Side note: github.com/grafana/grafonnet-lib/ A Jsonnet Library to write grafana dashboards.
  101. 101. Alerting Creative Commons Attribution 2.0 https://www.flickr.com/photos/calliope/234447967
  102. 102. How to do alerting right Use multiple channels (chat, tickets) Alert when really needed (non prod: BH) Send the alert to the right people (incl. partners) Make the alerts actionnable
  103. 103. Crisis Major incident in production Affecting multiple projects "Situation room": 2 channels: 1 for all the alerts, 1 for the people Bring managers, and all the relevant tech people in the same room Unique channel of communication for the incident (archived after the incident)
  104. 104. Conclusion Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/willy_photoshop/34829332342/
  105. 105. Quick Answers Business monitoring allows yo to know early when things are wrong, accross teams Provides clear asnwers to your customers in minutes (no more "I don't know, I will check") // to make between technical and business metrics (to find causes)
  106. 106. What happened? Is it REALLY fixed? When? Until when (technical and business)? What did I miss? What is the impact?
  107. 107. Metrics benefits Because you run queries and alerts from a central location You can run queries accross targets/jobs Detect faulty instances, alert for server X based on metrics of server Y
  108. 108. Metrics benefits Trends Dynamic thresholds Predictions
  109. 109. Do not underestimate the monitoring of the development / staging environments.
  110. 110. Business metrics are good candidates to wake up someone at night. The downside is that that person must be fluent with the business.
  111. 111. Prometheus benefits Pull Based , metrics centrincs The targets (e.g. developers) choose the metrics they expose => Empowering people HTTP permits TLS, Client Auth, ... and cross org sharing of metrics Becoming a standard in the industry
  112. 112. Grafana Central point for all teams Show current and past status Should give you the opportunity to answer questions
  113. 113. Focusing on Business Metrics is hard work that will show benefits accross teams and provide visibility towards hierarchy, enabling you to gain trust and move on more quickly towards a DevOps model.
  114. 114. Julien Pivotto roidelapluie roidelapluie@inuits.eu Inuits https://inuits.eu info@inuits.eu Contact

×