Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring und Metriken im Wunderland

0 views

Published on

Bei Jimdo sammeln wir jede Menge Metriken über alle Teile unseres Systems. Dabei fallen Daten auf allen Ebenen des Systems an: Infrastruktur, System und Applikation. Wichtig ist, dass alle Entwickler zu jedem Zeitpunkt Einblick in die Echtzeit-Metriken ihrer Services nehmen können. Um das zu garantieren, haben wir uns einige Zeit mit der Integration von Prometheus in unsere Systeme beschäftigt.

In unserem Talk werden wir sowohl über den Betrieb von Prometheus als auch über die Integrationen mit dem Rest der Jimdo-Plattform sprechen. Wir werden von Stolpersteinen und Tricks berichten, die wir gelernt haben, sowie einen Einblick in unserer Tool-Landschaft geben.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Monitoring und Metriken im Wunderland

  1. 1. Monitoring and Metrics in Wonderland
  2. 2. Who we are Paul SeiffertDennis Benkert
  3. 3. Agenda ● What is Monitoring? ● Which Metrics to Observe? ● Prometheus ● Visualization and Alerting
  4. 4. What is Monitoring?
  5. 5. Let’s start with a little story about an incident
  6. 6. Customers are complaining about slow responses. Can you please take a look?
  7. 7. Scenario A
  8. 8. Really?! Let me check this quickly and I will come back to you.
  9. 9. $ time curl -SsL https://www.mypage.com/ > /dev/null real 0m2.322s user 0m0.014s sys 0m0.010s
  10. 10. Scenario B
  11. 11. Yup, we got triggered about this 30 minutes ago and are already rolling out a fix.
  12. 12. Alerting First Use Case for Monitoring
  13. 13. What happened in the last 30 minutes?
  14. 14. Prometheus triggered
  15. 15. called
  16. 16. What is happening?!
  17. 17. What’s changed? ● A change to the application introduced a new kind of database queries ● The database wasn’t optimized for this query, no index was created
  18. 18. Fix it! ● As a quick solution, we are rolling back the change ● For a long-term solution, we inform the developer responsible for the change and help him set up the right index for the new query
  19. 19. Analysis & Debugging Second Use Case for Monitoring
  20. 20. What Monitoring is ...
  21. 21. ● Gathering metrics representing the state of an application ● Storing these in a time based context ● Triggering alerts based on threshold breaches of metrics ● Offering visualizations to display different metrics in context
  22. 22. What Monitoring is not ...
  23. 23. ● Gathering business related information of your system ● Generating annual reports for management ● The typical Data Warehouse system
  24. 24. Wonderland
  25. 25. Wonderland ● Jimdo’s internal PaaS that runs 250 services ● 3000 Docker containers at a time ● 600 deployments per Day
  26. 26. AWS Other Providers Infrastructure Automation APIs Monitoring Tools Other Tools CLI Tools Wonderland
  27. 27. Which Metrics to Observe?
  28. 28. Infrastructure Metrics System Metrics Application Metrics
  29. 29. Infrastructure Metrics ● Requests per Second ● Number of Virtual Machines ● Network Utilization ● ...
  30. 30. AWS APIs CloudWatch Exporter Wonderland APIsCustom Exporters Prometheus
  31. 31. Examples # Average Load Balancer Latency aws_elb_latency_average{ load_balancer_name="web-prod" } = 0.018619823587046183 ~= 20ms # Number of Cluster Nodes wonderland_cluster_scale{ cluster="crims" } = 234
  32. 32. wonderland_cluster_scale{ cluster="crims", datacenter="eu-west-1a" } = 123 metric value labels
  33. 33. System Metrics ● CPU Utilization ● Memory Utilization ● Free Disk Storage ● ...
  34. 34. /proccollectd cgroupscAdvisor Cluster Instance Prometheus
  35. 35. Examples # Memory usage of a specific Docker container container_memory_rss{ image="registry.jimdo/jimdo/web-fat:latest" instance="10.8.4.65:9104" name="web-prod--web-https-proxy" } = 724672512 ~= 700MB # Free disk space on root volume of a specific EC2 instance collectd_df_df_complex{ instance="10.8.4.65:9103", type="free", df="root" } = 5476036608 ~= 5GB
  36. 36. Application Metrics ● Processed Queue Messages ● Sign-Ups ● Number of Requests per Route ● ...
  37. 37. Application Container Application Container GET /metrics GET /metrics Prometheus
  38. 38. Example # Number of Requests on specific route in specific container http_requests_total{ action="index", controller="checkout", response="200", instance="10.8.4.180:11392", service="web-prod" } = 1191
  39. 39. Implementing a Metrics Endpoint ● Many Client Libraries available (https://prometheus.io/docs/instrumenting/clientlibs/) ● Mechanism is similar in all of them
  40. 40. Implementing a Metrics Endpoint 1. Register metric (name, labels, description) 2. Listen on /metrics 3. Provide values for metrics
  41. 41. Collecting Metrics <?php use PrometheusCollectorRegistry; $registry = CollectorRegistry::getDefault(); $numRequests = $registry->getOrRegisterCounter( 'http', 'requests_total', 'Number of HTTP requests per status code', ['status'] ); $numRequests->inc([200]);
  42. 42. Exposing Metrics <?php use PrometheusCollectorRegistry; use PrometheusRenderTextFormat; $registry = CollectorRegistry::getDefault(); $renderer = new RenderTextFormat(); $result = $renderer->render($registry->getMetricFamilySamples()); header('Content-type: ' . RenderTextFormat::MIME_TYPE); echo $result;
  43. 43. Prometheus
  44. 44. http_requests_total metric value label {service="web-prod"} = 320
  45. 45. http_requests_total Time 320 340 405 time series
  46. 46. 1 metric = 1 time series
  47. 47. redis_cache_misses mysql_query_duration http_requests_total
  48. 48. http_requests_total label {service="web-prod"} = 320
  49. 49. http_requests_total{ service="web-prod", response="200", controller="listUsers", instance="10.0.4.83" } = 320
  50. 50. 1 metric = 1 time series ???
  51. 51. http_requests_total{service="web-prod", instance="10.0.4.4"} http_requests_total{service="web-prod", instance="10.0.4.8"} http_requests_total{service="web-prod", instance="10.0.4.1"} ...
  52. 52. 1 metric + distinct label value combination = 1 time series
  53. 53. Recording rules to declutter metrics
  54. 54. Recording Rules ● Can pre-calculate heavily aggregating queries ● Can aggregate metrics to reduce label cardinality ● Can act like a filter on the input time series
  55. 55. service:http_requests_total:sum = sum(http_requests_total) without (instance)
  56. 56. http_requests_total{instance="10.0.4.4"} http_requests_total{instance="10.0.4.8"} http_requests_total{instance="10.0.4.1"} service:http_requests_total:sum{}
  57. 57. Federation to filter down metrics
  58. 58. Federation ● Allows Prometheus instances to fetch metrics from each other ● Supports filtering via time series names ● Can be used to reduce resolution on time series ● Enables you to use different retentions
  59. 59. Prometheus /federate Prometheus
  60. 60. Visualization and Alerting
  61. 61. Visualization: ● Fetches metrics from Prometheus ● Multitude of visualization options ● One dashboard per service
  62. 62. Prometheus Grafana Elasticsearch ...
  63. 63. Requests per second
  64. 64. Service Dashboard
  65. 65. Alerting ● Prometheus Alertmanager ● Send alerts via Slack, Email, PagerDuty ● Send alerts to the right people
  66. 66. Defining Alerts ALERT WebHighLatency IF service:aws_elb_latency_average:max{service="web-prod"} > 1 FOR 5m LABELS { importance="wake-me-at-night", team="platform" } ANNOTATIONS { summary = "Web Prod has a high latency.", runbook = "https://jimdo-runbooks.com/RUNBOOK_WEB.md", }
  67. 67. Further Reading ● PHP Prometheus Client Library github.com/Jimdo/prometheus_client_php ● SRE Book landing.google.com/sre/book/index.html

×