Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

95 views

Published on

Scaling Prometheus in Kubernetes seems easy with service-discovery, but quickly devolves into manual DevOps snowflake setup. Additionally, a single developer is able to overwhelm a federated Prometheus setup and impact the system as a whole without being able to self-service debug. In this talk, Chris will focus on a variety of architectures using Telegraf to scale scraping in Kubernetes and empower developers.

He’ll describe his experiences around scaling /metrics in the microservices of InfluxData’s Cloud 2.0 Kubernetes system…as he was the single developer that added just one more label…

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

  1. 1. Lessons from Cloud Scaling Prometheus metrics in Kubernetes with Telegraf
  2. 2. The curious case of the missing metrics One Label too far...
  3. 3. © 2019 InfluxData. All rights reserved. 3 The Suspects ● Prometheus ● Kubernetes ● Gateway ● Queryd
  4. 4. © 2019 InfluxData. All rights reserved. 4 Prometheus http://gateway.twodotoh.svc.cluster.local:9999/metrics
  5. 5. © 2019 InfluxData. All rights reserved. 5 Prometheus http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service
  6. 6. © 2019 InfluxData. All rights reserved. 6 Kubernetes
  7. 7. © 2019 InfluxData. All rights reserved. 7 InfluxCloud Gateway Gateway Queryd Gateway Queryd Queryd Ingress
  8. 8. © 2019 InfluxData. All rights reserved. 8 Problem: Prometheus Debugging is Hard prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.05"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.5"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.9"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.99"} 0.012562015 prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015 prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1
  9. 9. © 2019 InfluxData. All rights reserved. 9 Problem: Prometheus Scaling is Hard global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh_ns_a kubernetes_sd_configs: - role: service namespaces: names: - a global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh_ns_a kubernetes_sd_configs: - role: service namespaces: names: - b
  10. 10. © 2019 InfluxData. All rights reserved. 10 Solution: Isolatation with Telegraf Sidecar
  11. 11. © 2019 InfluxData. All rights reserved. 11 Solution: Isolation with Telegraf Sidecar apiVersion: apps/v1 kind: Deployment metadata: name: "gateway" labels: spec: serviceName: "gateway" replicas: 100 template: metadata: name: "gateway" labels: app: "gateway" spec: containers: - name: "telegraf" image: "docker.io/library/telegraf:1.12" - name: "gateway" image: "quay.io/influxdb/gateway:latest" [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.c token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  12. 12. © 2019 InfluxData. All rights reserved. 12 Solution: Isolatation with Telegraf Sidecar
  13. 13. © 2019 InfluxData. All rights reserved. 13 Problem: Prom has 1 and only 1 value http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service metric_relabel_configs: - regex: user_agent action: labeldrop
  14. 14. © 2019 InfluxData. All rights reserved. 14 Solution: Influx for more context http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.converter]] [processors.converter.tags] string = ["user_agent"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  15. 15. © 2019 InfluxData. All rights reserved. 15 Problem: Is there a way to prevent? http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service metric_relabel_configs: - regex: user_agent action: labeldrop
  16. 16. © 2019 InfluxData. All rights reserved. 16 Solution: Telegraf Guard Rails http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.tag_limit]] limit = 4 ## List of tags to preferentially preserve keep = ["handler", "method", "status"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  17. 17. © 2019 InfluxData. All rights reserved. 17 Problem: Hard to Rotate Prom Passwords http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service bearer_token_file: /etc/hunter2
  18. 18. © 2019 InfluxData. All rights reserved. 18 Solution: Per Pod Credentials http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] bearer_token = "/etc/telegraf/hunter2"
  19. 19. © 2019 InfluxData. All rights reserved. 19 Lessons Scaling is NOT More Manual Processes Scaling is NOT saying “You’re Doing it Wrong” Scaling IS Empowering Developers Scaling IS Predictability of Failure Modes
  20. 20. The time when we were Watching the watchers...
  21. 21. © 2019 InfluxData. All rights reserved. 21 Problem: Am I scraping all the pods? http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service
  22. 22. © 2019 InfluxData. All rights reserved. 22 Solution: Telegraf K8s Inventory [[inputs.internal]] [[inputs.kube_inventory]] url = "http://1.1.1.1:10255" [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  23. 23. Prometheus Scraping Designs
  24. 24. © 2019 InfluxData. All rights reserved. 24 Scaling even more
  25. 25. © 2019 InfluxData. All rights reserved. 25 Scaling even more with Influx Enterprise Load Balancer
  26. 26. © 2019 InfluxData. All rights reserved. 26 Scaling even more with Kafka and Influx Enterprise Kafka
  27. 27. © 2019 InfluxData. All rights reserved. 27 Core Idea ● Measure and test metrics scaling ○ Are you missing metrics? ● Decentralize metrics gathering ○ Consider metrics as part of the program ● Empower Developers ○ They know their metrics the best. Allow them local tooling control
  28. 28. © 2019 InfluxData. All rights reserved. 28 First Order Conclusion ● Too easy to shoot yourself in the foot with prometheus metrics. ● Too much in prometheus needs operation heroes. ● Too difficult to express vital information in prometheus about your program without a ton of centralized control. ● One mistake can impact everyone.
  29. 29. © 2019 InfluxData. All rights reserved. 29 Second Order Conclusion ● Prometheus is not descriptive enough. ● Extremely difficult to change over time. ● The metrics game is not a solved problem. ○ Opentelemetry? ○ SNMP? ● Probably not one answer to everything.
  30. 30. © 2019 InfluxData. All rights reserved. 30 Future ● Flux into Telegraf ○ Processor for transformation ○ Moving the program near the data ○ Flux Output ○ Monitoring and alerting at edge ● Telegraf Flux scripts hosted in InfluxDB API ○ Runtime plugins without re-compiling ○ Sampling rules from server-side ■ Aggregation on server with input to client ● What else?
  31. 31. © 2019 InfluxData. All rights reserved. 31 Thank You!
  32. 32. The time when collecting metrics impacted storage... Measure, measure, measure
  33. 33. © 2019 InfluxData. All rights reserved. 33 Problem: Prometheus metrics are heavy weight

×