Successfully reported this slideshow.
Your SlideShare is downloading. ×

Observing the HashiCorp Ecosystem From Prometheus

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 44 Ad

Observing the HashiCorp Ecosystem From Prometheus

Download to read offline

Observing the HashiCorp Ecosystem From Prometheus
talk given at Hashiconf Europe 2022

Observing the HashiCorp Ecosystem From Prometheus
talk given at Hashiconf Europe 2022

Advertisement
Advertisement

More Related Content

Advertisement

Observing the HashiCorp Ecosystem From Prometheus

  1. 1. Observing the HashiCorp Ecosystem From Prometheus Kris Buytaert & Julien Pivotto June 21, 2022 O11y
  2. 2. Who are we ? O11y 0
  3. 3. Kris Buytaert • I used to be a developer • Then I became an Ops person • Chief Trolling/Travel/Technical Officer @ Inuits.eu • Chief Yak Shaver @ o11y.eu • Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ... • Cofounder of all of the above • Everything is a Freaking DNS Problem • DNS : devops needs sushi • @krisbuytaert on twitter/github O11y 1
  4. 4. Julien Pivotto • Prometheus maintainer • Open Source Observability Expert • Principal Software Architect & CoFounder @ o11y.eu • DevOps believer • @roidelapluie on twitter/github O11y 2
  5. 5. O11y • Inuits.eu Spinoff • Open Source Observability • Currently supporting the Prometheus Ecosystem • Professional Services & Support (now) • Long Term Enterprise Support (next month) • Prometheus Distribution (soon) O11y 3
  6. 6. Introduction, a brief history of Open Source Monitoring O11y 3
  7. 7. July 2008 Ottawa Linux Symposium Paper • Bloated Java Tools • Dysfunctional Open Core Software • DBA Required • Nagios was king in the Open Source world O11y 4
  8. 8. June 2011 #monitoringsucks • John Vincent (@lusis) , june 2011 • A #devops sub-movement • (manual configuration, not in sync with reality, hosts only, services sometimes, applications never) O11y 5
  9. 9. October 2011 #monitoringlove • Ulf Mansson, #devopsdays Rome 2011 • A new found love for monitoring • Triggered by { New Open Source Tools * Automation } O11y 6
  10. 10. November 2012 Prometheus O11y 7
  11. 11. What is monitoring? • High level overview of the state of a service/component • Availability • Technical components • Performance ? What is going on? O11y 8
  12. 12. Pitfalls of traditional monitoring • Drift from reality • Total lack of automation • Total lack of automation • Total lack of automation • Total lack of automation • Partial automation • Lots of work to maintain • Binary states: it works - it does not work • Alert fatigue • Alert fatigue • Alert fatigue • Alert fatigue O11y 9
  13. 13. What is observability? • Understand how your services behave • Like you are at their place • Without incident specific code Why is this going on? O11y 10
  14. 14. How do monitoring and observability connect? • Monitoring is required • If lucky, monitoring is enough • Observability is removing luck <- @roidelapluie O11y 11
  15. 15. What is observability - in Practice? Three pillars: • Metrics • Logs • Traces O11y 12
  16. 16. Metrics https:/ /play.grafana.org/ O11y 13
  17. 17. Logs https:/ /play.grafana.org/ O11y 14
  18. 18. Traces https:/ /www.jaegertracing.io/ O11y 15
  19. 19. Prometheus O11y 15
  20. 20. Prometheus • Prometheus is an Open Source CNCF Project • Collects and stores metrics • Pull-based • Service discovery (including Consul) • Alerting O11y 16
  21. 21. The Prometheus ecosystem • Exporters for every piece of the infra • Maintained by multiple companies • Long-Term Support release coming Q3 2022 O11y 17
  22. 22. Prometheus data model • Metric have labels • Labels differentiate metrics, e.g.: • HTTP response code • Datacenter name O11y 18
  23. 23. PromQL • Prometheus Query Language • Powerful yet simple query language rate(http_requests_total[5m]) O11y 19
  24. 24. Prometheus + Consul O11y 19
  25. 25. Observing your services • consul_sd_configs • Stream consul services list to Prometheus • Up-to-date service list • Use the flexibility of labels • Add relevant labels • Filter targets O11y 20
  26. 26. consul_sd_configs labels • __meta_consul_service • __meta_consul_tags • __meta_consul_node • __meta_consul_service_metadata_ • __meta_consul_dc O11y 21
  27. 27. Alerting philosophy • Page on actionable critical failure • Avoid paging on Consul Health Check failure • Keep “ambiance” alerts to get the atmosphere and quickly find the cause O11y 22
  28. 28. Consul O11y 22
  29. 29. consul_exporter • Exporter maintained by Prometheus team • Expose consul cluster health • Optionally expose key/values • e.g. store desired state in KV for graphing • Connect to a single instance O11y 23
  30. 30. Consul telemetry • Built-in • Runtime metrics (memory, CPU, ...) • Autopilot, raft metrics • Calls (rate, errors, latency) O11y 24
  31. 31. Configure Consul telemetry Consul configuration: telemetry { disable_hostname = true prometheus_retention_time = "1h" } O11y 25
  32. 32. Configure Consul telemetry Prometheus configuration: scrape_jobs: - name: consul static_configs: - <consulserver1>:8500 - <consulserver2>:8500 metrics_path: '/v1/agent/metrics' param: format: ["prometheus"] O11y 26
  33. 33. Consul alerts (consul_exporter) Is consul running? up{job="consul_exporter"} == 0 consul_up{job="consul_exporter"} == 0 Is there a leader? consul_raft_leader != 1 Are peers in raft? sum(consul_raft_peers) != count(up{job="consul"}) O11y 27
  34. 34. Consul alerts (Consul telemetry) Is consul running? up{job="consul"} == 0 Is my cluster healthy? consul_autopilot_healthy == 0 O11y 28
  35. 35. Vault O11y 28
  36. 36. Configure Vault telemetry Vault configuration: telemetry { disable_hostname = true prometheus_retention_time = "1h" } O11y 29
  37. 37. Configure Consul telemetry Prometheus configuration: scrape_jobs: - name: vault static_configs: - <vaultserver1>:8200 - <vaultserver2>:8200 metrics_path: '/v1/sys/metrics' param: format: ["prometheus"] O11y 30
  38. 38. Vault alerting Is Vault up? up{job="vault"} == 0 Is Vault sealed? vault_core_unsealed == 0 Is audit log working? rate(vault_audit_log_request_failure[5m]) > 0 rate(vault_audit_log_response_failure[5m]) > 0 O11y 31
  39. 39. Alertmanager O11y 31
  40. 40. Alert inhibition • Suppressing notification from alerts of other alerts are firing. • Reduces alerts, e.g. if vault is sealed. O11y 32
  41. 41. Configuring inhibition Alertmanager configuration: inhibit_rules: - source_match: alertname: VaultIsSealed target_match: alertname: ErrorRateTooHigh equal: [ datacenter ] O11y 33
  42. 42. Conclusion O11y 33
  43. 43. Conclusion • Alerting should come from your end services • Consul & Vault focused alerts will pinpoint causes • Specific Vault & Consul alerts can page you (e.g. sealed) • Draft dashboards based on your needs (response times, errors, etc) O11y 34
  44. 44. Contact O11y https:/ /o11y.eu info@o11y.eu O11y 34

×