Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

237 views

Published on

SoundCloud <3 Prometheus. However, when it comes down to hardware, getting data into Prometheus isn’t always straight-forward. In this talk, I will provide a look into how we managed to port all our infrastructure monitoring – including SNMP, IPMI and more – to Prometheus, and even improve it along the way.

Published in: Software
  • Be the first to comment

  • Be the first to like this

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

  1. 1. Hardware-level data-center monitoring with Prometheus Conrad Hoffmann
  2. 2. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  3. 3. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ AMS5
  4. 4. 2118 servers 56 racks
  5. 5. 2118 servers 56 racks 200 network devices
  6. 6. 2118 servers 56 racks 200 network devices 2 * 2 generic uplinks 3 AWS Direct Connect 3 Google X-Connect
  7. 7. Where we started... & NRPE Cloud Watch Cacti
  8. 8. What’s paging you at night? Collection Visualization Alerting Cacti ✔ ✔ ✔ CloudWatch ✔ ✔ ✔ Ganglia ✔ Graphite ✔ ✔ Icinga/Nagios ✔ ✔ ✔ Smokeping ✔ ✔ ✔ Statsd ✔
  9. 9. https://xkcd.com/927/
  10. 10. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ prometheus.io
  11. 11. The Promise of Prometheus Prometheus is a reliable, scalable, flexible monitoring and alerting system that is easy to integrate and focused on real time metrics.
  12. 12. Prometheus: reliability ● Pull-based (“scrape”) ● List of known targets ○ Can be dynamic, e.g. DNS or service discovery ● Built-in meta-monitoring ● Redundancy is easy
  13. 13. Prometheus: scalability ● Performant, efficient storage ● Scales well to available resources ● Easy to scale horizontally ● Federation
  14. 14. Prometheus: flexibility ● Multi-dimensional, label-based data model ● Each data point is defined by ○ A metric name ○ An arbitrary number of key-value pairs (labels) ○ A value ○ A timestamp (added by Prometheus) ● Data points with identical metric names and labels form a time series ● Powerful query language allows for easy aggregation based on labels
  15. 15. Prometheus: flexibility Target exposes: http_responses_total{backend="foo",code="2xx"} 804 http_responses_total{backend="foo",code="4xx"} 3170 http_responses_total{backend="bar",code="2xx"} 6637 http_responses_total{backend="bar",code="4xx"} 26 Possible query: sum(http_responses_total{backend="foo"})
  16. 16. Prometheus: ease of integration ● Data format is text based ● Scrapes are HTTP requests ● Many integrations exist already ● Excellent tooling/libraries to write new ones
  17. 17. Application Prometheus: ease of integration
  18. 18. Host node exporter Prometheus: ease of integration
  19. 19. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  20. 20. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  21. 21. Nomen est omen... ● Alerting ● Silencing ● Alert grouping & routing ● High availability Alertmanager
  22. 22. Displays data from many sources: ● Prometheus ● Graphite ● Influx ● OpenTSDB ● Elasticsearch ● MySQL/Postgres ● CloudWatch ● ... Grafana grafana.com
  23. 23. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ Now withProtips!
  24. 24. Node exporter ● Exports: OS- and hardware-level metrics for running systems ● Replaces: Ganglia, some Icinga/NRPE checks ● Noteworthy: ○ Comes with many collectors built-in ○ Use WMI exporter on Windows
  25. 25. Protip I Use the node exporter’s text file collector as an easy integration point for custom metrics! Examples: Chef data, RAID controller data, SMART data, cron jobs, ... node exporter script Text file Host
  26. 26. Blackbox exporter ● Exports: data about probes against endpoints that don’t support Prometheus natively (DNS, HTTP(S), ICMP, TCP) ● Replaces: Smokeping, some Icinga checks ● Noteworthy: ○ Monitor TLS certificate expiry :)
  27. 27. Blackbox exporter - Smokeping replacement 1. Send ICMP probe every five seconds
  28. 28. Blackbox exporter - Smokeping replacement 2. Alert on target down and packet loss ALERT SmokepingTargetDown IF probe_success{job="smokeping"} == 0 FOR 2m ALERT SmokepingTargetPacketLoss IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20
  29. 29. Blackbox exporter - Smokeping replacement 3. Use Prometheus aggregation functions in Grafana
  30. 30. Blackbox exporter - Smokeping replacement
  31. 31. Protip II Scrape more, scrape faster! ● ~ 1M metrics ● > 5000 targets ● Mostly 10s scrape interval, some 5s, some longer ● 50 days retention time ● 250 GB storage ¯_(ツ)_/¯
  32. 32. SNMP exporter ● Exports: SNMP data from network devices ● Replaces: Cacti ● Noteworthy: ○ a pain to configure
  33. 33. SNMP exporter - Cacti replacement Once you have got the right SNMP config, alerts and nice graphs are easy!
  34. 34. SNMP exporter - Cacti replacement Cacti’s killer feature: the weathermap plugin! https://network-weathermap.com/
  35. 35. SNMP exporter - Cacti replacement There is a diagram panel type in Grafana, but… … we’re not quite there yet ¯_(ツ)_/¯
  36. 36. Protip III Build a dedicated long-term Prometheus server: ● Scrape only a few selected metrics ● Yank retention time way up ● Make backups (hot backups possible in Prometheus >2.1) Very useful data for estimating e.g. future bandwidth needs!
  37. 37. Collins exporter - Collins? ● https://tumblr.github.io/collins ● Infrastructure management / IPAM ● Server inventory, classification and lifecycle management
  38. 38. Collins exporter ● Exports: asset inventory data from Collins ● Replaces: a bunch of scripts ● Noteworthy: ○ https://github.com/soundcloud/collins_exporter
  39. 39. Collins exporter
  40. 40. Collins exporter ● Another candidate for long-term storage ● Valuable data for capacity planning
  41. 41. Protip IV Build your own integrations! Collins exporter: ● Written in Go ● 1 source file ● 264 lines total ¯_(ツ)_/¯
  42. 42. IPMI exporter ● Exports: IPMI data retrieved from BMCs ● Replaces: many Nagios/NRPE checks ● Noteworthy: ○ https://github.com/soundcloud/ipmi_exporter ○ Works regardless of hosts power state
  43. 43. IPMI exporter ● Mostly sensor data: temperature, fans, power consumption ● Mostly used for alerting: ○ Fans ○ Power supplies ○ Batteries
  44. 44. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels to expose (semi-)static data of interest *...but do it with some caution! ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1
  45. 45. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels and binary values to represent state *...but do it with some caution! collins_asset_state{tag="ABCD1234",state="Allocated"} collins_asset_state{tag="ABCD1234",state="Maintenance"} collins_asset_state{tag="ABCD1234",state="Unallocated"} 1 0 0
  46. 46. And now: merging data sources Example: BMC Firmware revisions of certain server types
  47. 47. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
  48. 48. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...}
  49. 49. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...} Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)") Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}
  50. 50. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}
  51. 51. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") * on (tag) group_left(status) (collins_asset_status == 1) Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}
  52. 52. Where we are now... & NRPE Cloud Watch Cacti ✘ ✘ ✘ ✘ ✘ ✘ ✘
  53. 53. Collection Visualization Alerting CloudWatch ✔ ✔ ✔ Graphite (✔) Prometheus ✔ Grafana ✔ Alertmanager ✔ What’s paging you at night?
  54. 54. What’s up with this CloudWatch thing? ● There is a CloudWatch exporter ● However, CloudWatch internal architecture is fundamentally incompatible with Prometheus ● Using CloudWatch as Grafana data source can incur costs
  55. 55. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  56. 56. So, is it working? ● Yes
  57. 57. Was it worth it? ● Yes
  58. 58. Why was it worth it? ● Many integrations readily available ● New ones are easy to write ● Quality and quantity of monitoring has increased ● Monitoring and alerting has become much more consistent ● Easy to merge data sources for alerting or graphing This is true across the entire organization, not just infrastructure!
  59. 59. Soon: long term storage ● Not a primary concern for Prometheus ● Simple solution as explained ● Remote (read/)write interface ● Some features in Prometheus 2.0 to allow external solutions ○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos
  60. 60. Soon: forging a standard? OpenMetrics working group ● https://github.com/RichiH/OpenMetrics
  61. 61. This is the end... Thank you!

×