Successfully reported this slideshow.

Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

0

Share

1 of 83
1 of 83

Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

0

Share

Download to read offline

This talk tells the story of how Dell switched its internal monitoring system shipped with Dell EMC ECS Enterprise Object Storage from a home-grown monitoring system to InfluxDB-based stack. The session will cover the following topics:

Lessons learned on completely changing the monitoring stack on the shipped system while doing continuous releases
Building a separate service running Flux language which connects to InfluxDB instances
Running multiple InfluxDB instances for HA
Using Flux language for Grafana dashboards and alerting rules
How to control metrics ingest rate and cardinality to have predictable resource consumption
Shipping InfluxDB with storage system for internal monitoring and running InfluxDB with low memory constraints (3Gb)

This talk tells the story of how Dell switched its internal monitoring system shipped with Dell EMC ECS Enterprise Object Storage from a home-grown monitoring system to InfluxDB-based stack. The session will cover the following topics:

Lessons learned on completely changing the monitoring stack on the shipped system while doing continuous releases
Building a separate service running Flux language which connects to InfluxDB instances
Running multiple InfluxDB instances for HA
Using Flux language for Grafana dashboards and alerting rules
How to control metrics ingest rate and cardinality to have predictable resource consumption
Shipping InfluxDB with storage system for internal monitoring and running InfluxDB with low memory constraints (3Gb)

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

  1. 1. Maksim Vazhenin Software Sr Principal Engineer Dell Technologies InfluxDB for Storage System Monitoring
  2. 2. Internal Use - Confidential | Agenda Our journey to Influxdb monitoring stack High Availability for InfluxDB Horizontally scalable query with Flux language Deploy on low memory resources How to switch monitoring stack Dashboards…
  3. 3. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  4. 4. Internal Use - Confidential SITE 1 SITE 3 SITE 2 IoT Financial Services Media & Entertainment Cloud Backup Archive Modern Apps Evidence Repository Analytics ECS
  5. 5. Internal Use - Confidential ECS is deployed in physical nodes combined by racks Node 1 Node 2 Node 3 Node 4 Node N Node N+1 Node N+2 Node N+3 Rack N Rack 1 Datacenter
  6. 6. Internal Use - Confidential ECS is deployed in Docker containers
  7. 7. Internal Use - Confidential ECS internal monitoring data Performance System monitoring Internal health metrics Capacity (lots of complicated compute from may services)
  8. 8. Internal Use - Confidential Existing monitoring solution disadvantages Different teams involved to show data on UI Code change to add new dashboard No flexible query language Slow on large queries
  9. 9. Internal Use - Confidential Need for modern monitoring stack Easy to build dashboards System resources monitoring Easy to create alerts Autonomous service teams
  10. 10. Internal Use - Confidential Challenges High scale (~300 nodes clusters) No free resources
  11. 11. Internal Use - Confidential Alternatives ELK Prometheus InfluxDB
  12. 12. Internal Use - Confidential Alternatives ELK High resource requirements Flexible analytics
  13. 13. Internal Use - Confidential Alternatives Prometheus High cardinality Bad when working with rare data Bad at counting exact values Performance Query language Does not support backfilling
  14. 14. Internal Use - Confidential Alternatives Prometheus 12 15 extrapolation 17 increase(4m) Polling interval 2m extrapolation
  15. 15. Internal Use - Confidential Alternatives InfluxDB High cardinality InfluxQL Performance Can be used for exact compute Supports backfilling Flux Query language
  16. 16. Internal Use - Confidential Alternatives ELK Prometheus InfluxDB
  17. 17. Internal Use - Confidential Our journey to Influxdb monitoring stack Distributed storage monitoring Influxdb beats competitors
  18. 18. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  19. 19. Internal Use - Confidential Single Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Influxdb
  20. 20. Internal Use - Confidential Telegraf on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  21. 21. Internal Use - Confidential Grafana on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  22. 22. Internal Use - Confidential No data if Node is down Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  23. 23. Internal Use - Confidential Run 3 Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb Influxdb datasources Influxdb1 Influxdb2 Influxdb3
  24. 24. Internal Use - Confidential Support 2 nodes down Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  25. 25. Internal Use - Confidential After failures some data may be unavailable Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1
  26. 26. Internal Use - Confidential
  27. 27. Internal Use - Confidential Recover on startup Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Use backup-restore api
  28. 28. Internal Use - Confidential All data available even in case of rolling failures Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  29. 29. Internal Use - Confidential Now we can even do node replacements Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  30. 30. Internal Use - Confidential High Availability for InfluxDB Run multiple Influxdb instances Use backup-restore api
  31. 31. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  32. 32. Internal Use - Confidential Select datasource manually in Grafana Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb datasources Influxdb1 Influxdb2 Influxdb3
  33. 33. Internal Use - Confidential Run Fluxd service on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Fluxd datasource local fluxd
  34. 34. Internal Use - Confidential Fluxd: Offload compute from Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd from()|>filter()|>range() complex query Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf
  35. 35. Internal Use - Confidential Fluxd: Load-balance complex compute Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd complex query Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf
  36. 36. Internal Use - Confidential Node 1 Fluxd: Stateless Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Telegraf Telegraf Telegraf Telegraf Telegraf
  37. 37. Internal Use - Confidential Horizontally scalable query with Flux language Single datasource Offload compute from Influxdb Load-balance requests Horizontally scalable
  38. 38. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  39. 39. Internal Use - Confidential Need to use minimal resources and avoid oom Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Telegraf Telegraf Telegraf Telegraf Telegraf
  40. 40. Internal Use - Confidential Telegraf Node 3 Telegraf
  41. 41. Internal Use - Confidential Services push metrics to Telegraf Node Telegraf Service1 Service2 ServiceN …
  42. 42. Internal Use - Confidential Sometimes services may push more metrics Telegraf Service1 Service2 ServiceN … Node
  43. 43. Internal Use - Confidential More metrics cause oom Telegraf Service1 Service2 ServiceN … Node
  44. 44. Internal Use - Confidential Better drop metrics then die Service1 Service2 ServiceN … Telegraf Drop metrics when buffer is filled Node
  45. 45. Internal Use - Confidential Set buffer limit for Telegraf Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  46. 46. Internal Use - Confidential Telegraf has predictable memory for received metrics Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  47. 47. Internal Use - Confidential Telegraf has predictable memory when some Influxdb are down Telegraf Influxdb Influxdb Influxdb Buffer per output Node
  48. 48. Internal Use - Confidential But still sometimes dies due to oom Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  49. 49. Internal Use - Confidential Telegraf used lots of input plugins Telegraf Influxdb listener Inputs procstat mem … exec Node
  50. 50. Internal Use - Confidential Exec plugin uses unpredictable scripts Telegraf Inputs exec scripts Node
  51. 51. Internal Use - Confidential Unpredictable scripts cause oom Telegraf Inputs exec scripts Node
  52. 52. Internal Use - Confidential Get rid of using exec plugin Telegraf Influxdb listener Inputs procstat mem … exec Node
  53. 53. Internal Use - Confidential Telegraf never dies Service1 Service2 ServiceN … Telegraf Influxdb listener Inputs procstat mem … New metrics Drop metrics when buffer is filled metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  54. 54. Internal Use - Confidential Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  55. 55. Internal Use - Confidential Influxdb memory driving factors Number of metrics Metrics cardinality Retention period Compute
  56. 56. Internal Use - Confidential Number of metrics matters Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf InfluxDB
  57. 57. Internal Use - Confidential Drop non-used metrics, prevent high cardinality Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb filter not used metrics namepass
  58. 58. Internal Use - Confidential Push less frequently if you can Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf push interval 5 min Influxdb
  59. 59. Internal Use - Confidential With full history Influxdb used more memory Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  60. 60. Internal Use - Confidential Select shard duration carefully Retention … shard shard shard shard index index index index Database Shard count < 10
  61. 61. Internal Use - Confidential All components resource consumption is under control Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  62. 62. Internal Use - Confidential ECS is operated by customer Rack N Rack 1 Datacenter Node 4 Node 3 Node 2 Node 1 Node N+3 Node N+2 Node N+1 Node N
  63. 63. Internal Use - Confidential Customer sometime uses external monitoring Rack 1 Datacenter External Monitoring Node 4 Node 3 Node 2 Node 1
  64. 64. Internal Use - Confidential Periodically poll fluxd for external monitoring Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  65. 65. Internal Use - Confidential Extra resources needed on Fluxd and Influxdb Grafana Grafana Grafana Grafana Grafana Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  66. 66. Internal Use - Confidential Not all metrics are available in internal monitoring filter some metrics Grafana Grafana Grafana Grafana Grafana Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  67. 67. Internal Use - Confidential Push all metrics from telegrafs to external External Monitoring Send all metrics Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb filter some metrics
  68. 68. Internal Use - Confidential Extra continuous queries and dashboards on external External Monitoring Send all metrics Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb filter some metrics
  69. 69. Internal Use - Confidential Deploy on low memory resources Limit telegraf buffer Do not use exec input plugins Offload compute from InfluxDB Filter out non-needed metrics Push with lower frequency Push metrics to external
  70. 70. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  71. 71. Internal Use - Confidential How to switch monitoring stack UI Alerting framework Node Telegraf Services Grafana Influxdb Fluxd Dashboard service Statistic framework
  72. 72. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  73. 73. Internal Use - Confidential Lots of new dashboards were created
  74. 74. Internal Use - Confidential Performance
  75. 75. Internal Use - Confidential
  76. 76. Internal Use - Confidential System metrics
  77. 77. Internal Use - Confidential
  78. 78. Internal Use - Confidential Top N buckets
  79. 79. Internal Use - Confidential
  80. 80. Internal Use - Confidential And many more …
  81. 81. Summary
  82. 82. Internal Use - Confidential Summary InfluxDB is a great Timeseries Database May add High Availability on top of OSS version May fit into low memory resources May use as internal monitoring in on-premise products Good luck using it in your product
  83. 83. Questions? Feedback? Let’s connect! Email: maksim.vazhenin@dell.com LinkedIn: https://www.linkedin.com/in/maksim-vazhenin/

×