SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021
This talk tells the story of how Dell switched its internal monitoring system shipped with Dell EMC ECS Enterprise Object Storage from a home-grown monitoring system to InfluxDB-based stack. The session will cover the following topics:
Lessons learned on completely changing the monitoring stack on the shipped system while doing continuous releases
Building a separate service running Flux language which connects to InfluxDB instances
Running multiple InfluxDB instances for HA
Using Flux language for Grafana dashboards and alerting rules
How to control metrics ingest rate and cardinality to have predictable resource consumption
Shipping InfluxDB with storage system for internal monitoring and running InfluxDB with low memory constraints (3Gb)
This talk tells the story of how Dell switched its internal monitoring system shipped with Dell EMC ECS Enterprise Object Storage from a home-grown monitoring system to InfluxDB-based stack. The session will cover the following topics:
Lessons learned on completely changing the monitoring stack on the shipped system while doing continuous releases
Building a separate service running Flux language which connects to InfluxDB instances
Running multiple InfluxDB instances for HA
Using Flux language for Grafana dashboards and alerting rules
How to control metrics ingest rate and cardinality to have predictable resource consumption
Shipping InfluxDB with storage system for internal monitoring and running InfluxDB with low memory constraints (3Gb)
Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021
1.
Maksim Vazhenin
Software Sr Principal Engineer
Dell Technologies
InfluxDB for Storage
System Monitoring
2.
Internal Use - Confidential
| Agenda
Our journey to Influxdb monitoring stack
High Availability for InfluxDB
Horizontally scalable query with Flux language
Deploy on low memory resources
How to switch monitoring stack
Dashboards…
3.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
4.
Internal Use - Confidential
SITE 1 SITE 3
SITE 2
IoT
Financial
Services
Media &
Entertainment
Cloud
Backup
Archive
Modern
Apps
Evidence
Repository
Analytics
ECS
5.
Internal Use - Confidential
ECS is deployed in physical nodes combined by racks
Node 1 Node 2 Node 3 Node 4 Node N Node N+1 Node N+2 Node N+3
Rack N
Rack 1
Datacenter
6.
Internal Use - Confidential
ECS is deployed in Docker containers
7.
Internal Use - Confidential
ECS internal monitoring data
Performance
System monitoring
Internal health metrics
Capacity (lots of
complicated compute
from may services)
8.
Internal Use - Confidential
Existing monitoring solution disadvantages
Different teams
involved to show data
on UI
Code change to add
new dashboard
No flexible query
language
Slow on large queries
9.
Internal Use - Confidential
Need for modern monitoring stack
Easy to build
dashboards
System resources
monitoring
Easy to create alerts
Autonomous service
teams
10.
Internal Use - Confidential
Challenges
High scale (~300 nodes
clusters)
No free resources
11.
Internal Use - Confidential
Alternatives
ELK Prometheus InfluxDB
12.
Internal Use - Confidential
Alternatives
ELK
High resource requirements Flexible analytics
13.
Internal Use - Confidential
Alternatives
Prometheus
High cardinality
Bad when working with rare data
Bad at counting exact values
Performance
Query language
Does not support backfilling
15.
Internal Use - Confidential
Alternatives
InfluxDB
High cardinality
InfluxQL Performance
Can be used for exact compute
Supports backfilling
Flux Query language
16.
Internal Use - Confidential
Alternatives
ELK Prometheus InfluxDB
17.
Internal Use - Confidential
Our journey to Influxdb monitoring stack
Distributed storage monitoring
Influxdb beats competitors
18.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
19.
Internal Use - Confidential
Single Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Influxdb
20.
Internal Use - Confidential
Telegraf on all nodes
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
21.
Internal Use - Confidential
Grafana on all nodes
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
22.
Internal Use - Confidential
No data if Node is down
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
27.
Internal Use - Confidential
Recover on startup
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Use backup-restore api
28.
Internal Use - Confidential
All data available even in case of rolling failures
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
29.
Internal Use - Confidential
Now we can even do node replacements
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
30.
Internal Use - Confidential
High Availability for InfluxDB
Run multiple Influxdb instances
Use backup-restore api
31.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
37.
Internal Use - Confidential
Horizontally scalable query with Flux language
Single datasource
Offload compute from Influxdb
Load-balance requests
Horizontally scalable
38.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
39.
Internal Use - Confidential
Need to use minimal resources and avoid oom
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Telegraf Telegraf Telegraf Telegraf Telegraf
40.
Internal Use - Confidential
Telegraf
Node 3
Telegraf
41.
Internal Use - Confidential
Services push metrics to Telegraf
Node
Telegraf
Service1 Service2 ServiceN
…
42.
Internal Use - Confidential
Sometimes services may push more metrics
Telegraf
Service1 Service2 ServiceN
…
Node
43.
Internal Use - Confidential
More metrics cause oom
Telegraf
Service1 Service2 ServiceN
…
Node
44.
Internal Use - Confidential
Better drop metrics then die
Service1 Service2 ServiceN
…
Telegraf
Drop metrics when buffer is filled
Node
45.
Internal Use - Confidential
Set buffer limit for Telegraf
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
46.
Internal Use - Confidential
Telegraf has predictable memory for received metrics
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
47.
Internal Use - Confidential
Telegraf has predictable memory when some Influxdb are down
Telegraf
Influxdb
Influxdb
Influxdb
Buffer per output
Node
48.
Internal Use - Confidential
But still sometimes dies due to oom
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
49.
Internal Use - Confidential
Telegraf used lots of input plugins
Telegraf
Influxdb
listener
Inputs
procstat mem
…
exec
Node
51.
Internal Use - Confidential
Unpredictable scripts cause oom
Telegraf
Inputs
exec
scripts
Node
52.
Internal Use - Confidential
Get rid of using exec plugin
Telegraf
Influxdb
listener
Inputs
procstat mem
…
exec
Node
53.
Internal Use - Confidential
Telegraf never dies
Service1 Service2 ServiceN
…
Telegraf
Influxdb
listener
Inputs
procstat mem
…
New metrics
Drop metrics when buffer is filled
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
55.
Internal Use - Confidential
Influxdb memory driving factors
Number of metrics
Metrics cardinality
Retention period
Compute
56.
Internal Use - Confidential
Number of metrics matters
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
InfluxDB
57.
Internal Use - Confidential
Drop non-used metrics, prevent high cardinality
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
filter not used metrics
namepass
58.
Internal Use - Confidential
Push less frequently if you can
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
push interval 5 min
Influxdb
59.
Internal Use - Confidential
With full history Influxdb used more memory
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
60.
Internal Use - Confidential
Select shard duration carefully
Retention
…
shard shard shard shard
index index index index
Database
Shard count < 10
61.
Internal Use - Confidential
All components resource consumption is under control
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
62.
Internal Use - Confidential
ECS is operated by customer
Rack N
Rack 1
Datacenter
Node 4
Node 3
Node 2
Node 1 Node
N+3
Node
N+2
Node
N+1
Node N
66.
Internal Use - Confidential
Not all metrics are available in internal monitoring
filter some metrics
Grafana Grafana Grafana Grafana Grafana
Fluxd Fluxd Fluxd Fluxd Fluxd
External
Monitoring
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb Influxdb
Influxdb
67.
Internal Use - Confidential
Push all metrics from telegrafs to external
External
Monitoring
Send all metrics
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
filter some metrics
68.
Internal Use - Confidential
Extra continuous queries and dashboards on external
External
Monitoring
Send all metrics
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
filter some metrics
69.
Internal Use - Confidential
Deploy on low memory resources
Limit telegraf buffer
Do not use exec input plugins
Offload compute from InfluxDB
Filter out non-needed metrics
Push with lower frequency
Push metrics to external
70.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
71.
Internal Use - Confidential
How to switch monitoring stack
UI
Alerting framework
Node
Telegraf
Services
Grafana
Influxdb
Fluxd
Dashboard service
Statistic framework
72.
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
73.
Internal Use - Confidential
Lots of new dashboards
were created
82.
Internal Use - Confidential
Summary
InfluxDB is a great Timeseries Database
May add High Availability on top of OSS version
May fit into low memory resources
May use as internal monitoring in on-premise products
Good luck using it in your product